This project extends the concepts introduced in assignment 1 to build a simulation of a piece of cloth falling under gravity on a stationary spherical object. The cloth is modelled as a 2-D rectangular network, where node (x,y) (x and y are integers) in an (N*N) square network interacts with all other nodes (x+delta,y+delta) where delta is some (low value) integer. Thus if delta=1 a typical node will interact with 8 neighbouring nodes, while if delta=2 there are 24 interactions to consider. When a node is located on the cloth edge there may be fewer interactions. Thus in contrast to the MD simulation, each node in the network interacts with a finite number of other nodes, making the evaluation of the total interaction potential scale as O(N), whereN is the number of nodes.
Between each pair of nodes, we define an interaction given byHooke's law.
PE12= K * ( R12- Eq12)2
Fx12= K * (R12- Eq12) * (X1- X2) / R12
Where K is the force constant determining how stiff the spring is, R12is the Euclidean distance between the two nodes and Eq12is the equilibrium distance between these two nodes. For example if node (1,1) and node (1,2) have Eq12=1*d then node (1,1) and node (2,2) have Eq12=sqrt(2)*d.
Each node we will assign a mass, noting that the force and acceleration are related by F=ma. The cloth is initially positioned in the xz plane and subjected to a gravitational force g in the y direction. As a consequence, the cloth will slowly fall under gravity.
Positioned below the cloth is a ball of radius, r, such that the centre of the cloth is located 1+r units above the centre of the ball. The cloth is allowed to fall under gravity until it collides with the ball. You follow the motions of the nodes using the same velocity verlet algorithm that you used for the MD. The difference arises when a node in the cloth hits the ball. You detect this by noticing that the updated position of the node is within the radius of the ball. At this point, you move the node to the nearest point on the surface of the ball.
While building on the understanding you gained in assignment 1 concerning numerical simulation, an important part of this assignment will be parallelising your simulation code.
STEP 1: Collision Correction
I wrote the python codecloth.pythat performs this basic simulation. Download and run this code. Make sure you understand what this code is doing. As discussed in lectures, something is not quite right! If you inspect the code you will see that if the new coordinates for the cloth are within the ball, then that node of the cloth is moved back to the surface of the ball. This happens here
for node in nodes:
dist = node.pos-vector(myball.x,myball.y,myball.z)
if dist.mag
fvector=dist/dist.mag*myball.radius
node.pos=vector(myball.x,myball.y,myball.z)+fvector
A problem arises in that the velocity of the node remains the same. In this step modify the code such that the component of the velocity that is in the direction of the fvector above is set to zero. In other words, only the component of the velocity that is tangential to the surface is allowed to be non-zero following a collision of a cloth node with the ball. (If you are uncertain how to do this, try using the forum, and if you are still confused - ask me!).
Download theassign2.tar.gzfile. As discussed in lectures it contains a C version of the cloth Simulation code that runs using openGL, and another version that just contains the numerical kernel. It is envisaged that you use the kernel only version for performance testing on Gadi. Your overall task is to vectorize and parallelise the code using OpenMP.
- Modify the code as provided so that when the cloth collides with the ball only the component of the velocity that is directed towards the centre of the ball is set to zero, i.e. what you did above for the Python code.
- Run the code for a range of different problem sizes and gather some basic performance data. Data that you will then use to compare with when tuning and parallelising the code. (Include this performance data in your write-up.)
- The supplied code is coded rather poorly in the computationally intensive parts. Identify as many problems as you can and fix them. Tune the code to perform as best as possible using a single core. (You are required to submit a copy of this code.)
STEP 3: SSE and OpenMP Vectorisation
- Vectorise the above code by having it use SSE intrinsic functions. (You are required to submit a copy of this code.)
- Vectorise the code using OpenMP. Use Intel Advisor to produce a report indicating which loops have been successfully vectorized. Include this information in your write-up. (You are required to submit a copy of this code)
- Gather data to compare the performance of the vectorised and non-vectorized code. (Include this performance data in your write-up.)
STEP 4: Parallelisation using OpenMP
Extend your SSE vectorised code as follows:
- Add another input flag to your program denoted-p, that reads in an integer that is used to set the maximum number of OpenMP threads used by your code.
- Parallelise your code using OpenMP by distributing a number of consecutive rows of nodes in the cloth to threads using a round-robin approach, i.e. use SCHEDULE (STATIC, CHUNKSIZE) directive. (You are required to submit a copy of this code.)
- Analyse the performance of your OpenMP code on Gadi giving consideration to the number of threads, problem size, chunksize and relating your results to Amdahl's law, cost of thread synchronization etc. (Include the performance data in your write-up.)
STEP 5: Roofline analysis
- Produce a roofline plot (using Intel Advisor) for the bottleneck of your program. Analyse your roofline plot, discussing whether your code is compute- or memory-bound. Motivate your answer.
You are required to submit a tar file that contains the following
·
A single makefile or a CMake buildsystemthat if run will build all the different versions of the code required
·
A version of the code that is modified such thatonly the velocity component of a node of the cloth colliding with the ball that is directed towards the centre of the ball is set to zero, and that has been optimised for execution on a single core. Call this version of the code kernel_main_opt
·
A version of kernel_main_opt that has been vectorised using SSE intrinsic functions. Call this version of the code kernel_main_sse
·
A version of kernel_main_opt that has been vectorised using OpenMP. Call this version of the code kernel_main_vect_omp
·
A version of kernel_main_sse that has been parallelised using OpenMP. Call this version of the code kernel_main_omp
·
A file called writeup.pdf that contains performance data with suitable commentary and analysis, including also your roofline analysis.
·
A README file that details the content of the tar file and specifies how to build the different versions of the code.