1) Convert the following pseudo code to C/C++ (vector triad benchmark from chapter 1 of the book) Run this code on the ACI-ICS cluster. Note* this should be run on a single core, you should run it on...

1 answer below »

1) Convert the following pseudo code to C/C++ (vector triad benchmark from chapter 1 of the book) Run this code on the ACI-ICS cluster. Note* this should be run on a single core, you should run it on both the aci-i and aci-b nodes.



double precision, dimension(N) :: A,B,C,D


double precision :: S,E,MFLOPS


do i=1,N // fill arrays with data



A(i) = 0.d0;B(i) = 1.d0



C(i) = 2.d0; D(i) = 3.d0


enddo


call get_walltime(S) // get start time


do j=1,R



do i=1,N



A(i) = B(i) + C(i) * D(i) // 3 loads, 1 store



enddo



if(A(2).lt.0) call dummy(A,B,C,D) // prevent loop interchange


enddo


call get_walltime(E) // get end time stamp


MFLOPS = R*N*2.d0/((E-S)*1.d6) // calculate MFLOPS



Use the following routine for get_wall_time()


#include



void get_walltime(double* wcTime) {



struct timeval tp;



gettimeofday(&tp, NULL);



*wcTime = (double)(tp.tv_sec + tp.tv_usec/1000000.0);


}



2) Explore appropriate values for R and N. Observe the results. Fix R to some value and generate a plot for N that best summarizes your observations. (MFPLOS vs. N).



*** R does not have to be fixed!!!! **** MFLOPS calculation takes care of this!



3) Your homework submission should be a zip file containing two files: a pdf write-up and a zip of your code. The short write-up should be 1-2 pages(including a graph) discussing the findings of your benchmarks. Your discussion should tie the benchmark results with the hardware under test. Your source code should include a Makefile.



Expectations:


Your code should use dynamic allocation for the arrays under test. No credit will be given if the code does not compile. Points will be lost if the code does not accurately represent the equivalent pseudo code above (actually fortran). This is a high performance computing course, the range of values for R and N should reflect that (Big Data!). None of the tests have to run for more than a few minutes.


Your write-up will be evaluated for quality of discussion and the accuracy of the graph.

Answered Same DaySep 28, 2020

Answer To: 1) Convert the following pseudo code to C/C++ (vector triad benchmark from chapter 1 of the book)...

Meenakshi answered on Oct 02 2020
150 Votes
Low level benchmark: Vector Triad
In this assignment we are discussing about the Memory Hierarchies and their performances. First we understand the memory concept and check the memory performance . We study and analysis the concept about memory hierarchies. For memory performance we work on algorithm and convert this algorithm into C language .
For this lab we have two header file and one c code
That header file having a function that is for walltime or delay of process
void get_walltime(double* wcTime) {
struct timeval tp;
gettimeo
fday(&tp, NULL);
*wcTime = (double)(tp.tv_sec + tp.tv_usec/1000000.0);
}
In this lab we develop a program another header file that is vector.h where we define the vector method that we will use for our array we use the array value in vector
For this program we define four vector array double A,B,C,D and precision :: S,E,MFLOPS with the help of loop we fill the vector using loop that run N times
Where A(i) = 0.d0;
B(i) = 1
C(i) = 2.
D(i) = 3
call get_walltime(S) // get start time
we set R for loop and calculate the r and n times in nested loop
A(i) = B(i) + C(i) * D(i) // 3 loads, 1 store
We call the dummy function that is for prevent loop interchange
call get_walltime(E)
MFLOPS = R*N*2.d0/((E-S)*1.d6) //
In this assignment we change N memory hierarchy analysis
We change the R and N for checking the memory performance and analysis the graph
We run the code and change the N and R them evaluate the memory performance
    jam
do i=1,N,2
do j=1,N
c(i)=c(i)+a(i,j)*b(j)
enddo
do j=1,N
c(i+1)=c(i+1)+a(i+1,j)*b(j) enddo
enddo
do i=1,N,2
do j=1,N
c(i)=c(i)+a(i,j)
* b(j)
c(i+1)=c(i+1)+a(i+1,j)* b(j)
enddo
enddo
b(j) can be re-used once
from register → save 1 LD operation
Lowers Bc from 1 to ¾ W/F
May 26, 2011
PTfS 2011 28
Data access – general guidelines
O(N2)/O(N2) algorithms cont’d
Data access pattern for 2-way unrolled dense MVM:
Vector b[] now only loaded
N/2 times!
= + •
Remainder loop handled
separately
Data transfers can further be reduced by more aggressive unrolling (i.e., m-way instead of 2-way)
Significant code bloat (try to use compiler directives if possible)
Main memory limit: b[] only be loaded once from memory (Bc ≈ ½ W/F) (can be achieved by high unrolling OR large outer level caches)
Outer loop unrolling can also be beneficial to reduce traffic within caches!
Beware: CPU registers are a limited resource
Excessive unrolling can cause register spills to memory
May 26, 2011
PTfS 2011 29
Data access – general guidelines
O(N2)/O(N2) algorithms cont’d: Spatial Blocking
Split up inner loop in small chunks (pref. multiple cache lines)
Add outer loop pointing to the start of the innermost loop
bs: blocking size
bs=m*CLS, nb=N/bs
do k=1,nb
do i=1,N
do j=(k-1)*bs+1,k*bs y(i)=y(i)+a(j,i)*x(j)
end do
end do
end do
What is the problem here?
Each chunk of x is loaded only once to inner cache level and (re-)used N times.
CLS: Cache Line Size
Blocking size (bs) should be CLS or multiple of it. Upper limit is imposed by inner cache size
May 26, 2011
PTfS 2011 30
Data access – general guidelines
O(N2)/O(N2) algorithms cont’d: Spatial Blocking
Spatially blocked MVM Matrix-Vector Multiply
y
=
A
x
bs:
blocking size
Save loads (x) from memory at the cost of additional store operations (y):
Vector x is loaded only once instead of N times
Vector y is loaded nb times instead of only once
May 26, 2011
PTfS 2011 31
Data access – general guidelines
Case 3: O(N3)/O(N2) algorithms
Most favorable case – computation outweighs data traffic by factor of N
Examples: Dense matrix diagonalization, dense matrix-matrix multiplication
Huge optimization potential: proper optimization can render the problem cache-bound if N is large enough
Example: dense matrix-matrix multiplication
    do i=1,N
    
    Core task: dense MVM
    
    
    
    
    
    do j=1,N
    
    
    
    
    
    (O(N2)/O(N2))
    
    do k=1,N
    
    
    
    
    
    → memory bound
    
    c(j,i)=c(j,i)+a(k,i)*b(k,j)
    
    
    
    
    
    
    
    enddo
    
    
    
    enddo
    
    
    
    
    
    
    
    enddo
    
    
    
    
    
    
3 N2 data elements involved BUT main memory traffic ~ N3 for large N Reason: b(:,:) needs to be reloaded N times!
Implement spatial blocking!
May 26, 2011
PTfS 2011 32
Data access – case study: Dense matrix vector multiplication
Stride one access is implemented by counting inner loops over first index (FORTRAN) or last index (C/C++) either by
choosing the right data layout or
arranging nested loops in the right order (loop interchange):
    do i=1,N
    do i=1,N
    do j=1,N
    do j=1,N
    c(i)=c(i)+A(i,j)*b(j)
    c(i)=c(i)+A(j,i)*b(j)
    enddo
    enddo
    enddo
    enddo
    Stride N access to A
    Stride 1 access to A
    „non-consecutive“
    „transpose“
This can be done automatically by some compilers!
May 26, 2011
PTfS 2011 33
Data access – case study: Dense MVM
Consider data transfer to registers for “transpose” version
(consider double precision)
    do i=1,N
    
    c(i) loaded once and kept in register
    tmp = c(i)
    
    Balance of inner loop kernel:
    do j=1,N
    
    
    
    
    2 W / 2 FLOPs = 1 W/F
    tmp = tmp + A(j,i)* b(j)
    
    
    
    enddo
    
    Complete vector B(1:N) is loaded N
    c(i) = tmp
    
    times from source to register!
    enddo
    
    do i=1,N,2
    
    
    tmp0 = c(i)
    m-way outer loop unrolling:
    
    tmp1 = c(i+1)
    BC = (m+1)W / 2mF =0.5 (1+1/m) W/F
    do...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here