[15/25/25]As we saw in Section G.5, some loop structures are not easily vectorized. One common structure is a reduction—a loop that reduces an array to a single value by repeated application of an...

[15/25/25]As we saw in Section G.5, some loop structures are not easily

vectorized. One common structure is a reduction—a loop that reduces an array to a


single value by repeated application of an operation. This is a special case of a


recurrence. A common example occurs in dot product:


dot=0.0


do 10 i=1,64


10 dot=dot+A(i) * B(i)


This loop has an obvious loop-carried dependence (on dot) and cannot be vectorized


in a straightforward fashion. The first thing a good vectorizing compiler


would do is split the loop to separate out the vectorizable portion and the recurrence


and perhaps rewrite the loop as:


do 10 i=1,64


10 dot(i)=A(i) * B(i)


do 20 i=2,64


20 dot(1)=dot(1)+dot(i)


The variable dot has been expanded into a vector; this transformation is called


scalar expansion. We can try to vectorize the second loop either relying strictly on


the compiler (part (a)) or with hardware support as well (part (b)). There is an


important caveat in the use of vector techniques for reduction. To make reduction


work, we are relying on the associativity of the operator being used for the reduction.



Because of rounding and finite range, however, floating-point arithmetic is


not strictly associative. For this reason, most compilers require the programmer


to indicate whether associativity can be used to more efficiently compile


reductions.


a. [15]One simple scheme for compiling the loop with the recurrence is to


add sequences of progressively shorter vectors—two 32-element vectors, then


two 16-element vectors, and so on. This technique has been called recursive


doubling. It is faster than doing all the operations in scalar mode. Show how


the FORTRAN code would look for execution of the second loop in the preceding


code fragment using recursive doubling.


b. [25]In some vector processors, the vector registers are addressable, and


the operands to a vector operation may be two different parts of the same vector


register. This allows another solution for the reduction, called partial sums.


The key idea in partial sums is to reduce the vector to m sums where m is the


total latency through the vector functional unit, including the operand read and


write times. Assume that the VMIPS vector registers are addressable (e.g., you


can initiate a vector operation with the operand V1(16), indicating that the input


operand began with element 16). Also, assume that the total latency for adds,


including operand read and write, is eight cycles. Write a VMIPS code


sequence that reduces the contents of V1 to eight partial sums. It can be done


with one vector operation.


c. [25]Discuss how adding the extension in part (b) would affect a


machine that had multiple lanes.


May 18, 2022
SOLUTION.PDF

Get Answer To This Question

Submit New Assignment

Copy and Paste Your Assignment Here