[15/25/25]As we saw in Section G.5, some loop structures are not easily
vectorized. One common structure is a reduction—a loop that reduces an array to a
single value by repeated application of an operation. This is a special case of a
recurrence. A common example occurs in dot product:
dot=0.0
do 10 i=1,64
10 dot=dot+A(i) * B(i)
This loop has an obvious loop-carried dependence (on dot) and cannot be vectorized
in a straightforward fashion. The first thing a good vectorizing compiler
would do is split the loop to separate out the vectorizable portion and the recurrence
and perhaps rewrite the loop as:
do 10 i=1,64
10 dot(i)=A(i) * B(i)
do 20 i=2,64
20 dot(1)=dot(1)+dot(i)
The variable dot has been expanded into a vector; this transformation is called
scalar expansion. We can try to vectorize the second loop either relying strictly on
the compiler (part (a)) or with hardware support as well (part (b)). There is an
important caveat in the use of vector techniques for reduction. To make reduction
work, we are relying on the associativity of the operator being used for the reduction.
Because of rounding and finite range, however, floating-point arithmetic is
not strictly associative. For this reason, most compilers require the programmer
to indicate whether associativity can be used to more efficiently compile
reductions.
a. [15]One simple scheme for compiling the loop with the recurrence is to
add sequences of progressively shorter vectors—two 32-element vectors, then
two 16-element vectors, and so on. This technique has been called recursive
doubling. It is faster than doing all the operations in scalar mode. Show how
the FORTRAN code would look for execution of the second loop in the preceding
code fragment using recursive doubling.
b. [25]In some vector processors, the vector registers are addressable, and
the operands to a vector operation may be two different parts of the same vector
register. This allows another solution for the reduction, called partial sums.
The key idea in partial sums is to reduce the vector to m sums where m is the
total latency through the vector functional unit, including the operand read and
write times. Assume that the VMIPS vector registers are addressable (e.g., you
can initiate a vector operation with the operand V1(16), indicating that the input
operand began with element 16). Also, assume that the total latency for adds,
including operand read and write, is eight cycles. Write a VMIPS code
sequence that reduces the contents of V1 to eight partial sums. It can be done
with one vector operation.
c. [25]Discuss how adding the extension in part (b) would affect a
machine that had multiple lanes.