[15/25/25]As we saw in Section G.5, some loop structures are not easily vectorized. One common structure is a reduction—a loop that reduces an array to a single value by repeated application of an...

[15/25/25]As we saw in Section G.5, some loop structures are not easily

vectorized. One common structure is a reduction—a loop that reduces an array to a

single value by repeated application of an operation. This is a special case of a

recurrence. A common example occurs in dot product:

dot=0.0

do 10 i=1,64

10 dot=dot+A(i) * B(i)

This loop has an obvious loop-carried dependence (on dot) and cannot be vectorized

in a straightforward fashion. The first thing a good vectorizing compiler

would do is split the loop to separate out the vectorizable portion and the recurrence

and perhaps rewrite the loop as:

do 10 i=1,64

10 dot(i)=A(i) * B(i)

do 20 i=2,64

20 dot(1)=dot(1)+dot(i)

The variable dot has been expanded into a vector; this transformation is called

scalar expansion. We can try to vectorize the second loop either relying strictly on

the compiler (part (a)) or with hardware support as well (part (b)). There is an

important caveat in the use of vector techniques for reduction. To make reduction

work, we are relying on the associativity of the operator being used for the reduction.

Because of rounding and finite range, however, floating-point arithmetic is

not strictly associative. For this reason, most compilers require the programmer

to indicate whether associativity can be used to more efficiently compile

reductions.

a. [15]One simple scheme for compiling the loop with the recurrence is to

add sequences of progressively shorter vectors—two 32-element vectors, then

two 16-element vectors, and so on. This technique has been called recursive

doubling. It is faster than doing all the operations in scalar mode. Show how

the FORTRAN code would look for execution of the second loop in the preceding

code fragment using recursive doubling.

b. [25]In some vector processors, the vector registers are addressable, and

the operands to a vector operation may be two different parts of the same vector

The key idea in partial sums is to reduce the vector to m sums where m is the

total latency through the vector functional unit, including the operand read and

write times. Assume that the VMIPS vector registers are addressable (e.g., you

can initiate a vector operation with the operand V1(16), indicating that the input

operand began with element 16). Also, assume that the total latency for adds,

including operand read and write, is eight cycles. Write a VMIPS code

sequence that reduces the contents of V1 to eight partial sums. It can be done

with one vector operation.

c. [25]Discuss how adding the extension in part (b) would affect a

machine that had multiple lanes.

May 18, 2022

Get Answer To This Question