[20/22/22]In this exercise, we will look at how a common vector loop runs on statically and dynamically scheduled versions of the RISC V pipeline. The loop is the so-called DAXPY loop (discussed...

[20/22/22]In this exercise, we will look at how a common vector

loop runs on statically and dynamically scheduled versions of the RISC V pipeline.

The loop is the so-called DAXPY loop (discussed extensively in Appendix

G) and the central operation in Gaussian elimination. The loop implements the

vector operation Y=a*X+Y for a vector of length 100. Here is the MIPS code

for the loop:

foo: fld f2, 0(x1) ; load X(i)

fmul.d f4, f2, f0 ; multiply a*X(i)

fld f6, 0(x2) ; load Y(i)

fadd.d f6, f4, f6 ; add a*X(i) + Y(i)

fsd 0(x2), f6 ; store Y(i)

addi x1, x1, 8 ; increment X index

addi x2, x2, 8 ; increment Y index

sltiu x3, x1, done ; test if done

bnez x3, foo ; loop if not done

For parts (a) to (c), assume that integer operations issue and complete in 1 clock

cycle (including loads) and that their results are fully bypassed. You will use the FP

latencies (only) shown in Figure C.29, but assume that the FP unit is fully pipelined.

For scoreboards below, assume that an instruction waiting for a result from

another function unit can pass through read operands at the same time the result is

written. Also assume that an instruction in WB completing will allow a currently

active instruction that is waiting on the same functional unit to issue in the same

clock cycle in which the first instruction completes WB.

a. [20]For this problem, use the RISC V pipeline of Section C.5 with the

pipeline latencies from Figure C.29, but a fully pipelined FP unit, so the initiation

interval is 1. Draw a timing diagram, similar to Figure C.32, showing the

timing of each instruction's execution. How many clock cycles does each loop

iteration take, counting from when the first instruction enters the WB stage to

when the last instruction enters the WB stage?

b. [20]Perform static instruction reordering to reorder the instructions to

minimize the stalls for this loop, renaming registers where necessary. Use all the

same assumptions as in (a). Draw a timing diagram, similar to Figure C.32,

showing the timing of each instruction's execution. How many clock cycles

does each loop iteration take, counting from when the first instruction enters

the WB stage to when the last instruction enters the WB stage?

c. [20]Using the original code above, consider how the instructions

would have executed using score boarding, a form of dynamic scheduling.

Draw a timing diagram, similar to Figure C.32, showing the timing of

the instructions through stages IF, IS (issue), RO (read operands), EX (execution),

and WR (write result). How many clock cycles does each loop iteration

take, counting from when the first instruction enters the WB stage to

when the last instruction enters the WB stage?

May 18, 2022

Get Answer To This Question