Question 1 (50 Points) Pipelining
Question 1 (50 Points) Pipelining
GOOD LUCK!
PART 1 (15 points) Assume you have a single cycle processor operating at 1 GHz. You are going to
make a 5-stage pipeline out of this processor. Although the processor can potentially operate at a
higher frequency, overheads associated with pipelining force you to operate the pipelined processor at
3 GHz. In a given program, assume that 40% are memory instructions, 50% are ALU instructions and
the rest are branch instructions. 10% of the memory instructions cause stalls of 20 clock cycles each
due to cache misses and 50% of the branch instructions cause stalls of 4 cycles each. Assume that
there are no stalls associated with the execution of ALU instructions. For this program, what is the
speedup achieved by the pipelined processor over the single cycle processor?
Answer:
Stage 1 2 3 4 5 6
A 250ps 180ps 400ps 200ps 150ps -
B 200ps 150ps 250ps 250ps 150ps 180ps
a) What are the maximum clock rates for the two implementations? Note that 1ps = 10-12 seconds.
b) Consider a program which requires 2 billion instructions to execute on pipeline A with a CPI
of 1.5, whereas 1.5 billion instructions to execute on pipeline B with a CPI of 4. Which
implementation would you prefer for this program?
T = IC * CPI * Tc
T_A = 2. 109 * 1.5 * 400 . 10-12 = 1.2 sec.
T_B = 1.5. 109 * 4 * 250 . 10-12 = 1.5 sec.
Therefore, A is faster for this program and should be chosen.
PART 3 (20 points) Assume you have a 6 stage pipeline which is composed of the following stages:
F D X1 X2 M W
Note that, execute stage requires two clock cycles (X1 and X2). Also, the register file is designed in a
way so that there is NO early write and late read. Assuming that the execute stage is designed in such
a way that a new execution can begin even while the previous one is in progress to complete, we have
a pipeline which can theoretically start (and complete) one instruction per clock cycle. But hazards
complicate things, and stalls which are unavoidable will result in a CPI greater than 1. Assume that
branch decisions are performed in the X1 stage. The following code needs to be run:
Consider only 2 iterations of the loop, that is, for a total of 3x2=6 instructions:
a) How many clock cycles does this code take in an ideal world if there were no control dependencies
or data dependencies?
b) Similar to the following table show which stage of each instruction is executed (F, D, X1, X2, M,
W) using the info given above, and assuming that pipeline has forwarding hardware. Also,
clearly show forwarding with arrows between stages (if any). Make sure that you explicitly
show stalls (if any).
I1 add F D X1 X2 M W
I2 lw F D - X1 X2 M W
I3 beq F - D - - X1 X2 M W
I1 add - - - - F D X1 X2 M W
I2 lw F D - X1 X2 M W
I3 beq F - D - - X1 X2 M W