hw2 Solns
hw2 Solns
College of Engineering
Computer Science Division — EECS
Solution Set #2
CS252 Graduate Computer Architecture
5.1 (9 points)
This problem explores unfair memory-system benchmarks. We have two cache configurations to compare:
cache A is 2-way set associative, 128 sets, 32-byte blocks, write-through, no-write allocate; cache B is 1-way
associative, 256 sets, 32-byte blocks, write-back, write-allocate. A word on the terminology in this problem:
we are told that the miss time is 10 times the hit time. Miss time here is the total time spent processing a
miss, not just the miss penalty; you can think of it as miss time = miss penalty + hit time.
(a) We want a program that makes A as much faster than B as possible. The basic idea is to write a
program that generates a large amount of conflict misses in the direct-mapped cache B, but generates
no misses in cache A. One way you might think about doing this is writing a program that repeatedly
reads from two alternating addresses α and β such that α mod (256 × 32) = β mod (256 × 32) (so that
α and β map to the same line in B, and to different parts of the same set in A). There is a subtlety,
however. Both cache A and B are unified caches, so for every memory read, there must also be an
instruction fetch from the cache. If there are no conflicts in the instruction fetches, the instruction
fetch times will be the same for both caches, and we will not see as much of a performance difference
between A and B than if there are instruction conflicts. One way to guarantee maximum performance
difference between A and B is to only use instruction conflicts. Consider the following code:
a: j b ; absolute jump to b
.
.
.
b: j a ; absolute jump to a
where a and b refer to addresses that collide in the direct-mapped cache but not in the 2-way cache
(for example, a mod 8192 = b mod 8192). For cache A, the instruction fetches map to different entries
in a set, so each instruction fetch is a hit. For cache B, the fetches map to the same set, so they cause
conflict misses on each instruction fetch.
(b) To make B run faster than A, we need to take advantage of its write-back property. We basically want
a program that writes repeatedly to one cache line (that has never been read before and is thus not
initially in the cache). Cache B can absorb all those writes as cache hits, whereas cache A is forced
to propagate them through to memory. Again there is a complication due to instruction accesses, but
this time the most we can do is assume that instruction fetches don’t conflict in the cache (since we
can’t actually perform writes via instruction fetches). Thus, a simple program that does what we want
is:
l1: sw 0(r1), 0
sw 0(r1), 0
.
. ; we assume the loop is unrolled n times
.
b l1
1
We assume that the loop can be unrolled n times without causing code conflicts in the cache; n must
be less than 2039 (the number of 4-byte instructions that fit in the cache minus a slot for the branch
and an entire block (32 bytes) for the data being stored).
(c) The program is 10 times faster on cache A than on cache B, if we assume single-cycle jumps. The
analysis is simple: the only cache accesses are for the instruction fetches; all accesses in cache A hit; all
accesses in cache B miss; miss time is 10x hit time. Thus, machine A runs 10x faster than B, assuming
that the machines have the same clock rate.
(d) The analysis is more complicated than in part (c), since we must take into account data accesses and
instruction fetches. We’ll assume the loop is unrolled n ≤ 2039 times, and the value of r1 is carefully
chosen so that no conflict misses occur during execution of the loop. We’ll also assume a DLX-like
in-order pipeline, but with single-cycle branches (e.g., perfect branch target prediction) and a unified,
single-ported instruction/data cache.
Each store instruction on cache B takes two cache cycles (and thus two real cycles since the cache is
single-ported): one to fetch the instruction and one to write the data. Let thit be the time for a cache
hit. There are n of the store instructions, each taking 2thit cycles, and one branch instruction, taking
thit cycles. So the total time for n iterations on cache B is simply thit (2n + 1).
The analysis for cache A is similar, except that each store instruction requires 6thit cycles, 5 for the
write and one for the instruction fetch (we’re assuming the pipeline stalls until the write completes (no
write buffer)). So the total time for n iterations on cache A is thit (6n + 1).
The performance ratio of cache B to cache A is:
If we unroll n = 2039 times, then the ratio is 2.9995, and so cache B is about 3 times faster than cache
A.
5.4 (9 points)
We want to calculate the utilization of memory bus bandwidth for two different cache configurations. The
basic approach in each case is to list all of the different types of memory reference, and, for each, to compute
how many bus transactions it takes and how frequently it occurs.
(a) We have a write-through cache. There are four types of memory reference to consider: read hit (in
the cache), read miss, write hit, and write miss. All blocks in the cache are always clean, so it is
not necessary to distinguish between dirty and clean blocks. Read hits are absorbed by the cache
and generate no memory system traffic; these comprise a 0.95 × 0.75 = 0.7125 fraction of the total
references (since there’s a 95% hit rate, and 75% of accesses are reads). Read misses require two
accesses, since both blocks of the cache line must be loaded; these comprise a 0.05 × 0.75 = 0.0375
fraction of the total time. Write hits require a single memory reference to write the new word back to
memory (only the modified word is written, not the entire block), and comprise 0.95 × 0.25 = 0.2375
of the references. Finally, since the cache is write-allocate, a write miss first fetches the entire old
block from memory (two references), then modifies one word of the block and writes the modified data
through to memory (one reference). So these take three memory references, and occur with a frequency
of 0.05 × 0.25 = 0.0125. These results are summarized in the table below.
2
We can now compute the bandwidth used. Using the data above, we first compute the average number
of main memory references per processor access to be:
Accesses are generated by the processor at a rate of 109 words per second, and on average 35% of
these will reach the memory system. The memory system can sustain 109 words per second, so the
utilization of the memory system is simply:
0.35 × 109
= 0.35
109
or 35%.
(b) Now we have a write-back cache. Our breakdown of memory references is more complicated, since we
must consider the cases when a miss is to a block in the cache that is dirty. We are told that 30% of
the blocks are dirty on average. If we assume random replacement, then on average 30% of misses will
force a writeback of a dirty line. The following table shows the breakdown of processor accesses, the
number of references generated by each, and the frequency of each:
Both read and write hits are absorbed by the cache, so they generate no memory system references.
Read and write misses to clean blocks each generate two references to read in the old block from the
memory system (a write miss behaves like a read miss plus a write hit). Finally, read and write misses
to dirty blocks each generate four references: two to write back the dirty block, and two to read in the
new block from memory.
We can again compute the average number of main memory references per processor access to be:
Again, accesses are generated by the processor at a rate of 109 per second, and the memory system
bandwidth is 109 accesses per second, so the utilization is simply:
0.13 × 109
= 0.13
109
or 13%. Notice that a write-back cache significantly reduces the memory bandwidth required as
compared to a write-through cache. This is often why processors are built with write-back L2 caches,
especially when they are intended for use on a shared-memory bus where bandwidth is a precious
resource.
3
5.5 (10 points)
In this problem, we will be calculating performance, using the following equations
Memory Stall Cycles
CPU Time = IC × CPIexecution + × Clk
Instruction
The following equation is used to calculate the CPI without any memory stalls.
The following equation is used to calculate the memory stall cycles per instruction.
Memory Stall Cycles I Fetch Stall Cycles Data Stall Cycles
= +
Instruction Instruction Instruction
where the stall cycles per instruction for both instructions and data can be calculated by:
Stall Cycles Memory Accesses
= × Miss Rate × Miss Penalty
Instruction Instruction
For fetch stalls, the number of memory accesses per instruction is 1, but in data stalls, only loads and stores
cause memory accesses. If R is a miss rate, P a miss penalty and f a frequency,
Memory Stall Cycles
= (Ri Pi ) + [(fstores + floads )Rd Pd ]
Instruction
(a) Now we only need to plug in the right values into the above equations. For both caches, the CPI
without stalls is the same:
The frequencies of various instructions can be obtained from Figure 2.26 of the textbook.
For both caches, the instruction cache miss rate is 0.5% and the penalty is 50 cycles. For both caches,
the data cache miss rate is 1% but the penalties are different. In a write through cache, there is nothing
dirty in the cache, and write throughs are serviced by the write buffer, which is infinitely large in this
case. Thus for both loads and stores the penalty is the cache miss penalty which is 50 cycles.
Memory Stall Cycles
= (0.5% × 50) + (fstores + floads )(1.0% × 50)
Instruction
In a write back cache, a miss on either a load or a store can either simply replace a clean line in the
block (50 cycles) or can knock out a dirty line (an additional 50 cycles, which happens 50% of the
time).
Memory Stall Cycles
= (0.5% × 50) + (fstores + floads )(1.0% × (50 + (50% × 50)))
Instruction
By plugging in the right frequencies from figure 2.26 into these three equations, and the very first
equation, the CPU time can be calculated for both systems. The following are the CPI numbers (CPI
is given rather than execution time because the clock cycle time and number of instructions is the same
for both cases).
Program Execution CPI W.T. Stalls W.T. CPI W.B. Stalls W.B. CPI
compress 1.05 0.38 1.43 0.44 1.49
eqntott 1.00 0.41 1.41 0.48 1.48
espresso 1.04 0.38 1.42 0.45 1.49
gcc 1.14 0.44 1.58 0.53 1.67
li 1.16 0.49 1.65 0.61 1.77
average 1.08 0.42 1.50 0.50 1.58
4
The write back cache is roughly 5% slower than the write through cache using average numbers. This is
because the write back cache sees the latency of writing back dirty lines, while the write through cache
does not see the latency of writes because it has a magically infinite write buffer. Because stalls caused
by a real write buffer, consistency issues caused by the write buffer and the extra memory bandwidth
eaten up by a write through cache are not considered, the write back cache appears to be slower.
(b) This part of the question is almost identical to the previous part. The only difference is that for write
through caches, stores take 1 cycle.
CPIexecution = (floads × 1) + (fstores × 1) + (fother × 1)
So the values in the table for the write back cache stay the same. For the write through cache, the
stalls stay the same. But the execution CPI of the write through cache now becomes 1 for all cases
because all instructions now take only 1 cycle, and so the total CPI of the write through cache changes.
Program W.T. Execution CPI W.T. Stalls W.T. CPI W.B. CPI
compress 1.00 0.38 1.38 1.49
eqntott 1.00 0.41 1.41 1.48
espresso 1.00 0.38 1.38 1.49
gcc 1.00 0.44 1.44 1.67
li 1.00 0.49 1.49 1.77
average 1.00 0.42 1.42 1.58
5
This is not a good general solution, however, since it requires that the cache be big enough to hold the
entire a[] and b[] arrays. Even if the arrays fit into the cache, this scheme significantly increases the cache
working set (amount of data that must stay in the cache to be useful) over what it could be.
A better approach is to prefetch the data as the loops walk over the array. In order to do this, we have
to split the loops into several parts. We begin with an initial loop that prefetches the first eight elements
needed, a[0][0. . .7] and b[0. . .7][0]:
/* SOLUTION PART 1 */
prefetch(a[0][0]);
for (j = 0; j < 8; j += 2) {
prefetch(b[j][0]);
prefetch(b[j+1][0]
prefetch(a[0][j+1]);
}
A few notes about this loop: first, we begin by prefetching a[0][0] manually so that the loop will fetch the
odd elements of a; the loop is then unrolled to only prefetch the odd elements of a. If we knew the array was
aligned with a cache block, this wouldn’t be necessary. However, if the array is not aligned, this guarantees
that we actually prefetch all needed elements. Note that we unroll the loop rather than using mod (%) to
prefetch every other element of a; this is because the mod operation can be very expensive compared to
unrolling the loop. Also, we put the prefetches in a loop to reduce the code footprint of the program; this
would be particularly important if the cache was unified. Next, notice that the prefetches are scheduled to
prefetch the b array first. This is because the values of the b array are needed first in the computational
instruction. One might wonder why we bother prefetching the first few iterations worth of data, rather than
just allowing them to cache-miss. The key is that the prefetches are non-blocking, and, since we assumed
that the memory system is pipelined with a large number of outstanding requests, the prefetches can all run
in parallel. So, when the first computational instruction tries to access b[0][0], the prefetch will not have
completed (since only about 12 cycles have elapsed) and the load will block until the data is ready. However,
all the other prefetches will continue (separated in time by just one cycle), so, by the time the first load is
ready and its value consumed, the remaining data will be ready, and thus there will be no cache misses on
any of the initial computation iterations.
Now we consider the main part of the loop. It must do the computation and also prefetch for subsequent
iterations. Again, we unroll the loop to avoid prefetching twice for each cache block. Finally, note that the
loop only runs until j == 90; this is to avoid extra unnecessary prefetches.
/* SOLUTION PART 2 */
for (j = 0; j < 92; j += 2) {
prefetch(b[j+8][0]);
prefetch(a[0][j+8];
a[0][j] = b[j][0] * b[j+1][0];
prefetch(b[j+9][0]);
a[0][j+1] = b[j+1][0] * b[j+2][0];
}
prefetch(b[100][0]);
Finally, we have a last loop that finishes the computation without prefetching past the bounds of the array.
Note, however, that up to this point we’ve only been working on the i == 0 iteration of the outer loop.
Since there are two more outer iterations to do, we begin prefetching for the next outer iteration in this last
part of the i == 0 loop:
/* SOLUTION PART 3 */
prefetch(a[1][0]);
for (j = 92; j < 100; j += 2) {
prefetch(a[1][j-91]);
a[0][j] = b[j][0] * b[j+1][0];
a[0][j+1] = b[j+1][0] * b[j+2][0];
}
6
Note that we don’t need to prefetch the b array, since it is already in the cache and is used unchanged in
the next iteration.
OK, now we’ve got the first outer loop iteration done. The next iteration is similar to the previous one;
it begins with a main loop that prefetches as it computes, and ends with a loop that prefetches for the last
(i == 2) iteration:
/* SOLUTION PART 4 */
for (j = 0; j < 92; j += 2) {
prefetch(a[1][j+8];
a[1][j] = b[j][0] * b[j+1][0];
a[1][j+1] = b[j+1][0] * b[j+2][0];
}
prefetch(a[2][0]);
for (j = 92; j < 100; j += 2) {
prefetch(a[2][j-91]);
a[1][j] = b[j][0] * b[j+1][0];
a[1][j+1] = b[j+1][0] * b[j+2][0];
}
And to complete the problem, we do the last outer iteration, prefetching only in the main loop. We don’t
need to prefetch in the last part of the loop since there are no more iterations to do.
/* SOLUTION PART 5 */
for (j = 0; j < 92; j += 2) {
prefetch(a[2][j+8]);
a[2][j] = b[j][0] * b[j+1][0];
a[2][j+1] = b[j+1][0] * b[j+2][0];
}
for (j = 92; j < 100; j += 1) {
a[2][j] = b[j][0] * b[j+1][0];
}
Lastly, let’s compute the performance of this solution. The first code fragment runs in 13 cycles. The
second code fragment would take 783 cycles to execute, but it has to stall in cycle 16 on the first access to
b[0][0]; this data was prefetched in cycle 2 and is thus available in cycle 52, causing 36 stall cycles. There
are no further stalls in the code, so we can continue adding up cycles. The third fragment takes 61 cycles;
the fourth fragment takes 751, and the fifth fragment takes 746. Adding these together, we get 2390 cycles
in total, or 44% faster than the simple prefetch code on page 404, and 6.13× faster than the non-prefetched
code.
One other note on this problem. The original (unprefetched) loop is rather poorly written. It can be
simplified significantly by performing a loop interchange optimization. Performing this optimization also
simplifies the prefetch problem. Here’s the code after loop interchange:
7
/* Solution with loop interchange */
prefetch(a[0][0]);
prefetch(a[1][0]);
prefetch(a[2][0]);
for (j = 0; j < 8; j += 2) {
prefetch(b[j][0]);
prefetch(b[j+1][0]
prefetch(a[0][j+1]);
prefetch(a[1][j+1]);
prefetch(a[2][j+1]);
}
for (j = 0; j < 92; j += 2) {
prefetch(b[j+8][0]);
prefetch(a[0][j+8];
prefetch(a[1][j+8];
prefetch(a[2][j+8];
t1 = b[j][0] * b[j+1][0];
a[2][j] = a[1][j] = a[0][j] = t1;
prefetch(b[j+9][0]);
t1 = b[j+1][0] * b[j+2][0];
a[2][j+1] = a[1][j+1] = a[0][j+1] = t1;
}
for (j = 92; j < 100; j += 2) {
t1 = b[j][0] * b[j+1][0];
a[2][j] = a[1][j] = a[0][j] = t1;
t1 = b[j+1][0] * b[j+2][0];
a[2][j+1] = a[1][j+1] = a[0][j+1] = t1;
}
8
5.7 (8 points)
In this problem, we consider a pseudo-associative cache in which we do not swap blocks on a hit to the
“slow” part of the cache. We also assume that a hit to the slow part takes only one extra cycle, not two
(since we don’t have to swap). Finally, we are assuming a random replacement policy.
(a) We first want to derive the AMAT for this cache in terms of miss rates and miss penalties of 1-way
(direct-mapped) and 2-way caches. We use a slightly different formulation of AMAT than the usual
one:
AMATpseudo = Hit ratepseudo × Hit timepseudo + Miss ratepseudo × Miss timepseudo
The reason for using this formulation (rather than the traditional AMAT = Hit time + Miss rate ×
Miss penalty) is that the miss time in a pseudo-associative cache is different than the hit time plus
the miss penalty. Here’s why: in a traditional cache, the hit time is a constant. When there’s a miss
in a traditional cache, that same constant amount of time is spent checking for a hit before the miss
penalty is paid. In a pseudo-associative cache, the hit time for a given hit can be one of two values,
and thus the overall hit time is just a statistical measure. On a miss in a pseudo-associative cache, it
is the slow hit time that must be paid before starting to process a miss, not the statistical average hit
time.
Given that, we can say now examine the terms of the equation. The miss/hit rates are the same as a
standard 2-way associative cache of the same size, since the two types of caches behave the same in
terms of hits and misses (since there are two possible locations for each datum):
The miss time is just the miss penalty of a 1-way cache (since the underlying memory system and fill
path is the same) plus the slow hit time in the associative cache. So at this point our AMAT equation
looks like:
The only remaining term in the AMAT formula is the overall hit time for the cache. We start by
expressing the overall hit time in terms of the percentage of accesses and the access time to the fast
and slow blocks, as follows:
Hit timepseudo = Hit fractionfast block × Hit timefast block + Hit slowslow block × Hit timeslow block .
Because we are using random replacement, and because we are not swapping blocks, the likelihood of
a given block being in the fast position is the same as the it being in the slow position, so:
1
Hit fractionfast block = Hit fractionslow block = .
2
So:
1
Hit timepseudo = (Hit timefast block + Hit timeslow block ) .
2
Putting this together with the AMAT formula and simplifying gives:
Hit timepseudo-fast + Hit timepseudo-slow
AMATpseudo = +
2
Miss rate2-way Hit timepseudo-slow + Miss penalty1-way −
Hit timepseudo-fast + Hit timepseudo-slow
2
9
Now, recalling that Hit timefast block = 1, Hit timeslow block = 2, and Miss penalty1-way = 50, we can
compute the final AMAT formula as:
(b) We assume that the values in Figure 5.9 in the text still apply for a cache with random replacement.
We assume a 50-cycle miss penalty. For the 2KB cache:
5.9 (3 points)
(a) This problem effectively asks us to produce a formula for the number of bits needed as input to a ROM
in order to look up the result of a mod operation. We are given that there are 2N − 1 memory banks,
and an input address that is M bits wide. We will assume that N is chosen such that 2N − 1 is prime.
According to the problem in the text, after step 3, an input address A will be represented in the form:
N
X −1
A= Termi
i=0
where the i’th term Termi is the sum of the binary digits ai , ai+N , ai+2N , . . . There will be M
N −1
of these terms, since the last bit we have is aM −1 and the interval between bits in the term is N . So:
dM
N e−1
X
i
Termi = 2 × ai+N ·j
j=0
M
A simple upper bound on this sum for Termi is to treat the upper sum bound as just N and all bits
ak as 1. Then,
M
Termi ≤ 2i ·
N
and
N −1 −1
NX
X
i M M i M
A≤ 2 = 2 = · (2N − 1)
i=0
N N i=0
N
Finally, we want to compute the number of bits needed to represent this maximum value (which is
equal to the number of input lines to the mod lookup ROM). This is just the ceiling of the base-2 log
10
of the maximum value:
M N
nbits = log2 · (2 − 1)
N
M N
= log2 + log2 (2 − 1)
N
M
≤ N + log2 .
N
For the example with M = 32 and N = 3, an upper bound on the maximum size of the address is
32
3 + log2 = d3 + 3.46e = 7 bits.
3
(b) In this part we want to come up with simple hardware to pick the correct bank out of 7 given a 32-bit
address. We are told to assume that the bank width is 8 bytes, and thus the lower 3 bits of the address
are used to select a byte within a row of a bank. Thus, we effectively have a M = 29 bit bank address.
Again, N = 3.
The following diagram shows a simple way of implementing the bank number generator. Since the
ROM has 7 input lines and returns a 3-bit value (the mod-7 of the input), it must have 27 = 128 3-bit
entries, or 384 total bits. The blocks labeled as “bit-counters” return (as a 4-bit number) the number
of set (high) input lines. These could be implemented as LUTs, a rippled chain of 2-bit adders, a tree
of wider adders, or something more clever.
11
5.20 (6 points)
In answering this question, we must consider the four functions of a virtual memory system as presented in
class:
1. to provide a larger effective memory space by using a paging disk to back physical memory
3. to provide translation between multiple virtual address spaces and one physical address space
(a) This part of the question can be answered by making a table of the number of transactions that each
component in the system can handle and then finding the bottleneck and using its TP/s value. Using
Figure 6.24 from the textbook, we find that we need 10GB of disk space (either twenty 500MB disks
or eight 1,250MB disks). Each transaction requires (2reads + 2writes) × 15, 000 + 40, 000 instructions
which is a total of 100,000 instructions or 0.1 MIPS per transaction.
So for small disks, the TP/s limitation is the disks system which is 150 TP/s. So for big disks, the
TP/s limitation is the disks system which is 60 TP/s.
(b) This part is trivial. The cost of each system is added up and then divide by the number of TP/s.
12
(c) This part is also trivial. We already know that the current bus can handle 2,621,440 TP/s and the
current CPU can handle 8,000 TP/s. So the CPU would have to be 2,621,440
8,000 = 328 times faster than
the current CPU. Sicne the current CPU is 800 MIPS, the new one would have to be 262,400 MIPS.
Ouch..
(d) The software approach would reduce the number of instructions per transaction and the number of
disk reads and writes so that the new TP/s would be:
15,000 × (1 read + 1 write) + 30,000 = 60,000 inst/transaction
This reduces the load on the old CPU to 0.06 MIPS per transaction and the overall TP/s would be
13,333 for the old CPU. The old approach needed 100,000 instructions per transaction and so the new
CPU would have to be 100,000
60,000 = 1.67 times faster than the original CPU.
(e) (Don’t trust those sneaky MTP I/O people - they listen at doors to sekret presentations!). Because the
old CPU was limited at 8,000 TP/s, the new CPU will be limited at 16,000 TP/s. To provide enough
small disks, each of which provides 30 IO/s, we would need 16,000×4
30 = 2, 134 small disks to handle the
all the IOs. Each disk costs $100 and so the system cost is:
Cost = $50,000 + (2,134 × $100) = $263,400
(f ) We originally had 20 small disks. To match the CPU TP/s of 16,000, we would need disks that can
support 16,000×4
20 = 3, 200I/Os per second per disk. If the new disks cost the same as the old disks,
the new system cost will be the same as the old system which was $52,000.
By multiplying the second and third columns together and adding the results, we get:
Memory Accesses per Instruction = 1.37
By multiplying the second and fourth columns together and adding the results, we get:
Memory Reads per Instruction = 1.23, 90% of all memory accesses
By multiplying the second and fifth columns together and adding the results, we get:
Memory Writes per Instruction = 0.14, 10% of all memory accesses
Thus:
CPIideal = 1.37
For write through caches, we can construct the following table. The frequencies are calculated by
multiplying the probability of the event in the first column happening by the probability of the event
in the second column happening. The average stalls per access can be calculated by multiplying the
last two columns and adding the results. Then we simply substitute the results into the equation
above.
13
Access Hits Cache? Access Type Frequency Cycles Bus is Busy
Yes Read 95% × 90% = 85.5% 0
Yes Write 95% × 10% = 9.5% 16
No Read 5% × 90% = 4.5% 23
No Write 5% × 10% = 0.5% 39
1.37 × 2.75
Traffic Ratio = = 73.3%
1.37 + 1.37 × 2.75
1.37 × 1.27
Traffic Ratio = = 56.0%
1.37 + 1.37 × 1.27
(b) 80% of the bus bandwidth is available. For write through caches, 73.3% of the bandwidth is consumed
by the cache, thus leaving 6.7% for I/O. For write back caches, 56% of the bandwidth is consumed,
leaving 24% for I/O.
(c) The following equation can be used to calculate the real CPI of each machine. For write through caches,
the value is :
Memory References Memory Stall Cycles
CPI = CPIideal + ×
Instruction Memory Reference
= 1.37 + 1.37(2.75) = 5.14
Thus to execute one million instructions, 5.14 × 106 cycles are needed. Of those cycles, 73.3% of the
time the bus is used by the cache. The following table summarizes this information for both cache
systems.
A disk operation takes a total of 102,000 cycles (101,000 cycles to initiate the operation and 1,000 cycles
on the IO bus to do the actual transfer). For the write through cache, we have 5.14 × 106 cycles to
6
play with. We can do b 5.14×10
102,000 c = 50 disk operations. 50 disk operations require 50 × 1, 000 = 50, 000
cycles on the bus, which we do have available.
6
For the write back cache, we have 3.11 × 106 cycles to play with. We can do b 3.11×10
102,000 c = 30 disk
operations. 30 disk operations require 30 × 1, 000 = 30, 000 cycles on the bus, which we do have
available.
When the miss rate is reduced to 2.5%, the traffic ratios for the write through and write back caches
drop down to 68.5% and 38.8% respectively (just recalculate the last 2 tables and set of equations in
part a). The CPIs decrease to 4.35 and 2.24 respectively.
14
$ Style CPI % Bus Used by $ % Available for IO Cycles for IO
Write Through 4.35 68.5% 11.5% 0.50 × 106
Write Back 2.24 38.8% 41.2% 0.92 × 106
The number of disk operations that can be supported are now 42 and 21 for the write through and
write back caches respectively.
(d) Previously, each disk operation had to complete in entirety before another could be started. Now, since
we have multiple disks, we can overlap the 100,000 cycles it takes each disk to find the data. So we
can initiate a disk access every 1000 cycles (and we do have plenty of cpu execution cycles) but the
limiting factor now is the number of disk accesses that can return on the IO bus because each return of
data requires 1000 cycles on the IO bus. Using the numbers from the above table (for the standard 5%
miss rate caches), we find that for write through caches, the number of accesses that can occur every
6
0.75×106
million instructions is b 0.34×10
1,000 c = 340 and for write back caches, that number is b 1,000 c = 750.
15