0% found this document useful (0 votes)

17 views15 pages

hw2 Solns

This document contains solutions to problems from a computer architecture course. It discusses cache configurations and analyzing their performance using different memory access patterns. Specifically: 1) It provides code for a program that would generate many conflict misses in a direct-mapped cache but no misses in an associative cache, making the associative cache much faster. 2) It analyzes the performance of two caches (write-back vs write-through) using different memory access benchmarks. It calculates the cache hit rates and memory traffic for each. 3) It calculates the utilization of memory bandwidth for two cache configurations (write-through and write-back) by analyzing the different types of memory references and how frequently they occur.

Uploaded by

Muhammad Fahad Naeem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views15 pages

hw2 Solns

Uploaded by

Muhammad Fahad Naeem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

University of California, Berkeley

College of Engineering
Computer Science Division — EECS

Fall 1999 J.D. Kubiatowicz

Solution Set #2

Solution Set #2
CS252 Graduate Computer Architecture

5.1 (9 points)
This problem explores unfair memory-system benchmarks. We have two cache configurations to compare:
cache A is 2-way set associative, 128 sets, 32-byte blocks, write-through, no-write allocate; cache B is 1-way
associative, 256 sets, 32-byte blocks, write-back, write-allocate. A word on the terminology in this problem:
we are told that the miss time is 10 times the hit time. Miss time here is the total time spent processing a
miss, not just the miss penalty; you can think of it as miss time = miss penalty + hit time.

(a) We want a program that makes A as much faster than B as possible. The basic idea is to write a
program that generates a large amount of conflict misses in the direct-mapped cache B, but generates
no misses in cache A. One way you might think about doing this is writing a program that repeatedly
reads from two alternating addresses α and β such that α mod (256 × 32) = β mod (256 × 32) (so that
α and β map to the same line in B, and to different parts of the same set in A). There is a subtlety,
however. Both cache A and B are unified caches, so for every memory read, there must also be an
instruction fetch from the cache. If there are no conflicts in the instruction fetches, the instruction
fetch times will be the same for both caches, and we will not see as much of a performance difference
between A and B than if there are instruction conflicts. One way to guarantee maximum performance
difference between A and B is to only use instruction conflicts. Consider the following code:

a: j b ; absolute jump to b
.
.
.
b: j a ; absolute jump to a

where a and b refer to addresses that collide in the direct-mapped cache but not in the 2-way cache
(for example, a mod 8192 = b mod 8192). For cache A, the instruction fetches map to different entries
in a set, so each instruction fetch is a hit. For cache B, the fetches map to the same set, so they cause
conflict misses on each instruction fetch.
(b) To make B run faster than A, we need to take advantage of its write-back property. We basically want
a program that writes repeatedly to one cache line (that has never been read before and is thus not
initially in the cache). Cache B can absorb all those writes as cache hits, whereas cache A is forced
to propagate them through to memory. Again there is a complication due to instruction accesses, but
this time the most we can do is assume that instruction fetches don’t conflict in the cache (since we
can’t actually perform writes via instruction fetches). Thus, a simple program that does what we want
is:

l1: sw 0(r1), 0
sw 0(r1), 0
.
. ; we assume the loop is unrolled n times
.
b l1

1
We assume that the loop can be unrolled n times without causing code conflicts in the cache; n must
be less than 2039 (the number of 4-byte instructions that fit in the cache minus a slot for the branch
and an entire block (32 bytes) for the data being stored).

(c) The program is 10 times faster on cache A than on cache B, if we assume single-cycle jumps. The
analysis is simple: the only cache accesses are for the instruction fetches; all accesses in cache A hit; all
accesses in cache B miss; miss time is 10x hit time. Thus, machine A runs 10x faster than B, assuming
that the machines have the same clock rate.

(d) The analysis is more complicated than in part (c), since we must take into account data accesses and
instruction fetches. We’ll assume the loop is unrolled n ≤ 2039 times, and the value of r1 is carefully
chosen so that no conflict misses occur during execution of the loop. We’ll also assume a DLX-like
in-order pipeline, but with single-cycle branches (e.g., perfect branch target prediction) and a unified,
single-ported instruction/data cache.
Each store instruction on cache B takes two cache cycles (and thus two real cycles since the cache is
single-ported): one to fetch the instruction and one to write the data. Let thit be the time for a cache
hit. There are n of the store instructions, each taking 2thit cycles, and one branch instruction, taking
thit cycles. So the total time for n iterations on cache B is simply thit (2n + 1).
The analysis for cache A is similar, except that each store instruction requires 6thit cycles, 5 for the
write and one for the instruction fetch (we’re assuming the pipeline stalls until the write completes (no
write buffer)). So the total time for n iterations on cache A is thit (6n + 1).
The performance ratio of cache B to cache A is:

TimeA thit (6n + 1) 6n + 1

= = .
TimeB thit (2n + 1) 2n + 1

If we unroll n = 2039 times, then the ratio is 2.9995, and so cache B is about 3 times faster than cache
A.

5.4 (9 points)
We want to calculate the utilization of memory bus bandwidth for two different cache configurations. The
basic approach in each case is to list all of the different types of memory reference, and, for each, to compute
how many bus transactions it takes and how frequently it occurs.

(a) We have a write-through cache. There are four types of memory reference to consider: read hit (in
the cache), read miss, write hit, and write miss. All blocks in the cache are always clean, so it is
not necessary to distinguish between dirty and clean blocks. Read hits are absorbed by the cache
and generate no memory system traffic; these comprise a 0.95 × 0.75 = 0.7125 fraction of the total
references (since there’s a 95% hit rate, and 75% of accesses are reads). Read misses require two
accesses, since both blocks of the cache line must be loaded; these comprise a 0.05 × 0.75 = 0.0375
fraction of the total time. Write hits require a single memory reference to write the new word back to
memory (only the modified word is written, not the entire block), and comprise 0.95 × 0.25 = 0.2375
of the references. Finally, since the cache is write-allocate, a write miss first fetches the entire old
block from memory (two references), then modifies one word of the block and writes the modified data
through to memory (one reference). So these take three memory references, and occur with a frequency
of 0.05 × 0.25 = 0.0125. These results are summarized in the table below.

Access type # Memory Refs Frequency

Read hit 0 0.95 × 0.75 = 0.7125
Read miss 2 0.05 × 0.75 = 0.0375
Write hit 1 0.95 × 0.25 = 0.2375
Write miss 3 0.05 × 0.25 = 0.0125

2
We can now compute the bandwidth used. Using the data above, we first compute the average number
of main memory references per processor access to be:

0 × 0.7125 + 2 × 0.0375 + 1 × 0.2375 + 3 × 0.0125 = 0.35

Accesses are generated by the processor at a rate of 109 words per second, and on average 35% of
these will reach the memory system. The memory system can sustain 109 words per second, so the
utilization of the memory system is simply:

0.35 × 109
= 0.35
109
or 35%.
(b) Now we have a write-back cache. Our breakdown of memory references is more complicated, since we
must consider the cases when a miss is to a block in the cache that is dirty. We are told that 30% of
the blocks are dirty on average. If we assume random replacement, then on average 30% of misses will
force a writeback of a dirty line. The following table shows the breakdown of processor accesses, the
number of references generated by each, and the frequency of each:

Access type # Memory Refs Frequency

Read hit 0 0.95 × 0.75 = 0.7125
Read miss to clean block 2 0.05 × 0.75 × 0.70 = 0.02625
Read miss to dirty block 4 0.05 × 0.75 × 0.30 = 0.01125
Write hit 0 0.95 × 0.25 = 0.2375
Write miss to clean block 2 0.05 × 0.25 × 0.70 = 0.00875
Write miss to dirty block 4 0.05 × 0.25 × 0.30 = 0.00375

Both read and write hits are absorbed by the cache, so they generate no memory system references.
Read and write misses to clean blocks each generate two references to read in the old block from the
memory system (a write miss behaves like a read miss plus a write hit). Finally, read and write misses
to dirty blocks each generate four references: two to write back the dirty block, and two to read in the
new block from memory.
We can again compute the average number of main memory references per processor access to be:

0 × 0.7125 + 2 × 0.02625 + 4 × 0.01125 + 0 × 0.2375 + 2 × 0.00875 + 4 × 0.00375

= 0.13

Again, accesses are generated by the processor at a rate of 109 per second, and the memory system
bandwidth is 109 accesses per second, so the utilization is simply:

0.13 × 109
= 0.13
109
or 13%. Notice that a write-back cache significantly reduces the memory bandwidth required as
compared to a write-through cache. This is often why processors are built with write-back L2 caches,
especially when they are intended for use on a shared-memory bus where bandwidth is a precious
resource.

3
5.5 (10 points)
In this problem, we will be calculating performance, using the following equations

Memory Stall Cycles
CPU Time = IC × CPIexecution + × Clk
Instruction
The following equation is used to calculate the CPI without any memory stalls.

CPIexecution = floads CPIloads + fstores CPIstores + fother CPIother

The following equation is used to calculate the memory stall cycles per instruction.
Memory Stall Cycles I Fetch Stall Cycles Data Stall Cycles
= +
Instruction Instruction Instruction
where the stall cycles per instruction for both instructions and data can be calculated by:
Stall Cycles Memory Accesses
= × Miss Rate × Miss Penalty
Instruction Instruction
For fetch stalls, the number of memory accesses per instruction is 1, but in data stalls, only loads and stores
cause memory accesses. If R is a miss rate, P a miss penalty and f a frequency,
Memory Stall Cycles
= (Ri Pi ) + [(fstores + floads )Rd Pd ]
Instruction
(a) Now we only need to plug in the right values into the above equations. For both caches, the CPI
without stalls is the same:

CPIexecution = (floads × 1) + (fstores × 2) + (fother × 1)

The frequencies of various instructions can be obtained from Figure 2.26 of the textbook.
For both caches, the instruction cache miss rate is 0.5% and the penalty is 50 cycles. For both caches,
the data cache miss rate is 1% but the penalties are different. In a write through cache, there is nothing
dirty in the cache, and write throughs are serviced by the write buffer, which is infinitely large in this
case. Thus for both loads and stores the penalty is the cache miss penalty which is 50 cycles.
Memory Stall Cycles
= (0.5% × 50) + (fstores + floads )(1.0% × 50)
Instruction

In a write back cache, a miss on either a load or a store can either simply replace a clean line in the
block (50 cycles) or can knock out a dirty line (an additional 50 cycles, which happens 50% of the
time).
Memory Stall Cycles
= (0.5% × 50) + (fstores + floads )(1.0% × (50 + (50% × 50)))
Instruction

By plugging in the right frequencies from figure 2.26 into these three equations, and the very first
equation, the CPU time can be calculated for both systems. The following are the CPI numbers (CPI
is given rather than execution time because the clock cycle time and number of instructions is the same
for both cases).

Program Execution CPI W.T. Stalls W.T. CPI W.B. Stalls W.B. CPI
compress 1.05 0.38 1.43 0.44 1.49
eqntott 1.00 0.41 1.41 0.48 1.48
espresso 1.04 0.38 1.42 0.45 1.49
gcc 1.14 0.44 1.58 0.53 1.67
li 1.16 0.49 1.65 0.61 1.77
average 1.08 0.42 1.50 0.50 1.58

4
The write back cache is roughly 5% slower than the write through cache using average numbers. This is
because the write back cache sees the latency of writing back dirty lines, while the write through cache
does not see the latency of writes because it has a magically infinite write buffer. Because stalls caused
by a real write buffer, consistency issues caused by the write buffer and the extra memory bandwidth
eaten up by a write through cache are not considered, the write back cache appears to be slower.
(b) This part of the question is almost identical to the previous part. The only difference is that for write
through caches, stores take 1 cycle.
CPIexecution = (floads × 1) + (fstores × 1) + (fother × 1)

So the values in the table for the write back cache stay the same. For the write through cache, the
stalls stay the same. But the execution CPI of the write through cache now becomes 1 for all cases
because all instructions now take only 1 cycle, and so the total CPI of the write through cache changes.

Program W.T. Execution CPI W.T. Stalls W.T. CPI W.B. CPI
compress 1.00 0.38 1.38 1.49
eqntott 1.00 0.41 1.41 1.48
espresso 1.00 0.38 1.38 1.49
gcc 1.00 0.44 1.44 1.67
li 1.00 0.49 1.49 1.77
average 1.00 0.42 1.42 1.58

5.6 (12 points)

We want to try to remove all extraneous prefetches, and also eliminate all non-prefetched data accesses. We
start with some assumptions:
• the computational instruction (a[i][j] = b[j][0] * b[j+1][0]) takes 7 cycles to execute
• the loop overhead is 0
• the memory system is fully pipelined and can handle as many outstanding requests as we need
• prefetches take 1 cycle to execute
• the cache miss penalty is 50 cycles
• a prefetch to one word of a cache block also prefetches all other words in that block
One simple way to answer this problem is to just prefetch all data in advance. The code for this is shown
below; note that we unroll the first loop in order to only prefetch one word of each cache block containing
data from the a[] array, but that we can’t do this for the b[] array since it is accessing noncontiguous data.
/* prefetch b array */
for (j = 0; j < 100; j = j + 1)
prefetch(b[j][0]);
/* prefetch a array */
for (i = 0; i < 3; i = i + 1)
/* prefetch first element in case array is not cache-block aligned */
prefetch(a[i][0]);
/* prefetch remaining elements */
for (j = 1; j < 100; j = j + 2) {
prefetch(a[i][j]);
}
/* do computation */
for (i = 0; i < 3; i = i + 1)
for (j = 0; j < 100; j = j + 1)
a[i][j] = b[j][0] * b[j+1][0];

5
This is not a good general solution, however, since it requires that the cache be big enough to hold the
entire a[] and b[] arrays. Even if the arrays fit into the cache, this scheme significantly increases the cache
working set (amount of data that must stay in the cache to be useful) over what it could be.
A better approach is to prefetch the data as the loops walk over the array. In order to do this, we have
to split the loops into several parts. We begin with an initial loop that prefetches the first eight elements
needed, a[0][0. . .7] and b[0. . .7][0]:
/* SOLUTION PART 1 */
prefetch(a[0][0]);
for (j = 0; j < 8; j += 2) {
prefetch(b[j][0]);
prefetch(b[j+1][0]
prefetch(a[0][j+1]);
}
A few notes about this loop: first, we begin by prefetching a[0][0] manually so that the loop will fetch the
odd elements of a; the loop is then unrolled to only prefetch the odd elements of a. If we knew the array was
aligned with a cache block, this wouldn’t be necessary. However, if the array is not aligned, this guarantees
that we actually prefetch all needed elements. Note that we unroll the loop rather than using mod (%) to
prefetch every other element of a; this is because the mod operation can be very expensive compared to
unrolling the loop. Also, we put the prefetches in a loop to reduce the code footprint of the program; this
would be particularly important if the cache was unified. Next, notice that the prefetches are scheduled to
prefetch the b array first. This is because the values of the b array are needed first in the computational
instruction. One might wonder why we bother prefetching the first few iterations worth of data, rather than
just allowing them to cache-miss. The key is that the prefetches are non-blocking, and, since we assumed
that the memory system is pipelined with a large number of outstanding requests, the prefetches can all run
in parallel. So, when the first computational instruction tries to access b[0][0], the prefetch will not have
completed (since only about 12 cycles have elapsed) and the load will block until the data is ready. However,
all the other prefetches will continue (separated in time by just one cycle), so, by the time the first load is
ready and its value consumed, the remaining data will be ready, and thus there will be no cache misses on
any of the initial computation iterations.
Now we consider the main part of the loop. It must do the computation and also prefetch for subsequent
iterations. Again, we unroll the loop to avoid prefetching twice for each cache block. Finally, note that the
loop only runs until j == 90; this is to avoid extra unnecessary prefetches.
/* SOLUTION PART 2 */
for (j = 0; j < 92; j += 2) {
prefetch(b[j+8][0]);
prefetch(a[0][j+8];
a[0][j] = b[j][0] * b[j+1][0];
prefetch(b[j+9][0]);
a[0][j+1] = b[j+1][0] * b[j+2][0];
}
prefetch(b[100][0]);
Finally, we have a last loop that finishes the computation without prefetching past the bounds of the array.
Note, however, that up to this point we’ve only been working on the i == 0 iteration of the outer loop.
Since there are two more outer iterations to do, we begin prefetching for the next outer iteration in this last
part of the i == 0 loop:
/* SOLUTION PART 3 */
prefetch(a[1][0]);
for (j = 92; j < 100; j += 2) {
prefetch(a[1][j-91]);
a[0][j] = b[j][0] * b[j+1][0];
a[0][j+1] = b[j+1][0] * b[j+2][0];
}

6
Note that we don’t need to prefetch the b array, since it is already in the cache and is used unchanged in
the next iteration.
OK, now we’ve got the first outer loop iteration done. The next iteration is similar to the previous one;
it begins with a main loop that prefetches as it computes, and ends with a loop that prefetches for the last
(i == 2) iteration:

/* SOLUTION PART 4 */
for (j = 0; j < 92; j += 2) {
prefetch(a[1][j+8];
a[1][j] = b[j][0] * b[j+1][0];
a[1][j+1] = b[j+1][0] * b[j+2][0];
}
prefetch(a[2][0]);
for (j = 92; j < 100; j += 2) {
prefetch(a[2][j-91]);
a[1][j] = b[j][0] * b[j+1][0];
a[1][j+1] = b[j+1][0] * b[j+2][0];
}

And to complete the problem, we do the last outer iteration, prefetching only in the main loop. We don’t
need to prefetch in the last part of the loop since there are no more iterations to do.

/* SOLUTION PART 5 */
for (j = 0; j < 92; j += 2) {
prefetch(a[2][j+8]);
a[2][j] = b[j][0] * b[j+1][0];
a[2][j+1] = b[j+1][0] * b[j+2][0];
}
for (j = 92; j < 100; j += 1) {
a[2][j] = b[j][0] * b[j+1][0];
}

Lastly, let’s compute the performance of this solution. The first code fragment runs in 13 cycles. The
second code fragment would take 783 cycles to execute, but it has to stall in cycle 16 on the first access to
b[0][0]; this data was prefetched in cycle 2 and is thus available in cycle 52, causing 36 stall cycles. There
are no further stalls in the code, so we can continue adding up cycles. The third fragment takes 61 cycles;
the fourth fragment takes 751, and the fifth fragment takes 746. Adding these together, we get 2390 cycles
in total, or 44% faster than the simple prefetch code on page 404, and 6.13× faster than the non-prefetched
code.
One other note on this problem. The original (unprefetched) loop is rather poorly written. It can be
simplified significantly by performing a loop interchange optimization. Performing this optimization also
simplifies the prefetch problem. Here’s the code after loop interchange:

/* after loop interchange, no prefetch */

for (j = 0; j < 100; j++) {
t1 = b[j][0] * b[j+1][0];
a[2][j] = a[1][j] = a[0][j] = t1;
}
Since we’ve eliminated the outer loop, the prefetched version is simpler as well, and has a much smaller cache
working set (since the B array doesn’t have to hang around in the cache):

7
/* Solution with loop interchange */
prefetch(a[0][0]);
prefetch(a[1][0]);
prefetch(a[2][0]);
for (j = 0; j < 8; j += 2) {
prefetch(b[j][0]);
prefetch(b[j+1][0]
prefetch(a[0][j+1]);
prefetch(a[1][j+1]);
prefetch(a[2][j+1]);
}
for (j = 0; j < 92; j += 2) {
prefetch(b[j+8][0]);
prefetch(a[0][j+8];
prefetch(a[1][j+8];
prefetch(a[2][j+8];
t1 = b[j][0] * b[j+1][0];
a[2][j] = a[1][j] = a[0][j] = t1;
prefetch(b[j+9][0]);
t1 = b[j+1][0] * b[j+2][0];
a[2][j+1] = a[1][j+1] = a[0][j+1] = t1;
}
for (j = 92; j < 100; j += 2) {
t1 = b[j][0] * b[j+1][0];
a[2][j] = a[1][j] = a[0][j] = t1;
t1 = b[j+1][0] * b[j+2][0];
a[2][j+1] = a[1][j+1] = a[0][j+1] = t1;
}

8
5.7 (8 points)
In this problem, we consider a pseudo-associative cache in which we do not swap blocks on a hit to the
“slow” part of the cache. We also assume that a hit to the slow part takes only one extra cycle, not two
(since we don’t have to swap). Finally, we are assuming a random replacement policy.

(a) We first want to derive the AMAT for this cache in terms of miss rates and miss penalties of 1-way
(direct-mapped) and 2-way caches. We use a slightly different formulation of AMAT than the usual
one:
AMATpseudo = Hit ratepseudo × Hit timepseudo + Miss ratepseudo × Miss timepseudo

The reason for using this formulation (rather than the traditional AMAT = Hit time + Miss rate ×
Miss penalty) is that the miss time in a pseudo-associative cache is different than the hit time plus
the miss penalty. Here’s why: in a traditional cache, the hit time is a constant. When there’s a miss
in a traditional cache, that same constant amount of time is spent checking for a hit before the miss
penalty is paid. In a pseudo-associative cache, the hit time for a given hit can be one of two values,
and thus the overall hit time is just a statistical measure. On a miss in a pseudo-associative cache, it
is the slow hit time that must be paid before starting to process a miss, not the statistical average hit
time.
Given that, we can say now examine the terms of the equation. The miss/hit rates are the same as a
standard 2-way associative cache of the same size, since the two types of caches behave the same in
terms of hits and misses (since there are two possible locations for each datum):

Miss ratepseudo = Miss rate2-way = 1 − Hit ratepseudo .

The miss time is just the miss penalty of a 1-way cache (since the underlying memory system and fill
path is the same) plus the slow hit time in the associative cache. So at this point our AMAT equation
looks like:

AMATpseudo = (1 − Miss rate2-way ) × Hit timepseudo +

Miss rate2-way × (Hit timepseudo-slow + Miss penalty1-way )

The only remaining term in the AMAT formula is the overall hit time for the cache. We start by
expressing the overall hit time in terms of the percentage of accesses and the access time to the fast
and slow blocks, as follows:

Hit timepseudo = Hit fractionfast block × Hit timefast block + Hit slowslow block × Hit timeslow block .

Because we are using random replacement, and because we are not swapping blocks, the likelihood of
a given block being in the fast position is the same as the it being in the slow position, so:
1
Hit fractionfast block = Hit fractionslow block = .
2
So:
1
Hit timepseudo = (Hit timefast block + Hit timeslow block ) .
2
Putting this together with the AMAT formula and simplifying gives:
Hit timepseudo-fast + Hit timepseudo-slow
AMATpseudo = +
2
Miss rate2-way Hit timepseudo-slow + Miss penalty1-way −

Hit timepseudo-fast + Hit timepseudo-slow
2

9
Now, recalling that Hit timefast block = 1, Hit timeslow block = 2, and Miss penalty1-way = 50, we can
compute the final AMAT formula as:

AMATpseudo = 1.5 + Miss rate2-way × 50.5.

(b) We assume that the values in Figure 5.9 in the text still apply for a cache with random replacement.
We assume a 50-cycle miss penalty. For the 2KB cache:

AMATpseudo-2KB = 1.5 + 0.076(50.5)

= 5.338.

And for the 128KB cache:

AMATpseudo-2KB = 1.5 + 0.007(50.5)

= 1.854.

5.9 (3 points)

AMAT = Hit TimeL1 + Miss RateL1 × (Hit TimeL2 +

Miss RateL2 × (Hit TimeL3 + Miss RateL3 × Miss PenaltyL3 ))

5.10 (12 points)

In this problem we consider how to implement a memory system that uses a prime number of memory banks.

(a) This problem effectively asks us to produce a formula for the number of bits needed as input to a ROM
in order to look up the result of a mod operation. We are given that there are 2N − 1 memory banks,
and an input address that is M bits wide. We will assume that N is chosen such that 2N − 1 is prime.
According to the problem in the text, after step 3, an input address A will be represented in the form:
N
X −1
A= Termi
i=0

where the i’th term Termi is the sum of the binary digits ai , ai+N , ai+2N , . . . There will be M

N −1
of these terms, since the last bit we have is aM −1 and the interval between bits in the term is N . So:

dM
N e−1
X
i
Termi = 2 × ai+N ·j
j=0
M
A simple upper bound on this sum for Termi is to treat the upper sum bound as just N and all bits
ak as 1. Then,
M
Termi ≤ 2i ·
N
and
N −1 −1
NX
X
i M M i M
A≤ 2 = 2 = · (2N − 1)
i=0
N N i=0
N

Finally, we want to compute the number of bits needed to represent this maximum value (which is
equal to the number of input lines to the mod lookup ROM). This is just the ceiling of the base-2 log

10
of the maximum value:

M N
nbits = log2 · (2 − 1)
N

M N
= log2 + log2 (2 − 1)
N

M
≤ N + log2 .
N

Finally, the reduction in the size of the address is given by:

M
N + log2 M

N

For the example with M = 32 and N = 3, an upper bound on the maximum size of the address is

32
3 + log2 = d3 + 3.46e = 7 bits.
3

(b) In this part we want to come up with simple hardware to pick the correct bank out of 7 given a 32-bit
address. We are told to assume that the bank width is 8 bytes, and thus the lower 3 bits of the address
are used to select a byte within a row of a bank. Thus, we effectively have a M = 29 bit bank address.
Again, N = 3.
The following diagram shows a simple way of implementing the bank number generator. Since the
ROM has 7 input lines and returns a 3-bit value (the mod-7 of the input), it must have 27 = 128 3-bit
entries, or 384 total bits. The blocks labeled as “bit-counters” return (as a 4-bit number) the number
of set (high) input lines. These could be implemented as LUTs, a rippled chain of 2-bit adders, a tree
of wider adders, or something more clever.

11
5.20 (6 points)
In answering this question, we must consider the four functions of a virtual memory system as presented in
class:

1. to provide a larger effective memory space by using a paging disk to back physical memory

2. to provide memory protection

3. to provide translation between multiple virtual address spaces and one physical address space

4. to provide a means of sharing memory between virtual address spaces

The argument presented in the problem is that memory capacity is growing quickly, so we no longer really
need virtual memory to provide a larger effective memory space (since most programs that people want to
run will fit directly in physical memory). While true, this argument only addresses the first function of a
virtual memory system. The final three points above are critical to the functioning of a multiprogrammed
computer system. Protection allows multiple untrusted user programs to be running at once, without worries
about memory corruption. Translation allows all programs to be loaded at the same virtual address (and
allows shared code such as kernel entry points to be at a static location in each process’s address space),
simplifying linkers, loaders, and the kernel. Sharing allows multiple processes to synchronize and share data
with low overhead. Of course, protection, translation, and sharing could be achieved without virtual memory,
but at the cost of increased software complexity and programmer burden. For example, protection could
be provided by running all code in a Java virtual machine; the need for translation can be sidestepped by
using relocating dynamic linkers; and sharing could be provided by cooperation between running JVMs or
via some OS IPC mechanism. The existence of virtual memory makes all of these problems much simpler.
Perhaps a more compelling argument would be that future systems should no longer include operating
system support for paging/swapping to disk. This allows the system to retain the protection, translation, and
sharing mechanisms provided by VM, yet eliminates the complexity of managing disk swapping. However,
since most OS’s reuse the paging code to support such useful features as memory-mapped files and demand-
paged executables, it is unlikely that a significant reduction in complexity would be achieved by removing
support for paging to disk.

6.5 (14 points)

(a) This part of the question can be answered by making a table of the number of transactions that each
component in the system can handle and then finding the bottleneck and using its TP/s value. Using
Figure 6.24 from the textbook, we find that we need 10GB of disk space (either twenty 500MB disks
or eight 1,250MB disks). Each transaction requires (2reads + 2writes) × 15, 000 + 40, 000 instructions
which is a total of 100,000 instructions or 0.1 MIPS per transaction.

Units Performance # of Units Demand Per Transaction TP/s Limit

CPU 800 MIPS 1 0.1 MIPS 8,000
Bus 1000 MB/s 1 400 bytes 2,621,440
20×30
Small disk 30 IOs/s 20 4 IOs 4 = 150
8×30
Big disk 30 IOs/s 8 4 IOs 4 = 60

So for small disks, the TP/s limitation is the disks system which is 150 TP/s. So for big disks, the
TP/s limitation is the disks system which is 60 TP/s.

(b) This part is trivial. The cost of each system is added up and then divide by the number of TP/s.

Organization Cost TP/s Cost per TP/s

CPU with Small disks $52,000 150 $346
CPU with Big disks $52,000 60 $866

12
(c) This part is also trivial. We already know that the current bus can handle 2,621,440 TP/s and the
current CPU can handle 8,000 TP/s. So the CPU would have to be 2,621,440
8,000 = 328 times faster than
the current CPU. Sicne the current CPU is 800 MIPS, the new one would have to be 262,400 MIPS.
Ouch..
(d) The software approach would reduce the number of instructions per transaction and the number of
disk reads and writes so that the new TP/s would be:
15,000 × (1 read + 1 write) + 30,000 = 60,000 inst/transaction

This reduces the load on the old CPU to 0.06 MIPS per transaction and the overall TP/s would be
13,333 for the old CPU. The old approach needed 100,000 instructions per transaction and so the new
CPU would have to be 100,000
60,000 = 1.67 times faster than the original CPU.

(e) (Don’t trust those sneaky MTP I/O people - they listen at doors to sekret presentations!). Because the
old CPU was limited at 8,000 TP/s, the new CPU will be limited at 16,000 TP/s. To provide enough
small disks, each of which provides 30 IO/s, we would need 16,000×4
30 = 2, 134 small disks to handle the
all the IOs. Each disk costs $100 and so the system cost is:
Cost = $50,000 + (2,134 × $100) = $263,400

(f ) We originally had 20 small disks. To match the CPU TP/s of 16,000, we would need disks that can
support 16,000×4
20 = 3, 200I/Os per second per disk. If the new disks cost the same as the old disks,
the new system cost will be the same as the old system which was $52,000.

6.10 (17 points)

(a) We need to find the ratio of how many cycles the bus is used servicing the cache to the total number
of cycles spent in executing the program.
Memory Stall Cycles Per Instruction
Traffic Ratio =
CPIideal + Memory Stall Cycles Per Instruction
The following table summarizes the data from Figure 2.26 from the textbook.

Instruction Class Frequency Cycles Read Accesses Write Accesses

Loads 22.8% 2 2 0
Stores 14.3% 2 1 1
Other 62.9% 1 1 0

By multiplying the second and third columns together and adding the results, we get:
Memory Accesses per Instruction = 1.37

By multiplying the second and fourth columns together and adding the results, we get:
Memory Reads per Instruction = 1.23, 90% of all memory accesses

By multiplying the second and fifth columns together and adding the results, we get:
Memory Writes per Instruction = 0.14, 10% of all memory accesses

Thus:
CPIideal = 1.37
For write through caches, we can construct the following table. The frequencies are calculated by
multiplying the probability of the event in the first column happening by the probability of the event
in the second column happening. The average stalls per access can be calculated by multiplying the
last two columns and adding the results. Then we simply substitute the results into the equation
above.

13
Access Hits Cache? Access Type Frequency Cycles Bus is Busy
Yes Read 95% × 90% = 85.5% 0
Yes Write 95% × 10% = 9.5% 16
No Read 5% × 90% = 4.5% 23
No Write 5% × 10% = 0.5% 39

Avg. Stalls per Access = 2.75

1.37 × 2.75
Traffic Ratio = = 73.3%
1.37 + 1.37 × 2.75

For write back caches, we can construct the following table:

Access Hits Cache? Is Block Dirty? Frequency Cycles Bus is Busy

Yes No 95% × 70% = 66.5% 0
Yes Yes 95% × 30% = 28.5% 0
No No 5% × 70% = 3.5% 23
No Yes 5% × 30% = 1.5% 31

Avg. Stalls per Access = 1.27

1.37 × 1.27
Traffic Ratio = = 56.0%
1.37 + 1.37 × 1.27

(b) 80% of the bus bandwidth is available. For write through caches, 73.3% of the bandwidth is consumed
by the cache, thus leaving 6.7% for I/O. For write back caches, 56% of the bandwidth is consumed,
leaving 24% for I/O.

(c) The following equation can be used to calculate the real CPI of each machine. For write through caches,
the value is :

Memory References Memory Stall Cycles
CPI = CPIideal + ×
Instruction Memory Reference
= 1.37 + 1.37(2.75) = 5.14

Thus to execute one million instructions, 5.14 × 106 cycles are needed. Of those cycles, 73.3% of the
time the bus is used by the cache. The following table summarizes this information for both cache
systems.

$ Style CPI % Bus Used by $ % Available for IO Cycles for IO

Write Through 5.14 73.3% 6.7% 0.34 × 106
Write Back 3.11 56.0% 24.0% 0.75 × 106

A disk operation takes a total of 102,000 cycles (101,000 cycles to initiate the operation and 1,000 cycles
on the IO bus to do the actual transfer). For the write through cache, we have 5.14 × 106 cycles to
6
play with. We can do b 5.14×10
102,000 c = 50 disk operations. 50 disk operations require 50 × 1, 000 = 50, 000
cycles on the bus, which we do have available.
6
For the write back cache, we have 3.11 × 106 cycles to play with. We can do b 3.11×10
102,000 c = 30 disk
operations. 30 disk operations require 30 × 1, 000 = 30, 000 cycles on the bus, which we do have
available.
When the miss rate is reduced to 2.5%, the traffic ratios for the write through and write back caches
drop down to 68.5% and 38.8% respectively (just recalculate the last 2 tables and set of equations in
part a). The CPIs decrease to 4.35 and 2.24 respectively.

14
$ Style CPI % Bus Used by $ % Available for IO Cycles for IO
Write Through 4.35 68.5% 11.5% 0.50 × 106
Write Back 2.24 38.8% 41.2% 0.92 × 106

The number of disk operations that can be supported are now 42 and 21 for the write through and
write back caches respectively.

(d) Previously, each disk operation had to complete in entirety before another could be started. Now, since
we have multiple disks, we can overlap the 100,000 cycles it takes each disk to find the data. So we
can initiate a disk access every 1000 cycles (and we do have plenty of cpu execution cycles) but the
limiting factor now is the number of disk accesses that can return on the IO bus because each return of
data requires 1000 cycles on the IO bus. Using the numbers from the above table (for the standard 5%
miss rate caches), we find that for write through caches, the number of accesses that can occur every
6
0.75×106
million instructions is b 0.34×10
1,000 c = 340 and for write back caches, that number is b 1,000 c = 750.

Kubernetes Vs Docker - A Step by Step Guide To Learn and Master Well
No ratings yet
Kubernetes Vs Docker - A Step by Step Guide To Learn and Master Well
247 pages
Darktrace Threat Visualizer User Guide
100% (1)
Darktrace Threat Visualizer User Guide
164 pages
230 V AC UPS System
No ratings yet
230 V AC UPS System
24 pages
Week 3 - Windows Forensic Analysis
No ratings yet
Week 3 - Windows Forensic Analysis
17 pages
Lesson Plan: Session No
No ratings yet
Lesson Plan: Session No
6 pages
PL 1 - Questions2 - HelwanCoders
No ratings yet
PL 1 - Questions2 - HelwanCoders
17 pages
Disc09 Sols
No ratings yet
Disc09 Sols
7 pages
54-SNAT Lab
No ratings yet
54-SNAT Lab
6 pages
CA11 2023S1 New
No ratings yet
CA11 2023S1 New
26 pages
Chap 5 Memory System p1
No ratings yet
Chap 5 Memory System p1
30 pages
SY - Core Java - Syllabus-Nep
No ratings yet
SY - Core Java - Syllabus-Nep
6 pages
Fingerprint Based Student Attendance Management System With Automatic Excel Computation
No ratings yet
Fingerprint Based Student Attendance Management System With Automatic Excel Computation
18 pages
53-Cache Memory - Principles, Cache Memory Management Techniques-28!02!2025
No ratings yet
53-Cache Memory - Principles, Cache Memory Management Techniques-28!02!2025
38 pages
ch5 1
No ratings yet
ch5 1
44 pages
Embedded Systems Midterm
No ratings yet
Embedded Systems Midterm
2 pages
L18 Cache Wrap Up
No ratings yet
L18 Cache Wrap Up
30 pages
Chapter 2z
No ratings yet
Chapter 2z
54 pages
24-Cache Memory Mapping Techniques-14!03!2024
No ratings yet
24-Cache Memory Mapping Techniques-14!03!2024
36 pages
Module 5
No ratings yet
Module 5
17 pages
CA Chap5 Memory
No ratings yet
CA Chap5 Memory
91 pages
Cyberres A Micro Focus Line of Business Presentation
No ratings yet
Cyberres A Micro Focus Line of Business Presentation
9 pages
2
100% (1)
2
2 pages
Revision 1
No ratings yet
Revision 1
33 pages
Cache Writing & Performance
No ratings yet
Cache Writing & Performance
23 pages
Lecture 41
No ratings yet
Lecture 41
41 pages
Unit-2 (Building Drones)
No ratings yet
Unit-2 (Building Drones)
45 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
55 pages
Solution For Chapter 4
100% (3)
Solution For Chapter 4
26 pages
Lect12 Cache
No ratings yet
Lect12 Cache
39 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
12 pages
DigitalLogic ComputerOrganization L22 CachesP3 Handout
No ratings yet
DigitalLogic ComputerOrganization L22 CachesP3 Handout
52 pages
MSVR SA 2021 12
No ratings yet
MSVR SA 2021 12
4 pages
Anatomy of Linux Kernel Shared Memory
No ratings yet
Anatomy of Linux Kernel Shared Memory
10 pages
Cose222 HW4
No ratings yet
Cose222 HW4
5 pages
PhDThesis2016-Energy-Efficient Management of Resources in Container-Based Clouds
No ratings yet
PhDThesis2016-Energy-Efficient Management of Resources in Container-Based Clouds
220 pages
Dagatan Nino PR
No ratings yet
Dagatan Nino PR
12 pages
Homework4 v2 Solution
No ratings yet
Homework4 v2 Solution
14 pages
Direct-Mapped Cache: Write Allocate With Write-Through Protocol
No ratings yet
Direct-Mapped Cache: Write Allocate With Write-Through Protocol
25 pages
L07 MemoryII
No ratings yet
L07 MemoryII
27 pages
hw6 Circuits
No ratings yet
hw6 Circuits
4 pages
Twister Framework Guide
No ratings yet
Twister Framework Guide
84 pages
6SL3220-3YE32-0AF0 Datasheet en
No ratings yet
6SL3220-3YE32-0AF0 Datasheet en
3 pages
1048 Camera Module PDF
No ratings yet
1048 Camera Module PDF
11 pages
ch2 Appb
No ratings yet
ch2 Appb
58 pages
Fundamentals of Computer Systems: Caches
No ratings yet
Fundamentals of Computer Systems: Caches
28 pages
Chapter 2z
No ratings yet
Chapter 2z
54 pages
Tutorial 7cache
No ratings yet
Tutorial 7cache
2 pages
05) Cache Memory Introduction
No ratings yet
05) Cache Memory Introduction
20 pages
ARM hw5
No ratings yet
ARM hw5
5 pages
Review Problems For Exam 1: MIPS (Instruction Count) / (Execution Time X 10
No ratings yet
Review Problems For Exam 1: MIPS (Instruction Count) / (Execution Time X 10
6 pages
Cau 6 Cache
No ratings yet
Cau 6 Cache
25 pages
CPSC 312 Cache Memories: Topics
No ratings yet
CPSC 312 Cache Memories: Topics
39 pages
Cse 410 Computer Systems: Hal Perkins Spring 2010 L T 13 C Hwit DPF Lecture 13 - Cache Writes and Performance
No ratings yet
Cse 410 Computer Systems: Hal Perkins Spring 2010 L T 13 C Hwit DPF Lecture 13 - Cache Writes and Performance
20 pages
CH04 COA10e
No ratings yet
CH04 COA10e
41 pages
Computer Architecture
No ratings yet
Computer Architecture
5 pages
Part 1. Memory Analysis: Question 1 (40 PT) - You Are Given Designs of 3 Caches For A 16-Bit Address D1
No ratings yet
Part 1. Memory Analysis: Question 1 (40 PT) - You Are Given Designs of 3 Caches For A 16-Bit Address D1
4 pages
Computer Architecture - Tutorial 4 (SOLUTIONS) : Context, Objectives and Organization
No ratings yet
Computer Architecture - Tutorial 4 (SOLUTIONS) : Context, Objectives and Organization
4 pages
Ca Sol PDF
No ratings yet
Ca Sol PDF
8 pages
Cache Memory: A Safe Place For Hiding or Storing Things
No ratings yet
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
Midterm2 s2012 Sol
No ratings yet
Midterm2 s2012 Sol
5 pages
Cache Memory: A Safe Place For Hiding or Storing Things
100% (1)
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
Computer Org and Arch: R.Magesh
No ratings yet
Computer Org and Arch: R.Magesh
48 pages
prj1 Specs2010
No ratings yet
prj1 Specs2010
15 pages
BaiTap Chuong4 PDF
No ratings yet
BaiTap Chuong4 PDF
8 pages
10 Cache Memories
No ratings yet
10 Cache Memories
49 pages
Cache Memory Test 2 Papers
No ratings yet
Cache Memory Test 2 Papers
21 pages
Report On Blue Gene
No ratings yet
Report On Blue Gene
22 pages
Parameters of Cache Memory: - Cache Hit - Cache Miss - Hit Ratio - Miss Penalty
No ratings yet
Parameters of Cache Memory: - Cache Hit - Cache Miss - Hit Ratio - Miss Penalty
18 pages
IETM Development Process
No ratings yet
IETM Development Process
6 pages
Cache Memory
No ratings yet
Cache Memory
28 pages
Teardown Manual For Ipad Wi-Fi
No ratings yet
Teardown Manual For Ipad Wi-Fi
34 pages
Computer Architecture and Organization: Lecture15: Cache Performance
No ratings yet
Computer Architecture and Organization: Lecture15: Cache Performance
17 pages
Python For Android
No ratings yet
Python For Android
66 pages
Satellite Network Configurations
No ratings yet
Satellite Network Configurations
19 pages
Advance Computer Architecture Homework 2 Solution
No ratings yet
Advance Computer Architecture Homework 2 Solution
8 pages
Solution of CSE 240A Assignemnt 3
No ratings yet
Solution of CSE 240A Assignemnt 3
5 pages
Assign1 PDF
No ratings yet
Assign1 PDF
5 pages
18-742 Advanced Computer Architecture: Test I February 24, 1998
No ratings yet
18-742 Advanced Computer Architecture: Test I February 24, 1998
10 pages
Pratt Chapter 2
No ratings yet
Pratt Chapter 2
41 pages
HW4
No ratings yet
HW4
3 pages
CS704 Finalterm QA Past Papers
No ratings yet
CS704 Finalterm QA Past Papers
20 pages
18-742 Advanced Computer Architecture: Exam I October 8, 1997
No ratings yet
18-742 Advanced Computer Architecture: Exam I October 8, 1997
11 pages
Solutions: 18-742 Advanced Computer Architecture
No ratings yet
Solutions: 18-742 Advanced Computer Architecture
8 pages
ENODE Distributed IO Controller Operations Manual 1000003483 RevR
100% (1)
ENODE Distributed IO Controller Operations Manual 1000003483 RevR
83 pages
Ehp Technical BP Catalyst Successful Upgrade Part I
No ratings yet
Ehp Technical BP Catalyst Successful Upgrade Part I
4 pages
Th-F6a F7e SM
100% (1)
Th-F6a F7e SM
74 pages
Sun x86 Systems Sales Specialist
No ratings yet
Sun x86 Systems Sales Specialist
3 pages
Session 02 of 2020 (Dec) : Previous Year Papers
No ratings yet
Session 02 of 2020 (Dec) : Previous Year Papers
9 pages
File Access in VBA
No ratings yet
File Access in VBA
4 pages
C & C++ Interview Questions You'll Most Likely Be Asked
From Everand
C & C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

hw2 Solns

Uploaded by

hw2 Solns

Uploaded by

University of California, Berkeley

Fall 1999 J.D. Kubiatowicz

TimeA thit (6n + 1) 6n + 1

Access type # Memory Refs Frequency

0 × 0.7125 + 2 × 0.0375 + 1 × 0.2375 + 3 × 0.0125 = 0.35

Access type # Memory Refs Frequency

0 × 0.7125 + 2 × 0.02625 + 4 × 0.01125 + 0 × 0.2375 + 2 × 0.00875 + 4 × 0.00375

CPIexecution = floads CPIloads + fstores CPIstores + fother CPIother

CPIexecution = (floads × 1) + (fstores × 2) + (fother × 1)

5.6 (12 points)

/* after loop interchange, no prefetch */

Miss ratepseudo = Miss rate2-way = 1 − Hit ratepseudo .

AMATpseudo = (1 − Miss rate2-way ) × Hit timepseudo +

AMATpseudo = 1.5 + Miss rate2-way × 50.5.

AMATpseudo-2KB = 1.5 + 0.076(50.5)

And for the 128KB cache:

AMATpseudo-2KB = 1.5 + 0.007(50.5)

AMAT = Hit TimeL1 + Miss RateL1 × (Hit TimeL2 +

5.10 (12 points)

Finally, the reduction in the size of the address is given by:

2. to provide memory protection

4. to provide a means of sharing memory between virtual address spaces

6.5 (14 points)

Units Performance # of Units Demand Per Transaction TP/s Limit

Organization Cost TP/s Cost per TP/s

6.10 (17 points)

Instruction Class Frequency Cycles Read Accesses Write Accesses

Avg. Stalls per Access = 2.75

For write back caches, we can construct the following table:

Access Hits Cache? Is Block Dirty? Frequency Cycles Bus is Busy

Avg. Stalls per Access = 1.27

$ Style CPI % Bus Used by $ % Available for IO Cycles for IO

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.