Par - 2 In-Term Exam - Course 2019/20-Q1: Memory Line
Par - 2 In-Term Exam - Course 2019/20-Q1: Memory Line
Problem 1 (3 points) Assume the following serial code computing a two–dimensional NxN matrix u:
void compute(int N, double *u) {
int i, j;
double tmp;
for (i = 1; i < N-1; i++)
for (j = 1; j < N-1; j++) {
tmp = u[n*(i+1) + j] + u[n*(i-1) + j] + // elements u[i+1][j] and u[i-1][j]
u[n*i + (j+1)] + u[n*i + (j-1)] - // elements u[i][j+1] and u[i][j-1]
4 * u[n*i + j]; // element u[i][j]
u[n*i + j] = tmp/4; // element u[i][j]
}
}
The code is parallelised on three processors (P0−2 ) with the assignment of iterations to processors shown on
the left part of Figure 1.
memory line
N-2
N-1
N-1
j column
0
1
0 0
1
N-2
N-1 N-1
Figure 1: (Left) Assignment of iterations to processors. (Right) Mapping of array elements to memory
modules in the NUMA system.
Observe that each processor is assigned the computation of N/3 consecutive iterations of the i loop (except
iterations 0 an N-1), starting with processor P0 (the number of processors 3 perfectly divides the number of
rows and columns N); each processor executes its assigned computation in 3 tasks, each one computing N/3
consecutive iterations of the j loop. Due to the dependences in this code, you should already know that for
example task11 can only be executed by P1 once processor P0 finishes with the execution of task01 and the
same processor (P1 ) finishes with the execution of task10 .
The three processors compose a multiprocessor architecture with 3 NUMA nodes sharing the access to main
memory, each NUMA node with a single processor Pp , a cache memory (of sufficient size to store all the
lines required to execute all tasks) and portion of main memory Mp (p in the range 0–2). Each memory Mp
has an associated directory to maintain the coherence at the NUMA level. Each entry in the directory uses
2 bits for the status (M, S and U) and 3 bits in the sharers list. The rows of matrix u are distributed among
NUMA nodes as shown on the right part of Figure 1. In that figure rectangles represent the memory lines
that are involved in the computation of task11 , for a specific case in which N=24 and each cache line is able
to host 4 consecutive elements of matrix u.
We ask you:
1. Assuming U status for all memory lines at the beginning of the parallel execution, which will be the
contents in the directories for the lines shown on the right part of Figure 1 when task11 is ready
for execution? Use the provided answer sheet for this question, indicating for each memory line the
contents of the directory (e.g. S011 meaning that the line is in S state with copies in NUMA nodes 1
and 0).
2. Indicate the sequence of coherence actions (RdReq, WrReq, UpgrReq, Fetch, Invalidate, Dreply, Ack or
WriteBack ) that will occur when processor P1 executes the first iteration in task11 , i.e. the iteration
in the upper left corner of task11 on the left part of Figure 1.
3. Which will be the contents in the directories for the same lines once task11 finishes its execution?
At that time you should assume that task02 and task20 have also finished their execution. Use the
provided answer sheet for this question.
Solution: the figure on the left represents the contents in the directories before the execution (question
1.1); the figure on the right after the execution (question 1.3).
Solution for question 1.1 (before execution) Solution for question 1.3 (after execution)
0 N-1 0 N-1
0 0
M0 M0
M2 M2
N-1 N-1
Regarding question 1.2, processor P1 first performs several read accesses, one of them to a memory
position in state M in memory M0 ; to read it, P0 issues RdReq1→0 to the home node M0 , which
responds with the contents of the line and a Dreply0→1 . After the computation, P1 has to write one
element for which it is the home node; since the line containing that element is in S state, with copy
in P0 ’s cache, M1 has to send and Invalidate1→0 command, which is acknowledged with Ack0→1 .
Problem 2 (4 points) A ticket lock is a lock implemented using two shared counters, next_ticket and
now_serving, both initialised to 0. A thread wanting to acquire the lock uses an atomic operation to
fetch the current value of next_ticket as its unique sequence number and increments it by 1 to generate
the next sequence number. The thread then waits until now_serving is equal to its sequence number.
Releasing the lock consists on incrementing now_serving in order to pass the lock to the next waiting
thread.
Given the following data structure and incomplete implementation of the primitives that support the ticket
lock mechanism:
typedef struct {
int next_ticket;
int now_serving;
} tTicket_lock;
void ticket_lock_init (tTicket_lock *lock) {
lock->now_serving = 0; lock->next_ticket = 0;
}
void ticket_lock_acquire (tTicket_lock *lock) {
// obtain my unique sequence number from next_ticket
// generate the next_ticket sequence number
// wait until my sequence number is equal to now_serving
}
void ticket_lock_release (tTicket_lock *lock) {
lock->now_serving++;
}
We ask:
1. Complete the code for the ticket_lock_acquire primitive to be executed on two different platforms
that provide the following different atomic operations:
2. Consider the following implementation for the basic lock explained in class using test-test-and-set:
Problem 3 (3 points)
We have the parallel code shown in the following code excerpt:
Since the computation part is repeteated num_iters times we want to exploit locality in all levels of the
memory hierarchy in order to speed up the parallel execution of code. The first time a memory location is
accessed by a thread, the operating system will assign memory in the NUMA node where the thread that is
doing such first access is being executed (provided there is free space). This can help us get memory for each
thread within the memory of the NUMA node where it is executed. Therefore, accesses to memory will be
faster if we manage to have a thread execute the very same iterations in all three loops controlled by variable
i. And locality will be exploited by implementing a block data decomposition.
1. (1.5 points) Given these indications above, we ask you to write a faster parallel code for both the
initialization and computation stages using the appropriate OpenMP pragmas and invocations
to intrinsic functions, assuming the following constraints: 1) you cannot make use of the for work–
sharing construct in OpenMP to distribute work among threads; 2) you cannot assume that the
number of threads evenly divides the problem size N; 3) physical memory has not been assigned yet by
the time we reach the initialization loop; 4) parallelization overheads should always be kept as low as
possible (due to thread/task creation, synchronization, load imbalance, ...).
Solution:
We need to make sure that a thread executes the very same iterations for all the executions of each
loop, both in the initialization and computation stages.
#define N ...
int i;
...
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int howmany = omp_get_num_threads();
int basesize = N / howmany;
int reminder = N % howmany;
int extra = id < reminder;
int extraprev= extra ? id : reminder;
int lb = id * basesize + extraprev; /* Loop lower bound */
int ub = lb + basesize + extra; /* Loop upper bound */
}
...
After the previous code finishes its computation the code continues with the following function call:
...
final_processing (a, b, c);
...
Contrary to the computation stage studied above, the loop that appears in the final_processing
stage is executed only once. However, we need to solve another problem, namely, that function zoo presents
a highly variable execution time depending on the actual numerical values received as inputs. Consequently,
the block distribution used in the previous stages could easily cause load imbalance in this final processing
stage. Additionally, assume cache lines are 64 bytes long and integers occupy 4 bytes.
2. (1.5 points) We ask you to provide an efficient parallelization of the loop above. As before, you are
not allowed to make use of the for work–sharing construct in OpenMP to distribute work among
threads.
Solution:
We are interested in avoiding load imbalance while avoiding false sharing in the accesses to vector c.
For this, we parallelize the loop following a block-cyclic geometric data decomposition for the output
vector c. The block size should allow the processor to use all the elements in a cache line, benefiting
from spatial locality and avoiding false sharing. To implement the block-cyclic decomposition we need
two nested loops, with an outer loop jumping the blocks cyclically and an inner loop traversing all the
elements in each block, as follows:
#define CACHE_LINE_SIZE 64
...
#pragma omp parallel
{
int id = omp_get_thread_num();
int howmany = omp_get_num_threads();
/* Computation of the block size */
int block_size = CACHE_LINE_SIZE / sizeof(int);
Answer sheet for question 1.1 (before execution) Answer sheet for question 1.3 (after execution)
0 N-1 0 N-1
0 0
M0 M0
M1 M1
M2 M2
N-1 N-1
Directory entry for memory line (for example S011 to Directory entry for memory line (for example S011 to
Indicate line in state S with copies in nodes 1 and 0. Indicate line in state S with copies in nodes 1 and 0.
Student name: ………………………………………………………………………………
P0 P1 P2
cache cache cache
Bus Bus Bus
steps instructions CPU event line CPU event line CPU event line lock
transaction transaction transaction
status status status
Initially I I I 0
T0 tries to acquire lock load lock
T1 tries to acquire lock load lock
T2 tries to acquire lock load lock
T0 adquires lock t&s lock
T1 tries to acquire lock t&s lock
T2 tries to acquire lock t&s lock
T1 tries to acquire lock load lock
T2 tries to acquire lock load lock
T0 release lock store lock
T1 tries to acquire lock load lock
T2 tries to acquire lock load lock
T1 adquires lock t&s lock
T2 tries to acquire lock t&s lock
T1 release lock store lock
T2 tries to acquire lock load lock
T2 adquires lock t&s lock
T2 release lock store lock