Lab3 PAP
Lab3 PAP
E. Ayguadé
Spring 2024-25
Contents
Index 1
1 The environment 2
1.1 Library constructor and destructor functions . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Contributing to the Transversal Competence in PAP . . . . . . . . . . . . . . . . . . . . . 3
2 Implementing parallel 4
2.1 Implementing a pool of threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Thread synchronisations 6
6 Implementing task 10
6.1 Taskwait synchronisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1
1
The environment
In this laboratory assignment you will build a simplified OpenMP runtime system, that should give
support to the code generated by the GNU gcc compiler, alternative to the gomp runtime system that is
distributed as part of the gcc distribution. Your implementation will be based on the POSIX Pthreads
standard library which will provide you the basic support for thread creation and synchronisation. The
new created library will be named miniomp (libminiomp.so).
Annotated
OpenMP Executable
source
binary
code compiler
gcc –fopenmp,
unmodified
OpenMP
runtime library libgomp.so in LD_LIBRARY_PATH,
to be replaced with libminiomp.so
Pthreads
The assignment is divided in two parts, each one implementing different functionalities of the runtime:
• Part 1 – Implementation of parallel regions and thread synchronizations (barrier
and critical). This corresponds with sections 2 to 4 in this document.
• Part 2 – Implementation of the tasking model: task, taskloop and task synchronizations.
This corresponds with sections 5 to 7 in this document.
In each part some optional components will be proposed.
As always in this course, the first thing to do is to copy the compressed file with all the files that you
will need to do the assignment. Once located in you ”home” directory:
> ./setup.sh lab3
> cd sessions/miniomp
This will leave in the miniomp all the files necessary to do this laboratory assignment in three different
directories:
• src with a set of files where you will implement the functionalities requested for the runtime (most
of them are empty at this moment or just contain the prototypes for the functions to be implemented
during the sessions devoted to this laboratory assignment). The directory also includes a Makefile
to compile the library.
• lib where the compiled libminiomp.so library will be generated.
• test with some simple codes to benchmark the library at the different stages of its implementation.
The directory also contains a Makefile to compile them and scripts to execute and instrument their
parallel execution. You should extend this basic set in order to make sure that you appropriately
test all functionalities requested and their performance.
2
Next proceed to do the following steps in order to check that everything is appropriately setup:
1. Go into src inside the miniomp directory, list the existing files to get familiar with their names
and type "make libminiomp.so". This should compile the miniomp library in its current imple-
mentation status and generate a .so file in lib. No errors should be reported by now.
2. Next go to the test directory and compile the first OpenMP benchmark code by typing "make
tparallel1-omp" and execute it by submitting the binary to the execution queue: "sbatch
submit-omp.sh tparallel1-omp 8". Check that the result is the expected one by inspecting
the source code of tparallel.c and verifying the correctness of the output returned.
3. You can also check if the functionality of miniomp conforms to the original gomp. Type "make
tparallel1-gomp" to compile and "sbatch submit-omp.sh tparallel1-gomp 8" to submit its
execution. Check if the result returned is the same as with miniomp.
4. Of course you can (I mean, should) generate Extrae traces for Paraver visualisation. Just use the
submit-extrae.sh script with the same argument to submit its execution and generate the trace
and then visualise with wxparaver. Compare the two traces generated with "sbatch submit-extrae.sh
tparallel1-omp 8" and "sbatch submit-extrae.sh tparallel1-gomp 8" after loading the trace
and using the OpenMP → Implicit tasks in parallel constructs hint.
The paths defined in environment.bash (that you should parse when initiating a session) assume
that the miniomp directory has been uncompressed in your home directory. We recommend that you
follow this unless you want to set up the MINIOMP and LD LIBRARY PATH environment variables in a
different way.
3
2
Implementing parallel
Before going into the assignment, let’s go first for a trivial implementation of the OpenMP parallel
region using Pthreads. In this implementation threads are created and finished in each parallel region,
paying the full overhead of thread management. In order to see which functions from the original gomp
library are invoked by gcc, we can take a look at the assembly code generated by the compiler. Go
into the test directory and type "make tparallel1-asm". Open with the editor the tparallel1-asm
generated and look for function invocations starting with GOMP. The usual interface used by gcc to
activate a parallel region is:
void GOMP_parallel (void (*fn) (void *), void *data, unsigned num_threads, unsigned int flags);
which receives the pointer fn to the function that encapsulates the body of the parallel region, a pointer
data to a structure used to communicate data in and out of that function to be executed by each
thread. The number of threads num threads is 1 if an if clause is present and false, or the value of the
num threads clause, if present, or 0; flags is related with the proc bind clause not to be implemented.
Please also take a look at how the compiler encapsulates the bodies of the parallel sections into functions
called foo. omp fn.0 to foo. omp fn.4, together with the 5 invocations GOMP parallel, trying to relate
the assembly and the source C code.
The current definition and implementation of GOMP parallel is found in parallel.c and parallel.h
which contain the data definitions and a serial implementation of the function invoked by gcc to im-
plement parallel regions. Let’s jointly modify it to make it parallel. Once done, we will check its
correctness and generate a trace to visualize its behaviour.
4
and omp get num threads); you may need to modify their implementation depending on what you decide
to do in your implementation.
How to test your implementation?
• To test your implementation please use the OpenMP program in file tparallel1.c. Check if the
output of the program is what you would expect and check if the functionality of miniomp conforms
to gomp. Observe that all the data clauses (shared, private, firstprivate and reduction) are
handled by the compiler, i.e. you don’t need to do anything in your runtime implementation
for them. We recommend to also test your implementation running the Extrae instrumented
version of tparallel1 and visualise the traces generated with Paraver (queue the execution of
./submit-extrae.sh script with the appropriate arguments and wxparaver to visualize the trace
generated).
• Once you check correctness, you can test your implementation with tparallel2.c. With this
new test program you can check correctness and performance, comparing again with the imple-
mentation using gomp. Again we recommend to also test your implementation running the Extrae
instrumented version of tparallel2 and visualise the traces generated with Paraver .
5
3
Thread synchronisations
Second you will perform a restricted implementation of some of the OpenMP thread synchronisation
constructs, using the mechanisms offered by Pthreads:
• Explicit barrier.
• Unnamed critical regions (i.e. no name provided in the critical directive).
• Named critical regions (i.e. with a name provided in the critical directive).
The atomic construct is handled by the compiler directly generating machine instructions preceded by
the lock prefix which forces the memory access to be atomically executed.
The files that you need to look at or modify in this section are: libminiomp.c and libminiomp.h
which should contain the allocation/initialization of the synchronization objects implemented in this
section; and synchronization.c and synchronization.h which contain the data definitions and pro-
totypes for the functions invoked by gcc to implement unnamed critical sections and explicit barrier
constructs. The declaration of the global variables to implement them (miniomp default lock and
miniomp barrier) are included in synchronization.h.
In order to see which functions from the original gomp library are invoked by gcc, we can take a look
at the assembly code generated by the compiler for tsynch1.c. Open with the editor the tsynch1-asm
generated in your last compilation and look for function invocations starting with GOMP. As you can see,
the interface used by gcc to enter to and exit from an unnamed critical section is as follows:
void GOMP_critical_start (void);
void GOMP_critical_end (void);
while for the implementation of the named section the compiler includes one argument that points to a
void * associated to the name that is provided by the programmer in the pragma.
void GOMP_critical_name_start (void **pptr);
void GOMP_critical_name_end (void **pptr);
Finally, for the implementation of a barrier the compiler simply injects a call to:
void GOMP_barrier(void);
All the initialisation for to properly execute the barrier should be done somewhere in your code before
the invocation to GOMP barrier.
Finally, verify in the assembly code how the compiler performs the translation for the atomic con-
struct, looking for additional information about the lock prefix.
6
What to do?
1. Do the implementation of GOMP critical start, GOMP critical name start, GOMP critical end
and GOMP critical name end using Pthreads mutexes.
7
4
In the second part of this laboratory assignment you will implement a simplified version of the tasking
model in OpenMP , including the single work–sharing and task constructs, as well as the basic taskwait
task synchronization construct. But before continuing with the second part, a couple of remarks:
1. To make a more complete testing of your implementation up to this point, you can use the OpenMP
version, making use of implicit tasks, in laboratory Lab1 (Sieve of Eratosthenes). Check correct-
ness and performance, for example doing a strong scalability evaluation and trace analysis with
Paraver to make sure you understand where differences happen when compared to the original
gomp implementation.
2. Although optional but highly recommended towards a high mark in the transversal competence, we
propose to extend your implementation in order to relax some of the constraints initially defined
and/or to consider some additional features initially not considered. You don’t have to do them now,
you can advance with Part 2 of the assignment and then go back to do some of the optimizations
here or the others we will propose in Part 2. For now, you can consider:
• Testing and evaluating you implementation with any other program you consider appropriate,
always making use of the features implemented so far.
• Implementation of your own locks, using the atomic intrinsic operations available for the gcc
compiler.
• Implementation of your own barrier construct, using the atomic intrinsic operations available
for the gcc compiler or any other synchronization object available for Pthreads.
• Adding thread to processor affinity in your implementation of parallel, so that threads are
mapped to processors in a fixed way (you can do a simple interpretation of the OMP PROC BIND
environment variable, assuming that values CLOSE and SPREAD correspond to consecutive
threads in the same socket or in consecutive sockets).
• Nested parallel regions: basically, each parallel region is totally independent, with its own
barriers, worksharing (single) constructs and number of threads. The only construct that
is global to all parallel regions is critical. Extend your implementation to consider test
programs such as tnested.c.
8
5
The usual pattern in parallel programs using tasking is based on the use of the single worksharing
construct, which restricts the execution of a code region to only one thread. This region of code is where
tasks are initially generated for the others threads to be executed; other threads wait at the end of an
implicit barrier waiting for the availability of tasks to execute. In your implementation of single you
don’t need to support the nowait clause, which implies that several consecutive single constructs may
be active at a time; you can leave the implementation of this clause as optional.
For this part, in addition to the files that you already know, you will need to modify the single.c file,
which contain the prototype for the function invoked by gcc to implement single. The declaration of
the global variable to implement it (miniomp single) should be completed in single.h.
In order to see which functions from the original gomp library are invoked by gcc, we can take a look
at the assembly code generated by the compiler for the test file tsingle.c: type "make tsingle-asm"
and open with the editor the tsingle-asm file generated. Look for function invocations starting with
GOMP; the following function is used to implement single:
bool GOMP_single_start (void);
which is called by all threads encountering the single but returns true only for the thread that should
execute the body of the single contract, i.e. the first one reaching it. Looking at the assembly code,
you can see how the result of the function is used.
What to do?
1. Do the implementation of GOMP single start, without considering the support for nowait.
2. Optionally, you will be proposed in section 7 to improve your implementation to consider the
possibility of having nowait, bypassing the implicit barrier at the end of the worksharing and
allowing multiple instances of the same single region to be active.
How to test your implementation? To test your implementation please first use the OpenMP
program in file tsingle.c. As done before in the previous section, check if the output of the program is
what you would expect and check if the functionality of miniomp conforms to gomp.
9
6
Implementing task
Second in this second part of the laboratory assignment you will perform a restricted implementation of
the tasking model in OpenMP, which should include:
• Support for if clause.
• No support for task dependencies.
In addition to the files you already know, the files that you need to look at or modify in this section are:
task.c and task.h which contain the prototypes for the functions invoked by gcc to implement task,
as well as the declaration of the main global variable to implement them: miniomp taskqueue.
In order to see which functions from the original gomp library are invoked by gcc, we can take a look
at the assembly code generated by the compiler for ttask1.c. Open with the editor the ttask1-asm
generated when typing "make ttask1-asm" and look for function invocations starting with GOMP. The
following function is invoked every time a task is found:
void GOMP_task (void (*fn) (void *), void *data,
void (*cpyfn) (void *, void *), long arg_size, long arg_align,
bool if_clause, unsigned flags, void **depend, int priority);
which you can easily match to the original OpenMP directives. For the GOMP task we will only consider
the arguments related with the pointer *fn to the function that encapsulates the body of the task
and the pointer *data to the arguments needed to execute it. The compiler may provide a helper
function *cpyfn and some additional arguments to correctly access them (arg size and arg align).
Also some flags related to additional clauses in the task construct and the necessary information for
task dependencies, which we will not consider in this implementation (except for if clause).
We already provide you with the implementation of a circular task queue, with the following interface:
miniomp_taskqueue_t * TQinit(int max_elements); // Initialises the task queue
bool TQis_empty(miniomp_taskqueue_t *task_queue); // Checks if the task queue is empty
bool TQis_full(miniomp_taskqueue_t *task_queue); // Checks if the task queue is full
void TQenqueue(miniomp_taskqueue_t *task_queue, miniomp_task_t *task_descriptor); // Enqueues
// the task descriptor at the tail of the task queue
miniomp_task_t * TQdequeue(miniomp_taskqueue_t *task_queue); // Dequeues the task descriptor
// at the head of the task queue
The definition of this interface as well as the types miniomp taskqueue t and miniomp task t can
be found in task.h. The not thread–safe implementation of the interface is in task.c. You don’t
have to necessarily use this definition and implementation, you can build your own data structure and
implementation for the task queue.
10
What to do?
1. Do the implementation of restricted GOMP task.
2. Update the implementation of the implicit barrier at the end of parallel and/or single so that
threads arriving there can grab tasks from the task queue if tasks are available for execution.
Threads should leave the barrier once all threads have arrived to it and all tasks in the task pool
have already been executed.
How to test your implementation? To test your implementation please use the OpenMP program
in file ttask1.c. Once verified, use the program in file ttask2.c to test functionallity and performance
compared to the gomp implementation.
which you can easily match to the original OpenMP directives. You will also find invocations to two
other functions that implement taskgroup; we leave its implementation for the optional part of the
assignment and by now it is implemented as a taskwait.
What to do?
11
7
To make a final testing of your implementation up to this point, you can use the OpenMP versions
of the two previous laboratory assignments (sieve and kmeans), either with implicit or explicit tasks.
Try to avoid functionalities not implemented in miniomp, replacing them by other mechanisms. Check
correctness and performance, for example doing a strong scalability evaluation and trace analysis with
Paraver to make sure you understand where differences happen when compared to the original gomp
implementation.
Finally, although optional but highly recommended towards a high mark in the transversal compe-
tence, we propose to add some additional functionalities to your implementation: nowait for single,
taskgroup, taskloop, and reductions in the tasking model. They are listed in increasing order of
complexity but also of interest to learn and challenge yourself.
1. Do the implementation of GOMP taskgroup start and GOMP taskgroup end, considering no nesting
of taskgroup constructs.
12
How to test your implementation? To test your implementation please use the OpenMP program
in file tsynchtasks.c which exercises the use of the task synchronisation constructs.
The gcc compiler invokes the following function to implement the taskloop functionalities:
void GOMP_taskloop (void (*fn) (void *), void *data, void (*cpyfn) (void *, void *),
long arg_size, long arg_align, unsigned flags,
unsigned long num_tasks, int priority,
long start, long end, long step)
Most of the arguments are similar to the arguments of GOMP task, the different ones are briefly described
next. num tasks which is used to indicate either the number of tasks to generate (if num tasks clause
is used) or the granularity of the tasks to generate (if the grainsize clause is used); one of the bits in
flags is used to indicate which of the two options is applied. Arguments start, end and step capture
the iteration bounds of the loop to which the taskloop construct applies. You can check all this by
looking at the assembly code generated for ttaskloop.c.
What to do?
1. Do the implementation of restricted GOMP taskloop. Your implementation should handle the
possibility of specifying one of num tasks or grainsize, or none of them in which case the number
of tasks could whatever you consider appropriate, as for example the number of threads in the
parallel region or something proportional to it..
How to test your implementation? To test your implementation please use the OpenMP program in
file ttaskloop.c. You should also use the OpenMP version of sieve1.c. Be aware of the fact that task
reductions have not been implemented up to this point, so if your version of sieve2.c includes them
you should replace by another synchronization mechanism (for example, atomic). Check correctness and
performance, for example doing a strong scalability evaluation and trace analysis with Paraver to make
sure you understand where differences happen when compared to the original gomp.
• void GOMP taskgroup reduction register, which should register each variable in the list of re-
duction variables, allocating enough space to store per-thread copies for each of them. All allocated
in consecutive memory locations.
• void GOMP taskgroup reduction unregister, which should unregister each variable in the list of
reduction variables, deallocating the space for them.
• void GOMP task reduction remap, which should remap the original address for each reduction
variable to the new address for the specific thread executing the task.
Additional details for all of them in the source code in taskreductions.c. You have the ttaskreduction.c
file to test your implementation. Also you can test the versions of sieve1.c and sieve2.c making use
of reductions to protect the access to the shared variable in the program.
13