0% found this document useful (0 votes)

61 views

UNIT-2 Parallel Programming Challenges

This document discusses four key challenges of parallel programming: 1. Performance - Achieving linear speedup is difficult due to overhead of communication and synchronization between cores. Speedup decreases with increasing number of cores. 2. Amdahl's Law - Even if 90% of a program is parallelized perfectly, speedup cannot exceed 10x due to serial portions. 3. Scalability - For a program to be scalable, increasing cores by a factor of k must require increasing problem size by the same factor k to maintain efficiency. 4. Taking Timings - Timings should focus on specific portions of interest, use wall clock time rather than CPU time, and consider timer resolution limitations.

Uploaded by

Monika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views

UNIT-2 Parallel Programming Challenges

Uploaded by

Monika

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

UNIT-2

PARALLEL PROGRAMMING CHALLENGES

PERFORMANCE

(1) Speed up and efficiency

Speed and efficiency is to equally divide the work among the cores,while at the same time
introducing no additional work for the cores.

If we succeeding doing this, and we run our program with p cores, one thread or process on
eachcore, then our parallel program will run p times faster than the serial program. If we call the
serial run-time Tserial and our parallel run-time Tparallel, then the best we can hope for is
Tparallel D Tserial=p. When this happens, we say that our parallel program has linear speedup.

In practice, we’re unlikely to get linear speedup because the use of multiple processes/threads
almost invariably introduces some overhead. For example, shared memory programs will almost
always have critical sections, which will require that we use some mutual exclusion mechanism
such as a mutex.

The calls to the mutex functions are overhead that’s not present in the serial program, and the use
of the mutex forces the parallel program to serialize execution of the critical section.

Distributed-memory programs will almost always need to transmit data across the network,
which is usually much slower than local memory access. Serial programs,on the other hand,
won’t have these overheads.

Thus, it will be very unusual for us to find that our parallel programs get linear speedup.
Furthermore, it’s likely that the overheads will increase as we increase the number of processes
or threads, that is, more threads will probably mean more threads need to access a critical
section. More processes will probably mean more data needs to be transmitted across the
network.So if we define the speedup of a parallel program to be

then linear speedup has S D p, which is unusual

,
. Furthermore, as p increases, we expect S to become a smaller and smaller fraction of the ideal,
linear speedup p.

Another way of saying this is that S=p will probably get smaller and smaller as p increases. An
example of the changes in S and S=p as p increases.This value, S=p, is sometimes called the
efficiency of the parallel program. If we substitute the formula for S, we see that the efficiency is

E==

when we increase the problem size, the speedups and the efficiencies increase, while they
decrease when we decrease the problem size.

This behavior is quite common. Many parallel programs are developed by dividingthe work of
the serial program among the processes/threads and adding in the necessary “parallel overhead”
such as mutual exclusion or communication. Therefore, if Toverhead denotes this parallel
overhead, it’s often the case that

The speedup and the efficiency will increase. This is what your intuition should tell you: there’s
more work for the processes/threads to do, so the relative amount of time spent coordinating the
work of the processes/threads should be less.

A final issue to consider is what values of Tserial should be used when reporting speedups and
efficiencies. Some authors say that Tserial should be the run-time of the fastest program on the
fastest processor available.

the performance of a parallel shell sort program, authors in the first group might use a serial
radix sort or quicksort on a single core of the fastest system available, while authors in the
second group would use a serial shell sort on a single processor of the parallel system.

(2) Amdahl’s law

Back in the 1960s, Gene Amdahl made an observation [2] that’s become known as Amdahl’s
law. It says, roughly, that unless virtually all of a serial program is parallelized, the possible
speedup is going to be very limited—regardless of the number of cores available.
For example, that we’re able to parallelize 90% of a serial program. Further suppose that the
parallelization is “perfect,” that is, regardless of the number of cores p we use, the speedup of
this part of the program will be p. If the serial run-time is Tserial= 20 seconds, then the run-time
of the parallelized part will be 0.9Tserialp= 18/p and the run-time of the “unparallelized” part
will be 0.1Tserial = 2. The overall parallel run-time will be

+ 0.1 + 2

And the speedup will be

Now as p gets larger and larger, 0.9Tserial/p = 18/p gets closer and closer to 0, so

the total parallel run-time can’t be smaller than 0.1Tserial =2. That is, the

denominator in S can’t be smaller than 0.1Tserial = 2. The fraction S must therefore be smaller
than

= = 10

That is, S 10. This is saying that even though we’ve done a perfect job in parallelizing

90% of the program, and even if we have, say, 1000 cores, we’ll never get a

speedup better than 10.

More generally, if a fraction r of our serial program remains unparallelized, then Amdahl’s law
says we can’t get a speedup better than 1/r. In our example, r = 1� 0.9 = 1/10, so we
couldn’t get a speedup better than 10. Therefore, if a fraction r of our serial program is
“inherently serial,” that is, cannot possibly be parallelized, then we can’t possibly get a speedup
better than 1/r. Thus, even if r is quite small— say 1/100—and we have a system with thousands
of cores, we can’t possibly get a speedup better than 100.

There are several reasons not to be too worried by Amdahl’s law. First, it doesn’t take into
consideration the problem size. For many problems, as we increase the problem size the
“inherently serial” fraction of the program decreases in size; a more mathematical version of this
statement is known as Gustafson’s law [25]. Second, there are thousands of programs used by
scientists and engineers that routinely obtain huge speedups on large distributed-memory
systems.

(3) Scalability
The word “scalable” has a wide variety of informal uses. Indeed, we’ve used it

several times already. Roughly speaking, a technology is scalable if it can handle

ever-increasing problem sizes. However, in discussions of parallel program performance,

scalability has a somewhat more formal definition.

Suppose we run a parallel program with a fixed number of processes/threads and a fixed input
size, and we obtain an efficiency E. Suppose we now increase the number of processes/threads
that are used by the program. If we can find a corresponding rate of increase in the problem size
so that the program always has efficiency E, then the program is scalable.

As an example, suppose that Tserial =n, where the units of Tserial are in microseconds, and n is
also the problem size. Also suppose that Tparallel / n=p+1. Then

If the program is scalable, we increase the number of processes/threads by a factor of k, and we

want to find the factor x that we need to increase the problem size by so that E is unchanged. The
number of processes/threads will be kp and the problem size will be xn, and we want to solve the
following equation for x:

If x = k, there will be a common factor of k in the denominator xn+kp =kn+kp = k(.n+p), and we
can reduce the fraction to get,

If when we increase the number of processes/threads, we can keep the efficiency fixed without
increasing the problem size, the program is said to be strongly scalable. If we can keep the
efficiency fixed by increasing the problem size at the same rate as we increase the number of
processes/threads, then the program is said to be weakly scalable. The program in our example
would be weakly scalable.

(4) Taking timings

The first thing to note is that there are at least two different reasons for taking timings. During
program development we may take timings in order to determine if the program is behaving as
we intend. For example, in a distributed-memory program we might be interested in finding out
how much time the processes are spending waiting for messages, because if this value is large,
there is almost certainly something wrong either with our design or our implementation

. On the other hand, once we’ve completed development of the program, we’re often interested
in determining how good its performance is. Perhaps surprisingly, the way we take these two
timings is usually different. For the first timing, we usually need very detailed information

Second, we’re usually not interested in the time that elapses between the program’s start and the
program’s finish. We’re usually interested only in some part of the program. For example, if we
write a program that implements bubble sort, we’re probably only interested in the time it takes
to sort the keys, not the time it takes to read them in and print them out. We probably can’t use
something like the Unix shell command time, which reports the time taken to run a program
from start to finish. Third, we’re usually not interested in “CPU time.” This is the time reported
by the standard C function clock. It’s the total time the program spends in code executed as part
of the program. It would include the time for code we’ve written; it would include the time we
spend in library functions such as pow or sin; and it would include the time the operating system
spends in functions we call, such as printf and scanf. It would not include time the program was
idle, and this could be a problem.

The function Get current time() is a hypothetical function that’s supposed to return the number of
seconds that have elapsed since some fixed time in the past. It’s just a placeholder. The actual
function that is used will depend on the API. For example, MPI has a function MPI Wtime that
could be used here, and the OpenMP API for shared-memory programming has a function omp
get wtime. Both functions return wall clock time instead of CPU time.

There may be an issue with the resolution of the timer function. The resolution is the unit of
measurement on the timer. It’s the duration of the shortest event that can

have a nonzero time.

When we’re timing parallel programs, we need to be a little more careful about how the timings
are taken. In our example, the code that we want to time is probably being executed by multiple
processes or threads and our original timing will result in the output of p elapsed times.

private double start, finish;

...

start = Get current time();

/_ Code that we want to time _/

...

finish = Get current time();

printf("The elapsed time = %e secondsnn", finish�start);

However, what we’re usually interested in is a single time: the time that has elapsed from when
the first process/thread began execution of the code to the time the last process/thread finished
execution of the code.We often can’t obtain this exactly, since there may not be any
correspondence between the clock on one node and the clock on another node. We usually settle
for a compromise that looks something like this:

shared double global elapsed;

private double my start, my finish, my elapsed;

/_ Synchronize all processes/threads _/

Barrier();

my start = Get current time();

/_ Code that we want to time _/

...

my finish = Get current time();

my elapsed = my finish � my start;

/_ Find the max across all processes/threads _/

global elapsed = Global max(my elapsed);

if (my rank == 0)

printf("The elapsed time = %e secondsnn", global elapsed);

Here, we first execute a barrier function that approximately synchronizes all of the
processes/threads. We would like for all the processes/threads to return from the call
simultaneously, but such a function usually can only guarantee that all the processes/ threads
have started the call when the first process/thread returns.We then execute the code as before and
each process/thread finds the time it took. Then all the processes/threads call a global maximum
function, which returns the largest of the elapsed times, and process/thread 0 prints it out.

We also need to be aware of the variability in timings. When we run a program several times, it’s
extremely likely that the elapsed time will be different for each run. This will be true even if each
time we run the program we use the same input and the same systems.

It might seem that the best way to deal with this would be to report either a mean or a median
run-time. However, it’s unlikely that some outside event could actually make our program run
faster than its best possible run-time. So instead of reporting the mean or median time, we
usually report the minimum time.

Running more than one thread per core can cause dramatic increases in the variability of timings.
More importantly, if we run more than one thread per core, the system will have to take extra
time to schedule and deschedule cores, and this will add to the overall run-time. Therefore, we
rarely run more than one thread per core.

Finally, as a practical matter, since our programs won’t be designed for highperformance I/O,
we’ll usually not include I/O in our reported run-times

Data Races
Data races are the most common programming error found in parallel code. A data race

occurs when multiple threads use the same data item and one or more of those threads

are updating it. It is best illustrated by an example. Suppose you have the code shown in

Listing 4.1, where a pointer to an integer variable is passed in and the function increments

the value of this variable by 4.

Updating the Value at an Address

void update(int * a)

*a = *a + 4;

SPARC Disassembly for Incrementing a Variable Held in Memory

ld [%o0], %o1 // Load *a

add %o1, 4, %o1 // Add 4

st %o1, [%o0] // Store *a

Suppose this code occurs in a multithreaded application and two threads try to increment

the same variable at the same time. Table 4.1 shows the resulting instruction stream.

Table 4.1 Two Threads Updating the Same Variable

Value of variable a = 10

Thread 1 Thread 2

ld [%o0], %o1 // Load %o1 = 10 ld [%o0], %o1 // Load %o1 = 10

add %01, 4, %o1 // Add %o1 = 14 add %01, 4, %o1 // Add %o1 = 14

st %o1, [%o0] // Store %o1 st %o1, [%o0] // Store %o1

Value of variable a = 14

In the example, each thread adds 4 to the variable, but because they do it at exactly

the same time, the value 14 ends up being stored into the variable. If the two threads had

executed the code at different times, then the variable would have ended up with the

value of 18.

This is the situation where both threads are running simultaneously. This illustrates a

common kind of data race and possibly the easiest one to visualize.

Another situation might be when one thread is running, but the other thread has

been context switched off of the processor. Imagine that the first thread has loaded the

value of the variable a and then gets context switched off the processor. When it eventually

runs again, the value of the variable a will have changed, and the final store of the

restored thread will cause the value of the variable a to regress to an old value.

Consider the situation where one thread holds the value of a variable in a register and

a second thread comes in and modifies this variable in memory while the first thread is

running through its code. The value held in the register is now out of sync with the
value held in memory.

The point is that a data race situation is created whenever a variable is loaded and

another thread stores a new value to the same variable: One of the threads is now working

with “old” data.

Data races can be hard to find. Take the previous code example to increment a variable.

It might reside in the context of a larger, more complex routine. It can be hard to

identify the sequence of problem instructions just by inspecting the code. The sequence

of instructions causing the data race is only three long, and it could be located within a

whole region of code that could be hundreds of instructions in length.

Not only is the problem hard to see from inspection, but the problem would occur

only when both threads happen to be executing the same small region of code. So even

if the data race is readily obvious and can potentially happen every time, it is quite possible

that an application with a data race may run for a long time before errors are observed.

In the example, unless you were printing out every value of the variable a and actually

saw the variable take the same value twice, the data race would be hard to detect.

The potential for data races is part of what makes parallel programming hard. It is a

common error to introduce data races into a code, and it is hard to determine, by

inspection, that one exists. Fortunately, there are tools to detect data races.

Using Tools to Detect Data Races

The code shown in Listing 4.3 contains a data race.

“Using POSIX Threads.” The code creates two threads, both of which execute the routine
func(). The main thread then waits for both the child threads to complete their work

Listing 4.3 Code Containing Data Race

#include <pthread.h>
int counter = 0;

void * func(void * params)

counter++;

void main()

pthread_t thread1, thread2;

pthread_create( &thread1, 0, func, 0);

pthread_create( &thread2, 0, func, 0);

pthread_join( thread1, 0 );

pthread_join( thread2, 0 );

Both threads will attempt to increment the variable counter. We can compile this

code with GNU gcc and then use Helgrind, which is part of the Valgrind1 suite, to

identify the data race. Valgrind is a tool that enables an application to be instrumented

and its runtime behavior examined. The Helgrind tool uses this instrumentation to

gather data about data races. Listing 4.4 shows the output from Helgrind.

Listing 4.4 Using Helgrind to Detect Data Races

$ gcc -g race.c -lpthread

$ valgrind -–tool=helgrind ./a.out

...

==4742==

==4742== Possible data race during write of size 4

at 0x804a020 by thread #3

==4742== at 0x8048482: func (race.c:7)

==4742== by 0x402A89B: mythread_wrapper (hg_intercepts.c:194)

==4742== by 0x40414FE: start_thread

(in /lib/tls/i686/cmov/libpthread-2.9.so)

==4742== by 0x413849D: clone (in /lib/tls/i686/cmov/libc-2.9.so)

==4742== This conflicts with a previous write of size 4 by thread #2

==4742== at 0x8048482: func (race.c:7)

==4742== by 0x402A89B: mythread_wrapper (hg_intercepts.c:194)

==4742== by 0x40414FE: start_thread

(in /lib/tls/i686/cmov/libpthread-2.9.so)

==4742== by 0x413849D: clone (in /lib/tls/i686/cmov/libc-2.9.so)

The output from Helgrind shows that there is a potential data race between two threads,

both executing line 7 in the file race.c. This is the anticipated result, but it should be

pointed out that the tools will find some false positives. The programmer may write code

where different threads access the same variable, but the programmer may know that

there is an enforced order that stops an actual data race. The tools, however, may not be

able to detect the enforced order and will report the potential data race.

Another tool that is able to detect potential data races is the Thread Analyzer in

Oracle Solaris Studio. This tool requires an instrumented build of the application, data

collection is done by the collect tool, and the graphical interface is launched with the

command tha.

Listing 4.5 Detecting Data Races Using the Sun Studio Thread Analyzer
$ cc -g -xinstrument=datarace race.c

$ collect -r on ./a.out

Recording experiment tha.1.er ...

$ tha tha.1.er&

The initial screen of the tool displays a list of data races, as shown in Figure 4.1.

Once the user has identified the data race they are interested in, they can view the

source code for the two locations in the code where the problem occurs. In the example, shown

in Figure 4.2, both threads are executing the same source line.
Figure 4.2 Source code with data race shown in Solaris Studio
Thread Analyzer

Avoiding Data Races

Although it can be hard to identify data races, avoiding them can be very simple: Make

sure that only one thread can update the variable at a time. The easiest way to do this is

to place a synchronization lock around all accesses to that variable and ensure that before

referencing the variable, the thread must acquire the lock. Listing 4.6 shows a modified

version of the code. This version uses a mutex lock, described in more detail in the next

section, to protect accesses to the variable counter. Although this ensures the correctness

of the code, it does not necessarily give the best performance

Listing 4.6 Code Modified to Avoid Data Races

void * func( void * params )

pthread_mutex_lock( &mutex );

counter++;

pthread_mutex_unlock( &mutex );

Synchronization Primitives
Synchronization is used to coordinate the activity of multiple threads. There are various

situations where it is necessary; this might be to ensure that shared resources are not

accessed by multiple threads simultaneously or that all work on those resources is complete

before new work starts.

Most operating systems provide a rich set of synchronization primitives. It is usually

most appropriate to use these rather than attempting to write custom methods of synchronization.

There are two reasons for this. Synchronization primitives provided by the

operating system will usually be recognized by the tools provided with that operating

system. Hence, the tools will be able to do a better job of detecting data races or correctly

labeling synchronization costs. The operating system will often provide support for

sharing the primitives between threads or processes, which can be hard to do efficiently

without operating system support.

(1) Mutexes and Critical Regions

The simplest form of synchronization is a mutually exclusive (mutex) lock. Only one

thread at a time can acquire a mutex lock, so they can be placed around a data structure

to ensure that the data structure is modified by only one thread at a time.
Placing Mutex Locks Around Accesses to Variables:

int counter;

mutex_lock mutex;

void Increment()

acquire( &mutex );

counter++;

release( &mutex );

void Decrement()

acquire( &mutex );

counter--;

release( &mutex );

In the example, the two routines Increment() and Decrement() will either increment

or decrement the variable counter. To modify the variable, a thread has to first

acquire the mutex lock. Only one thread at a time can do this; all the other threads that

want to acquire the lock need to wait until the thread holding the lock releases it. Both

routines use the same mutex; consequently, only one thread at a time can either increment

or decrement the variable counter.

If multiple threads are attempting to acquire the same mutex at the same time, then

only one thread will succeed, and the other threads will have to wait. This situation is

known as a contended mutex.

The region of code between the acquisition and release of a mutex lock is called a
critical section, or critical region. Code in this region will be executed by only one thread at

a time.

As an example of a critical section, imagine that an operating system does not have

an implementation of malloc() that is thread-safe, or safe for multiple threads to call at

the same time. One way to fix this is to place the call to malloc() in a critical section

by surrounding it with a mutex lock.

Placing a Mutex Lock Around a Region of Code

void * threadSafeMalloc( size_t size )

acquire( &mallocMutex );

void * memory = malloc( size );

release( &mallocMutex );

return memory;

If all the calls to malloc() are replaced with the threadSafeMalloc() call, then

only one thread at a time can be in the original malloc() code, and the calls to

malloc() become thread-safe.

Threads block if they attempt to acquire a mutex lock that is already held by another

thread. Blocking means that the threads are sent to sleep either immediately or after a

few unsuccessful attempts to acquire the mutex.

One problem with this approach is that it can serialize a program. If multiple threads

simultaneously call threadSafeMalloc(), only one thread at a time will make progress.

This causes the multithreaded program to have only a single executing thread, which

stops the program from taking advantage of multiple cores.

(2) Spin Locks
Spin locks are essentially mutex locks. The difference between a mutex lock and a spin

lock is that a thread waiting to acquire a spin lock will keep trying to acquire the lock

without sleeping. In comparison, a mutex lock may sleep if it is unable to acquire the

lock. The advantage of using spin locks is that they will acquire the lock as soon as it is

released, whereas a mutex lock will need to be woken by the operating system before it

can get the lock. The disadvantage is that a spin lock will spin on a virtual CPU monopolizing

that resource. In comparison, a mutex lock will sleep and free the virtual CPU

for another thread to use.

Often mutex locks are implemented to be a hybrid of spin locks and more traditional

mutex locks. The thread attempting to acquire the mutex spins for a short while before

blocking. There is a performance advantage to this. Since most mutex locks are held for

only a short period of time, it is quite likely that the lock will quickly become free for

the waiting thread to acquire. So, spinning for a short period of time makes it more

likely that the waiting thread will acquire the mutex lock as soon as it is released.

However, continuing to spin for a long period of time consumes hardware resources that

could be better used in allowing other software threads to run.

(3) Semaphores
Semaphores are counters that can be either incremented or decremented. They can be

used in situations where there is a finite limit to a resource and a mechanism is needed

to impose that limit. An example might be a buffer that has a fixed size. Every time an

element is added to a buffer, the number of available positions is decreased. Every time

an element is removed, the number available is increased.

Semaphores can also be used to mimic mutexes; if there is only one element in the

semaphore, then it can be either acquired or available, exactly as a mutex can be either
locked or unlocked.

Semaphores will also signal or wake up threads that are waiting on them to use available

resources; hence, they can be used for signaling between threads. For example, a thread

might set a semaphore once it has completed some initialization. Other threads could

wait on the semaphore and be signaled to start work once the initialization is complete.

Depending on the implementation, the method that acquires a semaphore might be

called wait, down, or acquire, and the method to release a semaphore might be called post,

up, signal, or release. When the semaphore no longer has resources available, the threads

requesting resources will block until resources are available.

(4) Barriers
There are situations where a number of threads have to all complete their work before

any of the threads can start on the next task. In these situations, it is useful to have a barrier

where the threads will wait until all are present.

One common example of using a barrier arises when there is a dependence between

different sections of code. For example, suppose a number of threads compute the values

stored in a matrix. The variable total needs to be calculated using the values stored in

the matrix. A barrier can be used to ensure that all the threads complete their computation

of the matrix before the variable total is calculated.

Using a Barrier to Order Computation

Compute_values_held_in_matrix();

Barrier();

total = Calculate_value_from_matrix();

The variable total can be computed only when all threads have reached the barrier.

This avoids the situation where one of the threads is still completing its computations

while the other threads start using the results of the calculations. Notice that another
barrier could well be needed after the computation of the value for total if that value

is then used in further calculations.

Use of Multiple Barriers

Compute_values_held_in_matrix();

Barrier();

total = Calculate_value_from_matrix();

Barrier();

Perform_next_calculation( total );

DEADLOCK:
A process request the resources, the resources are not available at that time, so the process enter
into the waiting state. The requesting resources are held by another waiting process, both are in
waiting state, this situation is said to be “Deadlock”.

CONDITION FOR DEADLOCK

A Deadlocked system must satisfied the following 4 conditions. These are :

1.MUTUALEXCLUSION

2.HOLD&WAIT

3.NO-PREEMPTION

4. CIRCULAR WAIT

Deadlock Aviodence:
Avoid actions that may lead to a deadlock.

Think of it as a state machine moving from 1 state to another as each instruction is executed.

We can avoid the situattion of deadlocked by:

1.Safe State

2.Banker’s Algorithm

3.Resource Allocation Graph

Safe state is one where It is not a deadlocked state There is some sequence by which all
requests can be satisfied. To avoid deadlocks, we try to make only those transitions that will take
you from one safe state to another. We avoid transitions to unsafe state (a state that is not
deadlocked, and is not safe)

Banker'sAlgorithm

When a request is made, check to see if after the request is satisfied, there is a (atleast one!)
sequence of moves that can satisfy all the requests ie. the new state is safe. If so, satisfy the
request, else make the request wait.

Resource Allocation Graph

If we have a resource allocation system with only one inatance of each process, a varient of
the resource allocation graph can be used for deadlock avoidence.

DEADLOCK PREVENTIONS
Difference from avoidance is that here, the system itself is build in such a way that there are no
deadlocks.

Make sure atleast one of the 4 deadlock conditions is never satisfied.

This may however be even more conservative than deadlock avoidance strategy.

Deadlock Detection
Detection mechanism of deadlocks for single instance of resource type is
different. We can detect the dead locks using wait for graph for single instance resource type and
detect using detection algorithm for multiple instances of resource type.

SINGLE INSTANCE OF RESOURCE TYPE:

Single instance of resource type means, the system consisting of only one resource for one type.
We can detect this type of deadlocks with the help of wait for graph.

P1 R1 P2 P2 R2 P3

P2 R3 P1 P3 R4 P2
Wait for graph

A system is in deadlock state , if and only if the wait for graph contains cycles. So
we can detect the deadlocks with cycles. In the figure there is 2 cycles one is P1 to P2 to P1,
second one P2 to P3 to P2 so the system consisting of deadlocks.

2. SEVERAL INSTANCE OF RESOURCE TYPE

The wait for graph is not applicable to several instance of resource type. So we need another
method for this type, that is “deadlock detection algorithm”. This algorithm looks like ‘Banker’s
algorithm” and it employees several data structures that are similar to those used in the Banker’s
algorithm.

Deadlock recovery
Once deadlock has been detected, some strategy is needed for recovery. The various
approaches of recovering from deadlock are:

PROCESS TERMINATION

RESOURCE PREEMPTION

PROCESS TERMINATION

“Process termination” it is one method to recover from deadlock. We uses 2 methods for process
termination, these are:

ABORT ALL DEADLOCKED PROCESS : It means release all the processes in the deadlocked
state, and start the allocation from the starting point. It is a great expensive method.

ABORT ONE BY ONE PROCESS UNTIL THE DEADLOCK CYCLE IS ELIMINATED : In

this method first abort the one of the processes in the deadlocked state, and allocated the
resources to some other process in the deadlock state then check whether the deadlock breaked
or not. If no, abort the another process from the deadlock state. Continue this process until we
recover from deadlock. This method is also expensive but compare with first one it is better.

RESOURCE PREEMPTION

To eliminate deadlocks using resource preemption, preempt some resources

from processes and give these resources to other processes until the deadlock cycle is broken.

There are 3 methods to eliminate the deadlocks using resource preemption.These are :
SELECTING A VICTIM : Select a victim resource from the deadlock state, and preempt that
one.

ROLLBACK : If a resource from a process is preempted, what should be done with that process.
The process must be roll backed to some safe state and restart it from that state.

STARVATION : It must be guaranteed that resource will not always be preempted from the same
process to avoid starvation problem.

Livelock:
A situation in which two or more processes continuously change their states in response to
changes in the other process(es) without doing any useful work. It is somewhat similar to the
deadlock but the difference is processes are getting polite and let other to do the work. This can
be happen when a process trying to avoid a deadlock.

In concurrent computing, a deadlock is a state in which each member of a group of actions, is

waiting for some other member to release a lock

A livelock is similar to a deadlock, except that the states of the processes involved in the livelock
constantly change with regard to one another, none progressing. Livelock is a special case of
resource starvation; the general definition only states that a specific process is not progressing.

A real-world example of livelock occurs when two people meet in a narrow corridor, and each
tries to be polite by moving aside to let the other pass, but they end up swaying from side to side
without making any progress because they both repeatedly move the same way at the same time.

Livelock is a risk with some algorithms that detect and recover from deadlock. If more than one
process takes action, the deadlock detection algorithm can be repeatedly triggered. This can be
avoided by ensuring that only one process (chosen randomly or by priority) takes action.

COMMUNICATION BETWEEN THREADS AND PROCESS:

All parallel applications require some element of communication between either the

threads or the processes. There is usually an implicit or explicit action of one thread

sending data to another thread. For example, one thread might be signaling to another
that work is ready for them. We have already seen an example of this where a semaphore

might indicate to waiting threads that initialization has completed. The thread signaling

the semaphore does not know whether there are other threads waiting for that signal.

Alternatively, a thread might be placing a message on a queue, and the message would be

received by the thread tasked with handling that queue.

These mechanisms usually require operating system support to mediate the sending of

messages between threads or processes. Programmers can invent their own implementa-

tions, but it can be more efficient to rely on the operating system to put a thread to

sleep until a condition is true or until a message is received.

The following sections outline various mechanisms to enable processes or threads to

pass messages or share data.

MEMORY, SHARED MEMORY AND MEMORY-MAPPED FILES:

The easiest way for multiple threads to communicate is through memory. If two threads

can access the same memory location, the cost of that access is little more than the

memory latency of the system. Of course, memory accesses still need to be controlled to

ensure that only one thread writes to the same memory location at a time. A multi-

threaded application will share memory between the threads by default, so this can be a

very low-cost approach. The only things that are not shared between threads are variables

on the stack of each thread (local variables) and thread-local variables, which will be dis-

cussed later.

Sharing memory between multiple processes is more complicated. By default, all

processes have independent address spaces, so it is necessary to preconfigure regions of

memory that can be shared between different processes.

To set up shared memory between two processes, one process will make a library call

to create a shared memory region. The call will use a unique descriptor for that shared
memory. This descriptor is usually the name of a file in the file system. The create call

returns a handle identifier that can then be used to map the shared memory region into

the address space of the application. This mapping returns a pointer to the newly mapped

memory. This pointer is exactly like the pointer that would be returned by malloc()

and can be used to access memory within the shared region.

When each process exits, it detaches from the shared memory region, and then the

last process to exit can delete it.

Creating and Deleting a Shared Memory Segment

ID = Open Shared Memory( Descriptor );

Memory = Map Shared Memory( ID );

...

Memory[100]++;

...

Close Shared Memory( ID );

Delete Shared Memory( Descriptor );

Below given shows the process of attaching to an existing shared memory segment. In

this instance, the shared region of memory is already created, so the same descriptor used

to create it can be used to attach to the existing shared memory region. This will provide

the process with an ID that can be used to map the region into the process.

Attaching to an Existing Shared Memory Segment

ID = Open Shared Memory( Descriptor );

Memory = Map Shared Memory( ID );

...

Close Shared Memory( ID );

A shared memory segment may remain on the system until it is removed, so it is

important to plan on which process has responsibility for creating and removing it.

CONDITION VARIABLES:

Condition variables communicate readiness between threads by enabling a thread to be

woken up when a condition becomes true. Without condition variables, the waiting

thread would have to use some form of polling to check whether the condition had

become true.

Condition variables work in conjunction with a mutex. The mutex is there to ensure

that only one thread at a time can access the variable. For example, the producer-

consumer model can be implemented using condition variables. Suppose an application

has one producer thread and one consumer thread. The producer adds data onto a

queue, and the consumer removes data from the queue. If there is no data on the queue,

then the consumer needs to sleep until it is signaled that an item of data has been placed

on the queue

Producer Thread Adding an Item to the Queue

Acquire Mutex();

Add Item to Queue();

If ( Only One Item on Queue )

Signal Conditions Met();

Release Mutex();

The producer thread needs to signal a waiting consumer thread only if the queue was

empty and it has just added a new item into that queue. If there were multiple items

already on the queue, then the consumer thread must be busy processing those items and

cannot be sleeping. If there were no items in the queue, then it is possible that the con-
sumer thread is sleeping and needs to be woken up.

Code for Consumer Thread Removing Items from Queue

Acquire Mutex();

Repeat

Item = 0;

If ( No Items on Queue() )

Wait on Condition Variable();

If (Item on Queue())

Item = remove from Queue();

Until ( Item != 0 );

Release Mutex();

The consumer thread will wait on the condition variable if the queue is empty. When

the producer thread signals it to wake up, it will first check to see whether there is any-

thing on the queue. It is quite possible for the consumer thread to be woken only to

find the queue empty; it is important to realize that the thread waking up does not

imply that the condition is now true, which is why the code is in a repeat loop in the

example. If there is an item on the queue, then the consumer thread can handle that

item; otherwise, it returns to sleep.

The interaction with the mutex is interesting. The producer thread needs to acquire

the mutex before adding an item to the queue. It needs to release the mutex after adding

the item to the queue, but it still holds the mutex when signaling. The consumer thread
cannot be woken until the mutex is released. The producer thread releases the mutex

after the signaling has completed; releasing the mutex is necessary for the consumer

thread to make progress.

The consumer thread acquires the mutex; it will need it to be able to safely modify

the queue. If there are no items on the queue, then the consumer thread will wait for an

item to be added. The call to wait on the condition variable will cause the mutex to be

released, and the consumer thread will wait to be signaled. When the consumer thread

wakes up, it will hold the mutex; either it will release the mutex when it has removed an

item from the queue or, if there is still nothing in the queue, it will release the mutex

with another call to wait on the condition variable.

The producer thread can use two types of wake-up calls: Either it can wake up a sin-

gle thread or it can broadcast to all waiting threads. Which one to use depends on the

context. If there are multiple items of data ready for processing, it makes sense to wake

up multiple threads with a broadcast. On the other hand, if the producer thread has

added only a single item to the queue, it is more appropriate to wake up only a single

thread. If all the threads are woken, it can take some time for all the threads to wake up,

execute, and return to waiting, placing an unnecessary burden on the system. Notice that

because each thread has to own the mutex when it wakes up, the process of waking all

the waiting threads is serial; only a single thread can be woken at a time.

The other point to observe is that when a wake-up call is broadcast to all threads,

some of them may be woken when there is no work for them to do. This is one reason

why it is necessary to place the wait on the condition variable in a loop.

The other problem to be aware of with condition variables is the lost wake-up. This

occurs when the signal to wake up the waiting thread is sent before the thread is readyto receive
it. Listing 4.19 shows a version of the consumer thread code. This version of
the code can suffer from the lost wake-up problem.

Listing 4.19 Consumer Thread Code with Potential Lost Wake-Up Problem

Repeat

Item = 0;

If ( No Items on Queue() )

Acquire Mutex();

Wait on Condition Variable();

Release Mutex();

Acquire Mutex();

If ( Item on Queue() )

Item = remove from Queue();

Release Mutex();

Until ( Item!=0 );

The problem with the code is the first if condition. If there are no items on the

queue, then the mutex lock is acquired, and the thread waits on the condition variable.

However, the producer thread could have placed an item and signaled the consumer

thread between the consumer thread executing the if statement and acquiring the

mutex. When this happens, the consumer thread waits on the condition variable indefi-

nitely because the producer thread, in Listing 4.17, signals only when it places the first

item into the queue.

SIGNALS AND EVENTS:

Signals are a UNIX mechanism where one process can send a signal to another process

and have a handler in the receiving process perform some task upon the receipt of the

message. Many features of UNIX are implemented using signals. Stopping a running

application by pressing ^C causes a SIGKILL signal to be sent to the process.

Windows has a similar mechanism for events. The handling of keyboard presses and

mouse moves are performed through the event mechanism. Pressing one of the buttons

on the mouse will cause a click event to be sent to the target window.

Signals and events are really optimized for sending limited or no data along with the

signal, and as such they are probably not the best mechanism for communication when

compared to other options.

Listing 4.20 shows how a signal handler is typically installed and how a signal can be

sent to that handler. Once the signal handler is installed, sending a signal to that thread

will cause the signal handler to be executed

Listing 4.20 Installing and Using a Signal Handler

void signalHandler(void *signal)

...

int main()

installHandler( SIGNAL, signalHandler );

sendSignal( SIGNAL );

}
MESSAGE QUEUE:

A message queue is a structure that can be shared between multiple processes. Messages

can be placed into the queue and will be removed in the same order in which they were

added. Constructing a message queue looks rather like constructing a shared memory

segment. The first thing needed is a descriptor, typically the location of a file in the file

system. This descriptor can either be used to create the message queue or be used to

attach to an existing message queue. Once the queue is configured, processes can place

messages into it or remove messages from it. Once the queue is finished, it needs to be

deleted.

Listing 4.21 shows code for creating and placing messages into a queue. This code is

also responsible for removing the queue after use.

Listing 4.21 Creating and Placing Messages into a Queue

ID = Open Message Queue Queue( Descriptor );

Put Message in Queue( ID, Message );

...

Close Message Queue( ID );

Delete Message Queue( Description );

Listing 4.22 shows the process for receiving messages for a queue. Using the descrip-

tor for an existing message queue enables two processes to communicate by sending and

receiving messages through the queue.

Listing 4.22 Opening a Queue and Receiving Messages

ID=Open Message Queue ID(Descriptor);

Message=Remove Message from Queue(ID);

...

Close Message Queue(ID);

NAMED PIPES:

UNIX uses pipes to pass data from one process to another. For example, the output from

the command ls, which lists all the files in a directory, could be piped into the wc com-

mand, which counts the number of lines, words, and characters in the input. The combi-

nation of the two commands would be a count of the number of files in the directory.

Named pipes provide a similar mechanism that can be controlled programmatically.

Named pipes are file-like objects that are given a specific name that can be shared

between processes. Any process can write into the pipe or read from the pipe. There is

no concept of a “message”; the data is treated as a stream of bytes. The method for using

a named pipe is much like the method for using a file: The pipe is opened, data is writ-

ten into it or read from it, and then the pipe is closed.

Listing 4.23 shows the steps necessary to set up and write data into a pipe, before

closing and deleting the pipe. One process needs to actually make the pipe, and once it

has been created, it can be opened and used for either reading or writing. Once the

process has completed, the pipe can be closed, and one of the processes using it should

also be responsible for deleting it.

Listing 4.23 Setting Up and Writing into a Pipe

Make Pipe( Descriptor );

ID = Open Pipe( Descriptor );

Write Pipe( ID, Message, sizeof(Message) );

...

Close Pipe( ID );

Delete Pipe( Descriptor );

Listing 4.24 shows the steps necessary to open an existing pipe and read messages from

it. Processes using the same descriptor can open and use the same pipe for communication.
Listing 4.24 Opening an Existing Pipe to Receive Messages

ID=Open Pipe( Descriptor );

Read Pipe( ID, buffer, sizeof(buffer) );

...

Close Pipe( ID );

Portable Programming Device by Salto: RW PPD
No ratings yet
Portable Programming Device by Salto: RW PPD
16 pages
2024 Budget Workplan Template
No ratings yet
2024 Budget Workplan Template
40 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Learn Excel Data Analysis
100% (15)
Learn Excel Data Analysis
721 pages
Python MCQ Quiz
33% (3)
Python MCQ Quiz
5 pages
Lecture Week - 3 Amdahl Law 1
No ratings yet
Lecture Week - 3 Amdahl Law 1
19 pages
Lecture-11 Amdhals Law Gustafsons Law
No ratings yet
Lecture-11 Amdhals Law Gustafsons Law
16 pages
Pc7 Performance
No ratings yet
Pc7 Performance
50 pages
unit-ii-performance-1
No ratings yet
unit-ii-performance-1
13 pages
Nscet E-Learning Presentation: Listen Learn Lead
No ratings yet
Nscet E-Learning Presentation: Listen Learn Lead
51 pages
Parallel Algorithm Analysis
No ratings yet
Parallel Algorithm Analysis
11 pages
Performance Analysis: PE PE
No ratings yet
Performance Analysis: PE PE
10 pages
HW2 Solutions
No ratings yet
HW2 Solutions
4 pages
performance metrics
No ratings yet
performance metrics
34 pages
PDC Week 2 (Performance Metrice, Amdahl's Law)
No ratings yet
PDC Week 2 (Performance Metrice, Amdahl's Law)
18 pages
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
67 pages
Lect 02
No ratings yet
Lect 02
51 pages
Lecture04 PDF
No ratings yet
Lecture04 PDF
27 pages
Lecture 4 Analytical Modeling of Parallel Programs
No ratings yet
Lecture 4 Analytical Modeling of Parallel Programs
11 pages
OOAD
No ratings yet
OOAD
67 pages
Reevaluating Amdahls Law
No ratings yet
Reevaluating Amdahls Law
3 pages
HPC 4th Unit - 240504 - 160030
No ratings yet
HPC 4th Unit - 240504 - 160030
19 pages
02 Gustafsons Law
No ratings yet
02 Gustafsons Law
2 pages
Unit 4 - Analytical Modeling of Parallel Programs
No ratings yet
Unit 4 - Analytical Modeling of Parallel Programs
37 pages
Week_7 (1)
No ratings yet
Week_7 (1)
27 pages
410A-week-4
No ratings yet
410A-week-4
12 pages
Pc98 Lect5 Part1 Speedup
No ratings yet
Pc98 Lect5 Part1 Speedup
36 pages
Zindagi Zama Da
No ratings yet
Zindagi Zama Da
21 pages
Principles of Scalable Performance
No ratings yet
Principles of Scalable Performance
61 pages
Cours 2
No ratings yet
Cours 2
25 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
Lecture 02-Amdahl's Law, Modern Hardware: ECE 459: Programming For Performance
No ratings yet
Lecture 02-Amdahl's Law, Modern Hardware: ECE 459: Programming For Performance
13 pages
p2
No ratings yet
p2
19 pages
AMDAHL's LAW
No ratings yet
AMDAHL's LAW
3 pages
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
36 pages
CS621 Week 14 - Complete
No ratings yet
CS621 Week 14 - Complete
69 pages
Unit 4
No ratings yet
Unit 4
64 pages
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
No ratings yet
2 New Module 2 Performance Analysis of Multiprocessor Architectures Students Version
13 pages
Week 7
No ratings yet
Week 7
27 pages
Performance Metrices
100% (1)
Performance Metrices
18 pages
34-Amdahl''s Law-10-04-2023
No ratings yet
34-Amdahl''s Law-10-04-2023
9 pages
Cours 2
No ratings yet
Cours 2
25 pages
Parallel Computing - Unit III
No ratings yet
Parallel Computing - Unit III
74 pages
Document
No ratings yet
Document
10 pages
Lecture # 21 (1)
No ratings yet
Lecture # 21 (1)
16 pages
Cao AMDAHL's Law
No ratings yet
Cao AMDAHL's Law
4 pages
5 Problems PDF
No ratings yet
5 Problems PDF
32 pages
Principles of Scalable Performance
0% (1)
Principles of Scalable Performance
7 pages
Slides
No ratings yet
Slides
44 pages
Performance and Scalability Class
No ratings yet
Performance and Scalability Class
63 pages
Design Problem1 of CSE261 "Computer System Architecture"
No ratings yet
Design Problem1 of CSE261 "Computer System Architecture"
4 pages
Amdahl Law
No ratings yet
Amdahl Law
2 pages
Amdahl's Law
No ratings yet
Amdahl's Law
25 pages
Lecture 2 Amdahl's Law and Karp-Flatt Metric
0% (1)
Lecture 2 Amdahl's Law and Karp-Flatt Metric
14 pages
Amdahl's Law
No ratings yet
Amdahl's Law
5 pages
Speed Up Laws
No ratings yet
Speed Up Laws
21 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
High Performance Computing Using Parallel Processing
No ratings yet
High Performance Computing Using Parallel Processing
3 pages
Q&A
No ratings yet
Q&A
2 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
PC 2
No ratings yet
PC 2
44 pages
CS-3006_10_PerformanceAnalysis
No ratings yet
CS-3006_10_PerformanceAnalysis
52 pages
3.2 Performance Evaluations
No ratings yet
3.2 Performance Evaluations
18 pages
Learn Python in One Hour: Programming by Example
From Everand
Learn Python in One Hour: Programming by Example
Victor R. Volkman
3/5 (2)
Case Studies - N-Body Solvers - Tree Search - Openmp and Mpi Implementations and Comparison
100% (1)
Case Studies - N-Body Solvers - Tree Search - Openmp and Mpi Implementations and Comparison
12 pages
Unit Iv Distributed Memory Programming With Mpi
No ratings yet
Unit Iv Distributed Memory Programming With Mpi
19 pages
3unit3 Mca Pecnotes
No ratings yet
3unit3 Mca Pecnotes
23 pages
Unit4 RMD PDF
No ratings yet
Unit4 RMD PDF
18 pages
Unit5 RMD PDF
No ratings yet
Unit5 RMD PDF
27 pages
Unit3 RMD PDF
No ratings yet
Unit3 RMD PDF
25 pages
6 - Security Part I - Auditing Operating Systems and Networks
100% (1)
6 - Security Part I - Auditing Operating Systems and Networks
68 pages
Unit-3: 2160711 Dot Net Technology
No ratings yet
Unit-3: 2160711 Dot Net Technology
71 pages
Fullstack-Developer_20240527123203_40
No ratings yet
Fullstack-Developer_20240527123203_40
11 pages
Trace
No ratings yet
Trace
153 pages
ECA2+ - Tests - Mid-Year Test C - 2018
No ratings yet
ECA2+ - Tests - Mid-Year Test C - 2018
7 pages
6762cd654c948_Hiring_post_Final_
No ratings yet
6762cd654c948_Hiring_post_Final_
3 pages
Adoption of Cloud Computing by Smes in Emerging Markets (India) "
No ratings yet
Adoption of Cloud Computing by Smes in Emerging Markets (India) "
16 pages
Hola
No ratings yet
Hola
4,492 pages
Lists and Array Variables
No ratings yet
Lists and Array Variables
5 pages
Nini Muly Resume-15
No ratings yet
Nini Muly Resume-15
1 page
Problem Statement
No ratings yet
Problem Statement
4 pages
Decompile6 UTF8
No ratings yet
Decompile6 UTF8
4,428 pages
03 Activity 1 3 Ans
No ratings yet
03 Activity 1 3 Ans
2 pages
Lastcrash 63806565573
No ratings yet
Lastcrash 63806565573
11 pages
Aws disaster recovery
No ratings yet
Aws disaster recovery
36 pages
B313 Mitre Ver 1.1
No ratings yet
B313 Mitre Ver 1.1
5 pages
Exalogic Elastic Cloud x2 2 Ds 1367805
No ratings yet
Exalogic Elastic Cloud x2 2 Ds 1367805
3 pages
GoldmanSachs - Client-Security-Statement
No ratings yet
GoldmanSachs - Client-Security-Statement
23 pages
DSD+ 2.465 Setup Guide July 3, 2024
No ratings yet
DSD+ 2.465 Setup Guide July 3, 2024
91 pages
Objects and Classes Python
No ratings yet
Objects and Classes Python
21 pages
Pan Os Panorama Api
No ratings yet
Pan Os Panorama Api
140 pages
Tadlak Elementary School Tle (Ict) 2 QTR Week 6
100% (1)
Tadlak Elementary School Tle (Ict) 2 QTR Week 6
3 pages
Renfert Catalog En
No ratings yet
Renfert Catalog En
156 pages
Race Operating 1033
No ratings yet
Race Operating 1033
51 pages
Cse Project Thesis
No ratings yet
Cse Project Thesis
14 pages
A Crash Course in Wikipedia Editing
No ratings yet
A Crash Course in Wikipedia Editing
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.