PDC Lecture 02
PDC Lecture 02
Fall 2024
Lecture No. 02
Modern classification (Sima, Fountain, Kacsuk)
The modern classification proposed by Sima, Fountain, and Kacsuk focuses on how
parallelism is achieved in computer architectures. Here are the key categories:
Data Parallelism: In this approach, the same function operates on many data elements simultaneously. It
emphasizes parallelism at the data level.
Function Parallelism: Multiple functions are performed in parallel. This category emphasizes parallelism at
the functional level.
Control Parallelism: It involves task parallelism, where different tasks are executed concurrently based on
control flow.
Parallel architectures
Data-parallel Function-parallel
architectures architectures
Functional-parallel architectures
Function-parallel
architectures
Parallel architectures
Data-parallel Function-parallel
architectures architectures
Performance
In the context of parallel and distributed systems, performance refers to how efficiently a system
or application accomplishes its tasks. Let’s delve into the details:
1. Throughput: This measures the rate at which a system processes tasks or transactions. It
indicates how many tasks can be completed per unit of time. High throughput is desirable for
systems handling large workloads.
2. Latency: Latency is the time delay between initiating an action and receiving a response. In
parallel and distributed systems, minimizing latency is crucial for responsiveness. Examples
include network latency (time for data to travel between nodes) and memory access latency.
3. Scalability: Scalability assesses how well a system can handle increased load. It can be vertical
(adding more resources to a single node) or horizontal (adding more nodes). Good scalability
ensures consistent performance as the workload grows.
4. Speedup: Speedup measures the improvement achieved by parallelizing a task. It’s the ratio of
the execution time on a single processor (sequential) to the execution time on multiple
processors (parallel). Ideally, speedup should be close to the number of processors used.
Performance
Efficiency: Efficiency quantifies how effectively resources are utilized in a parallel system. It’s
calculated as the speedup divided by the number of processors. High efficiency means minimal
resource wastage.
Load Balancing: In distributed systems, load balancing ensures that tasks are evenly distributed
among nodes. Balanced loads prevent bottlenecks and maximize system performance.
Overhead: Overhead refers to additional computational costs incurred due to parallelization.
Examples include communication overhead (data exchange between nodes) and synchronization
overhead (ensuring consistency).
Remember that achieving optimal performance involves trade-offs, considering factors like
communication costs, synchronization, and hardware limitations.
Performance
= × × ×
Performance Evaluation
The primary attributes used to measure the performance of a computer system are as
follows. Cycle time(T): It is the unit of time for all the operations of a computer system. It
is the inverse of clock rate (l/f). The cycle time is represented in n sec.
Cycles Per Instruction(CPI): Different instructions takes different number of cycles for
execution. CPI is measurement of number of cycles per instruction.
Instruction count(Ic ): Number of instruction in a program is called instruction count. If we
assume that all instructions have same number of cycles, then the total execution time of
a program = number of instruction in the program * number of cycle required by one
instruction * time of one cycle.
Hence, execution time T=Ic *CPI*Tsec. Practically the clock frequency of the system is
specified in MHz. Also, the processor speed is measured in terms of million instructions
per sec(MIPS).
FLOPS UNITS
Units Order Comments:
KFLOPS (kiloFLOPS) 10
MFLOPS (megaFLOPS) 10
GFLOPS (gigaFLOPS) 10 Intel i9-9900K CPU at 98.88GFLOPS,
AMD Ryzen 9 3950X at 170.56FLOPS
TFLOPS (teraFLOPS) 10 Nvidia GTX 3090 at 36 TFLOPS
(2002 No. 1 supercomputer NEC Earth
Simulator at 36TFLOPS)
PFLOPS (petaFLOPS) 10
EFLOPS (exaFLOPS) 10 This is the next milestone for
supercomputers (exa-scale computing). We
are almost there: Fugaku at 0.442 EFLOPS
A
o “#pragma omp parallel sections” specifies task creation and termination
+
To overcome that,
o Decompose the problem into two tasks, each
compute partial sums T[size-1]
A A
B C D E B C D
F E F
G
G
What are the work, span and parallelism of the
following CG?
A: 1
B: 2 C: 4
D: 1 E: 2 F: 1 G: 2
H: 1
I: 2
Scheduling of a computation graph
A Assume each node takes 1 time steps.
A task can be allocated only when all of its
B C predecessors have been executed.
Task scheduling:
D E F G Time step P0 P1
1 A -
2 B C
H I 3 D E
4 H F
5 G -
J 6 I -
7 J -
Scheduling of a computation graph
A
Another schedule: better than the previous one
B C Task scheduling (2 processors)
Time step P0 P1
D E F G 1 A -
2 B C
H I 3 D E
4 F G
5 H I
J 6 J -
Greedy Scheduling
A greedy scheduling algorithm assigns tasks to processors as soon as they become available, without
waiting for other tasks to complete. This ensures that processors are always busy if there are tasks
ready to be executed.
Example: Task Dependency Graph
Consider a task dependency graph where nodes represent tasks and edges represent dependencies
between tasks. Here’s a simple example:
A
/\
B C
• Task A must be completed before tasks B and C can start. /\/\
D E F
• Tasks B and C must be completed before tasks D, E, and F can start.
Greedy Scheduling Steps
1.Initial State:
• Ready tasks: A
• Processors: P1, P2, P3
2.Step 1:
• Assign task A to P1.
• Ready tasks: B, C (after A completes)
3.Step 2:
• Assign task B to P1 (since P1 is free after completing A).
• Assign task C to P2.
• Ready tasks: D, E, F (after B and C complete)
4.Step 3:
• Assign task D to P1 (since P1 is free after completing B).
• Assign task E to P2 (since P2 is free after completing C).
• Assign task F to P3.
Time Step 1: P1: A P2: - P3: -
Visualization Time Step 2: P1: B P2: C P3: -
Here’s a visual representation of the greedy scheduling process: Time Step 3: P1: D P2: E P3: F
Greedy Scheduling
Key Points
•No Idle Processors: At each step, all available processors are assigned tasks if there are ready tasks.
•Minimized Idle Time: Processors are never idle if there are tasks that can be executed.
•Efficient Utilization: This approach ensures efficient utilization of processor resources.
Conclusion
A greedy schedule ensures that processors are always busy if there are tasks ready to be
executed, leading to efficient use of resources and reduced overall execution time. This is
particularly useful in parallel computing where maximizing processor utilization is crucial
for performance.
Greedy schedule
A greedy schedule is one that never forces a processor to be idle when one or
more nodes are ready for execution.
A node is ready for execution if all its predecessors have been executed
With one processor, let be the time to execute CG.
o With any greedy schedule = ( ).
With an infinite number of process, let be the time to execute CG with
an infinite number of processors.
o With any greedy schedule, = ( )
Greedy schedule
o Hence, ≥ max( , )
Greedy schedule
When using 1 processor, the sequential program runs for 100 seconds.
When we use 10 processors, should the program run for 10 times faster?
This works only for embarrassingly parallel computations – parallel computations that can
be divided into completely independent computations that can be executed simultaneously.
There may have no interaction between separate processes; sometime the results need to
be collected.
Embarrassingly parallel applications are the kind that can scale up to a very large
number of processors. Examples: Monte Carlo analysis, numerical integration, 3D
graphics rendering, and many more.
In other types of applications, the computation components interact and have
dependencies, which prevents the applications from using a large number of processors.
Scalability
=
o Factor by which the use of P processors speeds up execution time relative to
1 processor for the same problem.
o Since the problem size is fixed, this is referred to as “Strong scaling”.
o Given a computation graph, what is the highest speedup that can be
achieved?
Speedup
=
Typically, 1 ≤ ≤
The speedup is ideal if =
Linear speedup: = × for some constant 0 < <
1
Efficiency