0% found this document useful (0 votes)
3 views35 pages

PDC Lecture 02

The document discusses modern classifications of parallelism in computer architectures, including data, function, and control parallelism, which help in understanding advanced architecture designs. It also covers performance metrics such as throughput, latency, scalability, and speedup, along with the importance of efficiency and load balancing in parallel and distributed systems. Additionally, it explains parallel computing and programming concepts, computation graphs, scheduling algorithms, and the principles of speedup and scalability in relation to processor utilization.

Uploaded by

arhamkhan4241
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views35 pages

PDC Lecture 02

The document discusses modern classifications of parallelism in computer architectures, including data, function, and control parallelism, which help in understanding advanced architecture designs. It also covers performance metrics such as throughput, latency, scalability, and speedup, along with the importance of efficiency and load balancing in parallel and distributed systems. Additionally, it explains parallel computing and programming concepts, computation graphs, scheduling algorithms, and the principles of speedup and scalability in relation to processor utilization.

Uploaded by

arhamkhan4241
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

CS-402 Parallel and Distributed Systems

Fall 2024

Lecture No. 02
Modern classification (Sima, Fountain, Kacsuk)

 The modern classification proposed by Sima, Fountain, and Kacsuk focuses on how
parallelism is achieved in computer architectures. Here are the key categories:

 Data Parallelism: In this approach, the same function operates on many data elements simultaneously. It
emphasizes parallelism at the data level.
 Function Parallelism: Multiple functions are performed in parallel. This category emphasizes parallelism at
the functional level.
 Control Parallelism: It involves task parallelism, where different tasks are executed concurrently based on
control flow.

These classifications help us understand the design spaces of advanced architectures.


Modern classification (Sima, Fountain, Kacsuk)

 Based on how parallelism is achieved


o Data parallelism: same function operating on many data
o Function parallelism: performing many functions in parallel
 Control parallelism, task parallelism depending on the level of the functional
parallelism.

Parallel architectures

Data-parallel Function-parallel
architectures architectures
Functional-parallel architectures

Function-parallel
architectures

Instruction level Thread level Process level


Parallel Arch Parallel Arch Parallel Arch
(ILPs) (MIMDs)

Pipelined VLIWs Superscalar Shared


Distributed
processors processors Memory
Memory MIMD
MIMD
Modern classification (Sima, Fountain, Kacsuk)

• BY OPERATING ON MULTIPLE DATA: DATA PARALLELISM


• BY PERFORMING MANY FUNCTIONS IN PARALLEL: FUNCTION
PARALLELISM
• CONTROL PARALLELISM, TASK PARALLELISM DEPENDING ON THE
LEVEL OF THE FUNCTIONAL PARALLELISM.

Parallel architectures

Data-parallel Function-parallel
architectures architectures
Performance
In the context of parallel and distributed systems, performance refers to how efficiently a system
or application accomplishes its tasks. Let’s delve into the details:

1. Throughput: This measures the rate at which a system processes tasks or transactions. It
indicates how many tasks can be completed per unit of time. High throughput is desirable for
systems handling large workloads.
2. Latency: Latency is the time delay between initiating an action and receiving a response. In
parallel and distributed systems, minimizing latency is crucial for responsiveness. Examples
include network latency (time for data to travel between nodes) and memory access latency.
3. Scalability: Scalability assesses how well a system can handle increased load. It can be vertical
(adding more resources to a single node) or horizontal (adding more nodes). Good scalability
ensures consistent performance as the workload grows.
4. Speedup: Speedup measures the improvement achieved by parallelizing a task. It’s the ratio of
the execution time on a single processor (sequential) to the execution time on multiple
processors (parallel). Ideally, speedup should be close to the number of processors used.
Performance
Efficiency: Efficiency quantifies how effectively resources are utilized in a parallel system. It’s
calculated as the speedup divided by the number of processors. High efficiency means minimal
resource wastage.
Load Balancing: In distributed systems, load balancing ensures that tasks are evenly distributed
among nodes. Balanced loads prevent bottlenecks and maximize system performance.
Overhead: Overhead refers to additional computational costs incurred due to parallelization.
Examples include communication overhead (data exchange between nodes) and synchronization
overhead (ensuring consistency).
Remember that achieving optimal performance involves trade-offs, considering factors like
communication costs, synchronization, and hardware limitations.
Performance

 Time and performance: Machine A is n times faster than Machine B if and


( )
only if =
( )

 Program execution time = × ×

(CPI) (cycle time)


 × can be approximated as instruction per second (IPS).
o In a system with variable instruction cycles, IPS is program dependent – this metric
can be misleading, and not very useful.
Performance

 MIPS (millions instructions per second) – vendors sometimes report this


value as an indication of the computing speed. It is not a good
performance metric.
 MFLOPS (million floating-point operations per second): This is more
meaningful when it is the measured time for completing a complex task.
FLOPS = FP ops in program/execution time
For CPU (or GPU), vendors use this formula

= × × ×
Performance Evaluation
The primary attributes used to measure the performance of a computer system are as
follows. Cycle time(T): It is the unit of time for all the operations of a computer system. It
is the inverse of clock rate (l/f). The cycle time is represented in n sec.
Cycles Per Instruction(CPI): Different instructions takes different number of cycles for
execution. CPI is measurement of number of cycles per instruction.
Instruction count(Ic ): Number of instruction in a program is called instruction count. If we
assume that all instructions have same number of cycles, then the total execution time of
a program = number of instruction in the program * number of cycle required by one
instruction * time of one cycle.

Hence, execution time T=Ic *CPI*Tsec. Practically the clock frequency of the system is
specified in MHz. Also, the processor speed is measured in terms of million instructions
per sec(MIPS).
FLOPS UNITS
Units Order Comments:
KFLOPS (kiloFLOPS) 10
MFLOPS (megaFLOPS) 10
GFLOPS (gigaFLOPS) 10 Intel i9-9900K CPU at 98.88GFLOPS,
AMD Ryzen 9 3950X at 170.56FLOPS
TFLOPS (teraFLOPS) 10 Nvidia GTX 3090 at 36 TFLOPS
(2002 No. 1 supercomputer NEC Earth
Simulator at 36TFLOPS)
PFLOPS (petaFLOPS) 10
EFLOPS (exaFLOPS) 10 This is the next milestone for
supercomputers (exa-scale computing). We
are almost there: Fugaku at 0.442 EFLOPS

You can find the FLOPS for up-to-date CPUs at


https://setiathome.berkeley.edu/cpu_list.php
Peak and sustained performance

 The FLOPS value reported by vendors is the peak performance that no


program can achieve.
 Sustained FLOPS is the FLOPS rate that a program can achieve over the
entire run.
 Peak FLOPS is usually much larger than sustained FLOPS
o =

 A set of standard benchmarks are there to measure the sustained


performance
o LINPACK for supercomputers
Parallel computing and parallel programming

 Parallel computing: using multiple processing elements in parallel to solve


problems more quickly than with a single processing element.
o Fully utilize the computing power in contemporary computing systems.
 Parallel programming:
o Specification of operations that can be executed in parallel
o A parallel program is decomposed into sequential sub-computations (tasks)
o Parallel programming constructs define task creation, termination, and
interaction.
Parallel programming example

A
o “#pragma omp parallel sections” specifies task creation and termination

o “#pragma omp section” specifies the two tasks in the program.


B C
o Notice that the parallel program runs slower when the array size is small.

o sum_omp.cpp consists of four sub-tasks.

 A: the sequential part before the parallel region D


 B, C: the two parallel sub-tasks

 D: the part after the parallel sub-tasks join back.


Parallel programming example

 Notice that this parallel program is non-trivial


sum T[0]
 The calculation of sum in the loop has
+
dependency inside. T[1]

+
 To overcome that,
o Decompose the problem into two tasks, each
compute partial sums T[size-1]

o Combine the results to get the final answer +


Modeling the execution of a parallel program

 The execution of a parallel program can be represented as a


computation graph (CG) or parallel program dependence graph that
allows for reasoning about the execution.
o Nodes are sequential subtasks
o Edges represent the dependency constraints
 Both control dependency and data dependency are captured.
o A CG is a directed acyclic graph (DAG) since a node cannot depend on itself.

 CG describes a set of computational tasks and the dependencies


between them.
Computation graph example

 The computation graph for sum_omp.cpp A

 Node A must be executed before Node B if there is a B C


path from A to B in the graph
 CG can be used as a visualization technique to help
us understand the complexity of the algorithms. D
 CG can also be used as a data structure for the
compiler or the system to schedule the execution of
the sub-tasks.
Complexity with computation graph
 Let T(N) be the execution time of node N
 Work = ∑ is the total work to be executed in CG
o execution time with a single processor
 Let span(CG) be the longest path in CG when adding the execution time of all nodes in
the path.
o span(CG) is the smallest possible executable for the CG regardless of how many processors are
used!
o Note that CG is a DAG.
( )
 CG’s degree of parallelism is defined as parallelism = . The parallelism of a
( )
computation provides a rough estimate of the maximum number of processors that can
be used efficiently.
o Consider two situations: parallelism = 1 and parallelism = N
Let the time for each node be 1, compute the work, span, and
parallelism for the following two computation graphs

A A

B C D E B C D

F E F

G
G
What are the work, span and parallelism of the
following CG?

A: 1

B: 2 C: 4

D: 1 E: 2 F: 1 G: 2

H: 1

I: 2
Scheduling of a computation graph
A  Assume each node takes 1 time steps.
 A task can be allocated only when all of its
B C predecessors have been executed.
Task scheduling:
D E F G Time step P0 P1
1 A -
2 B C
H I 3 D E
4 H F
5 G -
J 6 I -
7 J -
Scheduling of a computation graph
A
 Another schedule: better than the previous one
B C Task scheduling (2 processors)

Time step P0 P1
D E F G 1 A -
2 B C
H I 3 D E
4 F G
5 H I
J 6 J -
Greedy Scheduling
A greedy scheduling algorithm assigns tasks to processors as soon as they become available, without
waiting for other tasks to complete. This ensures that processors are always busy if there are tasks
ready to be executed.
Example: Task Dependency Graph
Consider a task dependency graph where nodes represent tasks and edges represent dependencies
between tasks. Here’s a simple example:
A
/\
B C
• Task A must be completed before tasks B and C can start. /\/\
D E F
• Tasks B and C must be completed before tasks D, E, and F can start.
Greedy Scheduling Steps
1.Initial State:
• Ready tasks: A
• Processors: P1, P2, P3
2.Step 1:
• Assign task A to P1.
• Ready tasks: B, C (after A completes)
3.Step 2:
• Assign task B to P1 (since P1 is free after completing A).
• Assign task C to P2.
• Ready tasks: D, E, F (after B and C complete)
4.Step 3:
• Assign task D to P1 (since P1 is free after completing B).
• Assign task E to P2 (since P2 is free after completing C).
• Assign task F to P3.
Time Step 1: P1: A P2: - P3: -
Visualization Time Step 2: P1: B P2: C P3: -
Here’s a visual representation of the greedy scheduling process: Time Step 3: P1: D P2: E P3: F
Greedy Scheduling
Key Points
•No Idle Processors: At each step, all available processors are assigned tasks if there are ready tasks.
•Minimized Idle Time: Processors are never idle if there are tasks that can be executed.
•Efficient Utilization: This approach ensures efficient utilization of processor resources.
Conclusion

A greedy schedule ensures that processors are always busy if there are tasks ready to be
executed, leading to efficient use of resources and reduced overall execution time. This is
particularly useful in parallel computing where maximizing processor utilization is crucial
for performance.
Greedy schedule

 A greedy schedule is one that never forces a processor to be idle when one or
more nodes are ready for execution.
 A node is ready for execution if all its predecessors have been executed
 With one processor, let be the time to execute CG.
o With any greedy schedule = ( ).
 With an infinite number of process, let be the time to execute CG with
an infinite number of processors.
o With any greedy schedule, = ( )
Greedy schedule

 With P processors, let be the execution of a schedule for


CG.
 For any greedy schedule:
( )
o ≥
o ≥

o Hence, ≥ max( , )
Greedy schedule

 For any greedy schedule:


o ≤ +
o A step is complete when all P processors are used, incomplete otherwise
o Number of complete steps ≤ ( )/
o Number of incomplete steps ≤ ( )
o Total steps ≤ + ( )

o Graham, R. L. “Bounds on Multiprocessing Timing Anomalies.” SIAM Journal on


Applied Mathematics 17, no. 2 (1969): 416–29.
Greedy schedule

 Combine the results:


max( , ) ≤ ≤ +

 Any greedy scheduler achieves that is within a factor of 2 of


the optimal.
Speedup and Scalability
 Speedup, Scalability, strong scaling, weak scaling
 Amdahl’s law
 Gustafson’s law
Performance expectation

 When using 1 processor, the sequential program runs for 100 seconds.
When we use 10 processors, should the program run for 10 times faster?
 This works only for embarrassingly parallel computations – parallel computations that can
be divided into completely independent computations that can be executed simultaneously.
There may have no interaction between separate processes; sometime the results need to
be collected.
 Embarrassingly parallel applications are the kind that can scale up to a very large
number of processors. Examples: Monte Carlo analysis, numerical integration, 3D
graphics rendering, and many more.
 In other types of applications, the computation components interact and have
dependencies, which prevents the applications from using a large number of processors.
Scalability

 Scalability of a program measures how many processors that the


program can make effective use of.
o For a computation represented by a computation graph, parallelism is a
good indicator of scalability.
Speedup and Strong scaling

 Let be the execution time for a computation to run on 1 processor


and be the execution time for the computation (with the same
input – same problem) to run on P processors.

=
o Factor by which the use of P processors speeds up execution time relative to
1 processor for the same problem.
o Since the problem size is fixed, this is referred to as “Strong scaling”.
o Given a computation graph, what is the highest speedup that can be
achieved?
Speedup

 =

 Typically, 1 ≤ ≤
 The speedup is ideal if =
 Linear speedup: = × for some constant 0 < <
1
Efficiency

 The efficiency of an algorithm using P processors is


Efficiency = speedup(P) / P
o Efficiency estimates how well-utilized the processors are in running the
parallel program.
o Ideal speedup means Efficiency = 1 (100% efficiency).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy