0% found this document useful (0 votes)
33 views

Parallel Computing - Unit III

The document discusses performance metrics for evaluating parallel systems, including execution time, total parallel overhead, speedup, and efficiency. It provides examples of calculating speedup and efficiency for algorithms like adding numbers and edge detection on images using parallel processing elements. The key metrics help determine the best algorithm, evaluate hardware platforms, and examine benefits from parallelism.

Uploaded by

shraddha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Parallel Computing - Unit III

The document discusses performance metrics for evaluating parallel systems, including execution time, total parallel overhead, speedup, and efficiency. It provides examples of calculating speedup and efficiency for algorithms like adding numbers and edge detection on images using parallel processing elements. The key metrics help determine the best algorithm, evaluate hardware platforms, and examine benefits from parallelism.

Uploaded by

shraddha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 74

Unit 3

Parallel Computing (CS 14.403)

Performance and Scalability of Parallel Systems

By
Niranjan Lal
Analytical Modeling of Parallel
Programs
• A sequential algorithm is usually evaluated in terms of its
execution time, expressed as a function of the size of its
input.
• The execution time of a parallel algorithm depends not only
on input size but also on the number of processing elements
used, and their relative computation and inter-process
communication speeds.
• Hence, a parallel algorithm cannot be evaluated in isolation
from a parallel architecture without some loss in accuracy.
• A parallel system is the combination of an algorithm and
the parallel architecture on which it is implemented.
In this chapter, we study various metrics for evaluating the
performance of parallel systems.
Performance

•  Why do we care about performance evaluation? –


Purchasing perspective
– given a collection of machines, which has the
• best performance ?
• least cost ?
• best performance / cost ?
• –Design perspective
– faced with design options, which has the
• best performance improvement ?
• least cost ?
• best performance / cost ?
How to measure, report, and summarize performance?
• Performance metric
• Benchmark
Which of these airplanes has the best
performance?

 What metric defines performance?


–Capacity, cruising range, or speed?
 Speed
–Taking one passenger from one point to another in the least time
–Transporting 450 passengers from one point to another
Performance Metrics for Parallel Systems

• Note that an algorithm may have different


performance on different parallel architecture.
• For example, an algorithm may perform
differently on a linear array of processors and on a
hypercube of processors.
• It is important to study the performance of parallel
programs with a view to determining the best
algorithm, evaluating hardware platforms, and
examining the benefits from parallelism.
• A number of metrics have been used based on the
desired outcome of performance analysis.
Performance Metrics for Parallel Systems

1. Execution Time
• The serial runtime of a program is the time
elapsed between the beginning and the end of its
execution on a sequential computer.
• The parallel runtime is the time that elapses from
the moment a parallel computation starts to the
moment the last processing element finishes
execution.
We denote the serial runtime by TS and the parallel
runtime by TP.
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems

An Interesting Question

If two machines have the same instruction set architecture


(ISA) which of our quantities (e.g., clock rate, CPI,
execution time, # of instructions, MIPS) will always be
identical?
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
2. Total Parallel Overhead
•The overheads incurred by a parallel program are encapsulated into a
single expression referred to as the overhead function.
•We define overhead function or total overhead of a parallel system as
the total time collectively spent by all the processing elements over and
above that required by the fastest known sequential algorithm for solving
the same problem on a single processing element.
•We denote the overhead function of a parallel system by the symbol To.
•The total time spent in solving a problem summed over all processing
elements is pTP. TS units of this time are spent performing useful work,
and the remainder is overhead.
•Therefore, the overhead function (To) is given by
Performance Metrics for Parallel Systems
3. Speedup
•When evaluating a parallel system, we are often interested in knowing
how much performance gain is achieved by parallelizing a given
application over a sequential implementation.
•Speedup is a measure that captures the relative benefit of solving a
problem in parallel.
•It is defined as the ratio of the time taken to solve a problem on a single
processing element to the time required to solve the same problem on a
parallel computer with p (Number of Cores) identical processing
elements.
•We denote speedup by the symbol S.
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems
3. Speedup (Example-1 Adding n numbers using n processing
elements)
•Consider the problem of adding n numbers by using n processing
elements. Initially, each processing element is assigned one of the
numbers to be added and, at the end of the computation, one of the
processing elements stores the sum of all the numbers.
•Assuming that n is a power of two, we can perform this operation in log
n steps by propagating partial sums up a logical binary tree of processing
elements.
•Next Figure illustrates the procedure for n = 16. The processing elements
are labeled from 0 to 15.
•Similarly, the 16 numbers to be added are labeled from 0 to 15. The sum
of the numbers with consecutive labels from i to j is denoted by
Computing the global sum of 16 partial sums using
16 processing elements. . denotes the sum of numbers with
consecutive labels from i to j.
Example-1 Adding n numbers using n
processing elements cont…
• Each step shown in Previous Figure , consists of one addition and the
communication of a single word. The addition can be performed in
some constant time, say tc, and the communication of a single word
can be performed in time ts + tw.
• Therefore, the addition and communication operations take a constant
amount of time. Thus,

• Since the problem can be solved in Q(n) time on a single processing


element, its speedup is

• For a given problem, more than one sequential algorithm may be


available, but all of these may not be equally suitable for
parallelization. When a serial computer is used, it is natural to use the
sequential algorithm that solves the problem in the least amount of
time.
Example-1 Adding n numbers using n
processing elements cont…
• Given a parallel algorithm, it is fair to judge its performance with respect to
the fastest sequential algorithm for solving the same problem on a single
processing element. Sometimes, the asymptotically fastest sequential
algorithm to solve a problem is not known, or its runtime has a large constant
that makes it impractical to implement.
• In such cases, we take the fastest known algorithm that would be a practical
choice for a serial computer to be the best sequential algorithm.
• We compare the performance of a parallel algorithm to solve a problem with
that of the best sequential algorithm to solve the same problem. We formally
define the speedup S as the ratio of the serial runtime of the best sequential
algorithm for solving a problem to the time taken by the parallel algorithm to
solve the same problem on p processing elements.
• The p processing elements used by the parallel algorithm are assumed to be
identical to the one used by the sequential algorithm.
Example-2 Computing speedups of parallel
programs
• Consider the example of parallelizing bubble sort.
• Assume that a serial version of bubble sort of 105 records takes 150
seconds and a serial quicksort can sort the same list in 30 seconds.
• If a parallel version of bubble sort, also called odd-even sort, takes 40
seconds on four processing elements, it would appear that the parallel
odd-even sort algorithm results in a speedup of 150/40 or 3.75.
• However, this conclusion is misleading, as in reality the parallel
algorithm results in a speedup of 30/40 or 0.75 with respect to the best
serial algorithm.
Performance Metrics for Parallel Systems

4. Efficiency
•Only an ideal parallel system containing p(Number of Cores)
processing elements can deliver a speedup equal to p.
•In practice, ideal behavior is not achieved because while
executing a parallel algorithm, the processing elements
cannot devote 100% of their time to the computations of the
algorithm.
•we saw in Example 1, part of the time required by the
processing elements to compute the sum of n numbers is
spent idling (and communicating in real systems).
Performance Metrics for Parallel Systems

• Efficiency is a measure of the fraction of time for which a


processing element is usefully employed; it is defined as
the ratio of speedup to the number of processing elements.
• In an ideal parallel system, speedup is equal to p and
efficiency is equal to one. In practice, speedup is less than
p and efficiency is between zero and one, depending on the
effectiveness with which the processing elements are
utilized.
• We denote efficiency by the symbol E. Mathematically, it
is given by
Performance Metrics for Parallel Systems
Performance Metrics for Parallel Systems

Example 3 Efficiency of adding n numbers on n


processing elements
•From speedup equation and the preceding definition, the
efficiency of the algorithm for adding n numbers on n
processing elements is

•We also illustrate the process of deriving the parallel


runtime, speedup, and efficiency while preserving various
constants associated with the parallel platform.
Example 5 . Edge detection on images

• Given an n x n pixel image, the problem of detecting edges


corresponds to applying a3x 3 template to each pixel.
• The process of applying the template corresponds to
multiplying pixel values with corresponding template
values and summing across the template (a convolution
operation).
• This process is illustrated in Figure (a) along with typical
templates (Figure (b)). Since we have nine multiply-add
operations for each pixel, if each multiply-add takes time
tc, the entire operation takes time 9tcn2 on a serial
computer.
Example 5 . Edge detection on images
Example 5 . Edge detection on images
• A simple parallel algorithm for this problem partitions the image
equally across the processing elements and each processing element
applies the template to its own subimage.
• Note that for applying the template to the boundary pixels, a
processing element must get data that is assigned to the adjoining
processing element.
• Specifically, if a processing element is assigned a vertically sliced
subimage of dimension n x (n/p), it must access a single layer of n
pixels from the processing element to the left and a single layer of n
pixels from the processing element to the right (note that one of these
accesses is redundant for the two processing elements assigned the
subimages at the extremities).
Example 5 . Edge detection on images
• This is illustrated in Figure 5.4(c). On a message passing machine, the
algorithm executes in two steps:
• (i) exchange a layer of n pixels with each of the two adjoining
processing elements; and
• (ii) apply template on local subimage.
• The first step involves two n-word messages (assuming each pixel
takes a word to communicate RGB data). This takes time 2(ts + twn).
• The second step takes time 9tcn2/p.
• The total time for the algorithm is therefore given by:
Performance Metrics for Parallel Systems

4. Cost
•We define the cost of solving a problem on a parallel system
as the product of parallel runtime and the number of
processing elements used.
•Cost reflects the sum of the time that each processing
element spends solving the problem.
•Efficiency can also be expressed as the ratio of the execution
time of the fastest known sequential algorithm for solving a
problem to the cost of solving the same problem on p
processing elements.
Performance Metrics for Parallel Systems
• The cost of solving a problem on a single processing element is the
execution time of the fastest known sequential algorithm.
• A parallel system is said to be cost-optimal if the cost of solving a
problem on a parallel computer has the same asymptotic growth (in Q
terms) as a function of the input size as the fastest-known sequential
algorithm on a single processing element.
• Since efficiency is the ratio of sequential cost to parallel cost, a cost-
optimal parallel system has an efficiency of Q(1). Cost is sometimes
referred to as work or processor-time product, and a cost-optimal
system is also known as a pTP -optimal system.
Example 6 Cost of adding n numbers on n
processing elements
• The algorithm given in Example 1 for adding n
numbers on n processing elements has a
processor-time product of Q(n log n).
• Since the serial runtime of this operation is Q(n),
the algorithm is not cost optimal.
• Cost optimality is a very important practical
concept although it is defined in terms of
asymptotics.
• We illustrate this using the following example.
Example 7 Performance of non-cost
optimal algorithms
• Consider a sorting algorithm that uses n processing elements to sort
the list in time (log n)2. Since the serial runtime of a (comparison-
based) sort is n log n, the speedup and efficiency of this algorithm are
given by n/log n and 1/log n, respectively.
• The pTP product of this algorithm is n(log n)2. Therefore, this
algorithm is not cost optimal but only by a factor of log n. Let us
consider a realistic scenario in which the number of processing
elements p is much less than n.
• An assignment of these n tasks to p < n processing elements gives us a
parallel time less than n(log n)2/p.
Example 7 Performance of non-cost
optimal algorithms
• This follows from the fact that if n processing elements take time (log
n)2, then one processing element would take time n(log n)2; and p
processing elements would take time n(log n)2/p.
• The corresponding speedup of this formulation is p/log n. Consider the
problem of sorting 1024 numbers (n = 1024, log n = 10) on 32
processing elements.
• The speedup expected is only p/log n or 3.2. This number gets worse
as n increases. For n = 106, log n = 20 and the speedup is only 1.6.
Clearly, there is a significant cost associated with not being cost-
optimal even by a very small factor (note that a factor of log p is
smaller than even ). This emphasizes the practical importance of
costoptimality.
The Effect of Granularity on Performance

• In parallel computing, granularity (or grain size)


of a task is a measure of the amount of work (or
computation) which is performed by that task.
• It defines granularity as the ratio of computation
time to communication time, wherein,
computation time is the time required to perform
the computation of a task and communication time
is the time required to exchange data between
processors.
Effects of Granularity on Cost-Optimality

• Assume: n initial PEs


– If: p -- physical PEs, then each PE simulates

n
– If:p -- virtual PEs, then the computation at each PE
n
increase by a factor of p

– Note: Even if p<n, this does not necessarily provide


cost-optimal alg.
Example 5.9 Adding n numbers on p
processing elements
• Consider the problem of adding n numbers on p processing
elements such that p < n and both n and p are powers of 2.
We use the same algorithm as in Example 5.1 and simulate
n processing elements on p processing elements.
• The steps leading to the solution are shown in Next Figure
5.5 for n = 16 and p = 4. Virtual processing element i is
simulated by the physical processing element labeled i
mod p; the numbers to be added are distributed similarly.
• The first log p of the log n steps of the original algorithm
are simulated in (n/p) log p steps on p processing elements.
Example 5.9 Adding n numbers on p
processing elements
• In the remaining steps, no communication is required because the
processing elements that communicate in the original algorithm are
simulated by the same processing element; hence, the remaining
numbers are added locally.
• The algorithm takes Q((n/p) log p) time in the steps that require
communication, after which a single processing element is left with
n/p numbers to add, taking time Q(n/p).
• Thus, the overall parallel execution time of this parallel system is
Q((n/p) log p). Consequently, its cost is Q(n log p), which is
asymptotically higher than the Q(n) cost of adding n numbers
sequentially.
• Therefore, the parallel system is not cost-optimal.
Example 5.9 Adding n numbers on p
processing elements
• Figure 5.5. Four processing elements simulating 16 processing
elements to compute the sum of 16 numbers (first two steps).
• denotes the sum of numbers with consecutive labels from i to j .
Four processing elements simulating 16 processing elements to
compute the sum of 16 numbers (last three steps).
12 13 14 15 12 13 14 15

8 9 10 8 9 10
11 11

4 5 6 7 4 5 6 7

0 1 2 3
10 32
0 1 2 3 0 1 2 3

Substep 1 Substep 2

12 13 14 15 12 13 14 15

8 9 10 89 11
10
11 5
4 76 54 76
10 32 10 32
0 1 2 3 0 1 2 3

Substep 3 Substep 4

(a) Four processors simulating the first communication step of 16 processors


13
12 15
14 13
12 15
14
89 11
10 89 11
10
54 76 54 76
10 32 30
0 1 2 3 0 1 2 3

Substep 1 Substep 2

13
12 15
14 13
12 15
14
89 11
10 11
8
74 74
30 30
0 1 2 3 0 1 2 3

Substep 3 Substep 4

(a) Four processors simulating the second communication step of 16 processors


15
12 15
12
11
8 11
8
74
30 70
0 1 2 3 0 1 2 3

Substep 1 Substep 2

(c ) Simulation of the third step in two substeps

15
8

70 15
0
0 1 2 3 0 1 2 3

(d) Simulation of the fourth step (e) Final result


• Example 5.1 showed that n numbers can be added on an n-processor machine
in time Q(log n). When using p processing elements to simulate n virtual
processing elements (p < n), the expected parallel runtime is Q((n/p) log n).
• However, in Example 5.9 this task was performed in time Q((n/p) log p)
instead. The reason is that every communication step of the original algorithm
does not have to be simulated; at times, communication takes place between
virtual processing elements that are simulated by the same physical processing
element.
• For these operations, there is no associated overhead. For example, the
simulation of the third and fourth steps (Figure 5.5(c) and (d)) did not require
any communication.
• However, this reduction in communication was not enough to make the
algorithm cost-optimal. Example 5.10 illustrates that the same problem
(adding n numbers on p processing elements) can be performed cost optimally
with a smarter assignment of data to processing elements.
Example 5.10 Adding n numbers cost-
optimally
• An alternate method for adding n numbers using p
processing elements is illustrated in Next Figure 5.6 for n
= 16 and p = 4.
Example 5.10 Adding n numbers cost-
optimally
• In the first step of this algorithm, each processing element
locally adds its n/p numbers in time Q(n/p).
• Now the problem is reduced to adding the p partial sums
on p processing elements, which can be done in time Q(log
p) by the method described in Example 5.1. The parallel
runtime of this algorithm is and its cost
is Q(n + p log p). As long as n = W(p log p), the cost is
Q(n), which is the same as the serial runtime. Hence, this
parallel system is cost-optimal.
Scalability of Parallel Systems

• Very often, programs are designed and tested for smaller


problems on fewer processing elements. However, the real
problems these programs are intended to solve are much
larger, and the machines contain larger number of
processing elements. Whereas code development is
simplified by using scaled-down versions of the machine
and the problem, their performance and correctness (of
programs) is much more difficult to establish based on
scaled-down systems.
• In this section, we will investigate techniques for
evaluating the scalability of parallel programs using
analytical tools.
5.4.1 Scaling Characteristics of Parallel
Programs
• The efficiency of a parallel program can be written
as:
Example 5.11 Why is performance
extrapolation so difficult?
• Consider three parallel algorithms for computing an n-
point Fast Fourier Transform (FFT) on 64 processing
elements. Next Figure 5.7 illustrates speedup as the value
of n is increased to 18 K.
• Keeping the number of processing elements constant, at
smaller values of n, one would infer from observed
speedups that binary exchange and 3-D transpose
algorithms are the best.
• However, as the problem is scaled up to 18 K points or
more, it is evident from Next Figure 5.7 that the 2-D
transpose algorithm yields best speedup.
Example 5.11 Why is performance
extrapolation so difficult?
Example 5.12 Speedup and efficiency as functions
of the number of processing elements
• Consider the problem of adding n numbers on p processing elements.
We use the same algorithm as in Example 5.10(Example 5.10 Adding
n numbers cost-optimally).
• However, to illustrate actual speedups, we work with constants here
instead of asymptotics.
• Assuming unit time for adding two numbers, the first phase (local
summations) of the algorithm takes roughly n/p time.
• The second phase involves log p steps with a communication and an
addition at each step.
Example 5.12 Speedup and efficiency as functions
of the number of processing elements

If a single communication takes unit time as well, the time


for this phase is 2 log p. Therefore, we can derive parallel
time, speedup, and efficiency as:
Amdahl’s Law (1967)
Amdahl’s Law (1967)
Gustafson’s Law (1988)

• Also known as the Gustafson‐Barsis’s Law


• Any sufficiently large problem can be efficiently
parallelized with a speedup

• where p is the number of processors, and α is the serial


portion of the problem
• Gustafson proposed a fixed time concept which leads to
scaled speedup for larger problem sizes.
• Basically, we use larger systems with more processors to
solve larger problems
Gustafson’s Law (1988)
• Execution time of program on a parallel computer is
(a+b) , a is the sequential time and b is the parallel time
• Total amount of work to be done in parallel varies
linearly with the number of processors. So b is fixed as p
is varied.
The total run time is (a + p*b)
• The speedup is (a+p*b)/(a+b)
• Define α = a/(a+b) , the sequential fraction of the
execution time, then
Scalability (cont.)

• Increase number of processors ‐‐> decrease efficiency


• Increase problem size ‐‐> increase efficiency
• Can a parallel system keep efficiency by increasing the
number of processors and the problem size
simultaneously???
Yes: ‐‐> scalable parallel system
No: ‐‐> non‐scalable parallel system
• A scalable parallel system can always be made cost‐
optimal by adjusting the number of processors and the
problem size.
Scalability (cont.)
ISO-efficiency Metric of Scalability

• We summarize the discussion in the section above


with the following two observations:
1. For a given problem size, as we increase the
number of processing elements, the overall
efficiency of the parallel system goes down. This
phenomenon is common to all parallel systems
2. In many cases, the efficiency of a parallel system
increases if the problem size is increased while
keeping the number of processing elements
constant.
ISO-efficiency Metric of Scalability
• These two phenomena are illustrated in Figure 5.9(a) and (b),
respectively. Following from these two observations, we define a
scalable parallel system as one in which the efficiency can be kept
constant as the number of processing elements is increased, provided
that the problem size is also increased.
• It is useful to determine the rate at which the problem size must
increase with respect to the number of processing elements to keep the
efficiency fixed. For different parallel systems, the problem size must
increase at different rates in order to maintain a fixed efficiency as the
number of processing elements is increased. This rate determines the
degree of scalability of the parallel system.
• As we shall show, a lower rate is more desirable than a higher growth
rate in problem size. Let us now investigate metrics for quantitatively
determining the degree of scalability of a parallel system. However,
before we do that, we must define the notion of problem size precisely.
ISO-efficiency Metric of Scalability
ISO-efficiency Metric of Scalability

• Degree of scalability: The rate to increase the size of the


problem to maintain efficiency as the number of processors
changes.
•Problem size: Number of basic computation steps in best
sequential algorithm to solve the problem on a single processor
( ).
•Overhead function: Part of the parallel system cost
(processor‐time product) that is not incurred by the fastest
known serial algorithm on a serial computer
Iso-efficiency Function
Parallel execution time can be expressed as a function of
problem size, overhead function, and the number of
processing elements. We can write parallel runtime as:

The resulting expression for speedup is

Finally, we write the expression for efficiency as


Iso-efficiency Function
•In Above Efficiency Equation 5.12, if the problem size is
kept constant and p is increased, the efficiency decreases
because the total overhead
•To increases with p. If W is increased keeping the number of
processing elements fixed, then for scalable parallel systems,
the efficiency increases.
•This is because To grows slower than Q(W) for a fixed p.
For these parallel systems, efficiency can be maintained at a
desired value (between 0 and 1) for increasing p, provided W
is also increased.
Iso-efficiency Function
• Solve the above equation for W

Let K = E/(1 - E) be a constant depending on the efficiency to be


maintained. Since To is a function of W and p, Previous Equation
5.13 can be rewritten as

• The isoefficiency function determines the growth rate of W


required to keep the efficiency fixed as p increases.
• Highly scalable systems have small isoefficiency function.
From above Equation 5.14, the problem size W can usually be
obtained as a function of p by algebraic manipulations. This
function dictates the growth rate of W required to keep the efficiency
fixed as p increases. We call this function the isoefficiency function
of the parallel system.
Example 5.14 Isoefficiency function of
adding numbers
• The overhead function for the problem of adding n
numbers on p processing elements is
approximately 2 p log p, as given by Equations 5.9
and 5.1. Substituting To by 2 p log p in Equation
5.14, we get
Example 5.14 Isoefficiency function of
adding numbers
• Thus, the asymptotic isoefficiency function for
this parallel system is Q(p log p).
• This means that, if the number of processing
elements is increased from p to p', the problem
size (in this case, n) must be increased by a factor
of (p' log p')/(p log p) to get the same efficiency as
on p processing elements. In other words,
increasing the number of processing elements by a
factor of p'/p requires that n be increased by a
factor of (p' log p')/(p log p) to increase the
speedup by a factor of p'/p.
• Efficiency of adding n numbers on a p processor
hypercube
– for the cost-optimal alg:
np S n

• S= n  2 p log p E= p n  2 p log p = E(n,p)

n P=1 P=4 P=8 P = 16 P = 32

64 1.0 .80 .57 .33 .17

192 1.0 .92 .80 .60 .38

320 1.0 .95 .87 .71 .50

512 1.0 .97 .91 .80 .62


• n =  (plogp)
• E = 0.80 constant
• for n=64 & p=4: n=8plogp
• for n=192 & p=8 : n=8plogp
• for n=512 & p=16: n=8plogp
• Conclusion:
– for adding n number on p processors w/ cost-optimal
algorithm:
• the alg. is cost-optimal if n =  (plogp)
• the alg. is scalable if n increase proportional w/  (plogp) as p
is increased
• Problem size
– for matrix multiply
• input n => O( n3 ) 3
n’=2n => O( n' ) = O(8 n3 )

– for matrix addition


• input n => O( n3 ) n’=2n => O( n'3 ) = O(4 n3 )

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy