Required Babes
Required Babes
Parallel computing
Parallel computing is a type of computation in
which many calculations or processes are carried
out simultaneously.[1] Large problems can often be
divided into smaller ones, which can then be solved
at the same time. There are several different forms
of parallel computing: bit-level, instruction-level,
data, and task parallelism. Parallelism has long
been employed in high-performance computing,
but has gained broader interest due to the physical
constraints preventing frequency scaling.[2] As
power consumption (and consequently heat
generation) by computers has become a concern in IBM's Blue Gene/P massively parallel
recent years,[3] parallel computing has become the supercomputer
dominant paradigm in computer architecture,
mainly in the form of multi-core processors.[4]
Parallel computing is closely related to concurrent computing—they are frequently used together,
and often conflated, though the two are distinct: it is possible to have parallelism without
concurrency, and concurrency without parallelism (such as multitasking by time-sharing on a
single-core CPU).[5][6] In parallel computing, a computational task is typically broken down into
several, often many, very similar sub-tasks that can be processed independently and whose results
are combined afterwards, upon completion. In contrast, in concurrent computing, the various
processes often do not address related tasks; when they do, as is typical in distributed computing,
the separate tasks may have a varied nature and often require some inter-process communication
during execution.
Parallel computers can be roughly classified according to the level at which the hardware supports
parallelism, with multi-core and multi-processor computers having multiple processing elements
within a single machine, while clusters, MPPs, and grids use multiple computers to work on the
same task. Specialized parallel computer architectures are sometimes used alongside traditional
processors, for accelerating specific tasks.
Background
https://en.wikipedia.org/wiki/Parallel_computing#History 1/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
Traditionally, computer software has been written for serial computation. To solve a problem, an
algorithm is constructed and implemented as a serial stream of instructions. These instructions
are executed on a central processing unit on one computer. Only one instruction may execute at a
time—after that instruction is finished, the next one is executed.[8]
Parallel computing, on the other hand, uses multiple processing elements simultaneously to solve
a problem. This is accomplished by breaking the problem into independent parts so that each
processing element can execute its part of the algorithm simultaneously with the others. The
processing elements can be diverse and include resources such as a single computer with multiple
processors, several networked computers, specialized hardware, or any combination of the
above.[8] Historically parallel computing was used for scientific computing and the simulation of
scientific problems, particularly in the natural and engineering sciences, such as meteorology. This
led to the design of parallel hardware and software, as well as high performance computing.[9]
Frequency scaling was the dominant reason for improvements in computer performance from the
mid-1980s until 2004. The runtime of a program is equal to the number of instructions multiplied
by the average time per instruction. Maintaining everything else constant, increasing the clock
frequency decreases the average time it takes to execute an instruction. An increase in frequency
thus decreases runtime for all compute-bound programs.[10] However, power consumption P by a
chip is given by the equation P = C × V 2 × F, where C is the capacitance being switched per clock
cycle (proportional to the number of transistors whose inputs change), V is voltage, and F is the
processor frequency (cycles per second).[11] Increases in frequency increase the amount of power
used in a processor. Increasing processor power consumption led ultimately to Intel's May 8, 2004
cancellation of its Tejas and Jayhawk processors, which is generally cited as the end of frequency
scaling as the dominant computer architecture paradigm.[12]
To deal with the problem of power consumption and overheating the major central processing unit
(CPU or processor) manufacturers started to produce power efficient processors with multiple
cores. The core is the computing unit of the processor and in multi-core processors each core is
independent and can access the same memory concurrently. Multi-core processors have brought
parallel computing to desktop computers. Thus parallelisation of serial programmes has become a
mainstream programming task. In 2012 quad-core processors became standard for desktop
computers, while servers have 10+ core processors. From Moore's law it can be predicted that the
number of cores per processor will double every 18–24 months. This could mean that after 2020 a
typical processor will have dozens or hundreds of cores, however in reality the standard is
somewhere in the region of 4 to 16 cores, with some designs having a mix of performance and
efficiency cores (such as ARM's big.LITTLE design) due to thermal and design constraints.[13]
An operating system can ensure that different tasks and user programmes are run in parallel on
the available cores. However, for a serial software programme to take full advantage of the multi-
core architecture the programmer needs to restructure and parallelise the code. A speed-up of
application software runtime will no longer be achieved through frequency scaling, instead
programmers will need to parallelise their software code to take advantage of the increasing
computing power of multicore architectures.[14]
Optimally, the speedup from parallelization would be linear—doubling the number of processing
elements should halve the runtime, and doubling it a second time should again halve the runtime.
However, very few parallel algorithms achieve optimal speedup. Most of them have a near-linear
speedup for small numbers of processing elements, which flattens out into a constant value for
large numbers of processing elements.
https://en.wikipedia.org/wiki/Parallel_computing#History 2/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
where
https://en.wikipedia.org/wiki/Parallel_computing#History 3/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
Dependencies
A graphical representation of Gustafson's law
Understanding data dependencies is fundamental
in implementing parallel algorithms. No program
can run more quickly than the longest chain of dependent calculations (known as the critical path),
since calculations that depend upon prior calculations in the chain must be executed in order.
However, most algorithms do not consist of just a long chain of dependent calculations; there are
usually opportunities to execute independent calculations in parallel.
Let Pi and Pj be two program segments. Bernstein's conditions[19] describe when the two are
independent and can be executed in parallel. For Pi, let Ii be all of the input variables and Oi the
output variables, and likewise for Pj. Pi and Pj are independent if they satisfy
Violation of the first condition introduces a flow dependency, corresponding to the first segment
producing a result used by the second segment. The second condition represents an anti-
dependency, when the second segment produces a variable needed by the first segment. The third
and final condition represents an output dependency: when two segments write to the same
location, the result comes from the logically last executed segment.[20]
1: function Dep(a, b)
2: c := a * b
3: d := 3 * c
4: end function
In this example, instruction 3 cannot be executed before (or even in parallel with) instruction 2,
because instruction 3 uses a result from instruction 2. It violates condition 1, and thus introduces a
flow dependency.
1: function NoDep(a, b)
2: c := a * b
3: d := 3 * b
4: e := a + b
5: end function
In this example, there are no dependencies between the instructions, so they can all be run in
parallel.
https://en.wikipedia.org/wiki/Parallel_computing#History 4/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
Bernstein's conditions do not allow memory to be shared between different processes. For that,
some means of enforcing an ordering between accesses is necessary, such as semaphores, barriers
or some other synchronization method.
Subtasks in a parallel program are often called threads. Some parallel computer architectures use
smaller, lightweight versions of threads known as fibers, while others use bigger versions known as
processes. However, "threads" is generally accepted as a generic term for subtasks.[21] Threads will
often need synchronized access to an object or other resource, for example when they must update
a variable that is shared between them. Without synchronization, the instructions between the two
threads may be interleaved in any order. For example, consider the following program:
Thread A Thread B
Thread A Thread B
One thread will successfully lock variable V, while the other thread will be locked out—unable to
proceed until V is unlocked again. This guarantees correct execution of the program. Locks may be
necessary to ensure correct program execution when threads must serialize access to resources,
but their use can greatly slow a program and may affect its reliability.[22]
Locking multiple variables using non-atomic locks introduces the possibility of program deadlock.
An atomic lock locks multiple variables all at once. If it cannot lock all of them, it does not lock any
of them. If two threads each need to lock the same two variables using non-atomic locks, it is
possible that one thread will lock one of them and the second thread will lock the second variable.
In such a case, neither thread can complete, and deadlock results.[23]
Many parallel programs require that their subtasks act in synchrony. This requires the use of a
barrier. Barriers are typically implemented using a lock or a semaphore.[24] One class of
algorithms, known as lock-free and wait-free algorithms, altogether avoids the use of locks and
barriers. However, this approach is generally difficult to implement and requires correctly
designed data structures.[25]
https://en.wikipedia.org/wiki/Parallel_computing#History 5/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
Not all parallelization results in speed-up. Generally, as a task is split up into more and more
threads, those threads spend an ever-increasing portion of their time communicating with each
other or waiting on each other for access to resources.[26][27] Once the overhead from resource
contention or communication dominates the time spent on other computation, further
parallelization (that is, splitting the workload over even more threads) increases rather than
decreases the amount of time required to finish. This problem, known as parallel slowdown,[28]
can be improved in some cases by software analysis and redesign.[29]
Applications are often classified according to how often their subtasks need to synchronize or
communicate with each other. An application exhibits fine-grained parallelism if its subtasks must
communicate many times per second; it exhibits coarse-grained parallelism if they do not
communicate many times per second, and it exhibits embarrassing parallelism if they rarely or
never have to communicate. Embarrassingly parallel applications are considered the easiest to
parallelize.
Flynn's taxonomy
Michael J. Flynn created one of the earliest classification systems for parallel (and sequential)
computers and programs, now known as Flynn's taxonomy. Flynn classified programs and
computers by whether they were operating using a single set or multiple sets of instructions, and
whether or not those instructions were using a single set or multiple sets of data.
According to David A. Patterson and John L. Hennessy, "Some machines are hybrids of these
categories, of course, but this classic model has survived because it is simple, easy to understand,
and gives a good first approximation. It is also—perhaps because of its understandability—the
most widely used scheme."[31]
Types of parallelism
Bit-level parallelism
https://en.wikipedia.org/wiki/Parallel_computing#History 6/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
standard addition instruction, then add the 8 higher-order bits using an add-with-carry
instruction and the carry bit from the lower order addition; thus, an 8-bit processor requires two
instructions to complete a single operation, where a 16-bit processor would be able to complete the
operation with a single instruction.
Historically, 4-bit microprocessors were replaced with 8-bit, then 16-bit, then 32-bit
microprocessors. This trend generally came to an end with the introduction of 32-bit processors,
which has been a standard in general-purpose computing for two decades. Not until the early
2000s, with the advent of x86-64 architectures, did 64-bit processors become commonplace.
Instruction-level parallelism
https://en.wikipedia.org/wiki/Parallel_computing#History 7/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
are two of the most common techniques for implementing out-of-order execution and instruction-
level parallelism.
Task parallelism
Task parallelisms is the characteristic of a parallel program that "entirely different calculations can
be performed on either the same or different sets of data".[35] This contrasts with data parallelism,
where the same calculation is performed on the same or different sets of data. Task parallelism
involves the decomposition of a task into sub-tasks and then allocating each sub-task to a
processor for execution. The processors would then execute these sub-tasks concurrently and often
cooperatively. Task parallelism does not usually scale with the size of a problem.[36]
Superword level parallelism is a vectorization technique based on loop unrolling and basic block
vectorization. It is distinct from loop vectorization algorithms in that it can exploit parallelism of
inline code, such as manipulating coordinates, color channels or in loops unrolled by hand.[37]
Hardware
Main memory in a parallel computer is either shared memory (shared between all processing
elements in a single address space), or distributed memory (in which each processing element has
its own local address space).[38] Distributed memory refers to the fact that the memory is logically
distributed, but often implies that it is physically distributed as well. Distributed shared memory
and memory virtualization combine the two approaches, where the processing element has its own
local memory and access to the memory on non-local processors. Accesses to local memory are
typically faster than accesses to non-local memory. On the supercomputers, distributed shared
memory space can be implemented using the programming model such as PGAS. This model
allows processes on one compute node to transparently access the remote memory of another
compute node. All compute nodes are also connected to an external shared memory system via
high-speed interconnect, such as Infiniband, this external shared memory system is known as
burst buffer, which is typically built from arrays of non-volatile memory physically distributed
across multiple I/O nodes.
https://en.wikipedia.org/wiki/Parallel_computing#History 8/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
Computer systems make use of caches—small and fast memories located close to the processor
which store temporary copies of memory values (nearby in both the physical and logical sense).
Parallel computer systems have difficulties with caches that may store the same value in more than
one location, with the possibility of incorrect program execution. These computers require a cache
coherency system, which keeps track of cached values and strategically purges them, thus ensuring
correct program execution. Bus snooping is one of the most common methods for keeping track of
which values are being accessed (and thus should be purged). Designing large, high-performance
cache coherence systems is a very difficult problem in computer architecture. As a result, shared
memory computer architectures do not scale as well as distributed memory systems do.[38]
Parallel computers based on interconnected networks need to have some kind of routing to enable
the passing of messages between nodes that are not directly connected. The medium used for
communication between the processors is likely to be hierarchical in large multiprocessor
machines.
Parallel computers can be roughly classified according to the level at which the hardware supports
parallelism. This classification is broadly analogous to the distance between basic computing
nodes. These are not mutually exclusive; for example, clusters of symmetric multiprocessors are
relatively common.
Multi-core computing
A multi-core processor is a processor that includes multiple processing units (called "cores") on
the same chip. This processor differs from a superscalar processor, which includes multiple
execution units and can issue multiple instructions per clock cycle from one instruction stream
(thread); in contrast, a multi-core processor can issue multiple instructions per clock cycle from
multiple instruction streams. IBM's Cell microprocessor, designed for use in the Sony PlayStation
3, is a prominent multi-core processor. Each core in a multi-core processor can potentially be
superscalar as well—that is, on every clock cycle, each core can issue multiple instructions from
one thread.
Simultaneous multithreading (of which Intel's Hyper-Threading is the best known) was an early
form of pseudo-multi-coreism. A processor capable of concurrent multithreading includes
multiple execution units in the same processing unit—that is it has a superscalar architecture—and
can issue multiple instructions per clock cycle from multiple threads. Temporal multithreading on
the other hand includes a single execution unit in the same processing unit and can issue one
instruction at a time from multiple threads.
Symmetric multiprocessing
A symmetric multiprocessor (SMP) is a computer system with multiple identical processors that
share memory and connect via a bus.[39] Bus contention prevents bus architectures from scaling.
As a result, SMPs generally do not comprise more than 32 processors.[40] Because of the small size
https://en.wikipedia.org/wiki/Parallel_computing#History 9/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
of the processors and the significant reduction in the requirements for bus bandwidth achieved by
large caches, such symmetric multiprocessors are extremely cost-effective, provided that a
sufficient amount of memory bandwidth exists.[39]
Distributed computing
Cluster computing
A massively parallel processor (MPP) is a single computer with many networked processors. MPPs
have many of the same characteristics as clusters, but MPPs have specialized interconnect
networks (whereas clusters use commodity hardware for networking). MPPs also tend to be larger
than clusters, typically having "far more" than 100 processors.[47] In an MPP, "each CPU contains
its own memory and copy of the operating system and application. Each subsystem communicates
with the others via a high-speed interconnect."[48]
IBM's Blue Gene/L, the fifth fastest supercomputer in the world according to the June 2009
TOP500 ranking, is an MPP.
Grid computing
https://en.wikipedia.org/wiki/Parallel_computing#History 10/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
FPGAs can be programmed with hardware description languages such as VHDL[50] or Verilog.[51]
Several vendors have created C to HDL languages that attempt to emulate the syntax and
semantics of the C programming language, with which most programmers are familiar. The best
known C to HDL languages are Mitrion-C, Impulse C, and Handel-C. Specific subsets of SystemC
based on C++ can also be used for this purpose.
AMD's decision to open its HyperTransport technology to third-party vendors has become the
enabling technology for high-performance reconfigurable computing.[52] According to Michael R.
D'Amour, Chief Operating Officer of DRC Computer Corporation, "when we first walked into
AMD, they called us 'the socket stealers.' Now they call us their partners."[52]
In the early days, GPGPU programs used the normal graphics APIs for executing programs.
However, several new programming languages and platforms have been built to do general
purpose computation on GPUs with both Nvidia and AMD releasing programming environments
with CUDA and Stream SDK respectively. Other GPU programming languages include BrookGPU,
PeakStream, and RapidMind. Nvidia has also released specific products for computation in their
https://en.wikipedia.org/wiki/Parallel_computing#History 11/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
Tesla series. The technology consortium Khronos Group has released the OpenCL specification,
which is a framework for writing programs that execute across platforms consisting of CPUs and
GPUs. AMD, Apple, Intel, Nvidia and others are supporting OpenCL.
Several application-specific integrated circuit (ASIC) approaches have been devised for dealing
with parallel applications.[54][55][56]
Because an ASIC is (by definition) specific to a given application, it can be fully optimized for that
application. As a result, for a given application, an ASIC tends to outperform a general-purpose
computer. However, ASICs are created by UV photolithography. This process requires a mask set,
which can be extremely expensive. A mask set can cost over a million US dollars.[57] (The smaller
the transistors required for the chip, the more expensive the mask will be.) Meanwhile,
performance increases in general-purpose computing over time (as described by Moore's law) tend
to wipe out these gains in only one or two chip generations.[52] High initial cost, and the tendency
to be overtaken by Moore's-law-driven general-purpose computing, has rendered ASICs unfeasible
for most parallel computing applications. However, some have been built. One example is the
PFLOPS RIKEN MDGRAPE-3 machine which uses custom ASICs for molecular dynamics
simulation.
Vector processors
Software
Concurrent programming languages, libraries, APIs, and parallel programming models (such as
algorithmic skeletons) have been created for programming parallel computers. These can generally
be divided into classes based on the assumptions they make about the underlying memory
architecture—shared memory, distributed memory, or shared distributed memory. Shared
memory programming languages communicate by manipulating shared memory variables.
Distributed memory uses message passing. POSIX Threads and OpenMP are two of the most
widely used shared memory APIs, whereas Message Passing Interface (MPI) is the most widely
used message-passing system API.[59] One concept used in programming parallel programs is the
future concept, where one part of a program promises to deliver a required datum to another part
of a program at some future time.
https://en.wikipedia.org/wiki/Parallel_computing#History 12/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
Efforts to standardize parallel programming include an open standard called OpenHMPP for
hybrid multi-core parallel programming. The OpenHMPP directive-based programming model
offers a syntax to efficiently offload computations on hardware accelerators and to optimize data
movement to/from the hardware memory using remote procedure calls.
The rise of consumer GPUs has led to support for compute kernels, either in graphics APIs
(referred to as compute shaders), in dedicated APIs (such as OpenCL), or in other language
extensions.
Automatic parallelization
Mainstream parallel programming languages remain either explicitly parallel or (at best) partially
implicit, in which a programmer gives the compiler directives for parallelization. A few fully
implicit parallel programming languages exist—SISAL, Parallel Haskell, SequenceL, System C (for
FPGAs), Mitrion-C, VHDL, and Verilog.
Application checkpointing
As a computer system grows in complexity, the mean time between failures usually decreases.
Application checkpointing is a technique whereby the computer system takes a "snapshot" of the
application—a record of all current resource allocations and variable states, akin to a core dump—;
this information can be used to restore the program if the computer should fail. Application
checkpointing means that the program has to restart from only its last checkpoint rather than the
beginning. While checkpointing provides benefits in a variety of situations, it is especially useful in
highly parallel systems with a large number of processors used in high performance computing.[61]
Algorithmic methods
As parallel computers become larger and faster, we are now able to solve problems that had
previously taken too long to run. Fields as varied as bioinformatics (for protein folding and
sequence analysis) and economics (for mathematical finance) have taken advantage of parallel
computing. Common types of problems in parallel computing applications include:[62]
https://en.wikipedia.org/wiki/Parallel_computing#History 13/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
Fault tolerance
Parallel computing can also be applied to the design of fault-tolerant computer systems,
particularly via lockstep systems performing the same operation in parallel. This provides
redundancy in case one component fails, and also allows automatic error detection and error
correction if the results differ. These methods can be used to help prevent single-event upsets
caused by transient errors.[64] Although additional measures may be required in embedded or
specialized systems, this method can provide a cost-effective approach to achieve n-modular
redundancy in commercial off-the-shelf systems.
History
The origins of true (MIMD) parallelism go back to Luigi
Federico Menabrea and his Sketch of the Analytic Engine
Invented by Charles Babbage.[66][67][68]
In 1969, Honeywell introduced its first Multics system, a symmetric multiprocessor system
capable of running up to eight processors in parallel.[70] C.mmp, a multi-processor project at
Carnegie Mellon University in the 1970s, was among the first multiprocessors with more than a
few processors. The first bus-connected multiprocessor with snooping caches was the Synapse N+1
in 1984.[67]
SIMD parallel computers can be traced back to the 1970s. The motivation behind early SIMD
computers was to amortize the gate delay of the processor's control unit over multiple
instructions.[72] In 1964, Slotnick had proposed building a massively parallel computer for the
Lawrence Livermore National Laboratory.[70] His design was funded by the US Air Force, which
was the earliest SIMD parallel-computing effort, ILLIAC IV.[70] The key to its design was a fairly
high parallelism, with up to 256 processors, which allowed the machine to work on large datasets
in what would later be known as vector processing. However, ILLIAC IV was called "the most
infamous of supercomputers", because the project was only one-fourth completed, but took
11 years and cost almost four times the original estimate.[65] When it was finally ready to run its
first real application in 1976, it was outperformed by existing commercial supercomputers such as
the Cray-1.
https://en.wikipedia.org/wiki/Parallel_computing#History 14/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
In the early 1970s, at the MIT Computer Science and Artificial Intelligence Laboratory, Marvin
Minsky and Seymour Papert started developing the Society of Mind theory, which views the
biological brain as massively parallel computer. In 1986, Minsky published The Society of Mind,
which claims that "mind is formed from many little agents, each mindless by itself".[73] The theory
attempts to explain how what we call intelligence could be a product of the interaction of non-
intelligent parts. Minsky says that the biggest source of ideas about the theory came from his work
in trying to create a machine that uses a robotic arm, a video camera, and a computer to build with
children's blocks.[74]
Similar models (which also view the biological brain as a massively parallel computer, i.e., the
brain is made up of a constellation of independent or semi-independent agents) were also
described by:
Thomas R. Blakeslee,[75]
Michael S. Gazzaniga,[76][77]
Robert E. Ornstein,[78]
Ernest Hilgard,[79][80]
Michio Kaku,[81]
George Ivanovich Gurdjieff,[82]
Neurocluster Brain Model.[83]
See also
Computer multitasking
Concurrency (computer science)
Content Addressable Parallel Processor
List of distributed computing conferences
Manchester dataflow machine
Manycore
Parallel programming model
Serializability
Synchronous programming
Transputer
Vector processing
References
1. Gottlieb, Allan; Almasi, George S. (1989). Highly parallel computing (http://dl.acm.org/citation.c
fm?id=160438). Redwood City, Calif.: Benjamin/Cummings. ISBN 978-0-8053-0177-9.
2. S.V. Adve et al. (November 2008). "Parallel Computing Research at Illinois: The UPCRC
Agenda" (https://graphics.cs.illinois.edu/sites/default/files/upcrc-wp.pdf) Archived (https://web.a
rchive.org/web/20180111165735/https://graphics.cs.illinois.edu/sites/default/files/upcrc-wp.pdf)
2018-01-11 at the Wayback Machine (PDF). Parallel@Illinois, University of Illinois at Urbana-
Champaign. "The main techniques for these performance benefits—increased clock frequency
and smarter but increasingly complex architectures—are now hitting the so-called power wall.
The computer industry has accepted that future performance increases must largely come
from increasing the number of processors (or cores) on a die, rather than making a single core
go faster."
3. Asanovic et al. Old [conventional wisdom]: Power is free, but transistors are expensive. New
[conventional wisdom] is [that] power is expensive, but transistors are "free".
https://en.wikipedia.org/wiki/Parallel_computing#History 15/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
4. Asanovic, Krste et al. (December 18, 2006). "The Landscape of Parallel Computing Research:
A View from Berkeley" (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pd
f) (PDF). University of California, Berkeley. Technical Report No. UCB/EECS-2006-183. "Old
[conventional wisdom]: Increasing clock frequency is the primary method of improving
processor performance. New [conventional wisdom]: Increasing parallelism is the primary
method of improving processor performance… Even representatives from Intel, a company
generally associated with the 'higher clock-speed is better' position, warned that traditional
approaches to maximizing performance through maximizing clock speed have been pushed to
their limits."
5. "Concurrency is not Parallelism", Waza conference Jan 11, 2012, Rob Pike (slides (https://talk
s.golang.org/2012/waza.slide) Archived (https://web.archive.org/web/20150730203124/http://ta
lks.golang.org/2012/waza.slide) 2015-07-30 at the Wayback Machine) (video (https://vimeo.co
m/49718712))
6. "Parallelism vs. Concurrency" (https://wiki.haskell.org/Parallelism_vs._Concurrency). Haskell
Wiki.
7. Hennessy, John L.; Patterson, David A.; Larus, James R. (1999). Computer organization and
design: the hardware/software interface (https://archive.org/details/computerorganiz000henn)
(2. ed., 3rd print. ed.). San Francisco: Kaufmann. ISBN 978-1-55860-428-5.
8. Barney, Blaise. "Introduction to Parallel Computing" (http://www.llnl.gov/computing/tutorials/par
allel_comp/). Lawrence Livermore National Laboratory. Retrieved 2007-11-09.
9. Thomas Rauber; Gudula Rünger (2013). Parallel Programming: for Multicore and Cluster
Systems. Springer Science & Business Media. p. 1. ISBN 9783642378010.
10. Hennessy, John L.; Patterson, David A. (2002). Computer architecture / a quantitative
approach (3rd ed.). San Francisco, Calif.: International Thomson. p. 43. ISBN 978-1-55860-
724-8.
11. Rabaey, Jan M. (1996). Digital integrated circuits : a design perspective. Upper Saddle River,
N.J.: Prentice-Hall. p. 235. ISBN 978-0-13-178609-7.
12. Flynn, Laurie J. (8 May 2004). "Intel Halts Development Of 2 New Microprocessors" (https://w
ww.nytimes.com/2004/05/08/business/08chip.html?ex=1399348800&en=98cc44ca97b1a562&
ei=5007). New York Times. Retrieved 5 June 2012.
13. Thomas Rauber; Gudula Rünger (2013). Parallel Programming: for Multicore and Cluster
Systems. Springer Science & Business Media. p. 2. ISBN 9783642378010.
14. Thomas Rauber; Gudula Rünger (2013). Parallel Programming: for Multicore and Cluster
Systems. Springer Science & Business Media. p. 3. ISBN 9783642378010.
15. Amdahl, Gene M. (1967). "Validity of the single processor approach to achieving large scale
computing capabilities" (http://dl.acm.org/citation.cfm?id=160438). Proceeding AFIPS '67
(Spring) Proceedings of the April 18–20, 1967, Spring Joint Computer Conference: 483–485.
doi:10.1145/1465482.1465560 (https://doi.org/10.1145%2F1465482.1465560).
ISBN 9780805301779.
16. Brooks, Frederick P. (1996). The mythical man month essays on software engineering (https://
archive.org/details/mythicalmonth00broo) (Anniversary ed., repr. with corr., 5. [Dr.] ed.).
Reading, Mass. [u.a.]: Addison-Wesley. ISBN 978-0-201-83595-3.
17. Michael McCool; James Reinders; Arch Robison (2013). Structured Parallel Programming:
Patterns for Efficient Computation. Elsevier. p. 61.
18. Gustafson, John L. (May 1988). "Reevaluating Amdahl's law" (https://web.archive.org/web/200
70927040654/http://www.scl.ameslab.gov/Publications/Gus/AmdahlsLaw/Amdahls.html).
Communications of the ACM. 31 (5): 532–533. CiteSeerX 10.1.1.509.6892 (https://citeseerx.is
t.psu.edu/viewdoc/summary?doi=10.1.1.509.6892). doi:10.1145/42411.42415 (https://doi.org/1
0.1145%2F42411.42415). S2CID 33937392 (https://api.semanticscholar.org/CorpusID:339373
92). Archived from the original (http://www.scl.ameslab.gov/Publications/Gus/AmdahlsLaw/Am
dahls.html) on 2007-09-27.
https://en.wikipedia.org/wiki/Parallel_computing#History 16/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
19. Bernstein, A. J. (1 October 1966). "Analysis of Programs for Parallel Processing". IEEE
Transactions on Electronic Computers. EC-15 (5): 757–763. doi:10.1109/PGEC.1966.264565
(https://doi.org/10.1109%2FPGEC.1966.264565).
20. Roosta, Seyed H. (2000). Parallel processing and parallel algorithms : theory and computation.
New York, NY [u.a.]: Springer. p. 114. ISBN 978-0-387-98716-3.
21. "Processes and Threads" (https://msdn.microsoft.com/en-us/library/windows/desktop/ms68484
1(v=vs.85).aspx). Microsoft Developer Network. Microsoft Corp. 2018. Retrieved 2018-05-10.
22. Krauss, Kirk J (2018). "Thread Safety for Performance" (https://web.archive.org/web/20180513
081315/http://www.developforperformance.com/ThreadSafetyForPerformance.html). Develop
for Performance. Archived from the original (http://www.developforperformance.com/ThreadSaf
etyForPerformance.html) on 2018-05-13. Retrieved 2018-05-10.
23. Tanenbaum, Andrew S. (2002-02-01). Introduction to Operating System Deadlocks (http://ww
w.informit.com/articles/article.aspx?p=25193). Informit. Pearson Education, Informit. Retrieved
2018-05-10.
24. Cecil, David (2015-11-03). "Synchronization internals – the semaphore" (https://www.embedde
d.com/design/operating-systems/4440752/Synchronization-internals----the-semaphore).
Embedded. AspenCore. Retrieved 2018-05-10.
25. Preshing, Jeff (2012-06-08). "An Introduction to Lock-Free Programming" (http://preshing.com/
20120612/an-introduction-to-lock-free-programming/). Preshing on Programming. Retrieved
2018-05-10.
26. "What's the opposite of "embarrassingly parallel"?" (https://stackoverflow.com/questions/80656
9/whats-the-opposite-of-embarrassingly-parallel). StackOverflow. Retrieved 2018-05-10.
27. Schwartz, David (2011-08-15). "What is thread contention?" (https://stackoverflow.com/questio
ns/1970345/what-is-thread-contention). StackOverflow. Retrieved 2018-05-10.
28. Kukanov, Alexey (2008-03-04). "Why a simple test can get parallel slowdown" (https://software.
intel.com/en-us/blogs/2008/03/04/why-a-simple-test-can-get-parallel-slowdown). Retrieved
2015-02-15.
29. Krauss, Kirk J (2018). "Threading for Performance" (https://web.archive.org/web/20180513081
501/http://www.developforperformance.com/ThreadingForPerformance.html). Develop for
Performance. Archived from the original (http://www.developforperformance.com/ThreadingFor
Performance.html) on 2018-05-13. Retrieved 2018-05-10.
30. Flynn, Michael J. (September 1972). "Some Computer Organizations and Their Effectiveness"
(https://www.cs.utah.edu/~hari/teaching/paralg/Flynn72.pdf) (PDF). IEEE Transactions on
Computers. C-21 (9): 948–960. doi:10.1109/TC.1972.5009071 (https://doi.org/10.1109%2FTC.
1972.5009071).
31. Patterson and Hennessy, p. 748.
32. Singh, David Culler; J.P. (1997). Parallel computer architecture ([Nachdr.] ed.). San Francisco:
Morgan Kaufmann Publ. p. 15. ISBN 978-1-55860-343-1.
33. Culler et al. p. 15.
34. Patt, Yale (April 2004). "The Microprocessor Ten Years From Now: What Are The Challenges,
How Do We Meet Them? (http://users.ece.utexas.edu/~patt/Videos/talk_videos/cmu_04-29-04.
wmv) Archived (https://web.archive.org/web/20080414141000/http://users.ece.utexas.edu/~pat
t/Videos/talk_videos/cmu_04-29-04.wmv) 2008-04-14 at the Wayback Machine (wmv).
Distinguished Lecturer talk at Carnegie Mellon University. Retrieved on November 7, 2007.
35. Culler et al. p. 124.
36. Culler et al. p. 125.
37. Samuel Larsen; Saman Amarasinghe. "Exploiting Superword Level Parallelism with Multimedia
Instruction Sets" (http://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf) (PDF).
38. Patterson and Hennessy, p. 713.
39. Hennessy and Patterson, p. 549.
40. Patterson and Hennessy, p. 714.
https://en.wikipedia.org/wiki/Parallel_computing#History 17/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
https://en.wikipedia.org/wiki/Parallel_computing#History 18/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
56. Acken, Kevin P.; Irwin, Mary Jane; Owens, Robert M. (July 1998). "A Parallel ASIC Architecture
for Efficient Fractal Image Coding". The Journal of VLSI Signal Processing. 19 (2): 97–113.
doi:10.1023/A:1008005616596 (https://doi.org/10.1023%2FA%3A1008005616596).
S2CID 2976028 (https://api.semanticscholar.org/CorpusID:2976028).
57. Kahng, Andrew B. (June 21, 2004) "Scoping the Problem of DFM in the Semiconductor
Industry (http://www.future-fab.com/documents.asp?grID=353&d_ID=2596) Archived (https://w
eb.archive.org/web/20080131221732/http://www.future-fab.com/documents.asp?grID=353&d_I
D=2596) 2008-01-31 at the Wayback Machine." University of California, San Diego. "Future
design for manufacturing (DFM) technology must reduce design [non-recoverable expenditure]
cost and directly address manufacturing [non-recoverable expenditures]—the cost of a mask
set and probe card—which is well over $1 million at the 90 nm technology node and creates a
significant damper on semiconductor-based innovation."
58. Patterson and Hennessy, p. 751.
59. The Sidney Fernbach Award given to MPI inventor Bill Gropp (http://awards.computer.org/ana/
award/viewPastRecipients.action?id=16) Archived (https://web.archive.org/web/201107251911
03/http://awards.computer.org/ana/award/viewPastRecipients.action?id=16) 2011-07-25 at the
Wayback Machine refers to MPI as "the dominant HPC communications interface"
60. Shen, John Paul; Mikko H. Lipasti (2004). Modern processor design : fundamentals of
superscalar processors (1st ed.). Dubuque, Iowa: McGraw-Hill. p. 561. ISBN 978-0-07-
057064-1. "However, the holy grail of such research—automated parallelization of serial
programs—has yet to materialize. While automated parallelization of certain classes of
algorithms has been demonstrated, such success has largely been limited to scientific and
numeric applications with predictable flow control (e.g., nested loop structures with statically
determined iteration counts) and statically analyzable memory access patterns. (e.g., walks
over large multidimensional arrays of float-point data)."
61. Encyclopedia of Parallel Computing, Volume 4 by David Padua 2011 ISBN 0387097651 page
265
62. Asanovic, Krste, et al. (December 18, 2006). "The Landscape of Parallel Computing Research:
A View from Berkeley" (http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.pd
f) (PDF). University of California, Berkeley. Technical Report No. UCB/EECS-2006-183. See
table on pages 17–19.
63. David R., Helman; David A., Bader; JaJa, Joseph (1998). "A Randomized Parallel Sorting
Algorithm with an Experimental Study" (http://www.cc.gatech.edu/~bader/papers/JPDC-98146
2.pdf) (PDF). Journal of Parallel and Distributed Computing. 52: 1–23. Retrieved 26 October
2012.
64. Dobel, B., Hartig, H., & Engel, M. (2012) "Operating system support for redundant
multithreading". Proceedings of the Tenth ACM International Conference on Embedded
Software, 83–92. doi:10.1145/2380356.2380375 (https://doi.org/10.1145%2F2380356.238037
5)
65. Patterson and Hennessy, pp. 749–50: "Although successful in pushing several technologies
useful in later projects, the ILLIAC IV failed as a computer. Costs escalated from the $8 million
estimated in 1966 to $31 million by 1972, despite the construction of only a quarter of the
planned machine . It was perhaps the most infamous of supercomputers. The project started in
1965 and ran its first real application in 1976."
66. Menabrea, L. F. (1842). Sketch of the Analytic Engine Invented by Charles Babbage (http://ww
w.fourmilab.ch/babbage/sketch.html). Bibliothèque Universelle de Genève. Retrieved on
November 7, 2007. quote: "when a long series of identical computations is to be performed,
such as those required for the formation of numerical tables, the machine can be brought into
play so as to give several results at the same time, which will greatly abridge the whole
amount of the processes."
67. Patterson and Hennessy, p. 753.
https://en.wikipedia.org/wiki/Parallel_computing#History 19/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
68. R.W. Hockney, C.R. Jesshope. Parallel Computers 2: Architecture, Programming and
Algorithms, Volume 2 (https://books.google.com/books?id=6HcBQ67-Fb4C). 1988. p. 8 quote:
"The earliest reference to parallelism in computer design is thought to be in General L. F.
Menabrea's publication in… 1842, entitled Sketch of the Analytical Engine Invented by Charles
Babbage".
69. "Parallel Programming", S. Gill, The Computer Journal Vol. 1 #1, pp2-10, British Computer
Society, April 1958.
70. Wilson, Gregory V. (1994). "The History of the Development of Parallel Computing" (http://ei.c
s.vt.edu/~history/Parallel.html). Virginia Tech/Norfolk State University, Interactive Learning with
a Digital Library in Computer Science. Retrieved 2008-01-08.
71. Anthes, Gry (November 19, 2001). "The Power of Parallelism" (https://web.archive.org/web/20
080131205427/http://www.computerworld.com/action/article.do?command=viewArticleBasic&a
rticleId=65878). Computerworld. Archived from the original (http://www.computerworld.com/acti
on/article.do?command=viewArticleBasic&articleId=65878) on January 31, 2008. Retrieved
2008-01-08.
72. Patterson and Hennessy, p. 749.
73. Minsky, Marvin (1986). The Society of Mind (https://archive.org/details/societyofmind00marv/p
age/17). New York: Simon & Schuster. pp. 17 (https://archive.org/details/societyofmind00marv/
page/17). ISBN 978-0-671-60740-1.
74. Minsky, Marvin (1986). The Society of Mind (https://archive.org/details/societyofmind00marv/p
age/29). New York: Simon & Schuster. pp. 29 (https://archive.org/details/societyofmind00marv/
page/29). ISBN 978-0-671-60740-1.
75. Blakeslee, Thomas (1996). Beyond the Conscious Mind. Unlocking the Secrets of the Self (htt
ps://archive.org/details/beyondconsciousm00blak). pp. 6–7 (https://archive.org/details/beyondc
onsciousm00blak/page/6). ISBN 9780306452628.
76. Gazzaniga, Michael; LeDoux, Joseph (1978). The Integrated Mind. pp. 132–161.
77. Gazzaniga, Michael (1985). The Social Brain. Discovering the Networks of the Mind (https://arc
hive.org/details/socialbraindisco0000gazz). pp. 77–79 (https://archive.org/details/socialbraindis
co0000gazz/page/77). ISBN 9780465078509.
78. Ornstein, Robert (1992). Evolution of Consciousness: The Origins of the Way We Think (http
s://archive.org/details/evolutionofconsc0000orns). pp. 2 (https://archive.org/details/evolutionofc
onsc0000orns/page/2).
79. Hilgard, Ernest (1977). Divided consciousness: multiple controls in human thought and action.
New York: Wiley. ISBN 978-0-471-39602-4.
80. Hilgard, Ernest (1986). Divided consciousness: multiple controls in human thought and action
(expanded edition). New York: Wiley. ISBN 978-0-471-80572-4.
81. Kaku, Michio (2014). The Future of the Mind.
82. Ouspenskii, Pyotr (1992). "Chapter 3". In Search of the Miraculous. Fragments of an Unknown
Teaching. pp. 72–83.
83. "Official Neurocluster Brain Model site" (http://neuroclusterbrain.com). Retrieved July 22, 2017.
Further reading
Rodriguez, C.; Villagra, M.; Baran, B. (29 August 2008). "Asynchronous team algorithms for
Boolean Satisfiability". Bio-Inspired Models of Network, Information and Computing Systems,
2007. Bionetics 2007. 2nd: 66–69. doi:10.1109/BIMNICS.2007.4610083 (https://doi.org/10.110
9%2FBIMNICS.2007.4610083). S2CID 15185219 (https://api.semanticscholar.org/CorpusID:1
5185219).
https://en.wikipedia.org/wiki/Parallel_computing#History 20/21
7/11/23, 12:03 PM Parallel computing - Wikipedia
Sechin, A.; Parallel Computing in Photogrammetry. GIM International. #1, 2016, pp. 21–23.
External links
Lawrence Livermore National Laboratory: Introduction to Parallel Computing (http://www.llnl.go
v/computing/tutorials/parallel_comp/)
Designing and Building Parallel Programs, by Ian Foster (http://www-unix.mcs.anl.gov/dbpp/)
Internet Parallel Computing Archive (https://web.archive.org/web/20021012122919/http://wotu
g.ukc.ac.uk/parallel/)
https://en.wikipedia.org/wiki/Parallel_computing#History 21/21