INTEL - The Parallel Universe - Issue 05 - 2010
INTEL - The Parallel Universe - Issue 05 - 2010
Issue 5
November 2010 UNIVERSE
EDITOR
by James Reinders
INTEL® PARALLEL
BUILDING BLOCKS
The Answer(s) to Cracking
the Parallelism Puzzle
by David Sekowski
CONTENTS
Letter from the Editor
High Performance Options Have Never Been Greater
BY JAMES REINDERS 4
James Reinders focuses on the latest Intel® software developer tools designed to tap into
the performance offered by today’s computers.
© 2010, Intel Corporation. All rights reserved. Intel, the Intel logo,
Intel Core, and Intel VTune are trademarks of Intel Corporation
in the U.S. and other countries. *Other names and brands may be Sign up for future issues | Share with a friend
claimed as the property of others.
THE PARALLEL UNIVERSE
High
Performance
Options
Have Never
Been Greater
This issue of The Parallel Universe magazine
focuses on the latest Intel® software developer tools that help us tap
4 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
On November 9 we introduced the latest installments in our most Eliminating defects is an important topic that gets attention in our
popular software developer tools. These numerous tools come tools as well, and in a way that’s easy to utilize in your build environ-
together to form two comprehensive studios: Intel® Parallel Studio ment. Our tools offer solutions for code quality, security, and applica-
XE and Intel® Cluster Studio. Intel Parallel Studio XE addresses the tion robustness, all applicable to parallel programs. The next-gener-
advanced performance challenges of today’s machines for C, C++, and ation correctness analyzers combine memory, threading, and code
Fortran developers. Intel Cluster Studio offers tools uniquely tailored analysis for security. The article “Intel® Inspector XE: An essential tool
for distributed computing, specifically helping with programs using during development along with Intel® Composer XE” advocates this
MPI, C, C++, and Fortran. tool as an essential and regular part of your development cycle. The
There are a lot of new features to be excited about, including case for it seems clear.
version 12.0 Intel® compilers, a new version of Intel® VTune™ Performance profiling is essential for high performance in detecting
Performance Analyzer, new concurrency-compatible memory checking hotspots, and helping you alleviate them with additional insight into
capabilities, code analysis for security and robustness, advanced MPI what is actually happening on your system. Our next-generation
support, Intel® Parallel Building Blocks for C/C++, co-array Fortran profiler, the Intel VTune Amplifier XE, provides easy-to-use, yet
support, and new threading error detection that handles not only detailed, insight into the most pressing performance issues.
compiled languages, but also .NET code. For the cluster developer, we have our highly scalable implementa-
Intel Parallel Studio XE, is available for Linux* and Windows* tion of MPI, in the Intel MPI library, architected to scale to the largest
developers. It includes Intel® Composer XE (compilers and libraries), systems. The article “On a path to petascale with commodity clusters
Intel® VTune Amplifier XE, and Intel® Inspector XE. and Intel MPI” highlights Intel MPI library advancements in our cluster
Intel Cluster Studio, is available for Linux and Windows developers. tools for HPC.
It includes Intel Composer XE, Intel® MPI Library and MPI Benchmarks, Continuous development of software for high performance is a
and the Intel® Trace Analyzer and Collector. complex undertaking. Intel® software development tools work with
Our tools offer a rich selection of parallel programming methods existing and emerging Intel architecture, extending Intel leadership in
to meet the numerous needs of different applications. They have processor technology, and in multicore and manycore processors. As
no equal providing robust ways to express parallelism: OpenMP*, customers, you look for predictability in your software development,
MPI, co-array Fortran, Intel® Math Kernel Library (Intel® MKL), Intel® and an assurance that the software investments you make today will
Integrated Performance Primitives (Intel® IPP), and Intel® Parallel continue to reap benefits in years to come. Our mission in the software
Building Blocks (Intel® PBB) which includes Intel® Threading tools group is to simplify the tools—and the way you purchase, install,
Building Blocks (Intel® TBB), Intel® Cilk Plus, and Intel® Array Building develop, and support them.
Blocks (Intel® ArBB). Our Beta customers said very nice things about our new tools
To explore the advantages of these innovative tools, you’ll find prior to release. I believe you will find Intel Parallel Studio XE and Intel
articles on Intel Parallel Building Blocks and the parts that make it up, Cluster Studio taking significant strides in advancing the innova-
“what’s new and exciting in Fortran after all these years,” and Intel MKL. tion bar for programming, productivity, and programmability for high
performance.
Enjoy!
JAMES REINDERS
Portland, Oregon
November 2010
Simplifying High
Performance with
INTEL PARALLEL ®
STUDIOXE
AND INTEL® CLUSTER STUDIO TOOL SUITES
By Sanjay Goil, Product Marketing Manager
John McHugh, Marketing Communication Manager
Intel® Software Development Products
6 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
Figure 1
In September, Intel introduced Intel® Parallel Studio 2011, Introducing New Tool Suites
a tool suite for Microsoft* Windows* Visual Studio* C++ developers, Software developers of high performance applications require a
with the singular objective of providing the essential performance complete set of development tools. While traditionally these tools
tools for application development on Intel® Architecture. These tools include compilers, debuggers, and performance and parallel libraries,
provide significant innovation, and enable unprecedented developer more often the issues in development come in error correctness and
productivity when building, debugging, and tuning parallel applications performance profiling. The code doesn’t run correctly, or exhibits
for multicore. With the introduction of Intel® Parallel Building Blocks error-prone behavior on some runs, pointing to data races, deadlocks,
(Intel® PBB), developers have methods to introduce and extend paral- or performance bottlenecks in locks for synchronization, or exposes
lelism in C/C++ applications for higher performance and efficiencies. security risks at runtime. To this end, Intel’s correctness analyzers
This month Intel is extending the reach of next-generation Intel and performance profilers are a great addition to the development
tools to developers of applications on both Windows and Linux in environment for highly robust and secure code development.
C/C++ and Fortran who need advanced performance for multicore Figure 1.
today and forward scaling to manycore. Intel Parallel Studio XE 2011 For advanced and distributed performance, Intel is simplifying
contains C/C++ and Fortran Compilers, Intel® Math Kernel Library the procurement, deployment, and use of HPC tools on IA-32 and
(Intel® MKL) and Intel® Integrated Performance Primitives (Intel® IPP) Intel® Architecture and compatible platforms, and HPC clusters
performance libraries, Intel PBB libraries, Intel® Threading Building programmed with the Message Passing Interface (MPI). Figure 2.
Blocks (Intel® TBB), Intel® Cilk™ Plus, and Intel® Array Building Blocks
(Intel® ArBB), Intel® Inspector XE correctness analyzer, and Intel®
VTune™ Amplifier XE performance profiler.
Boost Performance. Code Reliably. Scale Forward.
HPC programmers have traditionally been able to use all the
compute power made available to them. Even with the performance
leaps that Moore’s law has allowed Intel architecture to deliver over
the past decade, the hunger for additional performance continues to
thrive. There are big unsolved problems in science and engineering, Assist
Architectural Analysis
physical simulations at higher granularities, and problems where the
economically viable compute power provides lower resolution or
piecemeal simulation of smaller portions of the larger problem.
This is what makes serving the HPC market so exciting for Intel, and Code
Add Parallelism
it is a significant driver for innovation in both hardware and software
methodologies for parallelism and performance.
Performance
Intel® Cluster Studio introduces tools for HPC cluster development Optimize/Tune
with MPI, including the scalable Intel® MPI Library and Intel® Trace
Analyzer and Collector performance profiler, with the industry-leading
C/C++ and Fortran compilers for a complete cluster development tool
Correctness
suite. This is combined with the ease of deployment offered by Quality & Robustness
the Intel® Cluster Ready program, making deployment of cluster
applications highly efficient.
Memory and threading error >> Increases productivity and lowers cost
Advanced
Correctness Intel® Inspector XE checking tool for higher code by catching memory and threading
reliability and quality defects early
Figure 2
A software development project goes through several steps to get optimal performance on
the target platform. Most often, the developer gets a rudimentary performance profile of the
application run to show hotspots. Once opportunities for optimization are identified, the coding
aspects are handled by the compilers and performance and parallel libraries to add parallelism,
presenting task level, data level, and vectorization opportunities. Finally, the correctness tools
make robust code possible by checking for threading and memory errors, and identifying secu-
rity vulnerabilities. This cycle typically repeats itself to find higher application efficiencies.
8 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
Figure 3
The tools introduced in Intel Parallel Studio XE 2011 are next-generation revisions of
industry-leading tools for C/C++ and Fortran developers seeking cross-platform capabilities
for the latest x86 processors on Windows* and Linux* platforms. Those familiar with Intel’s
industry-leading tools will see that the product names have transitioned in this new release—in
all cases with significant additional capabilities, other names remain the same. Figure 3.
Jorge Martinis
Research and Development
Engineer, BR&E Inc.
Figure 4
Figure 5
10 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
Emmanuel Weber
Software Architect
BlueJeans Network
Figure 6
Alex Migdalski
CEO and CTO
OTRADA Inc.
Figure 7
12 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
Figure 9
Intel VTune Amplifier XE 2011 is the next generation of the Intel VTune Performance
Analyzer, which is a powerful tool to quickly find and provide greater insights into multicore
performance bottlenecks. It removes the guesswork and analyzes performance behavior in
Windows* and Linux* applications, providing quick access to scalability bottlenecks for faster
and improved decision making. Figures 8,9.
Figure 10
Software security starts very early in the development phase, and Intel Parallel Studio XE 2011
makes it faster to identify, locate, and fix software issues prior to software deployment. This
helps identify and prevent critical software security vulnerabilities early in the development
cycle, where the cost of finding and fixing errors is the lowest. Figure 10,11.
>> Easier, faster setup and ramp to get static analysis results
>> Simple approach to configure and run static analysis
>> Discovers and fixes defects at any phase of the development cycle
>> Finds more than 250 security errors, such as:
• Buffer overruns and uninitialized variables
• Unsafe library usage and arithmetic overflow
• Unchecked input and heap corruption
>> Tracks state associated with issues, even as source evolves and line numbers change
>> Displays problem sets and location of source
>> Provides filters, assignment of priority, and maintenance of problem set state
>> Intuitive standalone GUI and command line interface for Windows and Linux
14 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
Feature Benefit
Support for both Linux* and Development capability with the same set of tools on both Windows* and Linux* platforms; enhanced
Windows* platforms performance, productivity, and programmability
C/C++ Compilers with Intel® Breakthrough in providing choice of parallelism for applications— task, data, vector— with mix and match for
Parallel Building Blocks optimizing application performance. C/C++ standards support
Fortran Compilers with key Advances in the industry-leading Fortran Compilers with new support for scalable parallelism on nodes and
Fortran 2008 standards clusters (cluster support available separately with Intel® Cluster Studio 2011); Fortran standards support
support including
Co-Array Fortran (CAF)
Memory, threading, and Enhances developer productivity and efficiencies by simplifying and speeding the process of detecting
security analysis tools difficult-to-find coding errors
in one package
Updated performance Multicore performance for common math and data processing tasks, with a simple linking
libraries with these automatically parallel libraries
Updated performance Several ease-of-use enhancements, deeper microarchitectural insights, enhanced GUI, and
profiler quicker, more robust performance
Figure 11
Figure 12
>> Visualize and understand parallel application behavior >> Analyze performance of subroutines or code blocks
>> Evaluate profiling statistics and load balancing >> Identify communication hotspots
>> Scalability and High Performance: The interconnect- >> Target Applications to Multiple Operating Systems:
tuned and multicore-optimized Intel® MPI Library delivers Leverage the same source code in Intel® compilers and
application performance on thousands of Intel Architec- libraries, which bring advanced optimizations to Windows
ture and compatible multicore processors. and Linux.
>> Built-in Optimization: Utilize optimizing compilers and >> Intel® Cluster Ready Qualified: This program defines
libraries in Intel® Composer XE to get the most out of cluster architectures to increase uptime and productivity
advanced processor technologies. The C/C++ optimizing and reduce total cost of ownership (TCO) for IA-based
compiler now includes Intel PBB, which expands the types HPC clusters.
of problems that can be solved more easily in parallel, and >> Compatibility and Support: Intel Cluster Studio offers
with increased reliability. For Fortran developers, it now excellent compatibility with leading development environ-
offers Co-array Fortran (CAF) and additional support for ments and compilers , while providing optimal support for
the Fortran 2008 standard. Intel® compilers also deliver multiple generations of Intel processors and compatibles.
advanced vectorization support with SIMD pragmas. Intel offers broad support through its forums and Intel®
>> Ease of MPI Tuning: Intel® Trace Analyzer and Collector Premier Support, which provides fast answers and covers
has been enhanced with new features that accelerate all software updates for one year.
the analysis and tuning cycle of MPI-based cluster
applications.
16 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
Feature Benefit
Analysis tools for MPI Enhanced developer productivity and efficiencies by simplifying and speeding the detection of
developers load imbalance errors and offering performance profiling of MPI messages.
diagram; ideal Interconnect
simulator
Scalable Intel MPI Library Scale to tens of thousands of cores with one of the most scalable and robust commercial MPI libraries in the
with multirail IB support and industry. Ease-of-use with dynamic and configurable support across multiple cluster fabrics and multi-rail IB
Application Tuner support
C/C++ Compilers with Intel® Breakthrough in providing choice of parallelism for applications— process, task, data, vector— with mix
Parallel Building Blocks and match for optimizing application performance on clusters of SMP nodes. C/C++ standards support
Fortran compilers with key Advances in the industry-leading Fortran compilers with new support for scalable parallelism
Fortran 2008 standards on nodes and clusters. Fortran standards supported include key features in Fortran 2008,
support including co-array more complete Fortran 2003 support.
Fortran (CAF) on clusters
(available on Linux now and
Windows later)
Updated performance Multicore performance for common math and data processing tasks, with a simple linking with these
libraries, Intel® MKL and automatically parallel libraries
Intel IPP
Support for both Linux* and Development capability with the same set of tools on both Windows and Linux platforms for enhanced
Windows* platforms performance, productivity, and programmability
Figure 13
Summary
With the introduction of Intel Parallel Studio XE and Intel Cluster Studio, Intel is extending the
reach of the next-generation Intel tools to Windows and Linux C/C++ and Fortran developers
needing advanced performance for multicore today and forward scaling to manycore.
The Intel Parallel Studio XE 2011 bundle contains the latest versions of Intel C/C++ and
Fortran compilers, Intel MKL and Intel IPP performance libraries, Intel PBB libraries, (Intel TBB,
Intel ArBB [in beta], and Intel Cilk Plus), Intel Inspector XE correctness analyzer, and Intel VTune
Amplifier XE performance profiler.
Intel Cluster Studio 2011 bundle contains the latest versions of Intel MPI Library, Intel Trace
Analyzer and Collector, Intel C/C++ and Fortran compilers, Intel MKL and Intel IPP performance
libraries, and Intel PBB libraries (Intel TBB, Intel ArBB [in beta], and Intel Cilk Plus).
18 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
Motivation
As microprocessors transition from clock speed as the primary vehicle
Single-Node Systems
for performance gains to features such as multiple cores, it is increas-
ingly important for developers to optimize their applications to take
advantage of these platform capabilities. In the past, the same
application would automatically perform better on a newer CPU due
to increasingly higher clock speeds. However, when customers buy
computers with the latest CPUs, they may not see a corresponding
increase in applications that are written in serial or designed to take
advantage of only one processing element. Therefore, hardware Single-Cluster Nodes Desktops/Workstations
designers have to find other ways to deliver superior application
performance, and they’re turning to an old standby from high perfor-
mance computing: parallel hardware platforms. This represents a
> Motherboards are single-, dual-,
new challenge for most software developers who haven’t had much or quad-socketed
experience with writing parallel software.
Since Amdahl’s Law was originally coined in 1967, high-perfor- > CPUs are single, multi-, or manycore
mance computing experts have known a thing or two about the > Shared memory structures
answer to this problem. Namely, that by putting more processors
Laptops/Netbooks
on the job we can reduce total application runtime through soft-
ware parallelism. With the addition of Gustafson’s Law in 1988, the
upper limits on scalability implied by Amdahl’s Law were effectively
removed. Taken together these principles opened an alternative Single CPU or Core
path to improving software performance in mainstream applications
without increasing clock speed. Welcome to the era of multicore
processors. > Cores contain both scalar processors
Today and in the future, cutting-edge applications will turn to and vector arithmetic units
parallelism to harness the profound power of dual-, quad-, and even-
> Scalar processors handle single, often
more-core processors found in most common mainstream computers. complex, operations on single data items
In his book Only the Paranoid Survive, Andy Grove talks about
how strategic inflection points happen in business when a new > Vector arithmetic units handle single,
technology has the ability to improve performance by an order of often simple operations on multiple
Scalar Processors
data items
magnitude. It is at these inflection points that there is a possibility Vector Arithmetic
to revolutionize instead of “evolutionize” an industry. Multicore, and Units
> Shared cache structures
soon manycore, processors represent just such an opportunity to
mainstream software application developers, if they can harness the
added performance and functionality potential. Figure 1
Distributed and Shared Memory Systems Task parallelism is the highest level of software parallelism.
Tasking is generally needed for problems with irregular control
It is useful when talking about software parallelism to begin by
structures that operate on irregular data sets. It allows a developer
talking about hardware platforms from the highest to the lowest
to break their application into logically distinct pieces, such as the
performance. Figure 1. This is because we can apply lessons learned
render pipeline, AI, physics, and network I/O modules in a game
from the high end to the mainstream. High-performance computing
engine. Each of these logical elements is assigned to a task or
uses massively parallel hardware and software platforms to solve
group of tasks to be completed concurrently. Other, more general
some of the world’s largest problems from climate change to decoding
forms of task parallelism patterns include message passing, tasking,
the human genome. These systems range from grid computers using
eventing, and pipelining. However, these types of parallel code
idle cycles on widely dispersed systems over the Internet to clusters
generally do not scale well with additional processing elements,
using message passing to communicate across various nodes located
since the number of logical parts of an application rarely grow
relatively close to each other (e.g., using the Intel® Message Passing
over time, such that it is not possible to exploit Gustafson’s Law.
Interface library (Intel® MPI) to synchronize information across cluster
Nevertheless, task parallelism is still a vital first step to creating
nodes). Both of these types of computing systems utilize distributed
scalable software by reducing the serial portions of code in a
memory as compared to shared memory like that found in a single
given application.
node within a distributed system, a workstation, a netbook, etc.
Data parallelism is complementary to task parallelism. In fact,
With distributed-memory systems, there is a need for high-level
some of the most well-known and successful solutions for data
coordination across nodes as well as parallelism within nodes.
parallel patterns are implemented using task parallel programming
Normally an explicit message-passing model is used to coordinate
models. Data parallelism uses algorithms with regular control
multiple nodes. As those nodes have become more powerful, with
structures to operate on concurrent containers and other regular
multiple processors—each with multiple cores, multiple scalar
data structures. Much of the potential scalability of today’s applica-
processors, and vector arithmetic units—there has been a need
tions exists in such regular forms; examples include encoding audio
to mix message passing across nodes with parallel software
or video. In addition, it is often easier to begin parallelizing a serial
within each node.
application by taking a serial control flow construct, such as a “for”
loop, and turning it into a parallel for loop, before investigating the
Task Parallelism, Data Parallelism, benefits that can be offered by rearranging the logical elements of
and Vectorization the application with task parallelism. Examples of data parallelism
patterns and algorithms include parallel “loops” (also known as
The widely accepted industry terms for node-level parallelism can be
“maps”), sorts, reductions, and scans.
quite confusing, and often have different meanings depending on a
Vectorization: Both task and data parallelism are ways a
variety of factors. For the sake of discussing how Intel is providing
developer can spread application work across multiple processing
solutions for software parallelism, this article will define three key
elements like multiple cores or processors. Vectorization is a subset
types of parallelism. Figure 2.
of data parallelism that allows you to take advantage of the vector
Loops Trees
Data Parallelism Shared Regular Sorts Graphs
Reductions Lists
Figure 2
20 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
Can be Generally
Compilers & Libraries Standards published adhere to
Compliance
There are a number of ways that computer science researchers have as Standards Standards
Abstract - models must reduce the need Supported - models must work with Usable - models provide easy to implement and
to write parallelism using OS-dependent Intel tools like Parallel Amplifier for test abstractions to utilize all available hardware
threads and synchronization primitives performance tuning and Parallel Inspector parallelism
for debugging
Interoperable - models must be able to Composable - models must be able to Reliable - models are not prone to common multi-
coexist within an application andreliably be nested and otherwise combined with threading errors and they can be mixed and matched
exchange data reliable and performant behaviors in interesting and useful ways
Performant - models must provide Scalable - models must provide Future-Proof - models provide automatic forward
sufficient per-thread performance to additional performance scaling when scaling to more processing elements along with
productivity adding processing elements sufficient per-thread performance
Open - models should work on multiple Standard - models should provide Portable - models can be used on multiple OS, HW,
platforms where applicable so developers published standards where applicable IDE, and compiler platforms with flexible licensing
can use them anywhere so others can interoperate
Figure 4
22 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
Intel® TBB
Paid commercial version www.threadingbuildingblocks.com
Intel® ArBB
Beta download http://software.intel.com/en-us/articles/
intel-array-building-blocks
Figure 5
Intel ArBB, formerly Intel® Ct Technology, is a C++ library like Intel parallelism (Intel TBB, Intel Cilk Plus, and Intel ArBB) all share a common
TBB, which offers highly performant and scalable data parallelism infrastructure. Each individual model also adheres to strict selection
and vectorization. It utilizes a JIT, or just-in-time, compiler to dynami- criteria that guarantee a number of compelling value propositions
cally optimize for any given target heterogeneous hardware platform, by default. Figure 5. These models are complementary and each
including both manycore processors and vector arithmetic units. provides unique value in particular application contexts. In combination,
It can generate highly optimized machine code on the fly to take best they form a single comprehensive solution for task parallelism, data
advantage of the multiple processors, cores, and vector arithmetic parallelism, and vectorization, with interfaces implemented using both
units available on both CPUs and accelerators, which may comprise language extensions and libraries. Together, they provide a unified
a distributed memory system. By using dynamic code generation, it solution to parallelism and are available as part of Intel® Parallel
can also overcome modularity overhead of C++. For instance, it can Studio 2011 as Intel® Parallel Building Blocks (Intel® PBB). Interested
support virtual functions without their runtime cost. To learn more developers can get started with Intel PBB today by referring to
about Intel ArBB, refer to the next article. These three models for the above links. o
INTEL ®
ARRAY
By Michael McCool
Software Architect
Intel® Corporation
BUILDING
BLOCKS
THE PARALLEL UNIVERSE
Intel® Array Building Blocks Intel ArBB can be used to parallelize compute-intensive applications within
(Intel® ArBB) is a sophisticated a structured, deterministic-by-default framework. It also provides powerful runtime generic
programming mechanisms, yet can be used with existing compilers. In particular, it has been
and powerful platform for
verified to work with the Intel, Microsoft*, and gcc C++ compilers. Intel ArBB is currently in Beta,
portable data-parallel software and feedback is appreciated; it can be downloaded today from http://intel.com/go/ArBB for
development. Intel ArBB will either Windows or Linux.
be available as a component Is Intel ArBB a language or a library? Yes—both at the same time. Intel ArBB is the answer
of Intel® Parallel Building to the following question: How can parallelism mechanisms in modern processor hardware,
Blocks, along with several including vector SIMD instructions, be targeted in a portable, general way within existing
programming languages? The answer is an embedded language. Intel ArBB is a language
other tools and libraries
extension implemented as an API. It has a library interface, but includes a capability for the
for parallel programming. dynamic generation and optimization of parallelized and vectorized machine language.
Modern processors include many mechanisms for increasing performance through
parallelism: multiple cores, hyperthreading, superscalar instruction issue, pipelining, and
single-instruction, multiple data (SIMD) vector instructions. The first two—multiple cores and
hyperthreading—can be accessed through threads, although for efficiency, one may want to
use lightweight tasks that share hardware threads. Instruction-level parallelism, such as
superscalar instruction issue and pipelining, are invoked automatically by the processor, as long
as the instruction stream avoids unnecessary data dependencies. However, the last form of
parallelism, SIMD vector parallelism, can only be accessed by generating special instructions
that explicitly invoke multiple operations at once: SIMD instructions. SIMD instructions perform
the same operation on multiple components of a vector at once, so they are sometimes also
called SIMD vector instructions.
Figure 1. Practical
Intel® Parallel Building Blocks (Intel® PBB) is a set of comprehensive parallel
testing and performance
development models that supports multiple approaches to parallelism. expectation
SIMD vector instructions are very powerful, and they are becoming more powerful
over time. In current processors that support streaming SIMD extensions (SSE), four single-
precision floating point instructions can be executed with a single SSE SIMD instruction.
In next-generation AVX processors, the width of the SIMD instructions will double, so eight
such operations can be executed at once. In the Intel® Many Integrated Core (MIC) architecture,
the width doubles again, so 16 such operations can be executed at once.
26 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
BLOG
as sequences of operations over entire collections (vector mode), or
as functions replicated over every element of a collection (elemental
mode). Vector mode is the simplest: arithmetic operations on collec-
tions apply in parallel to the corresponding members of the collections.
This works even if the element type is user-defined and the user
has overloaded the operator themselves. For example, suppose we
have four dense<f32> collections called A, B, C and D, all of the same
size. Then the following expression will operate in parallel on all the
highlights
elements of these collections:
void
doit(dense<f32>& A, f32 b, f32 c,
dense<f32> D)
{
map(kernel)(A, b, c, D);
}
call(doit)(A,b,c,D);
with the same kernel function, but with the types of b and c matching
the corresponding function argument exactly; in this case, f32. There
will still be as many parallel instances of the kernel as there are
elements in the collections A and D, but every instance will get a
copy of the same value of b and c. In summary, call arguments need
to match exactly, but map functions are polymorphic and any
You can also write “elemental” functions over scalar Intel ArBB types: argument can either be a single element or a collection.
In addition to using these two basic patterns to express parallel
operations, users of ArBB also have access to several collective
void
operations that act on or take an entire container as an input. These
kernel(f32& a, f32 b, f32 c, f32 d)
{ operations can shift the contents of containers around, take cumula-
a += (b/c)*d; tive sums (prefix scans), perform sets of reads and writes (known as
} scatters and gathers), discard elements and pack the remainder into
a contiguous sequence (known as pack; the inverse is unpack), or
simply combine all elements into a single element. Combination of all
You can invoke elemental functions from inside a call by using the
the elements of a container into a single element is called a reduction.
map operation. A map operation replicates the function over every
For example, the following computes the dot product (sum of pairwise
element of the input containers.
products) of two containers A and B:
void
doit(dense<f32>& A, dense<f32>
B, dense<f32> C, dense<f32> D) f32 A_dot_B = add_reduce(A * B);
{
map(kernel)(A, B, C, D);
}
call(doit)(A,B,C,D); Elemental functions can also use control flow. During the process
described above, ordinary C++ control flow is actually executed only
when the function is ”captured.“ This is incredibly useful for generic
programming in order to specify variants, and to reduce the overhead
of modularity and configuration. However, in order for control flow
to be visible to ArBB and be compiled into the vector machine
code generated by it, special macros need to be used to express
“embeddable“ control flow.
28 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
be done with closures. If we use call, we will capture and “freeze” the
value of this variable the first time we invoke this function. If we use
capture, we can change the value and capture different versions and
use closure objects to manage them. Finally, note that ArBB control
flow has a few differences from C++ control flow: it uses a leading
underscore, but also has closing keywords (such as _end_for) and the
arguments to _for are separated by commas, not semicolons.
To actually get useful work done, we have to get data in and out
of ArBB collections. This can be done in a variety of ways. ArBB actu-
ally supports an efficient STL-friendly interface based on iterators for
sophisticated applications. However, the simplest way to get data in
and out of ArBB is to simply associate an ArBB collection with a C++
array using bind, as follows. Note that we also have to use a helper call
function “doit” to invoke the elemental function inside a map.
void
doit(dense<i32,2>& D,
Figure 1 dense<std::complex<f32>,2> P)
{
This is best shown by an example. The following program computes map(mandel)(D,P);
the Mandelbrot set, the famous fractal found by counting the number }
dense<std::complex<f32>,2> pos;
of iterations required for a complex quadtratic to diverge from a given bind(pos, c_pos, cols, rows)
starting point. Plotting this number of iterations over a region of the dense<i32,2> dest;
complex plane results in the above image. Figure 1. bind(dest, c_dest, cols, rows)
An elemental function to compute a single pixel of this image is call(doit)(dest, pos);
given by the following ArBB code:
Automatic
Parallelism
with the Intel® Math Kernel Library (Intel® MKL)
The Intel® Math Kernel Library (Intel® MKL) provides to re-link their applications. Intel MKL simply detects the hardware and
software developers optimized and automatically parallelized math- dispatches code optimized for that processor, requiring no effort on
ematical library routines. Our routines are thread-safe and are appli- behalf of the user. For example, the same application when linked with
cable to many engineering, science, financial, and other applications. Intel MKL should run optimally on Intel® Core™2 Duo and Intel® Core™i7
In this article, we provide an overview of the techniques Intel MKL processors because kernels optimized for both processors are already
uses to achieve the highest level of parallelism, as well as the hooks built in and dispatched during runtime.
and knobs useful for getting the most from these threaded hotspots. Multicore machines are the latest trend in computing, offering
Intel MKL has a number of domains useful to developers who high degrees of parallelism. While the potential for even higher
create applications for desktops, servers, and clusters. This includes performance is a natural side effect of an increased number of cores,
industry standards like the Basic Linear Algebra Subroutines (BLAS) the challenge of extracting that performance (and parallelism) falls
and the latest version of the Linear Algebra PACKage (LAPACK), as squarely on the shoulders of the software developer. A library such
well as Fast Fourier Transforms (FFTs), Vector Math and Statistics as Intel MKL, where we have threaded most of the commonly used
Libraries (VML, VSL), a direct sparse solver (PARDISO), and sparse routines, is a simple and effective means of obtaining that parallelism.
BLAS. To help lower the barrier to programming distributed memory Threading within Intel MKL is based on the industry standard
architectures (clusters), Intel MKL includes ScaLAPACK, Parallel BLAS, OpenMP* specification. We thread in several of our domains: the direct
and Cluster FFTs. Intel MKL is available on the latest versions of Linux, sparse solver, LAPACK, BLAS, Sparse BLAS, VML, FFTs, and Cluster
Windows, and Mac OS X. We have tuned code for Intel and AMD* FFTs. For industry-standard components like LAPACK and the BLAS,
hardware, including both IA-32 and Intel® 64 architectures. our tuning goes beyond what one can find in the public domain.
The primary advantage of Intel MKL is that it makes the highest For example, in LAPACK we have added threading to some of the
performance levels easily accessible to software developers. Within computational linear equation routines, orthogonal factorizations,
the software, we do automated dispatching to amortize the value singular value decompositions, and eigenproblems. In all cases,
of the underlying hardware features. This means users calling an Intel MKL is thread-safe, so simultaneous execution of routines
industry-standard subroutine like DGEMM from the BLAS get perfor- from multiple threads works correctly.
mance improvements on different systems without having
30 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
MKL. For example, the original design and current public implementa-
tion of LAPACK depends on parallelism within the underlying BLAS
routines. We found we could obtain better performance when we
did threading at the LAPACK-level and called sequential BLAS, rather
than relying on threading only in the BLAS. The critical observation is
that the advantage of threading at a higher level increases with the
number of threads. If we take the LAPACK routine DGETRF (double
precision general matrix factorization via Gaussian elimination with
partial row pivoting), fix a large matrix size on a manycore machine,
and test the gap between threading just at the BLAS level versus the
LAPACK routine level, the performance advantage increases as the
number of threads increase.
Additionally, there are times when it is useful to take advantage
of multiple levels and styles of parallelism, such as in a distributed
memory cluster running MPI (Message Passing Interface) between the
nodes. The Intel MKL benchmark MP LINPACK (which solves a cluster
problem similar to DGETRF) uses hybrid MPI-OpenMP* parallelism for
even greater performance. This is analogous to our previous state-
ment regarding threading at the highest level. While running one MPI
process per core is the most basic mechanism for parallelism on a
cluster, running fewer MPI processes and putting OpenMP* calls into
the code raises the threading level higher in the application and
Users can set Intel MKL-specific variables such as MKL_NUM_ should yield performance gains.
THREADS to specify the number of OpenMP* threads so as not to Intel MKL depends on the underlying OpenMP* software to deter-
interfere with a user’s other OpenMP* routines and environment mine the number of threads. When the presence of MPI is detected
variables. Intel MKL checks the Intel MKL-specific variables first; and MPI has not been initialized for multithreading, Intel MKL will
however, we are constrained by the underlying OpenMP environment. default to one thread. Likewise when called from inside an OpenMP
If neither the Intel MKL routines nor the OpenMP routines/variables are parallel region, the default will be to one thread. If OpenMP gives us
set, the underlying OpenMP default system will take precedence. more threads than the number of physical cores (which might happen
If a developer builds an application for their own customers and wants when HT is enabled), we will scale down the number of threads to
control over the OpenMP environment (as opposed to allowing their match the number of physical cores. But there are times when our
users to experiment with environment variables), they can call Intel default choice may not be optimal, because of other aspects of the
MKL threading service functions. Our threading service functions take application we cannot detect. A user can set MKL_DYNAMIC to FALSE
precedence over our environment variables. Intel MKL works with (its default is TRUE) or call mkl_set_dynamic() to try to override the
the Intel® Compiler’s OpenMP* libraries, in addition to those of number of threads we think will run optimally. Note that FFTs require
Microsoft and GNU. both a setup and an execute stage, and the number of threads should
Developers can control the number of threads not only on a Intel be the same for both.
MKL-wide level, but also on a domain-specific level with the MKL_ By taking advantage of the automatic parallelism Intel MKL provides,
DOMAIN_NUM_THREADS environment variable or its corresponding applications can get higher performance on modern multicore archi-
service function, mkl_domain_set_num_threads(). For instance, if one tectures. o
wants all of Intel MKL to use two threads and the BLAS instead to use
four, a user can set the variable with “MKL_ALL=2, MKL_BLAS=4.”
All environment variables are read only once in the course of a run. Web Bibliography:
To change the behavior in the middle of a run requires calls to the [BLAS] http://www.netlib.org/blas/index.html
service functions. [LAPACK] http://www.netlib.org/lapack/index.html
For developers building applications using a different threading [MKL] http://software.intel.com/en-us/intel-mkl/
model other than OpenMP, such as Intel® Cilk™ Plus Runtime Library, [MPI] http://www.mcs.anl.gov/research/projects/mpi/
Intel® Thread Building Blocks, or pthreads in Linux, we suggest [MPI] http://www.intel.com/go/mpi
threading with the user’s method of choice at the highest level, [OPENMP] www.openmp.org
and either linking in the sequential Intel MKL, or setting MKL_NUM_ [SCALAPACK] http://www.netlib.org/scalapack/index.html
THREADS to “one” in the threaded version.
This usage model works well because threading is most effective
when applied at the highest possible level—as it is in the current Intel
By Don Gunning,
Nick Meng, and Paul Besl
32 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
5.5
5 real timing on
new system
4.5
3.5
2.5
1.5
Elapsed time (hours)
real timing on
existing system
1.0
estimation on
new system
.5
0.0
0 2.5 k 5k 7.5 k 10k 15k 20k
source /opt/intel/itac/8.0.1.001/bin/itacvars.sh
export LD_PRELOAD=/opt/intel/itac/8.0.1.001/slib/libVT.so
# 8p run
runexec small_model.pre -np 8 --mpi-options -trace --machines-file $PBS_NODEFILE
#256p run
runexec large_model.pre -np 256 --mpi-options -trace --machines-file $PBS_NODEFILE
Needless to say, there were some issues and diagnosis and Collector tool: a small test case with eight cores and
was necessary. a large test case with 256 cores. After we got STF trace
With Intel Trace Analyzer and Collector, the diagnosis files, we collected statistical data with the analyzer tool.
was straightforward. Basically, you add “–trace” to your Figures 2 and 3 show the statistical data collected from
MPI command line or “–mpi-options –trace” in your run the small test case in graph mode. In Figure 2, we can
script file. Here is the run command with ITAC tool. see the communication from P0 to P1-P7 is hot, and P0 to
We tested two test cases with the Intel Trace Analyzer P1 is the hottest (see the outlined area).
MPI BCAST
Figure 2. Total MPI collective function time in each MPI process generated by Intel® Trace
Analyzer and Collector Tool. (Red indicates hot [busiest]; blue, cool [not so busy].)
MPI BCAST
Figure 3. Total time of MPI collective function in each MPI process generated by Intel® Trace
Analyzer and Collector. (Red indicates hot; blue, cool.)
34 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
Figure 4. Load balance of large test case in pie mode on 256 cores generated by Intel® Trace
Analyzer and Collector. (Red indicates MPI code percent execution time; blue indicates applica-
tion percent execution time.)
Figure 5. Profile data of large test case and total time of MPI collective function in each MPI
process generated by Intel® Trace Analyzer and Collector.
In Figure 3, you can see MPI_Bcast is the hottest func- applications; the red area indicates the cost of MPI
tion in the test case (see the outlined area). We also did communications. You can see that the master MPI process
deep performance investigation with a large test case pie is almost blue, and all slave MPI process pies are almost
on 256 cores. Figures 4, 5, 6, and 7 show the same red. It means that there is a very serious load imbalance
phenomena in the large test case. issue here. After we look at Figures 6 and 7, we find the
Figure 4 displays the percentage of functions in serious load imbalance is from the MPI_BCAST function.
pie mode: the blue area indicates computing cost of Figures 6 and 7 show the activities of MPI functions and
Figure 6.The activities of functions from 0 to 60 seconds generated by Intel® Trace Analyzer
and Collector.
Figure 7. The activities of functions from 21.136 to 21.176 seconds generated by Intel® Trace
Analyzer and Collector. (Red indicates MPI code; blue indicates application code. The horizontal
axis represents time. The vertical axis represents separate MPI processes/ranks.)
36 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
BLOG
the blue area indicates the application function activity. Obviously,
the large test case spent a lot time on MPI functions which is
extremely abnormal.
In Figure 4, we noticed the master MPI process (pie in upper left
corner) is blue and all slave MPI processes are almost red. Meanwhile,
highlights
Figures 5, 6, and 7 indicate that the load imbalance issue is clearly
from a MPI _BCAST function. This means that the default (algorithm)
setting of the MPI_BCAST function is not appropriate for this case.
We need to select the right algorithm. We did some tests with a small
cases and discovered the best setting for MPI_BCAST in this case:
I_MPI_ADJUST_BCAST=4. Condition Variable Support in
In this situation, the root cause of an unexpected problem was Intel® Threading Building Blocks
found quickly and the solution was easily implemented in large part
due to the Intel Trace Analyzer and Collector’s graphical data mining WOOYOUNG KIM
capability and the ISV developers’ knowledge of the software. Finally,
the user got reasonable performance on the new Intel 64 platform. One feature present in the proposed C++ standard
specification (i.e., N3092) threading support library, which
Now we will look at a more complex challenge. we began supporting since Intel® Threading Building Blocks
(Intel® TBB) 3.0, is condition variable. As the C++1x proposal
Solving challenge tasks with approaches the final approval, we expect using threads in con-
mixed mode parallelism junction with condition variables will become more popular.
Livermore Software Technology Corporation (LSTC) is continuously For example, Microsoft has already been supporting condition
being challenged by users to deliver results faster on ever-increasing variables natively since Windows* Vista. Until Intel® TBB 3.0,
data set sizes. Further, in many cases the results must be consistent. Intel TBB used to provide only half of it (cf., std::thread – it
LSTC offers shared and distributed memory versions of their software, used to be called tbb::tbb_thread). The following code example
with the distributed memory version offering better scalability than shows how to use the Intel TBB condition variable. For a
the shared memory version. The shared memory version uses OpenMP. Concise introduction to condition variables, see here or here.
The distributed memory version uses MPI.
We will now illustrate how the Intel Trace Analyzer and Collector using namespace std;
#include “tbb/compat/condition_variable”
was applied to LSTC DYNA to enable handling of significantly larger condition_variable my_condition;
data sets on Intel® multicore architecture. tbb::mutex my_mtx;
bool present = false;
The challenge
void producer() {
LSTC supplies LS-DYNA, a general-purpose transient dynamic finite unique_lock<tbb::mutex> ul( my_mtx );
element program capable of simulating complex real-world problems. present = true;
In very simple terms, LSTC sells crash simulation software that is used my_condition.notify_one();
}
in manufacturing: automobile design, aerospace, consumer products,
void consumer() {
and bioengineering. To improve solution accuracy, the problem size is while( !present ) {
continually increasing with a non-trivial increase in computer time unique_lock<tbb::mutex> ul( my_mtx );
(e.g., solving a 10 million element problem can take 43 hours on my_condition.wait( ul );
a given cluster). }
The challenge LSTC was faced with is the need for numerical } REVisit Go-Parallel.com
consistency combined with limited network bandwidth, fixed node
memory, and limited memory bandwidth which prevented LSTC users
from scaling beyond a certain node count as problem size increased.
This can significantly increase runtimes. READ THE REST OF WOOYOUNG’S POST:
Visit Go-Parallel.com
Figure 8. Profile data of a customer model on four nodes generated by Intel® Trace Analyzer and Collector.
Figure 9. Profile data of a customer model on 32 nodes generated by Intel® Trace Analyzer and Collector.
38 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
Figure 11. Profile data of the 1M model on 128 nodes generated by Intel® Trace Analyzer and Collector.
Application
functions
MPI_BCAST
Other MPI
functions
MPI_RECV
Figure 12. Load balance of the 1M model in pie mode on 128 nodes generated by Intel® Trace
Analyzer and Collector.
40 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE
Figure 13. Load balance of the 1M model in pie mode on 128 nodes generated by Intel® Trace Analyzer and Collector.
Conclusion
We believe and have shown that the Intel cluster tools can significantly reduce the time and
effort to achieve performance increases from years to months. Further, the tools can help when
performance is not what is expected.
Finally, we have worked to reduce the learning curve so that experienced parallelism enablers
and knowledgeable application developers can quickly gain the additional insights that the cluster
tools provide. o
Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize
for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction
sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel
compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors.
For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they
implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options.” Many library
routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other
microprocessors. While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and
Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will
get extra performance on Intel microprocessors.
Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree
for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations
include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and
Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not
guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by
Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.
While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on
Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine
which best meet your requirements. We hope to win your business by striving to offer the best performance of any
compiler or library; please let us know if you find we do not.
Notice revision #20101101
42 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
New from the makers of
Intel® VTune™ Performance Analyzer
and Intel® Visual Fortran Compiler
INTEL® PARALLEL STUDIO XE
Achieve enhanced
developer productivity
Intel® Parallel Studio XE 2011
combines ease-of-use innovations
with advanced functionality for high
performance, scalability, and code
robustness on both Linux
and Windows.
For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
© 2010, Intel Corporation. All rights reserved. Intel, the Intel logo, and VTune are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.
TAKE PERFORMANCE
TO THE EXTREME.
INTRODUCING INTEL® PARALLEL STUDIO XE
From one-person start-ups to enterprises with thousands of developers working on a single application,
Intel® Parallel Studio XE 2011 extends industry-leading development tools for unprecedented application
performance and reliability.
For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
© 2010, Intel Corporation. All rights reserved. Intel, the Intel logo, and VTune are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.