0% found this document useful (0 votes)
14 views44 pages

INTEL - The Parallel Universe - Issue 05 - 2010

Uploaded by

remilodyssee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views44 pages

INTEL - The Parallel Universe - Issue 05 - 2010

Uploaded by

remilodyssee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

THE PARALLEL

Issue 5
November 2010 UNIVERSE

Letter from the

EDITOR
by James Reinders

INTEL® PARALLEL
BUILDING BLOCKS
The Answer(s) to Cracking
the Parallelism Puzzle
by David Sekowski

Simplifying High Performance with


INTEL® PARALLEL STUDIO XE
AND INTEL® CLUSTER STUDIO TOOL SUITES
by Sanjay Goil and John McHugh
INTEL® PARALLEL
STUDIO HAS
GONE EXTREME.
FROM THE MAKERS OF
Intel® VTune™ Performance Analyzer
and Intel® Visual Fortran Compiler
comes the ultimate all-in-one
performance toolkit:
INTRODUCING
INTEL® PARALLEL STUDIO XE

Intel Parallel Studio XE for Linux‡


Purchase the tools individually or get the complete suite and save. See the table for important name change information.

PREVIOUS PRODUCT NAMES NEW PRODUCT NAMES


Intel® Parallel Studio XE
Intel® C++ Studio XE
Intel® Cluster Toolkit Compiler Edition Intel® Cluster Studio
Intel® Compiler Suite Professional Edition Intel® Composer XE
Intel® C++ Compiler Professional Edition Intel® C++ Composer XE
Intel® Visual Fortran Compiler Professional Edition Intel® Visual Fortran Composer XE
Intel® Fortran Compiler Professional Edition Intel® Fortran Composer XE
Intel® VTune™ Performance Analyzer (including Intel® Thread Profiler) Intel® VTune™ Amplifier XE
Intel® Thread Checker Intel® Inspector XE
‡ Intel Parallel Studio XE also supports Windows and Mac OS X.

The ultimate all-in-one performance toolkit


Intel Parallel Studio XE combines Intel’s industry-leading C/C++ and Fortran compilers, performance
and parallel libraries, correctness analyzers, code-quality tools, and performance profilers with ease-of-use
innovations to create an integrated tool suite that helps high-performance computing and enterprise
developers boost application performance, reliability, and security.

Advanced compilers Advanced memory, Advanced


and libraries threading, and performance profiler
Intel® Composer XE security analyzer Intel® VTune™ Amplifier XE
Intel® Inspector XE

Rock your code. Rock your world.


Get a free 30-day trial of Intel Parallel Studio XE today at http://software.intel.com/en-us/articles/intel-parallel-studio-xe/.
For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
© 2010, Intel Corporation. All rights reserved. Intel, the Intel logo, and VTune are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.
THE PARALLEL UNIVERSE

CONTENTS
Letter from the Editor
High Performance Options Have Never Been Greater
BY JAMES REINDERS 4
James Reinders focuses on the latest Intel® software developer tools designed to tap into
the performance offered by today’s computers.

Simplifying High Performance with


Intel® Parallel Studio XE and Intel® Cluster Studio Tool Suites
BY SANJAY GOIL AND JOHN MCHUGH 6
With the introduction of Intel® Parallel Studio XE and Intel® Cluster Studio, Intel extends the reach
of next-generation development tools to Windows* and Linux* C/C++ and Fortran developers.

Intel® Parallel Building Blocks:


The Answer(s) to Cracking the Parallelism Puzzle
BY DAVID SEKOWSKI 18
Examine three models for parallelism—Intel® Threading Building Blocks (Intel® TBB), Intel®
Cilk Plus, and Intel® Array Building Blocks (Intel® ArBB)—which together form a single
comprehensive solution for task parallelism, data parallelism, and vectorization.

Intel® Array Building Blocks


BY MICHAEL MCCOOL 24
Intel® Array Building Blocks (Intel® ArBB) is the answer to the following question: How can
parallelism mechanisms in modern processor hardware, including vector SIMD instructions,
be targeted in a portable, general way within existing programming languages?

Automatic Parallelism with


the Intel® Math Kernel Library (Intel® MKL)
BY GREG HENRY AND SHANE STORY 30
Explore the techniques Intel® Math Kernel Library (Intel® MKL) uses to achieve the highest
level of parallelism, as well as the hooks and knobs useful for getting the most from these
threaded hotspots.

When Print Statements and Timer are Not Enough:


Making the Parallelism Investment More Effective
BY DON GUNNING, NICK MENG, AND PAUL BESL 32
See how real-world developers are applying Intel® Trace Analyzer and Collector to find
issues that would be undetectable with print statements and a timer, correct those issues,
and deliver scaling performance.

© 2010, Intel Corporation. All rights reserved. Intel, the Intel logo,
Intel Core, and Intel VTune are trademarks of Intel Corporation
in the U.S. and other countries. *Other names and brands may be Sign up for future issues | Share with a friend
claimed as the property of others.
THE PARALLEL UNIVERSE

High
Performance
Options
Have Never
Been Greater
This issue of The Parallel Universe magazine
focuses on the latest Intel® software developer tools that help us tap

LETTER FROM into the performance offered by today’s computers.


From a technology standpoint, the hardware has never been
THE EDITOR more complex. Even a single feature such as parallelism is present
at every level of the computer architecture: superscalar processors
with expanding issue rates, SIMD instructions with expanding widths,
processors with expanding numbers of cores, and systems with
James Reinders is Chief Software Evangelist and Director of
expanding numbers of processors.
Software Development Products at Intel Corporation. His articles
and books on parallelism include Intel Threading Building Blocks: As systems become more complex, our tools for software developers
Outfitting C++ for Multicore Processor Parallelism. have evolved to help software developers. Your feedback has been
most helpful. Over the past year we’ve taken major strides in delivering
developer tools for existing and emerging hardware, while simplifying
the tools you need for high performance. Our focus has been to offer
choice, protect your software investments, and help you scale forward
to emerging and future hardware.

4 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

On November 9 we introduced the latest installments in our most Eliminating defects is an important topic that gets attention in our
popular software developer tools. These numerous tools come tools as well, and in a way that’s easy to utilize in your build environ-
together to form two comprehensive studios: Intel® Parallel Studio ment. Our tools offer solutions for code quality, security, and applica-
XE and Intel® Cluster Studio. Intel Parallel Studio XE addresses the tion robustness, all applicable to parallel programs. The next-gener-
advanced performance challenges of today’s machines for C, C++, and ation correctness analyzers combine memory, threading, and code
Fortran developers. Intel Cluster Studio offers tools uniquely tailored analysis for security. The article “Intel® Inspector XE: An essential tool
for distributed computing, specifically helping with programs using during development along with Intel® Composer XE” advocates this
MPI, C, C++, and Fortran. tool as an essential and regular part of your development cycle. The
There are a lot of new features to be excited about, including case for it seems clear.
version 12.0 Intel® compilers, a new version of Intel® VTune™ Performance profiling is essential for high performance in detecting
Performance Analyzer, new concurrency-compatible memory checking hotspots, and helping you alleviate them with additional insight into
capabilities, code analysis for security and robustness, advanced MPI what is actually happening on your system. Our next-generation
support, Intel® Parallel Building Blocks for C/C++, co-array Fortran profiler, the Intel VTune Amplifier XE, provides easy-to-use, yet
support, and new threading error detection that handles not only detailed, insight into the most pressing performance issues.
compiled languages, but also .NET code. For the cluster developer, we have our highly scalable implementa-
Intel Parallel Studio XE, is available for Linux* and Windows* tion of MPI, in the Intel MPI library, architected to scale to the largest
developers. It includes Intel® Composer XE (compilers and libraries), systems. The article “On a path to petascale with commodity clusters
Intel® VTune Amplifier XE, and Intel® Inspector XE. and Intel MPI” highlights Intel MPI library advancements in our cluster
Intel Cluster Studio, is available for Linux and Windows developers. tools for HPC.
It includes Intel Composer XE, Intel® MPI Library and MPI Benchmarks, Continuous development of software for high performance is a
and the Intel® Trace Analyzer and Collector. complex undertaking. Intel® software development tools work with
Our tools offer a rich selection of parallel programming methods existing and emerging Intel architecture, extending Intel leadership in
to meet the numerous needs of different applications. They have processor technology, and in multicore and manycore processors. As
no equal providing robust ways to express parallelism: OpenMP*, customers, you look for predictability in your software development,
MPI, co-array Fortran, Intel® Math Kernel Library (Intel® MKL), Intel® and an assurance that the software investments you make today will
Integrated Performance Primitives (Intel® IPP), and Intel® Parallel continue to reap benefits in years to come. Our mission in the software
Building Blocks (Intel® PBB) which includes Intel® Threading tools group is to simplify the tools—and the way you purchase, install,
Building Blocks (Intel® TBB), Intel® Cilk Plus, and Intel® Array Building develop, and support them.
Blocks (Intel® ArBB). Our Beta customers said very nice things about our new tools
To explore the advantages of these innovative tools, you’ll find prior to release. I believe you will find Intel Parallel Studio XE and Intel
articles on Intel Parallel Building Blocks and the parts that make it up, Cluster Studio taking significant strides in advancing the innova-
“what’s new and exciting in Fortran after all these years,” and Intel MKL. tion bar for programming, productivity, and programmability for high
performance.

Enjoy!

JAMES REINDERS
Portland, Oregon
November 2010

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

Simplifying High
Performance with

INTEL PARALLEL ®

STUDIOXE
AND INTEL® CLUSTER STUDIO TOOL SUITES
By Sanjay Goil, Product Marketing Manager
John McHugh, Marketing Communication Manager
Intel® Software Development Products

6 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

Intel® Parallel Studio XE 2011

Intel® Composer XE Intel® Inspector XE Intel® VTune™Amplifier XE


OPTIMIZING COMPILER MEMORY, THREAD, AND PERFORMANCE
AND LIBRARIES SECURITY ANALYZER PROFILER

Figure 1

In September, Intel introduced Intel® Parallel Studio 2011, Introducing New Tool Suites
a tool suite for Microsoft* Windows* Visual Studio* C++ developers, Software developers of high performance applications require a
with the singular objective of providing the essential performance complete set of development tools. While traditionally these tools
tools for application development on Intel® Architecture. These tools include compilers, debuggers, and performance and parallel libraries,
provide significant innovation, and enable unprecedented developer more often the issues in development come in error correctness and
productivity when building, debugging, and tuning parallel applications performance profiling. The code doesn’t run correctly, or exhibits
for multicore. With the introduction of Intel® Parallel Building Blocks error-prone behavior on some runs, pointing to data races, deadlocks,
(Intel® PBB), developers have methods to introduce and extend paral- or performance bottlenecks in locks for synchronization, or exposes
lelism in C/C++ applications for higher performance and efficiencies. security risks at runtime. To this end, Intel’s correctness analyzers
This month Intel is extending the reach of next-generation Intel and performance profilers are a great addition to the development
tools to developers of applications on both Windows and Linux in environment for highly robust and secure code development.
C/C++ and Fortran who need advanced performance for multicore Figure 1.
today and forward scaling to manycore. Intel Parallel Studio XE 2011 For advanced and distributed performance, Intel is simplifying
contains C/C++ and Fortran Compilers, Intel® Math Kernel Library the procurement, deployment, and use of HPC tools on IA-32 and
(Intel® MKL) and Intel® Integrated Performance Primitives (Intel® IPP) Intel® Architecture and compatible platforms, and HPC clusters
performance libraries, Intel PBB libraries, Intel® Threading Building programmed with the Message Passing Interface (MPI). Figure 2.
Blocks (Intel® TBB), Intel® Cilk™ Plus, and Intel® Array Building Blocks
(Intel® ArBB), Intel® Inspector XE correctness analyzer, and Intel®
VTune™ Amplifier XE performance profiler.
Boost Performance. Code Reliably. Scale Forward.
HPC programmers have traditionally been able to use all the
compute power made available to them. Even with the performance
leaps that Moore’s law has allowed Intel architecture to deliver over
the past decade, the hunger for additional performance continues to
thrive. There are big unsolved problems in science and engineering, Assist
Architectural Analysis
physical simulations at higher granularities, and problems where the
economically viable compute power provides lower resolution or
piecemeal simulation of smaller portions of the larger problem.
This is what makes serving the HPC market so exciting for Intel, and Code
Add Parallelism
it is a significant driver for innovation in both hardware and software
methodologies for parallelism and performance.
Performance
Intel® Cluster Studio introduces tools for HPC cluster development Optimize/Tune
with MPI, including the scalable Intel® MPI Library and Intel® Trace
Analyzer and Collector performance profiler, with the industry-leading
C/C++ and Fortran compilers for a complete cluster development tool
Correctness
suite. This is combined with the ease of deployment offered by Quality & Robustness
the Intel® Cluster Ready program, making deployment of cluster
applications highly efficient.

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

Phase Productivity Tool Feature Benefit

C/C++ and Fortran compilers, >> Drives application performance and


Advanced
Intel® Composer XE performance libraries, and scalability benefits of multicore and
Code parallel models forward scales to manycore. Additionally,
provides code robustness and security.

Memory and threading error >> Increases productivity and lowers cost
Advanced
Correctness Intel® Inspector XE checking tool for higher code by catching memory and threading
reliability and quality defects early

Performance profiler to optimize >> Removes guesswork, saves time, and


Advanced
Performance Intel® VTune™ performance and scalability makes it easier to find performance and
Amplifier XE scalability bottlenecks. Combines ease of
use with deeper insights.

Figure 2

Highlights of Intel® Parallel Studio XE 2011


>> Available for Multiple Operating Systems: Intel® Parallel Studio solved more easily in parallelism with increased scalability and
XE provides the same set of tools to aid development for both reliability. For Fortran developers, it now offers co-array Fortran and
Windows* and Linux* platforms. C/C++, Fortran compilers, and additional support for the Fortran 2008 standard.
performance and parallelism libraries bring advanced optimizations
to Mac OS*X. >> Performance: Intel® VTune™ Amplifier XE performance profiler
finds bottlenecks in serial and parallel code that limit performance.
>> Robustness: Intel® Inspector XE’s memory and thread analyzer Improvements include a more intuitive interface, fast statistical call
finds and pinpoints memory and threading errors before they graph, and timeline view. Intel® MKL and Intel® IPP performance libraries
happen. provide robust multicore performance for commonly used math and
>> Code Quality: Intel Parallel Studio XE enables developers to data processing routines. A simple linking of the application with these
effectively find software security vulnerabilities through static libraries is an easy first step for multicore parallelism.
security analysis. >> Compatibility and Support: Intel Parallel Studio XE excels at
>> Advanced Optimization: The compilers and libraries in Intel® compatibility with leading development environments and compilers.
Composer XE offer advanced vectorization support, including Intel offers broad support with forums and Intel® Premier Support, which
support for Intel® AVX. The C/C++ optimizing compiler now includes provides fast answers and covers all software updates for one year.
Intel® PBB library, expanding the types of problems that can be

A software development project goes through several steps to get optimal performance on
the target platform. Most often, the developer gets a rudimentary performance profile of the
application run to show hotspots. Once opportunities for optimization are identified, the coding
aspects are handled by the compilers and performance and parallel libraries to add parallelism,
presenting task level, data level, and vectorization opportunities. Finally, the correctness tools
make robust code possible by checking for threading and memory errors, and identifying secu-
rity vulnerabilities. This cycle typically repeats itself to find higher application efficiencies.

8 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

Old Name New Name

Intel® Compiler Suite Professional Edition Intel® Composer XE

Intel® C++ Compiler Professional Edition Intel® C++ Composer XE

Intel® Visual Fortran Compiler Intel® Visual Fortran


Professional Edition Composer XE

Intel® Visual Fortran Compiler Intel® Visual Fortran


Professional Edition with IMSL Composer XE with IMSL

Intel® VTune™ Performance Analyzer Intel® VTune™ Amplifier XE


(including Intel® Thread Profiler)

Intel® Thread Checker Intel® Inspector XE

Intel® Cluster Toolkit Compiler Edition Intel® Cluster Studio

Figure 3

The tools introduced in Intel Parallel Studio XE 2011 are next-generation revisions of
industry-leading tools for C/C++ and Fortran developers seeking cross-platform capabilities
for the latest x86 processors on Windows* and Linux* platforms. Those familiar with Intel’s
industry-leading tools will see that the product names have transitioned in this new release—in
all cases with significant additional capabilities, other names remain the same. Figure 3.

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

“Intel® Parallel Studio XE 2011 is a great


software development tool for perfor-
mance-oriented Windows*-based C++
software developers. I achieved an
astonishing boost in performance by
using Intel® Cilk Plus and Array features
in my code. If you need performance,
try Intel Parallel Studio XE 2011.”

Jorge Martinis
Research and Development
Engineer, BR&E Inc.

Figure 4

Introducing SIMD pragmas for vectorization

Figure 5

10 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

“BlueJeans Network is working on


the next-generation video cloud-
processing solution. We process large
Hotspots in the application volumes of audio, video, and data
content, and these processes are
highly CPU intensive. Intel® IPP 7.0
worked great for us. Its comprehensive
set of audio and video processing
functionality was the perfect solution
for our needs. It was a tremendous
timesaver for us—as building these
from scratch would have taken us
forever! I definitely recommend
Intel IPP 7.0.”

Emmanuel Weber
Software Architect
BlueJeans Network

Thread Based CPU usage

Figure 6

What’s New in Intel® Composer XE


Intel Composer XE contains next-generation C/C++ and Fortran compilers (v12.0) and
performance and parallel libraries: Intel MKL 10.3, Intel IPP 7.0, and Intel TBB 3.0. Figure 2.
The latest Intel C/C++ compiler, Intel® C++ Compiler XE 12.0, is optimized for the latest
Intel architecture processor (code-named Sandy Bridge) with Intel AVX support. The product
contains Intel PBB, which includes advances in mixing and matching task, vector, and data
parallelism in applications to better map to the multicore optimization opportunities; Intel Cilk
Plus; Intel TBB; and Intel ArBB (in beta, available separately). Figure 4. There are vector
optimizations with Intel AVX with SIMD pragmas, in addition to GAP, an array notation tool to
help in auto-parallelization for the highest performance and parallelism on the latest genera-
tion of x86 multicore CPUs. For Windows users, support for Visual Studio 2010* is included.
Intel® Fortran Compiler XE 12.0 includes several advances in more complete support for
Fortran 2003 standard and some support for Fortran 2008 standards, including Co-array
Fortran, vector optimizations with AVX, and help with auto-parallelization for the highest
performance and parallelism on the latest x86 multicore CPUs. Figure 5.
The performance libraries continue to provide an easy way to include highly optimized and
automatically parallel math and scientific functions, and data processing routines for high
performance users. The math library, Intel MKL 10.3, includes enhancements such as better
Intel AVX support, summary statistics library, and enhanced C language support for LAPACK.
The data processing library, Intel IPP 7.0, includes improved data compression and codecs, and
support for Intel AVX and AES instructions, continuing to excel at data processing intensive
application domains.

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

Intel® Inspector XE “It was an easy and fast ramp to


Memory, Threading, and Security Checker start using the Intel® Inspector XE
2011 tool. We were able to set the
analysis level, obtain a visual interpre-
>> Increase application tation of the collected data, and get
reliability and security
helpful information on hidden data
races in the code quickly.”

Alex Migdalski
CEO and CTO
OTRADA Inc.

Figure 7

Enhanced Developer Productivity with Correctness Analyzers


and Performance Profilers
Intel Parallel Studio XE 2011 combines ease-of-use innovations, introduced in Intel Parallel
Studio, with advanced functionality for high performance, scalability and code robustness for
Linux and Windows. Intel has traditionally offered developer tools on both Windows and Linux,
and strives to offer the same functionality across both platforms, especially important for
developing applications to run on both operating systems. Figure 6.
With the capabilities in the correctness analyzer, Intel Inspector XE, Figure 7, the
product helps C/C++ and Fortran developer with static and dynamic code analysis
through threading and memory analysis tools to develop highly robust, secure, and highly
optimized applications.

New capabilities in the Intel® Inspector XE correctness analyzer include:

>> Simplified configuration and run analysis


>> Finds coding defects quickly, such as:
• Memory leaks and memory corruption
• Threading data races and deadlocks
>> Supports native threads, understands any parallel model built on top of threads
>> Dynamic instrumentation works on standard builds and binaries
>> Timeline view to explore context of the respective threads
>> Intuitive standalone GUI and command line interface for Windows and Linux
>> Advanced command line reporting

12 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

Intel® VTune™ Amplifier XE - Hotspot “Intel VTune


Performance Profiler
Amplifier XE
>> Find performance bottlenecks 2011is the next
>> Functions sorted by amount
of CPU time generation of
the Intel VTune
Figure 8 Analyzer…”

Intel® VTune™ Amplifier XE - Concurrency


Performance Profiler

>> Color shows # of cores


utilized
>> Click {+} to view call stacks

Figure 9

Intel VTune Amplifier XE 2011 is the next generation of the Intel VTune Performance
Analyzer, which is a powerful tool to quickly find and provide greater insights into multicore
performance bottlenecks. It removes the guesswork and analyzes performance behavior in
Windows* and Linux* applications, providing quick access to scalability bottlenecks for faster
and improved decision making. Figures 8,9.

The next-generation Intel® VTune™ performance profiler


has new features, including:
>> Easy, predefined analyses >> Event multiplexing
>> Fast hotspot analysis (hot functions and >> Simplified remote collection
call stack)
>> Improved compare results
>> Powerful filtering
>> Tight Visual Studio* integration
>> Threading timeline
>> Non-root Linux* install
>> Frame analysis
>> Only EBS driver install needs root
>> Attach to a running process (Windows)

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

Intel® Inspector XE - Security Checker “Intel® static security analysis (SSA)


Memory, Threading, & Security Checker allowed us to easily find lots of poten-
tial flaws, thus preventing future bugs
or misuse to occur. Ultimately, we
>> Improve code security with expect the SSA to help us not repro-
static analysis duce typical security flaws.”
>> Find buffer overruns, unsafe
library usage, uninitialized Mikael Le Guerroue
variables, bad pointers
Senior Architecture Engineer
Envivio

When used with Intel® Parallel Studio XE

Figure 10

Software security starts very early in the development phase, and Intel Parallel Studio XE 2011
makes it faster to identify, locate, and fix software issues prior to software deployment. This
helps identify and prevent critical software security vulnerabilities early in the development
cycle, where the cost of finding and fixing errors is the lowest. Figure 10,11.

Intel’s static security analysis (SSA), included in the Intel® Parallel


Studio XE bundle, provides unique advantages for robust code
development:

>> Easier, faster setup and ramp to get static analysis results
>> Simple approach to configure and run static analysis
>> Discovers and fixes defects at any phase of the development cycle
>> Finds more than 250 security errors, such as:
• Buffer overruns and uninitialized variables
• Unsafe library usage and arithmetic overflow
• Unchecked input and heap corruption
>> Tracks state associated with issues, even as source evolves and line numbers change
>> Displays problem sets and location of source
>> Provides filters, assignment of priority, and maintenance of problem set state
>> Intuitive standalone GUI and command line interface for Windows and Linux

14 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

Feature Benefit

Support for both Linux* and Development capability with the same set of tools on both Windows* and Linux* platforms; enhanced
Windows* platforms performance, productivity, and programmability

C/C++ Compilers with Intel® Breakthrough in providing choice of parallelism for applications— task, data, vector— with mix and match for
Parallel Building Blocks optimizing application performance. C/C++ standards support

Fortran Compilers with key Advances in the industry-leading Fortran Compilers with new support for scalable parallelism on nodes and
Fortran 2008 standards clusters (cluster support available separately with Intel® Cluster Studio 2011); Fortran standards support
support including
Co-Array Fortran (CAF)

Memory, threading, and Enhances developer productivity and efficiencies by simplifying and speeding the process of detecting
security analysis tools difficult-to-find coding errors
in one package

Updated performance Multicore performance for common math and data processing tasks, with a simple linking
libraries with these automatically parallel libraries

Updated performance Several ease-of-use enhancements, deeper microarchitectural insights, enhanced GUI, and
profiler quicker, more robust performance

Figure 11

analysis and tuning cycle of MPI-based cluster applications. The latest


Intel® Cluster Studio Intel C/C++ and Fortran compiler technology, along with Intel MKL 10.3,
Distributed Performance Intel IPP 7.0, and Intel PBB (also sold as Intel® Composer XE), comple-
ments the suite to further optimize and parallelize application execu-
tion on each computing node. Co-array Fortran is supported on clusters
Contains:
in this package.
>> Intel® Composer XE Compiler and Libraries Along with Intel Cluster Ready (ICR), a program to define cluster
>> Intel® MPI Library architectures for increasing uptime, increasing productivity, and
>> Intel® Trace Analyzer and Collector reducing total cost of ownership (TCO) for IA-based HPC clusters,
Intel Cluster Studio 2011 makes it easy to code, debug, and optimize
to gain higher scalability for MPI-based cluster applications, up to
petascale, and also is the premier suite for developing and tuning
Increase Performance and Scalability hybrid-parallel codes that can mix MPI with multithreading paradigms
of HPC Cluster Computing such as OpenMP or Intel PBB.
Intel® Cluster Studio 2011 sets a new standard in distributed paral- Intel Cluster Studio 2011 provides an extensive software package
lelism on Intel architecture-based clusters. This premier tool suite containing Intel C/C++ Compilers and Intel® Fortran Compilers for
provides development flexibility for enabling MPI-based application all Intel architectures, plus all the Intel® Cluster Tools that help you
performance for highly parallel shared-memory and cluster systems develop, analyze, and optimize performance of parallel applications on
based on 32 and 64 Intel architectures. The newly architected Intel Linux or Windows. By combining all the compilers and tools into one
MPI library 4.0 is key to achieving these advantages by providing new license package, Intel can provide single installation, interoperability,
levels of cluster scalability, improved interconnect support across and support for the best-in-class cluster software tools.
many fabrics, faster on-node messaging, support for hybrid paralleliza-
tion, and an application tuning capability that adjusts to the cluster
and application structure. For the developer, the Intel Trace Analyzer
and Collector 8.0 is enhanced with new features that accelerate the

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

PO is the source of the problem now.

Figure 12

Intel® Trace Analyzer and Collector

>> Visualize and understand parallel application behavior >> Analyze performance of subroutines or code blocks
>> Evaluate profiling statistics and load balancing >> Identify communication hotspots

Highlights of Intel® Cluster Studio 2011

>> Scalability and High Performance: The interconnect- >> Target Applications to Multiple Operating Systems:
tuned and multicore-optimized Intel® MPI Library delivers Leverage the same source code in Intel® compilers and
application performance on thousands of Intel Architec- libraries, which bring advanced optimizations to Windows
ture and compatible multicore processors. and Linux.
>> Built-in Optimization: Utilize optimizing compilers and >> Intel® Cluster Ready Qualified: This program defines
libraries in Intel® Composer XE to get the most out of cluster architectures to increase uptime and productivity
advanced processor technologies. The C/C++ optimizing and reduce total cost of ownership (TCO) for IA-based
compiler now includes Intel PBB, which expands the types HPC clusters.
of problems that can be solved more easily in parallel, and >> Compatibility and Support: Intel Cluster Studio offers
with increased reliability. For Fortran developers, it now excellent compatibility with leading development environ-
offers Co-array Fortran (CAF) and additional support for ments and compilers , while providing optimal support for
the Fortran 2008 standard. Intel® compilers also deliver multiple generations of Intel processors and compatibles.
advanced vectorization support with SIMD pragmas. Intel offers broad support through its forums and Intel®
>> Ease of MPI Tuning: Intel® Trace Analyzer and Collector Premier Support, which provides fast answers and covers
has been enhanced with new features that accelerate all software updates for one year.
the analysis and tuning cycle of MPI-based cluster
applications.

16 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

Feature Benefit

Analysis tools for MPI Enhanced developer productivity and efficiencies by simplifying and speeding the detection of
developers load imbalance errors and offering performance profiling of MPI messages.
diagram; ideal Interconnect
simulator

Scalable Intel MPI Library Scale to tens of thousands of cores with one of the most scalable and robust commercial MPI libraries in the
with multirail IB support and industry. Ease-of-use with dynamic and configurable support across multiple cluster fabrics and multi-rail IB
Application Tuner support

C/C++ Compilers with Intel® Breakthrough in providing choice of parallelism for applications— process, task, data, vector— with mix
Parallel Building Blocks and match for optimizing application performance on clusters of SMP nodes. C/C++ standards support

Fortran compilers with key Advances in the industry-leading Fortran compilers with new support for scalable parallelism
Fortran 2008 standards on nodes and clusters. Fortran standards supported include key features in Fortran 2008,
support including co-array more complete Fortran 2003 support.
Fortran (CAF) on clusters
(available on Linux now and
Windows later)

Updated performance Multicore performance for common math and data processing tasks, with a simple linking with these
libraries, Intel® MKL and automatically parallel libraries
Intel IPP

Support for both Linux* and Development capability with the same set of tools on both Windows and Linux platforms for enhanced
Windows* platforms performance, productivity, and programmability

Figure 13

Summary
With the introduction of Intel Parallel Studio XE and Intel Cluster Studio, Intel is extending the
reach of the next-generation Intel tools to Windows and Linux C/C++ and Fortran developers
needing advanced performance for multicore today and forward scaling to manycore.
The Intel Parallel Studio XE 2011 bundle contains the latest versions of Intel C/C++ and
Fortran compilers, Intel MKL and Intel IPP performance libraries, Intel PBB libraries, (Intel TBB,
Intel ArBB [in beta], and Intel Cilk Plus), Intel Inspector XE correctness analyzer, and Intel VTune
Amplifier XE performance profiler.
Intel Cluster Studio 2011 bundle contains the latest versions of Intel MPI Library, Intel Trace
Analyzer and Collector, Intel C/C++ and Fortran compilers, Intel MKL and Intel IPP performance
libraries, and Intel PBB libraries (Intel TBB, Intel ArBB [in beta], and Intel Cilk Plus).

For more information, please visit http://www.intel.com/software/products. o

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

INTEL® PARALLEL BUILDING BLOCKS

THE ANSWER(S) TO CRACKING


THE PARALLELISM
PUZZLE By David Sekowski
Program Manager
Intel® Corporation

18 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

Editor’s note: Multiple-Node Systems


Intel® Threading Building Blocks (Intel® TBB) has grown
fantastically popular with C++ developers over the past
five years. It has been ported to many platforms and used in
many applications, including recently in well-known Adobe* > Systems are extensible to any number
of single-node systems
products. This article introduces an expanded family of
parallel models, with Intel TBB at the very center. The author > Distributed memory structures
introduces the complementary models that expand upon
what Intel TBB can do in a compatible and complementary
manner that makes Intel® Parallel Building Blocks (Intel® PBB) Apply lessons learned from HPC
well worth understanding and using. into mainstream computing
Clusters

Motivation
As microprocessors transition from clock speed as the primary vehicle
Single-Node Systems
for performance gains to features such as multiple cores, it is increas-
ingly important for developers to optimize their applications to take
advantage of these platform capabilities. In the past, the same
application would automatically perform better on a newer CPU due
to increasingly higher clock speeds. However, when customers buy
computers with the latest CPUs, they may not see a corresponding
increase in applications that are written in serial or designed to take
advantage of only one processing element. Therefore, hardware Single-Cluster Nodes Desktops/Workstations
designers have to find other ways to deliver superior application
performance, and they’re turning to an old standby from high perfor-
mance computing: parallel hardware platforms. This represents a
> Motherboards are single-, dual-,
new challenge for most software developers who haven’t had much or quad-socketed
experience with writing parallel software.
Since Amdahl’s Law was originally coined in 1967, high-perfor- > CPUs are single, multi-, or manycore
mance computing experts have known a thing or two about the > Shared memory structures
answer to this problem. Namely, that by putting more processors
Laptops/Netbooks
on the job we can reduce total application runtime through soft-
ware parallelism. With the addition of Gustafson’s Law in 1988, the
upper limits on scalability implied by Amdahl’s Law were effectively
removed. Taken together these principles opened an alternative Single CPU or Core
path to improving software performance in mainstream applications
without increasing clock speed. Welcome to the era of multicore
processors. > Cores contain both scalar processors
Today and in the future, cutting-edge applications will turn to and vector arithmetic units
parallelism to harness the profound power of dual-, quad-, and even-
> Scalar processors handle single, often
more-core processors found in most common mainstream computers. complex, operations on single data items
In his book Only the Paranoid Survive, Andy Grove talks about
how strategic inflection points happen in business when a new > Vector arithmetic units handle single,
technology has the ability to improve performance by an order of often simple operations on multiple
Scalar Processors
data items
magnitude. It is at these inflection points that there is a possibility Vector Arithmetic
to revolutionize instead of “evolutionize” an industry. Multicore, and Units
> Shared cache structures
soon manycore, processors represent just such an opportunity to
mainstream software application developers, if they can harness the
added performance and functionality potential. Figure 1

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

Distributed and Shared Memory Systems Task parallelism is the highest level of software parallelism.
Tasking is generally needed for problems with irregular control
It is useful when talking about software parallelism to begin by
structures that operate on irregular data sets. It allows a developer
talking about hardware platforms from the highest to the lowest
to break their application into logically distinct pieces, such as the
performance. Figure 1. This is because we can apply lessons learned
render pipeline, AI, physics, and network I/O modules in a game
from the high end to the mainstream. High-performance computing
engine. Each of these logical elements is assigned to a task or
uses massively parallel hardware and software platforms to solve
group of tasks to be completed concurrently. Other, more general
some of the world’s largest problems from climate change to decoding
forms of task parallelism patterns include message passing, tasking,
the human genome. These systems range from grid computers using
eventing, and pipelining. However, these types of parallel code
idle cycles on widely dispersed systems over the Internet to clusters
generally do not scale well with additional processing elements,
using message passing to communicate across various nodes located
since the number of logical parts of an application rarely grow
relatively close to each other (e.g., using the Intel® Message Passing
over time, such that it is not possible to exploit Gustafson’s Law.
Interface library (Intel® MPI) to synchronize information across cluster
Nevertheless, task parallelism is still a vital first step to creating
nodes). Both of these types of computing systems utilize distributed
scalable software by reducing the serial portions of code in a
memory as compared to shared memory like that found in a single
given application.
node within a distributed system, a workstation, a netbook, etc.
Data parallelism is complementary to task parallelism. In fact,
With distributed-memory systems, there is a need for high-level
some of the most well-known and successful solutions for data
coordination across nodes as well as parallelism within nodes.
parallel patterns are implemented using task parallel programming
Normally an explicit message-passing model is used to coordinate
models. Data parallelism uses algorithms with regular control
multiple nodes. As those nodes have become more powerful, with
structures to operate on concurrent containers and other regular
multiple processors—each with multiple cores, multiple scalar
data structures. Much of the potential scalability of today’s applica-
processors, and vector arithmetic units—there has been a need
tions exists in such regular forms; examples include encoding audio
to mix message passing across nodes with parallel software
or video. In addition, it is often easier to begin parallelizing a serial
within each node.
application by taking a serial control flow construct, such as a “for”
loop, and turning it into a parallel for loop, before investigating the
Task Parallelism, Data Parallelism, benefits that can be offered by rearranging the logical elements of
and Vectorization the application with task parallelism. Examples of data parallelism
patterns and algorithms include parallel “loops” (also known as
The widely accepted industry terms for node-level parallelism can be
“maps”), sorts, reductions, and scans.
quite confusing, and often have different meanings depending on a
Vectorization: Both task and data parallelism are ways a
variety of factors. For the sake of discussing how Intel is providing
developer can spread application work across multiple processing
solutions for software parallelism, this article will define three key
elements like multiple cores or processors. Vectorization is a subset
types of parallelism. Figure 2.
of data parallelism that allows you to take advantage of the vector

Parallelism Types Memory Structure Control Structure Algorithm Types

Distributed Message Events


Task Parallelism Irregular
Shared Passing Tasks Pipelines

Loops Trees
Data Parallelism Shared Regular Sorts Graphs
Reductions Lists

Simple Array & Elemental Functions


Vectorization Shared (Cache) Regular
Vector Operations SIMD

Figure 2

20 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

arithmetic units within most modern microprocessors. Vectorization,


or Single Instruction Multiple Data (SIMD) as it is sometimes called, Parallelism Model Implementations
allows a developer to perform a simple operation on multiple pieces of
data at the same time. Adding multiple arrays of vectors, for instance,
can be easily and quickly performed by a vector arithmetic unit. When Language
Comparisons Libraries
some published results show performance of math kernels in scientific, Extensions
financial, and other computations, they are highlighting the use of
such vector arithmetic units by mathematical libraries. Applications
that spend much of their runtime performing relatively simple data Compiler Dependent Independent
parallel operations can usually benefit greatly from vectorization.

Can be Generally
Compilers & Libraries Standards published adhere to
Compliance
There are a number of ways that computer science researchers have as Standards Standards

found to make implementing software parallelism easier. There are


entirely new languages, extensions for current languages, auto-
Portability Less More
matic compiler features, and runtime libraries comprised of low-level
constructs in current languages and operating systems. Intel is a
leader in researching and providing software products supporting
each of these methods. Today, Intel uses two primary methods to Ease of Use More Less
help developers write parallel code: (1) through the use of compiler
features and language extensions, and (2) through libraries. Figure 3.
When considering whether to use a language extension or a library
implementation for the user’s specific application, it is important to User Control Less More

consider the potential benefits of each solution. Language extensions


are enabled by compilers that can offer many benefits like lighter
weight scheduling, and, therefore, better performance for fine-grain Parallelism Fine Coarse
Grain Size
parallelism. Compilers also offer greater levels of abstraction that can
make it easier to implement parallelism. These benefits may come
at the cost of control, generality, and applicability. Depending on the Figure 3
exact design of the language extensions, the developer may not be
able to explicitly control the implementation of parallelism in their
code because they are only providing hints via keywords or Pragma
pragmas to the compiler. A pragma is a compiler directive, essentially a “hint” embedded
In contrast, libraries can work with existing compilers and infrastruc- in the source code to indicate some direction to the compiler.
ture, but may have to use a less standard and less compact syntax If the compiler understands the hint, it can perform a special
to expose their features. Library-based solutions may offer greater compilation. If it doesn’t, it simply ignores it and does not
control, more comprehensive feature sets, and better performance generate a warning.
for coarse-grain parallelism, although they generally introduce addi-
tional overhead compared to compiler-based solutions. However, both
compilers and libraries are effective and symbiotic in accessing and
expressing parallelism.
Finally, the implementation of parallelism in both compiler
extensions and libraries utilize low-level constructs available in the
operating system like OS threads and locking mechanisms. Therefore,
it should be obvious that on any particular machine, a savvy developer
can always match or exceed the per-thread performance and
system-level scalability offered by these abstractions given the time,
expertise, and inclination. On the other hand, coding directly to
low-level mechanisms instead of high-level abstractions may limit
the portability and future scaling of an application. Intel offers a
host of high-level abstractions that help developers overcome
these limitations.

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

Intel® Parallel Building Blocks

Selection Criteria Value Propositions

Abstract - models must reduce the need Supported - models must work with Usable - models provide easy to implement and
to write parallelism using OS-dependent Intel tools like Parallel Amplifier for test abstractions to utilize all available hardware
threads and synchronization primitives performance tuning and Parallel Inspector parallelism
for debugging

Interoperable - models must be able to Composable - models must be able to Reliable - models are not prone to common multi-
coexist within an application andreliably be nested and otherwise combined with threading errors and they can be mixed and matched
exchange data reliable and performant behaviors in interesting and useful ways

Performant - models must provide Scalable - models must provide Future-Proof - models provide automatic forward
sufficient per-thread performance to additional performance scaling when scaling to more processing elements along with
productivity adding processing elements sufficient per-thread performance

Open - models should work on multiple Standard - models should provide Portable - models can be used on multiple OS, HW,
platforms where applicable so developers published standards where applicable IDE, and compiler platforms with flexible licensing
can use them anywhere so others can interoperate

Figure 4

Intel’s Family of Parallel Models Intel® Parallel Building Blocks


Understanding the different types of parallelism available to a In response to these concerns when applying parallelism to the
developer and the ways in which they can utilize abstractions to average application, Intel has made it easier to utilize task and data
create parallel applications lays a framework for evaluating parallel parallelism through Intel TBB. Intel TBB is a C++ template library with
models. Intel’s family of parallel models supports a smorgasbord of a broad range of features to specify and execute both task and data
options for developers. These include native threads, auto-vectoriza- parallelism. It uses dynamic task scheduling, scalable memory alloca-
tion, auto-parallelization, OpenMP, OpenCL, and many others. However, tion, parallel algorithms and data structures, synchronization primi-
there is no universal solution. Each of these solutions has drawbacks tives, and portable threads to offer a composable and interoperable
or does not address all of the types of parallelism that developers solution for node-level parallelism. This generality is afforded by addi-
need to be successful. tional overhead compared to compiler-based models and sufficiently
For example, OpenMP is ideal for Fortran and C applications that are complex APIs. There are some capabilities that Intel TBB does not
meant to fully utilize a hardware system, complete some work, and support by design: (1) utilizing vector arithmetic units within modern
return the answer. But because of this behavior of assuming control processors and (2) automatically target Intel’s manycore co-processors.
over the entire hardware system, OpenMP does not compose well These technical limitations motivated creation of the two newest
with itself nor interoperate with other parallelism methods by default. members of Intel’s family of parallel models: Intel® Cilk Plus and Intel®
In high-performance computing, OpenMP is a great solution for Array Building Blocks (Intel® ArBB). Figure 5.
problems that are so large they can use all available resources fully Intel Cilk Plus is comprised of C and C++ language extensions imple-
to solve a problem, but has limited applicability in common applications mented in the Intel® C/C++ Compiler. If you are new to parallel program-
running on a personal laptop such as Web browsers, media players, ming, this is the easiest way to get started. The extensions include
and email clients. That being said, Intel has been and will continue to three simple keywords to expose both task and data parallelism, as well
provide industry-leading support for OpenMP in our compilers. as an easy syntax to explicitly vectorize portions of an application. It has
the additional benefits of serial semantics and deterministic output.

22 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

Free open source version www.threadingbuildingblocks.org

Intel® TBB
Paid commercial version www.threadingbuildingblocks.com

General information http://cilk.com

Intel® Cilk™ Plus

Intel® Cilk++ SDK for Microsoft C++ compiler users on http://software.intel.com/en-us/articles/intel-cilk/)


Windows* and GCC compiler users on Linux*

General information http://intel.com/go/ArBB

Intel® ArBB
Beta download http://software.intel.com/en-us/articles/
intel-array-building-blocks

Figure 5

Intel ArBB, formerly Intel® Ct Technology, is a C++ library like Intel parallelism (Intel TBB, Intel Cilk Plus, and Intel ArBB) all share a common
TBB, which offers highly performant and scalable data parallelism infrastructure. Each individual model also adheres to strict selection
and vectorization. It utilizes a JIT, or just-in-time, compiler to dynami- criteria that guarantee a number of compelling value propositions
cally optimize for any given target heterogeneous hardware platform, by default. Figure 5. These models are complementary and each
including both manycore processors and vector arithmetic units. provides unique value in particular application contexts. In combination,
It can generate highly optimized machine code on the fly to take best they form a single comprehensive solution for task parallelism, data
advantage of the multiple processors, cores, and vector arithmetic parallelism, and vectorization, with interfaces implemented using both
units available on both CPUs and accelerators, which may comprise language extensions and libraries. Together, they provide a unified
a distributed memory system. By using dynamic code generation, it solution to parallelism and are available as part of Intel® Parallel
can also overcome modularity overhead of C++. For instance, it can Studio 2011 as Intel® Parallel Building Blocks (Intel® PBB). Interested
support virtual functions without their runtime cost. To learn more developers can get started with Intel PBB today by referring to
about Intel ArBB, refer to the next article. These three models for the above links. o

“Intel ArBB, formerly Intel®


Ct Technology, is a C++
library like Intel TBB, which
offers highly performant
and scalable data parallelism
and vectorization. ”

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

INTEL ®

ARRAY
By Michael McCool
Software Architect
Intel® Corporation

BUILDING
BLOCKS
THE PARALLEL UNIVERSE

Intel® Array Building Blocks Intel ArBB can be used to parallelize compute-intensive applications within
(Intel® ArBB) is a sophisticated a structured, deterministic-by-default framework. It also provides powerful runtime generic
programming mechanisms, yet can be used with existing compilers. In particular, it has been
and powerful platform for
verified to work with the Intel, Microsoft*, and gcc C++ compilers. Intel ArBB is currently in Beta,
portable data-parallel software and feedback is appreciated; it can be downloaded today from http://intel.com/go/ArBB for
development. Intel ArBB will either Windows or Linux.
be available as a component Is Intel ArBB a language or a library? Yes—both at the same time. Intel ArBB is the answer
of Intel® Parallel Building to the following question: How can parallelism mechanisms in modern processor hardware,
Blocks, along with several including vector SIMD instructions, be targeted in a portable, general way within existing
programming languages? The answer is an embedded language. Intel ArBB is a language
other tools and libraries
extension implemented as an API. It has a library interface, but includes a capability for the
for parallel programming. dynamic generation and optimization of parallelized and vectorized machine language.
Modern processors include many mechanisms for increasing performance through
parallelism: multiple cores, hyperthreading, superscalar instruction issue, pipelining, and
single-instruction, multiple data (SIMD) vector instructions. The first two—multiple cores and
hyperthreading—can be accessed through threads, although for efficiency, one may want to
use lightweight tasks that share hardware threads. Instruction-level parallelism, such as
superscalar instruction issue and pipelining, are invoked automatically by the processor, as long
as the instruction stream avoids unnecessary data dependencies. However, the last form of
parallelism, SIMD vector parallelism, can only be accessed by generating special instructions
that explicitly invoke multiple operations at once: SIMD instructions. SIMD instructions perform
the same operation on multiple components of a vector at once, so they are sometimes also
called SIMD vector instructions.

Figure 1. Practical
Intel® Parallel Building Blocks (Intel® PBB) is a set of comprehensive parallel
testing and performance
development models that supports multiple approaches to parallelism. expectation

SIMD vector instructions are very powerful, and they are becoming more powerful
over time. In current processors that support streaming SIMD extensions (SSE), four single-
precision floating point instructions can be executed with a single SSE SIMD instruction.
In next-generation AVX processors, the width of the SIMD instructions will double, so eight
such operations can be executed at once. In the Intel® Many Integrated Core (MIC) architecture,
the width doubles again, so 16 such operations can be executed at once.

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

“Intel® Parallel Building Blocks


(Intel® PBB) actually
includes three separate
strategies for accessing
vector operations in
a portable manner. ”
Intel® Parallel Building Blocks (Intel® PBB) actually includes three
separate strategies for accessing vector operations in a portable
manner. The first strategy, which should not be overlooked, is to use
a fixed-function library: Intel® Math Kernel Library (Intel® MKL) and Intel®
Integrated Performance Primitives (Intel® IPP) include many mathemat-
ical operations that have already been vectorized. If the operation you
need is part of these optimized libraries, that is often the best
The theoretical peak floating-point performance of a processor is solution. If not, and you have to code the algorithm yourself, there
represented by the product of the number of cores, the width of are two other strategies available. First, you could use Intel® Cilk Plus,
the vector units, and the clock rate. While the clock rate is no longer an extension to C and C++ that includes a notation to specify explicit
scaling significantly, the number of cores and the SIMD vector width of vector operations on arrays. This notation is an extension to C/C++
each core continue to scale. Vectorization, that is, expressing compu- available in the Intel C/C++ Compiler. The second general-purpose
tations using SIMD vector instructions, is essential to attain the peak mechanism is Intel ArBB.
performance of modern processors. Intel ArBB is an embedded language, implemented as a C++ API,
However, there are two problems. First, using SIMD vector units that in theory works with any ISO-standard C++ Compiler. It uses
requires use of specific machine-language vector instructions. standard C++ mechanisms for its syntax, declaring types for
Second, different processors have different SIMD vector instruction collections of data and overloading operators so that operations
extensions. The SSE, AVX, and MIC vector instructions are all different. can be expressed over those collections. In other words, it looks like
While AVX machines can execute SSE instructions, this will not access a typical matrix-vector math library. However, there is a difference.
the full performance potential of AVX processors. This latter issue In an ordinary library, the C/C++ Compiler generates the code statically.
is not so critical since current compiler technology does permit the ArBB machine code is generated by the library itself, dynamically.
generation of multiple code paths in a single binary. For example, ArBB is very simple to use; we’ve provided a few examples below.
when using the Intel® C++ Compiler, a single-source program can be To set the stage, however, we first need to discuss some basics. The
compiled for both SSE and AVX machines, and the resulting program ArBB C++ API defines both types and operations. Types include scalar
will use AVX code when possible. However, when using static types for floating point numbers, integers, and Booleans, as well as
compilers, developers still need to know in advance which set types for representing collections of these types and user-defined
of processors they wish to target, and the problem remains: types based on them. The ArBB scalar types are used in place of the
how is efficient vectorized code to be generated? ordinary C++ types for floats and integers, and have names like f32
The traditional approach to supporting instruction set extensions (for single-precision float), i32 (for signed 32-bit integers), and so
is to modify the compiler to emit the new instructions, and then forth. Using an ArBB scalar type indicates to ArBB that the corre-
to recompile programs as necessary. However, for SIMD vector sponding machine language for operations on this type should be
instructions this is not so easy. It is very difficult for a compiler generated dynamically by ArBB and not statically by C++. There are
to automatically identify serial structures in a program that can also types to manage large collections of data. The simplest of these
be mapped to SIMD vector instructions. It can be done sometimes, is called dense<T,D> and represents a contiguously stored (dense)
but it is better for the programmer to explictly indicate which multidimensional array with element type T and dimensionality D.
operations in the program should use SIMD vector operations The dimensionality is optional and defaults to 1. The element type
and how. This requires new constructs in the programming T can be any ArBB scalar type or structures or classes with ArBB
language that can be easily and reliably vectorized. Unfortunately, scalar types as elements.
there is as yet no widely accepted machine-independent standard
for specifying vectorization in C and C++.

26 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

There are two basic ways to specify parallel computations in ArBB:

BLOG
as sequences of operations over entire collections (vector mode), or
as functions replicated over every element of a collection (elemental
mode). Vector mode is the simplest: arithmetic operations on collec-
tions apply in parallel to the corresponding members of the collections.
This works even if the element type is user-defined and the user
has overloaded the operator themselves. For example, suppose we
have four dense<f32> collections called A, B, C and D, all of the same
size. Then the following expression will operate in parallel on all the
highlights
elements of these collections:

Intel® Cilk™ Plus Specification and


A += (B/C) * D; Runtime ABI Published for Free
Download Now
Note that in general when a collection appears on both the left and
JAMES REINDERS, DIRECTOR OF SOFTWARE
right side of an expression, ArBB generates a result “as if” all the DEVELOPMENT PRODUCTS
inputs were read before any outputs are written. In practice, we have
to put this expression inside a function and invoke it with a call
operation. However, any sequence of parallel vector operations A Cilk Plus specification without an implementation
can be inside such a function: would be noise. That is why we released a serious
implementation first, followed shortly by a specification.
Serious evaluation, production usage, and feedback
void are all possible as a result.
doit(dense<f32>& A, dense<f32> B, On November 2, we published the specification for the
dense<f32> C, dense<f32> D)
{
language and the runtime ABI for Intel Cilk Plus on cilk.com.
A += (B/C) * D; This is an important step as we encourage adoption of these
} important capabilities in all compilers. We are in the early
... stages of discussions with others on how to best do this,
call(doit)(A,B,C,D); and all agree that publishing a specification is a very important
next step for the success of Cilk Plus.
The way call actually works is that it calls the function to do it precisely We know that promoting a specification without an implemen-
once and observes (rather than actually performs) the sequence of tation would be a poor way to promote a language. Having
ArBB type constructions, operations, and destructions generated by this an implementation that allows serious evaluation is a must.
function. It records this sequence, compiles it into optimized machine That is why we chose the order we did: implementation first,
language, executes it (in parallel), and then caches it. The next time the followed shortly by a specification.
same function is called, call does not invoke the C++ function again; it We have full support for Cilk Plus in Intel’s released compil-
will just retrieve the internally generated machine code from its cache. ers on Windows and Linux. These compilers and specifications
For simple uses of Intel ArBB this is exactly what you want. In more build upon people, expertise, and technology acquired
advanced use cases, however, you may want to generate different from Cilk Arts last year.
versions of the operation from the same C++ function. For example,
you can parameterize the sequence of Intel ArBB operations by ordinary
C++ variables and control flow, and you can use this to generate vari-
ants of a computation. Managing this powerful mechanism for generic
programming is enabled by another Intel ArBB type called a closure. READ THE REST OF JAMES’ POST:
A closure is an object that represents a captured Intel ArBB function; Visit Go-Parallel.com
it is conceptually similar to a lambda function, but is dynamically gener-
ated. The return type of call is actually an appropriately typed closure.
Another function, capture, is also available. It is similar to call in that it
creates a closure, but it does not cache it, so it can be called repeatedly
Browse other blogs exploring a range of related
on the same C++ function to generate variants. Again, for simple uses
subjects at Go Parallel: Translating Multicore
of Intel ArBB explicit use of closures is not necessary, and you can just
Power into Application Performance.
think of call as a straightforward function invocation.

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

It is also possible, from inside an elemental function, to access


neighboring elements of the input. This makes it very easy to write
stencil operations, such as convolutions. You can also pass in either
an entire container or a single element to every argument of the map.
Single-element arguments are replicated to match the size and shape
of any containers used as arguments. For example, suppose we
use the following:

void
doit(dense<f32>& A, f32 b, f32 c,
dense<f32> D)
{
map(kernel)(A, b, c, D);
}
call(doit)(A,b,c,D);

with the same kernel function, but with the types of b and c matching
the corresponding function argument exactly; in this case, f32. There
will still be as many parallel instances of the kernel as there are
elements in the collections A and D, but every instance will get a
copy of the same value of b and c. In summary, call arguments need
to match exactly, but map functions are polymorphic and any
You can also write “elemental” functions over scalar Intel ArBB types: argument can either be a single element or a collection.
In addition to using these two basic patterns to express parallel
operations, users of ArBB also have access to several collective
void
operations that act on or take an entire container as an input. These
kernel(f32& a, f32 b, f32 c, f32 d)
{ operations can shift the contents of containers around, take cumula-
a += (b/c)*d; tive sums (prefix scans), perform sets of reads and writes (known as
} scatters and gathers), discard elements and pack the remainder into
a contiguous sequence (known as pack; the inverse is unpack), or
simply combine all elements into a single element. Combination of all
You can invoke elemental functions from inside a call by using the
the elements of a container into a single element is called a reduction.
map operation. A map operation replicates the function over every
For example, the following computes the dot product (sum of pairwise
element of the input containers.
products) of two containers A and B:

void
doit(dense<f32>& A, dense<f32>
B, dense<f32> C, dense<f32> D) f32 A_dot_B = add_reduce(A * B);
{
map(kernel)(A, B, C, D);
}
call(doit)(A,B,C,D); Elemental functions can also use control flow. During the process
described above, ordinary C++ control flow is actually executed only
when the function is ”captured.“ This is incredibly useful for generic
programming in order to specify variants, and to reduce the overhead
of modularity and configuration. However, in order for control flow
to be visible to ArBB and be compiled into the vector machine
code generated by it, special macros need to be used to express
“embeddable“ control flow.

28 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

be done with closures. If we use call, we will capture and “freeze” the
value of this variable the first time we invoke this function. If we use
capture, we can change the value and capture different versions and
use closure objects to manage them. Finally, note that ArBB control
flow has a few differences from C++ control flow: it uses a leading
underscore, but also has closing keywords (such as _end_for) and the
arguments to _for are separated by commas, not semicolons.
To actually get useful work done, we have to get data in and out
of ArBB collections. This can be done in a variety of ways. ArBB actu-
ally supports an efficient STL-friendly interface based on iterators for
sophisticated applications. However, the simplest way to get data in
and out of ArBB is to simply associate an ArBB collection with a C++
array using bind, as follows. Note that we also have to use a helper call
function “doit” to invoke the elemental function inside a map.

void
doit(dense<i32,2>& D,
Figure 1 dense<std::complex<f32>,2> P)
{
This is best shown by an example. The following program computes map(mandel)(D,P);
the Mandelbrot set, the famous fractal found by counting the number }
dense<std::complex<f32>,2> pos;
of iterations required for a complex quadtratic to diverge from a given bind(pos, c_pos, cols, rows)
starting point. Plotting this number of iterations over a region of the dense<i32,2> dest;
complex plane results in the above image. Figure 1. bind(dest, c_dest, cols, rows)
An elemental function to compute a single pixel of this image is call(doit)(dest, pos);
given by the following ArBB code:

This article has presented a brief introduction to Intel Array Building


int max_count = MAX_COUNT; Blocks. This system provides a portable mechanism for sophisticated
void mandel(i32& d, std::complex<f32> c) { and efficient data-parallel computation. In order to target vector
i32 i; instructions, while being processor and compiler independent, Intel
std::complex<f32> z = 0.0f;
ArBB includes a capability for dynamic code generation. This capability
_for (i = 0, i < max_count, i++) {
_if (abs(z) >= 2.0f) { allows Intel ArBB to avoid the overhead of C++ in many cases, since
_break; the ArBB code generation is separate from that of the “host language,”
} _end_if; C++. Intel ArBB is a sophisticated and powerful system that provides
z = z*z + c; access to a simple means to express efficient data parallel computa-
} _end_for;
tions, and also supports unique and powerful mechanisms for generic,
d = i;
} modular programming.
If you are interested in learning more about Intel ArBB or experi-
menting with it (again, it is currently in Beta, and feedback is appreci-
ated) go to http://intel.com/go/ArBB.o
There are a few interesting things to note about this example.
First, complex numbers can be expressed simply by using the
std::complex type with an ArBB element type. This also works for
user-defined types and, as mentioned above, for operator overloading
on user types; such operator overloads even work for vector opera-
tions applied to collections of user types. Second, this function refers
to a non-local C++ variable, max_count. In theory, we might want to
capture this function with different values of this variable. This can

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

Automatic
Parallelism
with the Intel® Math Kernel Library (Intel® MKL)

By Greg Henry, Intel® MKL Architect


Shane Story, Engineering Manager, Intel Corporation

The Intel® Math Kernel Library (Intel® MKL) provides to re-link their applications. Intel MKL simply detects the hardware and
software developers optimized and automatically parallelized math- dispatches code optimized for that processor, requiring no effort on
ematical library routines. Our routines are thread-safe and are appli- behalf of the user. For example, the same application when linked with
cable to many engineering, science, financial, and other applications. Intel MKL should run optimally on Intel® Core™2 Duo and Intel® Core™i7
In this article, we provide an overview of the techniques Intel MKL processors because kernels optimized for both processors are already
uses to achieve the highest level of parallelism, as well as the hooks built in and dispatched during runtime.
and knobs useful for getting the most from these threaded hotspots. Multicore machines are the latest trend in computing, offering
Intel MKL has a number of domains useful to developers who high degrees of parallelism. While the potential for even higher
create applications for desktops, servers, and clusters. This includes performance is a natural side effect of an increased number of cores,
industry standards like the Basic Linear Algebra Subroutines (BLAS) the challenge of extracting that performance (and parallelism) falls
and the latest version of the Linear Algebra PACKage (LAPACK), as squarely on the shoulders of the software developer. A library such
well as Fast Fourier Transforms (FFTs), Vector Math and Statistics as Intel MKL, where we have threaded most of the commonly used
Libraries (VML, VSL), a direct sparse solver (PARDISO), and sparse routines, is a simple and effective means of obtaining that parallelism.
BLAS. To help lower the barrier to programming distributed memory Threading within Intel MKL is based on the industry standard
architectures (clusters), Intel MKL includes ScaLAPACK, Parallel BLAS, OpenMP* specification. We thread in several of our domains: the direct
and Cluster FFTs. Intel MKL is available on the latest versions of Linux, sparse solver, LAPACK, BLAS, Sparse BLAS, VML, FFTs, and Cluster
Windows, and Mac OS X. We have tuned code for Intel and AMD* FFTs. For industry-standard components like LAPACK and the BLAS,
hardware, including both IA-32 and Intel® 64 architectures. our tuning goes beyond what one can find in the public domain.
The primary advantage of Intel MKL is that it makes the highest For example, in LAPACK we have added threading to some of the
performance levels easily accessible to software developers. Within computational linear equation routines, orthogonal factorizations,
the software, we do automated dispatching to amortize the value singular value decompositions, and eigenproblems. In all cases,
of the underlying hardware features. This means users calling an Intel MKL is thread-safe, so simultaneous execution of routines
industry-standard subroutine like DGEMM from the BLAS get perfor- from multiple threads works correctly.
mance improvements on different systems without having

30 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

MKL. For example, the original design and current public implementa-
tion of LAPACK depends on parallelism within the underlying BLAS
routines. We found we could obtain better performance when we
did threading at the LAPACK-level and called sequential BLAS, rather
than relying on threading only in the BLAS. The critical observation is
that the advantage of threading at a higher level increases with the
number of threads. If we take the LAPACK routine DGETRF (double
precision general matrix factorization via Gaussian elimination with
partial row pivoting), fix a large matrix size on a manycore machine,
and test the gap between threading just at the BLAS level versus the
LAPACK routine level, the performance advantage increases as the
number of threads increase.
Additionally, there are times when it is useful to take advantage
of multiple levels and styles of parallelism, such as in a distributed
memory cluster running MPI (Message Passing Interface) between the
nodes. The Intel MKL benchmark MP LINPACK (which solves a cluster
problem similar to DGETRF) uses hybrid MPI-OpenMP* parallelism for
even greater performance. This is analogous to our previous state-
ment regarding threading at the highest level. While running one MPI
process per core is the most basic mechanism for parallelism on a
cluster, running fewer MPI processes and putting OpenMP* calls into
the code raises the threading level higher in the application and
Users can set Intel MKL-specific variables such as MKL_NUM_ should yield performance gains.
THREADS to specify the number of OpenMP* threads so as not to Intel MKL depends on the underlying OpenMP* software to deter-
interfere with a user’s other OpenMP* routines and environment mine the number of threads. When the presence of MPI is detected
variables. Intel MKL checks the Intel MKL-specific variables first; and MPI has not been initialized for multithreading, Intel MKL will
however, we are constrained by the underlying OpenMP environment. default to one thread. Likewise when called from inside an OpenMP
If neither the Intel MKL routines nor the OpenMP routines/variables are parallel region, the default will be to one thread. If OpenMP gives us
set, the underlying OpenMP default system will take precedence. more threads than the number of physical cores (which might happen
If a developer builds an application for their own customers and wants when HT is enabled), we will scale down the number of threads to
control over the OpenMP environment (as opposed to allowing their match the number of physical cores. But there are times when our
users to experiment with environment variables), they can call Intel default choice may not be optimal, because of other aspects of the
MKL threading service functions. Our threading service functions take application we cannot detect. A user can set MKL_DYNAMIC to FALSE
precedence over our environment variables. Intel MKL works with (its default is TRUE) or call mkl_set_dynamic() to try to override the
the Intel® Compiler’s OpenMP* libraries, in addition to those of number of threads we think will run optimally. Note that FFTs require
Microsoft and GNU. both a setup and an execute stage, and the number of threads should
Developers can control the number of threads not only on a Intel be the same for both.
MKL-wide level, but also on a domain-specific level with the MKL_ By taking advantage of the automatic parallelism Intel MKL provides,
DOMAIN_NUM_THREADS environment variable or its corresponding applications can get higher performance on modern multicore archi-
service function, mkl_domain_set_num_threads(). For instance, if one tectures. o
wants all of Intel MKL to use two threads and the BLAS instead to use
four, a user can set the variable with “MKL_ALL=2, MKL_BLAS=4.”
All environment variables are read only once in the course of a run. Web Bibliography:
To change the behavior in the middle of a run requires calls to the [BLAS] http://www.netlib.org/blas/index.html
service functions. [LAPACK] http://www.netlib.org/lapack/index.html
For developers building applications using a different threading [MKL] http://software.intel.com/en-us/intel-mkl/
model other than OpenMP, such as Intel® Cilk™ Plus Runtime Library, [MPI] http://www.mcs.anl.gov/research/projects/mpi/
Intel® Thread Building Blocks, or pthreads in Linux, we suggest [MPI] http://www.intel.com/go/mpi
threading with the user’s method of choice at the highest level, [OPENMP] www.openmp.org
and either linking in the sequential Intel MKL, or setting MKL_NUM_ [SCALAPACK] http://www.netlib.org/scalapack/index.html
THREADS to “one” in the threaded version.
This usage model works well because threading is most effective
when applied at the highest possible level—as it is in the current Intel

Sign up for future issues | Share with a friend


When Print
Statements
and Timer Are
Not Enough:
Making the Parallelism Investment More Effective

By Don Gunning,
Nick Meng, and Paul Besl

As Intel® Architecture evolves, cluster software users


are going to have to make their next investments in systems that
employ greater amounts of parallelism.To support those investments,
software developers will have to make changes to their software that
can significantly impact performance. The changes usually involve
many years of work and can lead to the implementation of mixed
mode parallelism. In a small number of cases, the changes can be
relatively minor. In either case, our experience indicates that the
changes are not obvious.
We suggest the high-level features of the Intel® Cluster
Toolkit Compiler Edition meet the above challenge, particularly,
the new features in Intel® MPI 4.0 and the benefits of the Intel®
Trace Analyzer and Collector.
In this article, we will see how real-world developers are applying
Intel Trace Analyzer and Collector to find issues that would be unde-
tectable with print statements and timer, correct those issues, and
then deliver scaling performance with Intel® MPI 4.0 that is 30 to 50
percent higher than previously achieved (e.g., LSTC dyna is scaling I just spent the money and the application runs slower
past 1,500 cores; fluent is eclipsing 3,000 cores). Finally, we will
Frequently, software users will come to a software ISV and ask
suggest that a knowledgeable developer can deliver effective results
what the effect of a new Intel® multicore architecture will be on
in months that would take years to deliver with traditional tools.
workloads. An ISV’s user asked this question and got a very favorable
Let’s start by looking at how the application of the Intel Cluster
estimate (yellow dash line in Figure 1). After having paid the money
Toolkit Compiler Edition 4.0 can be applied to real problems that
and installed the new Intel® 64 system, they saw the results (red line
happen to developers when they use the same software on new
in Figure 1).
systems or apply the software to larger data sets.
We will show how Intel Trace Analyzer and Collector can be
used to diagnose issues when previously estimated speedups are
not achieved, and ensure that the quality of results is maintained
when mixed mode parallelism is implemented to enable handling
of larger data sets.

32 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

5.5

5 real timing on
new system

4.5

3.5

2.5

1.5
Elapsed time (hours)

real timing on
existing system
1.0
estimation on
new system
.5

0.0
0 2.5 k 5k 7.5 k 10k 15k 20k

Problem size (elements)

Figure 1. Practical testing and performance expectation

source /opt/intel/itac/8.0.1.001/bin/itacvars.sh
export LD_PRELOAD=/opt/intel/itac/8.0.1.001/slib/libVT.so
# 8p run
runexec small_model.pre -np 8 --mpi-options -trace --machines-file $PBS_NODEFILE
#256p run
runexec large_model.pre -np 256 --mpi-options -trace --machines-file $PBS_NODEFILE

Needless to say, there were some issues and diagnosis and Collector tool: a small test case with eight cores and
was necessary. a large test case with 256 cores. After we got STF trace
With Intel Trace Analyzer and Collector, the diagnosis files, we collected statistical data with the analyzer tool.
was straightforward. Basically, you add “–trace” to your Figures 2 and 3 show the statistical data collected from
MPI command line or “–mpi-options –trace” in your run the small test case in graph mode. In Figure 2, we can
script file. Here is the run command with ITAC tool. see the communication from P0 to P1-P7 is hot, and P0 to
We tested two test cases with the Intel Trace Analyzer P1 is the hottest (see the outlined area).

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

MPI BCAST

Figure 2. Total MPI collective function time in each MPI process generated by Intel® Trace
Analyzer and Collector Tool. (Red indicates hot [busiest]; blue, cool [not so busy].)

MPI BCAST

Figure 3. Total time of MPI collective function in each MPI process generated by Intel® Trace
Analyzer and Collector. (Red indicates hot; blue, cool.)

34 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

Figure 4. Load balance of large test case in pie mode on 256 cores generated by Intel® Trace
Analyzer and Collector. (Red indicates MPI code percent execution time; blue indicates applica-
tion percent execution time.)

Figure 5. Profile data of large test case and total time of MPI collective function in each MPI
process generated by Intel® Trace Analyzer and Collector.

In Figure 3, you can see MPI_Bcast is the hottest func- applications; the red area indicates the cost of MPI
tion in the test case (see the outlined area). We also did communications. You can see that the master MPI process
deep performance investigation with a large test case pie is almost blue, and all slave MPI process pies are almost
on 256 cores. Figures 4, 5, 6, and 7 show the same red. It means that there is a very serious load imbalance
phenomena in the large test case. issue here. After we look at Figures 6 and 7, we find the
Figure 4 displays the percentage of functions in serious load imbalance is from the MPI_BCAST function.
pie mode: the blue area indicates computing cost of Figures 6 and 7 show the activities of MPI functions and

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

Figure 6.The activities of functions from 0 to 60 seconds generated by Intel® Trace Analyzer
and Collector.

Figure 7. The activities of functions from 21.136 to 21.176 seconds generated by Intel® Trace
Analyzer and Collector. (Red indicates MPI code; blue indicates application code. The horizontal
axis represents time. The vertical axis represents separate MPI processes/ranks.)

36 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

application functions. The red area indicates MPI function activity;

BLOG
the blue area indicates the application function activity. Obviously,
the large test case spent a lot time on MPI functions which is
extremely abnormal.
In Figure 4, we noticed the master MPI process (pie in upper left
corner) is blue and all slave MPI processes are almost red. Meanwhile,

highlights
Figures 5, 6, and 7 indicate that the load imbalance issue is clearly
from a MPI _BCAST function. This means that the default (algorithm)
setting of the MPI_BCAST function is not appropriate for this case.
We need to select the right algorithm. We did some tests with a small
cases and discovered the best setting for MPI_BCAST in this case:
I_MPI_ADJUST_BCAST=4. Condition Variable Support in
In this situation, the root cause of an unexpected problem was Intel® Threading Building Blocks
found quickly and the solution was easily implemented in large part
due to the Intel Trace Analyzer and Collector’s graphical data mining WOOYOUNG KIM
capability and the ISV developers’ knowledge of the software. Finally,
the user got reasonable performance on the new Intel 64 platform. One feature present in the proposed C++ standard
specification (i.e., N3092) threading support library, which
Now we will look at a more complex challenge. we began supporting since Intel® Threading Building Blocks
(Intel® TBB) 3.0, is condition variable. As the C++1x proposal
Solving challenge tasks with approaches the final approval, we expect using threads in con-
mixed mode parallelism junction with condition variables will become more popular.

Livermore Software Technology Corporation (LSTC) is continuously For example, Microsoft has already been supporting condition
being challenged by users to deliver results faster on ever-increasing variables natively since Windows* Vista. Until Intel® TBB 3.0,
data set sizes. Further, in many cases the results must be consistent. Intel TBB used to provide only half of it (cf., std::thread – it
LSTC offers shared and distributed memory versions of their software, used to be called tbb::tbb_thread). The following code example
with the distributed memory version offering better scalability than shows how to use the Intel TBB condition variable. For a
the shared memory version. The shared memory version uses OpenMP. Concise introduction to condition variables, see here or here.
The distributed memory version uses MPI.
We will now illustrate how the Intel Trace Analyzer and Collector using namespace std;
#include “tbb/compat/condition_variable”
was applied to LSTC DYNA to enable handling of significantly larger condition_variable my_condition;
data sets on Intel® multicore architecture. tbb::mutex my_mtx;
bool present = false;
The challenge
void producer() {
LSTC supplies LS-DYNA, a general-purpose transient dynamic finite unique_lock<tbb::mutex> ul( my_mtx );
element program capable of simulating complex real-world problems. present = true;
In very simple terms, LSTC sells crash simulation software that is used my_condition.notify_one();
}
in manufacturing: automobile design, aerospace, consumer products,
void consumer() {
and bioengineering. To improve solution accuracy, the problem size is while( !present ) {
continually increasing with a non-trivial increase in computer time unique_lock<tbb::mutex> ul( my_mtx );
(e.g., solving a 10 million element problem can take 43 hours on my_condition.wait( ul );
a given cluster). }
The challenge LSTC was faced with is the need for numerical } REVisit Go-Parallel.com
consistency combined with limited network bandwidth, fixed node
memory, and limited memory bandwidth which prevented LSTC users
from scaling beyond a certain node count as problem size increased.
This can significantly increase runtimes. READ THE REST OF WOOYOUNG’S POST:
Visit Go-Parallel.com

Browse other blogs exploring a range of related


subjects at Go Parallel: Translating Multicore
Power into Application Performance.

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

Figure 8. Profile data of a customer model on four nodes generated by Intel® Trace Analyzer and Collector.

Figure 9. Profile data of a customer model on 32 nodes generated by Intel® Trace Analyzer and Collector.

38 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

Cluster Intel® Xeon® 7560 Intel® Xeon® 5560 Cluster


Configuration 1 node/w 32 core 8 nodes, 8 cores per node; Total: 64 cores

MPP (MPI) version elapsed


44013s 18521s
time

Hybrid MPI and OpenMP


7047s 5541s
version elapsed time

Speed Up 6.25 3.34

Figure 10. LSTC standard implicit model benchmark CYL1E6.

The search for a solution The solution


With a large complex problem, LSTC in collaboration with Intel, applied LSTC was able to combine the shared memory version of LS DYNA
Intel® Cluster Tools to introduce hybrid scaling. The following illustrates with the MPP DYNA to obtain HYBRID LS-DYNA. This combined
how Intel Trace Analyzer and Collector was used to discover issues parallelism version did the following:
that print statements and timer would never show. The main point
is that these methods were applied on over a hundred routines and
enabled solution discovery in months rather than years. Maintained the numerically consistent results feature
required by the user
MPP DYNA performance >> Once OpenMP* code and MPI code are combined into one code,
we can take advantage of consistent features in the OpenMP
As the number of MPI processes increases, so does the number of code under certain conditions. As a result, customers get higher
sub-domains and communication costs. This can cause load imbal- quality and numerically consistent results during the design cycle.
ances and high communication overhead. Ultimately, it resulted in
poor parallel efficiency. Increased the scalability of LS-DYNA and the
Intel Trace Analyzer and Collector was used to quickly and effectiveness of Intel multicore node architecture
efficiently pinpoint this aspect of the problem. Figures 8 and 9 >> As a result of reducing the overhead cost of MPI functions,
show that MPI collective functions are performing well on four node customers can efficiently run their production model with more
configuration, but becoming a performance inhibitor on a cores on Intel multicore node architecture. Especially for implicit
solver users, they can take all cores and get maximum performance
32-node configuration. with fixed memory space and restricted I/O performance.
As these screens demonstrate, what’s needed is to combine
OpenMP within one node (which engages all cores there), together
with MPI cross nodes. We can reduce the number of MPI processes
as much as possible. Then, we can reduce the overhead of MPI Figure10 illustrates some of the performance increases that the Intel
functions. This procedure was performed on over 100 routines Cluster Tools helped to achieve.
to assess the change.

Sign up for future issues | Share with a friend


THE PARALLEL UNIVERSE

Figure 11. Profile data of the 1M model on 128 nodes generated by Intel® Trace Analyzer and Collector.

Application
functions
MPI_BCAST

Other MPI
functions

MPI_RECV

Figure 12. Load balance of the 1M model in pie mode on 128 nodes generated by Intel® Trace
Analyzer and Collector.

40 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
THE PARALLEL UNIVERSE

Horrible Load Imbalance Issue

Figure 13. Load balance of the 1M model in pie mode on 128 nodes generated by Intel® Trace Analyzer and Collector.

Solving problems with ITAC during development


During code development, we selected a one million model as a verification test case. The
model is tested with HYBRID LS-DYNA using Intel® MPI 4.0 library. During the verification test, we
encountered a serious performance issue. We quickly reproduced the performance issue with Intel
Trace Analyzer and Collector 8.0.1 library. Figures 11, 12, and 13 show the root cause.
As you can see, the blue area in each pie indicates the percentage of application function
computing cost, the green area in each pie indicates the percentage of MPI_RECV function
computing cost, and the yellow area in each pie indicates the percentage of MPI_BCAST function
computing cost. We found root cause (serious load imbalance issue) quickly and easily with the
Intel Trace Analyzer and Collector. The serious load imbalance issue is from MPI_RECV function.
The issue was fixed quickly. Finally, the target performance was achieved.

Conclusion
We believe and have shown that the Intel cluster tools can significantly reduce the time and
effort to achieve performance increases from years to months. Further, the tools can help when
performance is not what is expected.
Finally, we have worked to reduce the learning curve so that experienced parallelism enablers
and knowledgeable application developers can quickly gain the additional insights that the cluster
tools provide. o

Sign up for future issues | Share with a friend


Optimization Notice

Intel® compilers, associated libraries and associated development tools may include or utilize options that optimize
for instruction sets that are available in both Intel® and non-Intel microprocessors (for example SIMD instruction
sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel
compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors.
For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they
implicate, please refer to the “Intel® Compiler User and Reference Guides” under “Compiler Options.” Many library
routines that are part of Intel® compiler products are more highly optimized for Intel microprocessors than for other
microprocessors. While the compilers and libraries in Intel® compiler products offer optimizations for both Intel and
Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will
get extra performance on Intel microprocessors.

Intel® compilers, associated libraries and associated development tools may or may not optimize to the same degree
for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations
include Intel® Streaming SIMD Extensions 2 (Intel® SSE2), Intel® Streaming SIMD Extensions 3 (Intel® SSE3), and
Supplemental Streaming SIMD Extensions 3 (Intel® SSSE3) instruction sets and other optimizations. Intel does not
guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by
Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on
Intel® and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine
which best meet your requirements. We hope to win your business by striving to offer the best performance of any
compiler or library; please let us know if you find we do not.
Notice revision #20101101

Subscribe today: The Parallel Universe is a free quarterly magazine.


Sign up for future issue alerts and share the magazine with friends at
http://bit.ly/ParallelUniverseMag.

42 For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
New from the makers of
Intel® VTune™ Performance Analyzer
and Intel® Visual Fortran Compiler
INTEL® PARALLEL STUDIO XE

Achieve enhanced
developer productivity
Intel® Parallel Studio XE 2011
combines ease-of-use innovations
with advanced functionality for high
performance, scalability, and code
robustness on both Linux
and Windows.

The ultimate all-in-one performance toolkit


The integrated suite helps high-performance computing and enterprise developers boost performance, reliability, and security. Squeeze the
most out of applications that never have enough performance, including simulation, video rendering, seismic analysis, and medical imaging.

Advanced compilers Advanced memory, threading, Advanced


and libraries and security analyzer performance profiler
Intel® Composer XE Intel® Inspector XE Intel® VTune™ Amplifier XE

Rock your code. Rock your world.


Get a free 30-day trial of Intel Parallel Studio XE today at http://software.intel.com/en-us/articles/intel-parallel-studio-xe/.

For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
© 2010, Intel Corporation. All rights reserved. Intel, the Intel logo, and VTune are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.
TAKE PERFORMANCE
TO THE EXTREME.
INTRODUCING INTEL® PARALLEL STUDIO XE
From one-person start-ups to enterprises with thousands of developers working on a single application,
Intel® Parallel Studio XE 2011 extends industry-leading development tools for unprecedented application
performance and reliability.

Advanced compilers Advanced memory, Advanced


and libraries threading, and security performance profiler
Intel® Composer XE analyzer Intel® VTune™ Amplifier XE
Intel® Inspector XE

Rock your code. Rock your world.


Get a free 30-day trial of Intel Parallel Studio XE today at http://software.intel.com/en-us/articles/intel-parallel-studio-xe/.

For more information regarding performance and optimization choices in Intel software products, visit http://software.intel.com/en-us/articles/optimization-notice.
© 2010, Intel Corporation. All rights reserved. Intel, the Intel logo, and VTune are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy