0% found this document useful (0 votes)

16 views1 page

Physics Data Reduction With Spark 20190124

The document discusses using Apache Spark to perform data analysis and reduction on large physics data from experiments like CMS. It details how Spark and Hadoop can be used at scale to process ROOT files and access data on the EOS storage system. Results are shown from scalability tests running analysis workloads on datasets over 1PB in size.

Uploaded by

cxessoh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views1 page

Physics Data Reduction With Spark 20190124

Uploaded by

cxessoh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

PHYSICS DATA ANALYSIS AND DATA REDUCTION

AT SCALE WITH APACHE SPARK

E. Motesnitsalis1, V. Khristenko1, M. Migliorini1, R. Castellotti1, L. Canali1, M. Girone1,
D. Olivito1, M. Cremonesi2 , J. Pivarski3, O. Gutsche2
1)CERN, Geneva, Switzerland; 2)Fermilab, Batavia, USA ; 3)Princeton University

GOALS AND MOTIVATIONS CURRENT PROCEDURES AND MILESTONES

• The main goal of the project is to perform Data Analysis and Data • Until today, the vast majority of high energy physics analysis is done
Reduction at scale using Big Data Technologies over Physics data using the ROOT Framework processing physics data stored in
acquired by CMS and made public on the CERN open data portal ROOT format files
• We are interested in investigating new ways to analyse physics data and • To use big data tools, we have solved two key data engineering
allow further development with Streaming & Machine Learning challenges:
workloads 1.Read files in ROOT Format using Spark
• We want to adopt new technologies widely used in the industry with 2.Access files stored in EOS directly from Hadoop/Spark
modern APIs, development environments and platforms (notebooks etc.) • This enabled us to produce, and optimize Physics Analysis
• This opens data processing for High Energy Physics (HEP) to a larger Workloads with input up to 1 PB. The Spark infrastructure is now
community of data scientists and data engineers, bringing together used by several physics analysis groups
domain experts from industry and academia
Hadoop – XRootD Connector Architecture SPARK-ROOT AND HADOOP-XROOTD CONNECTOR
C++ Java • Spark-root is an open source Scala library which can read ROOT
TTrees, infer their schema and import them to Spark Dataframes
EOS • Hadoop-XRootD Connector is an open source Java library that
Hadoop
Storage
HDFS API
connects to the XRootD client via Java Native Interface
Service Java Hadoop-
XRootD
Client
Native
Interface XRootD • A parameterized “readAhead” buffer is used to improve performance
when reading files from the EOS Service
(JNI)
Connector

RESULTS FROM SCALABILITY TESTS

Test Workload Architecture and File-Task Mapping
• The data processing job of this project performs event selection (i.e. Data
IT Hadoop and Spark Service (analytix)
Reduction) and then computes the dimuon invariant mass
• On a single thread/core and one single file as input, the workload reads one
branch and produces the calculations in approximately 10 mins for a 4GB file
Driver • The produced results of running at scale are displayed in Graphs 1-5 and Table 1.
{Dimuon Invariant
Mass Calculation
Source Code}

Metric Total Time Spent Percentage

Name (Sum Over Executors) (Compared to
Execution Time)
Total ~3000 - 3500 hours 1
Execution
Executor Time
Executor Task
x-1
Task
x
CPU ~1200 hours 40%
Task Task Time Graph 3: Executor CPU Usage throughout job execution for 1 PB of input with 64
1 2 KB of “readAhead” buffer, 100 Spark executors, and 8 logical cores per executor.
EOS Read ~1200 - 1800 40-50%
Time hours, depending
on readAhead size
EOS ROOT ROOT ROOT ROOT
Storage file file file file Garbage ~200 hours 7-8 %
Collection
Service
Time
Table 1: Key workload metrics and time spent, measured with
Spark custom instrumentation for 1 PB of input with 100
Spark executors, 800 logical cores, 8 logical cores per Spark Graph 4: Read Throughput throughout job execution for 1 PB of input with 64 KB
executor, and variable “readAhead” size between 16 KB and of “readAhead” buffer, 100 Spark executors, and 8 logical cores per executor.
Performance Results at Scale 64 KB.

250 250

200 200
Runtime (Minutes)
Runtime (Minutes)

150 150

100 100
Graph 5: Number of concurrent active tasks throughout job execution for 1 PB of
input with 64 KB of “readAhead” buffer, 100 Spark executors, and 8 logical cores
50 50 per executor.

0 0
22 TB 44 TB 66 TB 88 TB 110 TB 22 TB 44 TB 110 TB 220 TB 1 PB EOS Service
Dataset Size
Dataset Size A disk-based, low-latency storage service with a highly-scalable
Graph 1: Runtime performance in minutes for different input size Graph 2: Runtime performance in minutes for different input size hierarchical namespace, which enables data access through the
with 407 Spark executors, 2 cores per Spark executor, 7 GB per Spark with 100 Spark executors, 8 cores per Spark executor, 7 GB per Spark XRootD protocol. It provides storage for both physics and user use
executor. The “readAhead” connector buffer is set to 32 MB. executor. The “readAhead” connector buffer is set to 64 KB which cases via different service instances such as EOSPUBLIC, EOSCMS etc.
drastically improved the performance compared to Graph 1.

FUTURE STEPS CONCLUSIONS

• Repeat the workload scalability tests on top of virtualized/containerized • We have solved two important data engineering challenges:
infrastructure with Kubernetes in larger infrastructure and public clouds – Hadoop-XRootD Connector can directly access files from the EOS Service
• Extend the features of the “Hadoop-XRootD Connector” library (i.e. write – Spark-root can read ROOT files and infer their schema into Spark
to EOS, better packaging, monitoring etc.) • Did we achieve the project milestone of reducing 1 PB in 5 hours?
• Extend the workloads to different and more complex use cases of – Yes, we even dropped to below 4 hours in our latest tests
Physics Data Processing as well as use cases for Machine Learning • Through this project we achieved:
and Online Data Processing (Streaming) – Efficient and fast processing of physics data
– Connecting Libraries between Big Data Technologies and HEP Tools
– Adoption of Big Data Technologies by CMS physics groups (e.g. University of
Padova, Fermilab)
http://cern.ch/IT ©CERN CC-BY-SA 4.0

Guo Thesis V5
No ratings yet
Guo Thesis V5
196 pages
Mondal Umn 0130E 25561
No ratings yet
Mondal Umn 0130E 25561
111 pages
Enabling Real Time Search of Medical Ima
No ratings yet
Enabling Real Time Search of Medical Ima
137 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Dissertation Pu
No ratings yet
Dissertation Pu
255 pages
Report Reference
No ratings yet
Report Reference
55 pages
Lesson4performtravel Relatedcomputeroperationsptco 220223060450
No ratings yet
Lesson4performtravel Relatedcomputeroperationsptco 220223060450
38 pages
Design and Application of A Co-Simulation Framework For Chisel-EECS-2021-133
No ratings yet
Design and Application of A Co-Simulation Framework For Chisel-EECS-2021-133
49 pages
Big Data
No ratings yet
Big Data
63 pages
Prasanth Kothuri, Danilo Piparo, Enric Tejedor Saavedra, Diogo Castro Cern It and Ep-Sft
No ratings yet
Prasanth Kothuri, Danilo Piparo, Enric Tejedor Saavedra, Diogo Castro Cern It and Ep-Sft
22 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
2015 Liulu Ms
No ratings yet
2015 Liulu Ms
63 pages
Big Data Lab File
No ratings yet
Big Data Lab File
49 pages
Io Server Lhc2
No ratings yet
Io Server Lhc2
41 pages
Azure Cost Management 1
No ratings yet
Azure Cost Management 1
10 pages
Worksheet SQL
No ratings yet
Worksheet SQL
14 pages
Mochi: Composing Data Services For High-Performance Computing Environments
No ratings yet
Mochi: Composing Data Services For High-Performance Computing Environments
24 pages
Creating
No ratings yet
Creating
23 pages
Digital Forensics Toolkit Phase 2 Report Final
No ratings yet
Digital Forensics Toolkit Phase 2 Report Final
69 pages
Spark 101
No ratings yet
Spark 101
25 pages
Hadoop Arch Perf DHT
No ratings yet
Hadoop Arch Perf DHT
18 pages
Ba Device Configuration Too Len
No ratings yet
Ba Device Configuration Too Len
41 pages
HEPDOOP High-Energy Physics Analysis Using Hadoop
No ratings yet
HEPDOOP High-Energy Physics Analysis Using Hadoop
6 pages
Mohnish Naresh Resume
No ratings yet
Mohnish Naresh Resume
1 page
Spark On Hadoop Vs MPI OpenMP On Beowulf
No ratings yet
Spark On Hadoop Vs MPI OpenMP On Beowulf
10 pages
6 Modal Frequency Response Analysis
No ratings yet
6 Modal Frequency Response Analysis
26 pages
Redux: #React Notes
No ratings yet
Redux: #React Notes
24 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Chapter 5. Cybersecurity and Risk Management Technology
No ratings yet
Chapter 5. Cybersecurity and Risk Management Technology
30 pages
Big Data and Hadoop: Senior Product Specialist
No ratings yet
Big Data and Hadoop: Senior Product Specialist
40 pages
Infinityfree Web Hosting: Setting Up From Nothing
No ratings yet
Infinityfree Web Hosting: Setting Up From Nothing
1 page
PM Master Data Creation (Task List) : Release V1.0
No ratings yet
PM Master Data Creation (Task List) : Release V1.0
14 pages
Automatic Meter Reading Systems: PPT Manikant Lakshkar
No ratings yet
Automatic Meter Reading Systems: PPT Manikant Lakshkar
26 pages
Qemu Interrupt
No ratings yet
Qemu Interrupt
34 pages
Dspace-Cris-2022 02 00
No ratings yet
Dspace-Cris-2022 02 00
227 pages
Zeus - Technical Specification - Eng - v101 - PP PDF
No ratings yet
Zeus - Technical Specification - Eng - v101 - PP PDF
70 pages
HP ScanJet Pro 2600 f1
No ratings yet
HP ScanJet Pro 2600 f1
2 pages
VeNus Manual For Vendor (EN) - Final
No ratings yet
VeNus Manual For Vendor (EN) - Final
27 pages
FPGA系統設計實務 L2
No ratings yet
FPGA系統設計實務 L2
8 pages
Code Optimization and Target Code Generation
No ratings yet
Code Optimization and Target Code Generation
24 pages
Buffalo Link Station Quad LS-QL-R5 User Manual
No ratings yet
Buffalo Link Station Quad LS-QL-R5 User Manual
96 pages
White Paper VMware To OpenStack Migration Business Paper Final
No ratings yet
White Paper VMware To OpenStack Migration Business Paper Final
13 pages
Fortimanager v5.2.2 Release Notes
No ratings yet
Fortimanager v5.2.2 Release Notes
38 pages
Lateral Thinking
No ratings yet
Lateral Thinking
4 pages
Praveen Kumar
No ratings yet
Praveen Kumar
1 page
Gramhal Massachusetts Institute of Technology Echoing Green Mittal Institute
No ratings yet
Gramhal Massachusetts Institute of Technology Echoing Green Mittal Institute
1 page
Modeling Business Processes Using BPMN and ARIS: Applies To
No ratings yet
Modeling Business Processes Using BPMN and ARIS: Applies To
11 pages
Robot Arm
No ratings yet
Robot Arm
3 pages
Time Series Graph
No ratings yet
Time Series Graph
1 page
Beginning Database Design
No ratings yet
Beginning Database Design
2 pages
Close: Lascon Storage Backups
No ratings yet
Close: Lascon Storage Backups
6 pages
MD Lab 5
No ratings yet
MD Lab 5
8 pages
Lecture 1
No ratings yet
Lecture 1
5 pages
SpatialWorkstation - Guide (Arrastrado) 5
No ratings yet
SpatialWorkstation - Guide (Arrastrado) 5
1 page
Mastering OpenStack: Design, deploy, and manage clouds in mid to large IT infrastructures
From Everand
Mastering OpenStack: Design, deploy, and manage clouds in mid to large IT infrastructures
Omar Khedher
No ratings yet
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Venkat Ankam
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Efficient Data Processing with Apache Pig: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Processing with Apache Pig: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
OpenStack Sahara Essentials
From Everand
OpenStack Sahara Essentials
Omar Khedher
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Deep Learning with Hadoop
From Everand
Deep Learning with Hadoop
Dipayan Dev
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Pandas Essentials for Data Analysis: Definitive Reference for Developers and Engineers
From Everand
Pandas Essentials for Data Analysis: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
From Everand
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
HBase Configuration and Operations: Definitive Reference for Developers and Engineers
From Everand
HBase Configuration and Operations: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Ceph Architecture and Administration: Definitive Reference for Developers and Engineers
From Everand
Ceph Architecture and Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Data Science Workflows with Vaex: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Science Workflows with Vaex: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to Hive Architecture and Query Language: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Hive Architecture and Query Language: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SpaCy for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
SpaCy for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
From Everand
Caffe Deep Learning Framework Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Deep Learning with Keras: From Basics to Expert Proficiency
From Everand
Mastering Deep Learning with Keras: From Basics to Expert Proficiency
William Smith
No ratings yet
Apache Sedona Essentials: A Practical Guide to Spatial Data Processing
From Everand
Apache Sedona Essentials: A Practical Guide to Spatial Data Processing
Robert Johnson
No ratings yet
Sqoop Essentials: Definitive Reference for Developers and Engineers
From Everand
Sqoop Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Apache Arrow: Accelerating Data Processing and In-Memory Analytics
From Everand
Mastering Apache Arrow: Accelerating Data Processing and In-Memory Analytics
Robert Johnson
No ratings yet
Couchbase Essentials: Definitive Reference for Developers and Engineers
From Everand
Couchbase Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Torch: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Ceph Handbook: Building and Managing Scalable Distributed Storage Systems
From Everand
The Ceph Handbook: Building and Managing Scalable Distributed Storage Systems
Robert Johnson
No ratings yet
Essential Apache Beam: Definitive Reference for Developers and Engineers
From Everand
Essential Apache Beam: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Physics Data Reduction With Spark 20190124

Uploaded by

Physics Data Reduction With Spark 20190124

Uploaded by

PHYSICS DATA ANALYSIS AND DATA REDUCTION

AT SCALE WITH APACHE SPARK

GOALS AND MOTIVATIONS CURRENT PROCEDURES AND MILESTONES

RESULTS FROM SCALABILITY TESTS

Metric Total Time Spent Percentage

FUTURE STEPS CONCLUSIONS

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.