0% found this document useful (0 votes)
16 views1 page

Physics Data Reduction With Spark 20190124

The document discusses using Apache Spark to perform data analysis and reduction on large physics data from experiments like CMS. It details how Spark and Hadoop can be used at scale to process ROOT files and access data on the EOS storage system. Results are shown from scalability tests running analysis workloads on datasets over 1PB in size.

Uploaded by

cxessoh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views1 page

Physics Data Reduction With Spark 20190124

The document discusses using Apache Spark to perform data analysis and reduction on large physics data from experiments like CMS. It details how Spark and Hadoop can be used at scale to process ROOT files and access data on the EOS storage system. Results are shown from scalability tests running analysis workloads on datasets over 1PB in size.

Uploaded by

cxessoh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

PHYSICS DATA ANALYSIS AND DATA REDUCTION

AT SCALE WITH APACHE SPARK


E. Motesnitsalis1, V. Khristenko1, M. Migliorini1, R. Castellotti1, L. Canali1, M. Girone1,
D. Olivito1, M. Cremonesi2 , J. Pivarski3, O. Gutsche2
1)CERN, Geneva, Switzerland; 2)Fermilab, Batavia, USA ; 3)Princeton University

GOALS AND MOTIVATIONS CURRENT PROCEDURES AND MILESTONES


• The main goal of the project is to perform Data Analysis and Data • Until today, the vast majority of high energy physics analysis is done
Reduction at scale using Big Data Technologies over Physics data using the ROOT Framework processing physics data stored in
acquired by CMS and made public on the CERN open data portal ROOT format files
• We are interested in investigating new ways to analyse physics data and • To use big data tools, we have solved two key data engineering
allow further development with Streaming & Machine Learning challenges:
workloads 1.Read files in ROOT Format using Spark
• We want to adopt new technologies widely used in the industry with 2.Access files stored in EOS directly from Hadoop/Spark
modern APIs, development environments and platforms (notebooks etc.) • This enabled us to produce, and optimize Physics Analysis
• This opens data processing for High Energy Physics (HEP) to a larger Workloads with input up to 1 PB. The Spark infrastructure is now
community of data scientists and data engineers, bringing together used by several physics analysis groups
domain experts from industry and academia
Hadoop – XRootD Connector Architecture SPARK-ROOT AND HADOOP-XROOTD CONNECTOR
C++ Java • Spark-root is an open source Scala library which can read ROOT
TTrees, infer their schema and import them to Spark Dataframes
EOS • Hadoop-XRootD Connector is an open source Java library that
Hadoop
Storage
HDFS API
connects to the XRootD client via Java Native Interface
Service Java Hadoop-
XRootD
Client
Native
Interface XRootD • A parameterized “readAhead” buffer is used to improve performance
when reading files from the EOS Service
(JNI)
Connector

RESULTS FROM SCALABILITY TESTS


Test Workload Architecture and File-Task Mapping
• The data processing job of this project performs event selection (i.e. Data
IT Hadoop and Spark Service (analytix)
Reduction) and then computes the dimuon invariant mass
• On a single thread/core and one single file as input, the workload reads one
branch and produces the calculations in approximately 10 mins for a 4GB file
Driver • The produced results of running at scale are displayed in Graphs 1-5 and Table 1.
{Dimuon Invariant
Mass Calculation
Source Code}

Metric Total Time Spent Percentage


Name (Sum Over Executors) (Compared to
Execution Time)
Total ~3000 - 3500 hours 1
Execution
Executor Time
Executor Task
x-1
Task
x
CPU ~1200 hours 40%
Task Task Time Graph 3: Executor CPU Usage throughout job execution for 1 PB of input with 64
1 2 KB of “readAhead” buffer, 100 Spark executors, and 8 logical cores per executor.
EOS Read ~1200 - 1800 40-50%
Time hours, depending
on readAhead size
EOS ROOT ROOT ROOT ROOT
Storage file file file file Garbage ~200 hours 7-8 %
Collection
Service
Time
Table 1: Key workload metrics and time spent, measured with
Spark custom instrumentation for 1 PB of input with 100
Spark executors, 800 logical cores, 8 logical cores per Spark Graph 4: Read Throughput throughout job execution for 1 PB of input with 64 KB
executor, and variable “readAhead” size between 16 KB and of “readAhead” buffer, 100 Spark executors, and 8 logical cores per executor.
Performance Results at Scale 64 KB.

250 250

200 200
Runtime (Minutes)
Runtime (Minutes)

150 150

100 100
Graph 5: Number of concurrent active tasks throughout job execution for 1 PB of
input with 64 KB of “readAhead” buffer, 100 Spark executors, and 8 logical cores
50 50 per executor.

0 0
22 TB 44 TB 66 TB 88 TB 110 TB 22 TB 44 TB 110 TB 220 TB 1 PB EOS Service
Dataset Size
Dataset Size A disk-based, low-latency storage service with a highly-scalable
Graph 1: Runtime performance in minutes for different input size Graph 2: Runtime performance in minutes for different input size hierarchical namespace, which enables data access through the
with 407 Spark executors, 2 cores per Spark executor, 7 GB per Spark with 100 Spark executors, 8 cores per Spark executor, 7 GB per Spark XRootD protocol. It provides storage for both physics and user use
executor. The “readAhead” connector buffer is set to 32 MB. executor. The “readAhead” connector buffer is set to 64 KB which cases via different service instances such as EOSPUBLIC, EOSCMS etc.
drastically improved the performance compared to Graph 1.

FUTURE STEPS CONCLUSIONS


• Repeat the workload scalability tests on top of virtualized/containerized • We have solved two important data engineering challenges:
infrastructure with Kubernetes in larger infrastructure and public clouds – Hadoop-XRootD Connector can directly access files from the EOS Service
• Extend the features of the “Hadoop-XRootD Connector” library (i.e. write – Spark-root can read ROOT files and infer their schema into Spark
to EOS, better packaging, monitoring etc.) • Did we achieve the project milestone of reducing 1 PB in 5 hours?
• Extend the workloads to different and more complex use cases of – Yes, we even dropped to below 4 hours in our latest tests
Physics Data Processing as well as use cases for Machine Learning • Through this project we achieved:
and Online Data Processing (Streaming) – Efficient and fast processing of physics data
– Connecting Libraries between Big Data Technologies and HEP Tools
– Adoption of Big Data Technologies by CMS physics groups (e.g. University of
Padova, Fermilab)
http://cern.ch/IT ©CERN CC-BY-SA 4.0

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy