0% found this document useful (0 votes)
84 views65 pages

How To Create A Pipeline Capable of Processing 2.5 Billion Records/day

The document describes how to create a pipeline capable of processing 2.5 billion records per day using Apache Spark in just 3 months. It outlines building a system with 5 servers on Amazon that can load test up to 2.5 billion records per day initially, and now up to 6-8 billion records per day with no additional hardware. The system was delivered in 3 months and has been in production handling the workload for 1.5 years.

Uploaded by

Hend Selmy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views65 pages

How To Create A Pipeline Capable of Processing 2.5 Billion Records/day

The document describes how to create a pipeline capable of processing 2.5 billion records per day using Apache Spark in just 3 months. It outlines building a system with 5 servers on Amazon that can load test up to 2.5 billion records per day initially, and now up to 6-8 billion records per day with no additional hardware. The system was delivered in 3 months and has been in production handling the workload for 1.5 years.

Uploaded by

Hend Selmy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

How to create a Pipeline capable of

processing 2.5 Billion records/day


2

in just 3 Months
Josef Habdank
Lead Data Scientist & Data Platform Architect
at Infare Solutions
@jahabdank
jha@infare.com
linkedin.com/in/jahabdank
You are in the right place!
Spark currently is de facto standard for BigData
• Using Spark currently is the most fundamental skill in BigData and
DataScience world
• Main reason: helps to solve most of BigData problems:
processing, transformation, abstraction and Machine Learning
Pretty much all serious BigData
players are using Spark
2013-2015 was insane

Good old‘n’slow Hadoop


MapReduce days

Google Trends for “Apache Spark”


What is this talk about?
Presentation consists of 4 parts:
• Quick intro to Spark internals and optimization
• N-billion rows/day system architecture
• How exactly we did what we did
• Focus on Spark’s Performance, getting maximum
bang for the buck
• Data Warehouse and Messaging
• (optional) How to deploy Spark so it does not
backfire
The Story
• “Hey guys, we might land a new cool project”
• “It might be 5-10x as much data as we have so far”
• “In 1+year it probably will be much more than 10x”
• “Oh, and can you do it in 6 months?”

“Lets do that
in 3 months!”
The Result
• 5 low-cost servers (8core, 64gb RAM)
• Located on Amazon with Hosted Apache Spark
• a fraction of cost that any other technology would cost
• Initial max capacity load tested on 2.5bn/day
• Currently improved to max capacity 6-8bn/day, ~250-
350mil/hour (with no extra hardware required)
• As Spark scales with hardware, we could do 15Bn with 10-15 machines

• Delivered in 3 months, in production for 1.5 year now 


Developing code for
distributed systems
Code in Notebooks, they are awesome
• Development on Cluster systems is by nature not easy
• Best you can do locally is to know that code compiles, unit tests pass and if
the code runs on some sample data
• You do not actually know if it works until you test-run on the PreProd/Dev
cluster as the data defines the correctness, not syntax

Normal workflow Notebook workflow


• Code locally on your machine • Write code online
• Compile and assemble • Shift+Enter to compile (on master),
• Upload JAR + make sure deps are send to cluster nodes, run and
present on all nodes show results in browser 
+ Can support Git lifecycle
• Run job and test if work, spend
+ Allows mixing Python/Scala/Sql
time looking for results (which is awesome  )
Traps of Notebooks
• Code is compiled on the fly
• When the chunk of code is executed as a Spark
Job (on a whole cluster) all the dependent
objects will be serialized and packaged with the
job
• Sometimes the dependency structure is very
non-trivial and the Notebook will start
serializing huge amounts of data (completely
silently, and attaching it to the Job)

• PRO TIP: have as few global variables as


possible, if needed use objects
Traps of Notebooks
Code as distributed JAR vs Code as lambda
Traps of Notebooks
Code as distributed JAR vs Code as lambda
This is compiled to JAR and This is serialized on master
distributed and bootstrapped and attached to the job
to JVMs across cluster (connection object will fail to
(when initializing JAR it will work after deserialization)
open the connection)
Writing high throughput
code in Spark
Spark’s core: the collections
• Spark is just a processing framework
• Works on distributed collections:
• Collections are partitioned
• The number of partitions is defined from the source
• Collections are lazily evaluated (nothing done until you request
results)
• Spark collection you only write a ‘recipe’ for what spark has to do (called
lineage)
• Types of collections:
• RDDs, just collections of Java objects. Slowest, but most flexible
• DataFrames/DataSet’s, mainly tabular data, can do structured data but is not
trivial. Much faster serialization/deserialization, more compact, faster
memory management, SparkSql compatible
Spark’s core: in-memory map reduce
• Spark Implements Map-LocalReduce-
LocalShuffle-Shuffle-Reduce paradigm
• Each step in the ‘recipe’/lineage is a
combination of the above
• Why in that way? Vast majority of BigData
problems can be converted to this paradigm:
• All SqlQueries/data extracts
• In many cases DataScience (modelling)
Map-LocalReduce-Shuffle-Reduce
%scala
val lirdd = .map([...]) .reduceByKey([...])
sc.parallelize(
loremIpsum.split(" ")
)
Map LocalReduce Shuffle
val wordCount = (“lorem”) (“lorem”, 1)
(“lorem”, 2)
lirdd (“Ipsum”) (“Ipsum”, 1) node 1
(“Ipsum”, 1)
.map(w => (w,1)) (“lorem”) (“lorem”, 1)
.reduceByKey(_ + _)
.collect (“Ipsum”) (“Ipsum”, 1)
(“Ipsum”, 1)
(“sicut”) (“sicut”, 1) node 2
(“sicut”, 2)
%sql (“sicut”) (“sicut”, 1)
select Reduce
word, (“lorem”, 2)
count(*) as word_count (“lorem”, 2)
(“Ipsum”, 1)
from words Master (“Ipsum”, 2)
(“Ipsum”, 1)
group by word (“sicut”, 2)
(“sicut”, 2)
Map-LocalReduce-Shuffle-Reduce
%scala
val lirdd = .map([...]) .reduceByKey([...])
sc.parallelize(
loremIpsum.split(" ")
)
Map LocalReduce Shuffle
val wordCount = (“lorem”) (“lorem”, 1)
(“lorem”, 2)
lirdd (“Ipsum”) (“Ipsum”, 1) node 1
(“Ipsum”, 1)
.map(w => (w,1)) (“lorem”) (“lorem”, 1)
.reduceByKey(_ + _)
.collect (“Ipsum”) (“Ipsum”, 1)
(“Ipsum”, 1)
(“sicut”) (“sicut”, 1) node 2
(“sicut”, 2)
%sql (“sicut”) (“sicut”, 1)
select Reduce
word, (“lorem”, 2)
count(*) as word_count (“lorem”, 2)
(“Ipsum”, 1)
from words Slowest part: Master (“Ipsum”, 2)
(“Ipsum”, 1)
group by word data is serialized from objects (“sicut”, 2)
(“sicut”, 2)
to BLOBs, send over network
and deserialized
Map only operations
Spark knows shuffle is expensive and tries to avoid it if can
.map([...])
%scala
// val rawBytesRDD is defined incoming node 1
// and contains blobs with blobs
0x00[…] Obj 001-100 File 1
// serialized Avro objects
rawBytesRDD
.map(fromAvroToObj) 0x00[…] Obj 101-200 File 2
.toDF.write
.parquet(outputPath) incoming 0x00[…] Obj 201-300 File 3
blobs
0x00[…] Obj 301-400 File 4

0x00[…] Obj 401-500 File 5

0x00[…] Obj 501-600 File 6


node 2
Local Shuffle-Map operations
For fragmented collections (with too many partitions)
.coalesce(2)
%scala .map([...])
// val rawBytesRDD is defined
// and contains blobs with incoming Local Shuffle node 1
// serialized Avro objects blobs Map
rawBytesRDD 0x00[…]
.coalesce(2) //** 0x00[…]
.map(fromAvroToObj) 0x00[…] 0x00[…] Obj 001-300 File 1
.toDF.write 0x00[…]
.parquet(outputPath) incoming 0x00[…]
blobs
// ** never set to so low!! 0x00[…]
// This is just an example  0x00[…]
// Aim in at least 2x node count. 0x00[…] 0x00[…] Obj 301-600 File 2
// Moreover, if possible 0x00[…]
// coalesce() or repartition() 0x00[…]
// on binary blobs node 2
Why Python/PySpark is (generally)
slower than Scala
.coalesce(2)
.map([...])

Conditional Shuffle
• All rows will be 0x00[…]
Map
serialized between JVM 0x00[…] Python ObjA
0x00[…] 0x00[…] JVM-> Python Serde ObjB node 1
and Python 0x00[…] do the map ObjC
• There are exceptions 0x00[…] Python -> JVM Serde

• Within the same 0x00[…]


machine, so is very fast 0x00[…] Python ObjD
0x00[…] 0x00[…] JVM-> Python Serde ObjE node 2
• Nonetheless significant 0x00[…] do the map ObjF
0x00[…] Python -> JVM Serde
overhead
Why Python/PySpark is (generally)
slower than Scala
• In Spark 2.0 with new version of Catalyst and dynamic code generation
Spark will try to convert Python code to native Spark functions
• This means in some occasions Python might work equally fast as Scala, as
in fact Python code is translated into native Spark calls
df2 = df1 \
.filter(df1.old_column >= 30) \
.withColumn("new_column1", ((df1.old_column - 2) % 7) + 1)

• Catalyst and code generation will not be able to do it for RDD map
operations as well as custom UDFs in DataFrames
df3 = df2 \
.withColumn("new_column2", custom_function(df2.new_column1))
• PRO TIP: avoid using RDDs as Spark will serialize whole objects. For UDFs
it only will serialize few columns and will do it in a very efficient way
N bn Data Platform design
What Infare does

Leading provider of Collect and processes


Airfare Intelligence Solutions 1.5+ billion distinct
to the Aviation Industry airfares daily

Business Intelligence on Advanced Analytics and


flight tickets prices, such that Data Science predicting
airlines know competitors prices, ticket demand and
prices and market trends well as financial data
What we were supposed to do
Scalable
DataWarehouse
(aimed at
500+bn rows)
Data
Data
Data
Collection
Data
Collection
Collection
Collection
Customer
Customer
Customer
specific Data
Customer
specific
specificData
Data
Warehouse
specific
Warehouse Data
Warehouse
Warehouse
*Scalable to billions of rows a day
What we need

Scalable Scalable
Data
Data Processing
Collection
Data fast-access low cost
Collection
Data Framework
Collection
Collection temporary permanent
storage storage

ALL BigData systems in the world look like that 


What we first did 

Monitoring System

Temporary Storage
Permanent Storage
Message
Data Broker
Data Streamer (Kinesis) Micro
Data Micro
Collection
Data Mini
Batch
Collection
Data
Collection
Monitoring/stats Batch
Batch
S3 Offline analytics
Collection Real time analytics
Data Science
Preaggregation uncompressed Data Warehouse
Parquet
Avro blobs micro batches partitioned
compressed with aggregated
Snappy Parquet
DataWarehouse
Did it work?

Monitoring System

Temporary Storage
Permanent Storage
Message
Data Broker
Data Streamer (Kinesis) Micro
Data Micro
Collection
Data Mini
Batch
Collection
Data
Collection
Monitoring/stats Batch
Batch
S3 Offline analytics
Collection Real time analytics
S3 has Data Science
Preaggregation uncompressed
latency Data Warehouse
Parquet
Avro blobs micro batches (inconsistent partitioned
compressed with for deletes) aggregated
Snappy Parquet
DataWarehouse
Why DynamoDB was a failure: Spark’s Parallelize hell

DynamoDB historically DID NOT SUPPORT


Spark WHATSOEVER, we effectively ended
up writing our own Spark driver from
scratch, WEEKS of wasted effort

I have to admit since our initial huge disappointment 1 year ago


Amazon released a Spark driver, and I do not know how good it is.
My opinion is still a closed-source DB with limited support and
usage will always be inferior to other technologies

• No Spark native driver, so no clustered queries


• Parallelize in current implementation has a
memory leak
How are we doing it now 
Elastic Search 5.2
has an amazing
Spark support

Monitoring
System

Temporary Storage
Permanent Storage
Data Message
Data Streamer Broker
Data Micro
Micro
Collection
Data Mini
Batch
Collection
Data
Collection
Monitoring/stats Batch
Batch
S3 Offline analytics
Collection Real time analytics
Data Science
Preaggregation uncompressed Data Warehouse
Parquet
New Kafka Avro blobs micro batches partitioned
0.10.2 has great Compressed aggregated
streaming support with Snappy Parquet/ORC
DataWarehouse
Getting maximum out of the Kinesis/Kafka
Data Message
Data Streamer Broker
Data
Collection
Data
Collection
Data
Collection
Collection

Serialize and send micro batches of data, not individual messages


Kinesis: Kafka:
• The messages are max 25kB (if larger the • The messages size is 1MB, but that is
driver will slice the message into multiple very large
PUT requests)
• Jay Kreps @ Linkedin researched
• Avro serialized and Snappy compressed optimal message size for Kafka and it is
data to max 25kB between 10-100kB
(~200 data rows per message)
• Obtained 10x throughput compared to • From his research at those message
sending individual rows (each 180 bytes) sizes allow sending as much as
hardware/network allows
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
Spark Streaming
n-seconds
Message
Data Broker
Data Mini Mini
Collection
Data
Collection
Data Batch Batch
Collection
Collection

• Creates a stream of MiniBatches, which are RDDs


created every n seconds
• Spark Driver continuously polls message broker,
default every 200ms
• Each received block (every 200ms) becomes an RDD kstream
partition .repartition(3 * nodeCount)
• Consider using repartition/coalesce, as the number .foreachRDD(rawBytes => {
of partitions gets very large quickly (for 60 sec, there [...]
will be up to 300 partitions, thus 300 files) })
• NOTE: in Spark 2.1 they added Structured Streaming
(streaming on DataFrames, not RDDs, very cool but
still limited in functionality)
Default error handling in Spark Streaming
Message
Data Broker
Data Micro Mini
Collection
Data
Collection
Data Batch Batch
Collection
Collection

• Retry the MiniBatch n times (4 default)


• If fails all retries, kill the streaming job
• Conclusion: must do error handling
In stream error handling, low error rate
Message
Data Broker
Data Mini Mini
Collection
Data
Collection
Data Batch Batch
Collection
Collection

Errors

• For low error rate, handle each error


kstream
.repartition(concurency)
individually .foreachRDD(objects => {
val res = objects.flatMap(b => {
• Open connection to storage, save try { // can use Option instead of Seq
[do something, rare error]
the error packet for later processing, } catch { case e: =>
[save the error packet]
close connection Seq[Obj]()
}
• Will clog the stream for high error })
[do something with res]
rate streams })
Advanced error handling, high error rate
Message Mini Mini
Data Broker Batch Batch
Data
Collection
Data
Collection
Data
Collection Error Error
Collection Batch Batch

• For high error rate, can’t store each error


individually
• Unfortunately Spark does not support
Multicast operation (stream splitting)
Advanced error handling, high error rate
• Try high error probability action (such as
API request)
kstream
.repartition(concurency)
.foreachRDD(objects => {
• Use transform to return Either[String, Obj]
val res = objects.map(b => {
Try([frequent error]) • Either class is like tuple but guarantees
.transform(
{ b => Success(Right(b)) },
that only one or the other is present
{ e => Success(Left(e.getMessage)) }
).get
(String for error, Obj for success)
})
res.cache() // cautious, can be slow • cache() to prevent reprocessing
res
.filter(ei => ei.isLeft) • individually process error and success
.map(ei => ei.left.get)
.[process errors as stream] stream
res
.filter(ei => ei.isRight)
.map(ei => ei.right.get)

})
.[process successes as stream]
• NOTE: cache() should be used cautiously
To cache or not to cache
Job1
• Cache is the most abused function in Spark
Step1 • It is NOT (!) a simple storing of a pointer to a
collection in memory of the process
Job2
serialize
• It is a SERIALISATION of the data to BLOB and
Step2 cache storing it in cluster’s shared memory
deserialize • When reusing data which was cached, it has to
deserialize the data from BLOB
Step3 Step1
• By default uses generic JAVA serializer

… …
Standard Spark Streaming Scenario:
• Incoming Avro stream
• Step 1) Stats computation, storing stats
• Step 2) Storing data from the stream
Question: Will caching make it faster?

Message
Deserialize Compute
Broker with Save Stats
Avro Stats
Avro objects

Deserialize
Save Data
Avro

Message
Compute
Broker with Save Stats
Stats
Avro objects
Deserialize
Avro cache
Save Data
Standard Spark Streaming Scenario:
• Incoming Avro stream
• Step 1) Stats computation, storing stats
• Step 2) Storing data from the stream
Question: Will caching make it faster? Nope

Message
Deserialize Compute
Broker with Save Stats
Avro Stats
Avro objects
Faster
Deserialize
Save Data
Avro

Message
Compute
Broker with Save Stats
Stats
Avro objects
Deserialize
Avro cache
Save Data
Standard Spark Streaming Scenario:
• Incoming Avro stream
• Step 1) Stats computation, storing stats
• Step 2) Storing data from the stream
Question: Will caching make it faster? Nope

Message
Deserialize Compute
Broker with Save Stats
Avro Stats
Avro objects
Faster
Deserialize
Save Data
Avro

Compute
Message Save Stats
Stats
Broker with
Avro objects
Deserialize Serialize Deserialize
cache Save Data
Avro Java Java
To cache or not to cache
• Cache is the most abused function in Spark
Job1
• It is NOT (!) a simple storing of a pointer to a collection in memory of the
process
Step1
• It is a SERIALISATION of the data to BLOB and storing it in cluster’s shared
memory
Job2 • When reusing data which was cached, it has to deserialize the data from
serialize BLOB
Step2 cache • By default uses generic JAVA serializer, which is SLOW
• Even super fast serde like Kryo, are much slower as are generic
deserialize (serializer does not know the type at compile time)
• Avro is amazingly fast as it is a Specific serializer (knows the type)
Step3 Step1 • Often you will be quicker to reprocess the data from your source than
use the cache, especially for complex objects (pure strings/byte arrays
cache fast)
• Caching is faster in DataFrames/Tungsten API, but even then it might be
… … slower than reprocessing

• Pro TIP: when using cache, make sure it actually helps. And monitor
CPU consumption too
Tungsten Memory Manager + Row Serializer
• Introduced in Spark 1.5, used in
DataFrames/DataSets
case class MyRow(val col: Option[String], val
exception: Option[String]) • Stores data in memory as a readable binary
kstream blobs, not as Java objects
.repartition(concurency)
.foreachRDD(objects => {
val res = objects.map(b => {
• Since Spark 2.0 the blobs are in columnar
Try([frequent error]).transform(
{ b => Success(MyRow(b, None))},
format (much better compression)
{ e => Success(MyRow(None, e.getMessage))}
).get • Does some black magic wizardy with
}).toDF()
res.cache() L1/L2/L3 CPU cache
res.select("exception")
.filter("exception is not null")
.[process errors as stream] • Much faster: 10x-100x faster then RDDs
res.select("col")
.filter("exception is null")
.[process successes as stream]
• Rule of thumb: always when possible use
}) DataFrames and you will get Tungsten
Data Warehouse and Messaging
Data Storage: Row Store vs Column Store
• What are DataBases: collections of objects
• Main difference:
• Row store require only one row at the time to
serialize
• Column Store requires a batch of data to serialize
• Serialization:
• Row store can serialize online (as rows come into
the serializer, can be appended to the binary buffer)
• Column Store requires w whole batch to present at
the moment of serialization, data can be processed
(index creation, sorting, duplicate removal etc.)
• Reading:
• Row store always reads all data from a file
• Column Store allows reading only selected columns
JSON/CSV Row Store
Pros:
• Human readable (do not underestimate that)
• No dev time required
• Compression algorithms work very well on ASCII text
(compressed CSV ‘only’ 2x larger than compressed
Avro)

Cons:
• Large(CSV) and very large (JSON) volume
• Slow serialization/deserialization

Overall: worth considering, especially during dev phase


Avro Row Store
Pros:
• BLAZING FAST serialization/deserialization
• Apache Avro lib is amazing (buffer based serde)
• Binary/compact storage
• Compresses about 70% with Snappy (compressed
200 objects with 50 cols result in 20kb)

Cons:
• Hard to debug, once BLOB is corrupt it is very
hard to find out what went wrong
Avro Scala classes
class ObjectModelV1Avro (
var dummy_id: Long,
var jobgroup: Int,
var instance_id: Int,
[...]
) extends SpecificRecordBase with SpecificRecord {
• Avro
def get(field: Int): AnyRef = { field match {
serialization/deserialization case pos if pos == 0 => { dummy_id
}.asInstanceOf[AnyRef]
requires Avro contract [...]
compatible with Apache }

Avro library def put(field: Int, value: Any): Unit = {


field match {

• In principle the Avro classes case pos if pos == 0 => this.dummy_id = {


value }.asInstanceOf[Long]
are logically similar to the }
[...]

Parquet classes def getSchema: org.apache.avro.Schema =


(definition/field accessor) new Schema.Parser().parse("[...]“)

}
Avro real-time serialization
• Apache Avro allows serializing on the fly Row by row
• Incoming data stream can be serialized on the fly
into a binary buffer
// C# code // Scala
byte[] raw; val writer: DataFileWriter[T] =
using (var ms = new MemoryStream()) new DataFileWriter[T](datumWriter)
{ writer.create(objs(0).getSchema,
using (var dfw = outputStream)
DataFileWriter<T>
.OpenWriter(_datumWriter, // can be streamed
ms, _codec)) for (obj <- objs) {
{ writer.append(obj)
// can be yielded }
microBatch.ForEach(dfw.Append); writer.close
}
raw = ms.ToArray(); val encodedByteArray: Array[Byte] =
} outputStream.toByteArray
Parquet Column Store
Pros:
• Meant for large data sets
• Single column searchable
• Compressed (eliminated duplicates etc.)
spark • Contains in-file stats and metadata located in TAIL
.read
.parquet(dwLoc) • Very well supported in Spark:
.filter('col1 === “text“) • predicate/filter pushdown
.select("col2") • VECTORIZED READER AND TUNGSTEN INTEGRATION
(5-10x faster then Java Parquet library)

Cons:
• Not indexed
More on predicate/filter pushdown
Apache Spark • Processing separate from the storage
DataFrame
• Predicate/filter pushdown gives Spark
uniform way to push query to the source
makes pushdown
request to API • Spark remains oblivious how the driver
executes the query, it only cares if the Driver
Parquet/ORC API can or can’t execute the pushdown request
• If driver can’t execute the request Spark will
Parquet/ORC
driver executes
load all data and filter it in Spark
pushed request
on binary data

Storage
More on predicate/filter pushdown
• Processing separate from the storage
Apache Spark • Predicate/filter pushdown gives Spark
DataFrame uniform way to push query to the source
• Spark remains oblivious how the driver
makes pushdown
request to API executes the query, it only cares if the Driver
can or can’t execute the pushdown request
• If driver can’t execute the request Spark will
load all data and filter it in Spark

• Such abstraction allows easy replacement of


storage, Spark does not care if the storage is
S3 files or a DataBase
ORC Column Store
Pros:
• Meant for large data sets
• Single column searchable
• Even better compression then Parquet (20-30% less)
spark • Contains in-file stats, metadata and indexes (3 level:
.read file, block and every 10k rows) located in TAIL
.orc(dwLoc) • Theoretically well supported in Spark:
.filter('col1 === “text“) • predicate/ filter pushdown
.select("col2") • uses indexes for filter pushdown searches, amazing 

Cons:
• No vectorized reader, rumors about adding it from
Spark 2.2. If this turns out to be true then ORC should
be faster
DataWarehouse building
• Think of it as a collection of read only files
• Recommended to use Parquet/ORC files in a folder
structure (aim at >100-1000Mb files, use coalesce)
• Folders are partitions
df • Spark supports append for Parquet/ORC
.coalesce(fileCount)
.write • Compression:
.option("compression",“snappy")
.mode("append")
• Use Snappy (decompression speed ~500MB/sec per core)
.partitionBy( • Gzip (decompression speed ~60MB/sec per core)
"year", • Note: Snappy is not splittable, keep files under 1GB
"month",
"day") • Ordering: if you can (often can not) order your data, as
//.orderBy(“some_column") then columnar deduplication will work better
.parquet(outputLoc)
//.orc(outputLoc)
• In our case this saves 50% of space, and thus 50% of reading
time
DataWarehouse query execution
dwFolder/year=2016/month=3/day=10/part-[ManyFiles]
dwFolder/year=2016/month=3/day=11/part-[ManyFiles]
[...]

• Partition pruning: Spark will only look for the files in


appropriate folders
sqlContext
.read • Row group pruning: uses row group stats to skip data
.parquet(dwLoc) (if filtered data is outside of min/max value of Row Group
.filter( stats in a Parquet file, data will be skipped, turned off by
'year === 2016 && default, as is expensive and gives benefit for ordered files)
'month === 1 && sqlContext.setConf("spark.sql.parquet.filterPushdown", “True“)
'day === 1 && • Reads only col1 and col2 from the file, and col1 as filter
'col1 === "text") (never seen by Spark, handled by the API), col2 returned to
.select("col2") Spark for processing
• If DW is ORC, it will use in-file indexes to speed up the scan
(Parquet will still scan through entire column in each
scanned file to filter col1)
DataWarehouse schema evolution
the SLOW way
• Schema evolution = columns changing over time
• Spark allows Schema-on-read paradigm
• Allows only adding columns
sqlContext • Removing is be done by predicate pushdown in SELECT
.read • Renaming is handled in Spark
.option( • Each file in DW (both ORC and Parquet) is schema
"mergeSchema", aware
"true")
.parquet(dwLoc) • Each file can have different columns
.filter( • By default Spark (for speed purposes) assumes all files
'year === 2016 && with the same schema
'day === 1 &&
'col1 === "text") • In order to enable schema merging, manually set a
.select("col2") flag during the read
.withColumnRenamed( • there is a heavy speed penalty for doing this
"col2", • How to make it: simply append data with different
"newAwesomeColumn") columns to already existing store
DataWarehouse schema evolution
the RIGHT way
• The much faster way it to create multiple warehouses,
and merging the by calling UNION
• The union requires the columns and types to be the
val df1 = sqlContext same in ALL dataframes/datawarehouses
.read.parquet(dwLoc1) • The dataframes have to be aligned by
.withColumn([...])
adding/renaming the columns using default values
val df2 = sqlContext etc.
.read.parquet(dwLoc2) • The advantage of doing this is that Spark now is
.withColumn([...])
dealing with small(er) number of datawarehouses,
val dfs = Seq(df1, df2) where within them it can assume the same types,
val df_union = dfs which can save massive amount of resources
.reduce(_ union _) • Spark is smart enough to figure out to execute
// df_union is your queryable
partition pruning and filter/predicate pushdown to
// warehouse, including all unioned warehouses, therefore this is a
// partitions etc recommended way
Scala Case classes limitation
• Spark can only automatically build the DataFrames
from RDDs consisting of case classes
• This means for saving Parquet/ORC you have to
use case classes
case class MyRow(
val col1: Option[String], • In Scala 2.10 case classes can have max 22 fields
val col2: Option[String]
)
(limitation not present in Scala 2.11), thus only 22
columns
// val rdd: RDD[MyRow]
val df = rdd.toDF() • Case classes are implicitly extending Product type,
df.coalesce(fileCount) if in need of DataWarehouse with more than 22
.write
.parquet(outputLoc) columns, create a POJO class extending the
Product type
Scala Product Class example
class ObjectModelV1 (
var dummy_id: Long,
var jobgroup: Int,
var instance_id: Int,
[...]
) extends java.io.Serializable with Product {

def canEqual(that: Any) = that.isInstanceOf[ObjectModelV1]

def productArity = 50

def productElement(idx: Int) = idx match {


case 0 => dummy_id
case 1 => jobgroup
case 2 => instance_id
[...]
}
}
Scala Avro + Parquet contract combined
• Avro + Parquet contract can be the same (no
inheritance collision)
• Save unnecessary object conversion/data copy
which in 5bn rage is actually large cost
• Spark Streaming can receive objects as Avro and
directly convert to Parquet/ORC
class ObjectModelV1 (
var dummy_id: Long,
+ var jobgroup: Int,
var instance_id: Int,
[...]
) extends SpecificRecordBase with SpecificRecord
with Serializable with Product {
[...]
}
Summary: Key to high performance
• Incremental aggregation/batching • “If it can wait do it later in the pipeline”
• Always make sure to have as many • Use DataFrames whenever possible
write threads as cores in cluster • When using cashing, make sure it
• Avoid reduce phase at all costs, avoid actually helps 
shuffle unless have a good reason**
Message
C# Kinesis Broker
Data Uploader (Kinesis) Micro
Data Micro
Collection
Data Mini
Batch
Collection
Data
Collection Monitoring/stats
Batch
Batch
S3 Offline analytics
Collection Preaggregation Data Science
+ Build daily DW Data Warehouse
+ Buffer by 25kb
+ Submit stats + Buffer by 1min, 12threads + Aim in 100-500Mb files
+ Submit stats + Submit stats
Want to work with cutting edge
100% Apache Spark projects? We are hiring!!!
2x Senior Data Scientist, working with
Apache Spark + R/Python doing Airfare/price forecasting

4x Senior Data Platform Engineer, working with


Apache Spark/S3/Cassandra/Scala backend + MicroAPIs
1x Network Administrator for Big Data systems
2x DevOps Engineer for Big Data, working on
as Hadoop, Spark, Kubernetes, OpenStack and more
http://www.infare.com/jobs/
job@infare.com
Thank You!!!
Q/A?

And remember, we are hiring!!!


http://www.infare.com/jobs/
job@infare.com
How to deploy Spark
so it does not backfire
Hardware
Own Hosted
+ Fully customizable + Much more failsafe
+ Cheaper, if you already have enough + On-demand scalability
OPS capacity, best case scenario 30-40% + No burden on current OPS
cheaper

- Dealing with bandwidth limits


- Deal with dependencies
- Dealing with hardware failures with existing systems
- No on-demand scalability (e.g. inter-data center
communication)
Data Platform
MapReduce + HDFS/S3 Spark + HDFS/S3 Spark + Cassandra
+ Simple platform + More advanced platform + The state of art for
+ Can be fully hosted + Can be fully hosted BigData systems
-Much slower + Possibly less coding + Might not need message
- Possibly more coding thanks to Scala broker (can easily with-
required, less maintainable + ML enabled (SparkML, stand 100k’s inserts /sec)
(Java/Pig/Hive) Python), future oriented + Amazing future
- Less future oriented +MessageBroker enabled possibilities
- Can not (yet) by hosted
- Possibly still needs
HDFS/S3
Spark on Amazon

Deployment only Out of the box platform Out of the box platform
~132$/month license per ~500$/month license per ~170$/month license per
8core/64Gb RAM 8core/64Gb RAM (min 5) 8core/64Gb RAM
No spot instances Allowed spot instances Allowed spot instances
No support incl. Support with debug Platform support
SSH access SSH access (new in 2017) SSH access
Self customizable Limited customization Support in customization
Notebook Zeppelin Notebook DataBricks Notebook Zeppelin
Spark on Amazon

Deployment only, so All you need for Spark Cheap and fully
requires a lot of system + amazing customizable
IT/Unix related support, a little pricey platform, needs more
knowledge to go but ‘just works’ and low level knowledge
worth it

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy