How To Create A Pipeline Capable of Processing 2.5 Billion Records/day
How To Create A Pipeline Capable of Processing 2.5 Billion Records/day
in just 3 Months
Josef Habdank
Lead Data Scientist & Data Platform Architect
at Infare Solutions
@jahabdank
jha@infare.com
linkedin.com/in/jahabdank
You are in the right place!
Spark currently is de facto standard for BigData
• Using Spark currently is the most fundamental skill in BigData and
DataScience world
• Main reason: helps to solve most of BigData problems:
processing, transformation, abstraction and Machine Learning
Pretty much all serious BigData
players are using Spark
2013-2015 was insane
“Lets do that
in 3 months!”
The Result
• 5 low-cost servers (8core, 64gb RAM)
• Located on Amazon with Hosted Apache Spark
• a fraction of cost that any other technology would cost
• Initial max capacity load tested on 2.5bn/day
• Currently improved to max capacity 6-8bn/day, ~250-
350mil/hour (with no extra hardware required)
• As Spark scales with hardware, we could do 15Bn with 10-15 machines
Conditional Shuffle
• All rows will be 0x00[…]
Map
serialized between JVM 0x00[…] Python ObjA
0x00[…] 0x00[…] JVM-> Python Serde ObjB node 1
and Python 0x00[…] do the map ObjC
• There are exceptions 0x00[…] Python -> JVM Serde
• Catalyst and code generation will not be able to do it for RDD map
operations as well as custom UDFs in DataFrames
df3 = df2 \
.withColumn("new_column2", custom_function(df2.new_column1))
• PRO TIP: avoid using RDDs as Spark will serialize whole objects. For UDFs
it only will serialize few columns and will do it in a very efficient way
N bn Data Platform design
What Infare does
Scalable Scalable
Data
Data Processing
Collection
Data fast-access low cost
Collection
Data Framework
Collection
Collection temporary permanent
storage storage
Monitoring System
Temporary Storage
Permanent Storage
Message
Data Broker
Data Streamer (Kinesis) Micro
Data Micro
Collection
Data Mini
Batch
Collection
Data
Collection
Monitoring/stats Batch
Batch
S3 Offline analytics
Collection Real time analytics
Data Science
Preaggregation uncompressed Data Warehouse
Parquet
Avro blobs micro batches partitioned
compressed with aggregated
Snappy Parquet
DataWarehouse
Did it work?
Monitoring System
Temporary Storage
Permanent Storage
Message
Data Broker
Data Streamer (Kinesis) Micro
Data Micro
Collection
Data Mini
Batch
Collection
Data
Collection
Monitoring/stats Batch
Batch
S3 Offline analytics
Collection Real time analytics
S3 has Data Science
Preaggregation uncompressed
latency Data Warehouse
Parquet
Avro blobs micro batches (inconsistent partitioned
compressed with for deletes) aggregated
Snappy Parquet
DataWarehouse
Why DynamoDB was a failure: Spark’s Parallelize hell
Monitoring
System
Temporary Storage
Permanent Storage
Data Message
Data Streamer Broker
Data Micro
Micro
Collection
Data Mini
Batch
Collection
Data
Collection
Monitoring/stats Batch
Batch
S3 Offline analytics
Collection Real time analytics
Data Science
Preaggregation uncompressed Data Warehouse
Parquet
New Kafka Avro blobs micro batches partitioned
0.10.2 has great Compressed aggregated
streaming support with Snappy Parquet/ORC
DataWarehouse
Getting maximum out of the Kinesis/Kafka
Data Message
Data Streamer Broker
Data
Collection
Data
Collection
Data
Collection
Collection
Errors
})
.[process successes as stream]
• NOTE: cache() should be used cautiously
To cache or not to cache
Job1
• Cache is the most abused function in Spark
Step1 • It is NOT (!) a simple storing of a pointer to a
collection in memory of the process
Job2
serialize
• It is a SERIALISATION of the data to BLOB and
Step2 cache storing it in cluster’s shared memory
deserialize • When reusing data which was cached, it has to
deserialize the data from BLOB
Step3 Step1
• By default uses generic JAVA serializer
… …
Standard Spark Streaming Scenario:
• Incoming Avro stream
• Step 1) Stats computation, storing stats
• Step 2) Storing data from the stream
Question: Will caching make it faster?
Message
Deserialize Compute
Broker with Save Stats
Avro Stats
Avro objects
Deserialize
Save Data
Avro
Message
Compute
Broker with Save Stats
Stats
Avro objects
Deserialize
Avro cache
Save Data
Standard Spark Streaming Scenario:
• Incoming Avro stream
• Step 1) Stats computation, storing stats
• Step 2) Storing data from the stream
Question: Will caching make it faster? Nope
Message
Deserialize Compute
Broker with Save Stats
Avro Stats
Avro objects
Faster
Deserialize
Save Data
Avro
Message
Compute
Broker with Save Stats
Stats
Avro objects
Deserialize
Avro cache
Save Data
Standard Spark Streaming Scenario:
• Incoming Avro stream
• Step 1) Stats computation, storing stats
• Step 2) Storing data from the stream
Question: Will caching make it faster? Nope
Message
Deserialize Compute
Broker with Save Stats
Avro Stats
Avro objects
Faster
Deserialize
Save Data
Avro
Compute
Message Save Stats
Stats
Broker with
Avro objects
Deserialize Serialize Deserialize
cache Save Data
Avro Java Java
To cache or not to cache
• Cache is the most abused function in Spark
Job1
• It is NOT (!) a simple storing of a pointer to a collection in memory of the
process
Step1
• It is a SERIALISATION of the data to BLOB and storing it in cluster’s shared
memory
Job2 • When reusing data which was cached, it has to deserialize the data from
serialize BLOB
Step2 cache • By default uses generic JAVA serializer, which is SLOW
• Even super fast serde like Kryo, are much slower as are generic
deserialize (serializer does not know the type at compile time)
• Avro is amazingly fast as it is a Specific serializer (knows the type)
Step3 Step1 • Often you will be quicker to reprocess the data from your source than
use the cache, especially for complex objects (pure strings/byte arrays
cache fast)
• Caching is faster in DataFrames/Tungsten API, but even then it might be
… … slower than reprocessing
• Pro TIP: when using cache, make sure it actually helps. And monitor
CPU consumption too
Tungsten Memory Manager + Row Serializer
• Introduced in Spark 1.5, used in
DataFrames/DataSets
case class MyRow(val col: Option[String], val
exception: Option[String]) • Stores data in memory as a readable binary
kstream blobs, not as Java objects
.repartition(concurency)
.foreachRDD(objects => {
val res = objects.map(b => {
• Since Spark 2.0 the blobs are in columnar
Try([frequent error]).transform(
{ b => Success(MyRow(b, None))},
format (much better compression)
{ e => Success(MyRow(None, e.getMessage))}
).get • Does some black magic wizardy with
}).toDF()
res.cache() L1/L2/L3 CPU cache
res.select("exception")
.filter("exception is not null")
.[process errors as stream] • Much faster: 10x-100x faster then RDDs
res.select("col")
.filter("exception is null")
.[process successes as stream]
• Rule of thumb: always when possible use
}) DataFrames and you will get Tungsten
Data Warehouse and Messaging
Data Storage: Row Store vs Column Store
• What are DataBases: collections of objects
• Main difference:
• Row store require only one row at the time to
serialize
• Column Store requires a batch of data to serialize
• Serialization:
• Row store can serialize online (as rows come into
the serializer, can be appended to the binary buffer)
• Column Store requires w whole batch to present at
the moment of serialization, data can be processed
(index creation, sorting, duplicate removal etc.)
• Reading:
• Row store always reads all data from a file
• Column Store allows reading only selected columns
JSON/CSV Row Store
Pros:
• Human readable (do not underestimate that)
• No dev time required
• Compression algorithms work very well on ASCII text
(compressed CSV ‘only’ 2x larger than compressed
Avro)
Cons:
• Large(CSV) and very large (JSON) volume
• Slow serialization/deserialization
Cons:
• Hard to debug, once BLOB is corrupt it is very
hard to find out what went wrong
Avro Scala classes
class ObjectModelV1Avro (
var dummy_id: Long,
var jobgroup: Int,
var instance_id: Int,
[...]
) extends SpecificRecordBase with SpecificRecord {
• Avro
def get(field: Int): AnyRef = { field match {
serialization/deserialization case pos if pos == 0 => { dummy_id
}.asInstanceOf[AnyRef]
requires Avro contract [...]
compatible with Apache }
}
Avro real-time serialization
• Apache Avro allows serializing on the fly Row by row
• Incoming data stream can be serialized on the fly
into a binary buffer
// C# code // Scala
byte[] raw; val writer: DataFileWriter[T] =
using (var ms = new MemoryStream()) new DataFileWriter[T](datumWriter)
{ writer.create(objs(0).getSchema,
using (var dfw = outputStream)
DataFileWriter<T>
.OpenWriter(_datumWriter, // can be streamed
ms, _codec)) for (obj <- objs) {
{ writer.append(obj)
// can be yielded }
microBatch.ForEach(dfw.Append); writer.close
}
raw = ms.ToArray(); val encodedByteArray: Array[Byte] =
} outputStream.toByteArray
Parquet Column Store
Pros:
• Meant for large data sets
• Single column searchable
• Compressed (eliminated duplicates etc.)
spark • Contains in-file stats and metadata located in TAIL
.read
.parquet(dwLoc) • Very well supported in Spark:
.filter('col1 === “text“) • predicate/filter pushdown
.select("col2") • VECTORIZED READER AND TUNGSTEN INTEGRATION
(5-10x faster then Java Parquet library)
Cons:
• Not indexed
More on predicate/filter pushdown
Apache Spark • Processing separate from the storage
DataFrame
• Predicate/filter pushdown gives Spark
uniform way to push query to the source
makes pushdown
request to API • Spark remains oblivious how the driver
executes the query, it only cares if the Driver
Parquet/ORC API can or can’t execute the pushdown request
• If driver can’t execute the request Spark will
Parquet/ORC
driver executes
load all data and filter it in Spark
pushed request
on binary data
Storage
More on predicate/filter pushdown
• Processing separate from the storage
Apache Spark • Predicate/filter pushdown gives Spark
DataFrame uniform way to push query to the source
• Spark remains oblivious how the driver
makes pushdown
request to API executes the query, it only cares if the Driver
can or can’t execute the pushdown request
• If driver can’t execute the request Spark will
load all data and filter it in Spark
Cons:
• No vectorized reader, rumors about adding it from
Spark 2.2. If this turns out to be true then ORC should
be faster
DataWarehouse building
• Think of it as a collection of read only files
• Recommended to use Parquet/ORC files in a folder
structure (aim at >100-1000Mb files, use coalesce)
• Folders are partitions
df • Spark supports append for Parquet/ORC
.coalesce(fileCount)
.write • Compression:
.option("compression",“snappy")
.mode("append")
• Use Snappy (decompression speed ~500MB/sec per core)
.partitionBy( • Gzip (decompression speed ~60MB/sec per core)
"year", • Note: Snappy is not splittable, keep files under 1GB
"month",
"day") • Ordering: if you can (often can not) order your data, as
//.orderBy(“some_column") then columnar deduplication will work better
.parquet(outputLoc)
//.orc(outputLoc)
• In our case this saves 50% of space, and thus 50% of reading
time
DataWarehouse query execution
dwFolder/year=2016/month=3/day=10/part-[ManyFiles]
dwFolder/year=2016/month=3/day=11/part-[ManyFiles]
[...]
def productArity = 50
Deployment only Out of the box platform Out of the box platform
~132$/month license per ~500$/month license per ~170$/month license per
8core/64Gb RAM 8core/64Gb RAM (min 5) 8core/64Gb RAM
No spot instances Allowed spot instances Allowed spot instances
No support incl. Support with debug Platform support
SSH access SSH access (new in 2017) SSH access
Self customizable Limited customization Support in customization
Notebook Zeppelin Notebook DataBricks Notebook Zeppelin
Spark on Amazon
Deployment only, so All you need for Spark Cheap and fully
requires a lot of system + amazing customizable
IT/Unix related support, a little pricey platform, needs more
knowledge to go but ‘just works’ and low level knowledge
worth it