Drill Slides
Drill Slides
Tomer Shiran
Latency Matters
Ad-hoc analysis with interactive tools Real-time dashboards
Event/trend detection
Network intrusions Fraud Failures
Programming model MapReduce Users Google project Open source project Developers MapReduce Hadoop MapReduce
GOOGLE DREMEL
Google Dremel
Interactive analysis of large-scale datasets
Trillion records at interactive speeds Complementary to MapReduce Used by thousands of Google employees Paper published at VLDB 2010
Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis
Model
Nested data model with schema
Most data at Google is stored/transferred in Protocol Buffers Normalization (to relational) is prohibitive
Implementation
Column-based storage and processing In-situ data access (GFS and Bigtable) Tree architecture as in Web search (and databases)
Google BigQuery
Hosted Dremel (Dremel as a Service) CLI (bq) and Web UI Import data from Google Cloud Storage or local files
Files must be in CSV format
Nested data not supported [yet] except built-in datasets
APACHE DRILL
Avro IDL
enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } {
JSON
"name": "Tomer", "gender": "Male", "followers": 100 } { "name": "Maya", "gender": "Female", "followers": 200, "zip": "94305" }
DrQL Example
DocId: 10 Links Forward: 20 Forward: 40 Forward: 60 Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A' Name Url: 'http://B' Name Language Code: 'en-gb' Country: 'gb' SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, '^http') AND DocId < 20;
Id: 10 Name Cnt: 2 Language Str: 'http://A,en-us' Str: 'http://A,en' Name Cnt: 0
Data Flow
Architecture
Nested query languages
Pluggable model DrQL Mongo Query Language
Design Principles
Flexible
Pluggable query languages Extensible execution engine Pluggable data formats Column-based and row-based Schema and schema-less Pluggable data sources
Easy
Unzip and run Zero configuration Reverse DNS not needed IP addresses can change Clear and concise log messages
Dependable
No SPOF Instant recovery from crashes
Fast
C/C++ core with Java support Min latency and max throughput (limited only by hardware) Full column-based data support including operators
Hadoop Integration
Hadoop data sources
Hadoop FileSystem API (HDFS/MapR-FS) HBase
MapReduce-based tools to create column-based formats Hive-based query language and optimizer Table registry in Hcatalog Run long-running services in YARN