0% found this document useful (0 votes)
217 views14 pages

Drill Slides

Apache Drill is an open source interactive analysis system designed to enable interactive analysis of large-scale datasets. It uses a column-based query engine and supports interactive query speeds on datasets with trillions of records. Drill supports a SQL-like query language called DrQL and can query data stored in files, HDFS, HBase and other data sources using a flexible, pluggable architecture.

Uploaded by

Ritu Nathan
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
217 views14 pages

Drill Slides

Apache Drill is an open source interactive analysis system designed to enable interactive analysis of large-scale datasets. It uses a column-based query engine and supports interactive query speeds on datasets with trillions of records. Drill supports a SQL-like query language called DrQL and can query data stored in files, HDFS, HBase and other data sources using a flexible, pluggable architecture.

Uploaded by

Ritu Nathan
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Apache Drill

Interactive Analysis of Large-Scale Datasets

Tomer Shiran

Latency Matters
Ad-hoc analysis with interactive tools Real-time dashboards
Event/trend detection
Network intrusions Fraud Failures

Big Data Processing


Batch processing Query runtime Data volume Minutes to hours TBs to PBs Interactive analysis Milliseconds to minutes GBs to PBs Queries Stream processing Never-ending Continuous stream DAG

Programming model MapReduce Users Google project Open source project Developers MapReduce Hadoop MapReduce

Analysts and developers Developers Dremel Storm and S4

Introducing Apache Drill

GOOGLE DREMEL

Google Dremel
Interactive analysis of large-scale datasets
Trillion records at interactive speeds Complementary to MapReduce Used by thousands of Google employees Paper published at VLDB 2010
Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis

Model
Nested data model with schema
Most data at Google is stored/transferred in Protocol Buffers Normalization (to relational) is prohibitive

SQL-like query language with nested data support

Implementation
Column-based storage and processing In-situ data access (GFS and Bigtable) Tree architecture as in Web search (and databases)

Google BigQuery
Hosted Dremel (Dremel as a Service) CLI (bq) and Web UI Import data from Google Cloud Storage or local files
Files must be in CSV format
Nested data not supported [yet] except built-in datasets

Schema definition required

APACHE DRILL

Nested Data Model


The data model in Dremel is Protocol Buffers
Nested Schema

Apache Drill is designed to support multiple data models


Schema: Apache Avro, Protocol Buffers, Schema-less: JSON, BSON,

Flat records are supported as a special case of nested data


CSV, TSV,

Avro IDL
enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } {

JSON
"name": "Tomer", "gender": "Male", "followers": 100 } { "name": "Maya", "gender": "Female", "followers": 200, "zip": "94305" }

Nested Query Languages


DrQL
SQL-like query language for nested data Compatible with Google BigQuery/Dremel
BigQuery applications should work with Drill

Designed to support efficient column-based processing


No record assembly during query processing

Mongo Query Language


{$query: {x: 3, y: "abc"}, $orderby: {x: 1}}

Other languages/programming models can plug in

DrQL Example
DocId: 10 Links Forward: 20 Forward: 40 Forward: 60 Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A' Name Url: 'http://B' Name Language Code: 'en-gb' Country: 'gb' SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, '^http') AND DocId < 20;

Id: 10 Name Cnt: 2 Language Str: 'http://A,en-us' Str: 'http://A,en' Name Cnt: 0

* Example from the Dremel paper

Data Flow

Architecture
Nested query languages
Pluggable model DrQL Mongo Query Language

Distributed execution engine


Extensible model (eg, Dryad) Low-latency Fault tolerant Column-based and row-based processing

Nested data formats


Pluggable model Column-based (Dremel, AVRO-806/Trevni, RCFile) and row-based (Protocol Buffers, Avro, JSON, BSON, CSV) Schema (Protocol Buffers/Dremel, Avro/AVRO-806/Trevni, CSV) and schema-less (JSON, BSON)

Scalable data sources


Pluggable model Hadoop NoSQL

Design Principles
Flexible
Pluggable query languages Extensible execution engine Pluggable data formats Column-based and row-based Schema and schema-less Pluggable data sources

Easy
Unzip and run Zero configuration Reverse DNS not needed IP addresses can change Clear and concise log messages

Dependable
No SPOF Instant recovery from crashes

Fast
C/C++ core with Java support Min latency and max throughput (limited only by hardware) Full column-based data support including operators

Hadoop Integration
Hadoop data sources
Hadoop FileSystem API (HDFS/MapR-FS) HBase

Hadoop data formats


Apache Avro RCFile

MapReduce-based tools to create column-based formats Hive-based query language and optimizer Table registry in Hcatalog Run long-running services in YARN

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy