0% found this document useful (0 votes)

217 views14 pages

Drill Slides

Apache Drill is an open source interactive analysis system designed to enable interactive analysis of large-scale datasets. It uses a column-based query engine and supports interactive query speeds on datasets with trillions of records. Drill supports a SQL-like query language called DrQL and can query data stored in files, HDFS, HBase and other data sources using a flexible, pluggable architecture.

Uploaded by

Ritu Nathan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

217 views14 pages

Drill Slides

Uploaded by

Ritu Nathan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Apache Drill

Interactive Analysis of Large-Scale Datasets

Tomer Shiran

Latency Matters
Ad-hoc analysis with interactive tools Real-time dashboards
Event/trend detection
Network intrusions Fraud Failures

Big Data Processing

Batch processing Query runtime Data volume Minutes to hours TBs to PBs Interactive analysis Milliseconds to minutes GBs to PBs Queries Stream processing Never-ending Continuous stream DAG

Programming model MapReduce Users Google project Open source project Developers MapReduce Hadoop MapReduce

Analysts and developers Developers Dremel Storm and S4

Introducing Apache Drill

GOOGLE DREMEL

Google Dremel
Interactive analysis of large-scale datasets
Trillion records at interactive speeds Complementary to MapReduce Used by thousands of Google employees Paper published at VLDB 2010
Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis

Model
Nested data model with schema
Most data at Google is stored/transferred in Protocol Buffers Normalization (to relational) is prohibitive

SQL-like query language with nested data support

Implementation
Column-based storage and processing In-situ data access (GFS and Bigtable) Tree architecture as in Web search (and databases)

Google BigQuery
Hosted Dremel (Dremel as a Service) CLI (bq) and Web UI Import data from Google Cloud Storage or local files
Files must be in CSV format
Nested data not supported [yet] except built-in datasets

Schema definition required

APACHE DRILL

Nested Data Model

The data model in Dremel is Protocol Buffers
Nested Schema

Apache Drill is designed to support multiple data models

Schema: Apache Avro, Protocol Buffers, Schema-less: JSON, BSON,

Flat records are supported as a special case of nested data

CSV, TSV,

Avro IDL
enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } {

JSON
"name": "Tomer", "gender": "Male", "followers": 100 } { "name": "Maya", "gender": "Female", "followers": 200, "zip": "94305" }

Nested Query Languages

DrQL
SQL-like query language for nested data Compatible with Google BigQuery/Dremel
BigQuery applications should work with Drill

Designed to support efficient column-based processing

No record assembly during query processing

Mongo Query Language

{$query: {x: 3, y: "abc"}, $orderby: {x: 1}}

Other languages/programming models can plug in

DrQL Example
DocId: 10 Links Forward: 20 Forward: 40 Forward: 60 Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A' Name Url: 'http://B' Name Language Code: 'en-gb' Country: 'gb' SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, '^http') AND DocId < 20;

Id: 10 Name Cnt: 2 Language Str: 'http://A,en-us' Str: 'http://A,en' Name Cnt: 0

* Example from the Dremel paper

Data Flow

Architecture
Nested query languages
Pluggable model DrQL Mongo Query Language

Distributed execution engine

Extensible model (eg, Dryad) Low-latency Fault tolerant Column-based and row-based processing

Nested data formats

Pluggable model Column-based (Dremel, AVRO-806/Trevni, RCFile) and row-based (Protocol Buffers, Avro, JSON, BSON, CSV) Schema (Protocol Buffers/Dremel, Avro/AVRO-806/Trevni, CSV) and schema-less (JSON, BSON)

Scalable data sources

Pluggable model Hadoop NoSQL

Design Principles
Flexible
Pluggable query languages Extensible execution engine Pluggable data formats Column-based and row-based Schema and schema-less Pluggable data sources

Easy
Unzip and run Zero configuration Reverse DNS not needed IP addresses can change Clear and concise log messages

Dependable
No SPOF Instant recovery from crashes

Fast
C/C++ core with Java support Min latency and max throughput (limited only by hardware) Full column-based data support including operators

Hadoop Integration
Hadoop data sources
Hadoop FileSystem API (HDFS/MapR-FS) HBase

Hadoop data formats

Apache Avro RCFile

MapReduce-based tools to create column-based formats Hive-based query language and optimizer Table registry in Hcatalog Run long-running services in YARN

Computer Vision User Guide
No ratings yet
Computer Vision User Guide
2,100 pages
Databricks Widgets
No ratings yet
Databricks Widgets
13 pages
(Kay A. Robbins, Steve Robbins) UNIX Systems Progr Pratica
0% (1)
(Kay A. Robbins, Steve Robbins) UNIX Systems Progr Pratica
1,008 pages
Instant Access To Data Lake Architecture Designing The Data Lake and Avoiding The Garbage Dump First Edition Bill Inmon Ebook Full Chapters
100% (6)
Instant Access To Data Lake Architecture Designing The Data Lake and Avoiding The Garbage Dump First Edition Bill Inmon Ebook Full Chapters
62 pages
Arista EOS ConfigGuide
100% (1)
Arista EOS ConfigGuide
934 pages
Data Warehouse
No ratings yet
Data Warehouse
74 pages
Lab - Qlik Replicate Azure Databricks
No ratings yet
Lab - Qlik Replicate Azure Databricks
16 pages
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
No ratings yet
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
38 pages
Data Lakehouse
No ratings yet
Data Lakehouse
7 pages
Amazon Ads Playbook Series v1-2
No ratings yet
Amazon Ads Playbook Series v1-2
133 pages
Top 50 Data Warehousing Interview Questions & Answers
No ratings yet
Top 50 Data Warehousing Interview Questions & Answers
8 pages
MIE1628 Big Data Analytics Lecture8
No ratings yet
MIE1628 Big Data Analytics Lecture8
82 pages
Cloudera Kudu
100% (1)
Cloudera Kudu
102 pages
Data Lake Bootcamp: Building Reliable Data Lakes
No ratings yet
Data Lake Bootcamp: Building Reliable Data Lakes
29 pages
Data Warehouse Massively Parallel Processing Design Patterns
100% (1)
Data Warehouse Massively Parallel Processing Design Patterns
28 pages
aws-certified-cloud-practitioner_1
No ratings yet
aws-certified-cloud-practitioner_1
28 pages
Big Data Technology Stack
100% (1)
Big Data Technology Stack
12 pages
The Benefits of Delta Lake and Lakehouse Architecture
No ratings yet
The Benefits of Delta Lake and Lakehouse Architecture
3 pages
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
Cloudera Hadoop Introduction PDF
100% (1)
Cloudera Hadoop Introduction PDF
50 pages
Data Warehouse - What Is It
No ratings yet
Data Warehouse - What Is It
5 pages
ISO 27001 Mapping contros
No ratings yet
ISO 27001 Mapping contros
18 pages
Ch-19 DNS SDP
No ratings yet
Ch-19 DNS SDP
51 pages
Database Group Assignment
No ratings yet
Database Group Assignment
34 pages
Kudu
No ratings yet
Kudu
9 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Apache HIVE
No ratings yet
Apache HIVE
9 pages
Amazon Chime Voice Connector - 3CX - Configuration - Guide - v2.1
No ratings yet
Amazon Chime Voice Connector - 3CX - Configuration - Guide - v2.1
24 pages
Big Data - RDBMS, NoSQL and DynamoDB
No ratings yet
Big Data - RDBMS, NoSQL and DynamoDB
6 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Ldi Plus Device SCCM Setup Guide
No ratings yet
Ldi Plus Device SCCM Setup Guide
30 pages
Access Control Snowflake
No ratings yet
Access Control Snowflake
6 pages
Big Data Landscape 2017
No ratings yet
Big Data Landscape 2017
1 page
Data Warehousing
No ratings yet
Data Warehousing
39 pages
Battle of The Giants - Comparing Kimball and Inmon
No ratings yet
Battle of The Giants - Comparing Kimball and Inmon
15 pages
EB2406 - Teradata PDF
No ratings yet
EB2406 - Teradata PDF
18 pages
Chapter Four
No ratings yet
Chapter Four
11 pages
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
No ratings yet
Serverless Architecture For Product Defect Detection Using Computer Vision Ra
1 page
Metadata Management On A Hadoop Eco-System: Whitepaper by
No ratings yet
Metadata Management On A Hadoop Eco-System: Whitepaper by
12 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
P4M900-M7 FE Setup Manual: Downloaded From Manuals Search Engine
No ratings yet
P4M900-M7 FE Setup Manual: Downloaded From Manuals Search Engine
43 pages
Flink Vs Spark by Slim Baltagi
No ratings yet
Flink Vs Spark by Slim Baltagi
67 pages
Data Warehousing and BA
No ratings yet
Data Warehousing and BA
77 pages
Azure Data Factory Monitoring Best Practices
No ratings yet
Azure Data Factory Monitoring Best Practices
9 pages
Govindarajan Data Vault PDF
100% (1)
Govindarajan Data Vault PDF
29 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
Epro Order Guide Draft San Diego
No ratings yet
Epro Order Guide Draft San Diego
31 pages
Modul 9 - Data Warehousing and Business Intelligence - DMBOK2
No ratings yet
Modul 9 - Data Warehousing and Business Intelligence - DMBOK2
59 pages
Talend ESB Container AG 50b en
No ratings yet
Talend ESB Container AG 50b en
63 pages
Agile Unit - 5
No ratings yet
Agile Unit - 5
11 pages
Dimensional Modeling
No ratings yet
Dimensional Modeling
38 pages
An Investigation of NoSQL Database Performance From A MYSQL Perspective
No ratings yet
An Investigation of NoSQL Database Performance From A MYSQL Perspective
3 pages
Fin Fisher Surveillance Malware Sales Brochure.
No ratings yet
Fin Fisher Surveillance Malware Sales Brochure.
58 pages
Operational Data Stores
No ratings yet
Operational Data Stores
3 pages
VMware VXLAN Deployment Guide
No ratings yet
VMware VXLAN Deployment Guide
59 pages
Untitled
No ratings yet
Untitled
6 pages
Hadoop ECO System
No ratings yet
Hadoop ECO System
1 page
Elaborate_Computer_Hardware_Maintenance_Guide
No ratings yet
Elaborate_Computer_Hardware_Maintenance_Guide
5 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
Data Lakes For Maximum Flexibility
No ratings yet
Data Lakes For Maximum Flexibility
29 pages
Data Warehouse Development Approach
No ratings yet
Data Warehouse Development Approach
25 pages
R&S®IQW Wideband I/Q Data Recorder - Product Brochure
No ratings yet
R&S®IQW Wideband I/Q Data Recorder - Product Brochure
10 pages
ETL vs. ELT: Frictionless Data Integration - Diyotta
100% (1)
ETL vs. ELT: Frictionless Data Integration - Diyotta
3 pages
Data Architect or ETL Architect
100% (1)
Data Architect or ETL Architect
4 pages
What Is DW2.0
No ratings yet
What Is DW2.0
13 pages
Maximum Speed, Low Duplex Printing Costs.: Ink Tank System Printers
100% (1)
Maximum Speed, Low Duplex Printing Costs.: Ink Tank System Printers
2 pages
Federated vs. Centeralized vs. De-Centeralized Data Warehouse
No ratings yet
Federated vs. Centeralized vs. De-Centeralized Data Warehouse
5 pages
Installing Wireshark On Linux For OpenFlow Packet Captures
No ratings yet
Installing Wireshark On Linux For OpenFlow Packet Captures
13 pages
Informatica Big Data Management Course Agenda
100% (2)
Informatica Big Data Management Course Agenda
4 pages
Apache Hive
No ratings yet
Apache Hive
3 pages
Data Mining: Concepts and Techniques: 0501 - 01/server.920/a96520 PDF
100% (1)
Data Mining: Concepts and Techniques: 0501 - 01/server.920/a96520 PDF
63 pages
Web Result With Site Links: BRAC Bank Astha
No ratings yet
Web Result With Site Links: BRAC Bank Astha
5 pages
Grow Your Network Marketing Business Using MLM Software
No ratings yet
Grow Your Network Marketing Business Using MLM Software
6 pages
Marlon Calimbo Mobile System
No ratings yet
Marlon Calimbo Mobile System
5 pages
Cloud Data Warehouse
No ratings yet
Cloud Data Warehouse
7 pages
"It's Simple!" - Time Configuration in Active Directory - NEPA PFE
No ratings yet
"It's Simple!" - Time Configuration in Active Directory - NEPA PFE
11 pages
SOP For PHD Electrical Engineering
No ratings yet
SOP For PHD Electrical Engineering
3 pages
Cisco Script PT 3.6.1 Packetracer Skills Challenge
No ratings yet
Cisco Script PT 3.6.1 Packetracer Skills Challenge
5 pages
MongoBoulder - Schema Design
No ratings yet
MongoBoulder - Schema Design
59 pages
Unit 5
No ratings yet
Unit 5
5 pages
CLL
No ratings yet
CLL
3 pages
Nyein Kyaw Win (Web Dev)
No ratings yet
Nyein Kyaw Win (Web Dev)
1 page
Data Warehousing FAQ
No ratings yet
Data Warehousing FAQ
5 pages
Camworks 2.5-Axis Milling: For Pocketing, Contouring, and Drilling
No ratings yet
Camworks 2.5-Axis Milling: For Pocketing, Contouring, and Drilling
2 pages
Closer PDF 2
No ratings yet
Closer PDF 2
1 page
Chap01 Data Warehouse 1
No ratings yet
Chap01 Data Warehouse 1
65 pages
Anti Fuse
No ratings yet
Anti Fuse
2 pages
Brocade Vyatta 5600vrouter Aag
No ratings yet
Brocade Vyatta 5600vrouter Aag
2 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Insurance DataWare House Design Vechiles
No ratings yet
Insurance DataWare House Design Vechiles
2 pages
Managing Multimedia and Unstructured Data in the Oracle Database
From Everand
Managing Multimedia and Unstructured Data in the Oracle Database
Marcelle Kratochvil
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Drill Slides

Uploaded by

Drill Slides

Uploaded by

Apache Drill

Interactive Analysis of Large-Scale Datasets

Big Data Processing

Analysts and developers Developers Dremel Storm and S4

Introducing Apache Drill

SQL-like query language with nested data support

Schema definition required

Nested Data Model

Apache Drill is designed to support multiple data models

Flat records are supported as a special case of nested data

Nested Query Languages

Designed to support efficient column-based processing

Mongo Query Language

Other languages/programming models can plug in

* Example from the Dremel paper

Distributed execution engine

Nested data formats

Scalable data sources

Hadoop data formats

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.