0% found this document useful (0 votes)

144 views55 pages

Big Data 2021 - 6,7,8 Big Data Technologies

This document discusses big data technologies, including big data analytics flow, the big data stack, and analytics patterns like alpha, beta, gamma, and delta patterns. It also covers Hadoop, explaining how Hadoop can store large volumes and varieties of data at high velocity. Key Hadoop technologies like MapReduce, Hive, HBase, Mahout and Storm are explained. The document discusses scenarios for implementing Hadoop both tactically and strategically and challenges of implementation in cloud and on-premise environments. It also covers data lakes.

Uploaded by

Putri Nur aini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

144 views55 pages

Big Data 2021 - 6,7,8 Big Data Technologies

Uploaded by

Putri Nur aini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Big Data

Big Data Technologies

Rosita Yanuarti - 2021

Big Data Technologies
Learning Objectives

1. Big Data Analytics Flow

2. Big data stack

3. Big Data analytics Pattern

4. Big Data and Cloud Computing

5. Mengenal Hadoop

6. Skenario implementasi Hadoop

Studi Kasus
Pendekatan Tradisional
Type of Analytics
Big data
analytics flow
Big Data Stack

Big Data Stack dapat digunakan untuk menggambarkan berbagai jenis analisis dan komputasi
secara keseluruhan yang terdiri dari framework-framework yang digunakan pada analisis big data.
Big Data Stack
Raw Data

• Logs: Logs generated by web applications and servers which can be used for performance monitoring

• Transactional Data:Transactional data generated by applications such as eCommerce, Banking and Financial

• Social Media: Data generated by social media platforms

• Databases: Structured data residing in relational databases

• Sensor Data: Sensor data generated by Internet of Things (IoT) systems

• Clickstream Data: Clickstream data generated by web applications which can be  

used to analyze browsing patterns of the users

• Surveillance Data: Sensor, image and video data generated by surveillance systems

• Healthcare Data: Healthcare data generated by Electronic Health Record (EHR) and  
other healthcare applications

• Network Data: Network data generated by network devices such as routers and  
firewalls
Big Data Stack
Data Access Connectors

• The Data Access Connectors includes tools and

frameworks for collecting and ingesting data from
various sources into the big data storage and analytics
frameworks.

• The choice of the data connector is driven by the type

of the data source.

• These connectors can include both wired and wireless

connections.
Data access connectors

• Publish-Subscribe Messaging : Publish-Subscribe is a communication

model that involves publishers, brokers and consumers. Publishers are the
source of data. Publishers send the data to the topics which are managed by
the broker. Publish-subscribe messaging frameworks such as Apache Kafka
and Amazon Kinesis.

• Source-Sink Connectors : Source-Sink connectors allow eﬃciently

collecting, aggregating and moving data from various sources (such as server
logs, databases, social media, streaming sensor data from Internet of Things
devices and other sources) into a centralized data store (such as a distributed
file system)
Data access connectors

• Database Connectors : Database connectors can be used for importing data from
relational database management systems into big data storage and analytics
frameworks for analysis.

• Messaging Queues : Messaging queues are useful for push-pull messaging where
the producers push data to the queues and the consumers pull the data from the
queues. The producers and consumers do not need to be aware of each other.

• Custom Connectors : Custom connectors can be built based on the source of the
data and the data collection requirements. Some examples of custom connectors
include: custom connectors for collecting data from social networks, custom
connectors for NoSQL databases and connectors for Internet of Things (IoT).
Big Data Stack
Data Storage

• The data storage block in the big data stack includes distributed filesystems and
non-relational (NoSQL) databases, which store the data collected from the raw data
sources using the data access connectors.

• Hadoop Distributed File System (HDFS), a distributed file system that runs on large
clusters and provides high-throughput access to data. With the data stored in HDFS,
it can be analyzed with various big data analytics frameworks built on top of HDFS.

• For certain analytics applications, it is preferable to store data in a NoSQL database

such as HBase.

• HBase is a scalable, non-relational, distributed, column-oriented database that

provides structured data storage for large tables.
Big Data Stack
Analytics Modes

• Batch analytics : analitik yang bertujuan untuk mengolah data secara batch,
dan prosesnya yang terjadi dinamakan alpha pattern.

• Real-time analytics : jika data yang diolah sifatnya real time, termasuk data
yang diperoleh dari IoT. Dibutuhkan proses yang berbeda dengan batch
analytics —> framework yang digunakan ini dinamakan beta pattern.

• Interactive : Interactive querying, memungkinkan user dapat mencoba/

melakukan eksperimen berbagai asumsi, misalnya task analisis prescriptive.
Framework yang digunakan delta pattern.
Big Data Stack
Serving Databases, Web & Visualization Frameworks

• While the various analytics blocks process and analyze the data, the results
are stored in serving databases for subsequent tasks of presentation and
visualization.

• These serving databases allow the analyzed data to be queried and presented
in the web applications.
Big Data Stack
Mapping Analytics Flow to Big Data Stack

• For any big data application, once we come up with an analytics flow, the next step is to
map the analytics flow to specific tools and frameworks in the big data stack.

• For data collection tasks, the choice of a specific tool or framework depends on the type
of the data source (such as log files, machines generating sensor data, social media
feeds, records in a relational database, for instance) and the characteristics of the data.

• For data cleaning and transformation, tools such as Open Refine and Stanford
DataWrangler can be used. These tools support various file formats such as CSV, Excel,
XML, JSON and line-based formats.

• For the basic statistics analysis type (with analysis such as computing counts, max, min,
mean, top-N, distinct, correlations, for instance), most of the analysis can be done using
the Hadoop-MapReduce framework or with Pig scripts.
Mapping Analytics Flow to Big Data Stack
Mapping Analytics Flow to Big Data Stack
Analytics Patterns

• Alpha Pattern : Batch analytics

• Beta Pattern : Real time analytics

• Gamma Pattern : Gamma pattern which combines batch and real-time

analysis patterns

• Delta Pattern : Interactive querying analytics

Analytics Patterns
Alpha Pattern
Analytics Patterns
Beta Pattern
Analytics Patterns
Gamma Pattern
Analytics Patterns
Delta Pattern
Mengenal Hadoop
Hadoop
Data Volume

• Hadoop menyimpan berkas dalam distributed file system

• Penyimpanan dan komputasi dilakukan di lokasi yang tersebar

• Berkas dapat tersebar dalam beberapa nodes

• Hadoop dapat menyimpan ‘begitu banyak’ daya

• Sumber data penyimpanan dapat dihubungkan antar satu dengan yang lain

• Berkas yang cukup besar dapat disimpan meskipun lebih besar dari satu
node
Hadoop
Data Variety
Hadoop
Data Velocity
Hadoop
Platform

• Diatur oleh Apache Software Foundation

• Terdiri dari dayanan inti MapReduce, HDFS, dan YARN

• Layanan data yang memungkinkan untuk memanipulasi dan memindahkan

data (Hive, HBase, Pig, Flume, Sqoop)

• Layanan operaslional yang membantu Mengelola klaster (Ambari, Falcon, dan

Oozie)
Teknologi-teknologi Big Data
Teknologi Big Data
MapReduce

• Data yang diterima akan dipecah-

pecah menjadi beberapa node :

• master node, zookeeper, worker.

• Selanjutnya akan di-reduce, menjadi

satu kembali.
Teknologi Big Data
Hive

• Hive digunakan untuk meng-query data yang ada pada Hadoop, dengan
pendekatan sql.

• Berlaku juga pada data-data unstructures lainnya seperti images, video.

Teknologi Big Data
HBase, Mahout, Storm

• HBase

• Implementasi NoSQL di HDFS.

• Columnar, database NoSQL

• Memberikan fleksibilitas pada pemrosesan data

• Mahout ==> Machine learning library

• Library algorithm machine learning untuk dijalankan pada data dalam HDFS

• Collaborative Filtering, Classification, Clustering, Dimensionality Reduction, Topic models.

• Storm ==> stream analytics untuk near-real-time-processing

Skenario implementasi
Skenario
Pemasangan Hadoop

• Cloud Computing (PaaS)

• On-Premise (Hadoop)

• Hybrid
Skenario implementasi Hadoop
Skenario Taktis

• Data Virtualization

• Data Lake

• Scale out Data Mart

• Integrated Machine learning

Skenario Taktis

Data Virtualization Data Lake Data Mart

Integrated machine learning
Skenario implementasi Hadoop
Skenario Strategis-1

• Ganti pre-processing ETL pada data warehouse ke Hadoop.

Skenario implementasi Hadoop
Skenario Strategis-2 ==> Hot and cold storage

• Memindahkan historical data ke cold storage dengan Hadoop.

• Data warehouse digunakan untuk hot data yang dapat digunakan oleh BI dan analytics

• Ketika data dari cold storage dibutuhkan dapat dikembalikan ke data warehouse.
Implementasi Hadoop pada Industri
Business Intelligence

• Secara sederhana, Big data merupakan sarana dan solusi untuk

menghasilkan business intelligence.

• Pada saat data analyst menghasilkan sebuah dashboard dan report yang
dapat digunakan untuk membantu membuat keputusan, maka dapat
dikatakan menghasilkan/masuk ke arah —> “business intelligence”.

• Business Intelligence ==> oleh data scientist, misalnya : Prediksi,

classification, pattern analisis, knowledge extraction.
End user
Pemanfaatan Hadoop
Implementasi Hadoop
Tantangan
Implementasi Hadoop
Tantangan jika di Cloud
Implementasi Hadoop
Latihan

• Latihan mencoba implementasi Hadoop

Data Lake
Data Lake

• A centralized repository for both structured

and unstructured data

• Store data as-is in open-source file formats

to enable direct analytics

• Data Lake is architecture, Hadoop is

Technology
Why data lake

• Decouple storage from compute, allowing you to scale

• Enable advanced analytics across all of your data sources

• Reduce complexity in ETL and operational overhead

• Future extensibility as new database and analytics technologies are invented

Data Lakes Extend the Traditional Approach
Data Lake
Advantages

• All data in one place

• Quick ingest

• Separating your storage and compute allows you to scale each component as
required

• A Data Lake enables ad-hoc analysis by applying schemas 

on read, not write.

Imarkup Document Review
No ratings yet
Imarkup Document Review
8 pages
Degree To Which A Set of Inherent Characteristics Fulfills Requirements. (Stated, Implied or Obligatory)
No ratings yet
Degree To Which A Set of Inherent Characteristics Fulfills Requirements. (Stated, Implied or Obligatory)
22 pages
Business Intelligence Exam II Answers
0% (1)
Business Intelligence Exam II Answers
24 pages
Excel Practice
100% (2)
Excel Practice
9 pages
Viscosity Chart
No ratings yet
Viscosity Chart
2 pages
The Big Data Ecosystem at LinkedIn Presentation 1
No ratings yet
The Big Data Ecosystem at LinkedIn Presentation 1
33 pages
Layered Scalable Architecture@SAP 2.0
No ratings yet
Layered Scalable Architecture@SAP 2.0
26 pages
Fundamentals of Big Data JUNE 2022
No ratings yet
Fundamentals of Big Data JUNE 2022
11 pages
LCD Panel Repairing Book - Parte3
No ratings yet
LCD Panel Repairing Book - Parte3
30 pages
Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
Digital Transformation Published Article
No ratings yet
Digital Transformation Published Article
5 pages
Task Scheduling Model One
No ratings yet
Task Scheduling Model One
4 pages
Tech Architecture
No ratings yet
Tech Architecture
3 pages
Data Engineering
No ratings yet
Data Engineering
10 pages
Document Approval
No ratings yet
Document Approval
15 pages
What Requirements Documents Does A Business Analyst Create
No ratings yet
What Requirements Documents Does A Business Analyst Create
4 pages
Table Analysis (TAANA) - Usage For Filters in SAP Event Management
No ratings yet
Table Analysis (TAANA) - Usage For Filters in SAP Event Management
26 pages
Thesis MVZ v1.0 PDF
No ratings yet
Thesis MVZ v1.0 PDF
134 pages
Re-Inventing Business Analysis: New Skills?: Planning and Monitoring
No ratings yet
Re-Inventing Business Analysis: New Skills?: Planning and Monitoring
11 pages
Getting Started With Hadoop Planning Guide
No ratings yet
Getting Started With Hadoop Planning Guide
24 pages
Reporting Data
No ratings yet
Reporting Data
76 pages
The Empower Process
No ratings yet
The Empower Process
1 page
Big Data Insights For: Maximum Business Impact
No ratings yet
Big Data Insights For: Maximum Business Impact
4 pages
Business Modelling: A Practical Guide
No ratings yet
Business Modelling: A Practical Guide
11 pages
Big Data My Studies
No ratings yet
Big Data My Studies
28 pages
Workflow Designer91 PDF
No ratings yet
Workflow Designer91 PDF
508 pages
1 Intro To SCM Strategy and Value
No ratings yet
1 Intro To SCM Strategy and Value
38 pages
Industry 4.0
No ratings yet
Industry 4.0
54 pages
Week6Introduction To Advanced Materials
No ratings yet
Week6Introduction To Advanced Materials
4 pages
Mapping The PPDM Data Model and Witsml
100% (1)
Mapping The PPDM Data Model and Witsml
17 pages
Top 10 Guidelines For Deploying Modern Data Architecture For The Data Driven Enterprise
No ratings yet
Top 10 Guidelines For Deploying Modern Data Architecture For The Data Driven Enterprise
6 pages
An Enhanced Hyper-Heuristics Task Scheduling in Cloud Computing
No ratings yet
An Enhanced Hyper-Heuristics Task Scheduling in Cloud Computing
6 pages
Wireless - Merits & Demerits
No ratings yet
Wireless - Merits & Demerits
10 pages
5 Pillars of API Management
No ratings yet
5 Pillars of API Management
10 pages
How To Redesign CCBPM Process
100% (1)
How To Redesign CCBPM Process
59 pages
Operations Manual
No ratings yet
Operations Manual
189 pages
Basic Charts and Multidimensional Visualization
No ratings yet
Basic Charts and Multidimensional Visualization
33 pages
Written Report - Chapter 3 - Visualizing Data
100% (1)
Written Report - Chapter 3 - Visualizing Data
5 pages
Business Transformation Through ERP A Case Study of An Asian Company
No ratings yet
Business Transformation Through ERP A Case Study of An Asian Company
22 pages
Adattarhaz Forum 2013 Areus Halasz Gabor Adatintegracio Nagyvallalatok Szamara
No ratings yet
Adattarhaz Forum 2013 Areus Halasz Gabor Adatintegracio Nagyvallalatok Szamara
30 pages
Hype CycHype Cycle For Application Development 2007le For Application Development 2007
No ratings yet
Hype CycHype Cycle For Application Development 2007le For Application Development 2007
47 pages
Astm d3359 22 English
No ratings yet
Astm d3359 22 English
8 pages
Business Intelligence: Body of Knowledge
No ratings yet
Business Intelligence: Body of Knowledge
16 pages
Chapter 9 Business Intelligence
No ratings yet
Chapter 9 Business Intelligence
3 pages
Unit 2
No ratings yet
Unit 2
71 pages
10190-Move and Improve With Oracle Analytics Cloud-Presentation - 287
No ratings yet
10190-Move and Improve With Oracle Analytics Cloud-Presentation - 287
69 pages
Comparison Fagll03 Fbl3n Fbl5n
No ratings yet
Comparison Fagll03 Fbl3n Fbl5n
2 pages
Asap G 2012-04
No ratings yet
Asap G 2012-04
14 pages
Cmos Fabrication: N - Well Process
No ratings yet
Cmos Fabrication: N - Well Process
42 pages
Bab 7. System Analysis: Reference: Whitten Bentley - Chapter 5
No ratings yet
Bab 7. System Analysis: Reference: Whitten Bentley - Chapter 5
44 pages
HCL Enables DIGITAL TRANSFORMATION Through PARTNERSHIP and COLLABORATION
No ratings yet
HCL Enables DIGITAL TRANSFORMATION Through PARTNERSHIP and COLLABORATION
7 pages
Latest Technologies in Software Development Web Development: What Is Web Technology?
No ratings yet
Latest Technologies in Software Development Web Development: What Is Web Technology?
10 pages
Agile Manufacturing
No ratings yet
Agile Manufacturing
3 pages
SQL Manual
No ratings yet
SQL Manual
29 pages
Information System Management
No ratings yet
Information System Management
32 pages
D365Now Business Edition Capabilities Guide 7-17-17
No ratings yet
D365Now Business Edition Capabilities Guide 7-17-17
13 pages
TCS Connected Universe Platform - 060918
No ratings yet
TCS Connected Universe Platform - 060918
4 pages
Data Visualization (Reference To First Slide Data Content)
No ratings yet
Data Visualization (Reference To First Slide Data Content)
5 pages
TEDtalk Transcript - How To Spot A Liar
No ratings yet
TEDtalk Transcript - How To Spot A Liar
9 pages
SCOR and Benefits of Using Process Reference Models: Thomas Phelps
No ratings yet
SCOR and Benefits of Using Process Reference Models: Thomas Phelps
41 pages
Annual Performance Appraisal For Administrative and Professional Staff
No ratings yet
Annual Performance Appraisal For Administrative and Professional Staff
5 pages
Tech Breakthroughs Megatrend PDF
No ratings yet
Tech Breakthroughs Megatrend PDF
18 pages
Data Acquisition & SCADA Systems
No ratings yet
Data Acquisition & SCADA Systems
4 pages
FAQ - Document Approval and Routing ID 210488.1
No ratings yet
FAQ - Document Approval and Routing ID 210488.1
5 pages
BPMN Bpel Mapping
100% (1)
BPMN Bpel Mapping
6 pages
Big Data - GCP IM Point of View
No ratings yet
Big Data - GCP IM Point of View
38 pages
PIP-LV 3KVA Manual-20210701
No ratings yet
PIP-LV 3KVA Manual-20210701
46 pages
Monthly Bill
No ratings yet
Monthly Bill
1 page
CH12
No ratings yet
CH12
8 pages
Hydrogen Use in Internal Combustion Engine - A Review
No ratings yet
Hydrogen Use in Internal Combustion Engine - A Review
13 pages
18 Spring Mid
No ratings yet
18 Spring Mid
16 pages
Performance Management System in Nigeria: An Evaluation of New Aper in Federal Civil Service of Nigeria Pillah, Tyodzer Patrick, PHD
No ratings yet
Performance Management System in Nigeria: An Evaluation of New Aper in Federal Civil Service of Nigeria Pillah, Tyodzer Patrick, PHD
9 pages
HaightAshburyFreePressVol 1no 61968D D TeoliJr A C 1
100% (1)
HaightAshburyFreePressVol 1no 61968D D TeoliJr A C 1
16 pages
Measuring & Evaluating Learning
No ratings yet
Measuring & Evaluating Learning
15 pages
Uc Colorado Springs
No ratings yet
Uc Colorado Springs
17 pages
Edi 104 - Chapter 3
No ratings yet
Edi 104 - Chapter 3
47 pages
733-Article Text-1725-3-10-20230630
No ratings yet
733-Article Text-1725-3-10-20230630
16 pages
Contextualization of The MT4T E-Citizenship Learning Packets
No ratings yet
Contextualization of The MT4T E-Citizenship Learning Packets
36 pages
Subject: Physics Grade: 10-SCIENCE, 10-TVET Week: I Topic: Time
No ratings yet
Subject: Physics Grade: 10-SCIENCE, 10-TVET Week: I Topic: Time
1 page
Proportional Relief Valves, High Pressure: SS-4R3A
No ratings yet
Proportional Relief Valves, High Pressure: SS-4R3A
2 pages
The Most Sensitive Area of The Tooth During
No ratings yet
The Most Sensitive Area of The Tooth During
5 pages
Sample Diagnostic
No ratings yet
Sample Diagnostic
29 pages
Math Investigation
No ratings yet
Math Investigation
20 pages
Weekly Home Learning Plan g10 q4 w7
No ratings yet
Weekly Home Learning Plan g10 q4 w7
3 pages
Gold Care
No ratings yet
Gold Care
16 pages
Notice Regarding PTM For Students
No ratings yet
Notice Regarding PTM For Students
1 page
NIPS2019 TGAN Supplementary PDF
No ratings yet
NIPS2019 TGAN Supplementary PDF
7 pages
M D A I C: Measure Define Improve Control
No ratings yet
M D A I C: Measure Define Improve Control
1 page
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Big Data 2021 - 6,7,8 Big Data Technologies

Uploaded by

Big Data 2021 - 6,7,8 Big Data Technologies

Uploaded by

Big Data

Big Data Technologies

Rosita Yanuarti - 2021

1. Big Data Analytics Flow

2. Big data stack

3. Big Data analytics Pattern

4. Big Data and Cloud Computing

6. Skenario implementasi Hadoop

• Social Media: Data generated by social media platforms

• Databases: Structured data residing in relational databases

• Sensor Data: Sensor data generated by Internet of Things (IoT) systems

• Clickstream Data: Clickstream data generated by web applications which can be

• The Data Access Connectors includes tools and

• The choice of the data connector is driven by the type

• These connectors can include both wired and wireless

• Publish-Subscribe Messaging : Publish-Subscribe is a communication

• Source-Sink Connectors : Source-Sink connectors allow eﬃciently

• For certain analytics applications, it is preferable to store data in a NoSQL database

• HBase is a scalable, non-relational, distributed, column-oriented database that

• Interactive : Interactive querying, memungkinkan user dapat mencoba/

• Alpha Pattern : Batch analytics

• Beta Pattern : Real time analytics

• Gamma Pattern : Gamma pattern which combines batch and real-time

• Delta Pattern : Interactive querying analytics

• Hadoop menyimpan berkas dalam distributed file system

• Penyimpanan dan komputasi dilakukan di lokasi yang tersebar

• Berkas dapat tersebar dalam beberapa nodes

• Hadoop dapat menyimpan ‘begitu banyak’ daya

• Diatur oleh Apache Software Foundation

• Terdiri dari dayanan inti MapReduce, HDFS, dan YARN

• Layanan data yang memungkinkan untuk memanipulasi dan memindahkan

• Layanan operaslional yang membantu Mengelola klaster (Ambari, Falcon, dan

• Data yang diterima akan dipecah-

• master node, zookeeper, worker.

• Selanjutnya akan di-reduce, menjadi

• Berlaku juga pada data-data unstructures lainnya seperti images, video.

• Implementasi NoSQL di HDFS.

• Columnar, database NoSQL

• Memberikan fleksibilitas pada pemrosesan data

• Mahout ==> Machine learning library

• Collaborative Filtering, Classification, Clustering, Dimensionality Reduction, Topic models.

• Storm ==> stream analytics untuk near-real-time-processing

• Cloud Computing (PaaS)

• Scale out Data Mart

• Integrated Machine learning

Data Virtualization Data Lake Data Mart

• Ganti pre-processing ETL pada data warehouse ke Hadoop.

• Memindahkan historical data ke cold storage dengan Hadoop.

• Secara sederhana, Big data merupakan sarana dan solusi untuk

• Business Intelligence ==> oleh data scientist, misalnya : Prediksi,

• Latihan mencoba implementasi Hadoop

• A centralized repository for both structured

• Store data as-is in open-source file formats

• Data Lake is architecture, Hadoop is

• Decouple storage from compute, allowing you to scale

• Enable advanced analytics across all of your data sources

• Reduce complexity in ETL and operational overhead

• Future extensibility as new database and analytics technologies are invented

• All data in one place

• A Data Lake enables ad-hoc analysis by applying schemas

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

• Clickstream Data: Clickstream data generated by web applications which can be  

• A Data Lake enables ad-hoc analysis by applying schemas