0% found this document useful (0 votes)
144 views55 pages

Big Data 2021 - 6,7,8 Big Data Technologies

This document discusses big data technologies, including big data analytics flow, the big data stack, and analytics patterns like alpha, beta, gamma, and delta patterns. It also covers Hadoop, explaining how Hadoop can store large volumes and varieties of data at high velocity. Key Hadoop technologies like MapReduce, Hive, HBase, Mahout and Storm are explained. The document discusses scenarios for implementing Hadoop both tactically and strategically and challenges of implementation in cloud and on-premise environments. It also covers data lakes.

Uploaded by

Putri Nur aini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
144 views55 pages

Big Data 2021 - 6,7,8 Big Data Technologies

This document discusses big data technologies, including big data analytics flow, the big data stack, and analytics patterns like alpha, beta, gamma, and delta patterns. It also covers Hadoop, explaining how Hadoop can store large volumes and varieties of data at high velocity. Key Hadoop technologies like MapReduce, Hive, HBase, Mahout and Storm are explained. The document discusses scenarios for implementing Hadoop both tactically and strategically and challenges of implementation in cloud and on-premise environments. It also covers data lakes.

Uploaded by

Putri Nur aini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Big Data

Big Data Technologies

Rosita Yanuarti - 2021


Big Data Technologies
Learning Objectives

1. Big Data Analytics Flow

2. Big data stack

3. Big Data analytics Pattern

4. Big Data and Cloud Computing

5. Mengenal Hadoop

6. Skenario implementasi Hadoop


Studi Kasus
Pendekatan Tradisional
Type of Analytics
Big data
analytics flow
Big Data Stack

Big Data Stack dapat digunakan untuk menggambarkan berbagai jenis analisis dan komputasi
secara keseluruhan yang terdiri dari framework-framework yang digunakan pada analisis big data.
Big Data Stack
Raw Data

• Logs: Logs generated by web applications and servers which can be used for performance monitoring

• Transactional Data:Transactional data generated by applications such as eCommerce, Banking and Financial

• Social Media: Data generated by social media platforms

• Databases: Structured data residing in relational databases

• Sensor Data: Sensor data generated by Internet of Things (IoT) systems

• Clickstream Data: Clickstream data generated by web applications which can be 



used to analyze browsing patterns of the users

• Surveillance Data: Sensor, image and video data generated by surveillance systems

• Healthcare Data: Healthcare data generated by Electronic Health Record (EHR) and 

other healthcare applications

• Network Data: Network data generated by network devices such as routers and 

firewalls
Big Data Stack
Data Access Connectors

• The Data Access Connectors includes tools and


frameworks for collecting and ingesting data from
various sources into the big data storage and analytics
frameworks.

• The choice of the data connector is driven by the type


of the data source.

• These connectors can include both wired and wireless


connections.
Data access connectors

• Publish-Subscribe Messaging : Publish-Subscribe is a communication


model that involves publishers, brokers and consumers. Publishers are the
source of data. Publishers send the data to the topics which are managed by
the broker. Publish-subscribe messaging frameworks such as Apache Kafka
and Amazon Kinesis.

• Source-Sink Connectors : Source-Sink connectors allow efficiently


collecting, aggregating and moving data from various sources (such as server
logs, databases, social media, streaming sensor data from Internet of Things
devices and other sources) into a centralized data store (such as a distributed
file system)
Data access connectors

• Database Connectors : Database connectors can be used for importing data from
relational database management systems into big data storage and analytics
frameworks for analysis.

• Messaging Queues : Messaging queues are useful for push-pull messaging where
the producers push data to the queues and the consumers pull the data from the
queues. The producers and consumers do not need to be aware of each other.

• Custom Connectors : Custom connectors can be built based on the source of the
data and the data collection requirements. Some examples of custom connectors
include: custom connectors for collecting data from social networks, custom
connectors for NoSQL databases and connectors for Internet of Things (IoT).
Big Data Stack
Data Storage

• The data storage block in the big data stack includes distributed filesystems and
non-relational (NoSQL) databases, which store the data collected from the raw data
sources using the data access connectors.

• Hadoop Distributed File System (HDFS), a distributed file system that runs on large
clusters and provides high-throughput access to data. With the data stored in HDFS,
it can be analyzed with various big data analytics frameworks built on top of HDFS.

• For certain analytics applications, it is preferable to store data in a NoSQL database


such as HBase.

• HBase is a scalable, non-relational, distributed, column-oriented database that


provides structured data storage for large tables.
Big Data Stack
Analytics Modes

• Batch analytics : analitik yang bertujuan untuk mengolah data secara batch,
dan prosesnya yang terjadi dinamakan alpha pattern.

• Real-time analytics : jika data yang diolah sifatnya real time, termasuk data
yang diperoleh dari IoT. Dibutuhkan proses yang berbeda dengan batch
analytics —> framework yang digunakan ini dinamakan beta pattern.

• Interactive : Interactive querying, memungkinkan user dapat mencoba/


melakukan eksperimen berbagai asumsi, misalnya task analisis prescriptive.
Framework yang digunakan delta pattern.
Big Data Stack
Serving Databases, Web & Visualization Frameworks

• While the various analytics blocks process and analyze the data, the results
are stored in serving databases for subsequent tasks of presentation and
visualization.

• These serving databases allow the analyzed data to be queried and presented
in the web applications.
Big Data Stack
Mapping Analytics Flow to Big Data Stack

• For any big data application, once we come up with an analytics flow, the next step is to
map the analytics flow to specific tools and frameworks in the big data stack.

• For data collection tasks, the choice of a specific tool or framework depends on the type
of the data source (such as log files, machines generating sensor data, social media
feeds, records in a relational database, for instance) and the characteristics of the data.

• For data cleaning and transformation, tools such as Open Refine and Stanford
DataWrangler can be used. These tools support various file formats such as CSV, Excel,
XML, JSON and line-based formats.

• For the basic statistics analysis type (with analysis such as computing counts, max, min,
mean, top-N, distinct, correlations, for instance), most of the analysis can be done using
the Hadoop-MapReduce framework or with Pig scripts.
Mapping Analytics Flow to Big Data Stack
Mapping Analytics Flow to Big Data Stack
Analytics Patterns

• Alpha Pattern : Batch analytics

• Beta Pattern : Real time analytics

• Gamma Pattern : Gamma pattern which combines batch and real-time


analysis patterns

• Delta Pattern : Interactive querying analytics


Analytics Patterns
Alpha Pattern
Analytics Patterns
Beta Pattern
Analytics Patterns
Gamma Pattern
Analytics Patterns
Delta Pattern
Mengenal Hadoop
Hadoop
Data Volume

• Hadoop menyimpan berkas dalam distributed file system

• Penyimpanan dan komputasi dilakukan di lokasi yang tersebar

• Berkas dapat tersebar dalam beberapa nodes

• Hadoop dapat menyimpan ‘begitu banyak’ daya

• Sumber data penyimpanan dapat dihubungkan antar satu dengan yang lain

• Berkas yang cukup besar dapat disimpan meskipun lebih besar dari satu
node
Hadoop
Data Variety
Hadoop
Data Velocity
Hadoop
Platform

• Diatur oleh Apache Software Foundation

• Terdiri dari dayanan inti MapReduce, HDFS, dan YARN

• Layanan data yang memungkinkan untuk memanipulasi dan memindahkan


data (Hive, HBase, Pig, Flume, Sqoop)

• Layanan operaslional yang membantu Mengelola klaster (Ambari, Falcon, dan


Oozie)
Teknologi-teknologi Big Data
Teknologi Big Data
MapReduce

• Data yang diterima akan dipecah-


pecah menjadi beberapa node :

• master node, zookeeper, worker.

• Selanjutnya akan di-reduce, menjadi


satu kembali.
Teknologi Big Data
Hive

• Hive digunakan untuk meng-query data yang ada pada Hadoop, dengan
pendekatan sql.

• Berlaku juga pada data-data unstructures lainnya seperti images, video.


Teknologi Big Data
HBase, Mahout, Storm

• HBase

• Implementasi NoSQL di HDFS.

• Columnar, database NoSQL

• Memberikan fleksibilitas pada pemrosesan data

• Mahout ==> Machine learning library

• Library algorithm machine learning untuk dijalankan pada data dalam HDFS

• Collaborative Filtering, Classification, Clustering, Dimensionality Reduction, Topic models.

• Storm ==> stream analytics untuk near-real-time-processing


Skenario implementasi
Skenario
Pemasangan Hadoop

• Cloud Computing (PaaS)

• On-Premise (Hadoop)

• Hybrid
Skenario implementasi Hadoop
Skenario Taktis

• Data Virtualization

• Data Lake

• Scale out Data Mart

• Integrated Machine learning


Skenario Taktis

Data Virtualization Data Lake Data Mart


Integrated machine learning
Skenario implementasi Hadoop
Skenario Strategis-1

• Ganti pre-processing ETL pada data warehouse ke Hadoop.


Skenario implementasi Hadoop
Skenario Strategis-2 ==> Hot and cold storage

• Memindahkan historical data ke cold storage dengan Hadoop.

• Data warehouse digunakan untuk hot data yang dapat digunakan oleh BI dan analytics

• Ketika data dari cold storage dibutuhkan dapat dikembalikan ke data warehouse.
Implementasi Hadoop pada Industri
Business Intelligence

• Secara sederhana, Big data merupakan sarana dan solusi untuk


menghasilkan business intelligence.

• Pada saat data analyst menghasilkan sebuah dashboard dan report yang
dapat digunakan untuk membantu membuat keputusan, maka dapat
dikatakan menghasilkan/masuk ke arah —> “business intelligence”.

• Business Intelligence ==> oleh data scientist, misalnya : Prediksi,


classification, pattern analisis, knowledge extraction.
End user
Pemanfaatan Hadoop
Implementasi Hadoop
Tantangan
Implementasi Hadoop
Tantangan jika di Cloud
Implementasi Hadoop
Latihan

• Latihan mencoba implementasi Hadoop


Data Lake
Data Lake

• A centralized repository for both structured


and unstructured data

• Store data as-is in open-source file formats


to enable direct analytics

• Data Lake is architecture, Hadoop is


Technology
Why data lake

• Decouple storage from compute, allowing you to scale

• Enable advanced analytics across all of your data sources

• Reduce complexity in ETL and operational overhead

• Future extensibility as new database and analytics technologies are invented


Data Lakes Extend the Traditional Approach
Data Lake
Advantages

• All data in one place

• Quick ingest

• Separating your storage and compute allows you to scale each component as
required

• A Data Lake enables ad-hoc analysis by applying schemas



on read, not write.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy