Big Data 2021 - 6,7,8 Big Data Technologies
Big Data 2021 - 6,7,8 Big Data Technologies
5. Mengenal Hadoop
Big Data Stack dapat digunakan untuk menggambarkan berbagai jenis analisis dan komputasi
secara keseluruhan yang terdiri dari framework-framework yang digunakan pada analisis big data.
Big Data Stack
Raw Data
• Logs: Logs generated by web applications and servers which can be used for performance monitoring
• Transactional Data:Transactional data generated by applications such as eCommerce, Banking and Financial
• Surveillance Data: Sensor, image and video data generated by surveillance systems
• Healthcare Data: Healthcare data generated by Electronic Health Record (EHR) and
other healthcare applications
• Network Data: Network data generated by network devices such as routers and
firewalls
Big Data Stack
Data Access Connectors
• Database Connectors : Database connectors can be used for importing data from
relational database management systems into big data storage and analytics
frameworks for analysis.
• Messaging Queues : Messaging queues are useful for push-pull messaging where
the producers push data to the queues and the consumers pull the data from the
queues. The producers and consumers do not need to be aware of each other.
• Custom Connectors : Custom connectors can be built based on the source of the
data and the data collection requirements. Some examples of custom connectors
include: custom connectors for collecting data from social networks, custom
connectors for NoSQL databases and connectors for Internet of Things (IoT).
Big Data Stack
Data Storage
• The data storage block in the big data stack includes distributed filesystems and
non-relational (NoSQL) databases, which store the data collected from the raw data
sources using the data access connectors.
• Hadoop Distributed File System (HDFS), a distributed file system that runs on large
clusters and provides high-throughput access to data. With the data stored in HDFS,
it can be analyzed with various big data analytics frameworks built on top of HDFS.
• Batch analytics : analitik yang bertujuan untuk mengolah data secara batch,
dan prosesnya yang terjadi dinamakan alpha pattern.
• Real-time analytics : jika data yang diolah sifatnya real time, termasuk data
yang diperoleh dari IoT. Dibutuhkan proses yang berbeda dengan batch
analytics —> framework yang digunakan ini dinamakan beta pattern.
• While the various analytics blocks process and analyze the data, the results
are stored in serving databases for subsequent tasks of presentation and
visualization.
• These serving databases allow the analyzed data to be queried and presented
in the web applications.
Big Data Stack
Mapping Analytics Flow to Big Data Stack
• For any big data application, once we come up with an analytics flow, the next step is to
map the analytics flow to specific tools and frameworks in the big data stack.
• For data collection tasks, the choice of a specific tool or framework depends on the type
of the data source (such as log files, machines generating sensor data, social media
feeds, records in a relational database, for instance) and the characteristics of the data.
• For data cleaning and transformation, tools such as Open Refine and Stanford
DataWrangler can be used. These tools support various file formats such as CSV, Excel,
XML, JSON and line-based formats.
• For the basic statistics analysis type (with analysis such as computing counts, max, min,
mean, top-N, distinct, correlations, for instance), most of the analysis can be done using
the Hadoop-MapReduce framework or with Pig scripts.
Mapping Analytics Flow to Big Data Stack
Mapping Analytics Flow to Big Data Stack
Analytics Patterns
• Sumber data penyimpanan dapat dihubungkan antar satu dengan yang lain
• Berkas yang cukup besar dapat disimpan meskipun lebih besar dari satu
node
Hadoop
Data Variety
Hadoop
Data Velocity
Hadoop
Platform
• Hive digunakan untuk meng-query data yang ada pada Hadoop, dengan
pendekatan sql.
• HBase
• Library algorithm machine learning untuk dijalankan pada data dalam HDFS
• On-Premise (Hadoop)
• Hybrid
Skenario implementasi Hadoop
Skenario Taktis
• Data Virtualization
• Data Lake
• Data warehouse digunakan untuk hot data yang dapat digunakan oleh BI dan analytics
• Ketika data dari cold storage dibutuhkan dapat dikembalikan ke data warehouse.
Implementasi Hadoop pada Industri
Business Intelligence
• Pada saat data analyst menghasilkan sebuah dashboard dan report yang
dapat digunakan untuk membantu membuat keputusan, maka dapat
dikatakan menghasilkan/masuk ke arah —> “business intelligence”.
• Quick ingest
• Separating your storage and compute allows you to scale each component as
required