0% found this document useful (0 votes)
43 views24 pages

Big Data Architecture Basics.pptx (1)

Big Data architecture involves the organization of technologies and processes to manage large volumes of data characterized by the five Vs: Volume, Velocity, Variety, Veracity, and Value. It includes various components such as NoSQL databases, distributed file systems, and processing frameworks like Hadoop and Lambda architecture for efficient data handling. Business use cases range from customer behavior analytics to real-time predictions, highlighting the importance of specialized infrastructure for extracting insights from complex datasets.

Uploaded by

shivani28ag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views24 pages

Big Data Architecture Basics.pptx (1)

Big Data architecture involves the organization of technologies and processes to manage large volumes of data characterized by the five Vs: Volume, Velocity, Variety, Veracity, and Value. It includes various components such as NoSQL databases, distributed file systems, and processing frameworks like Hadoop and Lambda architecture for efficient data handling. Business use cases range from customer behavior analytics to real-time predictions, highlighting the importance of specialized infrastructure for extracting insights from complex datasets.

Uploaded by

shivani28ag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Big Data Architecture

Big Data Architecture


• Big Data architecture refers to the systematic organization and structure of
technologies, components, and processes designed to handle, process, store, and
analyse large volumes of data.

• Big Data is characterized by the five Vs: Volume, Velocity, and Variety, Veracity
and Value.

• Managing and extracting meaningful insights from massive and diverse datasets
requires specialized infrastructure and methodologies.
Identifying Big Data symptoms
• Data management is more complex than it has been before
• Big Data is everywhere.
• When should I think about employing Big Data?
• Am I ready?
• What should I start with?
• One may choose to start a big data project based on different needs:
• Volume of data
• Variety of data structure the system has
• Scalability issues
• Reduce the cost of data processing
Size matters
• Two main areas: Size + Volume
• Handle new data structures with flexible & schema less technology
• Big Data is also about extracting added value information
• Near real time processing with real time architecture
• Execute complex queries with NoSQL store
Business Use Cases
• Analyzing application log, web access log, server log, DB log, Social
networks
• Customer Behavior Analytics: used on e-commerce websites
• Sentiment Analysis: images and reputation of companies which perceived
across social networks.
• CRM on Boarding: Combine online data sources with offline data sources for
better and more accurate customer segmentation (profile-customized offers).
• Prediction: Learning from data, main big data trend (for past 2 years) – For
example in Telecom industry:
• Issue or event prediction based on router log
• Product catalog selection
• Pricing depending on user’s global behavior
Understanding Big Data Project’s Ecosystem
• Choosing:
• Hadoop distribution
• Distributed file system
• SQL like processing language
• Machine learning language
• Scheduler
• Message oriented middleware
• NoSQL data store
• Data visualization

• github.com/zenkay/bigdata-ecosystem#projects-1
• NoSQL - https://www.youtube.com/watch?v=0buKQHokLK8
An Architecture for Big Data – Client Server Architecture

A conceptual Cluster
Architecture for Big Data
Client level
Architecture
• The client level architecture consists of NoSQL databases, distributed file
systems and a distributed processing framework.
• NoSQL databases provide distributed, highly scalable data storage for Big Data.
• Oracle described the Oracle NoSQL Database as a distributed key-value database designed to
provide highly reliable, scalable and available data storage across a configurable set of systems
that function as storage nodes.
• A popular example of NoSQL database is Apache Hbase..
• The next layers consist of
• the distributed file system that is scalable and can handle a large volume of data, and
• a distributed processing framework that distributes computations over large server clusters.

• The Internet services file systems can have Google file system, Amazon Simple Storage
Service and the open-source Hadoop distributed files system.
Client level Architecture
• A popular platform is the Apache Hadoop.
• The two critical components for Hadoop are the Hadoop distributed file
system (HDFS) and MapReduce.
• HDFS is the storage system and distributes data files over large server clusters and
provides high-throughput access to large data sets.
• MapReduce is the distributed processing framework for parallel processing of large data
sets. It distributes computing jobs to each server in the cluster and collects the results.
Server Level Architecture
• The server level architecture for Big Data consists of parallel computing platforms
that can handle the associated volume and speed.
• There are three prominent parallel computing options:
• Clusters or Grids,
• Massively Parallel Processing (MPP), and
• High Performance Computing (HPC).
• A commonly used architecture for Hadoop consists of client machines and clusters of
loosely coupled commodity servers that serve as the HDFS distributed data storage
and MapReduce distributed data processing.
• There are three major categories of machine roles in a Hadoop deployment that consist of Client
machines, Master nodes and Slave nodes.
• HBase, built on top of HDFS provides fast record lookups and updates.
• Apache HBase provides random, real-time read/write access to Big Data
Server Level Architecture
• HDFS was originally designed for high-latency high-throughput batch analytic
systems like MapReduce
• HBase improved its suitability for real-time systems low-latency performance.
• In this architecture
• Hadoop HDFS provides a fault-tolerant and scalable distributed data storage for Big Data
• Hadoop MapReduce provides the fault-tolerant distributed processing over large data sets
across the Hadoop cluster
• HBase provides the real-time random access to Big Data.
Client Client Client

Job
Tracker

Task Task Task


Tracker Tracker Tracker

Task Task Task


Tracker Tracker Tracker
HBase / Hadoop Cluster
Architecture for Big Data
Lambda Architecture
Lambda Architecture
• Lambda architecture is a way of processing massive quantities of data
(i.e. “Big Data”) that provides access to batch-processing and
stream-processing methods with a hybrid approach.
• Lambda architecture is used to solve the problem of computing
arbitrary functions.
• The lambda architecture itself is composed of 3 layers
Batch Layer
• New data comes continuously, as a feed to the data system.
• It gets fed to the batch layer and the speed layer simultaneously.
• It looks at all the data at once and eventually corrects the data in the
stream layer.
• Here we can find lots of ETL and a traditional data warehouse.
• The batch layer has two very important functions:
• To manage the master dataset
• To pre-compute the batch views.
Serving Layer
• The outputs from the batch layer in the form of batch views and those
coming from the speed layer in the form of near real-time views get
forwarded to the serving.
• This layer indexes the batch views so that they can be queried in
low-latency on an ad-hoc basis
Speed Layer (Stream Layer)
• This layer handles the data that are not already delivered in the batch
view due to the latency of the batch layer.
• In addition, it only deals with recent data in order to provide a
complete view of the data to the user by creating real-time views.
Benefits of Lambda Architecture
• No Server Management
• You do not have to install, maintain, or administer any software.
• Flexible Scaling
• Your application can be either automatically scaled or scaled by the
adjustment of its capacity
• Automated High Availability
• Refers to the fact that serverless applications have already built-in availability
and faults tolerance.
• It represents a guarantee that all requests will get a response about whether
they were successful or not.
• Business Agility
• React in real-time to changing business/market scenarios
How to implement Lambda Architecture
• We can implement this architecture in the real-world by using
Hadoop data lakes, where HDFS can be used to store the master
dataset, Spark (or Storm) can form the speed layer, HBase (or
Cassandra) can be the serving layer, and Hive creates views that can
be queried.
Challenges with Lambda Architecture
• Complexity
• Lambda architectures can be highly complex.
• Administrators must typically maintain two separate code bases for batch and
streaming layers, which can make debugging difficult.
Lambda Architecture in use
• Yahoo
• For running analytics on its advertising data warehouse, Yahoo has taken a
similar approach, also using Apache Storm, Apache Hadoop, and Druid.

• Netflix
• The Netflix Suro project is the backbone of Netflix’s Data Pipeline that has
separate processing paths for data but does not strictly follow lambda
architecture since the paths may be intended to serve different purposes and
not necessarily to provide the same type of views.

• LinkedIn
• Bridging offline and nearline computations with Apache Calcite.
Thank you

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy