Big Data Architecture Basics.pptx (1)
Big Data Architecture Basics.pptx (1)
• Big Data is characterized by the five Vs: Volume, Velocity, and Variety, Veracity
and Value.
• Managing and extracting meaningful insights from massive and diverse datasets
requires specialized infrastructure and methodologies.
Identifying Big Data symptoms
• Data management is more complex than it has been before
• Big Data is everywhere.
• When should I think about employing Big Data?
• Am I ready?
• What should I start with?
• One may choose to start a big data project based on different needs:
• Volume of data
• Variety of data structure the system has
• Scalability issues
• Reduce the cost of data processing
Size matters
• Two main areas: Size + Volume
• Handle new data structures with flexible & schema less technology
• Big Data is also about extracting added value information
• Near real time processing with real time architecture
• Execute complex queries with NoSQL store
Business Use Cases
• Analyzing application log, web access log, server log, DB log, Social
networks
• Customer Behavior Analytics: used on e-commerce websites
• Sentiment Analysis: images and reputation of companies which perceived
across social networks.
• CRM on Boarding: Combine online data sources with offline data sources for
better and more accurate customer segmentation (profile-customized offers).
• Prediction: Learning from data, main big data trend (for past 2 years) – For
example in Telecom industry:
• Issue or event prediction based on router log
• Product catalog selection
• Pricing depending on user’s global behavior
Understanding Big Data Project’s Ecosystem
• Choosing:
• Hadoop distribution
• Distributed file system
• SQL like processing language
• Machine learning language
• Scheduler
• Message oriented middleware
• NoSQL data store
• Data visualization
• github.com/zenkay/bigdata-ecosystem#projects-1
• NoSQL - https://www.youtube.com/watch?v=0buKQHokLK8
An Architecture for Big Data – Client Server Architecture
A conceptual Cluster
Architecture for Big Data
Client level
Architecture
• The client level architecture consists of NoSQL databases, distributed file
systems and a distributed processing framework.
• NoSQL databases provide distributed, highly scalable data storage for Big Data.
• Oracle described the Oracle NoSQL Database as a distributed key-value database designed to
provide highly reliable, scalable and available data storage across a configurable set of systems
that function as storage nodes.
• A popular example of NoSQL database is Apache Hbase..
• The next layers consist of
• the distributed file system that is scalable and can handle a large volume of data, and
• a distributed processing framework that distributes computations over large server clusters.
• The Internet services file systems can have Google file system, Amazon Simple Storage
Service and the open-source Hadoop distributed files system.
Client level Architecture
• A popular platform is the Apache Hadoop.
• The two critical components for Hadoop are the Hadoop distributed file
system (HDFS) and MapReduce.
• HDFS is the storage system and distributes data files over large server clusters and
provides high-throughput access to large data sets.
• MapReduce is the distributed processing framework for parallel processing of large data
sets. It distributes computing jobs to each server in the cluster and collects the results.
Server Level Architecture
• The server level architecture for Big Data consists of parallel computing platforms
that can handle the associated volume and speed.
• There are three prominent parallel computing options:
• Clusters or Grids,
• Massively Parallel Processing (MPP), and
• High Performance Computing (HPC).
• A commonly used architecture for Hadoop consists of client machines and clusters of
loosely coupled commodity servers that serve as the HDFS distributed data storage
and MapReduce distributed data processing.
• There are three major categories of machine roles in a Hadoop deployment that consist of Client
machines, Master nodes and Slave nodes.
• HBase, built on top of HDFS provides fast record lookups and updates.
• Apache HBase provides random, real-time read/write access to Big Data
Server Level Architecture
• HDFS was originally designed for high-latency high-throughput batch analytic
systems like MapReduce
• HBase improved its suitability for real-time systems low-latency performance.
• In this architecture
• Hadoop HDFS provides a fault-tolerant and scalable distributed data storage for Big Data
• Hadoop MapReduce provides the fault-tolerant distributed processing over large data sets
across the Hadoop cluster
• HBase provides the real-time random access to Big Data.
Client Client Client
Job
Tracker
• Netflix
• The Netflix Suro project is the backbone of Netflix’s Data Pipeline that has
separate processing paths for data but does not strictly follow lambda
architecture since the paths may be intended to serve different purposes and
not necessarily to provide the same type of views.
• LinkedIn
• Bridging offline and nearline computations with Apache Calcite.
Thank you