0% found this document useful (0 votes)

90 views60 pages

Big Data Class - Introduction

The document provides an introduction to big data, including: - The world now has 5 zettabytes of data that is growing exponentially as more data is generated every day from sources like emails, social media posts, searches, videos watched, and photos uploaded. - Big data is defined by its volume, velocity, variety and value. It requires new technologies and processes to manage the huge volumes of structured and unstructured data that is being generated from many different sources. - Examples of big data use cases include consumer behavioral analytics using clickstream data to gain insights and optimize conversions, and predictive analytics using sensor data to predict maintenance needs or operational results.

Uploaded by

Ameek ghuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views60 pages

Big Data Class - Introduction

Uploaded by

Ameek ghuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

Big Data

Class - 1
Introduction

► We now have 5 zettabyte of Data in the world

► 1000000000000000000000 of Bytes
► Every 1 Billion of computer of 1TB of Storage
► 175 zettabyte by 2025
► 125 zettabyte of DVD Stack can be around the earth 222
times
Introduction

► We now generate 2.5 quintillion bytes of data

everyday
► 1000000000000 Bytes of Data
► Everyday 300BN Emails
► 65BN Whatsapp message
► 26 BN weather forecast request
► Every day 5BN Hours on YouTube and 500000 New Hours on
video content
► Everyday 5BN searches on Google
► 3BN Photos & 300M FB Photos, 750M status
► 165M Hours on NetFlix
Big Data

► 90% of all Data generated in last 18 months

► Historically, a number of the large-scale Internet
search, advertising, and social networking companies
pioneered Big Data hardware and software innovations.
► For example, Google analyzes the clicks, links, and
content on 1.5 trillion page views per day and delivers
search results plus personalized advertising in
milliseconds
Big Data

► Big Data describes a holistic information management

strategy that includes and integrates many new types of
data and data management alongside traditional data.
► While many of the techniques to process and analyze
these data types have existed for some time, it has
been the massive proliferation of data and the lower
cost computing models that have encouraged broader
adoption.
► In addition, Big Data has popularized two foundational
storage and processing technologies: Apache Hadoop
and the NoSQL database
Big Data

► Big Data has also been defined by the four “V”s:

► Volume,
► Velocity,
► Variety,
► Value.

These become a reasonable test to determine whether

you should add Big Data to your information architecture
Volume

► The amount of data

► While volume indicates more data, it is the granular
nature of the data that is unique
► . Big Data requires processing high volumes of
low-density data, that is, data of unknown value, such
as twitter data feeds, clicks on a web page, network
traffic, sensor-enabled equipment capturing data at the
speed of light, and many more.
► It is the task of Big Data to convert low-density data
into high-density data, that is, data that has value. For
some companies, this might be tens of terabytes, for
others it may be hundreds of petabytes.
Velocity

► A fast rate that data is received and perhaps acted upon.

► The highest velocity data normally streams directly into
memory versus being written to disk.
► Some Internet of Things (IoT) applications have health and
safety ramifications that require real-time evaluation and
action.
► Other internet-enabled smart products operate in real-time
or near real-time. As an example, consumer ecommerce
applications seek to combine mobile device location and
personal preferences to make time sensitive offers.
Operationally, mobile application experiences have large
user populations, increased network traffic, and the
expectation for immediate response.
Variety

► New unstructured data types

► Unstructured and semi-structured data types, such as
text, audio, and video require additional processing to
both derive meaning and the supporting metadata
► Once understood, unstructured data has many of the
same requirements as structured data, such as
summarization, lineage, audit ability, and privacy.
► Further complexity arises when data from a known
source changes without notice. Frequent or real-time
schema changes are an enormous burden for both
transaction and analytical environments.
Value

► Data has intrinsic value—but it must be discovered

► There are a range of quantitative and investigative techniques to
derive value from data – from discovering a consumer preference
or sentiment, to making a relevant offer by location, or for
identifying a piece of equipment that is about to fail.
► The technological breakthrough is that the cost of data storage
and compute has exponentially decreased, thus providing an
abundance of data from which statistical sampling and other
techniques become relevant, and meaning can be derived.
► However, finding value also requires new discovery processes
involving clever and insightful analysts, business users, and
executives.
► The real Big Data challenge is a human one, which is learning to
ask the right questions, recognizing patterns, making informed
assumptions, and predicting behavior.
Why Big Data ?

► Data management is getting more complex than it has

ever been before. Big Data is everywhere, on
everyone’s mind, and in many different forms:
advertising, social graphs, news feeds,
recommendations, marketing, healthcare, security,
government, and so on.
► In the last three years, thousands of technologies having
to do with Big Data acquisition, management, and
analytics have emerged; this has given IT teams the
hard task of choosing, without having a comprehensive
methodology to handle the choice most of the time.
How will we make use of the
data?
► Sell new products and services
► Personalize customer experiences
► Sense product maintenance needs
► Predict risk, operational results
► Sell value-added data
Which business processes can
benefit?
► Operational ERP/CRM systems
► BI and Reporting systems
► Predictive analytics, modeling, data mining
Data Ingestion

► Sensor-based real-time events

► Near real-time transaction events
► Real-time analytics
► Near real time analytics
► No immediate analytics
Data Storage

► HDFS (Hadoop plus others)

► File system
► Data Warehouse
► RDBMS
► NoSQL database
Data Processing

► Leave it at the point of capture

► Add minor transformations
► ETL data to analytical platform
► Export data to desktops
Performance

How to maximize speed of ad hoc query, data

transformations, and analytical modeling?
► Analyze and transform data in real-time
► Optimize data structures for intended use
► Use parallel processing
► Increase hardware and memory
► Database configuration and operations
► Dedicate hardware sandboxes
► Analyze data at rest, in-place
What’s Different about Big
Data?
► Big Data introduces new technology, processes, and
skills to your information architecture and the people
that design, operate, and use them
► With new technology, there is a tendency to separate
the new from the old, but we strongly urge you to resist
this strategy. While there are exceptions, the
fundamental expectation is that finding patterns in this
new data enhances your ability to understand your
existing data.
Identifying Big Data
Symptoms
► The two main areas that get people to start thinking
about Big Data are when they start having issues related
to data size and volume; although most of the time
these issues present true and legitimate reasons to think
about Big Data, today, they are not the only reasons to
go this route.
► There are others symptoms that you should also
consider—type of data, for example. How will you
manage to increase various types of data when
traditional data stores, such as SQL databases, expect
you to do the structuring, like creating tables?
Identifying Big Data
Symptoms
► This is not feasible without adding a flexible, schema
less technology that handles new data structures as they
come. When I talk about types of data, you should
imagine unstructured data, graph data, images, videos,
voices, and so on.
► Another symptom comes out of this premise: Big Data is
also about extracting added value information from a
high-volume variety of data.
► When, previously, there were more read transactions
than write transactions, common caches or databases
were enough when paired with weekly ETL (extract,
transform, load) processing jobs.
Identifying Big Data
Symptoms
► Today that’s not the trend any more. Now, you need an
architecture that is capable of handling data as it comes
through long processing to near real-time processing
jobs.
► The architecture should be distributed and not rely on
the rigid high-performance and expensive mainframe;
instead, it should be based on a more available,
performance driven, and cheaper technology to give it
more flexibility.
Business Use Cases

► In addition to technical and architecture considerations,

you may be facing use cases that are typical Big Data
use cases. Some of them are tied to a specific industry;
others are not specialized and can be applied to various
industries.
► These considerations are generally based on analyzing
application’s logs, such as web access logs, application
server logs, and database logs, but they can also be
based on other types of data sources such as social
network data.
► When you are facing such use cases, you might want to
consider a distributed Big Data architecture if you want
to be able to scale out as your business grows.
Consumer Behavioral
Analytics
► Knowing your customer, or what we usually call the
“360-degree customer view” might be the most popular
Big Data use case.
► This customer view is usually used on e-commerce
websites and starts with an unstructured
click-stream—in other words, it is made up of the active
and passive website navigation actions that a visitor
performs.
► By counting and analyzing the clicks and impressions on
ads or products, you can adapt the visitor’s user
experience depending on their behavior, while keeping
in mind that the goal is to gain insight in order to
optimize the funnel conversion.
Sentiment Analysis

► Companies care about how their image and reputation is

perceived across social networks; they want to minimize
all negative events that might affect their notoriety and
leverage positive events.
► By crawling a large amount of social data in a
near-real-time way, they can extract the feelings and
sentiments of social communities regarding their brand,
and they can identify influential users and contact them
in order to change or empower a trend depending on
the outcome of their interaction with such users.
CRM Onboarding

► You can combine consumer behavioral analytics with

sentiment analysis based on data surrounding the
visitor’s social activities.
► Companies want to combine these online data sources
with the existing offline data, which is called CRM
(customer relationship management) onboarding, in
order to get better and more accurate customer
segmentation.
► Thus, companies can leverage this segmentation and
build a better targeting system to send
profile-customized offers through marketing actions.
Prediction

► Learning from data has become the main Big Data trend
for the last few years such as in the telecommunication
industry, where prediction router log analysis is
democratized.
► Every time an issue is likely to occur on a device, the
company can predict it and order part to avoid
downtime or lost profits.
► When combined with the previous use cases, you can
use predictive architecture to optimize the product
catalog selection and pricing depending on the user’s
global behavior.
Hadoop Distribution

In a Big Data project that involves Hadoop-related

ecosystem technologies, you have two choices:
► Download the project you need separately and try to
create or assemble the technologies in a coherent,
resilient, and consistent architecture.
► Use one of the most popular Hadoop distributions,
which assemble or create the technologies for you.
Hortonworks and Cloudera are the main actors in this
field. There are a couple of differences between the
two vendors, but for starting a Big Data package, they
are equivalent, as long as you don’t pay attention to the
proprietary add-ons.
Cloudera

► Cloudera adds a set of in-house components to the

Hadoop-based components;
► these components are designed to give you better
cluster management and search experiences.
Cloudera Components

► Impala: A real-time, parallelized, SQL-based engine that

searches for data in HDFS (Hadoop Distributed File
System) and Base. Impala is considered to be the fastest
querying engine within the Hadoop distribution vendors
market, and it is a direct competitor of Spark from UC
Berkeley.
► Cloudera Manager: This is Cloudera’s console to
manage and deploy Hadoop components within your
Hadoop cluster.
► Hue: A console that lets the user interact with the data
and run scripts for the different Hadoop components
contained in the cluster.
Cloudera Hadoop Distribution

•Orange Hadoop
core stack.
•Pink Hadoop
ecosystem
project.
•Blue
Cloudera-specifi
c components.
Hortonworks HDP

► Hortonworks is 100-percent open source and is used to

package stable components rather than the last version
of the Hadoop project in its distribution.
► It adds a component management console to the stack
that is comparable to Cloudera Manager.
Hortonworks Hadoop
Distribution
Hadoop Distributed File
System (HDFS)
► The data is stored when it is ingested into the Hadoop
cluster. Generally it ends up in a dedicated file system
called HDFS.
► Key Features:
► Distribution
► High-throughput access
► High availability
► Fault tolerance
► Tuning
► Security
► Load balancing
HDFS

► HDFS is the first class citizen for data storage in a Hadoop

cluster. Data is automatically replicated across the cluster
data nodes.

Data Replication
Apache Flume

► When you are looking to produce ingesting logs, I would

highly recommend that you use Apache Flume; it’s designed
to be reliable and highly available and it provides a simple,
flexible, and intuitive programming model based on
streaming data flows. Basically, you can configure a data
pipeline without a single line of code, only through
configuration.
► Flume is composed of sources, channels, and sinks. The
Flume source basically consumes an event from an external
source, such as an Apache Avro source, and stores it into the
channel. The channel is a passive storage system like a
file system; it holds the event until a sink consumes it. The
sink consumes the event, deletes it from the channel, and
distributes it to an external target.
Apache Flume

Flume
Architecture
Apache Flume

► With Flume, the idea is to use it to move different log

files that are generated by the web servers to HDFS
► For example. Remember that we are likely to work on a
distributed architecture that might have load balancers,
HTTP servers, application servers, access logs, and so
on.
► We can leverage all these assets in different ways and
they can be handled by a Flume pipeline.
Apache Sqoop

► Sqoop is a project designed to transfer bulk data

between a structured data store and HDFS.
► You can use it to either import data from an external
relational database to HDFS, Hive, or even HBase, or to
export data from your Hadoop cluster to a relational
database or data warehouse.
► Sqoop supports major relational databases such as
Oracle, MySQL, and Postgres. This project saves you
from writing scripts to transfer the data; instead, it
provides you with performance data transfers features.
Apache Sqoop

► Since the data can grow quickly in our relational

database, it’s better to identity fast growing tables
from the beginning and use Sqoop to periodically
transfer the data in Hadoop so it can be analyzed.
► Then, from the moment the data is in Hadoop, it is
combined with other data, and at the end, we can use
Sqoop export to inject the data in our business
intelligence (BI) analytics tools.
Yarn: NextGen MapReduce

► MapReduce was the main processing framework in the first

generation of the Hadoop cluster; it basically grouped sibling
data together (Map) and then aggregated the data in
depending on a specified aggregation operation (Reduce).
► In Hadoop 1.0, users had the option of writing MapReduce
jobs in different languages—Java, Python, Pig, Hive, and so
on. Whatever the users chose as a language, everyone relied
on the same processing model: MapReduce.
► Since Hadoop 2.0 was released, however, a new architecture
has started handling data processing above HDFS. Now
that YARN (Yet Another Resource Negotiator) has been
implemented, others processing models are allowed and
MapReduce has become just one among them. This means
that users now have the ability to use a specific processing
model depending on their particular use case.
Yarn: NextGen MapReduce

YARN
structure
Hive

► Hive, a high-level programming language brings users

the simplicity and power of querying data from HDFS in
a SQL-like way.
► When you use another language rather than using native
MapReduce, the main drawback is the performance.
► Hive is used for batch processing such as long-term
processing job with a low priority.
Spark Streaming

► Spark Streaming lets you write a processing job as you

would do for batch processing in Java, Scale, or Python,
but for processing data as you stream it.
► This can be really appropriate when you deal with high
throughput data sources such as a social network
(Twitter), clickstream logs, or web access logs.
► Spark Streaming is an extension of Spark, which
leverages its distributed data processing framework and
treats streaming computation as a series of
nondeterministic, micro-batch computations on small
intervals.
► Spark Streaming can get its data from a variety of
sources but when it is combined
Apache Kafka

► Apache Kafka is a distributed publish-subscribe mes

► Kafka is a persistent messaging and high-throughput
system, it supports both queue and topic semantics, and
it uses ZooKeeper to form the cluster nodes.saging
application written by LinkedIn in Scale.
► Kafka implements the publish-subscribe enterprise
integration pattern and supports parallelism and
enterprise features for performance and improved fault
tolerance.
Apache Kafka

Kafka Partitioned Topic

Example
Spark MLlib

► MLlib enables machine learning for Spark, it leverages

the Spark Direct Acyclic Graph (DAG) execution engine,
and it brings a set of APIs that ease machine learning
integration for Spark.
► It’s composed of various algorithms that go from basic
statistics, logistic regression, k-means clustering, and
Gaussian mixtures to singular value decomposition and
multinomial naive Bayes.
NoSQL Stores

► NoSQL datastores are fundamental pieces of the data

architecture because they can ingest a very large
amount of data
► It provides scalability and resiliency, and thus high
availability
► NoSQL means Not Only SQL
Couchbase

► Couchbase is a document-oriented NoSQL database that

is easily scalable, provides a flexible model, and is
consistently high performance.
► It exposes a fast key-value store with managed cache
for sub-millisecond data operations, purpose-built
indexers for fast queries and a powerful query engine
for executing SQL-like queries.
ElasticSearch

► ElasticSearch is a NoSQL technology that is very popular

for its scalable distributed indexing engine and search
features.
► It’s based on Apache Lucene and enables real-time data
analytics and full-text search in your architecture.
► ElasticSearch is part of the ELK platform, which stands
for ElasticSearch + Logstash + Kibana, which is delivered
by Elastic the company.
► The three products work together to provide the best
end-to-end platform for collecting, storing, and
visualizing data
ElasticSearch

► Logstash lets you collect data from many kinds of

sources—such as social data, logs, messages queues, or
sensors—it then supports data enrichment and
transformation, and finally it transports them to an
indexation system such as ElasticSearch.
► ElasticSearch indexes the data in a distributed,
scalable, and resilient system. It’s schemaless and
provides libraries for multiple languages so they can
easily and fatly enable real-time search and analytics in
your application.
► Kibana is a customizable user interface in which you
can build a simple to complex dashboard to explore and
visualize data indexed by ElasticSearch.
Big Data Architecture

► Keeping all the Big Data technology we are going to use in

mind, we can now go forward and build the foundation of
our architecture.
Architecture Overview
► A web application the visitor can use to navigate in a catalog of
products
► A log ingestion application that is designed to pull the logs and
process them
► A learning application for triggering recommendations for our
visitor
► A processing engine that functions as the central processing
cluster for the architecture
► A search engine to pull analytics for our process data
Big Data Architecture
Log Ingestion Application

► The log ingestion application is used to consume

application logs such as web access logs.
► To ease the use case, a generated web access log is
provided and it simulates the behavior of visitors
browsing the product catalog.
► These logs represent the clickstream logs that are used
for long-term processing but also for real-time
recommendation.
Log Ingestion Application

► There can be two options in the architecture:

► the first can be ensured by Flume and can transport the
logs as they come in to our processing application;
► the second can be ensured by ElasticSearch, Logstash, and
Kibana (the ELK platform) to create access analytics.
Log Ingestion Application
Learning Application

► The learning application receives a stream of data and

builds prediction to optimize our recommendation
engine.
► This application uses a basic algorithm to introduce the
concept of machine learning based on Spark MLlib.
Learning Application
Processing Engine

► The processing engine is the heart of the architecture;

it receives data from multiple kinds of source and
delegates the processing to the appropriate model.
Search Engine

► The search engine leverages the data processed by the

processing engine and exposes a dedicated RESTful API
that will be used for analytic purposes.

Sta 111 (Introduction of Statistics)
No ratings yet
Sta 111 (Introduction of Statistics)
47 pages
The Nature of Quantitative Research
No ratings yet
The Nature of Quantitative Research
4 pages
How Consulting Is Changing in India
No ratings yet
How Consulting Is Changing in India
7 pages
Big Data Analysis by deshbandhu
No ratings yet
Big Data Analysis by deshbandhu
368 pages
Basic Concepts in Big Data 1
No ratings yet
Basic Concepts in Big Data 1
43 pages
UNIT 1
No ratings yet
UNIT 1
57 pages
Big Data Presentation Slide
100% (1)
Big Data Presentation Slide
30 pages
05-Big Data
No ratings yet
05-Big Data
29 pages
mcy-20240325SCHEDULE 14A INFORMATION
No ratings yet
mcy-20240325SCHEDULE 14A INFORMATION
50 pages
BIGDATA UNITS
No ratings yet
BIGDATA UNITS
80 pages
Requirements Modelling - [Use Cases]
No ratings yet
Requirements Modelling - [Use Cases]
62 pages
ROHIT JINDAL FINAL WORK - Rohit Jindal
No ratings yet
ROHIT JINDAL FINAL WORK - Rohit Jindal
42 pages
Unit 1
No ratings yet
Unit 1
76 pages
UNIT1 -BDH
No ratings yet
UNIT1 -BDH
77 pages
Unit-1 Introduction to big data analytics
No ratings yet
Unit-1 Introduction to big data analytics
57 pages
An Introduction To Big Data: Data Management For Data Science
No ratings yet
An Introduction To Big Data: Data Management For Data Science
32 pages
Resorts
No ratings yet
Resorts
12 pages
Class - Big Data UNIT-I
No ratings yet
Class - Big Data UNIT-I
40 pages
Module 1.1 - Introduction To Big Data
No ratings yet
Module 1.1 - Introduction To Big Data
18 pages
Wibd Notes
No ratings yet
Wibd Notes
32 pages
PERSONALITY and WORK MOTIVATION
No ratings yet
PERSONALITY and WORK MOTIVATION
12 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Electronic Code of Federal Regulations Norma Fda para Tuna
No ratings yet
Electronic Code of Federal Regulations Norma Fda para Tuna
19 pages
Bigdatappt
No ratings yet
Bigdatappt
31 pages
Big Data Presentation
No ratings yet
Big Data Presentation
22 pages
Arts Scopeand Sequence
No ratings yet
Arts Scopeand Sequence
17 pages
Big Data Analytics
No ratings yet
Big Data Analytics
45 pages
BDA U1 copy
No ratings yet
BDA U1 copy
78 pages
Ir1034 Exc12
No ratings yet
Ir1034 Exc12
24 pages
HIidro Studio Culverts - Manual
No ratings yet
HIidro Studio Culverts - Manual
26 pages
Effectiveness of The Automated Election System in The Philippines A Comparative Study in Barangay 1 Poblacion Malaybalay City Bukidnon
100% (1)
Effectiveness of The Automated Election System in The Philippines A Comparative Study in Barangay 1 Poblacion Malaybalay City Bukidnon
109 pages
Introduction To E-Models: Emarketing Excellence by Dave Chaffey and PR Smith
No ratings yet
Introduction To E-Models: Emarketing Excellence by Dave Chaffey and PR Smith
35 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
83 pages
Unit-III CC&BD Cs62 Ab
No ratings yet
Unit-III CC&BD Cs62 Ab
85 pages
Lecture 3-Introduction to Big Data
No ratings yet
Lecture 3-Introduction to Big Data
25 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Maths Grade 12 MIP 2022
No ratings yet
Maths Grade 12 MIP 2022
51 pages
Seashells, Mussels, Barnacles and Sea Growth Is Prevent in Water Systems
No ratings yet
Seashells, Mussels, Barnacles and Sea Growth Is Prevent in Water Systems
8 pages
1-Put A Period, Exclamation Mark, or A Question Mark After Each Sentence
No ratings yet
1-Put A Period, Exclamation Mark, or A Question Mark After Each Sentence
6 pages
Risk Assessment For Ladders & Stepladders
No ratings yet
Risk Assessment For Ladders & Stepladders
5 pages
Introduction To Bda
No ratings yet
Introduction To Bda
67 pages
Now To Be Data
No ratings yet
Now To Be Data
16 pages
Big Data Presentation
No ratings yet
Big Data Presentation
24 pages
Lecture 6 BigData
No ratings yet
Lecture 6 BigData
61 pages
PPT01-Introduction To Big Data
No ratings yet
PPT01-Introduction To Big Data
34 pages
Research
No ratings yet
Research
5 pages
Bigdata, Hadoop and HDFS: Evolution of Data
No ratings yet
Bigdata, Hadoop and HDFS: Evolution of Data
28 pages
Big Data: Let's Start With The Basics
No ratings yet
Big Data: Let's Start With The Basics
10 pages
De Links
No ratings yet
De Links
4 pages
Big Data PPT 55b0fc01e7543
No ratings yet
Big Data PPT 55b0fc01e7543
31 pages
Big Data Basics Unit 1
No ratings yet
Big Data Basics Unit 1
12 pages
Math g7 m6 Topic A Lesson 4 Teacher
No ratings yet
Math g7 m6 Topic A Lesson 4 Teacher
11 pages
RJ01GD0134
No ratings yet
RJ01GD0134
3 pages
Unit 5 - Principles of Big Data 2
No ratings yet
Unit 5 - Principles of Big Data 2
14 pages
Unit-Iii CC&BD CS71
No ratings yet
Unit-Iii CC&BD CS71
89 pages
What Is Data Mining?: Warehousing
No ratings yet
What Is Data Mining?: Warehousing
12 pages
Bda Unit 1
No ratings yet
Bda Unit 1
47 pages
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
No ratings yet
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
58 pages
Big Data
No ratings yet
Big Data
24 pages
Module-1-Introduction To BigData Platform
No ratings yet
Module-1-Introduction To BigData Platform
21 pages
Big Data
No ratings yet
Big Data
31 pages
VES - Law - Prospectus 2022 23 1 1 PDF
No ratings yet
VES - Law - Prospectus 2022 23 1 1 PDF
37 pages
Data - Structure Lab - 04
No ratings yet
Data - Structure Lab - 04
3 pages
Introduction To Big Data Platform
No ratings yet
Introduction To Big Data Platform
20 pages
Big Data-Introduction
No ratings yet
Big Data-Introduction
14 pages
Prepared By: Asmita Deshmukh
No ratings yet
Prepared By: Asmita Deshmukh
51 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
Data - Structure Lab - 03
No ratings yet
Data - Structure Lab - 03
2 pages
Big Data
No ratings yet
Big Data
30 pages
Savitribai Phule Pune University
No ratings yet
Savitribai Phule Pune University
4 pages
Titan Store Details
No ratings yet
Titan Store Details
142 pages
Seminar On: Big Data
No ratings yet
Seminar On: Big Data
23 pages
Part 1 - Introduction To Big Data
No ratings yet
Part 1 - Introduction To Big Data
24 pages
BhagavadGita SridharaGloss PDF
75% (4)
BhagavadGita SridharaGloss PDF
623 pages
Solve Following Programming Problems:: (There Will Be 10% Class Evaluation in Lab 3 Based On Lab 1 and Lab 2)
No ratings yet
Solve Following Programming Problems:: (There Will Be 10% Class Evaluation in Lab 3 Based On Lab 1 and Lab 2)
1 page
Data - Structure Lab - 01
No ratings yet
Data - Structure Lab - 01
1 page
117769
No ratings yet
117769
20 pages
British Imperialism in India - Section 4
No ratings yet
British Imperialism in India - Section 4
17 pages
Dsc652 - Chapter 1 Introduction To Big Data Systems
No ratings yet
Dsc652 - Chapter 1 Introduction To Big Data Systems
27 pages
Future Revolution On Big Data
No ratings yet
Future Revolution On Big Data
24 pages
What Is Big Data
No ratings yet
What Is Big Data
8 pages
Big Data
No ratings yet
Big Data
5 pages
Big Data MINING AND TOOLS
No ratings yet
Big Data MINING AND TOOLS
44 pages
International Journal of Engineering Research and Development (IJERD)
No ratings yet
International Journal of Engineering Research and Development (IJERD)
6 pages
Unit I-Ch 01-Big Data Introduction
No ratings yet
Unit I-Ch 01-Big Data Introduction
40 pages
Big Data
No ratings yet
Big Data
31 pages
Gr. 4 QP
No ratings yet
Gr. 4 QP
4 pages
Big Data Project
100% (3)
Big Data Project
61 pages
Chapter 1 - Rizal
No ratings yet
Chapter 1 - Rizal
13 pages
Quaid-e-Azam (History Project)
No ratings yet
Quaid-e-Azam (History Project)
3 pages
I Jcs It 2015060405
No ratings yet
I Jcs It 2015060405
6 pages
The Data Whisperer - Making Sense of Big Data
From Everand
The Data Whisperer - Making Sense of Big Data
Keaton Rivers
No ratings yet
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Big Data: Opportunities and challenges
From Everand
Big Data: Opportunities and challenges
BCS, The Chartered Institute for IT
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Big Data Class - Introduction

Uploaded by

Big Data Class - Introduction

Uploaded by

Big Data

► We now have 5 zettabyte of Data in the world

► We now generate 2.5 quintillion bytes of data

► 90% of all Data generated in last 18 months

► Big Data describes a holistic information management

► Big Data has also been defined by the four “V”s:

These become a reasonable test to determine whether

► The amount of data

► A fast rate that data is received and perhaps acted upon.

► New unstructured data types

► Data has intrinsic value—but it must be discovered

► Data management is getting more complex than it has

► Sensor-based real-time events

► HDFS (Hadoop plus others)

► Leave it at the point of capture

How to maximize speed of ad hoc query, data

► In addition to technical and architecture considerations,

► Companies care about how their image and reputation is

► You can combine consumer behavioral analytics with

In a Big Data project that involves Hadoop-related

► Cloudera adds a set of in-house components to the

► Impala: A real-time, parallelized, SQL-based engine that

► Hortonworks is 100-percent open source and is used to

► HDFS is the first class citizen for data storage in a Hadoop

► When you are looking to produce ingesting logs, I would

► With Flume, the idea is to use it to move different log

► Sqoop is a project designed to transfer bulk data

► Since the data can grow quickly in our relational

► MapReduce was the main processing framework in the first

► Hive, a high-level programming language brings users

► Spark Streaming lets you write a processing job as you

► Apache Kafka is a distributed publish-subscribe mes

Kafka Partitioned Topic

► MLlib enables machine learning for Spark, it leverages

► NoSQL datastores are fundamental pieces of the data

► Couchbase is a document-oriented NoSQL database that

► ElasticSearch is a NoSQL technology that is very popular

► Logstash lets you collect data from many kinds of

► Keeping all the Big Data technology we are going to use in

► The log ingestion application is used to consume

► There can be two options in the architecture:

► The learning application receives a stream of data and

► The processing engine is the heart of the architecture;

► The search engine leverages the data processed by the

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.