0% found this document useful (0 votes)
50 views13 pages

BigData Reading

This document discusses big data, including what it is, how large it is, and how it can be analyzed. Big data is large, diverse data generated from many sources that is too large to store and analyze using traditional methods. It is characterized by high volume, velocity, variety, and veracity. Techniques for analyzing big data include MapReduce, machine learning, and distributing the data and computations across clusters of computers.

Uploaded by

sunny peoupang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views13 pages

BigData Reading

This document discusses big data, including what it is, how large it is, and how it can be analyzed. Big data is large, diverse data generated from many sources that is too large to store and analyze using traditional methods. It is characterized by high volume, velocity, variety, and veracity. Techniques for analyzing big data include MapReduce, machine learning, and distributing the data and computations across clusters of computers.

Uploaded by

sunny peoupang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Big Data

Jeff rey L. Popyack and William M. Mongan

What is Big Data?


Big Data is large, diverse, longitudinal and/or distributed. It is generated from a variety of sources
including sensors, digital equipment, internet transactions, email, video, click streams, phone calls and
any digital source available today or in the future. We are in an era of observation, and it all generates
data. Data is additionally generated by medical equipment, telescopes, satellites, environmental
networks, scanners, financial transactions, blogs, twitter, digital photos, geo maps, etc.

Big Data can be structured, as in the data stored in databases, or unstructured as in the data contained
within a wiki or product recommendation site. Data can be temporal, where time is part of the value of
the data. The exact time and date the photo was taken may be critical to its value. Data can be spatial,
such as maps, where geolocation is part of the value of the data. Data can also be dynamic as in real
time click-steams from large e-commerce sites (read Amazon).

One definition of big data is that the data set must be too large to store using traditional database
storage and query techniques (Wikipedia). These data sets are commonly in the tera- to petabyte size
range, and growing larger.

Four V’s
The four V’s of Big Data are volume, velocity, variety, and veracity. A fifth V is sometimes included,
value. This is a useful lens for looking at applications or potential uses of big data. More on this later.
source: NSF Solicitation 12-499 (Core Techniques and Technologies for Advancing
Big Data Science & Engineering, http://www.nsf.gov/pubs/2012/nsf12499/nsf12499.pdf

How Big is Big Data?


1 byte is needed to store a single The words “Big Data” can be stored in 8 bytes
letter, digit or symbol.
1 kilobyte = 1000 bytes (or 1024) The text of Dr. Seuss’ “Green Eggs and Ham” is 3.3 kilobytes in
size.
1 megabyte = 1000 kilobytes An mp3 audio recording of The Beatles’ “I Want to Hold Your
Hand” occupies 2.76 megabytes.
1 gigabyte = 1000 megabytes The text of articles in Encyclopædia Britannica is about 0.2
gigabytes. The text of English Wikipedia articles is ~44 gigabytes
1 terabyte = 1000 gigabytes The Library of Congress book collection has been estimated at 10
terabytes.
1 petabyte = 1000 terabytes As of November 2019, the Library of Congress had collected over
1 petabytes of web archive data since 2000
(https://www.loc.gov/programs/web-archiving/about-this-
program/frequently-asked-questions/)
1 exabyte = 1000 petabytes We’re not there yet. YouTube users upload 48 hours of video per
minute, or about 15 terabytes of data per hour. At this rate, 8
years = 1 exabyte.
Big Data Surrounds Us
Think about the volume of data represented in
social networking posts. We can capture real-time
sentiments in Twitter, determine who is connected
90% OF ALL DATA WAS CREATED IN
to whom with LinkedIn, and play “Six Degrees of
Separation” with almost anyone.
THE LAST TWO YEARS ALONE
(ACCORDING TO IBM)
In the medical field we are using big data to help
improve cancer screening and to find patterns in
disease vectors and genetics.

In finance, big data is providing significant analysis capabilities that were not possible before. We can
now determine if a credit card use is likely to be fraudulent, we can determine what decisions a
consumer is likely to make, and even analyze the market.

In security and protection services, big data is providing new methods of screening …

“E VERY TWO DAYS NOW WE CREATE AS MUCH INFORMATION AS WE DID FROM THE DAWN OF CIVILIZATION UP
UNTIL 2003.” --E RIC S CHMIDT , G OOGLE CEO, A UG . 2010

Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even
petabytes—of information.

• Turn 12 terabytes of Tweets created each day into improved product sentiment analysis

• Convert 350 billion annual meter readings to better predict power consumption

Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data
must be used as it streams into your enterprise to maximize its value.

• Scrutinize 5 million trade events created each day to identify potential fraud

• Analyze 500 million daily call detail records in real-time to predict customer churn faster

Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio,
video, click streams, log files and more. New insights are found when analyzing these data types
together.

• Monitor 100’s of live video feeds from surveillance cameras to target points of interest

• Exploit the 80% data growth in images, video and documents to improve customer
satisfaction

Veracity: 1 in 3 business leaders don’t trust the information they use to make decisions. How can you
act upon information if you don’t trust it? Establishing trust in big data presents a huge challenge as the
variety and number of sources grows.

Big Data is Interdisciplinary


Big Data encompasses several fields of computing such as computer architecture, distributed
computing, artificial intelligence, data science, and systems administration. More importantly, Big Data
reaches many other fields such as medicine, social networking, finance, business intelligence and public
safety.

Applications of Big Data


• The FBI is combining data from social media, CCTV cameras, phone calls and texts to track down
criminals and predict the next terrorist attack.

• Supermarkets are combining their loyalty card data with social media information to detect and
leverage changing buying patterns. For example, it is easy for retailers to predict that a woman is
pregnant simply based on the changing buying patterns. This allows them to target pregnant
women with promotions for baby related goods.

• Facebook is using face recognition tools to compare the photos you have up-loaded with those
of others to find potential friends of yours (see my post on how Facebook is exploiting your
private information using big data tools).

• Politicians are using social media analytics to determine where they must campaign the hardest
to win the next election.
source: Bernard Marr, “Big Data: The Mega-Trend That Will Impact All Our Lives”
http://www.linkedin.com/
today/post/article/20130827231108-64875646-big-data-the-mega-trend-that-will-impact-all-our-lives

How Can We Analyze It?


Data need not be “this large” to qualify for big data processing techniques. The techniques developed
for working on very large data sets can also be applied to smaller, but still large, datasets. The “Single
Instruction, Multiple Data” paradigm for programming gave rise to techniques such as MapReduce,
which processes data sets on a distributed cluster with relative efficiency.

Open research problems include to improve this efficiency by reducing the number of network
transactions, caching the data more intelligently and others. The important concept is scale – we need
big data analysis techniques when one of the V’s described earlier can scale beyond what we can handle
on a single computer. For example, if you can purchase a 1TB hard drive for $100; how would you use
one computer to process 100PB (100,000 TB) of data?

Machine Learning Techniques


Use the power, speed, capacity and relentlessness of computing to look for patterns in your data. A
correlation of data is often sufficient to identify trends. Certain search terms are good indicators of flu
outbreaks (http://www.google.org/flutrends/).

Supervised Learning can be used to identify incoming email as spam. The user identifies certain email as
spam, then the machine learns to classify others accordingly.

Unsupervised Learning can be used to find out connections we never made ourselves. What traits
identify a potentially good NBA player? Yes, they’re tall, but “average ratio of arms to height in the NBA
is an astounding 1.06. (To put that in context, a ratio of greater than 1.05 is one of the diagnostic criteria
for Marfan syndrome, a disorder of the body's connective tissues that often results in elongated limbs.)”
The Sports Gene, David Epstein
MapReduce
MapReduce is used to process large amounts of data with two parts, a mapper and a reducer. The
mapper breaks data into usable segments to be processed and creates key-value pairs. The reducer
consolidates key-value pairs into meaningful data. There can be many reducers to distribute the
workload.

To facilitate distribution of workload, we convert our idea of storing data in tables to objects. The
columns of a table become fields of an object. The objects are representable in common notations such
as JSON. The objects can be stored in hashtables as key-value pairs.

A table like this one…

zip city state

19063 Middletown PA

19064 Springfield PA

19065 Media PA

22125 Occoquan VA

22134 Quantico VA

22150 Springfield VA

22172 Triangle VA

22191 Woodbridge VA

Becomes JSON…
[
{17057, "Middletown","PA"}
("Worthington", "MA")
("Springfield", "MA")
("Springfield", "MA")

("Springfield", "PA")

("Springfield", "VA")

Becomes key-value pairs:
{

"Middletown":["PA"],
"Springfield":["MA", "PA", "VA"],

}

Computing Systems for Handling Big Data


Hadoop is one such system. Hadoop was developed by Apache as a MapReduce framework. It is
distributed and cluster-based. It is used for processing very large data sets by Facebook, Twitter, Spotify
and many more. (hadoop.apache.org)

The data is stored on a filesystem (Hadoop Distributed Filesystem – HDFS), is fault-tolerant and can be
run on commodity hardware. It can also use services like Amazon S3, has 99.999999999% durability and
99.99% availability of objects over a given year, and can sustain concurrent loss of data in two facilities.
It is elastic and does not require replication of data across clusters. For example, Netflix uses this to
study video-streaming trends with a 500+ node query cluster, reported and visualized via REST web
service endpoints. Execution service allows Netflix to execute jobs over HTTP.

Harnessing Big Data


Big data can come from user-generated data such as Social media dumps or Wikipedia dumps. It can
also come from computer-generated data such as weather sensor data, credit card transactions or HTTP
server logs. It can be historical trend data such as student grade performance data, Medical records or
crime reports.

Other systems include MapReduce, jsmapreduce (supports Python and JavaScript), IBM BigInsights and
IBM SPSS, Tableau and R.

Here is an example from jsmapreduce:


• The Mapper takes a “shard” of data to process, like a line of text from a web log file.

• This is an offshoot of a common parallel programming paradigm known as “Single Instruction,


Multiple Data” (SIMD), and is well-suited to symmetric multiprocessing environments (SMP) like
multicore.

• The data is already broken up across the computing nodes, but this is transparent to the
Mapper, which sees a traditional filesystem.

• The data is moved as-needed

• The best mapreduce algorithms require little movement of data

• Replication of that data is provided across the nodes because frequent disk failure at this scale is
expected

• The Mapper’s job is to “emit” data back into the distributed filesystem that will be processed by
the Reducer

• This could be statistics about the data, or further instructions on how to process the data as a
whole

• Typically done as one or more key/value tuples, i.e.: { “forbidden” : 5 } to represent 5 HTTP 403
forbidden messages found in the shard of data.

• The key/value tuples are then “shuffled,” or arranged such that the keys and values are grouped
by common key.

• That is, all the “forbidden” keys emitted by the mappers are put together.

• This data is then broken up by key and distributed to the Reducers for processing

• …the Reducers might have previously been Mappers

• Like the Mappers, the Reducers take this data and emit key/value tuples into the distributed
filesystem, typically representing aggregations of the values it received with its key.

• For example, { “forbidden” : 5 } { “forbidden” : 3 } might result in


{ “forbidden” : 8 }
JSMapReduce Example
Data:

hope is the thing with feathers


that perches in the soul
and sings the tune without the words
and never stops at all
and sweetest in the gale is heard
and sore must be the storm
that could abash the little bird
that keeps so many warm
Ive heard it in the chillest land
and on the strangest sea
yet never in extremity
it asked a crumb of me

Mapper:
function Mapper(jsmr_context, data) {
// count the number of each word in this line of input ...
var words_list = data.split(' ');
var word_counts_map = {};
for (var i = 0; i < words_list.length; i++) {
var word = words_list[i];
if (word in word_counts_map) {
word_counts_map[word]++;
} else {
word_counts_map[word] = 1;
}
}
// ... and Emit() each word along with its count (reducer will sum
these)
for (word in word_counts_map) {
var count = word_counts_map[word];
jsmr_context.Emit(word, count);
}
}

Mapper Output:
calling mapper with data="hope is the thing with feathers"
mapper: emitted: key=hope , value=1
mapper: emitted: key=is , value=1
mapper: emitted: key=the , value=1
mapper: emitted: key=thing , value=1
mapper: emitted: key=with , value=1
mapper: emitted: key=feathers , value=1
calling mapper with data="that perches in the soul"
mapper: emitted: key=that , value=1
mapper: emitted: key=perches , value=1
mapper: emitted: key=in , value=1
mapper: emitted: key=the , value=1
mapper: emitted: key=soul , value=1
calling mapper with data="and sings the tune without the words"
mapper: emitted: key=and , value=1
mapper: emitted: key=sings , value=1
mapper: emitted: key=the , value=2
mapper: emitted: key=tune , value=1
mapper: emitted: key=without , value=1
mapper: emitted: key=words , value=1
calling mapper with data="and never stops at all"
mapper: emitted: key=and , value=1
mapper: emitted: key=never , value=1
mapper: emitted: key=stops , value=1
mapper: emitted: key=at , value=1
mapper: emitted: key=all , value=1
calling mapper with data="and sweetest in the gale is heard"
mapper: emitted: key=and , value=1
mapper: emitted: key=sweetest , value=1
mapper: emitted: key=in , value=1
mapper: emitted: key=the , value=1
mapper: emitted: key=gale , value=1
mapper: emitted: key=is , value=1
mapper: emitted: key=heard , value=1
calling mapper with data="and sore must be the storm"
mapper: emitted: key=and , value=1
mapper: emitted: key=sore , value=1
mapper: emitted: key=must , value=1
mapper: emitted: key=be , value=1
mapper: emitted: key=the , value=1
mapper: emitted: key=storm , value=1
calling mapper with data="that could abash the little bird"
mapper: emitted: key=that , value=1
mapper: emitted: key=could , value=1
mapper: emitted: key=abash , value=1
mapper: emitted: key=the , value=1
mapper: emitted: key=little , value=1
mapper: emitted: key=bird , value=1
calling mapper with data="that keeps so many warm"
mapper: emitted: key=that , value=1
mapper: emitted: key=keeps , value=1
mapper: emitted: key=so , value=1
mapper: emitted: key=many , value=1
mapper: emitted: key=warm , value=1
calling mapper with data="Ive heard it in the chillest land"
mapper: emitted: key=Ive , value=1
mapper: emitted: key=heard , value=1
mapper: emitted: key=it , value=1
mapper: emitted: key=in , value=1
mapper: emitted: key=the , value=1
mapper: emitted: key=chillest , value=1
mapper: emitted: key=land , value=1
calling mapper with data= "and on the strangest sea"
mapper: emitted: key=and , value=1
mapper: emitted: key=on , value=1
mapper: emitted: key=the , value=1
mapper: emitted: key=strangest , value=1
mapper: emitted: key=sea , value=1
calling mapper with data= "yet never in extremity"
mapper: emitted: key=yet , value=1
mapper: emitted: key=never , value=1
mapper: emitted: key=in , value=1
mapper: emitted: key=extremity , value=1
calling mapper with data= "it asked a crumb of me"
mapper: emitted: key=it , value=1
mapper: emitted: key=asked , value=1
mapper: emitted: key=a , value=1
mapper: emitted: key=crumb , value=1
mapper: emitted: key=of , value=1
mapper: emitted: key=me , value=1

Shuffle Output:
shuffle result : { "hope":[1], "is":[1,1], "the":[1,1,2,1,1,1,1,1],
"thing":[1], "with":[1], "feathers":[1], "that":[1,1,1], "perches":
[1], "in":[1,1,1,1], "soul":[1], "and":[1,1,1,1,1], "sings":[1],
"tune":[1], "without":[1], "words":[1], "never":[1,1], "stops":[1],
"at":[1], "all":[1], "sweetest":[1], "gale":[1], "heard":[1,1],
"sore":[1], "must":[1], "be":[1], "storm":[1], "could":[1], "abash":
[1], "little":[1], "bird":[1], "keeps":[1], "so":[1], "many":[1],
"warm":[1], "Ive":[1], "it":[1,1], "chillest":[1], "land":[1], "on":
[1], "strangest":[1], "sea":[1], "yet":[1], "extremity":[1],
"asked":[1], "a":[1], "crumb":[1], "of":[1], "me":[1] }
Reducer:
function Reducer(jsmr_context, key) {
// sum word counts from each line to get total for each word
var total_count = 0;
while (jsmr_context.HaveMoreValues()) {
var value_str = jsmr_context.GetNextValue();
total_count += parseInt(value_str);
}
jsmr_context.Emit(key + ':' + total_count.toString());
}

Reducer Output:
initialized reducer for key: Ive
initialized reducer for key: a
initialized reducer for key: abash
initialized reducer for key: all
initialized reducer for key: and
initialized reducer for key: asked
initialized reducer for key: at
initialized reducer for key: be
initialized reducer for key: bird
initialized reducer for key: chillest
initialized reducer for key: could
initialized reducer for key: crumb
initialized reducer for key: extremity
initialized reducer for key: feathers
initialized reducer for key: gale
initialized reducer for key: heard
initialized reducer for key: hope
initialized reducer for key: in
initialized reducer for key: is
initialized reducer for key: it
initialized reducer for key: keeps
initialized reducer for key: land
initialized reducer for key: little
initialized reducer for key: many
initialized reducer for key: me
initialized reducer for key: must
initialized reducer for key: never
initialized reducer for key: of
initialized reducer for key: on
initialized reducer for key: perches
initialized reducer for key: sea
initialized reducer for key: sings
initialized reducer for key: so
initialized reducer for key: sore
initialized reducer for key: soul
initialized reducer for key: stops
initialized reducer for key: storm
initialized reducer for key: strangest
initialized reducer for key: sweetest
initialized reducer for key: that
initialized reducer for key: the
initialized reducer for key: thing
initialized reducer for key: tune
initialized reducer for key: warm
initialized reducer for key: with
initialized reducer for key: without
initialized reducer for key: words
initialized reducer for key: yet
shuffle complete; initialized 48 reducer phases
(step)
calling reducer with key="Ive"
reducer: reducer fetched value: 1
reducer: emitted: "Ive:1"
(step)
calling reducer with key="a"
reducer: reducer fetched value: 1
reducer: emitted: "a:1"
(step)
calling reducer with key="abash"
reducer: reducer fetched value: 1
reducer: emitted: "abash:1"
(step)
calling reducer with key="all"
reducer: reducer fetched value: 1
reducer: emitted: "all:1"
(step)
calling reducer with key="and"
reducer: reducer fetched value: 1
reducer: reducer fetched value: 1
reducer: reducer fetched value: 1
reducer: reducer fetched value: 1
reducer: reducer fetched value: 1
reducer: emitted: "and:5"
(step)
calling reducer with key="asked"
reducer: reducer fetched value: 1
reducer: emitted: "asked:1"
(step)
calling reducer with key="at"
reducer: reducer fetched value: 1
reducer: emitted: "at:1"
(step)
calling reducer with key="be"
reducer: reducer fetched value: 1
reducer: emitted: "be:1"
(step)
calling reducer with key="bird"
reducer: reducer fetched value: 1
reducer: emitted: "bird:1"
step)
calling reducer with key="chillest"
reducer: reducer fetched value: 1
reducer: emitted: "chillest:1"
(step)
calling reducer with key="could"
reducer: reducer fetched value: 1
reducer: emitted: "could:1"
(step)
calling reducer with key="crumb"
reducer: reducer fetched value: 1
reducer: emitted: "crumb:1"
(step)
calling reducer with key="extremity"
reducer: reducer fetched value: 1
reducer: emitted: "extremity:1"
(step)
calling reducer with key="feathers"
reducer: reducer fetched value: 1
reducer: emitted: "feathers:1"
(step)
calling reducer with key="gale"
reducer: reducer fetched value: 1
reducer: emitted: "gale:1"
(step)
calling reducer with key="heard"
reducer: reducer fetched value: 1
reducer: reducer fetched value: 1
reducer: emitted: "heard:2"
(step)
(and so on…)
References

• http://techblog.netflix.com/2013/01/hadoop-platform-as-service-in-cloud.html

• http://queue.acm.org/detail.cfm?id=1961297

• http://www-01.ibm.com/software/data/bigdata/

• http://www.cs.kent.edu/~jin/Cloud12Spring/HbaseHivePig.pptx

• http://marianoguerra.github.io/json.human.js/

• http://jsonprettyprint.com/

• http://stackoverflow.com/questions/12628246/how-to-send-oauth-request-with-python-
oauth2

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. CNS-
1301171.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the National Science Foundation.

Supported in part by IBM Big Data Faculty Awards 2013

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy