Big Data Analytics
Big Data Analytics
`1
1. Introduction
This paper documents the basic concepts relating to big data. It attempts to
consolidate the hitherto fragmented discourse on what constitutes big data, what
metrics define the size and other characteristics of big data, and what tools and
technologies exist to harness the potential of big data.
From corporate leaders to municipal planners and academics, big data are the
subject of attention, and to some extent, fear. The sudden rise of big data has left
many unprepared. In the past, new technological developments first appeared in
technical and academic publications. The knowledge and synthesis later seeped into
other avenues of knowledge mobilization, including books. The fast evolution of big
data technologies and the ready acceptance of the concept by public and private
sectors left little time for the discourse to develop and mature in the academic
domain. Authors and practitioners leapfrogged to books and other electronic media
for immediate and wide circulation of their work on big data. Thus, one finds several
books on big data, including Big Data for Dummies, but not enough fundamental
discourse in academic publications.
The leapfrogging of the discourse on big data to more popular outlets implies that a
coherent understanding of the concept and its nomenclature is yet to develop. For
instance, there is little consensus around the fundamental question of how big the
data has to be to qualify as ‘big data’. Thus, there exists the need to document in
the academic press the evolution of big data concepts and technologies.
A key contribution of this paper is to bring forth the oft-neglected dimensions of big
data. The popular discourse on big data, which is dominated and influenced by the
marketing efforts of large software and hardware developers, focuses on predictive
analytics and structured data. It ignores the largest component of big data, which is
unstructured and is available as audio, images, video, and unstructured text. It is
estimated that the analytics-ready structured data forms only a small subset of big
data. The unstructured data, especially data in video format, is the largest
component of big data that is only partially archived.
2
`1
This paper is organized as follows. We begin the paper by defining big data. We
highlight the fact that size is only one of several dimensions of big data. Other
characteristics, such as the frequency with which data are generated, are equally
important in defining big data. We then expand the discussion on various types of
big data, namely text, audio, video, and social media. We apply the analytics lens to
the discussion on big data. Hence, when we discuss data in video format, we focus
on methods and tools to analyze data in video format.
While it is ubiquitous today, however, ‘big data’ as a concept is nascent and has
uncertain origins. Diebold (2012) argues that the term “big data … probably
originated in lunch-table conversations at Silicon Graphics Inc. (SGI) in the mid-
1990s, in which John Mashey figured prominently”. Despite the references to the
mid-nineties, Fig. 1 shows that the term became widespread as recently as in 2011.
The current hype can be attributed to the promotional initiatives by IBM and other
3
`1
leading technology companies who invested in building the niche analytics market.
Big data definitions have evolved rapidly, which has raised some confusion. This is
evident from an online survey of 154 C-suite global executives conducted by Harris
Interactive on behalf of SAP in April 2012 (“Small and midsize companies look to
make big gains with big data,” 2012). Fig. 2 shows how executives differed in their
understanding of big data, where some definitions focused on what it is, while
others tried to answer what it does.
4
`1
Fig. 2. Definitions of big data based on an online survey of 154 global executives in
April 2012.
Clearly, size is the first characteristic that comes to mind considering the question
“what is big data?” However, other characteristics of big data have emerged
recently. For instance, Laney (2001) suggested that Volume, Variety,
and Velocity (or the Three V's) are the three dimensions of challenges in data
management. The Three V's have emerged as a common framework to describe big
data (Chen et al., 2012, Kwon et al., 2014). For example, Gartner, Inc. defines big
data in similar terms:
“Big data is a term that describes large volumes of high velocity , complex and
variable data that require advanced techniques and technologies to enable the
5
`1
Volume refers to the magnitude of data. Big data sizes are reported in multiple
terabytes and petabytes. A survey conducted by IBM in mid-2012 revealed that just
over half of the 1144 respondents considered datasets over one terabyte to be big
data (Schroeck, Shockley, Smart, Romero-Morales, & Tufano, 2012). One terabyte
stores as much data as would fit on 1500 CDs or 220 DVDs, enough to store around
16 million Facebook photographs. Beaver, Kumar, Li, Sobel, and Vajgel (2010) report
that Facebook processes up to one million photographs per second. One petabyte
equals 1024 terabytes. Earlier estimates suggest that Facebook stored 260 billion
photos using storage space of over 20 petabytes.
Definitions of big data volumes are relative and vary by factors, such as time and the
type of data. What may be deemed big data today may not meet the threshold in
the future because storage capacities will increase, allowing even bigger data sets to
be captured. In addition, the type of data, discussed under variety , defines what is
meant by ‘big’. Two datasets of the same size may require different data
management technologies based on their type, e.g., tabular versus video data. Thus,
definitions of big data also depend upon the industry. These considerations therefore
make it impractical to define a specific threshold for big data volumes.
data. XML documents contain user-defined data tags which make them machine-
readable.
A high level of variety, a defining characteristic of big data, is not necessarily new.
Organizations have been hoarding unstructured data from internal sources (e.g.,
sensor data) and external sources (e.g., social media). However, the emergence of
new data management technologies and analytics, which enable organizations to
leverage data in their business processes, is the innovative aspect. For instance,
facial recognition technologies empower the brick-and-mortar retailers to acquire
intelligence about store traffic, the age or gender composition of their customers,
and their in-store movement patterns. This invaluable information is leveraged in
decisions related to product promotions, placement, and staffing. Clickstream data
provides a wealth of information about customer behavior and browsing patterns to
online retailers. Clickstream advises on the timing and sequence of pages viewed by
a customer. Using big data analytics, even small and medium-sized enterprises
(SMEs) can mine massive volumes of semi-structured data to improve website
designs and implement effective cross-selling and personalized product
recommendation systems.
Velocity refers to the rate at which data are generated and the speed at which it
should be analyzed and acted upon. The proliferation of digital devices such as
smartphones and sensors has led to an unprecedented rate of data creation and is
driving a growing need for real-time analytics and evidence-based planning. Even
conventional retailers are generating high-frequency data. Wal-Mart, for instance,
processes more than one million transactions per hour (Cukier, 2010). The data
emanating from mobile devices and flowing through mobile apps produces torrents
of information that can be used to generate real-time, personalized offers for
everyday customers. This data provides sound information about customers, such as
geospatial location, demographics, and past buying patterns, which can be analyzed
in real time to create real customer value.
Given the soaring popularity of smartphones, retailers will soon have to deal with
hundreds of thousands of streaming data sources that demand real-time analytics.
7
`1
Traditional data management systems are not capable of handling huge data feeds
instantaneously. This is where big data technologies come into play. They enable
firms to create real-time intelligence from high volumes of ‘perishable’ data.
In addition to the three V's, other dimensions of big data have also been mentioned.
These include:
The relativity of big data volumes discussed earlier applies to all dimensions. Thus,
universal benchmarks do not exist for volume, variety, and velocity that define big
data. The defining limits depend upon the size, sector, and location of the firm and
these limits evolve over time. Also important is the fact that these dimensions are
not independent of each other. As one dimension changes, the likelihood increases
that another dimension will also change as a result. However, a ‘three-V tipping
point’ exists for every firm beyond which traditional data management and analysis
technologies become inadequate for deriving timely intelligence. The Three-V tipping
point is the threshold beyond which firms start dealing with big data. The firms
should then trade-off the future value expected from big data technologies against
their implementation costs.
Big data are worthless in a vacuum. Its potential value is unlocked only when
leveraged to drive decision making. To enable such evidence-based decision making,
organizations need efficient processes to turn high volumes of fast-moving and
diverse data into meaningful insights. The overall process of extracting insights from
big data can be broken down into five stages (Labrinidis & Jagadish, 2012), shown
in Fig. 3. These five stages form the two main sub-processes: data management and
analytics. Data management involves processes and supporting technologies to
acquire and store data and to prepare and retrieve it for analysis. Analytics, on the
other hand, refers to techniques used to analyze and acquire intelligence from big
data. Thus, big data analytics can be viewed as a sub-process in the overall process
of ‘insight extraction’ from big data.
9
`1
In the following sections, we briefly review big data analytical techniques for
structured and unstructured data. Given the breadth of the techniques, an
exhaustive list of techniques is beyond the scope of a single paper. Thus, the
following techniques represent a relevant subset of the tools available for big data
analytics.
Text analytics (text mining) refers to techniques that extract information from
textual data. Social network feeds, emails, blogs, online forums, survey responses,
corporate documents, news, and call center logs are examples of textual data held
by organizations. Text analytics involve statistical analysis, computational linguistics,
and machine learning. Text analytics enable businesses to convert large volumes of
human generated text into meaningful summaries, which support evidence-based
decision-making. For instance, text analytics can be used to predict stock market
based on information extracted from financial news (Chung, 2014). We present a
brief review of text analytics methods below.
Audio analytics analyze and extract information from unstructured audio data. When
applied to human spoken language, audio analytics is also referred to as speech
analytics. Since these techniques have mostly been applied to spoken audio, the
terms audio analytics and speech analytics are often used interchangeably.
Currently, customer call centers and healthcare are the primary application areas of
audio analytics.
Call centers use audio analytics for efficient analysis of thousands or even millions of
hours of recorded calls. These techniques help improve customer experience,
evaluate agent performance, enhance sales turnover rates, monitor compliance with
different policies (e.g., privacy and security policies), gain insight into customer
behavior, and identify product or service issues, among many other tasks. Audio
analytics systems can be designed to analyze a live call, formulate cross/up-selling
recommendations based on the customer's past and present interactions, and
provide feedback to agents in real time. In addition, automated call centers use the
Interactive Voice Response (IVR) platforms to identify and handle frustrated callers.
•
13
`1
Video analytics, also known as video content analysis (VCA), involves a variety of
techniques to monitor, analyze, and extract meaningful information from video
streams. Although video analytics is still in its infancy compared to other types of
data mining (Panigrahi, Abraham, & Das, 2010), various techniques have already
been developed for processing real-time as well as pre-recorded videos. The
increasing prevalence of closed-circuit television (CCTV) cameras and the booming
popularity of video-sharing websites are the two leading contributors to the growth
of computerized video analysis. A key challenge, however, is the sheer size of video
data. To put this into perspective, one second of a high-definition video, in terms of
14
`1
size, is equivalent to over 2000 pages of text (Manyika et al., 2011). Now consider
that 100 hours of video are uploaded to YouTube every minute (YouTube Statistics,
n.d.).
Big data technologies turn this challenge into opportunity. Obviating the need for
cost-intensive and risk-prone manual processing, big data technologies can be
leveraged to automatically sift through and draw intelligence from thousands of
hours of video. As a result, the big data technology is the third factor that has
contributed to the development of video analytics.
The primary application of video analytics in recent years has been in automated
security and surveillance systems. In addition to their high cost, labor-based
surveillance systems tend to be less effective than automatic systems (e.g., Hakeem
et al., 2012 report that security personnel cannot remain focused on surveillance
tasks for more than 20 minutes). Video analytics can efficiently and effectively
perform surveillance functions such as detecting breaches of restricted zones,
identifying objects removed or left unattended, detecting loitering in a specific area,
recognizing suspicious activities, and detecting camera tampering, to name a few.
Upon detection of a threat, the surveillance system may notify security personnel in
real time or trigger an automatic action (e.g., sound alarm, lock doors, or turn on
lights).
The data generated by CCTV cameras in retail outlets can be extracted for business
intelligence. Marketing and operations management are the primary application
areas. For instance, smart algorithms can collect demographic information about
customers, such as age, gender, and ethnicity. Similarly, retailers can count the
number of customers, measure the time they stay in the store, detect their
movement patterns, measure their dwell time in different areas, and monitor queues
in real time. Valuable insights can be obtained by correlating this information with
customer demographics to drive decisions for product placement, price, assortment
optimization, promotion design, cross-selling, layout optimization, and staffing.
15
`1
Another potential application of video analytics in retail lies in the study of buying
behavior of groups. Among family members who shop together, only one interacts
with the store at the cash register, causing the traditional systems to miss data on
buying patterns of other members. Video analytics can help retailers address this
missed opportunity by providing information about the size of the group, the group's
demographics, and the individual members’ buying behavior.
Automatic video indexing and retrieval constitutes another domain of video analytics
applications. The widespread emergence of online and offline videos has highlighted
the need to index multimedia content for easy search and retrieval. The indexing of
a video can be performed based on different levels of information available in a
video including the metadata, the soundtrack, the transcripts, and the visual content
of the video. In the metadata-based approach, relational database management
systems (RDBMS) are used for video search and retrieval. Audio analytics and text
analytics techniques can be applied to index a video based on the associated
soundtracks and transcripts, respectively. A comprehensive review of approaches
and techniques for video indexing is presented in Hu, Xie, Li, Zeng, and Maybank
(2011).
In terms of the system architecture, there exist two approaches to video analytics,
namely server-based and edge-based:
•
16
`1
Social media analytics refer to the analysis of structured and unstructured data from
social media channels. Social media is a broad term encompassing a variety of online
platforms that allow users to create and exchange content. Social media can be
categorized into the following types: Social networks (e.g., Facebook and LinkedIn),
blogs (e.g., Blogger and WordPress), microblogs (e.g., Twitter and Tumblr), social
news (e.g., Digg and Reddit), social bookmarking (e.g., Delicious and StumbleUpon),
media sharing (e.g., Instagram and YouTube), wikis (e.g., Wikipedia and Wikihow),
question-and-answer sites (e.g., Yahoo! Answers and Ask.com) and review sites
(e.g., Yelp, TripAdvisor) (Barbier and Liu, 2011, Gundecha and Liu, 2012). Also,
many mobile apps, such as Find My Friend, provide a platform for social interactions
and, hence, serve as social media channels.
Although the research on social networks dates back to early 1920s, nevertheless,
social media analytics is a nascent field that has emerged after the advent of Web
2.0 in the early 2000s. The key characteristic of the modern social media analytics is
its data-centric nature. The research on social media analytics spans across several
disciplines, including psychology, sociology, anthropology, computer science,
mathematics, physics, and economics. Marketing has been the primary application of
social media analytics in recent years. This can be attributed to the widespread and
growing adoption of social media by consumers worldwide (He, Zha, & Li, 2013), to
the extent that Forrester Research, Inc., projects social media to be the second-
fastest growing marketing channel in the US between 2011 and 2016 (VanBoskirk,
Overby, & Takvorian, 2011).
17
`1
User-generated content (e.g., sentiments, images, videos, and bookmarks) and the
relationships and interactions between the network entities (e.g., people,
organizations, and products) are the two sources of information in social media.
Based on this categorization, the social media analytics can be classified into two
groups:
Various techniques have recently emerged to extract information from the structure
of social networks. We briefly discuss these below.
Social influence analysis refers to techniques that are concerned with modeling and
evaluating the influence of actors and connections in a social network. Naturally, the
behavior of an actor in a social network is affected by others. Thus, it is desirable to
evaluate the participants’ influence, quantify the strength of connections, and
uncover the patterns of influence diffusion in a network. Social influence analysis
techniques can be leveraged in viral marketing to efficiently enhance brand
awareness and adoption.
social networks. The Linear Threshold Model (LTM) and Independent Cascade Model
(ICM) are two well-known examples of such frameworks (Sun & Tang, 2011).
At its core, predictive analytics seek to uncover patterns and capture relationships in
data. Predictive analytics techniques are subdivided into two groups. Some
techniques, such as moving averages, attempt to discover the historical patterns in
the outcome variable(s) and extrapolate them to the future. Others, such as linear
regression, aim to capture the interdependencies between outcome variable(s) and
20
`1
Heterogeneity. Big data are often obtained from different sources and
represent information from different sub-populations. As a result, big data are
highly heterogeneous. The sub-population data in small samples are deemed
outliers because of their insufficient frequency. However, the sheer size of big
data sets creates the unique opportunity to model the heterogeneity arising
from sub-population data, which would require sophisticated statistical
techniques.
•
21
`1
Noise accumulation. Estimating predictive models for big data often involves
the simultaneous estimation of several parameters. The accumulated
estimation error (or noise) for different parameters could dominate the
magnitudes of variables that have true effects within the model. In other
words, some variables with significant explanatory power might be overlooked
as a result of noise accumulation.
4. Concluding remarks
The objective of this paper is to describe, review, and reflect on big data. The paper
first defined what is meant by big data to consolidate the divergent discourse on big
data. We presented various definitions of big data, highlighting the fact that size is
only one dimension of big data. Other dimensions, such as velocity and variety are
equally important. The paper's primary focus has been on analytics to gain valid and
valuable insights from big data. We highlight the point that predictive analytics,
which deals mostly with structured data, overshadows other forms of analytics
applied to unstructured data, which constitutes 95% of big data. We reviewed
analytics techniques for text, audio, video, and social media data, as well as
predictive analytics. The paper makes the case for new statistical techniques for big
data to address the peculiarities that differentiate big data from smaller data sets.
Most statistical methods in practice have been devised for smaller data sets
comprising samples.
Although major innovations in analytical techniques for big data have not yet taken
place, one anticipates the emergence of such novel analytics in the near future. For
instance, real-time analytics will likely become a prolific field of research because of
the growth in location-aware social media and mobile apps. Since big data are noisy,
highly interrelated, and unreliable, it will likely lead to the development of statistical
23
`1
techniques more readily apt for mining big data while remaining sensitive to the
unique characteristics. Going beyond samples, additional valuable insights could be
obtained from the massive volumes of less ‘trustworthy’ data.
Acknowledgment
The authors would like to acknowledge research and editing support provided by Ms.
Ioana Moca.
References
Aggarwal, 2011
C.C. AggarwalAn introduction to social network data analytics
C.C. Aggarwal (Ed.), Social network data analytics, Springer, United
States (2011), pp. 1-15
CrossRefView Record in ScopusGoogle Scholar
Barbier and Liu, 2011
G. Barbier, H. LiuData mining in social media
C.C. Aggarwal (Ed.), Social network data analytics, Springer, United
States (2011), pp. 327-352
CrossRefView Record in ScopusGoogle Scholar
Beaver et al., 2010
D. Beaver, S. Kumar, H.C. Li, J. Sobel, P. VajgelFinding a needle in
haystack: Facebook's photo storage
Proceedings of the nineth USENIX conference on operating systems design
and implementation, USENIX Association, Berkeley, CA, USA (2010), pp. 1-8
View Record in ScopusGoogle Scholar
Chen et al., 2012
H. Chen, R.H.L. Chiang, V.C. StoreyBusiness intelligence and analytics:
From big data to big impact
MIS Quarterly, 36 (4) (2012), pp. 1165-1188
CrossRefView Record in ScopusGoogle Scholar
24
`1
Chung, 2014
W. ChungBizPro: Extracting and categorizing business intelligence
factors from textual news articles
International Journal of Information Management, 34 (2) (2014), pp. 272-284
ArticleDownload PDFView Record in ScopusGoogle Scholar
Cukier, 2010
Cukier K., The Economist, Data, data everywhere: A special report on
managing information, 2010, February 25, Retrieved
from http://www.economist.com/node/15557443.
Google Scholar
Diebold, 2012
F.X. DieboldA personal perspective on the origin(s) and development
of “big data”: The phenomenon, the term, and the discipline
(Scholarly Paper No. ID 2202843)
Social Science Research Network (2012)
Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2202843
Google Scholar
Fan et al., 2014
J. Fan, F. Han, H. LiuChallenges of big data analysis
National Science Review, 1 (2) (2014), pp. 293-314
CrossRefView Record in ScopusGoogle Scholar
Fan and Lv, 2008
J. Fan, J. LvSure independence screening for ultrahigh dimensional
feature space
Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 70 (5) (2008), pp. 849-911
CrossRefView Record in ScopusGoogle Scholar
Feldman, 2013
R. FeldmanTechniques and applications for sentiment analysis
Communications of the ACM, 56 (4) (2013), pp. 82-89
CrossRefView Record in ScopusGoogle Scholar
Gartner IT Glossary, n.d.
25
`1
forecast demand and/or sales. His research interests include human development in
Canada and South Asia, forecasting housing market dynamics, transport and
infrastructure planning and development. Murtaza Haider is working on a
book, Getting Started with Data Science: Making Sense of Data with
Analytics (ISBN 9780133991024), which will be published by Pearson/IBM Press in
Spring 2015. He is an avid blogger and blogs weekly about socio-economics in South
Asia for the Dawn newspaper and for the Huffington Post.
View Abstract
Copyright © 2014 The Authors. Published by Elsevier Ltd.