0% found this document useful (0 votes)
13 views30 pages

Big Data Analytics

Uploaded by

hareebabooo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views30 pages

Big Data Analytics

Uploaded by

hareebabooo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

1

`1

1. Introduction

This paper documents the basic concepts relating to big data. It attempts to
consolidate the hitherto fragmented discourse on what constitutes big data, what
metrics define the size and other characteristics of big data, and what tools and
technologies exist to harness the potential of big data.

From corporate leaders to municipal planners and academics, big data are the
subject of attention, and to some extent, fear. The sudden rise of big data has left
many unprepared. In the past, new technological developments first appeared in
technical and academic publications. The knowledge and synthesis later seeped into
other avenues of knowledge mobilization, including books. The fast evolution of big
data technologies and the ready acceptance of the concept by public and private
sectors left little time for the discourse to develop and mature in the academic
domain. Authors and practitioners leapfrogged to books and other electronic media
for immediate and wide circulation of their work on big data. Thus, one finds several
books on big data, including Big Data for Dummies, but not enough fundamental
discourse in academic publications.

The leapfrogging of the discourse on big data to more popular outlets implies that a
coherent understanding of the concept and its nomenclature is yet to develop. For
instance, there is little consensus around the fundamental question of how big the
data has to be to qualify as ‘big data’. Thus, there exists the need to document in
the academic press the evolution of big data concepts and technologies.

A key contribution of this paper is to bring forth the oft-neglected dimensions of big
data. The popular discourse on big data, which is dominated and influenced by the
marketing efforts of large software and hardware developers, focuses on predictive
analytics and structured data. It ignores the largest component of big data, which is
unstructured and is available as audio, images, video, and unstructured text. It is
estimated that the analytics-ready structured data forms only a small subset of big
data. The unstructured data, especially data in video format, is the largest
component of big data that is only partially archived.
2
`1

This paper is organized as follows. We begin the paper by defining big data. We
highlight the fact that size is only one of several dimensions of big data. Other
characteristics, such as the frequency with which data are generated, are equally
important in defining big data. We then expand the discussion on various types of
big data, namely text, audio, video, and social media. We apply the analytics lens to
the discussion on big data. Hence, when we discuss data in video format, we focus
on methods and tools to analyze data in video format.

Given that the discourse on big data is contextualized in predictive analytics


frameworks, we discuss how analytics have captured the imaginations of business
and government leaders and describe the state-of-practice of a rapidly evolving
industry. We also highlight the perils of big data, such as spurious correlation, which
have hitherto escaped serious inquiry. The discussion has remained focused on
correlation, ignoring the more nuanced and involved discussion on causation. We
conclude by highlighting the expected developments to realize in the near future in
big data analytics.

2. Defining big data

While it is ubiquitous today, however, ‘big data’ as a concept is nascent and has
uncertain origins. Diebold (2012) argues that the term “big data … probably
originated in lunch-table conversations at Silicon Graphics Inc. (SGI) in the mid-
1990s, in which John Mashey figured prominently”. Despite the references to the
mid-nineties, Fig. 1 shows that the term became widespread as recently as in 2011.
The current hype can be attributed to the promotional initiatives by IBM and other
3
`1

leading technology companies who invested in building the niche analytics market.

1. Download : Download full-size image

Fig. 1. Frequency distribution of documents containing the term “big data” in


ProQuest Research Library.

Big data definitions have evolved rapidly, which has raised some confusion. This is
evident from an online survey of 154 C-suite global executives conducted by Harris
Interactive on behalf of SAP in April 2012 (“Small and midsize companies look to
make big gains with big data,” 2012). Fig. 2 shows how executives differed in their
understanding of big data, where some definitions focused on what it is, while
others tried to answer what it does.
4
`1

1. Download : Download full-size image

Fig. 2. Definitions of big data based on an online survey of 154 global executives in
April 2012.

Clearly, size is the first characteristic that comes to mind considering the question
“what is big data?” However, other characteristics of big data have emerged
recently. For instance, Laney (2001) suggested that Volume, Variety,
and Velocity (or the Three V's) are the three dimensions of challenges in data
management. The Three V's have emerged as a common framework to describe big
data (Chen et al., 2012, Kwon et al., 2014). For example, Gartner, Inc. defines big
data in similar terms:

“Big data is high-volume, high-velocity and high-variety information assets that


demand cost-effective, innovative forms of information processing for enhanced
insight and decision making.” (“Gartner IT Glossary, n.d.”)

Similarly, TechAmerica Foundation defines big data as follows:

“Big data is a term that describes large volumes of high velocity , complex and
variable data that require advanced techniques and technologies to enable the
5
`1

capture, storage, distribution, management, and analysis of the information .”


(TechAmerica Foundation's Federal Big Data Commission, 2012)

We describe the Three V's below.

Volume refers to the magnitude of data. Big data sizes are reported in multiple
terabytes and petabytes. A survey conducted by IBM in mid-2012 revealed that just
over half of the 1144 respondents considered datasets over one terabyte to be big
data (Schroeck, Shockley, Smart, Romero-Morales, & Tufano, 2012). One terabyte
stores as much data as would fit on 1500 CDs or 220 DVDs, enough to store around
16 million Facebook photographs. Beaver, Kumar, Li, Sobel, and Vajgel (2010) report
that Facebook processes up to one million photographs per second. One petabyte
equals 1024 terabytes. Earlier estimates suggest that Facebook stored 260 billion
photos using storage space of over 20 petabytes.

Definitions of big data volumes are relative and vary by factors, such as time and the
type of data. What may be deemed big data today may not meet the threshold in
the future because storage capacities will increase, allowing even bigger data sets to
be captured. In addition, the type of data, discussed under variety , defines what is
meant by ‘big’. Two datasets of the same size may require different data
management technologies based on their type, e.g., tabular versus video data. Thus,
definitions of big data also depend upon the industry. These considerations therefore
make it impractical to define a specific threshold for big data volumes.

Variety refers to the structural heterogeneity in a dataset. Technological advances


allow firms to use various types of structured, semi-structured, and unstructured
data. Structured data, which constitutes only 5% of all existing data (Cukier, 2010),
refers to the tabular data found in spreadsheets or relational databases. Text,
images, audio, and video are examples of unstructured data, which sometimes lack
the structural organization required by machines for analysis. Spanning a continuum
between fully structured and unstructured data, the format of semi-structured data
does not conform to strict standards. Extensible Markup Language (XML), a textual
language for exchanging data on the Web, is a typical example of semi-structured
6
`1

data. XML documents contain user-defined data tags which make them machine-
readable.

A high level of variety, a defining characteristic of big data, is not necessarily new.
Organizations have been hoarding unstructured data from internal sources (e.g.,
sensor data) and external sources (e.g., social media). However, the emergence of
new data management technologies and analytics, which enable organizations to
leverage data in their business processes, is the innovative aspect. For instance,
facial recognition technologies empower the brick-and-mortar retailers to acquire
intelligence about store traffic, the age or gender composition of their customers,
and their in-store movement patterns. This invaluable information is leveraged in
decisions related to product promotions, placement, and staffing. Clickstream data
provides a wealth of information about customer behavior and browsing patterns to
online retailers. Clickstream advises on the timing and sequence of pages viewed by
a customer. Using big data analytics, even small and medium-sized enterprises
(SMEs) can mine massive volumes of semi-structured data to improve website
designs and implement effective cross-selling and personalized product
recommendation systems.

Velocity refers to the rate at which data are generated and the speed at which it
should be analyzed and acted upon. The proliferation of digital devices such as
smartphones and sensors has led to an unprecedented rate of data creation and is
driving a growing need for real-time analytics and evidence-based planning. Even
conventional retailers are generating high-frequency data. Wal-Mart, for instance,
processes more than one million transactions per hour (Cukier, 2010). The data
emanating from mobile devices and flowing through mobile apps produces torrents
of information that can be used to generate real-time, personalized offers for
everyday customers. This data provides sound information about customers, such as
geospatial location, demographics, and past buying patterns, which can be analyzed
in real time to create real customer value.

Given the soaring popularity of smartphones, retailers will soon have to deal with
hundreds of thousands of streaming data sources that demand real-time analytics.
7
`1

Traditional data management systems are not capable of handling huge data feeds
instantaneously. This is where big data technologies come into play. They enable
firms to create real-time intelligence from high volumes of ‘perishable’ data.

In addition to the three V's, other dimensions of big data have also been mentioned.
These include:

Veracity. IBM coined Veracity as the fourth V, which represents the


unreliability inherent in some sources of data. For example, customer
sentiments in social media are uncertain in nature, since they entail human
judgment. Yet they contain valuable information. Thus the need to deal with
imprecise and uncertain data is another facet of big data, which is addressed
using tools and analytics developed for management and mining of uncertain
data.

Variability (and complexity). SAS introduced Variability and Complexity as two


additional dimensions of big data. Variability refers to the variation in the data
flow rates. Often, big data velocity is not consistent and has periodic peaks
and troughs. Complexity refers to the fact that big data are generated
through a myriad of sources. This imposes a critical challenge: the need to
connect, match, cleanse and transform data received from different sources.

Value. Oracle introduced Value as a defining attribute of big data. Based on


Oracle's definition, big data are often characterized by relatively “low value
density”. That is, the data received in the original form usually has a low
value relative to its volume. However, a high value can be obtained by
analyzing large volumes of such data.
8
`1

The relativity of big data volumes discussed earlier applies to all dimensions. Thus,
universal benchmarks do not exist for volume, variety, and velocity that define big
data. The defining limits depend upon the size, sector, and location of the firm and
these limits evolve over time. Also important is the fact that these dimensions are
not independent of each other. As one dimension changes, the likelihood increases
that another dimension will also change as a result. However, a ‘three-V tipping
point’ exists for every firm beyond which traditional data management and analysis
technologies become inadequate for deriving timely intelligence. The Three-V tipping
point is the threshold beyond which firms start dealing with big data. The firms
should then trade-off the future value expected from big data technologies against
their implementation costs.

3. Big data analytics

Big data are worthless in a vacuum. Its potential value is unlocked only when
leveraged to drive decision making. To enable such evidence-based decision making,
organizations need efficient processes to turn high volumes of fast-moving and
diverse data into meaningful insights. The overall process of extracting insights from
big data can be broken down into five stages (Labrinidis & Jagadish, 2012), shown
in Fig. 3. These five stages form the two main sub-processes: data management and
analytics. Data management involves processes and supporting technologies to
acquire and store data and to prepare and retrieve it for analysis. Analytics, on the
other hand, refers to techniques used to analyze and acquire intelligence from big
data. Thus, big data analytics can be viewed as a sub-process in the overall process
of ‘insight extraction’ from big data.
9
`1

1. Download : Download full-size image

Fig. 3. Processes for extracting insights from big data.

In the following sections, we briefly review big data analytical techniques for
structured and unstructured data. Given the breadth of the techniques, an
exhaustive list of techniques is beyond the scope of a single paper. Thus, the
following techniques represent a relevant subset of the tools available for big data
analytics.

3.1. Text analytics

Text analytics (text mining) refers to techniques that extract information from
textual data. Social network feeds, emails, blogs, online forums, survey responses,
corporate documents, news, and call center logs are examples of textual data held
by organizations. Text analytics involve statistical analysis, computational linguistics,
and machine learning. Text analytics enable businesses to convert large volumes of
human generated text into meaningful summaries, which support evidence-based
decision-making. For instance, text analytics can be used to predict stock market
based on information extracted from financial news (Chung, 2014). We present a
brief review of text analytics methods below.

Information extraction (IE) techniques extract structured data from unstructured


text. For example, IE algorithms can extract structured information such as drug
name, dosage, and frequency from medical prescriptions. Two sub-tasks in IE are
Entity Recognition (ER) and Relation Extraction (RE) (Jiang, 2012). ER finds names
in text and classifies them into predefined categories such as person, date, location,
and organization. RE finds and extracts semantic relationships between entities
(e.g., persons, organizations, drugs, genes, etc.) in the text. For example, given the
sentence “Steve Jobs co-founded Apple Inc. in 1976”, an RE system can extract
relations such as FounderOf [Steve Jobs, Apple Inc.] or FoundedIn [Apple Inc.,
1976].
10
`1

Text summarization techniques automatically produce a succinct summary of a


single or multiple documents. The resulting summary conveys the key information in
the original text(s). Applications include scientific and news articles, advertisements,
emails, and blogs. Broadly speaking, summarization follows two approaches: the
extractive approach and the abstractive approach. In extractive summarization, a
summary is created from the original text units (usually sentences). The resulting
summary is a subset of the original document. Based on the extractive approach,
formulating a summary involves determining the salient units of a text and stringing
them together. The importance of the text units is evaluated by analyzing their
location and frequency in the text. Extractive summarization techniques do not
require an ‘understanding’ of the text. In contrast, abstractive summarization
techniques involve extracting semantic information from the text. The summaries
contain text units that are not necessarily present in the original text. In order to
parse the original text and generate the summary, abstractive summarization
incorporates advanced Natural Language Processing (NLP) techniques. As a result,
abstractive systems tend to generate more coherent summaries than the extractive
systems do (Hahn & Mani, 2000). However, extractive systems are easier to adopt,
especially for big data.

Question answering (QA) techniques provide answers to questions posed in natural


language. Apple's Siri and IBM's Watson are examples of commercial QA systems.
These systems have been implemented in healthcare, finance, marketing, and
education. Similar to abstractive summarization, QA systems rely on complex NLP
techniques. QA techniques are further classified into three categories: the
information retrieval (IR)-based approach, the knowledge-based approach, and the
hybrid approach. IR-based QA systems often have three sub-components. First is
the question processing, used to determine details, such as the question type,
question focus, and the answer type, which are used to create a query. Second
is document processing which is used to retrieve relevant pre-written passages from
a set of existing documents using the query formulated in question processing. Third
is answer processing, used to extract candidate answers from the output of the
previous component, rank them, and return the highest-ranked candidate as the
11
`1

output of the QA system. Knowledge-based QA systems generate a semantic


description of the question, which is then used to query structured resources. The
Knowledge-based QA systems are particularly useful for restricted domains, such as
tourism, medicine, and transportation, where large volumes of pre-written
documents do not exist. Such domains lack data redundancy, which is required for
IR-based QA systems. Apple's Siri is an example of a QA system that exploits the
knowledge-based approach. In hybrid QA systems, like IBM's Watson, while the
question is semantically analyzed, candidate answers are generated using the IR
methods.

Sentiment analysis (opinion mining) techniques analyze opinionated text, which


contains people's opinions toward entities such as products, organizations,
individuals, and events. Businesses are increasingly capturing more data about their
customers’ sentiments that has led to the proliferation of sentiment analysis (Liu,
2012). Marketing, finance, and the political and social sciences are the major
application areas of sentiment analysis. Sentiment analysis techniques are further
divided into three sub-groups, namely document-level, sentence-level, and aspect-
based. Document-level techniques determine whether the whole document
expresses a negative or a positive sentiment. The assumption is that the document
contains sentiments about a single entity. While certain techniques categorize a
document into two classes, negative and positive, others incorporate more sentiment
classes (like the Amazon's five-star system) (Feldman, 2013). Sentence-level
techniques attempt to determine the polarity of a single sentiment about a known
entity expressed in a single sentence. Sentence-level techniques must first
distinguish subjective sentences from objective ones. Hence, sentence-level
techniques tend to be more complex compared to document-level techniques.
Aspect-based techniques recognize all sentiments within a document and identify the
aspects of the entity to which each sentiment refers. For instance, customer product
reviews usually contain opinions about different aspects (or features) of a product.
Using aspect-based techniques, the vendor can obtain valuable information about
different features of the product that would be missed if the sentiment is only
classified in terms of polarity.
12
`1

3.2. Audio analytics

Audio analytics analyze and extract information from unstructured audio data. When
applied to human spoken language, audio analytics is also referred to as speech
analytics. Since these techniques have mostly been applied to spoken audio, the
terms audio analytics and speech analytics are often used interchangeably.
Currently, customer call centers and healthcare are the primary application areas of
audio analytics.

Call centers use audio analytics for efficient analysis of thousands or even millions of
hours of recorded calls. These techniques help improve customer experience,
evaluate agent performance, enhance sales turnover rates, monitor compliance with
different policies (e.g., privacy and security policies), gain insight into customer
behavior, and identify product or service issues, among many other tasks. Audio
analytics systems can be designed to analyze a live call, formulate cross/up-selling
recommendations based on the customer's past and present interactions, and
provide feedback to agents in real time. In addition, automated call centers use the
Interactive Voice Response (IVR) platforms to identify and handle frustrated callers.

In healthcare, audio analytics support diagnosis and treatment of certain medical


conditions that affect the patient's communication patterns (e.g., depression,
schizophrenia, and cancer) (Hirschberg, Hjalmarsson, & Elhadad, 2010). Also, audio
analytics can help analyze an infant's cries, which contain information about the
infant's health and emotional status (Patil, 2010). The vast amount of data recorded
through speech-driven clinical documentation systems is another driver for the
adoption of audio analytics in healthcare.

Speech analytics follows two common technological approaches: the transcript-


based approach (widely known as large-vocabulary continuous speech recognition,
LVCSR) and the phonetic-based approach. These are explained below.


13
`1

LVCSR systems follow a two-phase process: indexing and searching. In the


first phase, they attempt to transcribe the speech content of the audio. This is
performed using automatic speech recognition (ASR) algorithms that match
sounds to words. The words are identified based on a predefined dictionary.
If the system fails to find the exact word in the dictionary, it returns the most
similar one. The output of the system is a searchable index file that contains
information about the sequence of the words spoken in the speech. In the
second phase, standard text-based methods are used to find the search term
in the index file.

Phonetic-based systems work with sounds or phonemes. Phonemes are the


perceptually distinct units of sound in a specified language that distinguish
one word from another (e.g., the phonemes/k/and/b/differentiate the
meanings of “cat” and “bat”). Phonetic-based systems also consist of two
phases: phonetic indexing and searching. In the first phase, the system
translates the input speech into a sequence of phonemes. This is in contrast
to LVCSR systems where the speech is converted into a sequence of words.
In the second phase, the system searches the output of the first phase for the
phonetic representation of the search terms.

3.3. Video analytics

Video analytics, also known as video content analysis (VCA), involves a variety of
techniques to monitor, analyze, and extract meaningful information from video
streams. Although video analytics is still in its infancy compared to other types of
data mining (Panigrahi, Abraham, & Das, 2010), various techniques have already
been developed for processing real-time as well as pre-recorded videos. The
increasing prevalence of closed-circuit television (CCTV) cameras and the booming
popularity of video-sharing websites are the two leading contributors to the growth
of computerized video analysis. A key challenge, however, is the sheer size of video
data. To put this into perspective, one second of a high-definition video, in terms of
14
`1

size, is equivalent to over 2000 pages of text (Manyika et al., 2011). Now consider
that 100 hours of video are uploaded to YouTube every minute (YouTube Statistics,
n.d.).

Big data technologies turn this challenge into opportunity. Obviating the need for
cost-intensive and risk-prone manual processing, big data technologies can be
leveraged to automatically sift through and draw intelligence from thousands of
hours of video. As a result, the big data technology is the third factor that has
contributed to the development of video analytics.

The primary application of video analytics in recent years has been in automated
security and surveillance systems. In addition to their high cost, labor-based
surveillance systems tend to be less effective than automatic systems (e.g., Hakeem
et al., 2012 report that security personnel cannot remain focused on surveillance
tasks for more than 20 minutes). Video analytics can efficiently and effectively
perform surveillance functions such as detecting breaches of restricted zones,
identifying objects removed or left unattended, detecting loitering in a specific area,
recognizing suspicious activities, and detecting camera tampering, to name a few.
Upon detection of a threat, the surveillance system may notify security personnel in
real time or trigger an automatic action (e.g., sound alarm, lock doors, or turn on
lights).

The data generated by CCTV cameras in retail outlets can be extracted for business
intelligence. Marketing and operations management are the primary application
areas. For instance, smart algorithms can collect demographic information about
customers, such as age, gender, and ethnicity. Similarly, retailers can count the
number of customers, measure the time they stay in the store, detect their
movement patterns, measure their dwell time in different areas, and monitor queues
in real time. Valuable insights can be obtained by correlating this information with
customer demographics to drive decisions for product placement, price, assortment
optimization, promotion design, cross-selling, layout optimization, and staffing.
15
`1

Another potential application of video analytics in retail lies in the study of buying
behavior of groups. Among family members who shop together, only one interacts
with the store at the cash register, causing the traditional systems to miss data on
buying patterns of other members. Video analytics can help retailers address this
missed opportunity by providing information about the size of the group, the group's
demographics, and the individual members’ buying behavior.

Automatic video indexing and retrieval constitutes another domain of video analytics
applications. The widespread emergence of online and offline videos has highlighted
the need to index multimedia content for easy search and retrieval. The indexing of
a video can be performed based on different levels of information available in a
video including the metadata, the soundtrack, the transcripts, and the visual content
of the video. In the metadata-based approach, relational database management
systems (RDBMS) are used for video search and retrieval. Audio analytics and text
analytics techniques can be applied to index a video based on the associated
soundtracks and transcripts, respectively. A comprehensive review of approaches
and techniques for video indexing is presented in Hu, Xie, Li, Zeng, and Maybank
(2011).

In terms of the system architecture, there exist two approaches to video analytics,
namely server-based and edge-based:

Server-based architecture. In this configuration, the video captured through


each camera is routed back to a centralized and dedicated server that
performs the video analytics. Due to bandwidth limits, the video generated by
the source is usually compressed by reducing the frame rates and/or the
image resolution. The resulting loss of information can affect the accuracy of
the analysis. However, the server-based approach provides economies of
scale and facilitates easier maintenance.


16
`1

Edge-based architecture. In this approach, analytics are applied at the ‘edge’


of the system. That is, the video analytics is performed locally and on the raw
data captured by the camera. As a result, the entire content of the video
stream is available for the analysis, enabling a more effective content
analysis. Edge-based systems, however, are more costly to maintain and have
a lower processing power compared to the server-based systems.

3.4. Social media analytics

Social media analytics refer to the analysis of structured and unstructured data from
social media channels. Social media is a broad term encompassing a variety of online
platforms that allow users to create and exchange content. Social media can be
categorized into the following types: Social networks (e.g., Facebook and LinkedIn),
blogs (e.g., Blogger and WordPress), microblogs (e.g., Twitter and Tumblr), social
news (e.g., Digg and Reddit), social bookmarking (e.g., Delicious and StumbleUpon),
media sharing (e.g., Instagram and YouTube), wikis (e.g., Wikipedia and Wikihow),
question-and-answer sites (e.g., Yahoo! Answers and Ask.com) and review sites
(e.g., Yelp, TripAdvisor) (Barbier and Liu, 2011, Gundecha and Liu, 2012). Also,
many mobile apps, such as Find My Friend, provide a platform for social interactions
and, hence, serve as social media channels.

Although the research on social networks dates back to early 1920s, nevertheless,
social media analytics is a nascent field that has emerged after the advent of Web
2.0 in the early 2000s. The key characteristic of the modern social media analytics is
its data-centric nature. The research on social media analytics spans across several
disciplines, including psychology, sociology, anthropology, computer science,
mathematics, physics, and economics. Marketing has been the primary application of
social media analytics in recent years. This can be attributed to the widespread and
growing adoption of social media by consumers worldwide (He, Zha, & Li, 2013), to
the extent that Forrester Research, Inc., projects social media to be the second-
fastest growing marketing channel in the US between 2011 and 2016 (VanBoskirk,
Overby, & Takvorian, 2011).
17
`1

User-generated content (e.g., sentiments, images, videos, and bookmarks) and the
relationships and interactions between the network entities (e.g., people,
organizations, and products) are the two sources of information in social media.
Based on this categorization, the social media analytics can be classified into two
groups:

Content-based analytics. Content-based analytics focuses on the data posted


by users on social media platforms, such as customer feedback, product
reviews, images, and videos. Such content on social media is often
voluminous, unstructured, noisy, and dynamic. Text, audio, and video
analytics, as discussed earlier, can be applied to derive insight from such
data. Also, big data technologies can be adopted to address the data
processing challenges.

Structure-based analytics. Also referred to as social network analytics, this


type of analytics are concerned with synthesizing the structural attributes of a
social network and extracting intelligence from the relationships among the
participating entities. The structure of a social network is modeled through a
set of nodes and edges, representing participants and relationships,
respectively. The model can be visualized as a graph composed of the nodes
and the edges. We review two types of network graphs, namely social
graphs and activity graphs (Heidemann, Klier, & Probst, 2012). In social
graphs, an edge between a pair of nodes only signifies the existence of a link
(e.g., friendship) between the corresponding entities. Such graphs can be
mined to identify communities or determine hubs (i.e., the users with a
relatively large number of direct and indirect social links). In activity
networks, however, the edges represent actual interactions between any pair
of nodes. The interactions involve exchanges of information (e.g., likes and
18
`1

comments). Activity graphs are preferable to social graphs, because an active


relationship is more relevant to analysis than a mere connection.

Various techniques have recently emerged to extract information from the structure
of social networks. We briefly discuss these below.

Community detection, also referred to as community discovery, extracts implicit


communities within a network. For online social networks, a community refers to a
sub-network of users who interact more extensively with each other than with the
rest of the network. Often containing millions of nodes and edges, online social
networks tend to be colossal in size. Community detection helps to summarize huge
networks, which then facilitates uncovering existing behavioral patterns and
predicting emergent properties of the network. In this regard, community detection
is similar to clustering (Aggarwal, 2011), a data mining technique used to partition a
data set into disjoint subsets based on the similarity of data points. Community
detection has found several application areas, including marketing and the World
Wide Web (Parthasarathy, Ruan, & Satuluri, 2011). For example, community
detection enables firms to develop more effective product recommendation systems.

Social influence analysis refers to techniques that are concerned with modeling and
evaluating the influence of actors and connections in a social network. Naturally, the
behavior of an actor in a social network is affected by others. Thus, it is desirable to
evaluate the participants’ influence, quantify the strength of connections, and
uncover the patterns of influence diffusion in a network. Social influence analysis
techniques can be leveraged in viral marketing to efficiently enhance brand
awareness and adoption.

A salient aspect of social influence analysis is to quantify the importance of the


network nodes. Various measures have been developed for this purpose, including
degree centrality, betweenness centrality, closeness centrality, and eigenvector
centrality (for more details refer to Tang & Liu, 2010). Other measures evaluate the
strength of connections represented by edges or model the spread of influence in
19
`1

social networks. The Linear Threshold Model (LTM) and Independent Cascade Model
(ICM) are two well-known examples of such frameworks (Sun & Tang, 2011).

Link prediction specifically addresses the problem of predicting future linkages


between the existing nodes in the underlying network. Typically, the structure of
social networks is not static and continuously grows through the creation of new
nodes and edges. Therefore, a natural goal is to understand and predict the
dynamics of the network. Link prediction techniques predict the occurrence of
interaction, collaboration, or influence among entities of a network in a specific time
interval. Link prediction techniques outperform pure chance by factors of 40–50,
suggesting that the current structure of the network surely contains latent
information about future links (Liben-Nowell & Kleinberg, 2003).

In biology, link prediction techniques are used to discover links or associations in


biological networks (e.g., protein–protein interaction networks), eliminating the need
for expensive experiments (Hasan & Zaki, 2011). In security, link prediction helps to
uncover potential collaborations in terrorist or criminal networks. In the context of
online social media, the primary application of link prediction is in the development
of recommendation systems, such as Facebook's “People You May Know”, YouTube's
“Recommended for You”, and Netflix's and Amazon's recommender engines.

3.5. Predictive analytics

Predictive analytics comprise a variety of techniques that predict future outcomes


based on historical and current data. In practice, predictive analytics can be applied
to almost all disciplines – from predicting the failure of jet engines based on the
stream of data from several thousand sensors, to predicting customers’ next moves
based on what they buy, when they buy, and even what they say on social media.

At its core, predictive analytics seek to uncover patterns and capture relationships in
data. Predictive analytics techniques are subdivided into two groups. Some
techniques, such as moving averages, attempt to discover the historical patterns in
the outcome variable(s) and extrapolate them to the future. Others, such as linear
regression, aim to capture the interdependencies between outcome variable(s) and
20
`1

explanatory variables, and exploit them to make predictions. Based on the


underlying methodology, techniques can also be categorized into two groups:
regression techniques (e.g., multinomial logit models) and machine learning
techniques (e.g., neural networks). Another classification is based on the type of
outcome variables: techniques such as linear regression address continuous outcome
variables (e.g., sale price of houses), while others such as Random Forests are
applied to discrete outcome variables (e.g., credit status).

Predictive analytics techniques are primarily based on statistical methods. Several


factors call for developing new statistical methods for big data. First, conventional
statistical methods are rooted in statistical significance: a small sample is obtained
from the population and the result is compared with chance to examine the
significance of a particular relationship. The conclusion is then generalized to the
entire population. In contrast, big data samples are massive and represent the
majority of, if not the entire, population. As a result, the notion of statistical
significance is not that relevant to big data. Secondly, in terms of computational
efficiency, many conventional methods for small samples do not scale up to big data.
The third factor corresponds to the distinctive features inherent in big data:
heterogeneity, noise accumulation, spurious correlations, and incidental endogeneity
(Fan, Han, & Liu, 2014). We describe these below.

Heterogeneity. Big data are often obtained from different sources and
represent information from different sub-populations. As a result, big data are
highly heterogeneous. The sub-population data in small samples are deemed
outliers because of their insufficient frequency. However, the sheer size of big
data sets creates the unique opportunity to model the heterogeneity arising
from sub-population data, which would require sophisticated statistical
techniques.


21
`1

Noise accumulation. Estimating predictive models for big data often involves
the simultaneous estimation of several parameters. The accumulated
estimation error (or noise) for different parameters could dominate the
magnitudes of variables that have true effects within the model. In other
words, some variables with significant explanatory power might be overlooked
as a result of noise accumulation.

Spurious correlation. For big data, spurious correlation refers to uncorrelated


variables being falsely found to be correlated due to the massive size of the
dataset. Fan and Lv (2008) show this phenomenon through a simulation
example, where the correlation coefficient between independent random
variables is shown to increase with the size of the dataset. As a result, some
variables that are scientifically unrelated (due to their independence) are
erroneously proven to be correlated as a result of high dimensionality.

Incidental endogeneity. A common assumption in regression analysis is the


exogeneity assumption: the explanatory variables, or predictors, are
independent of the residual term. The validity of most statistical methods
used in regression analysis depends on this assumption. In other words, the
existence of incidental endogeneity (i.e., the dependence of the residual term
on some of the predictors) undermines the validity of the statistical methods
used for regression analysis. Although the exogeneity assumption is usually
met in small samples, incidental endogeneity is commonly present in big data.
It is worthwhile to mention that, in contrast to spurious correlation, incidental
endogeneity refers to a genuine relationship between variables and the error
term.

The irrelevance of statistical significance, the challenges of computational efficiency,


and the unique characteristics of big data discussed above highlight the need to
develop new statistical techniques to gain insights from predictive models.
22
`1

4. Concluding remarks

The objective of this paper is to describe, review, and reflect on big data. The paper
first defined what is meant by big data to consolidate the divergent discourse on big
data. We presented various definitions of big data, highlighting the fact that size is
only one dimension of big data. Other dimensions, such as velocity and variety are
equally important. The paper's primary focus has been on analytics to gain valid and
valuable insights from big data. We highlight the point that predictive analytics,
which deals mostly with structured data, overshadows other forms of analytics
applied to unstructured data, which constitutes 95% of big data. We reviewed
analytics techniques for text, audio, video, and social media data, as well as
predictive analytics. The paper makes the case for new statistical techniques for big
data to address the peculiarities that differentiate big data from smaller data sets.
Most statistical methods in practice have been devised for smaller data sets
comprising samples.

Technological advances in storage and computations have enabled cost-effective


capture of the informational value of big data in a timely manner. Consequently, one
observes a proliferation in real-world adoption of analytics that were not
economically feasible for large-scale applications prior to the big data era. For
example, sentiment analysis (opinion mining) have been known since the early
2000s (Pang & Lee, 2008). However, big data technologies enabled businesses to
adopt sentiment analysis to glean useful insights from millions of opinions shared on
social media. The processing of unstructured text fueled by the massive influx of
social media data is generating business value by adopting conventional (pre-big
data) sentiment analysis techniques, which may not be ideally suited to leverage big
data.

Although major innovations in analytical techniques for big data have not yet taken
place, one anticipates the emergence of such novel analytics in the near future. For
instance, real-time analytics will likely become a prolific field of research because of
the growth in location-aware social media and mobile apps. Since big data are noisy,
highly interrelated, and unreliable, it will likely lead to the development of statistical
23
`1

techniques more readily apt for mining big data while remaining sensitive to the
unique characteristics. Going beyond samples, additional valuable insights could be
obtained from the massive volumes of less ‘trustworthy’ data.

Acknowledgment

The authors would like to acknowledge research and editing support provided by Ms.
Ioana Moca.

Recommended articlesCiting articles (1288)

References

Aggarwal, 2011
C.C. AggarwalAn introduction to social network data analytics
C.C. Aggarwal (Ed.), Social network data analytics, Springer, United
States (2011), pp. 1-15
CrossRefView Record in ScopusGoogle Scholar
Barbier and Liu, 2011
G. Barbier, H. LiuData mining in social media
C.C. Aggarwal (Ed.), Social network data analytics, Springer, United
States (2011), pp. 327-352
CrossRefView Record in ScopusGoogle Scholar
Beaver et al., 2010
D. Beaver, S. Kumar, H.C. Li, J. Sobel, P. VajgelFinding a needle in
haystack: Facebook's photo storage
Proceedings of the nineth USENIX conference on operating systems design
and implementation, USENIX Association, Berkeley, CA, USA (2010), pp. 1-8
View Record in ScopusGoogle Scholar
Chen et al., 2012
H. Chen, R.H.L. Chiang, V.C. StoreyBusiness intelligence and analytics:
From big data to big impact
MIS Quarterly, 36 (4) (2012), pp. 1165-1188
CrossRefView Record in ScopusGoogle Scholar
24
`1

Chung, 2014
W. ChungBizPro: Extracting and categorizing business intelligence
factors from textual news articles
International Journal of Information Management, 34 (2) (2014), pp. 272-284
ArticleDownload PDFView Record in ScopusGoogle Scholar
Cukier, 2010
Cukier K., The Economist, Data, data everywhere: A special report on
managing information, 2010, February 25, Retrieved
from http://www.economist.com/node/15557443.
Google Scholar
Diebold, 2012
F.X. DieboldA personal perspective on the origin(s) and development
of “big data”: The phenomenon, the term, and the discipline
(Scholarly Paper No. ID 2202843)
Social Science Research Network (2012)
Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2202843
Google Scholar
Fan et al., 2014
J. Fan, F. Han, H. LiuChallenges of big data analysis
National Science Review, 1 (2) (2014), pp. 293-314
CrossRefView Record in ScopusGoogle Scholar
Fan and Lv, 2008
J. Fan, J. LvSure independence screening for ultrahigh dimensional
feature space
Journal of the Royal Statistical Society: Series B (Statistical
Methodology), 70 (5) (2008), pp. 849-911
CrossRefView Record in ScopusGoogle Scholar
Feldman, 2013
R. FeldmanTechniques and applications for sentiment analysis
Communications of the ACM, 56 (4) (2013), pp. 82-89
CrossRefView Record in ScopusGoogle Scholar
Gartner IT Glossary, n.d.
25
`1

Gartner IT Glossary (n.d.). Retrieved from http://www.gartner.com/it-


glossary/big-data/.
Google Scholar
Gundecha and Liu, 2012
P. Gundecha, H. LiuMining social media: A brief introduction
Tutorials in Operations Research, 1 (4) (2012)
Google Scholar
Hahn and Mani, 2000
U. Hahn, I. ManiThe challenges of automatic summarization
Computer, 33 (11) (2000), pp. 29-36
View Record in ScopusGoogle Scholar
Hakeem et al., 2012
A. Hakeem, H. Gupta, A. Kanaujia, T.E. Choe, K. Gunda, A. Scanlon, et
al.Video analytics for business intelligence
C. Shan, F. Porikli, T. Xiang, S. Gong (Eds.), Video analytics for business
intelligence, Springer, Berlin, Heidelberg (2012), pp. 309-354
CrossRefView Record in ScopusGoogle Scholar
Hasan and Zaki, 2011
M.A. Hasan, M.J. ZakiA survey of link prediction in social networks
C.C. Aggarwal (Ed.), Social network data analytics, Springer, United
States (2011), pp. 243-275
CrossRefView Record in ScopusGoogle Scholar
Heidemann et al., 2012
J. Heidemann, M. Klier, F. ProbstOnline social networks: A survey of a
global phenomenon
Computer Networks, 56 (18) (2012), pp. 3866-3878
ArticleDownload PDFView Record in ScopusGoogle Scholar
He et al., 2013
W. He, S. Zha, L. LiSocial media competitive analysis and text mining:
A case study in the pizza industry
International Journal of Information Management, 33 (3) (2013), pp. 464-472
ArticleDownload PDFView Record in ScopusGoogle Scholar
26
`1

Hirschberg et al., 2010


J. Hirschberg, A. Hjalmarsson, N. Elhadad“You’re as sick as you sound”:
Using computational approaches for modeling speaker state to
gauge illness and recovery
A. Neustein (Ed.), Advances in speech recognition, Springer, United
States (2010), pp. 305-322
CrossRefView Record in ScopusGoogle Scholar
Hu et al., 2011
W. Hu, N. Xie, L. Li, X. Zeng, S. MaybankA survey on visual content-
based video indexing and retrieval
IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications
and Reviews, 41 (6) (2011), pp. 797-819
View Record in ScopusGoogle Scholar
Jiang, 2012
J. JiangInformation extraction from text
C.C. Aggarwal, C. Zhai (Eds.), Mining text data, Springer, United
States (2012), pp. 11-41
CrossRefView Record in ScopusGoogle Scholar
Kwon et al., 2014
O. Kwon, N. Lee, B. ShinData quality management, data usage
experience and acquisition intention of big data analytics
International Journal of Information Management, 34 (3) (2014), pp. 387-394
ArticleDownload PDFView Record in ScopusGoogle Scholar
Labrinidis and Jagadish, 2012
A. Labrinidis, H.V. JagadishChallenges and opportunities with big data
Proceedings of the VLDB Endowment, 5 (12) (2012), pp. 2032-2033
CrossRefView Record in ScopusGoogle Scholar
Laney, 2001
D. Laney3-D data management: Controlling data volume, velocity and
variety
Application Delivery Strategies by META Group Inc. (2001, February 6),
p. 949
27
`1

Retrieved from http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-


Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
Google Scholar
Liben-Nowell and Kleinberg, 2003
D. Liben-Nowell, J. KleinbergThe link prediction problem for social
networks
Proceedings of the twelfth international conference on information and
knowledge management, ACM, New York, NY, USA (2003), pp. 556-559
CrossRefView Record in ScopusGoogle Scholar
Liu, 2012
B. LiuSentiment analysis and opinion mining
Synthesis Lectures on Human Language Technologies, 5 (1) (2012), pp. 1-
167
CrossRefView Record in ScopusGoogle Scholar
Manyika et al., 2011
J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, et al.Big
data: The next frontier for innovation, competition, and productivity
McKinsey Global Institute (2011)
Retrieved from http://www.citeulike.org/group/18242/article/9341321
Google Scholar
Pang and Lee, 2008
B. Pang, L. LeeOpinion mining and sentiment analysis
Foundations and Trends in Information Retrieval, 2 (1–2) (2008), pp. 1-135
CrossRefView Record in ScopusGoogle Scholar
Panigrahi et al., 2010
B.K. Panigrahi, A. Abraham, S. DasComputational intelligence in power
engineering
Springer (2010)
Google Scholar
Parthasarathy et al., 2011
S. Parthasarathy, Y. Ruan, V. SatuluriCommunity discovery in social
networks: Applications, methods and emerging trends
28
`1

C.C. Aggarwal (Ed.), Social network data analytics, Springer, United


States (2011), pp. 79-113
CrossRefView Record in ScopusGoogle Scholar
Patil, 2010
H.A. Patil“Cry baby”: Using spectrographic analysis to assess
neonatal health status from an infant's cry
A. Neustein (Ed.), Advances in speech recognition, Springer, United
States (2010), pp. 323-348
CrossRefView Record in ScopusGoogle Scholar
Schroeck et al., 2012
M. Schroeck, R. Shockley, J. Smart, D. Romero-Morales, P. TufanoAnalytics:
The real-world use of big data. How innovative enterprises extract
value from uncertain data
IBM Institute for Business Value (2012)
Retrieved
from http://www-03.ibm.com/systems/hu/resources/the_real_word_use_of_bi
g_data.pdf
Google Scholar
SMC, 2012
Small and midsize companies look to make big gains with “big data,”
according to recent poll conducted on behalf of SAP
(2012, June 26)
Retrieved from http://global.sap.com/corporate-en/news.epx?PressID=19188
Google Scholar
Sun and Tang, 2011
J. Sun, J. TangA survey of models and algorithms for social influence
analysis
C.C. Aggarwal (Ed.), Social network data analytics, Springer, United
States (2011), pp. 177-214
CrossRefView Record in ScopusGoogle Scholar
Tang and Liu, 2010
L. Tang, H. LiuCommunity detection and mining in social media
29
`1

Synthesis Lectures on Data Mining and Knowledge Discovery, 2 (1) (2010),


pp. 1-137
CrossRefView Record in ScopusGoogle Scholar
TechAmerica Foundation's Federal Big Data Commission, 2012
TechAmerica Foundation's Federal Big Data CommissionDemystifying big
data: A practical guide to transforming the business of Government
(2012)
Retrieved from http://www.techamerica.org/Docs/fileManager.cfm?
f=techamerica-bigdatareport-final.pdf
Google Scholar
VanBoskirk et al., 2011
S. VanBoskirk, C.S. Overby, S. TakvorianUS interactive marketing
forecast 2011 to 2016
Forrester Research, Inc. (2011)
Retrieved
from https://www.forrester.com/US+Interactive+Marketing+Forecast+2011+
To+2016/fulltext/-/E-RES59379
Google Scholar
YouTube Statistics, n.d.
YouTube Statistics (n.d.). Retrieved
from http://www.youtube.com/yt/press/statistics.html.
Google Scholar

Amir Gandomi is an assistant professor at the Ted Rogers School of Information


Technology Management, Ryerson University. His research lies at the intersection of
marketing, operations research and IT. He is specifically focused on big data
analytics as it relates to marketing. His research has appeared in journals such as
OMEGA - The International Journal of Management Science, The International
Journal of Information Management, and Computers & Industrial Engineering.

Murtaza Haider is an associate professor at the Ted Rogers School of


Management, Ryerson University, in Toronto. Murtaza is also the Director of a
consulting firm Regionomics Inc. He specializes in applying statistical methods to
30
`1

forecast demand and/or sales. His research interests include human development in
Canada and South Asia, forecasting housing market dynamics, transport and
infrastructure planning and development. Murtaza Haider is working on a
book, Getting Started with Data Science: Making Sense of Data with
Analytics (ISBN 9780133991024), which will be published by Pearson/IBM Press in
Spring 2015. He is an avid blogger and blogs weekly about socio-economics in South
Asia for the Dawn newspaper and for the Huffington Post.

View Abstract
Copyright © 2014 The Authors. Published by Elsevier Ltd.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy