0% found this document useful (0 votes)
42 views1 page

Case 11 - Big Data and The Elephant 2022 Valacich IS Today

Uploaded by

Trần Nhật Vy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views1 page

Case 11 - Big Data and The Elephant 2022 Valacich IS Today

Uploaded by

Trần Nhật Vy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

420 CHAPTER 9 • DEvElOPING AND ACqUIRING INfORmATION SySTEmS

CASE 2 Big Data and the Elephant


It may seem obvious that the amount of data in So, how do we store and do something known outside Google. As a result, many
the world keeps getting larger and larger, but useful with this much data being generated so other projects implemented a similar
what is really meant by Big Data? Physical fast? Traditional mechanisms for storing and approach. One of these projects, an open
size of the storage device? Number of records? searching large data sets don’t scale well. In source effort, became very popular and
How do we effectively store it, and how do we the early 2000s, Google started utilizing a new widely used. The project was named Hadoop
approach doing useful things with it? approach that allowed for cheap hardware to after a stuffed elephant belonging to the pri-
Big Data can be defined using three attri- be used to easily store and process large data mary developer’s son. Hadoop implements
butes: volume, velocity, and variety (often, the sets. The key to the approach is to expect fail- the core functions of GFS as the Hadoop
term veracity is added to refer to the often ure of any given component of the system and Distributed File System (HDFS) and
unknown origins of the data). Volume refers to to design the system such that it can easily and MapReduce as Hadoop MapReduce.
traditional measures of size—how many bytes. rapidly recover when such failures do occur. Because it is open source technology and can
A byte is an encoding of information using 8 The Google File System (GFS) spreads data be freely incorporated into anyone’s soft-
bits—on or off states, like 1s and 0s. Large across multiple servers and incorporates ware system, it has been widely deployed. In
numbers of bytes are quantified using the same redundancy such that if any one server goes keeping with the elephant name and logo, the
prefixes as metric units. Kilo- means “a thou- down, the others can pick up where it left off suite of tools that grew up around Hadoop is
sand,” mega- “a million,” giga- “a billion,” and no data get lost. To do something useful called Mahout (a term for an elephant han-
and so on. A single e-mail may be a few kilo- with this much data, however, requires even dler). Mahout and Hadoop were incorporated
bytes of data including the message itself and more ingenuity. Google developed an into the Apache Software Foundation suite
the associated address information. A word approach to split queries into multiple steps of open source technologies.
processor document may be anywhere from a that can be distributed across multiple servers, Google has moved past MapReduce as its
few kilobytes to a few megabytes depending much as the data files themselves get split up primary Big Data processing model, and
on how long the contents are and if there are in GFS. First, the input data files are filtered Mahout has also moved on to more capable
any embedded images. A high-definition and sorted into chunks (referred to as “map- processing models that are enabled by or
movie takes up around 5 gigabytes when ping”). Then, these chunks are distributed improvements on Hadoop and MapReduce.
encoded and stored on a Blu-ray disc. An across multiple servers for processing. Each Apache Pig is a high-level language for devel-
example of the high velocity of Big Data is the chunk gets processed into a smaller output set oping and implementing Hadoop programs.
amount of video submitted to YouTube— (the “reduce” step). The algorithm is designed Hive is a data warehouse platform built on top
every minute more than 500 additional hours in such a way that these steps can be per- of Hadoop and HDFS. New approaches to dis-
of video are uploaded. The variety aspect of formed simultaneously on each chunk on dif- tributed processing beyond MapReduce that
Big Data represents all the different types of ferent servers at the same time. Just like GFS, are being pursued by the Apache Foundation
digital data we encounter—everything from if any given server fails, the others can include Spark and Flink—both provide resil-
e-mails, to homework assignments, to fitness smoothly recover and keep processing. These ient distributed data set functionality and
tracker data, to photos, status updates, and VR smaller output sets are then combined back implement modern programming architectures
videos. Veracity refers to the quality of the together, and the process is repeated until a using languages like Java and Scala. These
data and the ability to generate value from solution is reached. These steps lead to the technologies are a key part of the behind-the-
using accurate data. As the Internet of Things algorithm’s name: MapReduce. scenes infrastructure that makes our modern
expands and more and more devices are con- MapReduce was a proprietary technol- world work. Any platform, like an app store, a
nected all the time, the volume, velocity, and ogy that belonged to Google and was a key fitness app, or a social network, must deal with
variety of data will only continue to increase, part of its competitive advantage in the early the challenges of volume, velocity, and vari-
and ensuring veracity will be crucial to deriv- 2000s. However, the underlying computer ety. Technologies like Hadoop, Mahout, and
ing value from Big Data. science research was published openly and their kin are what make it all possible.

Questions
9-43. How much data do you generate on a daily, weekly, month- Big Data. (2020, July 1). In Wikipedia, The Free Encyclopedia. Retrieved
ly, and annual basis? Think about every digital encounter July 16, 2020, from https://en.wikipedia.org/w/index.php?title=Big_
you have and what gets stored. data&oldid=965530277
9-44. What advantages are there to storing and processing Big MapReduce. (2020, June 29). In Wikipedia, The Free Encyclopedia. Retrieved
Data? How can companies and individuals benefit? July 16, 2020, from https://en.wikipedia.org/w/index.php?title=MapReduce&
9-45. What skills might help you to pursue opportunities in the oldid=965086280
area of Big Data? Smith, K. (2020, February 21). 57 fascinating and incredible YouTube statistics.
Based on: Brandwatch. Retrieved July 16, 2020, from https://www.brandwatch.com/blog/
youtube-stats
Apache Hadoop. (2020, June 20). In Wikipedia, The Free Encyclopedia. Re-
trieved July 16, 2020, from https://en.wikipedia.org/w/index.php?title=Apache_
Hadoop&oldid=963539845

M09_VALA8115_09_GE_C09.indd 420 07/04/22 2:07 AM

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy