Chapter 2
Chapter 2
Data Science
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms,
and systems to extract knowledge and insights from structured, semi-structured and
unstructured data. Data science is much more than simply analysing data. It offers a range of
roles and requires a range of skills.
Let’s consider this idea by thinking about some of the data involved in buying a box of cereal
from the store or supermarket:
• Whatever your cereal preferences—teff, wheat, or burly—you prepare for the purchase by
writing “cereal” in your notebook. This planned purchase is a piece of data though it is
written by pencil that you can read.
• When you get to the store, you use your data as a reminder to grab the item and put it in
your cart. At the checkout line, the cashier scans the barcode on your container, and the
cash register logs the price. Back in the warehouse, a computer tells the stock manager that
it is time to request another order from the distributor because your purchase was one of
the last boxes in the store.
• You also have a coupon for your big box, and the cashier scans that, giving you a
predetermined discount. At the end of the week, a report of all the scanned manufacturer
coupons gets uploaded to the cereal company so they can issue a reimbursement to the
grocery store for all of the coupon discounts they have handed out to customers. Finally,
at the end of the month, a store manager looks at a colourful collection of pie charts showing
all the different kinds of cereal that were sold and, on the basis of strong sales of cereals,
decides to offer more varieties of these on the store’s limited shelf space next month.
• So, the small piece of information that began as a scribble on your notebook ended up in
many different places, most notably on the desk of a manager as an aid to decision making.
Activity 4:
➢ List and discuss the characteristics of big data
➢ Describe the big data life cycle. Which step you think most useful and why?
➢ List and describe each technology or tool used in the big data life cycle.
➢ Discuss the three methods of computing over a large dataset.
The first stage of Big Data processing is Ingest. The data is ingested or transferred to Hadoop
from various sources such as relational databases, systems, or local files. Sqoop transfers data
from RDBMS to HDFS, whereas Flume transfers event data.
The second stage is Processing. In this stage, the data is stored and processed. The data is
stored in the distributed file system, HDFS, and the NoSQL distributed data, HBase. Spark
and MapReduce perform data processing.
The third stage is to Analyze. Here, the data is analyzed by processing frameworks such as
Pig, Hive, and Impala. Pig converts the data using a map and reduces and then analyzes it.
Hive is also based on the map and reduces programming and is most suitable for structured
data.
The fourth stage is Access, which is performed by tools such as Hue and Cloudera Search. In
this stage, the analysed data can be accessed by users.