0% found this document useful (0 votes)
4 views47 pages

Chapter 2 - EMTE - 240216 - 133452

The document provides an overview of data science, defining it as a multi-disciplinary field that extracts knowledge from structured and unstructured data using scientific methods and algorithms. It discusses the concepts of data, information, knowledge, and wisdom, along with data types and the data value chain, emphasizing the importance of big data and technologies like Hadoop for processing large datasets. The document outlines the stages of data handling, including acquisition, analysis, curation, storage, and usage, highlighting the exponential growth of data and its significance in decision-making.

Uploaded by

Yabsira Yimenu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views47 pages

Chapter 2 - EMTE - 240216 - 133452

The document provides an overview of data science, defining it as a multi-disciplinary field that extracts knowledge from structured and unstructured data using scientific methods and algorithms. It discusses the concepts of data, information, knowledge, and wisdom, along with data types and the data value chain, emphasizing the importance of big data and technologies like Hadoop for processing large datasets. The document outlines the stages of data handling, including acquisition, analysis, curation, storage, and usage, highlighting the exponential growth of data and its significance in decision-making.

Uploaded by

Yabsira Yimenu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

1

INTRODUCTION TO EMERGING TECHNOLOGIES


(EMTE1012)

CHAPTER – 2
INTRODUCTION TO DATA SCIENCE

2
❖ Overview of Data Science
❖ Definition of data and information
❖ Definition of knowledge and wisdom


Data Types and Representation
Data Value Chain
Data
❖ Basic Concepts of Big Data

Science
3
What is Data Science?

4
▪ Data science is a multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights
from structured and unstructured data.
▪ Data science is a "concept to unify statistics, data analysis, machine
learning and their related methods" in order to "understand and
analyze actual phenomena" with data.
▪ It employs techniques and theories drawn from many fields within the
context of mathematics, statistics, computer science, and information
science.

5
▪ Data science is much more than simply analyzing data.
▪ There are many people who enjoy analyzing data and who could
happily spend all day looking at histograms and averages, but for
those who prefer other activities, data science offers a range of roles
and requires a range of skills.

6
▪ Data is raw. It simply exists and has no significance beyond its
existence (in and of itself).
▪ It does not have meaning of itself.
▪ Data can be defined as a representation of facts, concepts, or
instructions in a formalized manner, which should be suitable for
communication, interpretation, or processing, by human or electronic
▪ machines.
▪ It can exist in any form, usable or not.
▪ It can be described as unprocessed facts and figures.
▪ It is represented with the help of characters such as alphabets (A-Z, a-
z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.)
7
▪ Information is data that has been given meaning by way of relational
connection.
▪ In computer parlance, a relational database makes information from
the data stored within it.
▪ Information is the processed data on which decisions and actions are
based.
▪ It is data that has been processed into a form that is meaningful to
the recipient and is of real or perceived value in the current or the
prospective action or decision of recipient.
▪ Furtherer more, information is interpreted data; created from
organized, structured, and processed data in a particular context
8
▪ Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a
particular purpose.
▪ Data processing consists of three steps constitute the data processing
cycle.

9
Almost all programming languages explicitly include the notion of data
type, though different languages may use different terminology.
Common data types include:
• Integers(int)- is used to store whole numbers, mathematically
known as integers
• Booleans(bool)- is used to represent restricted to one of two values:
true or false
• Characters(char)- is used to store a single character
• Floating-point numbers(float)- is used to store real numbers
• Alphanumeric strings(string)- used to store a combination of
characters and numbers
10
▪ Data: It is Raining.
▪ Information: The temperature dropped 15 degrees and then it started
raining.
▪ Knowledge: If the humidity is very high and the temperature drops
substantially the atmospheres is often unlikely to be able to hold the
moisture so it rains.
▪ Wisdom: It rains because it rains. And this encompasses an
understanding of all the interactions that happen between raining,
evaporation, air currents, temperature gradients, changes, and
raining.

11
▪ Data can take many material forms including numbers, text, symbols,
images, sound, electromagnetic waves, or even a blankness or silence
(an empty space is itself data).
▪ These are typically divided into two broad categories.
• Qualitative and
• Quantitative

12
13
▪ Structured data are those that can be easily organized, stored and
transferred in a defined data model, such as numbers/text set out in a
table or relational database that have a consistent format (e.g., name,
date of birth, address, gender, etc).
▪ Such data can be processed, searched, queried, combined, and
analyzed relatively straightforwardly using calculus and algorithms,
and can be visualized using various forms of graphs and maps, and
easily processed by computers.
▪ Often structured data is managed using Structured Query Language
(SQL).

14
▪ A much bigger percentage of all the data is our world is unstructured
data.
▪ Unstructured data is data that cannot be contained in a row-column
database and doesn’t have an associated data model.
▪ Think of the text of an email message.
▪ Other examples of unstructured data include photos, video and audio
files, text files, social media content, satellite imagery, presentations,
PDFs, open-ended survey responses, websites and call center
transcripts/recordings.
▪ Instead of spreadsheets or relational databases, unstructured data is
usually stored in data lakes, NoSQL databases, applications and data
warehouses.
15
▪ Beyond structured and unstructured data, there is a third category,
which basically is a mix between both of them.
▪ Semi-structured data are loosely structured data that have no
predefined data model/schema and thus cannot be held in a relational
database.
▪ The type of data defined as semi-structured data has some defining
or consistent characteristics but doesn’t conform to a structure as
rigid as is expected with a relational database.
▪ Email messages are a good example. While the actual content is
unstructured, it does contain structured data such as name and email
address of sender and recipient, time sent, etc.
16
17
▪ The last category of data type is metadata.
▪ From a technical point of view, this is not a separate data structure,
but it is one of the most important elements for Big Data analysis and
big data solutions.
▪ Metadata is data about data.
▪ It provides additional information about a specific set of data

18
▪ The data value chain describes the process of data creation and use
from first identifying a need for data to its final use and possible
reuse.
▪ A value chain is made up of a series of subsystem each with inputs,
transformation processes, and outputs.
▪ In a Data Value Chain, information flow is described as a series of
steps needed to generate value and useful insights from data.

19
▪ Data Acquisition is the process of gathering, filtering, and cleaning
data before it is put in a data warehouse or any other storage solution
on which data analysis can be carried out.
▪ Data acquisition is one of the major big data challenges in terms of
infrastructure requirements.

20
21
▪ Data Analysis is concerned with making the raw data acquired
amenable to use in decision-making as well as domain-specific usage.
▪ Data analysis is the process of evaluating data using analytical and
statistical tools to discover useful information and aid in business
decision making.
▪ Data analysis involves exploring, transforming, and modeling data
with the goal of highlighting relevant data, synthesizing and extracting
useful hidden information with high potential from a business point of
view.
▪ Related areas include data mining, business intelligence, and machine
learning.
22
23
▪ Data Curation is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements for its
effective usage.
▪ Data curation is the organization and integration of data collected
from various sources.
▪ Data curation processes can be categorized into different activities
such as content creation, selection, classification, transformation,
validation, and preservation.
▪ It involves annotation, publication and presentation of the data such
that the value of the data is maintained over time, and the data
remains available for reuse and preservation.
24
25
▪ Data Storage is the persistence and management of data in a scalable
way that satisfies the needs of applications that require fast access to
the data.
▪ Relational Database Management Systems (RDBMS) have been the
main, and almost unique, solution to the storage paradigm for nearly
40 years.
▪ NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models.

26
27
28
▪ Data Usage covers the data-driven business activities that need access
to data, its analysis, and the tools needed to integrate the data
analysis within the business activity.
▪ The process of decision-making includes reporting, exploration of
data (browsing and lookup), and exploratory search (finding
correlations, comparisons, what-if scenarios, etc.).

29
30
31
▪ Data has not only become the lifeblood of any organization, but is also
growing exponentially.
▪ Data generated today is several magnitudes larger than what was
generated just a few years ago.
▪ Big Data is not simply a large amount of data
▪ Leading IT industry research group Gartner defines Big Data as:
“Big Data are high-volume, high-velocity, and/or
high-variety information assets that require new forms
of processing to enable enhanced decision making, insight
discovery and process optimization.”

32
▪ Big Data definition is based on the three Vs:
✓ Volume: Size of data (how big it is)
✓ Velocity: How fast data is being generated
✓ Variety: Variation of data types to include source, format, and
structure(data can be unstructured, semi-structured, or multi
structured).
✓ Veracity: can we trust the data? How accurate is it?

33
Importance of Big Data
▪ New generation data is changing in both quantity (volume) and format
(variety).
▪ Explosive growth (velocity) is the most obvious example of data
change.
• IBM estimates 2.5 quintillion bytes of data are generated each day.
• Ninety percent of the data in the world is less than two years old.

34
▪ Reasons for the data explosion are due to new technologies
generating and collecting vast amounts of data.
▪ These sources include
• Scientific sensors such as global mapping, meteorological tracking,
medical imaging, and DNA research
• Point of Sale (POS) tracking and inventory control systems
• Social media such as Facebook posts and Twitter Tweets
• Internet and intranet websites across the world

35
36
Hadoop
▪ Open-source software from Apache
Software Foundation to store and
process large non-relational data sets via
a large, scalable distributed model.
▪ It is a scalable fault-tolerant system for
processing large datasets across a cluster
of commodity servers.
▪ The Apache Hadoop software library is a framework that allows for
the distributed processing of large data sets across clusters of
computers using simple programming models.
37
▪ Because of the qualities of big data, individual
computers are often inadequate for handling
the data at most stages.
▪ To better address the high storage and
computational needs of big data, computer
clusters are a better fit.

▪ Using clusters requires a solution for managing cluster membership,


coordinating resource sharing, and scheduling actual work on
individual nodes.

38
▪ The assembled computing cluster often acts as a foundation that
other software interfaces with to process the data.

▪ The machines involved in the computing cluster are also typically


involved with the management of a distributed storage system,
which we will talk about when we discuss data persistence.

▪ Cluster membership and resource allocation can be handled by


software like Hadoop’s YARN (which stands for Yet Another
Resource Negotiator).

39
▪ Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits:

▪ Resource Pooling: Combining the available storage space to


hold data is a clear benefit, but CPU and memory pooling
▪ High Availability: Clusters can provide varying levels of fault
tolerance and availability.
▪ Easy Scalability: This means the system can react to changes in
resource requirements without expanding the physical
resources on a machine.

40
▪ Hadoop is an open-source framework intended
to make interaction with big data easier.
▪ It is a framework that allows for the distributed
processing of large datasets across clusters of
computers using simple programming models.

▪ It is inspired by a technical document published


by Google.
▪ The four key characteristics of Hadoop are:
▪ Using clusters requires a solution for managing cluster membership,
coordinating resource sharing, and scheduling actual work on
individual nodes.
41
▪ The four key characteristics of Hadoop are:

▪ Economical: Its systems are highly economical as ordinary computers


can be used for data processing.
▪ Reliable: It is reliable as it stores copies of the data on different
machines and is resistant to hardware failure.
▪ Scalable: It is easily scalable both, horizontally and vertically. A few
extra nodes help in scaling up the framework.
▪ Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.

42
▪ Hadoop has an ecosystem that has evolved from its four core
components: data management, access, processing, and storage.
▪ It is continuously growing to meet the needs of Big Data.
▪ It comprises the following components and many others:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling 43
• HDFS: Hadoop Distributed File
System
• YARN: Yet Another Resource
Negotiator
• MapReduce: Programming
based Data Processing
• Spark: In-Memory data
processing
• PIG, HIVE: Query-based
processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib:
Machine Learning algorithm
libraries
• Solar, Lucene: Searching and
Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling 44
▪ Ingesting data into the system (The first stage of Big Data)
▪ The data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
Sqoop transfers data from
▪ RDBMS to HDFS, whereas Flume transfers event data.

▪ Processing the data in storage (The second stage of Big Data)


▪ In this stage, the data is stored and processed.
▪ The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, HBase. Spark and MapReduce perform
data processing.
45
▪ Computing and analyzing data (The third stage is to Analyze)

▪ Here, the data is analyzed by processing frameworks such as Pig,


▪ Hive, and Impala. Pig converts the data using a map and reduce and
then analyzes it. Hive is also based on the map and reduce
programming and is most suitable for structured data.

▪ Visualizing the results (The fourth stage is Access)


▪ This is performed by tools such as Hue and Cloudera Search. In
this stage, the analyzed data can be accessed by users.

46
THANK YOU
?
47

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy