0% found this document useful (0 votes)

4 views47 pages

Chapter 2 - EMTE - 240216 - 133452

The document provides an overview of data science, defining it as a multi-disciplinary field that extracts knowledge from structured and unstructured data using scientific methods and algorithms. It discusses the concepts of data, information, knowledge, and wisdom, along with data types and the data value chain, emphasizing the importance of big data and technologies like Hadoop for processing large datasets. The document outlines the stages of data handling, including acquisition, analysis, curation, storage, and usage, highlighting the exponential growth of data and its significance in decision-making.

Uploaded by

Yabsira Yimenu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views47 pages

Chapter 2 - EMTE - 240216 - 133452

Uploaded by

Yabsira Yimenu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

1

INTRODUCTION TO EMERGING TECHNOLOGIES

(EMTE1012)

CHAPTER – 2
INTRODUCTION TO DATA SCIENCE

2
❖ Overview of Data Science
❖ Definition of data and information
❖ Definition of knowledge and wisdom
❖
❖
Data Types and Representation
Data Value Chain
Data
❖ Basic Concepts of Big Data

Science
3
What is Data Science?

4
▪ Data science is a multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights
from structured and unstructured data.
▪ Data science is a "concept to unify statistics, data analysis, machine
learning and their related methods" in order to "understand and
analyze actual phenomena" with data.
▪ It employs techniques and theories drawn from many fields within the
context of mathematics, statistics, computer science, and information
science.

5
▪ Data science is much more than simply analyzing data.
▪ There are many people who enjoy analyzing data and who could
happily spend all day looking at histograms and averages, but for
those who prefer other activities, data science offers a range of roles
and requires a range of skills.

6
▪ Data is raw. It simply exists and has no significance beyond its
existence (in and of itself).
▪ It does not have meaning of itself.
▪ Data can be defined as a representation of facts, concepts, or
instructions in a formalized manner, which should be suitable for
communication, interpretation, or processing, by human or electronic
▪ machines.
▪ It can exist in any form, usable or not.
▪ It can be described as unprocessed facts and figures.
▪ It is represented with the help of characters such as alphabets (A-Z, a-
z), digits (0-9) or special characters (+, -, /, *, <,>, =, etc.)
7
▪ Information is data that has been given meaning by way of relational
connection.
▪ In computer parlance, a relational database makes information from
the data stored within it.
▪ Information is the processed data on which decisions and actions are
based.
▪ It is data that has been processed into a form that is meaningful to
the recipient and is of real or perceived value in the current or the
prospective action or decision of recipient.
▪ Furtherer more, information is interpreted data; created from
organized, structured, and processed data in a particular context
8
▪ Data processing is the re-structuring or re-ordering of data by people
or machines to increase their usefulness and add values for a
particular purpose.
▪ Data processing consists of three steps constitute the data processing
cycle.

9
Almost all programming languages explicitly include the notion of data
type, though different languages may use different terminology.
Common data types include:
• Integers(int)- is used to store whole numbers, mathematically
known as integers
• Booleans(bool)- is used to represent restricted to one of two values:
true or false
• Characters(char)- is used to store a single character
• Floating-point numbers(float)- is used to store real numbers
• Alphanumeric strings(string)- used to store a combination of
characters and numbers
10
▪ Data: It is Raining.
▪ Information: The temperature dropped 15 degrees and then it started
raining.
▪ Knowledge: If the humidity is very high and the temperature drops
substantially the atmospheres is often unlikely to be able to hold the
moisture so it rains.
▪ Wisdom: It rains because it rains. And this encompasses an
understanding of all the interactions that happen between raining,
evaporation, air currents, temperature gradients, changes, and
raining.

11
▪ Data can take many material forms including numbers, text, symbols,
images, sound, electromagnetic waves, or even a blankness or silence
(an empty space is itself data).
▪ These are typically divided into two broad categories.
• Qualitative and
• Quantitative

12
13
▪ Structured data are those that can be easily organized, stored and
transferred in a defined data model, such as numbers/text set out in a
table or relational database that have a consistent format (e.g., name,
date of birth, address, gender, etc).
▪ Such data can be processed, searched, queried, combined, and
analyzed relatively straightforwardly using calculus and algorithms,
and can be visualized using various forms of graphs and maps, and
easily processed by computers.
▪ Often structured data is managed using Structured Query Language
(SQL).

14
▪ A much bigger percentage of all the data is our world is unstructured
data.
▪ Unstructured data is data that cannot be contained in a row-column
database and doesn’t have an associated data model.
▪ Think of the text of an email message.
▪ Other examples of unstructured data include photos, video and audio
files, text files, social media content, satellite imagery, presentations,
PDFs, open-ended survey responses, websites and call center
transcripts/recordings.
▪ Instead of spreadsheets or relational databases, unstructured data is
usually stored in data lakes, NoSQL databases, applications and data
warehouses.
15
▪ Beyond structured and unstructured data, there is a third category,
which basically is a mix between both of them.
▪ Semi-structured data are loosely structured data that have no
predefined data model/schema and thus cannot be held in a relational
database.
▪ The type of data defined as semi-structured data has some defining
or consistent characteristics but doesn’t conform to a structure as
rigid as is expected with a relational database.
▪ Email messages are a good example. While the actual content is
unstructured, it does contain structured data such as name and email
address of sender and recipient, time sent, etc.
16
17
▪ The last category of data type is metadata.
▪ From a technical point of view, this is not a separate data structure,
but it is one of the most important elements for Big Data analysis and
big data solutions.
▪ Metadata is data about data.
▪ It provides additional information about a specific set of data

18
▪ The data value chain describes the process of data creation and use
from first identifying a need for data to its final use and possible
reuse.
▪ A value chain is made up of a series of subsystem each with inputs,
transformation processes, and outputs.
▪ In a Data Value Chain, information flow is described as a series of
steps needed to generate value and useful insights from data.

19
▪ Data Acquisition is the process of gathering, filtering, and cleaning
data before it is put in a data warehouse or any other storage solution
on which data analysis can be carried out.
▪ Data acquisition is one of the major big data challenges in terms of
infrastructure requirements.

20
21
▪ Data Analysis is concerned with making the raw data acquired
amenable to use in decision-making as well as domain-specific usage.
▪ Data analysis is the process of evaluating data using analytical and
statistical tools to discover useful information and aid in business
decision making.
▪ Data analysis involves exploring, transforming, and modeling data
with the goal of highlighting relevant data, synthesizing and extracting
useful hidden information with high potential from a business point of
view.
▪ Related areas include data mining, business intelligence, and machine
learning.
22
23
▪ Data Curation is the active management of data over its life cycle to
ensure it meets the necessary data quality requirements for its
effective usage.
▪ Data curation is the organization and integration of data collected
from various sources.
▪ Data curation processes can be categorized into different activities
such as content creation, selection, classification, transformation,
validation, and preservation.
▪ It involves annotation, publication and presentation of the data such
that the value of the data is maintained over time, and the data
remains available for reuse and preservation.
24
25
▪ Data Storage is the persistence and management of data in a scalable
way that satisfies the needs of applications that require fast access to
the data.
▪ Relational Database Management Systems (RDBMS) have been the
main, and almost unique, solution to the storage paradigm for nearly
40 years.
▪ NoSQL technologies have been designed with the scalability goal in
mind and present a wide range of solutions based on alternative data
models.

26
27
28
▪ Data Usage covers the data-driven business activities that need access
to data, its analysis, and the tools needed to integrate the data
analysis within the business activity.
▪ The process of decision-making includes reporting, exploration of
data (browsing and lookup), and exploratory search (finding
correlations, comparisons, what-if scenarios, etc.).

29
30
31
▪ Data has not only become the lifeblood of any organization, but is also
growing exponentially.
▪ Data generated today is several magnitudes larger than what was
generated just a few years ago.
▪ Big Data is not simply a large amount of data
▪ Leading IT industry research group Gartner defines Big Data as:
“Big Data are high-volume, high-velocity, and/or
high-variety information assets that require new forms
of processing to enable enhanced decision making, insight
discovery and process optimization.”
▪
32
▪ Big Data definition is based on the three Vs:
✓ Volume: Size of data (how big it is)
✓ Velocity: How fast data is being generated
✓ Variety: Variation of data types to include source, format, and
structure(data can be unstructured, semi-structured, or multi
structured).
✓ Veracity: can we trust the data? How accurate is it?

33
Importance of Big Data
▪ New generation data is changing in both quantity (volume) and format
(variety).
▪ Explosive growth (velocity) is the most obvious example of data
change.
• IBM estimates 2.5 quintillion bytes of data are generated each day.
• Ninety percent of the data in the world is less than two years old.

34
▪ Reasons for the data explosion are due to new technologies
generating and collecting vast amounts of data.
▪ These sources include
• Scientific sensors such as global mapping, meteorological tracking,
medical imaging, and DNA research
• Point of Sale (POS) tracking and inventory control systems
• Social media such as Facebook posts and Twitter Tweets
• Internet and intranet websites across the world

35
36
Hadoop
▪ Open-source software from Apache
Software Foundation to store and
process large non-relational data sets via
a large, scalable distributed model.
▪ It is a scalable fault-tolerant system for
processing large datasets across a cluster
of commodity servers.
▪ The Apache Hadoop software library is a framework that allows for
the distributed processing of large data sets across clusters of
computers using simple programming models.
37
▪ Because of the qualities of big data, individual
computers are often inadequate for handling
the data at most stages.
▪ To better address the high storage and
computational needs of big data, computer
clusters are a better fit.

▪ Using clusters requires a solution for managing cluster membership,

coordinating resource sharing, and scheduling actual work on
individual nodes.

38
▪ The assembled computing cluster often acts as a foundation that
other software interfaces with to process the data.

▪ The machines involved in the computing cluster are also typically

involved with the management of a distributed storage system,
which we will talk about when we discuss data persistence.

▪ Cluster membership and resource allocation can be handled by

software like Hadoop’s YARN (which stands for Yet Another
Resource Negotiator).

39
▪ Big data clustering software combines the resources of many
smaller machines, seeking to provide a number of benefits:

▪ Resource Pooling: Combining the available storage space to

hold data is a clear benefit, but CPU and memory pooling
▪ High Availability: Clusters can provide varying levels of fault
tolerance and availability.
▪ Easy Scalability: This means the system can react to changes in
resource requirements without expanding the physical
resources on a machine.

40
▪ Hadoop is an open-source framework intended
to make interaction with big data easier.
▪ It is a framework that allows for the distributed
processing of large datasets across clusters of
computers using simple programming models.

▪ It is inspired by a technical document published

by Google.
▪ The four key characteristics of Hadoop are:
▪ Using clusters requires a solution for managing cluster membership,
coordinating resource sharing, and scheduling actual work on
individual nodes.
41
▪ The four key characteristics of Hadoop are:

▪ Economical: Its systems are highly economical as ordinary computers

can be used for data processing.
▪ Reliable: It is reliable as it stores copies of the data on different
machines and is resistant to hardware failure.
▪ Scalable: It is easily scalable both, horizontally and vertically. A few
extra nodes help in scaling up the framework.
▪ Flexible: It is flexible and you can store as much structured and
unstructured data as you need to and decide to use them later.

42
▪ Hadoop has an ecosystem that has evolved from its four core
components: data management, access, processing, and storage.
▪ It is continuously growing to meet the needs of Big Data.
▪ It comprises the following components and many others:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling 43
• HDFS: Hadoop Distributed File
System
• YARN: Yet Another Resource
Negotiator
• MapReduce: Programming
based Data Processing
• Spark: In-Memory data
processing
• PIG, HIVE: Query-based
processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib:
Machine Learning algorithm
libraries
• Solar, Lucene: Searching and
Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling 44
▪ Ingesting data into the system (The first stage of Big Data)
▪ The data is ingested or transferred to Hadoop from various
sources such as relational databases, systems, or local files.
Sqoop transfers data from
▪ RDBMS to HDFS, whereas Flume transfers event data.

▪ Processing the data in storage (The second stage of Big Data)

▪ In this stage, the data is stored and processed.
▪ The data is stored in the distributed file system, HDFS, and the
NoSQL distributed data, HBase. Spark and MapReduce perform
data processing.
45
▪ Computing and analyzing data (The third stage is to Analyze)

▪ Here, the data is analyzed by processing frameworks such as Pig,

▪ Hive, and Impala. Pig converts the data using a map and reduce and
then analyzes it. Hive is also based on the map and reduce
programming and is most suitable for structured data.

▪ Visualizing the results (The fourth stage is Access)

▪ This is performed by tools such as Hue and Cloudera Search. In
this stage, the analyzed data can be accessed by users.

46
THANK YOU
?
47

2 Data-Science PDF
No ratings yet
2 Data-Science PDF
49 pages
Chapter 2 Introduction To Data Science - For Extension
No ratings yet
Chapter 2 Introduction To Data Science - For Extension
51 pages
Data Science
No ratings yet
Data Science
35 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
EmTec Chapter 2
No ratings yet
EmTec Chapter 2
32 pages
Emergency Chapter Two
No ratings yet
Emergency Chapter Two
41 pages
Chapter 2 (Data Science)
No ratings yet
Chapter 2 (Data Science)
35 pages
Emerging CH2
No ratings yet
Emerging CH2
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Ict Ch. 2
No ratings yet
Ict Ch. 2
38 pages
Data Science
No ratings yet
Data Science
32 pages
Islamic Answer
No ratings yet
Islamic Answer
27 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Chapter 2 DS New
No ratings yet
Chapter 2 DS New
29 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Course Name: Introduction To Emerging Technologies
No ratings yet
Course Name: Introduction To Emerging Technologies
24 pages
Introduction To Emerging Technologies Chapter 2
No ratings yet
Introduction To Emerging Technologies Chapter 2
31 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter 2EMR
No ratings yet
Chapter 2EMR
21 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
37 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
41 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Lecture 01: Introduction To GIS and GIS Data Models
No ratings yet
Lecture 01: Introduction To GIS and GIS Data Models
61 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
COMPUTER STUDIES PAPER 2 MOCK Mkey
No ratings yet
COMPUTER STUDIES PAPER 2 MOCK Mkey
12 pages
Big Data Visualization
No ratings yet
Big Data Visualization
55 pages
Artificial Intelligence Class 10 SP
No ratings yet
Artificial Intelligence Class 10 SP
32 pages
Bigdata Units
No ratings yet
Bigdata Units
80 pages
ISC Linked List
No ratings yet
ISC Linked List
22 pages
Language, Culture, Computation: Computational Linguistics and Linguistics
No ratings yet
Language, Culture, Computation: Computational Linguistics and Linguistics
882 pages
DWM
No ratings yet
DWM
64 pages
Official Resume (English Version) 2
No ratings yet
Official Resume (English Version) 2
1 page
Data Analytics For Intelligent Transportation Systems 1st Edition Edition Mashrur Chowdhury - Ebook PDFinstant Download
100% (4)
Data Analytics For Intelligent Transportation Systems 1st Edition Edition Mashrur Chowdhury - Ebook PDFinstant Download
51 pages
CLOUD COMPUTING Outline
No ratings yet
CLOUD COMPUTING Outline
2 pages
DBMS (2018 2019)
No ratings yet
DBMS (2018 2019)
2 pages
Huawei Talent Training Report
No ratings yet
Huawei Talent Training Report
4 pages
Schema Refinement
No ratings yet
Schema Refinement
25 pages
An Introduction To Text Mining: Bettina Berendt
No ratings yet
An Introduction To Text Mining: Bettina Berendt
94 pages
Rafael Rangel: Work Experience
No ratings yet
Rafael Rangel: Work Experience
2 pages
CORE Computer Science Conference Rankings
No ratings yet
CORE Computer Science Conference Rankings
4 pages
Lecture#11
No ratings yet
Lecture#11
19 pages
5 Key Ways Ai Can Supercharge Observability
No ratings yet
5 Key Ways Ai Can Supercharge Observability
10 pages
Digital - Fluency - Fullnotesp
No ratings yet
Digital - Fluency - Fullnotesp
32 pages
Machine Learning Masterclass 2023
No ratings yet
Machine Learning Masterclass 2023
6 pages
Presentation Styles: Balancing Function & Fashion
No ratings yet
Presentation Styles: Balancing Function & Fashion
22 pages
INT404 Artificial Intelligence: Lecture Zero
No ratings yet
INT404 Artificial Intelligence: Lecture Zero
28 pages
Chapter 7. XML: Table of Contents
No ratings yet
Chapter 7. XML: Table of Contents
19 pages
Cep-Pso & C
No ratings yet
Cep-Pso & C
4 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
CJW Res
No ratings yet
CJW Res
1 page
Experience: Cisco Software Download
No ratings yet
Experience: Cisco Software Download
1 page
OOPs Question Bank
No ratings yet
OOPs Question Bank
2 pages
Alex Handzel - Resume
No ratings yet
Alex Handzel - Resume
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Chapter 2 - EMTE - 240216 - 133452

Uploaded by

Chapter 2 - EMTE - 240216 - 133452

Uploaded by

1

INTRODUCTION TO EMERGING TECHNOLOGIES

▪ Using clusters requires a solution for managing cluster membership,

▪ The machines involved in the computing cluster are also typically

▪ Cluster membership and resource allocation can be handled by

▪ Resource Pooling: Combining the available storage space to

▪ It is inspired by a technical document published

▪ Economical: Its systems are highly economical as ordinary computers

▪ Processing the data in storage (The second stage of Big Data)

▪ Here, the data is analyzed by processing frameworks such as Pig,

▪ Visualizing the results (The fourth stage is Access)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.