0% found this document useful (0 votes)
2 views27 pages

Chapter - 1

The document introduces Big Data, defining it through the concepts of volume, velocity, and variety, emphasizing the need for innovative processing methods. It explains the role of data science in extracting knowledge from diverse data types and highlights the essential skills required for data scientists. Additionally, it distinguishes data science from business intelligence, showcasing the depth of insights and future-oriented questions data science addresses.

Uploaded by

ENBAKOM ZAWUGA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views27 pages

Chapter - 1

The document introduces Big Data, defining it through the concepts of volume, velocity, and variety, emphasizing the need for innovative processing methods. It explains the role of data science in extracting knowledge from diverse data types and highlights the essential skills required for data scientists. Additionally, it distinguishes data science from business intelligence, showcasing the depth of insights and future-oriented questions data science addresses.

Uploaded by

ENBAKOM ZAWUGA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Science and Big Data Analytics

CHAPTER ONE
INTRODUCNTION
Outline
▪ Introduction to Big Data

▪ Data Science and Business Intelligence

▪ The Skillset of Data Scientists

▪ Summary

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 2


What is “Big Data“?
Is this really
about size?

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 3


Naive Definition
▪ Naive definition:
• Big data only depends on the data size
• 1 Gigabyte? 1 Terabyte? 1 Petabyte?

▪ Naive interpretation misses important aspects


• Time:
✓ Analyzing 1 Gigabyte of data per day is different from analyzing 1 Gigabyte of data per
second
• Diversity:
✓ Analyzing spread sheets with numeric data is different from analyzing Web pages that
contain a mixture of text and images
• Distribution:
✓ Analyzing data from a single source is different from analyzing data from multiple sources

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 4


Definition of Big Data
▪ Following Gartner‘s IT Glossary:
• Big data is high-volume, high-velocity and/or high-variety information assets
that demand cost-effective, innovative forms of information processing that
enable enhanced insight, decision making, and process automation.
Some people actually use 10 Vs to
▪ The three Vs define big data!
• Volume • Variability
• Velocity • Veracity
• Variety • Validity
• Vulnerability
• Volatility
• Visualization
• Value
11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 5
The 3 Vs: Volume
▪ Scale of the data must be “big“
• No clear definition
• “that demand […] innovative forms of information processing“(Gartner)

Data center storage worldwide

© Statista 2018

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 6


The 3 Vs: Velocity
▪ Speed at which new data is created
▪ Speed at which data must be processed and analyzed
• Often close to real-time

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 7


The 3 Vs: Variety
▪ Diversity in data types and data sources

Structured • Data with defined types and structure


• Example: comma separated values

Semi- • Textual data with parseable pattern


Structured • Example: XML files with schema
• Textual data with erratic formats that
Quasi-Structured can be formated with effort
• Example: stream data
• Data that has no inherent structure,
Unstructured often with multiple formats
• Example: Web site, videos
11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 8
Examples for data types
Structured Quasi-Structured

Semi-Structured Unstructured

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 9


3Vs’- A Case Study
▪ A retail company, YoursMart, has recently started
collecting data from various sources. They have
customer transaction data from their point-of-sale
systems, online shopping behaviors from their
website, social media interactions, and inventory
levels from their warehouses. With this data, they
aim to improve customer experience, optimize
inventory management, and enhance their
marketing strategies.

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 10


3Vs’- A Case Study
▪ A retail company, YoursMart, has recently started collecting data
from various sources. They have customer transaction data from
their point-of-sale systems, online shopping behaviors from their
website, social media interactions, and inventory levels from their
warehouses. With this data, they aim to improve customer
experience, optimize inventory management, and enhance their
marketing strategies.
1. What does the "Volume" aspect of Big Data primarily refer to?
A. The speed at which data is generated
B. The amount of data being collected
C. The different formats of data
D. The cost associated with storing data

Answer: B) The amount of data being collected

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 11


3Vs’- A Case Study
▪ A retail company, YoursMart, has recently started collecting
data from various sources. They have customer transaction
data from their point-of-sale systems, online shopping
behaviors from their website, social media interactions, and
inventory levels from their warehouses. With this data, they
aim to improve customer experience, optimize inventory
management, and enhance their marketing strategies.
2. In the context of YoursMart, how would you define "Velocity"?
A. The total number of transactions made in a year
B. The frequency at which customer interactions occur
C. The different types of data formats YoursMart is using
D. The rate at which data is stored

Answer: B) The frequency at which customer interactions occur


11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 12
3Vs’- A Case Study
▪ A retail company, YoursMart, has recently started collecting data
from various sources. They have customer transaction data from
their point-of-sale systems, online shopping behaviors from their
website, social media interactions, and inventory levels from their
warehouses. With this data, they aim to improve customer
experience, optimize inventory management, and enhance their
marketing strategies.
3. What does "Variety" in Big Data indicate in relation to YoursMart's
data sources?
A. The size of the data collected
B. The speed of data processing
C. The different forms of data (e.g., structured, unstructured)
D. The methods used to analyze the data

Answer: C) The different forms of data (e.g., structured, unstructured)

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 13


3Vs’- A Case Study
▪ A retail company, YoursMart, has recently started collecting data
from various sources. They have customer transaction data from
their point-of-sale systems, online shopping behaviors from their
website, social media interactions, and inventory levels from their
warehouses. With this data, they aim to improve customer
experience, optimize inventory management, and enhance their
marketing strategies.
4. Which of the following statements best captures a naive definition of
Volume?
A. "Volume is how fast data is generated and processed.“
B. "Volume refers to the quantity of data that a company collects.“
C. "Volume is about the diversity of data types.“
D. "Volume deals with the analysis techniques used."

Answer: B) "Volume refers to the quantity of data that a company collects.“

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 14


3Vs’- A Case Study
▪ A retail company, YoursMart, has recently started collecting data
from various sources. They have customer transaction data from
their point-of-sale systems, online shopping behaviors from their
website, social media interactions, and inventory levels from their
warehouses. With this data, they aim to improve customer
experience, optimize inventory management, and enhance their
marketing strategies.
5. How might the naive definition of Velocity misrepresent its
significance in YoursMart's strategy?
A. It might emphasize the importance of data formats.
B. It could lead to neglecting the time-sensitive nature of customer
behavior analysis.
C. It might focus too much on storage capacity.
D. It may not highlight the diversity of data types.
Answer: B) It could lead to neglecting the time-sensitive nature of customer
behavior analysis.
11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 15
Defining Data Science
▪ Unfortunately, there is no clear definition (yet?)

▪ Goal is the extraction of knowledge from data

▪ Combination of techniques from different disciplines

▪ Scientific principles guide the data analysis

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 16


What is “Data Science“?

Tools? Big
Data?
Machine
Learning?

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 17


Mathematical Aspects

Computational Optimization Stochastics


Geometry

Scientific Machine Learning


Computing
11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 18
Computer Science Aspects

Data Structures and Databases Distributed


Algorithms Computing

Software Artificial Machine


Engineering Intelligence Learning
11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 19
Statistical Aspects

Linear Models Statistical Tests Inference

Time Series Machine


Analysis Learning
11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 20
Applications

Intelligent Robotics Marketing


Systems

Medicine Autonomous Social Networks


Driving
11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 21
Data Science vs. Business Intelligence
▪ Business Intelligence (Gartner IT Glossary)
• […] best practices that enable access to and analysis of information to
improve and optimize decisions and performance.
Business Data Science
High Intelligence
Techniques Dashboards, Optimization,
alerts, queries predictive modelling,
forecasting
Depth of Data

Insights
Science Data Types Structured, data Any kind, often
warehouses unstructured

Business
Common What What if…?
Intelligence questions happened…? What will…?
Low
How much How can we…?
Past Present Future did…?
When did…?
Time

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 22


More Data → More Opportunities
TERABYTES PETABYTES EXABYTES
LARGE
VOLUME OF INFORMATION

SMALL
1990’s 2000’s 2010’s
Relational Databases Content Management Key-Value Storages
& Data Warehouses & Unstructured Data

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 23


What are Data Scientists?
▪ Not computer scientists
• But should know about databases, data structures, algorithms, etc.

▪ Not mathematicians
• But should know about optimization, stochastics, etc.

▪ Not statisticians
• But should know about regression, statistical tests, etc.

▪ Not domain experts


• But must work together with them

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 24


Skills of Data Scientists
Quantitative
• Maths
• Algorithms
• Statistics

Collaborative
• Teamwork
Data Technical
• Programming
• Communication skills Scientists • Infrastructures

Skeptical
Create hypotheses, but be
uncertain about them

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 25


Different Types of Data Scientists
According to Microsoft Research:
◦ Polymath
“Do it all“ ◦Data Analyzer
Analyzing data
◦ Data Evangelist
Data analysis, disseminating and acting on ◦Platform Builder
insights Collect data and create infrastructures

◦ Data Preparer ◦Moonlighters (50%/20%)


Querying existing data, preparing data for “Spare time“ data scientists
analysis

◦ Data Shapers ◦Insight Actors


Analyzing and preparing data Use the outcome and act on insights.

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 26


Summary
▪ Big data has a high volume, velocity, and variety

▪ Different data structures


• Structured, semi-structured, quasi-structured,
unstructured

▪ Data science is a very diverse discipline


• Maths, computer science, statistics, applications

→ Data scientists require a diverse skillset

11/8/2024 DATA SCIENCE AND BIG DATA ANAYTICS, CHAPTER ONE 27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy