0% found this document useful (0 votes)

36 views55 pages

Cse 2027 Fda M1

Uploaded by

harshulnihaaniii

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views55 pages

Cse 2027 Fda M1

Uploaded by

harshulnihaaniii

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

CSE 2027-Fundamental of Data

Analysis

Module 1-Introduction to Data Analysis

Introducing Data, overview of data analysis: Data

in the Real World, Data vs. Information, Many “Vs”
of Data, Structured Data and Unstructured Data,
Types of Data, Data Analysis Deﬁned, Types of
Variables, Central Tendency of Data, Scales of
Data, Sources of Data, Data preparation:
Cleaning the data, Removing variables, Data
Transformations.
Introducing Data
• Facts and statistics collected together for reference or analysis

• Data has to be transformed into a form that is eﬃcient for movement or

processing.

2
Over view of Data
Analysis
• Data analysis is deﬁned as a process of
cleaning, transforming, and modeling data to
discover useful information for business
decision-making.
• The purpose of Data Analysis is to extract
useful information from data and taking the
decision based upon the data analysis.
• A simple example of Data analysis is whenever
we take any decision in our day-to-day life is by
thinking about what happened last time or what
will happen by choosing that particular
decision.

4
• This is nothing but analyzing our past or future
and making decisions based on it.
• For that, we gather memories of our past or
dreams of our future.
• So that is nothing but data analysis. Now same
thing analyst does for business purposes, is
called Data Analysis.

5
Data Analysis Tools

6
Data in the Real World

7
8
Many v’s of Data
A. Volume
• The term Volume is meant for the Magnitude or Scale of data.
• Massive amounts of data generated from multiple resources are not
possible to handle through the traditional ways like a database.
• This large volume data is a composition of multiple data types,
which is unstructured in nature.
• This kind of data can be either in the form of audio, video, tweets,
likes etc.
B. Velocity
• Velocity refers to the speed at which the gigantic amount of
data is being generated, collected and scrutinized.

• With every ﬂip of second, data is being searched on the internet.

• On a day to day basis, social networking sites like Facebook,

Twitter, Linkedln etc, are sharing a large amount of data.

• For easy analysis of this high amount of constantly generating

data with keeping an eye on it speed and easy access.
C. Variety
• In terms of Big data, term Variety of data pretends to be a composition
of structured and unstructured kind of data.

• The data collected from different sources like mobile phones, laptops etc
is not homogenous in nature.

• Apart from text, audio ,video files, there may be some log files ,clicks or
likes or dislikes etc.
D. Value
• Value refers to convert our investigated data into values.
• Value is one of the most important characteristics of Big
data with a composition of collection and analyzing the
same in order to boost the performance of any organization
along with a better understanding of customers.
• With the access to this useful data, one must analyze great
values in order to get amazing benefits.
E. Variability
• Variability refers to unpredictable changes in the data.

• It may happen because of multiple data types & the speed

with which data is generating and being loaded into the
database.
F. Veracity
• Veracity refers to the term trustworthiness with reference to accurate data.
• If the data is accurate, only then you could think of meaningful data.
• For example, consider a dataset of thirty students on which we have to
make an analysis about the reason they got distinction.
• Being an analyzer, you can ask questions like:
• what are the methodology you adopted to get good marks in all the
subjects?
• How much time you devote to individual subject?

• Do you learn some subjects with the help of daily life activities like
sports etc?

• Have you ever been a scholar?

• Be getting answers like this it would be easier to determine the

accuracy of information which could easily be maintained in statistical
form.
G. Validity
• Two terms of big data veracity and validity seems to be alike
but are quite different.

• validity is meant for an accurate analysis in order to get

optimized results.
H. Vulnerability
• Vulnerability is one of the major challenge in big data as the
data generated from multiple sources with such an erratic
speed has high chances of being harmed by any intruder.

• Currently, in a case of facebook, where the Belgium court has

threatened to ﬁne a high amount on breaking privacy recently.
I. Volatility
• Volatility refers to how long the perceived data remains to
be useful for us and how it is to be kept.

• For analyzing the same, it is necessary to develop some

new rules and techniques through which rapid access to
information is possible.
J. Visualization
• Data Visualization is one of the most complex challenge in big data.
• In this information age, data is not only going beyond the limits but
also is composed of different data types.
• So, there is a need of communicate the information by visualizing it
through some special ways with special functionalities like a web-
based approach, statistical analysis etc.
• Traditional tools of data visualization face severe
challenges like low response time, complex methods
of scalability, precision in reporting time etc.

• So, it is a challenge to work with the concept which

way of communication with data is most suitable in
order to make visualization more effective.
22
Typical human-generated unstructured
data includes
• Text files: Word processing, spreadsheets, presentations, email, logs.
• Email: Email has some internal structure thanks to its metadata, and we sometimes
refer to it as semi-structured. However, its message field is unstructured and
traditional analytics tools cannot parse it.
• Social Media: Data from Facebook, Twitter, LinkedIn.
• Website: YouTube, Instagram, photo sharing sites.
• Mobile data: Text messages, locations.
• Communications: Chat, IM, phone recordings, collaboration software.
• Media: MP3, digital photos, audio and video files.
• Business applications: MS Office documents, productivity applications

23
Typical machine-generated unstructured
data includes:
• Satellite imagery: Weather data, land forms, military movements.

• Scientiﬁc data: Oil and gas exploration, space exploration, seismic

imagery, atmospheric data.

• Digital surveillance: Surveillance photos and video.

• Sensor data: Traﬃc, weather, oceanographic sensors.

24
25
26
27
Types of Digital Data

28
Data Analysis-Types
• There are several types of Data Analysis techniques
that exist based on business and technology. However,
the major Data Analysis methods are:
• Text Analysis
• Statistical Analysis
• Diagnostic Analysis
• Predictive Analysis
• Prescriptive Analysis

29
30
Text Analysis
• Text Analysis is also referred to as Data Mining.
It is one of the methods of data analysis to
discover a pattern in large data sets using
databases or data mining tools.
• It used to transform raw data into business
information. Business Intelligence tools are
present in the market which is used to take
strategic business decisions. Overall it offers a
way to extract and examine data and deriving
patterns and ﬁnally interpretation of the data.

31
Statistical Analysis
• Statistical Analysis shows "What happen?" by
using past data in the form of dashboards.
Statistical Analysis includes collection,
Analysis, interpretation, presentation, and
modeling of data. It analyses a set of data or a
sample of data.
• There are two categories of this type of
Analysis - Descriptive Analysis
Inferential Analysis.

32
Descriptive Analysis
• Analyses complete data or a sample of
summarized numerical data. It shows mean
and deviation for continuous data whereas
percentage and frequency for categorical data.
Inferential Analysis
• Analyses sample from complete data. In this
type of Analysis, you can ﬁnd different
conclusions from the same data by selecting
different samples.

33
Diagnostic Analysis
• Diagnostic Analysis shows "Why did it happen?"
by ﬁnding the cause from the insight found in
Statistical Analysis. This Analysis is useful to
identify behavior patterns of data. If a new
problem arrives in your business process, then
you can look into this Analysis to ﬁnd similar
patterns of that problem. And it may have
chances to use similar prescriptions for the
new problems.

34
Predictive Analysis
• Predictive Analysis shows "what is likely to happen" by using
previous data. The simplest data analysis example is like if
last year I bought two dresses based on my savings and if
this year my salary is increasing double then I can buy four
dresses. But of course it's not easy like this because you
have to think about other circumstances like chances of
prices of clothes is increased this year or maybe instead of
dresses you want to buy a new bike, or you need to buy a
house!
• So here, this Analysis makes predictions about future
outcomes based on current or past data. Forecasting is just
an estimate. Its accuracy is based on how much detailed
information you have and how much you dig in it.

35
Prescriptive Analysis
• Prescriptive Analysis combines the insight
from all previous Analysis to determine which
action to take in a current problem or decision.
Most data-driven companies are utilizing
Prescriptive Analysis because predictive and
descriptive Analysis are not enough to improve
data performance. Based on current situations
and problems, they analyze the data and make
decisions.

36
Types of Variable
 Categorical (qualitative) variables have values
that can only be placed into categories, such as
“yes” and “no.”

 Numerical (quantitative) variables have values

that represent quantities.
 Discrete variables arise from a counting process
 Continuous variables arise from a measuring
process

37
Types of Variables
Variables

Categorical Numerical

Examples:
Marital Status

Discrete Continuous
 Political Party
 Eye Color
(Deﬁned categories) Examples: Examples:
 Number of Children  Weight
 Defects per hour  Voltage
(Counted items) (Measured characteristics)

Copyright ©2011
2-38
Pearson Education
Central Tendency-Mode
• The mode is the most commonly reported
value for a particular variable.
• It is illustrated using the following variable
whose values are: 3, 4, 5, 6, 7, 7, 7, 8,8,9
• The mode would be the value 7 since there are
three occurrences of 7 (more than any other
value).
• The following values, both 7 and 8 are reported
three times: 3, 4, 5, 6, 7,7, 7, 8, 8, 8, 9 The mode
may be reported as {7, 8} or 7.5.

39
Median
• The median is the middle value of a variable
once it has been sorted from low to high. For
variables with an even number of values, the
mean of the two values closest to the middle is
selected (sum the two values and divide by 2).
• The following set of values will be used to
illustrate: 3, 4, 7, 2, 3, 7,4, 2, 4, 7, 4.
• Before identifying the median, the values must
be sorted: 2, 2, 3, 3, 4, 4, 4, 4, 7,7, 7

40
Mean

41
Source of Data
• Surveys or polls
A survey or poll can be useful for gathering data to
answer speciﬁc questions
• Experiments:
Experiments measure and collect data to answer a
speciﬁc question in a highly controlled manner. The data
collected should be reliably measured, that is, repeating
the measurement should not result in different values.
Experiments attempt to understand cause and affect
phenomena by controlling other factors that may be
important.

42
• Observational and other studies: In certain
situations it is impossible on either logistical or
ethical grounds to conduct a controlled
experiment. In these situations, a large number
of observations are measured and care taken
when interpreting the results.
• Operational databases: These databases
contain ongoing business transactions. They
are accessed constantly and updated regularly.
Examples include supply chain management
systems, customer relationship management
(CRM) databases and manufacturing
production databases.

43
• Data warehouses: A data warehouse is a copy of data
gathered from other sources within an organization
that has been cleaned, normalized, and optimized for
making decisions. It is not updated as frequently as
operational databases.
• Historical databases: Databases are often used to
house historical polls, surveys and experiments.
• Purchased data: In many cases data from in-house
sources may not be suﬃcient to answer the questions
now being asked of it. One approach is to combine this
internal data with data from other sources.

44
Scales of Data
• Nominal: Scale describing a variable with a
limited number of different values. This scale
is made up of the list of possible values that
the variable may take. It is not possible to
determine whether one value is larger than
another.
• Ordinal: This scale describes a variable whose
values are ordered; however, the difference
between the values does not describe the
magnitude of the actual difference.

45
• Interval: Scales that describe values where the
interval between the values has meaning.
• Ratio: Scales that describe variables where the
same difference between values has the same
meaning (as in interval) but where a double,
tripling, etc. of the values implies a double,
tripling, etc. of the measurement.

46
Table

47
Cleaning the Data
• Since the data available for analysis may not have
been originally collected with this project’s goal in
mind, it is important to spend time cleaning the data.
• It is also beneﬁcial to understand the accuracy with
which the data was collected as well as correcting any
errors.
• For variables measured on a nominal or ordinal scale
(where there are a ﬁxed number of possible values), it
is useful to inspect all possible values to uncover
mistakes and/or inconsistencies.
• Any assumptions made concerning possible values
that the variable can take should be tested.

48
• For example, a variable Company may include
a number of different spellings for the same
company such as:
• General Electric Company
• General Elec. Co
• GE
• Gen. Electric Company
• General electric company
• G.E. Company

49
• These different terms, where they refer to the
same company, should be consolidated into
one for analysis.
• In addition, subject matter expertise may be
needed in cleaning these variables.
• For example, a company name may include
one of the divisions of the General Electric
Company and for the purpose of this speciﬁc
project it should be included as the ‘‘General
Electric Company.’’

50
Removing Variables
• On the basis of an initial categorization of the
variables, it may be possible to remove
variables from consideration at this point.
• For example, constants and variables with too
many missing data points should be
considered for removal.
• Further analysis of the correlations between
multiple variables may identify variables that
provide no additional information to the
analysis and hence could be removed.

51
Data Transformation
Normalization
• Normalization is a process where numeric columns are
transformed using a mathematical function to a new range. It is
important for two reasons.
• First, analysis of the data should treat all variables equally so that
one column does not have more inﬂuence over another because
the ranges are different.
• For example, when analyzing customer credit card data, the
Credit limit value is not given more weightage in the analysis than
the Customer’s age.
• Second, certain data analysis and data mining methods require
the data to be normalized prior to analysis, such as neural
networks or k-nearest neighbors

52
Problem

53
Solution

Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
No ratings yet
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
40 pages
Bda Combined
No ratings yet
Bda Combined
102 pages
Data Analytics III-i
No ratings yet
Data Analytics III-i
85 pages
Module 4 DSBD
No ratings yet
Module 4 DSBD
89 pages
Updated Review Research Paper On Big Data
No ratings yet
Updated Review Research Paper On Big Data
14 pages
Da 33
No ratings yet
Da 33
76 pages
Screenshot 2024-11-08 at 11.01.05 AM
No ratings yet
Screenshot 2024-11-08 at 11.01.05 AM
54 pages
Unit 1
No ratings yet
Unit 1
10 pages
Module 3
No ratings yet
Module 3
137 pages
$R3N9XOZ
No ratings yet
$R3N9XOZ
56 pages
Chapter 2 - EMTE - 240216 - 133452
No ratings yet
Chapter 2 - EMTE - 240216 - 133452
47 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
9 pages
Chapter 1 Introduction Data Analytics
No ratings yet
Chapter 1 Introduction Data Analytics
64 pages
Da Unit-1
No ratings yet
Da Unit-1
24 pages
Intro. To Business Analytics
No ratings yet
Intro. To Business Analytics
44 pages
DAUnit 1
No ratings yet
DAUnit 1
20 pages
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-07-15 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE2011 ETH VL2024250103282 2024-07-15 Reference-Material-I
69 pages
Big Data Analysis
No ratings yet
Big Data Analysis
4 pages
DA Merge Notes (30!09!24)
No ratings yet
DA Merge Notes (30!09!24)
348 pages
Unit 1
No ratings yet
Unit 1
76 pages
Big - Data Unit-1
100% (2)
Big - Data Unit-1
33 pages
DATA Analytics
No ratings yet
DATA Analytics
86 pages
Fundatmental of Data Analysis Week 1-4
No ratings yet
Fundatmental of Data Analysis Week 1-4
23 pages
Chapter 5 Big Data
No ratings yet
Chapter 5 Big Data
37 pages
Module 1
No ratings yet
Module 1
55 pages
Big-Data-Unit 1
No ratings yet
Big-Data-Unit 1
23 pages
Bda - Unit 1
No ratings yet
Bda - Unit 1
32 pages
Robotics Quiz Reviewer
No ratings yet
Robotics Quiz Reviewer
9 pages
Unit 1ppt
No ratings yet
Unit 1ppt
29 pages
Chapter-1 Introduction To Data Analytics
No ratings yet
Chapter-1 Introduction To Data Analytics
34 pages
Unit-5 DS
No ratings yet
Unit-5 DS
20 pages
Big Data Analytics Group 6
No ratings yet
Big Data Analytics Group 6
23 pages
Big Data and Analytics
No ratings yet
Big Data and Analytics
86 pages
BDA Unit 1
No ratings yet
BDA Unit 1
22 pages
Unit1 - Introduction To Big Data
No ratings yet
Unit1 - Introduction To Big Data
53 pages
Python Full Stack
No ratings yet
Python Full Stack
37 pages
BDA Unit 1 Bigdata Intro
No ratings yet
BDA Unit 1 Bigdata Intro
69 pages
L01-Fundamentals of Big Data and Data Analytics
No ratings yet
L01-Fundamentals of Big Data and Data Analytics
58 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Pe s4hc PR Dd2 Wa
No ratings yet
Pe s4hc PR Dd2 Wa
8 pages
BI Module 2
No ratings yet
BI Module 2
11 pages
Data Analysis - Unit1
No ratings yet
Data Analysis - Unit1
65 pages
Unit 1
No ratings yet
Unit 1
22 pages
Partiiunit5characteristics of Big Data and Data Analytics
No ratings yet
Partiiunit5characteristics of Big Data and Data Analytics
6 pages
Unit 1 Notes Final Part C
No ratings yet
Unit 1 Notes Final Part C
38 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
70 pages
Data Analytics III I
No ratings yet
Data Analytics III I
86 pages
U1 C CLSRM
No ratings yet
U1 C CLSRM
30 pages
ToolKit 1 - Unit 1 - Introduction To Data Analytics
No ratings yet
ToolKit 1 - Unit 1 - Introduction To Data Analytics
15 pages
Unit 2 Da
No ratings yet
Unit 2 Da
69 pages
Dataanalyticsunit 1
No ratings yet
Dataanalyticsunit 1
26 pages
What Is Big Data?
No ratings yet
What Is Big Data?
19 pages
2.1 Data Analytics
No ratings yet
2.1 Data Analytics
16 pages
2 Technology and Data
No ratings yet
2 Technology and Data
12 pages
Examples of Sets in Real Life
No ratings yet
Examples of Sets in Real Life
12 pages
Unit 2
No ratings yet
Unit 2
35 pages
Challenges in Big Data Analytics Techniques
No ratings yet
Challenges in Big Data Analytics Techniques
6 pages
Project 02 - Business Decisions
No ratings yet
Project 02 - Business Decisions
27 pages
Unit 1
No ratings yet
Unit 1
61 pages
1 - Konsep Big Data
No ratings yet
1 - Konsep Big Data
35 pages
Notes - KCS 061 Big Data Unit 1
No ratings yet
Notes - KCS 061 Big Data Unit 1
25 pages
ITN Final Skills Exam
0% (1)
ITN Final Skills Exam
5 pages
Module 3-Intellectual Property
No ratings yet
Module 3-Intellectual Property
41 pages
AB Cheatsheet
No ratings yet
AB Cheatsheet
13 pages
Sslmicf9quarter3week1 779332714012337
No ratings yet
Sslmicf9quarter3week1 779332714012337
4 pages
Tech Achievements With Photos (IT Batch 2026)
No ratings yet
Tech Achievements With Photos (IT Batch 2026)
23 pages
Brochure SRT 4930 - en
No ratings yet
Brochure SRT 4930 - en
2 pages
Thesis On Mobile Computing PDF
100% (3)
Thesis On Mobile Computing PDF
6 pages
1st Unit Python Programming
No ratings yet
1st Unit Python Programming
37 pages
GIM 165 Lecture 6
No ratings yet
GIM 165 Lecture 6
18 pages
Manufacturing Big Data Ecosystem A Systematic Literature Review - 2ndresubmit - v2
No ratings yet
Manufacturing Big Data Ecosystem A Systematic Literature Review - 2ndresubmit - v2
40 pages
JD For Grey Orange
No ratings yet
JD For Grey Orange
1 page
PV Inverter Thesis
100% (1)
PV Inverter Thesis
7 pages
AQA Comp Sci WB2 Answers Ms
No ratings yet
AQA Comp Sci WB2 Answers Ms
52 pages
Total Station Product Review
No ratings yet
Total Station Product Review
3 pages
Cyb - SS4 - DSTS - 3000-4000a GFS - en - Rev - A00
No ratings yet
Cyb - SS4 - DSTS - 3000-4000a GFS - en - Rev - A00
14 pages
Fleeti Presentation
No ratings yet
Fleeti Presentation
14 pages
0.pham Bac Nguyen - LLM Algorithm
No ratings yet
0.pham Bac Nguyen - LLM Algorithm
2 pages
An Online Road Transport Booking System: Asian Journal of Computer Science and Technology October 2021
No ratings yet
An Online Road Transport Booking System: Asian Journal of Computer Science and Technology October 2021
6 pages
Oracle Lab 3
No ratings yet
Oracle Lab 3
18 pages
Guide LDP
No ratings yet
Guide LDP
6 pages
L2B Test 3 Answer Key
No ratings yet
L2B Test 3 Answer Key
2 pages
Maths - HT - Y2 - Using The Progress Tracker
No ratings yet
Maths - HT - Y2 - Using The Progress Tracker
8 pages
Cellular IoT - 01 15 2024
No ratings yet
Cellular IoT - 01 15 2024
3 pages
Testng Interview Questions Level
No ratings yet
Testng Interview Questions Level
7 pages
HM616 HM618
No ratings yet
HM616 HM618
4 pages
Essentials of Data Analysis
From Everand
Essentials of Data Analysis
Agasti Khatri
No ratings yet
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
From Everand
PYTHON FOR DATA ANALYTICS: Mastering Python for Comprehensive Data Analysis and Insights (2023 Guide for Beginners)
Waldo Todd
No ratings yet
Data Collection: Six Sigma Thinking, #1
From Everand
Data Collection: Six Sigma Thinking, #1
Sumeet Savant
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Cse 2027 Fda M1

Uploaded by

Cse 2027 Fda M1

Uploaded by

CSE 2027-Fundamental of Data

Module 1-Introduction to Data Analysis

Introducing Data, overview of data analysis: Data

• Data has to be transformed into a form that is eﬃcient for movement or

• With every ﬂip of second, data is being searched on the internet.

• On a day to day basis, social networking sites like Facebook,

• For easy analysis of this high amount of constantly generating

• It may happen because of multiple data types & the speed

• Have you ever been a scholar?

• Be getting answers like this it would be easier to determine the

• validity is meant for an accurate analysis in order to get

• Currently, in a case of facebook, where the Belgium court has

• For analyzing the same, it is necessary to develop some

• So, it is a challenge to work with the concept which

• Scientiﬁc data: Oil and gas exploration, space exploration, seismic

• Digital surveillance: Surveillance photos and video.

• Sensor data: Traﬃc, weather, oceanographic sensors.

 Numerical (quantitative) variables have values

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.