0% found this document useful (0 votes)

10 views37 pages

Data Science and Visualization

Data science

Uploaded by

Aanya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views37 pages

Data Science and Visualization

Data science

Uploaded by

Aanya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

DATA SCIENCE AND

VISUALIZATION
21CS644
MODULE-1 Statistical Inference
Chapter 2
Statistical Inferencing

 The world is a complex, random, sometimes uncertain, in fact it’s a big-data generating
machine!
 Once the data is captured, we have captured the world or certain traces of the world.
 The captured traces of the world must then be simplified into something more
comprehensible, and concise such as mathematical models or functions. These are called
statistical estimators
 This overall process of going from the world to the data, and then from the data back to
the world, is the field of statistical inference.
 Definition: it is a discipline that is concerned with the development of procedures, methods
and theorems that allows the extraction of meaning and information from the data generated
by stochastic processes.
Population and Samples

 Population – any set of objects or units (tweets, photographs, stars)

 If the characteristics of all the objects can be measured or extracted,

then it is a complete set of observations (N).
Population and Samples

 A subset of the units of size (n) considered to

make observations and draw conclusions and
make inferences about the population is
called a sample

 Taking a subset may introduce biases into the

data and distort it.
When should samples be used?

• When studying a large population where it is impractical or impossible to collect data

from every individual.

• When resources such as time, cost, and manpower are limited, making it more feasible to
collect data from a subset of the population.

• When conducting research or experiments where it is important to minimize potential

biases in data collection.
Population and Samples of Big
Data

 In the age of big data do we still need to take sample?

 Sampling solves some engineering problems
 How much data is needed depends on the goal,
 Foranalysis or inferencing there is no need to store the data all the
time.
 For serving purpose you may need it all the time in order to render
correct information.
Population and Samples of Big Data

 Types of data
 Traditional – numerical, categorical and binary
 Text – emails, tweets, reviews, news articles
 Records – user-level data, timestamped event data, json-formatted log files
 Geo-based location data – housing data
 Network
 Sensor data
 Images
Populations and Samples of Big Data

 New data requires new strategies for sampling

 If a Facebook user-level data aggregated from
timestamped event logs is analyzed for a week, can any
conclusions be drawn that is relevant next week or year?
 How to sample from a network and preserve the complex
network structure?
 Many of these questions are open research questions
BIG Data

 BIG is a moving target – when the size of the data becomes a

challenge we refer to it as big.

 Big is when you can’t fit all the data on one machine

 BIG data is a cultural phenomenon

 The 4 v – volume, variety, velocity and value.

Big Data Can Mean Big Assumptions

 Big data revolution consists of three things:

 Collecting and using a lot of data rather than small samples
 Accepting messiness in your data
 Can n=all?
 Very often it is not and misses the things we should consider the most
 The n=1 – Sample size of 1
 For a single person, we can actually record a lot of information
 We might even sample from all the actions they took in order to make inferences about them
 This is User-Level Modeling
Big Data Can Mean Big Assumptions

 Data is not objective – data does not speak for itself!

 Example: algorithm for hiring – consider an organization that did not treat female
employees well. So when deciding to compare men and women with same
qualifications, data showed that women tend to leave more often, get promoted less
often and give more negative feedback on the environment than men. The automated
model based on this data will likely hire a man over a woman if a man and a woman
with the same qualification turned up for the interview
 Ignoring causation can be a flaw rather than a feature and add to historical problems
rather than address them.
 Data is just a quantitative representation of the events of our society
Modeling

 Model - humans try to understand the world around them by representing it in different
ways
 Statisticians and data scientists capture the uncertainty and randomness of data-
generating processes with mathematical functions that express the shape and structure
of the data itself.
 A model is our attempt to understand and represent the nature of reality through a
particular lens, be it architectural, biological, or mathematical.
 A model is an artificial construction where all irrelevant detail has been removed or
abstracted
Modeling

 A model is an artificial construction where all irrelevant detail has been

removed or abstracted.
 Attention must always be paid to these abstracted details after a model has
been analysed to see what might have been overlooked.
 In the case of a statistical model, we may have mistakenly excluded key
variables, included irrelevant ones, or assumed a mathematical structure
divorced from reality.
Statistical Modeling

 What comes first? What influences what? What causes what? What’s a
test of that?
 Expressing these kind of relations in terms of mathematical expressions
that will be general enough that they have to include parameters, but
the parameter values are not yet known.
 Example: if there are two columns x and y of data and there is a linear
relationship between them then we can represent it as
𝑦= (a 𝑥 + 𝑏)
Where a and b are parameters whose values are not yet known
How to build a model?

 Start with exploratory data analysis (EDA)

 Making plots, building intuition for a particular dataset
 Plot histograms, look at scatter plots to get a feel for the data
 Write down representative functions – start with a simple linear function and see if it
makes sense
 Write down complete sentences and try to express the words as equations & code
 Simple plots may be easier to interpret and understand
 Trade off- simple model may get you 90% of the way & may take few hours to build,
whereas a complex model may get you up to 92% and may take months to build.
Probability Distributions

 Probability distributions are foundations of statistical models

 Example – normal (gaussian) distribution, poisson distribution, weibull
distribution, gamma distribution, exponential distribution.
 https://www.analyticsvidhya.com/blog/2017/09/6-probability-
distributions-data-science/
Probability Distributions

 A probability distribution is a mathematical function that defines the

likelihood of different outcomes or values of a variable.
 This function is commonly represented by a graph or probability
table, and it provides the probabilities of various possible results of
an experiment or random phenomenon based on the sample
space and the probabilities of events.
Probability Distributions
Probability Distribution
Probability Distribution

 A random variable denoted by x or y can be assumed to have a

corresponding probability distribution p(x) which maps to a positive
real number.
 In order to be a probability density function, we’re restricted to the
set of functions such that if we integrate p(x) to get the area under
the curve, it is 1, so it can be interpreted as probability.
Probability Distribution

 Example: Let x be the amount of time until the next bus arrives. x is a random variable
because there is variation and uncertainty in the amount of time until the next bus.
 Suppose we know that the time until the next bus has a probability density function of p(x)
= 2𝑒 −2𝑥 . If we want to know the likelihood of the next bus arriving in between 12 and 13
13
minutes, then we find the area under the curve between 12 and 13 by 12 2𝑒 −2𝑥
 How do we know that the distribution is correct?
 We can conduct an experiment where we show up at the bus stop at a random time,
measure how much time until the next bus, and repeat this experiment over and over
again. Then we look at the measurements, plot them, and approximate the function.
 Because we are familiar with the fact that “waiting time” is a common enough real-world
phenomenon that a distribution called the exponential distribution has been invented to
describe it, we know that it takes the form p(x) = λ𝑒 −λ𝑥 .
Joint Distribution and Conditional
Distribution

 Multivariate functions called joint distributions to do the same thing for more
than one random variable.
 So in the case of two random variables, for example, we could denote our
distribution by a function p(x, y) , and it would take values in the plane and give us
nonnegative values.
 We also have what is called a conditional distribution, p(x|y), which is to be
interpreted as the density function of x given a particular value of y.
Fitting a Model

 Estimate the parameters of the model using the observed data.

 The data is used as evidence to help approximate the real world-mathematical process that generated the
data.
 Fitting the model often involves optimization methods and algorithms such as maximum likelihood
estimation, to help get the parameters.
 When you estimate the parameters, they are actually estimators, meaning they themselves are functions of
the data
 Fitting the model is when you start actually coding: your code will read in the data, and you’ll specify the
functional form that you wrote down on the piece of paper. Then R or python will use built-in
optimization methods to give you the most likely values of the parameters given the data.
 Initially you should have an understanding that optimization is taking place and how it works, but you
don’t have to code this part yourself—it underlies the r or python functions.
Fitting a Model

 Model fitting is a measure of how well a machine learning

model generalizes to similar data to that on which it was
trained.
 A model that is well-fitted produces more accurate
outcomes.
 A model that is overfitted matches the data too closely.
 A model that is underfitted doesn't match closely enough.
 What is bias?
 Bias refers to distance between predicted and actual values. i.e, How far we
have predicted the values from actual values
 High bias — if the average prediction values are far away from actual values
 Low bias — distance between prediction and actual values are minimal
 High bias will cause the algorithm to miss a dominant pattern or relationship
between input and output variables. If bias is too high, model performs very
badly and accuracy will be low, which causes underfitting.
 The training data is used to train the machine learning
algorithm.
 The testing data is used to evaluate the accuracy of the
trained algorithm.
 The training data should represent the data the algorithm
will encounter in the real world.
 What is Variance?
 If model which predicts well with training dataset and fails with independent
unseen data (Testing dataset), then it is evident that model has a variance.
i.e., it conveys how scattered the predicted values are from the actual values
(or)
 We can say that How much the data is scattered from each other
 High Variance — data is scattered significantly, the model has trained
with lot of noise and irrelevant data that causes the overfitting
 Low Variance — Less scattered
In Case (i) — LB & LV (ideal and Best
Scenario — Best Model)

In Case (ii) — LB & HV (Model are somewhat

accurate but inconsistent)

In Case (iii) — HB & LV (Model are Consistent

but inaccurate)

In Case (iv) — HB & HV (Model will be

inconsistent & inaccurate)
Bias & Variance

Bias-Consistent

Bias- Straight line cannot curve like Bias – No Bias, passes through every
the true relationship point of Train Data

*Variance – Difference in
fit between datasets

Variance – Very High

Variance - Low Bad with Testing Set
Performs better with Testing Overfitting
 An example of a case study would be to predict if a customer would default
on a bank loan. Assuming we have a dataset of 100,000 customers containing
features such as demographics, income, loan amount, credit history,
employment record, and default status, we split our data into training and test
data.
 Our training dataset contains 80,000 customers, while our test dataset
contains 20,000 customers. In the training the dataset, we observe that our
model has a 97% accuracy, but in prediction, we only get 50% accuracy.
This shows that we have an overfitting problem.
Overfitting

 Overfitting is the term used to mean that you used a dataset to estimate the
parameters of your model, but your model isn’t that good at capturing reality
beyond your sampled data.
 Overfitting happens due to several reasons, such as:
 the training data size is too small and does not contain enough data samples
to accurately represent all possible input data values.
How Bias and Variance will cause Overfitting and Underfitting?

 Dataset is divided into training and testing data.

 We will build a model using training dataset and validate a model
using testing dataset.
 Bias is resulting from training error and variance is related to
testing error.
END

Data Analyticsi Foundations
No ratings yet
Data Analyticsi Foundations
540 pages
DAV - Technical Book
No ratings yet
DAV - Technical Book
137 pages
İdi̇l Ören CV
No ratings yet
İdi̇l Ören CV
3 pages
Proceedings of The 1ST International Congress of The International Society of Sports Sciences in The Arab World
No ratings yet
Proceedings of The 1ST International Congress of The International Society of Sports Sciences in The Arab World
140 pages
CL6 Winter 2024-25
No ratings yet
CL6 Winter 2024-25
4 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
34 pages
02 Unit 1 - 2
No ratings yet
02 Unit 1 - 2
37 pages
Brochure Final Brochure A. M. 2025
No ratings yet
Brochure Final Brochure A. M. 2025
4 pages
Data Analytics
No ratings yet
Data Analytics
110 pages
ClassXI DS Teacher Presentation
No ratings yet
ClassXI DS Teacher Presentation
77 pages
Lecture Note-1-Birleştirildi
No ratings yet
Lecture Note-1-Birleştirildi
137 pages
Unit 4 Big Data Complete Notes
No ratings yet
Unit 4 Big Data Complete Notes
32 pages
Data Strategy
No ratings yet
Data Strategy
41 pages
Model QP Scheme-2
No ratings yet
Model QP Scheme-2
19 pages
What Are Data Distributions, and Why Are They Important
No ratings yet
What Are Data Distributions, and Why Are They Important
4 pages
USP-NF Purified Water
No ratings yet
USP-NF Purified Water
1 page
Fha Unit 1 Introduction
No ratings yet
Fha Unit 1 Introduction
8 pages
Data Science Dse
No ratings yet
Data Science Dse
24 pages
NSO Level 2 Class 5 Science 2013 Part 6
No ratings yet
NSO Level 2 Class 5 Science 2013 Part 6
6 pages
Module 1
No ratings yet
Module 1
19 pages
02-Geostatistics Prob and Stats Review
No ratings yet
02-Geostatistics Prob and Stats Review
43 pages
Unit 1 Theory
No ratings yet
Unit 1 Theory
8 pages
U3 Prob & Stat & Hypo
No ratings yet
U3 Prob & Stat & Hypo
80 pages
Module 2 BDA
No ratings yet
Module 2 BDA
40 pages
Translation Shift in English-Indonesian Translation of "The Things You Can See Only When You Slow Down"
No ratings yet
Translation Shift in English-Indonesian Translation of "The Things You Can See Only When You Slow Down"
14 pages
Light XlTwgwQ0 OvDn1N7
No ratings yet
Light XlTwgwQ0 OvDn1N7
41 pages
Module1-Talk-GITAA-modified (Autosaved)
No ratings yet
Module1-Talk-GITAA-modified (Autosaved)
328 pages
Probability
No ratings yet
Probability
22 pages
Estima
No ratings yet
Estima
378 pages
Hardy Weinberg Law
No ratings yet
Hardy Weinberg Law
7 pages
Using Models To Explore
No ratings yet
Using Models To Explore
17 pages
Ads M1 02
No ratings yet
Ads M1 02
16 pages
Big Data Part-I
No ratings yet
Big Data Part-I
15 pages
7 Inference L8 Unlocked
No ratings yet
7 Inference L8 Unlocked
29 pages
WP0 REPLA0140 Same 0 Box 00 PUBLIC0
No ratings yet
WP0 REPLA0140 Same 0 Box 00 PUBLIC0
114 pages
Apuntes Estadistica
No ratings yet
Apuntes Estadistica
116 pages
Most Compact and Complete Data Science Cheat Sheet 1672981093
No ratings yet
Most Compact and Complete Data Science Cheat Sheet 1672981093
10 pages
Unit - 1 Introduction-Statistical Inference
No ratings yet
Unit - 1 Introduction-Statistical Inference
28 pages
Data Science
No ratings yet
Data Science
12 pages
Data Modeling March 16
No ratings yet
Data Modeling March 16
29 pages
Ap Physics 2 Lab: Photoelectric Effect
No ratings yet
Ap Physics 2 Lab: Photoelectric Effect
9 pages
Measurement of Irrigation Water
No ratings yet
Measurement of Irrigation Water
83 pages
LECT-3-Introduction To Statics-Economics
No ratings yet
LECT-3-Introduction To Statics-Economics
47 pages
مبادئ الاحصاء
No ratings yet
مبادئ الاحصاء
66 pages
(Ebook PDF) Statistics Learning From Data by Roxy Peck Download
100% (1)
(Ebook PDF) Statistics Learning From Data by Roxy Peck Download
58 pages
Application Letter For Safaricom
No ratings yet
Application Letter For Safaricom
4 pages
Unit - II - Part I - Importance of Statistics in Data Science
No ratings yet
Unit - II - Part I - Importance of Statistics in Data Science
10 pages
Qualitative
No ratings yet
Qualitative
5 pages
Chapter1 2023
No ratings yet
Chapter1 2023
76 pages
Introduction To Predictive Analytics
No ratings yet
Introduction To Predictive Analytics
92 pages
DJ 14 Ai&ds 3
No ratings yet
DJ 14 Ai&ds 3
20 pages
L1 - Introduction To Data Science
No ratings yet
L1 - Introduction To Data Science
33 pages
MS Broschuere FLUITEX EN Metric
No ratings yet
MS Broschuere FLUITEX EN Metric
12 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
54 pages
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
No ratings yet
Lecture 2 - Statistical Inference - EDA and DS Process - 02032023 111156am 1 - 1 27022024 012412pm
44 pages
Statistics Book
No ratings yet
Statistics Book
170 pages
Mee Nitesh
No ratings yet
Mee Nitesh
5 pages
DV - Unit 1
No ratings yet
DV - Unit 1
40 pages
Andrews M. Doing Data Science in R. An Introduction... 2021
No ratings yet
Andrews M. Doing Data Science in R. An Introduction... 2021
486 pages
CH 4 - Book Exercise
No ratings yet
CH 4 - Book Exercise
3 pages
Vol 4 2 Gonzalez
No ratings yet
Vol 4 2 Gonzalez
30 pages
Wa0004.
No ratings yet
Wa0004.
44 pages
Divisibility Rules 86
No ratings yet
Divisibility Rules 86
60 pages
What Exactly Is Data Science
No ratings yet
What Exactly Is Data Science
15 pages
Chapter 15 - SI - Final Solutions
No ratings yet
Chapter 15 - SI - Final Solutions
14 pages
Evidence
No ratings yet
Evidence
4 pages
Course1 STA 112 Notes 2021 Latest
No ratings yet
Course1 STA 112 Notes 2021 Latest
57 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
16 pages
Unit 4
No ratings yet
Unit 4
5 pages
Unit 1
No ratings yet
Unit 1
84 pages
High Fluence High Beam Quality Q Switched Ndyag Laser With Optoflex Delivery System For Treating Benign Pigmented Lesions and Tattoos
No ratings yet
High Fluence High Beam Quality Q Switched Ndyag Laser With Optoflex Delivery System For Treating Benign Pigmented Lesions and Tattoos
12 pages
AEM Lecture 1
No ratings yet
AEM Lecture 1
70 pages
Holiday Homework
No ratings yet
Holiday Homework
16 pages
MITx - 18.6501x - FUNDAMENTALS OF STATISTICS
No ratings yet
MITx - 18.6501x - FUNDAMENTALS OF STATISTICS
10 pages
Fundamentals of Statistics I - Lecture Notes
No ratings yet
Fundamentals of Statistics I - Lecture Notes
77 pages
Assignment No. 7 Chemical Engineering Fluid Dynamics Session 2016 Due Date: 16 May-2018 Solve All The Questions. (As A Part of Assessment of CLO3)
No ratings yet
Assignment No. 7 Chemical Engineering Fluid Dynamics Session 2016 Due Date: 16 May-2018 Solve All The Questions. (As A Part of Assessment of CLO3)
1 page
Fundamentals of Metal Forming
No ratings yet
Fundamentals of Metal Forming
22 pages
Limit of PAC (1.5%) Analysis of The Effect of Polyanionic Cellulose On Viscosity and Filtrate Volume in Drilling Fluid
No ratings yet
Limit of PAC (1.5%) Analysis of The Effect of Polyanionic Cellulose On Viscosity and Filtrate Volume in Drilling Fluid
6 pages
ML Course Slides
No ratings yet
ML Course Slides
345 pages
Exploratory Data Analysis: Datascience Using Python Topic: 3
No ratings yet
Exploratory Data Analysis: Datascience Using Python Topic: 3
32 pages
4.02 Statistics Fundamentals
No ratings yet
4.02 Statistics Fundamentals
2 pages
6220010
No ratings yet
6220010
37 pages
Problem Set #8: Newton - Raphson Method: Iteration, The Approximate Root of The Given Function Is 2.094551
No ratings yet
Problem Set #8: Newton - Raphson Method: Iteration, The Approximate Root of The Given Function Is 2.094551
5 pages
Master of Public Management: Admission Requirements
No ratings yet
Master of Public Management: Admission Requirements
3 pages
AS Level Mathematics Statistics (New)
No ratings yet
AS Level Mathematics Statistics (New)
49 pages
Discipline of Focus
No ratings yet
Discipline of Focus
9 pages
Data Strategy Feb 9 Part 2
No ratings yet
Data Strategy Feb 9 Part 2
36 pages
Social Work Tools, Skills, and Techniques
100% (5)
Social Work Tools, Skills, and Techniques
63 pages
Self-Loosening of Threaded Fasteners: by Dr. Bill Eccles, Bolt Science
No ratings yet
Self-Loosening of Threaded Fasteners: by Dr. Bill Eccles, Bolt Science
2 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Science and Visualization

Uploaded by

Data Science and Visualization

Uploaded by

DATA SCIENCE AND

 Population – any set of objects or units (tweets, photographs, stars)

 If the characteristics of all the objects can be measured or extracted,

 A subset of the units of size (n) considered to

 Taking a subset may introduce biases into the

• When studying a large population where it is impractical or impossible to collect data

• When conducting research or experiments where it is important to minimize potential

 In the age of big data do we still need to take sample?

 New data requires new strategies for sampling

 BIG is a moving target – when the size of the data becomes a

 BIG data is a cultural phenomenon

 The 4 v – volume, variety, velocity and value.

 Big data revolution consists of three things:

 Data is not objective – data does not speak for itself!

 A model is an artificial construction where all irrelevant detail has been

 Start with exploratory data analysis (EDA)

 Probability distributions are foundations of statistical models

 A probability distribution is a mathematical function that defines the

 A random variable denoted by x or y can be assumed to have a

 Estimate the parameters of the model using the observed data.

 Model fitting is a measure of how well a machine learning

In Case (ii) — LB & HV (Model are somewhat

In Case (iii) — HB & LV (Model are Consistent

In Case (iv) — HB & HV (Model will be

Variance – Very High

 Dataset is divided into training and testing data.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.