0% found this document useful (0 votes)
10 views37 pages

Data Science and Visualization

Data science

Uploaded by

Aanya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views37 pages

Data Science and Visualization

Data science

Uploaded by

Aanya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

DATA SCIENCE AND

VISUALIZATION
21CS644
MODULE-1 Statistical Inference
Chapter 2
Statistical Inferencing

 The world is a complex, random, sometimes uncertain, in fact it’s a big-data generating
machine!
 Once the data is captured, we have captured the world or certain traces of the world.
 The captured traces of the world must then be simplified into something more
comprehensible, and concise such as mathematical models or functions. These are called
statistical estimators
 This overall process of going from the world to the data, and then from the data back to
the world, is the field of statistical inference.
 Definition: it is a discipline that is concerned with the development of procedures, methods
and theorems that allows the extraction of meaning and information from the data generated
by stochastic processes.
Population and Samples

 Population – any set of objects or units (tweets, photographs, stars)

 If the characteristics of all the objects can be measured or extracted,


then it is a complete set of observations (N).
Population and Samples

 A subset of the units of size (n) considered to


make observations and draw conclusions and
make inferences about the population is
called a sample

 Taking a subset may introduce biases into the


data and distort it.
When should samples be used?

• When studying a large population where it is impractical or impossible to collect data


from every individual.

• When resources such as time, cost, and manpower are limited, making it more feasible to
collect data from a subset of the population.

• When conducting research or experiments where it is important to minimize potential


biases in data collection.
Population and Samples of Big
Data

 In the age of big data do we still need to take sample?


 Sampling solves some engineering problems
 How much data is needed depends on the goal,
 Foranalysis or inferencing there is no need to store the data all the
time.
 For serving purpose you may need it all the time in order to render
correct information.
Population and Samples of Big Data

 Types of data
 Traditional – numerical, categorical and binary
 Text – emails, tweets, reviews, news articles
 Records – user-level data, timestamped event data, json-formatted log files
 Geo-based location data – housing data
 Network
 Sensor data
 Images
Populations and Samples of Big Data

 New data requires new strategies for sampling


 If a Facebook user-level data aggregated from
timestamped event logs is analyzed for a week, can any
conclusions be drawn that is relevant next week or year?
 How to sample from a network and preserve the complex
network structure?
 Many of these questions are open research questions
BIG Data

 BIG is a moving target – when the size of the data becomes a


challenge we refer to it as big.

 Big is when you can’t fit all the data on one machine

 BIG data is a cultural phenomenon

 The 4 v – volume, variety, velocity and value.


Big Data Can Mean Big Assumptions

 Big data revolution consists of three things:


 Collecting and using a lot of data rather than small samples
 Accepting messiness in your data
 Can n=all?
 Very often it is not and misses the things we should consider the most
 The n=1 – Sample size of 1
 For a single person, we can actually record a lot of information
 We might even sample from all the actions they took in order to make inferences about them
 This is User-Level Modeling
Big Data Can Mean Big Assumptions

 Data is not objective – data does not speak for itself!


 Example: algorithm for hiring – consider an organization that did not treat female
employees well. So when deciding to compare men and women with same
qualifications, data showed that women tend to leave more often, get promoted less
often and give more negative feedback on the environment than men. The automated
model based on this data will likely hire a man over a woman if a man and a woman
with the same qualification turned up for the interview
 Ignoring causation can be a flaw rather than a feature and add to historical problems
rather than address them.
 Data is just a quantitative representation of the events of our society
Modeling

 Model - humans try to understand the world around them by representing it in different
ways
 Statisticians and data scientists capture the uncertainty and randomness of data-
generating processes with mathematical functions that express the shape and structure
of the data itself.
 A model is our attempt to understand and represent the nature of reality through a
particular lens, be it architectural, biological, or mathematical.
 A model is an artificial construction where all irrelevant detail has been removed or
abstracted
Modeling

 A model is an artificial construction where all irrelevant detail has been


removed or abstracted.
 Attention must always be paid to these abstracted details after a model has
been analysed to see what might have been overlooked.
 In the case of a statistical model, we may have mistakenly excluded key
variables, included irrelevant ones, or assumed a mathematical structure
divorced from reality.
Statistical Modeling

 What comes first? What influences what? What causes what? What’s a
test of that?
 Expressing these kind of relations in terms of mathematical expressions
that will be general enough that they have to include parameters, but
the parameter values are not yet known.
 Example: if there are two columns x and y of data and there is a linear
relationship between them then we can represent it as
𝑦= (a 𝑥 + 𝑏)
Where a and b are parameters whose values are not yet known
How to build a model?

 Start with exploratory data analysis (EDA)


 Making plots, building intuition for a particular dataset
 Plot histograms, look at scatter plots to get a feel for the data
 Write down representative functions – start with a simple linear function and see if it
makes sense
 Write down complete sentences and try to express the words as equations & code
 Simple plots may be easier to interpret and understand
 Trade off- simple model may get you 90% of the way & may take few hours to build,
whereas a complex model may get you up to 92% and may take months to build.
Probability Distributions

 Probability distributions are foundations of statistical models


 Example – normal (gaussian) distribution, poisson distribution, weibull
distribution, gamma distribution, exponential distribution.
 https://www.analyticsvidhya.com/blog/2017/09/6-probability-
distributions-data-science/
Probability Distributions

 A probability distribution is a mathematical function that defines the


likelihood of different outcomes or values of a variable.
 This function is commonly represented by a graph or probability
table, and it provides the probabilities of various possible results of
an experiment or random phenomenon based on the sample
space and the probabilities of events.
Probability Distributions
Probability Distribution
Probability Distribution

 A random variable denoted by x or y can be assumed to have a


corresponding probability distribution p(x) which maps to a positive
real number.
 In order to be a probability density function, we’re restricted to the
set of functions such that if we integrate p(x) to get the area under
the curve, it is 1, so it can be interpreted as probability.
Probability Distribution

 Example: Let x be the amount of time until the next bus arrives. x is a random variable
because there is variation and uncertainty in the amount of time until the next bus.
 Suppose we know that the time until the next bus has a probability density function of p(x)
= 2𝑒 −2𝑥 . If we want to know the likelihood of the next bus arriving in between 12 and 13
13
minutes, then we find the area under the curve between 12 and 13 by 12 2𝑒 −2𝑥
 How do we know that the distribution is correct?
 We can conduct an experiment where we show up at the bus stop at a random time,
measure how much time until the next bus, and repeat this experiment over and over
again. Then we look at the measurements, plot them, and approximate the function.
 Because we are familiar with the fact that “waiting time” is a common enough real-world
phenomenon that a distribution called the exponential distribution has been invented to
describe it, we know that it takes the form p(x) = λ𝑒 −λ𝑥 .
Joint Distribution and Conditional
Distribution

 Multivariate functions called joint distributions to do the same thing for more
than one random variable.
 So in the case of two random variables, for example, we could denote our
distribution by a function p(x, y) , and it would take values in the plane and give us
nonnegative values.
 We also have what is called a conditional distribution, p(x|y), which is to be
interpreted as the density function of x given a particular value of y.
Fitting a Model

 Estimate the parameters of the model using the observed data.


 The data is used as evidence to help approximate the real world-mathematical process that generated the
data.
 Fitting the model often involves optimization methods and algorithms such as maximum likelihood
estimation, to help get the parameters.
 When you estimate the parameters, they are actually estimators, meaning they themselves are functions of
the data
 Fitting the model is when you start actually coding: your code will read in the data, and you’ll specify the
functional form that you wrote down on the piece of paper. Then R or python will use built-in
optimization methods to give you the most likely values of the parameters given the data.
 Initially you should have an understanding that optimization is taking place and how it works, but you
don’t have to code this part yourself—it underlies the r or python functions.
Fitting a Model

 Model fitting is a measure of how well a machine learning


model generalizes to similar data to that on which it was
trained.
 A model that is well-fitted produces more accurate
outcomes.
 A model that is overfitted matches the data too closely.
 A model that is underfitted doesn't match closely enough.
 What is bias?
 Bias refers to distance between predicted and actual values. i.e, How far we
have predicted the values from actual values
 High bias — if the average prediction values are far away from actual values
 Low bias — distance between prediction and actual values are minimal
 High bias will cause the algorithm to miss a dominant pattern or relationship
between input and output variables. If bias is too high, model performs very
badly and accuracy will be low, which causes underfitting.
 The training data is used to train the machine learning
algorithm.
 The testing data is used to evaluate the accuracy of the
trained algorithm.
 The training data should represent the data the algorithm
will encounter in the real world.
 What is Variance?
 If model which predicts well with training dataset and fails with independent
unseen data (Testing dataset), then it is evident that model has a variance.
i.e., it conveys how scattered the predicted values are from the actual values
(or)
 We can say that How much the data is scattered from each other
 High Variance — data is scattered significantly, the model has trained
with lot of noise and irrelevant data that causes the overfitting
 Low Variance — Less scattered
In Case (i) — LB & LV (ideal and Best
Scenario — Best Model)

In Case (ii) — LB & HV (Model are somewhat


accurate but inconsistent)

In Case (iii) — HB & LV (Model are Consistent


but inaccurate)

In Case (iv) — HB & HV (Model will be


inconsistent & inaccurate)
Bias & Variance

Bias-Consistent

Bias- Straight line cannot curve like Bias – No Bias, passes through every
the true relationship point of Train Data

*Variance – Difference in
fit between datasets

Variance – Very High


Variance - Low Bad with Testing Set
Performs better with Testing Overfitting
 An example of a case study would be to predict if a customer would default
on a bank loan. Assuming we have a dataset of 100,000 customers containing
features such as demographics, income, loan amount, credit history,
employment record, and default status, we split our data into training and test
data.
 Our training dataset contains 80,000 customers, while our test dataset
contains 20,000 customers. In the training the dataset, we observe that our
model has a 97% accuracy, but in prediction, we only get 50% accuracy.
This shows that we have an overfitting problem.
Overfitting

 Overfitting is the term used to mean that you used a dataset to estimate the
parameters of your model, but your model isn’t that good at capturing reality
beyond your sampled data.
 Overfitting happens due to several reasons, such as:
 the training data size is too small and does not contain enough data samples
to accurately represent all possible input data values.
How Bias and Variance will cause Overfitting and Underfitting?

 Dataset is divided into training and testing data.


 We will build a model using training dataset and validate a model
using testing dataset.
 Bias is resulting from training error and variance is related to
testing error.
END

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy