Data Science and Visualization
Data Science and Visualization
VISUALIZATION
21CS644
MODULE-1 Statistical Inference
Chapter 2
Statistical Inferencing
The world is a complex, random, sometimes uncertain, in fact it’s a big-data generating
machine!
Once the data is captured, we have captured the world or certain traces of the world.
The captured traces of the world must then be simplified into something more
comprehensible, and concise such as mathematical models or functions. These are called
statistical estimators
This overall process of going from the world to the data, and then from the data back to
the world, is the field of statistical inference.
Definition: it is a discipline that is concerned with the development of procedures, methods
and theorems that allows the extraction of meaning and information from the data generated
by stochastic processes.
Population and Samples
• When resources such as time, cost, and manpower are limited, making it more feasible to
collect data from a subset of the population.
Types of data
Traditional – numerical, categorical and binary
Text – emails, tweets, reviews, news articles
Records – user-level data, timestamped event data, json-formatted log files
Geo-based location data – housing data
Network
Sensor data
Images
Populations and Samples of Big Data
Big is when you can’t fit all the data on one machine
Model - humans try to understand the world around them by representing it in different
ways
Statisticians and data scientists capture the uncertainty and randomness of data-
generating processes with mathematical functions that express the shape and structure
of the data itself.
A model is our attempt to understand and represent the nature of reality through a
particular lens, be it architectural, biological, or mathematical.
A model is an artificial construction where all irrelevant detail has been removed or
abstracted
Modeling
What comes first? What influences what? What causes what? What’s a
test of that?
Expressing these kind of relations in terms of mathematical expressions
that will be general enough that they have to include parameters, but
the parameter values are not yet known.
Example: if there are two columns x and y of data and there is a linear
relationship between them then we can represent it as
𝑦= (a 𝑥 + 𝑏)
Where a and b are parameters whose values are not yet known
How to build a model?
Example: Let x be the amount of time until the next bus arrives. x is a random variable
because there is variation and uncertainty in the amount of time until the next bus.
Suppose we know that the time until the next bus has a probability density function of p(x)
= 2𝑒 −2𝑥 . If we want to know the likelihood of the next bus arriving in between 12 and 13
13
minutes, then we find the area under the curve between 12 and 13 by 12 2𝑒 −2𝑥
How do we know that the distribution is correct?
We can conduct an experiment where we show up at the bus stop at a random time,
measure how much time until the next bus, and repeat this experiment over and over
again. Then we look at the measurements, plot them, and approximate the function.
Because we are familiar with the fact that “waiting time” is a common enough real-world
phenomenon that a distribution called the exponential distribution has been invented to
describe it, we know that it takes the form p(x) = λ𝑒 −λ𝑥 .
Joint Distribution and Conditional
Distribution
Multivariate functions called joint distributions to do the same thing for more
than one random variable.
So in the case of two random variables, for example, we could denote our
distribution by a function p(x, y) , and it would take values in the plane and give us
nonnegative values.
We also have what is called a conditional distribution, p(x|y), which is to be
interpreted as the density function of x given a particular value of y.
Fitting a Model
Bias-Consistent
Bias- Straight line cannot curve like Bias – No Bias, passes through every
the true relationship point of Train Data
*Variance – Difference in
fit between datasets
Overfitting is the term used to mean that you used a dataset to estimate the
parameters of your model, but your model isn’t that good at capturing reality
beyond your sampled data.
Overfitting happens due to several reasons, such as:
the training data size is too small and does not contain enough data samples
to accurately represent all possible input data values.
How Bias and Variance will cause Overfitting and Underfitting?