Unit - 1 Introduction-Statistical Inference (1)
Unit - 1 Introduction-Statistical Inference (1)
The process of collecting data from a small subsection of the population and then
using it to generalize over the entire set is called Sampling.
Sample
Samples are used when :
• The population is too large to collect data.
• The data collected is not reliable.
• The population is hypothetical and is unlimited in size. Take the example of a study that
documents the results of a new medical procedure. It is unknown how the procedure will
affect people across the globe, so a test group is used to find out how people react to it.
Standard Deviation: σ = √[(Σ(X-μ)²) / N], where X is a value in the population, μ is the population
mean, and N is the size of the population
Sample Statistic:
Mean: x̄ = (Σx) / n, where Σx is the sum of all values in the sample and n is the size of the sample
Standard Deviation: s = √[(Σ(x-x̄)²) / (n-1)], where x is a value in the sample and x̄ is the sample
mean
Note that the formulas for the population parameter and sample statistic are similar, but they use
different notation and have slightly different calculations. The population parameter uses the entire
population, while the sample statistic uses a subset (i.e., sample) of the population.
Examples of Statistical Inference Using Population
and Sample Data
Statistical inference using population and sample data can be applied in various fields.
Here are some examples:
• Medical Research: In medical research, clinical trials are conducted on a sample of the
population to estimate the effects of a drug or treatment. Statistical inference is used to
estimate the effect size and determine the probability that the results are due to chance.
In all these examples, statistical inference using population and sample data is used to
draw conclusions or make predictions about the population of interest. By using
probability theory and statistical methods, researchers can estimate population parameters,
such as proportions or means, and determine the likelihood that the results are due to
chance.
Statistical Inference
The main types of statistical inference are:
• Estimation
• Hypothesis testing
Estimation
• Statistics from a sample are used to estimate population parameters.
• The most likely value is called a point estimate.
• There is always uncertainty when estimating.
• The uncertainty is often expressed as confidence intervals defined by a likely lowest
and highest value for the parameter.
• An example could be a confidence interval for the number of bicycles a Dutch person
owns:
"The average number of bikes a Dutch person owns is between 3.5 and 6."
Hypothesis Testing
• Hypothesis testing is a method to check if a claim about a population is true. More
precisely, it checks how likely it is that a hypothesis is true is based on the sample data.
• The steps of the test depends on:
• Type of data (categorical or numerical)
• If you are looking at:
❑ A single group
❑ Comparing one group to another
❑ Comparing the same group before and after a change
Some examples of claims or questions that can be checked with hypothesis testing:
• 90% of Australians are left handed
• Is the average weight of dogs more than 40kg?
• Do doctors make more money than lawyers?
What is a Data Model?
• A data model organizes data elements and standardizes how the data elements relate to
one another. Since data elements document real life people, places and things and the
events between them, the data model represents reality. For example a house has many
windows or a cat has two eyes.
• Data models are often used as an aid to communication between the business people
defining the requirements for a computer system and the technical people defining the
design in response to those requirements. They are used to show the data needed and
created by business processes.
• A data model explicitly determines the structure of data. Data models are specified in
a data modeling notation, which is often graphical in form.
• The creation of the data model is the critical first step that must be taken after business
requirements for analytics and reporting have been defined.
• A data model can be sometimes referred to as a data structure, especially in the context
of programming languages. Data models are often complemented by function models.
• A model is an artificial construction where all extraneous detail has been removed or
abstracted.
What is Statistical Modeling?
• Statistical modeling is the process of describing the connections between variables in a
dataset using mathematical equations and statistical approaches.
• Statistical modeling is used to identify correlations between variables, generate
predictions, and influence decision-making in a range of professions and sectors. It can
be used in any case where we wish to improve our understanding of how different
variables are connected and make predictions based on that information.
What is Statistical Modeling?
• Predicting the number of people who will travel on a specific rail route is an example of
statistical modeling. To develop a statistical model, we would collect data on the number
of passengers who utilize the train route over time, as well as data on variables that
might affect passenger counts, such as time of day, day of the week, and weather.
• Then, using statistical approaches such as regression analysis, we can determine the
correlations between these factors and the number of passengers utilizing the railway
route. For example, we might discover that the number of passengers is larger during
rush hour and on weekdays, and fewer when it is raining. We can apply this data to build
a statistical model that forecasts the number of people who would use the railway route
depending on the time of day, day of the week, and weather conditions. This model can
then be used to anticipate future passenger numbers and make resource allocation
choices, such as adding additional trains during rush hour or giving specials during
severe weather.
• It is essential in statistical modeling to pick an appropriate statistical model that fits the
data and to evaluate the model to ensure accuracy and reliability. This might include
running the model on a new set of data or employing statistical tests to assess the
model’s performance.
Types of Statistical Models
• There are several statistical models, each designed to solve a specific research issue or data
format. Here are a few common types of statistical models and their applications:
• Linear regression models: These models are used to represent the connection between a
continuous result variable and one or more predictor variables. For example, depending on a
person’s height, age, and gender, a linear regression model may be used to estimate their weight.
• Logistic regression models: Logistic regression models are used to represent the connection
between a binary outcome variable (for example, yes/no) and one or more predictor variables.
For example, depending on age, blood pressure, and cholesterol levels, a logistic regression
model may be used to predict if a patient would have a heart attack.
• Time series models: Time series models are used to model data that changes over time, such as
stock prices, weather trends, or monthly sales numbers. These types of models may be applied to
data to find trends, seasonal patterns, and other forms of temporal correlations.
• Multilevel models: These models are used to model data having a hierarchical structure, such as
pupils in schools or patients in hospitals. Multilevel models can be used to investigate how
individual-level and group-level factors impact outcomes, as well as to account for the fact that
people in the same group may be more similar to each other than those in different groups.
Types of Statistical Models
• Structural equation models: These types of models are used to represent complicated
interactions between several variables. Structural equation models can be used to
evaluate ideas regarding causal links between variables and to quantify their strength and
direction.
• Clustering models: Clustering models are used to bring together comparable
observations based on their similarities in terms of features. Clustering algorithms can be
used to uncover patterns in data that would be difficult to detect using other approaches.
What is Linear Regression?
• Linear regression is a type of statistical analysis used to predict the relationship between
two variables.
• It assumes a linear relationship between the independent variable and the dependent
variable, and aims to find the best-fitting line that describes the relationship.
• The line is determined by minimizing the sum of the squared differences between the
predicted values and the actual values.
Simple Linear Regression
• In simple linear regression, there is one independent variable and one dependent
variable. The model estimates the slope and intercept of the line of best fit, which
represents the relationship between the variables.
• The slope represents the change in the dependent variable for each unit change in the
independent variable, while the intercept represents the predicted value of the
dependent variable when the independent variable is zero.
• Linear regression shows the linear relationship between the independent(predictor)
variable i.e. X-axis and the dependent(output) variable i.e. Y-axis, called linear
regression. If there is a single input variable X(independent variable), such linear
regression is called simple linear regression.
Simple Linear Regression
• To calculate best-fit line linear regression uses a traditional slope-intercept form which is
given below,
Yi = β0 + β1Xi
where Yi = Dependent variable, β0 = Intercept, β1 = Slope, Xi = Independent variable.
• In regression, the difference between the observed value of the dependent variable(yi)
and the predicted value(predicted) is called the residuals or random error.
• εi = ypredicted – yi
• where ypredicted = B0 + B1 Xi
Fitting a model
• Model Fitting is a measurement of how well a machine learning model adapts to data that is
similar to the data on which it was trained.
• The fitting process is generally built-in to models and is automatic.
• A well-fit model will accurately approximate the output when given new data, producing
more precise results. A model is fitted by adjusting the parameters within the model, leading
to improvements in accuracy. During the fitting process, the algorithm is run on test data,
otherwise known as “labeled” data. Once the algorithm has finished running, the results
need to be compared to real and observed values of the target variable, in order to identify
the accuracy of the model.
• Using the results, the parameters of the algorithm can be further adjusted to better uncover
relationships and patterns between the inputs, outputs, and targets. The process can be
repeated until valid and accurate insights are given.
What Does a Well-Fit Model Look Like?
• A well-fit model should closely match the available data while also closely matching the
general curves of the model. No model will be able to perfectly match the input data, but a
well-fit model will be able to closely match the data and general shapes. In the image below,
it is important to note that the line does not match every individual data point, but does
capture the general curve.
Fitting a model
• Model Fitting is a measurement of how well a machine learning model adapts to data that is
similar to the data on which it was trained.
• The fitting process is generally built-in to models and is automatic.
• A well-fit model will accurately approximate the output when given new data, producing
more precise results. A model is fitted by adjusting the parameters within the model, leading
to improvements in accuracy. During the fitting process, the algorithm is run on test data,
otherwise known as “labeled” data. Once the algorithm has finished running, the results
need to be compared to real and observed values of the target variable, in order to identify
the accuracy of the model.
• Using the results, the parameters of the algorithm can be further adjusted to better uncover
relationships and patterns between the inputs, outputs, and targets. The process can be
repeated until valid and accurate insights are given.
Why is model fitting important?
Underfitting
Underfitting occurs when a model oversimplifies the data and fails to capture enough information on
the relationships within it. This is frequently a result of insufficient model training time. An underfit
model can be identified when a model performs poorly on the training data.
Overfitting
Overfitting is the opposite of underfitting. It transpires when a model is overly sensitive to the data
within, which results in an over-analysis of the patterns within. Overfitting is generally a result of
overtraining on training data sets. It can be identified when a model performs well on the data used for
training, but does not adapt to new data and performs poorly.
A Well-Fit Model
A Well-fit model performs well on training data and on evaluation data, due to correct
hyperparameters that capture the relationships between the variables and target variables. Generally,
fitting is an automatic process where the hyperparameters are adjusted individually and automatically
to best suit the data inputted. The use of well-fit models enables users to make better decisions and
draw accurate insights
Why is model fitting important?