0% found this document useful (0 votes)
7 views21 pages

Unit 3 Statistical References

The document discusses statistical concepts including population and sample definitions, types of populations, and probability distributions such as normal, binomial, and Poisson distributions. It also covers statistical modeling techniques like linear regression, logistic regression, decision trees, random forests, support vector machines, naive Bayes, and k-nearest neighbors. Each method is explained with examples and formulas, emphasizing their applications in real-world scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views21 pages

Unit 3 Statistical References

The document discusses statistical concepts including population and sample definitions, types of populations, and probability distributions such as normal, binomial, and Poisson distributions. It also covers statistical modeling techniques like linear regression, logistic regression, decision trees, random forests, support vector machines, naive Bayes, and k-nearest neighbors. Each method is explained with examples and formulas, emphasizing their applications in real-world scenarios.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Unit 3 Statistical References

Population and Sample:

A population refers to the entire set of individuals, objects, or data points that you want to study. It
can be large or small depending on the scope of your research. For example, all students in a school
or all people in a country.

A sample is a subset of the population that is selected for analysis. It’s used when studying the entire
population is impractical or impossible. Sampling allows for inferences about the population using
statistical techniques.

Population vs Sample

Population Sample

The population includes all members of a


A sample is a subset of the population.
specified group.

Collecting data from an entire population Samples offer a more feasible approach to studying
can be time-consuming, expensive, and populations, allowing researchers to draw
sometimes impractical or impossible. conclusions based on smaller, manageable datasets

Consists of 1000 households, a subset of the entire


Includes all residents in the city.
population.

Types of Population:

Finite Population

A population is called finite if it is possible to count its individuals. It may also be called a countable
population. The number of vehicles crossing a bridge every day, the number of births per years and
the number of words in a book are finite populations. The number of units in a finite population is
denoted by N. Thus N is the size of the population.

Infinite Population

Sometimes it is not possible to count the units contained in the population. Such a population is
called infinite or uncountable. Let us suppose that we want to examine whether a coin is fair or not.
We shall toss it a very large number of times to observe the number of heads. All the tosses will
make an infinite or uncountable infinite population. The number of germs in the body of a sick
patient is perhaps something which is uncountable.

Existent Population

The existing population is defined as the population of concrete individuals. In other words, the
population whose unit is available in solid form is known as existent population. Examples are books,
students etc.

Hypothetical Population

The population in which whose unit is not available in solid form is known as the hypothetical
population. A population consists of sets of observations, objects etc that are all something in
common. In some situations, the populations are only hypothetical. Examples are an outcome of
rolling the dice, the outcome of tossing a coin.

Population And Sample Formulas

Types of Probability Distribution:


Probability distribution yields the possible outcomes for any random event.
Random experiments are defined as the result of an experiment, whose outcome cannot be
predicted. Suppose, if we toss a coin, we cannot predict, what outcome it will appear either
it will come as Head or as Tail. The possible result of a random experiment is called an
outcome. And the set of outcomes is called a sample point.
Types of Probability Distribution
There are two types of probability distribution which are used for different purposes and
various types of the data generation process.
1. Normal or Cumulative or Continuous Probability Distribution
2. Binomial or Discrete Probability Distribution
Let us discuss now both the types along with their definition, formula and examples.

Normal or Cumulative Probability Distribution


The cumulative probability distribution is also known as a continuous probability
distribution. In this distribution, the set of possible outcomes can take on values in a
continuous range.
For example, a set of real numbers, is a continuous or normal distribution, as it gives all the
possible outcomes of real numbers. Similarly, a set of complex numbers, a set of prime
numbers, a set of whole numbers etc. are examples of Normal Probability distribution. Also,
in real-life scenarios, the temperature of the day is an example of continuous probability.
Based on these outcomes we can create a distribution table. A probability density function
describes it. The formula for the normal distribution is;

Where,
• μ = Mean Value
• σ = Standard Distribution of probability.
• If mean(μ) = 0 and standard deviation(σ) = 1, then this distribution is
known to be normal distribution.
• x = Normal random variable
Normal Distribution Examples

Since the normal distribution statistics estimates many natural events so well, it has evolved into a
standard of recommendation for many probability queries. Some of the examples are:

• Height of the Population of the world

• Rolling a dice (once or multiple times)


• To judge the Intelligent Quotient Level of children in this competitive world

• Tossing a coin

• Income distribution in countries economy among poor and rich

• The sizes of females shoes

• Weight of newly born babies range

• Average report of Students based on their performance

Discrete Probability Distribution


Discrete probability distribution counts occurrences with finite outcomes. The common
examples of discrete probability distribution include Bernoulli, Binomial and Poisson
distributions.
For example, if a dice is rolled, then all the possible outcomes are discrete and give a mass of
outcomes. It is also known as the probability mass function.
So, the outcomes of binomial distribution consist of n repeated trials and the outcome may
or may not occur. The formula for the binomial distribution is;

Where,
• n = Total number of events
• r = Total number of successful events.
• p = Success on a single trial probability.
• nC
r = [n!/r!(n−r)]!
• 1 – p = Failure Probability
Binomial Distribution Examples
As we already know, binomial distribution gives the possibility of a different set of outcomes.
In the real-life, the concept is used for:
• To find the number of used and unused materials while manufacturing
a product.
• To take a survey of positive and negative feedback from the people for
anything.
• To check if a particular channel is watched by how many viewers by
calculating the survey of YES/NO.
• The number of men and women working in a company.
• To count the votes for a candidate in an election and many more.
What is Negative Binomial Distribution?
In probability theory and statistics, if in a discrete probability distribution, the number of
successes in a series of independent and identically disseminated Bernoulli trials before a
particularised number of failures happens, then it is termed as the negative binomial
distribution. Here the number of failures is denoted by ‘r’. For instance, if we throw a dice
and determine the occurrence of 1 as a failure and all non-1’s as successes. Now, if we throw
a dice frequently until 1 appears the third time, i.e.r = three failures, then the probability
distribution of the number of non-1s that arrived would be the negative binomial
distribution.

What is Poisson Probability Distribution?


The Poisson probability distribution is a discrete probability distribution that represents the
probability of a given number of events happening in a fixed time or space if these cases
occur with a known steady rate and individually of the time since the last event.
The Poisson distribution can also be practised for the number of events happening in other
particularised intervals such as distance, area or volume. Some of the real-life examples are:
• A number of patients arriving at a clinic between 10 to 11 AM.
• The number of emails received by a manager between office hours.
• The number of apples sold by a shopkeeper in the time period of 12
pm to 4 pm daily.
Poisson distribution formula:
P (X = k) = e−λk / k!
Where,
• P(X = k) is the Probability of Observing k Events
• e is the Base of the Natural Logarithm (approximately 2.71828)
• λ is the Average Rate of Occurrence of Events
• k is the Number of Events that Occur

Types of Statistical Modelling:


Statistical modelling is a process of using data to create mathematical representations of
real-world phenomena. For instance, predicting housing prices based on factors like location,
size, and features is a statistical model.
By applying statistical models to raw data, a statistical model can generate comprehensible
visualisations that enable data scientists in discovering correlations between variables and
generate predictions. Census data, public health data, and social media data are examples of
typical data sets for statistical analysis.
Statistical modeling in Python involves using libraries like StatsModels or scikit-learn to build
models. It enables data scientists to perform regression, hypothesis testing, and other
analyses.
Types of Statistical Modelling:
1. Linear Regression: Linear regression is a widely used statistical model that predicts
the relationship between a dependent variable and one or more independent
variables. It assumes a linear relationship between the variables and is often used for
forecasting and estimating future trends.
Example-For example if we want to predict house price we consider various factor
such as house age, distance from the main road, location, area and number of room,
linear regression uses all these parameter to predict house price as it consider a
linear relation between all these features and price of house.
a. Linearity: The independent and dependent variables have a linear relationship with
one another. This implies that changes in the dependent variable follow those in the
independent variable(s) in a linear fashion. This means that there should be a straight
line that can be drawn through the data points. If the relationship is not linear, then
linear regression will not be an accurate model.

b. Independence: The observations in the dataset are independent of each other. This
means that the value of the dependent variable for one observation does not
depend on the value of the dependent variable for another observation. If the
observations are not independent, then linear regression will not be an accurate
model.
2. Logistic Regression: Logistic regression is a binary classification model that predicts
the probability of an event occurring or not. It is widely used in healthcare, finance,
and marketing to predict outcomes such as whether a patient will develop a disease
or whether a customer will buy a product.
Working:
• Logistic regression predicts the output of a categorical dependent variable.
Therefore, the outcome must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
• In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).

3. Decision Trees: Decision trees are a visual representation of a set of rules that lead to
a decision. They are often used in machine learning and data mining to predict
outcomes and classify data. A decision tree is a supervised learning algorithm used
for both classification and regression tasks.
Below pictures are examples of how a decision tree can be made:
4. Random Forest: A random forest is an ensemble learning technique that uses
multiple decision trees to make predictions. It is often used in data mining and
machine learning for classification and regression tasks.
Below is the example of random forest technique where we have dataset of the
voters of various location and our goal is to predict the election outcome:

5. Support Vector Machines: Support vector machines are a popular supervised


learning technique used for classification and regression tasks. They work by
identifying a hyperplane that separates the data into different classes.
Support Vector Machine (SVM) Terminology:
• Hyperplane: A decision boundary separating different classes in feature space,
represented by the equation wx + b = 0 in linear classification.
• Support Vectors: The closest data points to the hyperplane, crucial for determining
the hyperplane and margin in SVM.
• Margin: The distance between the hyperplane and the support vectors. SVM aims to
maximize this margin for better classification performance.
• Kernel: A function that maps data to a higher-dimensional space, enabling SVM to
handle non-linearly separable data.
• Hard Margin: A maximum-margin hyperplane that perfectly separates the data
without misclassifications.
• Soft Margin: Allows some misclassifications by introducing slack variables, balancing
margin maximization and misclassification penalties when data is not perfectly
separable. (ignore outliers if a few)
• C: A regularization term balancing margin maximization and misclassification
penalties. A higher C value enforces a stricter penalty for misclassifications.
How SVM Works?
The key idea behind the SVM algorithm is to find the hyperplane that best separates two
classes by maximizing the margin between them. This margin is the distance from the
hyperplane to the nearest data points (support vectors) on each side.

The best hyperplane, also known as the “hard margin,” is the one that maximizes the distance
between the hyperplane and the nearest data points from both classes. This ensures a clear separation
between the classes. So, from the above figure, we choose L2 as hard margin.

Advantages of SVM:

• High-Dimensional Performance: SVM excels in high-dimensional spaces, making it suitable


for image classification and gene expression analysis.
• Nonlinear Capability: Utilizing kernel functions like RBF and polynomial, SVM effectively
handles nonlinear relationships.
• Outlier Resilience: The soft margin feature allows SVM to ignore outliers, enhancing
robustness in spam detection and anomaly detection.
• Binary and Multiclass Support: SVM is effective for both binary classification and multiclass
classification, suitable for applications in text classification.
• Memory Efficiency: SVM focuses on support vectors, making it memory efficient compared to
other algorithms.

6. Naive Bayes: Naive Bayes is a probabilistic model that calculates the likelihood of an
event based on prior knowledge. It is widely used in natural language processing and
text classification.
The main idea behind the Naive Bayes classifier is to use Bayes’ Theorem to classify data
based on the probabilities of different classes given the features of the data. It is used mostly
in high-dimensional text classification
• The Naive Bayes Classifier is a simple probabilistic classifier and it has very few
number of parameters which are used to build the ML models that can predict at a
faster speed than other classification algorithms.
• It is a probabilistic classifier because it assumes that one feature in the model is
independent of existence of another feature. In other words, each feature
contributes to the predictions with no relation between each other.
• Naïve Bayes Algorithm is used in spam filtration, Sentimental analysis, classifying
articles and many more.

What is Bayes Rule?

The Bayes Rule provides the formula for the probability of Y given X.

But, in real-world problems, you typically have multiple X variables.


When the features are independent, we can extend the Bayes Rule to what is called Naive Bayes.

It is called ‘Naive’ because of the naive assumption that the X’s are independent of each other.

7. K-Nearest Neighbors: K-Nearest Neighbors is a non-parametric classification model


that predicts the class of a new data point based on the classes of its k nearest
neighbors. It is widely used in pattern recognition.
K-Nearest Neighbors (KNN) is a simple way to classify things by looking at what’s
nearby. Imagine a streaming service wants to predict if a new user is likely to cancel
their subscription (churn) based on their age. They checks the ages of its existing
users and whether they churned or stayed. If most of the “K” closest users in age of
new user canceled their subscription KNN will predict the new user might churn
too. The key idea is that users with similar ages tend to have similar behaviors and
KNN uses this closeness to make decisions.
In the k-Nearest Neighbours (k-NN) algorithm k is just a number that tells the
algorithm how many nearby points (neighbours) to look at when it makes a decision.
Example:
Imagine you’re deciding which fruit it is based on its shape and size. You compare it
to fruits you already know.
• If k = 3, the algorithm looks at the 3 closest fruits to the new one.
• If 2 of those 3 fruits are apples and 1 is a banana, the algorithm says the new fruit is
an apple because most of its neighbours are apples.
The value of k is critical in KNN as it determines the number of neighbors to consider
when making predictions. Selecting the optimal value of k depends on the
characteristics of the input data. If the dataset has significant outliers or noise a
higher k can help smooth out the predictions and reduce the influence of noisy
data. However choosing very high value can lead to underfitting where the model
becomes too simplistic.
Statistical Methods for Selecting k:
• Cross-Validation: A robust method for selecting the best k is to perform k-fold cross-
validation. This involves splitting the data into k subsets training the model on some
subsets and testing it on the remaining ones and repeating this for each subset. The
value of k that results in the highest average validation accuracy is usually the best
choice.
• Elbow Method: In the elbow method we plot the model’s error rate or accuracy for
different values of k. As we increase k the error usually decreases initially. However
after a certain point the error rate starts to decrease more slowly. This point where
the curve forms an “elbow” that point is considered as best k.
• Odd Values for k: It’s also recommended to choose an odd value for k especially in
classification tasks to avoid ties when deciding the majority class.
Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity
where it predicts the label or value of a new data point by considering the labels or
values of its K nearest neighbors in the training dataset.

Step 1: Selecting the optimal value of K


• K represents the number of nearest neighbors that needs to be considered while
making prediction.
Step 2: Calculating distance
• To measure the similarity between target and training data points Euclidean distance
is used. Distance is calculated between data points in the dataset and target point.
Step 3: Finding Nearest Neighbors
• The k data points with the smallest distances to the target point are nearest
neighbors.
Step 4: Voting for Classification or Taking Average for Regression
• When you want to classify a data point into a category (like spam or not spam), the
K-NN algorithm looks at the K closest points in the dataset. These closest points are
called neighbors. The algorithm then looks at which category the neighbors belong to
and picks the one that appears the most. This is called majority voting.
• In regression, the algorithm still looks for the K closest points. But instead of voting
for a class in classification, it takes the average of the values of those K neighbors.
This average is the predicted value for the new point for the algorithm.
For example- to check whether a particular reviews falls on positive or negative
reviews this will be results after applying k-NN algorithm-

8. Principal Component Analysis: Principal Component Analysis is a dimensionality


reduction technique that reduces the dimensionality of a dataset while retaining the
most important features. It is widely used in data visualization and feature selection.
Principal component analysis can be broken down into five steps.
Step 1: Standardization
The aim of this step is to standardize the range of the continuous initial variables so
that each one of them contributes equally to the analysis.
if there are large differences between the ranges of initial variables, those variables with
larger ranges will dominate over those with small ranges (for example, a variable that ranges
between 0 and 100 will dominate over a variable that ranges between 0 and 1), which will
lead to biased results. So, transforming the data to comparable scales can prevent this
problem.

Once the standardization is done, all the variables will be transformed to the same scale.
Step 2: Compute the covariance matrix to identify correlations
To see if there is any relationship between them. Because sometimes, variables are highly
correlated in such a way that they contain redundant information. So, in order to identify
these correlations, we compute the covariance matrix.
What do the covariances that we have as entries of the matrix tell us about the
correlations between the variables?
• If positive then: the two variables increase or decrease together (correlated)
• If negative then: one increases when the other decreases (Inversely correlated)
Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify the
principal components
Step 4: Create a feature vector to decide which principal components to keep
Step5: Recast the data along the principal components axes.
9. Clustering: Clustering in machine learning is a technique that groups similar data
points together. It helps identify patterns or structures within data by dividing it into
clusters based on shared characteristics.
Types of Clustering Algorithm:

Clustering
Strengths Weaknesses Ideal Use Cases
Algorithm

Simple,
efficient for
Struggles with
large
non-spherical Customer segmentation,
K Means datasets,
clusters, document classification,
Clustering works well
sensitive to image compression
with
initial centroids
spherical
clusters

Automatically
determines Computationally
the number expensive,
Mean-Shift Image segmentation, traffic
of clusters, sensitive to
Clustering pattern analysis
effective for bandwidth
irregular parameter
clusters

Handles Difficult to
noise, define
Density-
identifies parameters like Fraud detection,
Based
arbitrary- minimum points geographical data grouping
Clustering
shaped and distance
clusters threshold
Builds a
visual Computationally Genealogy analysis, protein
Hierarchical dendrogram intensive, not structure analysis, multi-
Clustering for suitable for level customer
hierarchical large datasets segmentation
relationships

Handles Requires
overlapping assumptions
Distribution- Traffic flow modelling,
clusters, about data
Based customer segmentation
based on distribution, not
Clustering with shared characteristics
probability ideal for all
distributions datasets

Combines
strengths of
algorithms, Increased Customer segmentation
Hybrid
improves complexity, may with shared characteristics,
Clustering
accuracy, require more large-scale genomic data
Methods
adapts to computation analysis
complex
datasets

10. Neural Networks: Neural networks are a set of algorithms modelled on the human
brain that can learn and recognize patterns. They are widely used in image and
speech recognition, natural language processing, and robotics.
Neural networks are capable of learning and identifying patterns directly from data without
pre-defined rules. These networks are built from several key components:
1. Neurons: The basic units that receive inputs, each neuron is governed by a threshold
and an activation function.
2. Connections: Links between neurons that carry information, regulated by weights
and biases.
3. Weights and Biases: These parameters determine the strength and influence of
connections.
4. Propagation Functions: Mechanisms that help process and transfer data across
layers of neurons.
5. Learning Rule: The method that adjusts weights and biases over time to improve
accuracy.
Learning in neural networks follows a structured, three-stage process:
1. Input Computation: Data is fed into the network.
2. Output Generation: Based on the current parameters, the network generates an
output.
3. Iterative Refinement: The network refines its output by adjusting weights and biases,
gradually improving its performance on diverse tasks.
11. Markov Chains: Markov chains are a stochastic model that describes a sequence of
events where the probability of each event depends only on the state of the previous
event. They are widely used in finance, speech recognition, and genetics.

Markov chains, named after Andrey Markov, a stochastic model that depicts a sequence of
possible events where predictions or probabilities for the next state are based solely on its
previous event state, not the states before. In simple words, the probability that n+1th steps
will be x depends only on the nth steps not the complete sequence of steps that came before
n. This property is known as Markov Property or Memorylessness.
12. Time Series Analysis: Time series analysis is a statistical technique used to analyze time
series data to identify patterns and make forecasts. It is widely used in finance,
economics, and engineering.
Time series data is commonly represented graphically with time on the horizontal axis
and the variable of interest on the vertical axis, allowing analysts to identify trends,
patterns, and changes over time.
Time series data is often represented graphically as a line plot, with time depicted on
the horizontal x-axis and the variable's values displayed on the vertical y-axis. This
graphical representation facilitates the visualization of trends, patterns, and
fluctuations in the variable over time, aiding in the analysis and interpretation of the
data.

Preprocessing Time Series Data


Time series preprocessing refers to the steps taken to clean, transform, and prepare
time series data for analysis or forecasting. It involves techniques aimed at improving
data quality, removing noise, handling missing values, and making the data suitable for
modeling. Preprocessing tasks may include removing outliers, handling missing values
through imputation, scaling or normalizing the data, detrending, deseasonalizing, and
applying transformations to stabilize variance. The goal is to ensure that the time series
data is in a suitable format for subsequent analysis or modeling.
• Handling Missing Values : Dealing with missing values in the time series data to ensure
continuity and reliability in analysis.
• Dealing with Outliers: Identifying and addressing observations that significantly
deviate from the rest of the data, which can distort analysis results.
• Stationarity and Transformation: Ensuring that the statistical properties of the time
series, such as mean and variance, remain constant over time. Techniques like
differencing, detrending, and deseasonalizing are used to achieve stationarity.

Distance Metrics:
if you created clusters using a clustering algorithm such as K-Means Clustering or k-nearest
neighbour algorithm (knn), which uses nearest neighbours to solve a classification or
regression problem. How will you define the similarity between different observations? How
can we say that two points are similar to each other? This will happen if their features are
similar, right? When we plot these points, they will be closer to each other by distance.
Types of Distance Metrics in Machine Learning:
1. Euclidean Distance
2. Manhattan Distance
3. Minkowski Distance
4. Hamming Distance
Euclidean Distance:
The Euclidean distance is the most widely used distance measure in clustering. It calculates
the straight-line distance between two points in n-dimensional space. The formula for
Euclidean distance is:

The two spots that we are computing the Euclidean distance between are represented with
blue dots in the figure. The Euclidean distance, represented by the black line that separates
them is the distance measured in a straight line.
Manhattan Distance
The Manhattan distance, is the total of the absolute differences between their Cartesian
coordinates, sometimes referred to as the L1 distance or city block distance. Envision
maneuvering across a city grid in which your only directions are horizontal and vertical. The
Manhattan distance, which computes the total distance traveled along each dimension to
reach a different data point represents this movement. When it comes to categorical data this
metric is more effective than Euclidean distance since it is less susceptible to outliers. The
formula is:
The two points are represented by the blue color in the plot. The grid-line-based method
used to determine the Manhattan distance.
Minkowski Distance:
Minkowski distance is a generalized form of both Euclidean and Manhattan distances,
controlled by a parameter p. The Minkowski distance allows adjusting the power parameter
(p). When p=1, it’s equivalent to Manhattan distance; when p=2, it’s Euclidean distance.

Hamming Distance:
Hamming distance is a metric for comparing two binary data strings. While comparing two
binary strings of equal length, Hamming distance is the number of bit positions in which the
two bits are different.
The Hamming distance between two strings, a and b is denoted as d(a,b) or H(a,b).
In order to calculate the Hamming distance between two strings, and , we perform their XOR
operation, (a⊕ b), and then count the total number of 1s in the resultant string.
Example
Suppose there are two strings 1101 1001 and 1001 1101.
11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance,
d(11011001, 10011101) = 2.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy