Unit 3 Statistical References
Unit 3 Statistical References
A population refers to the entire set of individuals, objects, or data points that you want to study. It
can be large or small depending on the scope of your research. For example, all students in a school
or all people in a country.
A sample is a subset of the population that is selected for analysis. It’s used when studying the entire
population is impractical or impossible. Sampling allows for inferences about the population using
statistical techniques.
Population vs Sample
Population Sample
Collecting data from an entire population Samples offer a more feasible approach to studying
can be time-consuming, expensive, and populations, allowing researchers to draw
sometimes impractical or impossible. conclusions based on smaller, manageable datasets
Types of Population:
Finite Population
A population is called finite if it is possible to count its individuals. It may also be called a countable
population. The number of vehicles crossing a bridge every day, the number of births per years and
the number of words in a book are finite populations. The number of units in a finite population is
denoted by N. Thus N is the size of the population.
Infinite Population
Sometimes it is not possible to count the units contained in the population. Such a population is
called infinite or uncountable. Let us suppose that we want to examine whether a coin is fair or not.
We shall toss it a very large number of times to observe the number of heads. All the tosses will
make an infinite or uncountable infinite population. The number of germs in the body of a sick
patient is perhaps something which is uncountable.
Existent Population
The existing population is defined as the population of concrete individuals. In other words, the
population whose unit is available in solid form is known as existent population. Examples are books,
students etc.
Hypothetical Population
The population in which whose unit is not available in solid form is known as the hypothetical
population. A population consists of sets of observations, objects etc that are all something in
common. In some situations, the populations are only hypothetical. Examples are an outcome of
rolling the dice, the outcome of tossing a coin.
Where,
• μ = Mean Value
• σ = Standard Distribution of probability.
• If mean(μ) = 0 and standard deviation(σ) = 1, then this distribution is
known to be normal distribution.
• x = Normal random variable
Normal Distribution Examples
Since the normal distribution statistics estimates many natural events so well, it has evolved into a
standard of recommendation for many probability queries. Some of the examples are:
• Tossing a coin
Where,
• n = Total number of events
• r = Total number of successful events.
• p = Success on a single trial probability.
• nC
r = [n!/r!(n−r)]!
• 1 – p = Failure Probability
Binomial Distribution Examples
As we already know, binomial distribution gives the possibility of a different set of outcomes.
In the real-life, the concept is used for:
• To find the number of used and unused materials while manufacturing
a product.
• To take a survey of positive and negative feedback from the people for
anything.
• To check if a particular channel is watched by how many viewers by
calculating the survey of YES/NO.
• The number of men and women working in a company.
• To count the votes for a candidate in an election and many more.
What is Negative Binomial Distribution?
In probability theory and statistics, if in a discrete probability distribution, the number of
successes in a series of independent and identically disseminated Bernoulli trials before a
particularised number of failures happens, then it is termed as the negative binomial
distribution. Here the number of failures is denoted by ‘r’. For instance, if we throw a dice
and determine the occurrence of 1 as a failure and all non-1’s as successes. Now, if we throw
a dice frequently until 1 appears the third time, i.e.r = three failures, then the probability
distribution of the number of non-1s that arrived would be the negative binomial
distribution.
b. Independence: The observations in the dataset are independent of each other. This
means that the value of the dependent variable for one observation does not
depend on the value of the dependent variable for another observation. If the
observations are not independent, then linear regression will not be an accurate
model.
2. Logistic Regression: Logistic regression is a binary classification model that predicts
the probability of an event occurring or not. It is widely used in healthcare, finance,
and marketing to predict outcomes such as whether a patient will develop a disease
or whether a customer will buy a product.
Working:
• Logistic regression predicts the output of a categorical dependent variable.
Therefore, the outcome must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
• In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
3. Decision Trees: Decision trees are a visual representation of a set of rules that lead to
a decision. They are often used in machine learning and data mining to predict
outcomes and classify data. A decision tree is a supervised learning algorithm used
for both classification and regression tasks.
Below pictures are examples of how a decision tree can be made:
4. Random Forest: A random forest is an ensemble learning technique that uses
multiple decision trees to make predictions. It is often used in data mining and
machine learning for classification and regression tasks.
Below is the example of random forest technique where we have dataset of the
voters of various location and our goal is to predict the election outcome:
The best hyperplane, also known as the “hard margin,” is the one that maximizes the distance
between the hyperplane and the nearest data points from both classes. This ensures a clear separation
between the classes. So, from the above figure, we choose L2 as hard margin.
Advantages of SVM:
6. Naive Bayes: Naive Bayes is a probabilistic model that calculates the likelihood of an
event based on prior knowledge. It is widely used in natural language processing and
text classification.
The main idea behind the Naive Bayes classifier is to use Bayes’ Theorem to classify data
based on the probabilities of different classes given the features of the data. It is used mostly
in high-dimensional text classification
• The Naive Bayes Classifier is a simple probabilistic classifier and it has very few
number of parameters which are used to build the ML models that can predict at a
faster speed than other classification algorithms.
• It is a probabilistic classifier because it assumes that one feature in the model is
independent of existence of another feature. In other words, each feature
contributes to the predictions with no relation between each other.
• Naïve Bayes Algorithm is used in spam filtration, Sentimental analysis, classifying
articles and many more.
The Bayes Rule provides the formula for the probability of Y given X.
It is called ‘Naive’ because of the naive assumption that the X’s are independent of each other.
Once the standardization is done, all the variables will be transformed to the same scale.
Step 2: Compute the covariance matrix to identify correlations
To see if there is any relationship between them. Because sometimes, variables are highly
correlated in such a way that they contain redundant information. So, in order to identify
these correlations, we compute the covariance matrix.
What do the covariances that we have as entries of the matrix tell us about the
correlations between the variables?
• If positive then: the two variables increase or decrease together (correlated)
• If negative then: one increases when the other decreases (Inversely correlated)
Step 3: Compute the eigenvectors and eigenvalues of the covariance matrix to identify the
principal components
Step 4: Create a feature vector to decide which principal components to keep
Step5: Recast the data along the principal components axes.
9. Clustering: Clustering in machine learning is a technique that groups similar data
points together. It helps identify patterns or structures within data by dividing it into
clusters based on shared characteristics.
Types of Clustering Algorithm:
Clustering
Strengths Weaknesses Ideal Use Cases
Algorithm
Simple,
efficient for
Struggles with
large
non-spherical Customer segmentation,
K Means datasets,
clusters, document classification,
Clustering works well
sensitive to image compression
with
initial centroids
spherical
clusters
Automatically
determines Computationally
the number expensive,
Mean-Shift Image segmentation, traffic
of clusters, sensitive to
Clustering pattern analysis
effective for bandwidth
irregular parameter
clusters
Handles Difficult to
noise, define
Density-
identifies parameters like Fraud detection,
Based
arbitrary- minimum points geographical data grouping
Clustering
shaped and distance
clusters threshold
Builds a
visual Computationally Genealogy analysis, protein
Hierarchical dendrogram intensive, not structure analysis, multi-
Clustering for suitable for level customer
hierarchical large datasets segmentation
relationships
Handles Requires
overlapping assumptions
Distribution- Traffic flow modelling,
clusters, about data
Based customer segmentation
based on distribution, not
Clustering with shared characteristics
probability ideal for all
distributions datasets
Combines
strengths of
algorithms, Increased Customer segmentation
Hybrid
improves complexity, may with shared characteristics,
Clustering
accuracy, require more large-scale genomic data
Methods
adapts to computation analysis
complex
datasets
10. Neural Networks: Neural networks are a set of algorithms modelled on the human
brain that can learn and recognize patterns. They are widely used in image and
speech recognition, natural language processing, and robotics.
Neural networks are capable of learning and identifying patterns directly from data without
pre-defined rules. These networks are built from several key components:
1. Neurons: The basic units that receive inputs, each neuron is governed by a threshold
and an activation function.
2. Connections: Links between neurons that carry information, regulated by weights
and biases.
3. Weights and Biases: These parameters determine the strength and influence of
connections.
4. Propagation Functions: Mechanisms that help process and transfer data across
layers of neurons.
5. Learning Rule: The method that adjusts weights and biases over time to improve
accuracy.
Learning in neural networks follows a structured, three-stage process:
1. Input Computation: Data is fed into the network.
2. Output Generation: Based on the current parameters, the network generates an
output.
3. Iterative Refinement: The network refines its output by adjusting weights and biases,
gradually improving its performance on diverse tasks.
11. Markov Chains: Markov chains are a stochastic model that describes a sequence of
events where the probability of each event depends only on the state of the previous
event. They are widely used in finance, speech recognition, and genetics.
Markov chains, named after Andrey Markov, a stochastic model that depicts a sequence of
possible events where predictions or probabilities for the next state are based solely on its
previous event state, not the states before. In simple words, the probability that n+1th steps
will be x depends only on the nth steps not the complete sequence of steps that came before
n. This property is known as Markov Property or Memorylessness.
12. Time Series Analysis: Time series analysis is a statistical technique used to analyze time
series data to identify patterns and make forecasts. It is widely used in finance,
economics, and engineering.
Time series data is commonly represented graphically with time on the horizontal axis
and the variable of interest on the vertical axis, allowing analysts to identify trends,
patterns, and changes over time.
Time series data is often represented graphically as a line plot, with time depicted on
the horizontal x-axis and the variable's values displayed on the vertical y-axis. This
graphical representation facilitates the visualization of trends, patterns, and
fluctuations in the variable over time, aiding in the analysis and interpretation of the
data.
Distance Metrics:
if you created clusters using a clustering algorithm such as K-Means Clustering or k-nearest
neighbour algorithm (knn), which uses nearest neighbours to solve a classification or
regression problem. How will you define the similarity between different observations? How
can we say that two points are similar to each other? This will happen if their features are
similar, right? When we plot these points, they will be closer to each other by distance.
Types of Distance Metrics in Machine Learning:
1. Euclidean Distance
2. Manhattan Distance
3. Minkowski Distance
4. Hamming Distance
Euclidean Distance:
The Euclidean distance is the most widely used distance measure in clustering. It calculates
the straight-line distance between two points in n-dimensional space. The formula for
Euclidean distance is:
The two spots that we are computing the Euclidean distance between are represented with
blue dots in the figure. The Euclidean distance, represented by the black line that separates
them is the distance measured in a straight line.
Manhattan Distance
The Manhattan distance, is the total of the absolute differences between their Cartesian
coordinates, sometimes referred to as the L1 distance or city block distance. Envision
maneuvering across a city grid in which your only directions are horizontal and vertical. The
Manhattan distance, which computes the total distance traveled along each dimension to
reach a different data point represents this movement. When it comes to categorical data this
metric is more effective than Euclidean distance since it is less susceptible to outliers. The
formula is:
The two points are represented by the blue color in the plot. The grid-line-based method
used to determine the Manhattan distance.
Minkowski Distance:
Minkowski distance is a generalized form of both Euclidean and Manhattan distances,
controlled by a parameter p. The Minkowski distance allows adjusting the power parameter
(p). When p=1, it’s equivalent to Manhattan distance; when p=2, it’s Euclidean distance.
Hamming Distance:
Hamming distance is a metric for comparing two binary data strings. While comparing two
binary strings of equal length, Hamming distance is the number of bit positions in which the
two bits are different.
The Hamming distance between two strings, a and b is denoted as d(a,b) or H(a,b).
In order to calculate the Hamming distance between two strings, and , we perform their XOR
operation, (a⊕ b), and then count the total number of 1s in the resultant string.
Example
Suppose there are two strings 1101 1001 and 1001 1101.
11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance,
d(11011001, 10011101) = 2.