0% found this document useful (0 votes)
20 views85 pages

Unit-1 - Machine Learning

machine learning unit 1

Uploaded by

do7760
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views85 pages

Unit-1 - Machine Learning

machine learning unit 1

Uploaded by

do7760
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

21CSC305P- MACHINE

LEARNING
UNIT-I

Machine learning- What and Why, Supervised Learning,


Unsupervised learning, Polynomial curve fitting , Probability theory-
discrete random variables, Fundamental rules, Bayes rule,
Independence and conditional independence , Continuous random
variables, Quantiles, mean and variance, Probability densities,
Expectation and covariance
Machine learning- What and Why
• The rise of big data demands machine learning for efficient data analysis and
decision-making.
• For instance, there are around 1 trillion web pages, and every second, one hour of
video content is uploaded to YouTube, equating to 10 years of content every day.
Additionally, thousands of human genomes, each consisting of approximately 3.8
billion base pairs, have been sequenced, and Walmart handles over 1 million
transactions per hour, resulting in databases containing more than 2.5 petabytes of
information.
• Machine learning comprises techniques that can automatically detect patterns within
data and leverage these patterns to predict future data or make decisions under
uncertainty.
• The optimal approach to addressing such challenges is through probability theory,
which applies to any problem involving uncertainty.
Types of Machine Learning
Predictive or Supervised Learning:
•Goal: Learn a mapping from inputs x to outputs y, given a labeled set of input-output pairs
•Training set D: Set of input-output pairs and N: Number of training examples.
•Training input xi:
• Typically a D-dimensional vector of numbers.
• Represents features, attributes, or covariates (e.g., height and weight of a person).
• Can be complex structured objects (e.g., images, sentences, time series, molecular shapes, graphs).
•Output or response variable yi:
• Can be categorical/nominal (e.g., male or female) or real-valued (e.g., income level).
• Categorical problems are known as classification or pattern recognition.
• Real-valued problems are known as regression.
• Ordinal regression: Label space Y has a natural ordering (e.g., grades A-F).
Descriptive or Unsupervised Learning:
•Goal: Find "interesting patterns" in the data.
•Given inputs .Also known as knowledge discovery.
•No well-defined problem as patterns are not specified in advance.
Reinforcement Learning:
•Useful for learning how to act or behave when given occasional reward or punishment signals.
•Example: How a baby learns to walk.
Supervised learning
1. Classification:
• Goal of Classification: Learn a mapping from inputs x to outputs y,
where y∈{1,…,C} with C being the number of classes.
• Binary Classification: When C=2, known as binary classification (e.g.,
y∈{0,1})
• Multiclass Classification: When C>2, known as multiclass classification.
• Multi-label Classification: When class labels are not mutually exclusive
(e.g., someone classified as tall and strong), predicting multiple related
binary class labels (multiple output model).
• One way to formalize the problem is as function approximation.
Assume y=f(x) for an unknown function f; learning aims to estimate f
using a labeled training set and predict with
• Generalization: The main goal is to make predictions on novel inputs not
seen before, emphasizing the importance of generalization over fitting the
Supervised learning-Cont.
Example
• Two classes of objects with labels 0 and 1.
• Inputs are colored shapes, described by D features or attributes.
• Features are stored in an N×D design matrix.
• Input features x can be discrete, continuous, or both. Vector of training labels y.
• Test objects: blue crescent, yellow circle, and blue arrow.
• These test objects have not been seen before, requiring generalization beyond the training set.
Generalization:
•Blue crescent likely has y=1 since all blue shapes in the training set are labeled 1.
•Yellow circle's label is unclear due to mixed labels for yellow objects and circles.
•Blue arrow's label is also unclear due to lack of specific information from the training set.
Supervised learning-Cont.

The need for probabilistic predictions:


• In classification, ambiguous cases should be handled by returning a probability
distribution over possible labels given the input and training set, denoted by p(y∣x,D)
•Compute best guess as the most probable class label using

•This is the mode of the distribution p(y∣x,D) and known as a MAP estimate (maximum a
posteriori).
• Confidence in predictions is crucial, especially in risk-averse domains like medicine
and finance.
• IBM's Watson for Jeopardy uses a confidence module to decide when to answer.
• Google's SmartASS (ad selection system) predicts the click-through rate (CTR) to
maximize expected profit.
• Systems like Watson and SmartASS assess the risk of their predictions, making
decisions based on confidence levels to optimize performance and minimize errors.
Supervised learning-Cont.

Real-world applications:
(i) Document classification and email spam filtering
• In document classification, the primary objective is to categorize documents like web
pages or email messages into predefined classes C, determining p(y=c∣x,D), where x
represents the document's text representation.
• A classic example is email spam filtering, where classes are typically labeled as spam (
y=1 ) or non-spam ( y=0).
• Most classifiers assume a fixed-size input vector x. To handle variable-length documents,
a common approach is the bag of words (BoW) representation.
Bag of Words (BoW):
• Documents are transformed into fixed-size feature vectors.
• Each vector element corresponds to a word from a predefined vocabulary.
• If a word appears in the document, its corresponding vector element is set to 1; otherwise,
it remains 0.
Supervised learning-Cont.
(ii) Classifying flowers
• The goal is to classify iris flowers into three types: setosa, versicolor, and virginica,
based on four extracted features: sepal length, sepal width, petal length, and petal
width.
Supervised learning-Cont.
(iii) Image classification and handwriting recognition
• Image classification involves categorizing images
based on their content, such as indoor vs. outdoor
scenes, orientation (horizontal vs. vertical), or
presence of specific objects like dogs.
• MNIST (which stands for “Modified National
Institute of Standards” )is a widely used dataset for
handwritten digit recognition, containing 60,000
training images and 10,000 test images of digits (0-9).
• Each image is grayscale, sized 28x28 pixels, and
represents handwritten digits by various individuals.
• Images are represented as feature vectors, where each
pixel's grayscale value (ranging from 0 to 255) serves
as a feature.
Supervised learning-Cont.
(iv) Face detection and recognition
• Object detection, or localization, involves
identifying specific objects within an image. A
notable application is face detection, which is
crucial for tasks like autofocus in cameras and
privacy features in services like Google's
StreetView.
• One approach to face detection is the sliding
window detector method. It divides the image into
small overlapping patches at various locations,
scales, and orientations.
• Each patch is classified based on whether it
exhibits face-like textures or features. Locations
where the probability of containing a face is high
are identified as potential face locations.
• Modern digital cameras often integrate face
detection systems to assist with autofocus by
identifying and focusing on faces within the frame.
• Services like Google's StreetView use face
detection to automatically blur faces to protect
privacy.
Supervised learning-Cont.

2. Regression:
• Regression is just like classification except the response variable is continuous

Here are some examples of real-world regression problems.


• Predict tomorrow’s stock market price given current market conditions and other possible side information.
• Predict the age of a viewer watching a given video on YouTube.
• Predict the location in 3d space of a robot arm end effector, given control signals (torques) sent to its various motors.
• Predict the amount of prostate specific antigen (PSA) in the body as a function of a number of different clinical
measurements.
• Predict the temperature at any location inside a building using weather data, time, door sensors, etc.
Unsupervised learning
• The goal is to discover “interesting structure” in the data; this is sometimes called
knowledge discovery.
• Supervised learning involves conditional density estimation p(yi∣xi,θ), where yi is the
target variable. In contrast, unsupervised learning focuses on unconditional density
estimation p(xi∣θ), where xi represents feature vectors.
• In unsupervised learning, xi is typically a vector of features, necessitating the creation
of multivariate probability models to capture dependencies between different
features.
• Supervised learning often uses simpler univariate probability models with
input-dependent parameters, focusing on predicting a single variable yi. This
simplification is not applicable in unsupervised settings due to the absence of labeled
output.
• It is more widely applicable than supervised learning since it does not require costly
and often sparse labeled data, making it feasible for modeling complex systems
where labeled data is limited or unavailable.
Unsupervised learning-Cont.

1. Discovering clusters:

•Clustering involves grouping data points into clusters based on similarities in their features,
without predefined labels.
•The goal is to estimate the distribution p(K∣D) over the number of clusters K, indicating
the presence of subgroups within the data.
•Model selection in clustering aims to determine the optimal number of clusters K ∗ often
approximated by the mode of p(K∣D). Unlike supervised learning where classes are
predefined, unsupervised learning allows flexibility in choosing the number of clusters that
best represent the underlying structure of the data.
•Each data point i is assigned to a cluster zi∈{1,…,K} based on the probability
p(zi=k∣xi,D), where xi is the feature vector of the data point.

•Assignments zi∗ are inferred to determine the cluster membership of each data point,
illustrated by different colors representing clusters in visualizations.
Unsupervised learning-Cont.

Applications of Clustering:
•Astronomy: Clustering methods like Autoclass have been used to discover new
types of stars based on astrophysical measurements.
•E-commerce: Clustering users based on purchasing or web-surfing behavior allows
for targeted advertising and personalized recommendations.
•Biology: Clustering flow-cytometry data helps identify different sub-populations of
cells, aiding in biological research such as understanding disease mechanisms.
Unsupervised learning-Cont.
2. Discovering latent factors:
• Dimensionality reduction involves projecting high-dimensional data into a
lower-dimensional subspace that captures essential characteristics of the data.
• Despite high-dimensional appearances, data often exhibit variability across a smaller
number of latent factors. Dimensionality reduction helps in focusing on these key
factors, such as lighting, pose, or identity in face image modeling.
• PCA is a common approach for dimensionality reduction, resembling an unsupervised
form of multi-output linear regression.
• Given high-dimensional responses y, PCA infers latent low-dimensional factors z that
explain most of the variability in y.
Unsupervised learning-Cont.

Applications:

• In biology, it is common to use PCA to interpret gene microarray data, to account for
the fact that each measurement is usually the result of many genes which are correlated
in their behavior by the fact that they belong to different biological pathways.
• In natural language processing, it is common to use a variant of PCA called latent
semantic analysis for document retrieval.
• In signal processing (e.g., of acoustic or neural signals), it is common to use ICA (which
is a variant of PCA) to separate signals into their different sources.
• In computer graphics, it is common to project motion capture data to a low dimensional
space, and use it to create animations.
Unsupervised learning-Cont.
3. Discovering graph structure
• Learning sparse graphical models involves representing relationships between correlated
variables using a graph G, where nodes depict variables and edges denote direct
dependencies.
• This approach is pivotal in both discovering new knowledge and enhancing joint
probability density estimators.
• In systems biology, sparse graphical models are used to uncover relationships among
biological entities. For instance, graphs derived from protein phosphorylation data reveal
complex interactions within cellular networks.
• In fields like financial portfolio management, sparse graphs help model covariance
between stocks for better prediction and decision-making.
• Applications extend to traffic prediction systems, such as JamBayes, which leverage
learned graphical models to forecast traffic flow dynamics.
Unsupervised learning-Cont.
Unsupervised learning-Cont.
4. Matrix completion
• Sometimes we have missing data, that is, variables whose values are unknown. For example, we might have
conducted a survey, and some people might not have answered certain questions.
• The corresponding design matrix will then have “holes” in it; these missing entries are often represented by
NaN, which stands for “not a number”. The goal of imputation is to infer plausible values for the missing
entries. This is sometimes called matrix completion.
• Image Inpainting: Technique to fill in missing parts of images due to scratches or occlusions, achieved by
modeling joint probability of pixels from clean images.
• Collaborative Filtering: Predicting user preferences for items (like movies) based on sparse ratings matrices,
aiming to fill in missing ratings for better recommendation systems.
• Market basket analysis:
❖ Involves examining a large, sparse binary matrix where columns represent items/products and rows
represent transactions.
❖ Each entry in the matrix indicates whether an item was purchased in a specific transaction. By analyzing
correlations among items often bought together, predictions can be made about additional items a consumer
might buy based on partial transaction data.
❖ This technique is also applicable in other domains, such as predicting file dependencies in software systems.
❖ Common methods for market basket analysis include frequent itemset mining, which generates association
rules, and probabilistic modeling, which fits a joint density model to the data.
❖ Data mining emphasizes interpretability of models, whereas machine learning focuses on model accuracy.
Polynomial Curve Fitting
• Consider the example of recognizing handwritten digits, illustrated in Figure 1.1.
• Each digit corresponds to a 28×28 pixel image and so can be represented by a vector x comprising 784 real
numbers.
• The goal is to build a machine that will take such a vector x as input and that will produce the identity of the digit
0…9 as the output.
• This is not a simple problem due to the wide variability of handwriting. It could be tackled using rules or
heuristics for distinguishing the digits based on the shapes of the strokes.
• But this leads to a creation of rules and of exceptions to the rules and so on, and so gives poor results.

Fig. 1.1 Hand-written digits


Polynomial Curve Fitting-Cont.

• Better results can be obtained by adopting a machine learning approach in which a large set of N digits {x 1,..., xN
} called a training set is used to tune the parameters of an adaptive model.
• The categories of the digits in the training set are known in advance (hand-labelling). We can express the
category of a digit using target vector t, which represents the identity of the corresponding digit. There is one
such target vector t for each digit image x.
• The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit
image x as input and that generates an output vector y, encoded in the same way as the target vectors.
• The precise form of the function y(x) is determined during the training phase (learning phase) on the basis of the
training data.
• Once the model is trained it can then determine the identity of new digit images (a test set). The ability to
categorize correctly new examples that differ from those used for training is known as generalization (goal of
pattern recognition).
• The original input variables are pre-processed to transform them into some new space of variables where the
pattern recognition problem will be easier to solve.
Polynomial Curve Fitting-Cont.
• Example: In the digit recognition problem, the images of the digits are translated and scaled so that each digit is
contained within a box of a fixed size. This greatly reduces the variability within each digit class, because the
location and scale of all the digits are now the same, which makes it much easier for a subsequent pattern
recognition algorithm to distinguish between the different classes. This pre-processing is sometimes called feature
extraction.
• Pre-processing might also be performed to speed up computation. Example: if the goal is real-time face detection
in a high-resolution video stream, the computer must handle huge numbers of pixels per second.
• Instead, useful features (dimensionality reduction) are find out that are fast to compute, and preserve useful
discriminatory information enabling faces to be distinguished from non-faces.
• These features are then used as the inputs to the pattern recognition algorithm.
• The training data comprises of the input vectors along with their corresponding target vectors are known as
supervised learning problems.
• Ex: Classification - the digit recognition assigns each input vector to one of a finite number of discrete categories
• Regression -the desired output consists of one or more continuous variables
• Ex: prediction of the yield in a chemical manufacturing process in which the inputs consist of the concentrations
of reactants, the temperature, and the pressure.
Polynomial Curve Fitting-Cont.

• Unsupervised learning problems: The training data consists of a set of input vectors x without any corresponding
target values.
• Clustering - to find groups of similar examples within the data (or to determine the distribution of data within the
input space, known as density estimation, or to project the data from a high-dimensional space down to two or
three dimensions for visualization.)

Fig. 1.2 Plot of a training data set of N=10 points (blue circles), each comprising an observation of the input variable
x along with the corresponding target variable t. The green curve shows the function sin(2πx) used to generate the
data. The goal is to predict the value of t for some new value of x, without knowledge of the green curve.
Polynomial Curve Fitting-Cont.

• Consider a simple regression problem. Let a real-valued input variable x and which used to predict
the value of a real-valued target variable t.
• For example, the data is generated from the function sin(2πx) with random noise included in the
target values. Now suppose that we are given a training set comprising N observations of x, written x
≡ (x1,...,xN )T, together with corresponding observations of the values of t, denoted t ≡ (t1,...,tN )T.
• Figure 1.2 shows a plot of a training set comprising N = 10 data points. The input data set x was
generated by choosing values of xn, for n = 1,...,N, spaced uniformly in range [0, 1], and the target
data set t was obtained by first computing the corresponding values of the function sin(2πx) and then
adding a small level of random noise having a Gaussian distribution to each such point in order to
obtain the corresponding value tn.
• In this way, by capturing a property of many real data sets, namely that they possess an underlying
regularity, which is wish to learn, but that individual observations are corrupted by random noise.
Polynomial Curve Fitting-Cont.
• This noise might arise from intrinsically stochastic (i.e. random) processes such as radioactive decay
but more typically is due to there being sources of variability that are unobserved.
• The goal is to exploit this training set to make predictions of the value t of the target variable for some
new value x of the input variable.
• The observed data are corrupted with noise, and for a given x there is uncertainty as to the appropriate
value for t.
• Probability theory provides a framework for expressing such uncertainty in a precise and quantitative
manner.
• Decision theory allows to exploit this probabilistic representation in order to make predictions that are
optimal according to appropriate criteria.
• Consider a simple approach based on curve fitting. The data can be fit using a polynomial function of
the form

where M is the order of the polynomial. The polynomial coefficients w0,...,wM are collectively denoted
by the vector w.
Polynomial Curve Fitting-Cont.

• The polynomial function y(x, w) is a nonlinear function of x, it is a linear function of the coefficients w.
• Functions, such as the polynomial, which are linear in the unknown parameters have important properties and are
called linear models.
• The values of the coefficients will be determined by fitting the polynomial to the training data. This can be done
by minimizing an error function that measures the misfit between the function y(x, w), for any given value of w,
and the training set data points.
• The error function is given by the sum of the squares of the errors between the predictions y(xn, w) for each data
point xn and the corresponding target values tn, so that we minimize

where the factor of 1/2 is included for later convenience.


• The error function is a nonnegative quantity that would be zero if, and only if, the function y(x, w) were to pass
exactly through each training data point. The geometrical interpretation of the sum-of-squares error function is
illustrated in Figure 1.3.
Polynomial Curve Fitting-Cont.

Fig. 1.3 The error function (1.2) corresponds to (one half of) the sum of the squares of
the displacements (vertical green bars) of each data point from the function y(x,w)
Polynomial Curve Fitting-Cont.

It is sometimes more convenient to use the root-mean-square (RMS) error defined by


Polynomial Curve Fitting-Cont.

Fig. 1.5 Graphs of the root-mean-square error, defined by (1.3), evaluated on the training set and on independent
set for various values of M
Polynomial Curve Fitting-Cont.

Table 1.1.
Polynomial Curve Fitting-Cont.
Polynomial Curve Fitting-Cont.

Fig. 1.8 Graph of the root-mean-square error (1.3) values In λ for the M=9 polynomial
Probability Theory
What is probability?
• We are all familiar with the phrase “the probability that a coin will land heads is 0.5”.
• There are actually at least two different interpretations of probability.
• One is called the frequentist interpretation. In this view, probabilities represent long run frequencies of
events. For example, the above statement means that, if we flip the coin many times, we expect it to land
heads about half the time
• The other interpretation is called the Bayesian interpretation of probability. In this view, probability is
used to quantify our uncertainty about something; hence it is fundamentally related to information rather
than repeated trials.
• One big advantage of the Bayesian interpretation is that it can be used to model our uncertainty about
events that do not have long term frequencies.
Eg:
• To compute the probability whether the received email message is a spam
Discrete Random Variables
• The expression p(A) denotes the probability that the event • Figure 2.1 shows two pmf’s defined on the finite
A is true. For example, A might be the logical expression state space X = {1, 2, 3, 4, 5}.
“it will rain tomorrow”. • On the left we have a uniform distribution, p(x)=1/5,
• We require that 0 ≤ p(A) ≤ 1, where p(A)=0 means the and on the right, we have a degenerate distribution,
event definitely will not happen, and p(A)=1 means the p(x) = I(x = 1), where I() is the binary indicator
event definitely will happen. function.
• We write p(A) to denote the probability of the event not A; • This distribution represents the fact that X is always
this is defined to p(A)=1 − p(A). equal to the value 1, in other words, it is a constant.
• We will often write A = 1 to mean the event A is true, and
A = 0 to mean the event A is false
• We can extend the notion of binary events by defining a
discrete random variable X, which can take on any value
from a finite or countably infinite set X .
• We denote the probability of the event that X = x by p(X =
x), or just p(x) for short. Here p() is called a probability
mass function or pmf.
• This satisfies the properties 0 ≤ p(x) ≤ 1 and x∈X p(x)=1.
Fundamental Rules
• Probability of a union of two events The product rule can be applied multiple times to
Given two events, A and B, we define the probability of A yield the chain rule of probability:
or B as follows:

where we introduce the Matlab-like notation 1 : D to


denote the set {1, 2,...,D}.
• Joint probabilities
We define the probability of the joint event A and B as • Conditional Probability
follows: We define the conditional probability of event A,
given that event B is true, as follows:

This is sometimes called the product rule.


Given a joint distribution on two events p(A, B), we
define the marginal distribution as follows:

where we are summing over all possible states of B.


We can define p(B) similarly. This is sometimes
called the sum rule or the rule of total probability.
Bayes Rule
Combining the definition of conditional probability with the But this is false! It ignores the prior probability of having
product and sum rules yields Bayes rule, also called Bayes breast cancer, which fortunately is quite low:
Theorem:

Ignoring this prior is called the base rate fallacy. We also


1. Example: medical diagnosis need to take into account the fact that the test may be a
As an example of how to use this rule, consider the following false positive or false alarm. Unfortunately, such false
medical diagnosis problem. Suppose you are a woman in your positives are quite likely(with current screening
40s, and you decide to have a medical test for breast cancer technology):
called a mammogram. If the test is positive, what is the
probability you have cancer? That obviously depends on how
reliable the test is. Suppose you are told the test has a Combining the three terms using Bayes rule, we can
sensitivity of 80%,which means, if you have cancer, the test compute the correct answer as follows:
will be positive with probability 0.8. In other words,

Where x=1 is the event the mammogram is positive, and y=1 Where p(y=0)=1−p(y=1)=0.996. In other words, if you
is the event you have breast cancer. Many people conclude test positive, you only have about a 3% chance of
they are therefore 80% likely to have cancer. actually having breast cancer.
Bayes Rule(Contd..)
2. Example: Generative classifiers

We can generalize the medical diagnosis example to classify feature vectors x of arbitrary type as follows:

This is called a generative classifier, since it specifies how to generate the data using the class conditional density
p(x|y=c) and the class prior p(y=c).An alternative approach is to directly fit the class posterior, p(y=c|x); this is known
as a discriminative classifier.
Independence and conditional independence
• X and Y are unconditionally independent or marginally independent, denoted X ⊥ Y , if we can represent the
joint as the product of the two marginals

• A set of variables is mutually independent if the joint can be written as a product of marginals.
• Unconditional independence is rare, because most variables can influence most other variables.
Independence and conditional independence—Contd.

• X and Y are conditionally independent (CI) given Z if the conditional joint can be written as a product of
conditional marginals:

• The assumption can be represented as a graph X−Z−Y, indicating that all dependencies between X and Y are
mediated through Z.
• For instance, the probability of rain tomorrow (X) is independent of whether the ground is wet today (Y) if we
know whether it is raining today (Z). This is because Z influences both X and Y, making knowledge of Z
sufficient to predict X or Y independently.

• Another characterization of conditional independence (CI) is provided by a Theorem , which states that

“X is conditionally independent of Y given Z (X ⊥ Y | Z) if and only if there exist functions g and h such that:

for all values of x, y, and z where p(z)>0”.


Independence and conditional independence—Contd.

Example
• Let’s consider an example. Suppose A is the height of a child, and B is the
number of words that the child knows. It seems that when A is high, B is high
too.

• However, there is a single piece of information that makes A and B


completely independent. What would that be?

• The child’s age.

• The height and the number of words known by the kid are NOT independent,
but they are conditionally independent if you provide the kid’s age.
Random Variable
Introduction to Random Variables
• A random variable is a mathematical function that assigns a numerical value
to every outcome which is possible in a random experiment. It is called a
random variable because its value depends on the output of a random
experiment. Random variables are usually symbolized by capital letters, such
as X, Y, or Z.
• For example, let's consider the outcome of flipping a coin. If the coin lands
head, we assign a value of 1 to the random variable X, and if it lands tails, we
assign a value of 0 to X. Therefore, X is a random variable that takes on two
possible values, 0 or 1, depending on the outcome of the coin flip. We can
represent the possible values of X using a probability distribution, which tells
us the probability of each possible value.
Random Variable-
Contd.
For the case of tossing a coin, the probability distribution is as follows:
• P(X = 0) = 1 / 2 (since the probability of getting tails is 1 / 2)
• P(X = 1) = 1 / 2 (since the probability of getting heads is also 1 / 2)
• In general, a probability distribution for a random variable specifies the
probability of each possible value of the variable.
Random Variable-
Contd.
• There are two specific types of Random variables:
1. Discrete Random variables
2. Continuous Random variables.
1. Discrete Random Variables:
• Discrete random variables take on a finite or accountably infinite set of
possible values. In other words, the range of a discrete random variable is a
discrete set of numbers. For example, the number of heads obtained when
flipping a coin five times is a discrete random variable that can take on values
0, 1, 2, 3, 4, or 5.
• The probability distribution for a discrete random variable is called a
probability mass function (PMF). The PMF gives the probability that the
random variable takes on a particular value.
Random Variable- Contd.

Sample Space (S): This is the set of all possible outcomes when three
coins are tossed simultaneously.
S={HHH,HHT,HTH,HTT,THH,THT,TTH,TTT}
• Each outcome represents the result of the three coins, where 'H' stands
for heads and 'T' stands for tails.
Random Variable (X): The number of tails in each outcome is
considered as the random variable XXX.
Values of X:

• Here, xi represents the number of tails in the i-th outcome. Listing the
number of tails for each outcome in the sample space:
• X={0,1,1,1,2,2,2,3}
Continuous Random variables
• The Continuous random variables take on any value in a continuous range of
possible values. In other words, the range of a continuous random variable is
an uncountable set of numbers.
• For example, the height of a person can be referred to as a continuous random
variable that can take on any value in a continuous range from zero to infinity.
• The probability distribution for a continuous random variable is called a
probability density function (PDF).
• Unlike the PMF, the PDF does not give the probability that the random
variable takes on a particular value, but rather the probability density at each
point in the range of possible values.
Continuous Random variables- Contd
Continuous Random variables- Contd
Continuous Random variables- Contd
Continuous Random variables- Contd

• To find the probability that a continuous random variable X takes a value


in the interval [0.5,1] given the probability density function (PDF),
Continuous Random variables- Contd
Quantiles
• A quantile is a statistical measure that divides a data set into equal-sized, contiguous intervals, or
partitions.
• Quartiles: Divide data into four equal parts. The first quartile (Q1) is the 25th percentile, the second
quartile (Q2) is the median or 50th percentile, and the third quartile (Q3) is the 75th percentile.
• In the context of probability distributions, a quantile is a value below which a given percentage of the
distribution lies. It effectively divides the probability distribution into intervals with equal probabilities.
Quantiles- Contd..
Quantiles- Contd..
Example
Quantiles- Contd..
Mean and variance
Mean and variance- Contd…
Mean and variance- Contd…

To find Expected Output E(x):


Mean and variance- Contd…
Mean and variance- Contd…
Mean and variance- Contd…

Var(X) = 0.26
Probability densities

• In probability theory and statistics, a probability density function (PDF) is a function that

describes the likelihood of a continuous random variable to take on a particular value. Unlike

discrete random variables, which have probabilities assigned to specific outcomes, continuous

random variables are described using probability densities.

• PDF provides a powerful tool for modeling and analyzing the behavior of continuous random
variables.
Probability Density Function (PDF)
Properties of PDF

Some of the commonly used PDF in probability and statistics to model different types of data:

• Uniform Distribution
• Normal Distribution (Gaussian Distribution)
• Exponential Distribution:
• Gamma distribution
• Beta distribution
Uniform distribution
Normal distribution
This PDF is the most popular distribution for continuous random variables
Describes some natural phenomena, e.g. height, blood pressure, test scores, measurement error, and IQ scores.
Exponential distribution

Rate parameter (λ).


Expectation
•The expected value of a random variable is denoted by E(X).
–It can be thought of as the “average” value attained by the random variable.
Expectation
The variance of a Random Variable
Variance. ( )

–A positive quantity that measures the spread of the distribution of


the random variable about its mean value
–Larger values of the variance indicate that the distribution is more
spread out

Standard Deviation
–The positive square root of the variance
The variance of a Random Variable
The variance of a Random Variable
–Suppose that the diameter of a metal cylinder has a p.d.f

What is the variance and S.D of the metal cylinder diameters?


The variance of a Random Variable
–Suppose that the diameter of a metal cylinder has a p.d.f
COVARIANCE
• Covariance is a measure of the association or dependence between
two random variables X and Y .

• Covariance can be either positive or negative. (Variance is always


positive.)
COVARIANCE
COVARIANCE
Suppose that X and Y have the following joint probability mass function, in which
the six central cells give the discrete joint probabilities f(x,y) of the six hypothetical
realizations ( x , y ) ∈ S = { ( 5,8 ) , ( 6,8 ) , ( 7,8 ) , ( 5,9 ) , ( 6,9 ) , ( 7,9 ) }
COVARIANCE
Suppose that X and Y have the following joint probability mass function, in which
the six central cells give the discrete joint probabilities f(x,y) of the six hypothetical
realizations ( x , y ) ∈ S = { ( 5,8 ) , ( 6,8 ) , ( 7,8 ) , ( 5,9 ) , ( 6,9 ) , ( 7,9 ) }
COVARIANCE: Intuition
Properties
Variance vs. Covariance vs. Correlation
Variance tells us how much a quantity varies w.r.t. its mean. Its the spread of data
around the mean value.
A random variable is compared against itself.
Var(X) = E(X.X) — E(X).E(X)

Covariance tells us direction in which two quantities vary with each other.
Two random variables compared against each other.
Cov(X,Y) = E(X.Y) — E(X).E(Y)

Correlation shows us both, the direction and magnitude of how two quantities
vary with each other.
Summary:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy