6 Multivariate Gaussian
6 Multivariate Gaussian
• Proof:
• Follows from the transformation X := Y – µ := g-1(Y)
Multivariate Gaussian – Composite Transformations
• If Y is multivariate Gaussian,
then Z := BY + c is multivariate Gaussian,
where matrix B is square invertible
• Proof:
• Because Y is multivariate Gaussian, we have Y = AW + µ, where A is invertible
• Thus,
Z
= B (AW + µ) + c
= (BA)W + (Bµ + c), where matrix BA is invertible
Multivariate Statistics – Mean and Covariance
Multivariate Statistics – Mean
• For an general random (column) vector X,
the mean vector is
EP(X)[X]
= a (column) vector with the i-th component as EP(X)[Xi] = EP(Xi)[Xi]
Multivariate Statistics – Covariance
• Covariance matrix for a general random (column) vector Y is defined
as:
C := EP(Y) [ (Y – E[Y]) (Y – E[Y])T ]
• So,
Cij = ?
= EP(Y) [ (Yi – E[Yi]) (Yj – E[Yj]) ]
= EP(Yi,Yj) [ (Yi – E[Yi]) (Yj – E[Yj]) ]
= Cov (Yi, Yj)
Multivariate Statistics – Covariance
• More properties of covariance matrix C (for a general random vector X)
• Notes:
• Take the expectation of first component of AW, i.e.,
EP(W) [ A11W1 + A12W2 + … A1DWD ]
= A11 EP(W) [W1] + A12 EP(W) [W2] + … + A1D EP(W) [WD]
• So, for the whole vector: EP(W) [AW] = A EP(W) [W]
Multivariate Gaussian – Covariance
• The covariance matrix of X := AW + µ is AAT
Multivariate Gaussian – Different Cases
Multivariate Gaussian – Special Cases
• Diagonal matrix
• Orthogonal matrix
• Definition: Real square matrix Q whose columns and rows are orthogonal
unit vectors (i.e., orthonormal vectors) Q QT = QT Q = Identity matrix
• Determinant det(Q) is either +1 or -1
• “orthogonal” is an over-used term in mathematics
• Rotation matrix
• When det(Q) = +1, then Q is a rotation matrix
• When det(Q) = -1, then Q models either reflection (called as an improper rotation) or a
combination of rotation and reflection
• “Rotation” is an over-used term (sometimes includes improper rotations) in
mathematics
• Reflection matrix
• An orthogonal matrix that is also symmetric
Multivariate Gaussian – Special Cases
• Property (Rotation and/or Reflection):
If µ = 0; and A = R where R is orthogonal;
then Y := RW has PDF:
• Proof:
• Transformation of random vectors
• |det(R)| = 1
• Inverse transformation is
W = RT Y
Multivariate Gaussian – Special Cases
• Property (Scaling):
If µ = 0; and A = S square diagonal with positive entries on diagonal;
then Y := SW has PDF:
Multivariate Gaussian – Special Cases
• Property (first Scaling, and then Rotation and/or Reflection):
If µ = 0; A = RS,
then Y := RSW has the PDF:
Multivariate Gaussian – Special Cases
• Property (first Scaling, and then Rotation and/or Reflection):
If µ = 0; A = RS,
then Y := RSW has the PDF:
Multivariate Gaussian – General Case
• If X := A W is a multivariate Gaussian,
then Y := X + µ is a multivariate Gaussian with
• Then 𝑐 * = Ax # Ax = x # (AAx)
• RHS can take value 𝑐 * only if x is parallel to AAx (and then AAx = 𝑐 * )
• Then we can claim that:
1. Either Ax is parallel to x; so x is a real-valued eig.vector (orthogonal to v1) (eig.value c)
2. Or (Ax+cx) is a real-valued eig.vector of A (orthogonal to v1, as both Ax and x are),
because A Ax + cx = AAx + cAx = c $x + cAx = c(cx + Ax)
Linear Algebra – Eigen Decomposition
• Every NxN real symmetric positive definite (SPD) matrix M
(e.g., covariance matrix C) has an eigen-decomposition with
all eigenvalues as positive
• Proof:
• Let eigen decomposition for real symmetric matrix M be: M = Q D QT
• Where Q is real orthogonal and D is real diagonal
• Then, vT M v = vT Q D QT v = uT D u, where u := QT v (simply “rotated” v)
• For a PD matrix M, vTMv must be positive for every non-zero v
• So, uTDu must be positive for every non-zero u
• So, all values on diagonal of D must be positive
Multivariate Gaussian – Level Sets
• If X = A W is a multivariate Gaussian,
then Y = X + µ is a multivariate Gaussian with
peterroelants.g
ithub.io/posts/
multivariate-
normal-primer/
Multivariate Gaussian – Level Sets
• If X = A W is a multivariate Gaussian, then Y = X + µ is a multivariate
Gaussian with
• More examples
• https://en.wikipedia.org/wiki/Normally_distributed_and_uncorrelated_does_not_impl
y_independent
Multivariate Gaussian – Marginals
• Marginal PDFs being Gaussian doesn’t imply joint PDF is multivariate
Gaussian
• Only top-row left, top-row middle, bottom-row left are bivariate Gaussian
• All marginals are Gaussian
Multivariate Gaussian – Conditionals
• Conditional PDFs
• If multivariate Gaussian X
is partitioned into X1 and X2,
then conditional PDF P(X1|X2=x2)
is also a multivariate Gaussian
• P(X1|X2=x2)
=
P(X1,X2=x2) / P(X2=x2)
Multivariate Gaussian – Conditionals
• Conditional PDFs
• If multivariate Gaussian X is partitioned into X1 and X2,
then the conditional PDF P(X1|X2=x2) is also a multivariate Gaussian
Multivariate Gaussian – Conditionals
• “Conditional” PDFs
• What about this way of slicing ?
• Yes, profile on the line has a Gaussian shape
Multivariate Gaussian – Conditionals
• “Conditional” PDFs
• What about this way of slicing ?
• Yes, profile on the line has a Gaussian shape
Multivariate Gaussian – ML Estimation
Multivariate Gaussian – ML Estimation
• Data: {y1, …, yN}
• Take log-likelihood function
• ML estimate (MLE) for mean vector (= sample mean)
• Take derivative with respect to (w.r.t.) µ, and assign to zero. Solve.
• Quadratic form aTBa
= ΣiΣj aiajBij 𝜕
𝜇 − 𝑥 !𝐶 %# 𝜇 − 𝑥 = 2𝐶 %#(𝜇 − 𝑥)
• Partial derivative w.r.t. ak 𝜕𝜇
= Σj ajBkj + Σi aiBik
= 2 Σj Bkjaj (because B is symmetric)
= 2 (k-th row of B * column-vector a)
• Scalar function, say f(a), of multiple scalar variables in column-vector ‘a’
• Jacobian df/da will be a row vector of the same length as ‘a’
• Change in function value (df) = derivative (df/da) * change in variable (da)
• Can be reshaped/rearranged into a column vector of the same shape as ‘a’
Multivariate Gaussian – ML Estimation
• Data: {y1, …, yN}
• Take log-likelihood function
• MLE for covariance matrix (= sample covariance; uncorrected/biased)
• Take derivative w.r.t. C, and assign to zero. Solve.
• Need partial derivatives w.r.t. Cij
• Scalar function, say f(C), of multiple scalar variables in C
• Consider a (column)-vectorized form of C
• Jacobian df/dC will be a row vector of the same length as (column)-vectorized C
• Can be reshaped/rearranged into a matrix of the same shape as C
Multivariate Gaussian – ML Estimation
• “Matrix Calculus”
• http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/calculus.html
• https://en.wikipedia.org/wiki/Matrix_calculus
• http://www.matrixcalculus.org/
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
• Principal Component Analysis (PCA)
• What is it about ?
• What does it tell us about the distribution underlying the data ?
• What does it tell us about the distribution underlying the data,
when the data is known to have a multivariate Gaussian distribution ?
• Applications
Principal Component Analysis (PCA)
• “Modes of variation”
• Set of vectors (directions and magnitudes) that are used to depict the
variation in a population or sample, around the mean
https://statistics.laerd.com/spss-tutorials/pearsons-product-moment-correlation-using-spss-statistics.php
Principal Component Analysis (PCA)
• “Modes of variation”
• Set of vectors (directions and magnitudes) that are used to depict the
variation in a population or sample, around the mean
• Can we do it for the
distribution of images of each digit ?
Principal Component Analysis (PCA)
• Directions of maximal variance
• Consider a general multivariate random variable X with PDF P(X)
• We aren’t assuming it to be a Gaussian yet
Principal Component Analysis (PCA)
• Directions of maximal variance
Principal Component Analysis (PCA)
• Directions of maximal variance
Principal Component Analysis (PCA)
• Directions of maximal variance
Principal Component Analysis (PCA)
• Directions of maximal variance
• When covariance matrix C is diagonal (sample mean at origin)
• Let d-th element on diagonal of C be Cdd
• Let d-th element in vector ‘v’ be vd
• What do the eigenvectors look like ? Compute them and see for yourself.
Principal Component Analysis (PCA)
• Example 1: rng(0); N = 1e5; data = randn (2,N)
Principal Component Analysis (PCA)
• Example 2: rng(0); N = 1e5; data = rand (2,N)
Principal Component Analysis (PCA)
• Example 3: rng(0); N = 1e5; data = [[1 -1];[1 1]]/sqrt(2) * (rand (2,N))
Principal Component Analysis (PCA)
• Example 4: rng(0); N = 1e5; data = exprnd (1,[2,N])
Principal Component Analysis (PCA)
• Example 5: rng(0); N=1e5; data = [[1 -1];[1 1]]/sqrt(2) * exprnd (1,[2,N])
Principal Component Analysis (PCA)
• What happens to covariance matrix when we rotate the data ?
• In general
• Let 𝐶/ ≔ 𝐸 𝑋 − 𝜇/ 𝑋 − 𝜇/ # = 𝐸 𝑋𝑋 # − 𝜇/ 𝜇/#
• Let 𝑌 ≔ 𝑅𝑋 (for any invertible R)
• Then, 𝜇0 ≔ 𝐸 𝑌 = 𝐸 𝑅𝑋 = 𝑅𝐸 𝑋 = 𝑅𝜇/
• Then, 𝐶0 ≔ 𝐸 𝑌 − 𝜇0 𝑌 − 𝜇0 #
• = 𝐸 𝑌𝑌 # − 𝜇0 𝜇0#
• = 𝐸 𝑅𝑋𝑋 # 𝑅# − 𝑅𝜇/ 𝜇/# 𝑅#
• = 𝑅 𝐸 𝑋𝑋 # − 𝜇/ 𝜇/# 𝑅#
• = 𝑅C1 R#
• If 𝐶! = 𝑎I and R is orthogonal, then 𝐶" = 𝑅𝑎𝐼R# = 𝑎I = 𝐶!
• Thus, eigenvalues don’t change, and eigenvectors can be any orthogonal basis
• Thus, directions of maximal variance aren’t unique
Principal Component Analysis (PCA)
• Spaces of maximal variance
• What if we want to find multi-D lower-dimensional spaces that maximize
“total dispersion/variance” ?
• Total dispersion/variance is empirical average of squared distance from mean
Principal Component Analysis (PCA)
• Spaces of maximal variance
• What if we want to find multi-D lower-dimensional spaces that maximize
“total dispersion/variance” ?
• Total dispersion/variance is empirical average of squared distance from mean
• When covariance matrix C is diagonal (and sample mean is at origin):
Principal Component Analysis (PCA)
• Spaces of maximal variance
• What if we want to find multi-D lower-dimensional spaces that maximize
“total dispersion/variance” ?
• When covariance matrix C is diagonal (and sample mean at origin):
Principal Component Analysis (PCA)
• Spaces of maximal variance
• What if we want to find multi-D lower-dimensional spaces that maximize
“total dispersion/variance” ?
• When covariance matrix C is diagonal (and sample mean at origin):
• Any orthogonal basis spanning the space spanned by first 2 cardinal axes will be a solution
• Similar arguments will hold for lower-dimensional spaces of dimensions 3, 4, …, D-1
• Similar arguments will also hold for a general SPD covariance matrix C
Principal Component Analysis (PCA)
• PCA applied to data from a multivariate Gaussian distribution
• Consider X is multivariate Gaussian
• If X := AW + b, then:
• Principal modes of variation are directions given by
eigenvectors of covariance matrix C := AAT
• Principal modes of variation are along
axes of hyper-ellipsoids that are level sets of P(X)
• Variances along principal modes of variation are
the eigenvalues of C
• If X := RSW + b, then:
• Principal modes of variation are
column vectors of orthogonal matrix R
i.e., eigenvectors of C = RS2RT
• Variances along principal modes of variation are
the eigenvalues of C, i.e., diagonal elements in S2
Principal Component Analysis (PCA)
• Applications: Dimensionality reduction
• Intrinsic dimension: Minimum number of variables (degrees of freedom)
required to represent the signal
• Consider a multivariate random vector X of N scalar variables: x = (x1, …, xN)
• Consider a function g(.), and M<N scalar variables a1, …, aM such that
every x~P(X) can be written as x = g (a1, …, aM) for some a1, …, aM,
then signal X needs only M variables for representation
• Here, “intrinsic dimension” of X is M, instead of the “representation dimension” = N
Principal Component Analysis (PCA)
• Applications: Dimensionality reduction
https://medium.com/analytics-vidhya/dimensionality-reduction-using-principal-component-analysis-pca-41e364615766
Principal Component Analysis (PCA)
• Applications: Dimensionality reduction
• Acquired data is corrupted with errors
• e.g., measurement errors
• Such errors make the signal representation seem to be of a dimension higher than
intrinsic dimension
• Dimensionality reduction:
Transformation of data
from a higher-dimensional space into a lower-dimensional space
so that
lower-D representation (ideally close to its intrinsic dimension)
retains some meaningful properties of original data,
• PCA can perform linear dimensionality reduction
Principal Component Analysis (PCA)
• Applications:
Dimensionality reduction
• Using PCA
• X may be N dimensional
• PCA finds an
M-dimensional space
that captures most of the
variability (total dispersion)
in the data
http://bennymachinelearning.blogspot.com/2017
/08/machine-learning-principal-component.html
Principal Component Analysis (PCA)
• Applications: Dimensionality reduction
• Using PCA
• X may be N dimensional
• PCA can find an
M-dimensional space
(often when M << N)
that captures most of
the variability
(total dispersion)
in the data
Multivariate Gaussian – Mahalanobis Distance
Multivariate Gaussian – Mahalanobis Distance
Blue = normal
Blue = normal Red = anamolous
Red = anamolous Using Mahalanobis distance from mean of normal data,
Using Euclidean distance from mean of normal data using covariance of normal data
Multivariate Gaussian – Applications
• Multivariate Gaussian for maximum-likelihood classification (in 1D)
http://sar.kangwon.ac.kr/etc/rs_note/rsnote/cp11/cp11-7.htm
Multivariate Gaussian – Applications
• Multivariate Gaussian for maximum-likelihood classification (in 2D)
• If det(C1)=det(C2),
then
likelihood-based
classification is
equivalent to
Mahalanobis-
distance based
classification
https://onlinelibrary.
wiley.com/doi/full/1
0.1111/maps.13314
Multivariate Gaussian – Applications
• Multivariate Gaussian for maximum-likelihood classification
• How do decision boundaries look like ?
• P(x|Class1) = G (x; m1, C1)
• P(x|Class2) = G (x; m2, C2)
• Decision surface comprises all points ‘x’ at which likelihoods are equal
• { x : P(x|Class1) = P(x|Class2) }
• { x : 0 = log ( P(x|Class1) / P(x|Class2) ) }
• At any point in the domain ‘x’, the log likelihood-ratio is:
log (P(x|Class1) / P(x|Class2))
=
- 0.5 (x-m1)T C1-1 (x-m1) - 0.5 log (det (C1))
+ 0.5 (x-m2)T C2-1 (x-m2) + 0.5 log (det (C2))
• In general, decision surface is a hyper-quadric
Multivariate Gaussian – Applications
• Multivariate Gaussian for maximum-likelihood classification
• Decision boundaries
http://mi.eng.cam.
ac.uk/~mjfg/local/
4F10/lect2.pdf
Multivariate Gaussian – Applications
• Multivariate Gaussian for maximum-likelihood classification
• Decision boundaries
http://mi.eng.cam.
ac.uk/~mjfg/local/
4F10/lect2.pdf
Multivariate Gaussian – Applications
• Multivariate Gaussian for maximum-likelihood classification
• Decision boundaries
• When C1 = C2 = C, then decision boundary is:
• 0 = log (P(x|Class1) / P(x|Class2))
=
- 0.5 (x-m1)T C-1 (x-m1)
+ 0.5 (x-m2)T C-1 (x-m2)
•0
=
+ (m2-m1)T C-1 x
+ 0.5 m1T C-1 m1
- 0.5 m2T C-1 m2
• Decision surface is a hyper-plane
Multivariate Gaussian – Applications
• Multivariate Gaussian for maximum-likelihood classification
• Example (Data taken from R. A. Fisher's classic 1936 paper)
• UCI ML repository Iris dataset http://archive.ics.uci.edu/ml/datasets/Iris/
http://mi.eng.cam.
ac.uk/~mjfg/local/
4F10/lect2.pdf
Datasets
• UCI Machine Learning Repository
• https://archive.ics.uci.edu/ml/
Singular Value Decomposition (SVD)
• Singular Value Decomposition (SVD)
• What is it about ?
• What can we say about existence ?
• What can we say
about uniqueness ?
• How does it help us
understand the
multivariate Gaussian ?
Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD)
• Matrix factorization
• Let matrix A be size MxN
• When A is real valued, then SVD of A = U S VT, where:
• V is orthogonal of size NxN
• When A is complex: V is unitary
• U is orthogonal of size MxM
• When A is complex: U is unitary
• S is (rectangular) diagonal with size MxN
• Values on diagonal = singular values
• Singular values are non-negative real (even when A, U, V are complex-valued)
• If the m-th columns of U and V are um and vm, respectively, then:
Singular Value Decomposition (SVD)
• A = U S VT
• An example, in pictures:
A = U S VT
Singular Value Decomposition (SVD)
• Geometric interpretation of the action of a matrix A on a vector
• In this example,
A is square A
VT U
S
Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD)
• Matrix norm
• Induced by a vector norm
v1
Singular Value Decomposition (SVD)
x
• Uniqueness analysis w
θ
v1
Singular Value Decomposition (SVD) x
• Uniqueness analysis w
θ
v1
2
Singular Value Decomposition (SVD)
• Uniqueness analysis x
w
θ
v1
Singular Value Decomposition (SVD)
• Uniqueness analysis
• Why is norm(B) <= norm(A) ?
• We know that A = USVT , where U and V are orthogonal, and S is as shown above
• Let β := norm(B)
• By definition of norm(B),
there exists a unit-norm column-vector y such that norm(By) = β
• Use that y to construct a longer (but still) unit-norm column vector x := V [0,yT]T
• norm (A x)
= norm ( USVT V[0,yT]T )
= norm ( S [0,yT]T )
= norm ( [ 0, (By)T ]T )
= norm (By)
=β
• Thus, there exists a vector x such that norm(Ax) = β,
which implies that norm(A) cannot be less than β, i.e., norm(A) >= β = norm(B)
Singular Value Decomposition (SVD)
• Uniqueness analysis