0% found this document useful (0 votes)
22 views

6 Multivariate Gaussian

Uploaded by

Atharva Tambat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

6 Multivariate Gaussian

Uploaded by

Atharva Tambat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 138

CS 215

Data Analysis and Interpretation


Multivariate Statistics: Multivariate Gaussian
Suyash P. Awate
Multivariate Gaussian – Definition
• Consider a vector random variable X := [X1; X2; …; XD]
• Column vector of length D
Multivariate Gaussian – Identity A
• Consider a vector random variable X := [X1; X2; …; XD]
• Column vector of length D
Multivariate Gaussian – Identity A
• What are the level sets of the PDF ?
Multivariate Gaussian – Identity A
• Isotropic /
spherical
multivariate
Gaussian
• Level sets
Multivariate Gaussian – Identity A
• Isotropic /
spherical
multivariate
Gaussian
• Level sets
Multivariate Gaussian – Diagonal A
•X=AW+µ
• What is PDF q(X) for non-singular square diagonal matrix A,
and some µ ?
• X1 = A11 W1 + µ1 : Gaussian with mean µ1, standard deviation σ1 = |A11|
• X2 = A22 W2 + µ2 : Gaussian with mean µ2, standard deviation σ2 = |A22|
•…
• XD = ADD WD + µD : Gaussian with mean µD, standard deviation σD = |ADD|
• P(X) = P(X1, X2, …, XD) = G(X1; µ1, σ12) G(X2; µ2, σ22) … G(XD; µD, σD2)
• Any level set of PDF q(X) is a hyper-ellipsoid with:
• Center at µ
• Axes aligned with cardinal axes
Multivariate Gaussian – Diagonal A
•X=AW+µ
• What is PDF q(X) for non-singular square diagonal matrix A, some µ ?
• P(X) = P(X1, X2, …, XD) = G(X1; µ1, σ12) G(X2; µ2, σ22) … G(XD; µD, σD2)
• Example 1-3 (left to right):
means are (µ1, µ2) : (0,0), (0,0), (0,0)
variances are (σ12, σ22): (4,4), (6,2),(2,6)
Multivariate Gaussian – Non-Singular A
•X=AW+µ
• What is PDF q(X) for non-singular square matrix A and µ = 0 ?
• Transformation of random variables (multivariate case)
• Transformation is X := g(W) := A W
• Inverse transformation is W = g-1(X) = A-1X
• Univariate case
• We wanted magnitude of derivative of g-1(.)
• Measured local scaling in lengths caused by g-1(.)
• Multivariate case
• Measure local scaling in volumes caused by g-1(.)
• We want the magnitude of the volume-scaling given by Jacobian of g-1(.)
• Magnitude of determinant of Jacobian of g-1(.), where the Jacobian is A-1
Multivariate Gaussian – Non-Singular A
• Linear transformation
W := A-1 X
Multivariate Gaussian – Non-Singular A
• Linear transformation W := A-1 X
• Transformation A-1 maps
an infinitesimal hyper-cube (dX) δ x δ x … x δ (D times) à
an infinitesimal hyper-parallelepiped (dW)
• If axes of hyper-cube were unit vectors along cardinal axes,
then axes of hyper-parallelepiped are columns of A-1
• If volume of the hyper-cube (dX) is δD,
then volume of hyper-parallelepiped (dW) is δD det(A-1) = δD / det(A)
Multivariate Gaussian – Non-Singular A
• Volume of a parallelepiped (in 3D)
• Scalar triple product
Multivariate Gaussian – Non-Singular A
• Why is volume of hyper-parallelepiped given by
determinant of matrix with columns as sides of hyper-parallelepiped ?
• An inductive proof exists. We consider the following non-inductive reasoning.
• Two important properties from linear algebra and geometry:
a) Adding multiples of one column to another
doesn’t change determinant, because determinant function is multi-linear
b) Adding multiples of one side vector to another
doesn’t change volume, because it causes a skew translation of hyper-parallelepiped
1. Using Gram-Schmidt orthogonalization,
transform matrix A-1 to a matrix, say, Borthog with orthogonal columns
(NOT orthonormal columns; that would have determinant 1)
• This doesn’t change determinant or volume
Multivariate Gaussian – Non-Singular A
1. Gram–Schmidt
orthogonalization
• {v1,v2, …} to {u1,u2, …}
Multivariate Gaussian – Non-Singular A
• Why is volume of hyper-parallelepiped given by
determinant of matrix with columns as sides of hyper-parallelepiped ?
• 2 important properties from linear algebra and geometry:
Adding multiples of one column, or side vector, to another:
a) doesn’t change determinant, because determinant function is multi-linear
b) doesn’t change volume, because it causes a skew translation of hyper-parallelepiped
1. Using Gram-Schmidt orthogonalization,
transform matrix A-1 to a matrix, say, Borthog with orthogonal columns
(NOT orthonormal columns; that would have determinant 1)
2. Rotate Borthog to make it to diagonal form (align columns to cardinal axes)
• This doesn’t change determinant or volume
Multivariate Gaussian – Non-Singular A
2. Rotate {u1,u2, …} to align it to cardinal axes
Multivariate Gaussian – Non-Singular A
• Why is volume of hyper-parallelepiped given by
determinant of matrix with columns as sides of hyper-parallelepiped ?
• 2 important properties from linear algebra and geometry:
Adding multiples of one column, or side vector, to another:
a) doesn’t change determinant, because determinant function is multi-linear
b) doesn’t change volume, because it causes a skew translation of hyper-parallelepiped
1. Using Gram-Schmidt orthogonalization,
transform matrix A-1 to a matrix, say, Borthog with orthogonal columns
(NOT orthonormal columns; that would have determinant 1)
2. Rotate Borthog to make it to diagonal form (align columns to cardinal axes)
3. For this diagonal matrix (aligned hyper-rectangle),
determinant magnitude (= product of diagonal-entries' magnitudes) =
volume of a hyper-rectangle (= product of side lengths)
4. Now trace back all operations
Multivariate Gaussian – Non-Singular A
•X=AW+µ
• What is the PDF q(X) for non-singular square matrix A and µ = 0 ?
• Transformation of random variables (multivariate case)
• Transformation is X := g(W) := A W
• Inverse transformation is W = g-1(X) = A-1X
• Multivariate case
• Measure local scaling in volumes caused by g-1(.)
• We want the magnitude determinant of Jacobian of g-1(.)
Multivariate Gaussian – Non-Singular A, Non-Zero µ
• If X := A W is a multivariate Gaussian,
then Y := X + µ is a multivariate Gaussian with

• Proof:
• Follows from the transformation X := Y – µ := g-1(Y)
Multivariate Gaussian – Composite Transformations
• If Y is multivariate Gaussian,
then Z := BY + c is multivariate Gaussian,
where matrix B is square invertible
• Proof:
• Because Y is multivariate Gaussian, we have Y = AW + µ, where A is invertible
• Thus,
Z
= B (AW + µ) + c
= (BA)W + (Bµ + c), where matrix BA is invertible
Multivariate Statistics – Mean and Covariance
Multivariate Statistics – Mean
• For an general random (column) vector X,
the mean vector is
EP(X)[X]
= a (column) vector with the i-th component as EP(X)[Xi] = EP(Xi)[Xi]
Multivariate Statistics – Covariance
• Covariance matrix for a general random (column) vector Y is defined
as:
C := EP(Y) [ (Y – E[Y]) (Y – E[Y])T ]
• So,
Cij = ?
= EP(Y) [ (Yi – E[Yi]) (Yj – E[Yj]) ]
= EP(Yi,Yj) [ (Yi – E[Yi]) (Yj – E[Yj]) ]
= Cov (Yi, Yj)
Multivariate Statistics – Covariance
• More properties of covariance matrix C (for a general random vector X)

• If A is invertible, then C is invertible, and then C is positive definite (PD) where


aTCa > 0 for any non-zero vector a
Multivariate Gaussian – Mean and Covariance
Multivariate Gaussian – Mean
• The mean vector of X := AW+µ is µ
• Proof:
• When X = AW + µ,
EP(X)[X] = EP(W)[AW+µ] = µ + EP(W)[AW] = µ + A EP(W)[W] = µ

• Notes:
• Take the expectation of first component of AW, i.e.,
EP(W) [ A11W1 + A12W2 + … A1DWD ]
= A11 EP(W) [W1] + A12 EP(W) [W2] + … + A1D EP(W) [WD]
• So, for the whole vector: EP(W) [AW] = A EP(W) [W]
Multivariate Gaussian – Covariance
• The covariance matrix of X := AW + µ is AAT
Multivariate Gaussian – Different Cases
Multivariate Gaussian – Special Cases
• Diagonal matrix
• Orthogonal matrix
• Definition: Real square matrix Q whose columns and rows are orthogonal
unit vectors (i.e., orthonormal vectors) Q QT = QT Q = Identity matrix
• Determinant det(Q) is either +1 or -1
• “orthogonal” is an over-used term in mathematics
• Rotation matrix
• When det(Q) = +1, then Q is a rotation matrix
• When det(Q) = -1, then Q models either reflection (called as an improper rotation) or a
combination of rotation and reflection
• “Rotation” is an over-used term (sometimes includes improper rotations) in
mathematics
• Reflection matrix
• An orthogonal matrix that is also symmetric
Multivariate Gaussian – Special Cases
• Property (Rotation and/or Reflection):
If µ = 0; and A = R where R is orthogonal;
then Y := RW has PDF:

• Proof:
• Transformation of random vectors
• |det(R)| = 1
• Inverse transformation is
W = RT Y
Multivariate Gaussian – Special Cases
• Property (Scaling):
If µ = 0; and A = S square diagonal with positive entries on diagonal;
then Y := SW has PDF:
Multivariate Gaussian – Special Cases
• Property (first Scaling, and then Rotation and/or Reflection):
If µ = 0; A = RS,
then Y := RSW has the PDF:
Multivariate Gaussian – Special Cases
• Property (first Scaling, and then Rotation and/or Reflection):
If µ = 0; A = RS,
then Y := RSW has the PDF:
Multivariate Gaussian – General Case
• If X := A W is a multivariate Gaussian,
then Y := X + µ is a multivariate Gaussian with

• What are the level sets of this PDF ?


• We need some linear algebra
• Analyze properties of covariance matrix C that is:
• In general: real symmetric positive semi-definite
• When C = AAT, and A is invertible, then C is positive definite
Probability and Statistics
• Reference books specifically for
multivariate Gaussian
• Basic Probability Theory, by Robert Ash
• faculty.math.illinois.edu/~r-ash/BPT.html
• Link 2
Linear Algebra
• Reference
books
Linear Algebra – Eigen Decomposition
• Eigenvalue and Eigenvector
• For any square NxN matrix A,
an eigenvector is a non-zero vector ‘v’ s.t. Av = λv.
Then, λ is the associated eigenvalue

• Square matrix A is diagonalizable if it is “similar” to a diagonal matrix,


i.e., if there exists an invertible matrix P and a diagonal matrix D
such that P-1AP = D
Linear Algebra – Eigen Decomposition
• If A is diagonalizable,
then it has
N linearly-independent eigenvectors
• The eigenvectors needn’t be
orthogonal to each other
Linear Algebra – Eigen Decomposition
• Invertible doesn’t imply diagonalizable
• A non-diagonalizable matrix is called a defective matrix
• e.g., 2x2 matrix A as shown. B = inv (A). [V D] = eig (A)
• Doesn’t have a complete basis of eigenvectors
• Intuition: Action of matrix is to map vector (x,y) to (x+y,y)
So, any eigenvalue must be 1, any eigenvector must have y=0
Linear Algebra – Eigen Decomposition
• Diagonalizable doesn’t imply invertible
• e.g., some eigenvalues can be zero
Linear Algebra – Eigen Decomposition
• Eigenvalue and Eigenvector
• For any square NxN matrix A,
an eigenvector is a non-zero vector ‘v’ s.t. Av = λv.
Then, λ is the associated eigenvalue
• Theorem:
Every real symmetric matrix (e.g., covariance C) is diagonalizable
• There exists an invertible matrix Q such that Q-1 C Q is diagonal
• This implies C has N linearly-independent eigenvectors
• Theorem:
Every real symmetric matrix (e.g., covariance C) is diagonalizable by
an orthogonal matrix
• There exists an orthogonal matrix Q such that QT C Q is diagonal
Linear Algebra – Eigen Decomposition
• Spectral Theorem: If A is a real symmetric NxN matrix, then
A has N real eigenvalues with N real-valued orthogonal eigenvectors
Linear Algebra – Eigen Decomposition
• In general:
• If {u,v} are distinct/non-parallel eigenvectors associated with same eigenvalue
λ, then any vector within span of {u,v} is also an eigenvector with eigenvalue λ.
• Because A(au + bv) = aAu + bAv = aλu + bλv = λ(au + bv)
• The above equation holds for real/complex-valued a,b, u,v, λ, A
• If {u,v} are distinct eigenvectors with distinct eigenvalues {λ1,λ2}, then action of
A on any vector within span of {u,v} outputs a vector within the span of {u,v}.
• Because A(au + bv) = aAu + bAv = aλ1u + bλ2v = (aλ1)u + (bλ2)v
• Notation: (.)* = conjugate and (.)T = transpose (e.g., converts col to row)
• For a real-symmetric matrix A
• Inner product of Ax and y = inner product of x and Ay
because 𝐴𝑥 !𝑦 ∗ = 𝑥 !𝐴!𝑦 ∗ = 𝑥 !𝐴 𝑦 ∗ = 𝑥 !𝐴∗ 𝑦 ∗ = 𝑥 ! 𝐴𝑦 ∗ (A is ”self-adjoint”)
• If v is a eigenvector and if u is orthogonal to v (i.e., if uTv* = 0), then
action of A on u produces a vector also orthogonal to v (A maintains orthogonality)
• Because (Au)Tv* = uT(Av*) = uT(Av)* = uTλ*v* = 0
Linear Algebra – Eigen Decomposition
• Spectral Theorem: If A is a real symmetric NxN matrix, then
A has N real eigenvalues with N real-valued orthogonal eigenvectors
• First, we show that A has all real eigenvalues
(i.e., A cannot have a complex-valued eigenvalue)
Linear Algebra – Eigen Decomposition
• Spectral Theorem: If A is a real symmetric NxN matrix, then
A has N real eigenvalues with N real-valued orthogonal eigenvectors
• Let us analyze the special/simple case where all real eigenvalues = λ
• Then, for all v in RN, linear operator A’s action on v simply scales v by factor λ
• We can exactly model such an operator A simply by a diagonal matrix 𝜆𝐼!×!
• Or model A as A = 𝑄 𝜆𝐼!×! 𝑄# ,
where Q is any orthogonal basis for RN with real-valued column vectors
• Thus, 𝑄# AQ leads to a diagonal matrix
• So, the N columns of Q are N real-valued eigenvectors of A
• Of course, Q isn’t unique
Linear Algebra – Eigen Decomposition
• Spectral Theorem: If A is a real symmetric NxN matrix, then
A has N real eigenvalues with N real-valued orthogonal eigenvectors
• For real-symmetric A,
eigenvectors corresponding to distinct (real) eigenvalues
are orthogonal Notation:
• Proof: (.)T = transpose
• Let A have eigenvector 𝑣# with real eigenvalue 𝜆#
• Let A have eigenvector 𝑣$ with real eigenvalue 𝜆$ ≠ 𝜆#
• Then, 𝜆#𝑣#!𝑣$∗ = 𝜆#𝑣# !𝑣$∗ = 𝐴𝑣# !𝑣$∗
• = 𝑣#! 𝐴𝑣$∗ (because A is symmetric)
• = 𝑣#! 𝐴𝑣$ ∗ (because A is real)
• = 𝑣#! 𝜆$𝑣$ ∗ = 𝑣#!𝜆$𝑣$∗ (because 𝜆$ is real)
• = 𝜆$𝑣#!𝑣$∗
• Because 𝜆$ ≠ 𝜆#, we get 𝑣# orthogonal to 𝑣$
Linear Algebra – Eigen Decomposition
• Spectral Theorem: If A is a real symmetric NxN matrix, then
A has N real eigenvalues with N real-valued orthogonal eigenvectors
Linear Algebra – Eigen Decomposition
• Spectral Theorem: If A is a real symmetric NxN matrix, then
A has N real eigenvalues with N real-valued orthogonal eigenvectors
• How do we know that real-valued eigenvectors exist for A ?
• Let real-positive scalar 𝑏 ≔ !
max 𝐴𝑦 * ,
$∈& ∶ $ " ()
• Let real-valued vector x ≔ arg max
!
𝐴𝑦 *
$∈+ ∶ $ " ()
• If there are multiple such unit-norm vectors y, then we pick one of them as x
• Thus, action of A scales norm of any vector by at-most b times
• Then 𝑏 * = Ax # Ax = x # (AAx)
• RHS can take value 𝑏 * only if x is parallel to AAx (and then AAx = 𝑏 * x)
• Then we can claim that:
1. Either Ax is parallel to x; so v1 := x is a real-valued eigenvector, with real eigenvalue b
2. Or v1 := Ax+bx (non-zero) is a real-valued eigenvector, with real eigenvalue b,
because A Ax + bx = AAx + bAx = b$x + bAx = b(bx + Ax)
Linear Algebra – Eigen Decomposition
• Spectral Theorem: If A is a real symmetric NxN matrix, then
A has N real eigenvalues with N real-valued orthogonal eigenvectors
• How do we know that real-valued eigenvectors exist for A ?
• Repeat the following:
• Let real-positive scalar 𝑐 ≔ ! #
max 𝐴𝑦 * ,
$∈& , $ -$ (. ∶ $ " ()
• Let real-valued vector x ≔ arg max 𝐴𝑦 *
$∈+! , $ -$ (. ∶ $ " ()
#

• Then 𝑐 * = Ax # Ax = x # (AAx)
• RHS can take value 𝑐 * only if x is parallel to AAx (and then AAx = 𝑐 * )
• Then we can claim that:
1. Either Ax is parallel to x; so x is a real-valued eig.vector (orthogonal to v1) (eig.value c)
2. Or (Ax+cx) is a real-valued eig.vector of A (orthogonal to v1, as both Ax and x are),
because A Ax + cx = AAx + cAx = c $x + cAx = c(cx + Ax)
Linear Algebra – Eigen Decomposition
• Every NxN real symmetric positive definite (SPD) matrix M
(e.g., covariance matrix C) has an eigen-decomposition with
all eigenvalues as positive
• Proof:
• Let eigen decomposition for real symmetric matrix M be: M = Q D QT
• Where Q is real orthogonal and D is real diagonal
• Then, vT M v = vT Q D QT v = uT D u, where u := QT v (simply “rotated” v)
• For a PD matrix M, vTMv must be positive for every non-zero v
• So, uTDu must be positive for every non-zero u
• So, all values on diagonal of D must be positive
Multivariate Gaussian – Level Sets
• If X = A W is a multivariate Gaussian,
then Y = X + µ is a multivariate Gaussian with

• What are the level sets of this PDF ?


• Let C = Q D QT . Then, C-1 = Q D-1 QT that is also SPD
• Each level set satisfies (y-μ)T C-1 (y-μ) = a, where a>=0
• Because C-1 is SPD; ‘a’ becomes zero iff y=μ
• So, (y-μ)T Q D-1 QT (y-μ) = a
• Change to roto-reflected coordinate system represented by orthogonal basis Q
• Where y maps to y’ = QT y, and μ maps to μ’ = QT μ
• Then, (y’-μ’)T D-1 (y’-μ’) = a, which is a hyper-ellipsoid:
• In roto-reflected coordinate system, center is at μ’ and axes are along cardinal axes
• Whose half-lengths of axes are square root of diagonal elements in D-1
Multivariate Gaussian – Level Sets
• If X = A W is a multivariate Gaussian, then Y = X + µ is a multivariate
Gaussian with

• What are the level sets of this PDF ?

peterroelants.g
ithub.io/posts/
multivariate-
normal-primer/
Multivariate Gaussian – Level Sets
• If X = A W is a multivariate Gaussian, then Y = X + µ is a multivariate
Gaussian with

• What are the level sets of this PDF ?


Multivariate Gaussian – Marginals and Conditionals
Multivariate Gaussian – Marginals
• Marginal PDFs
• Property: The 1D marginal PDF of
multivariate Gaussian X,
for any single variable,
is (univariate) Gaussian
• Proof:
• From the definition, we know that:
• (1)
• (2) transformations of scaling and/or translation on a univariate Gaussian RV
lead to another univariate Gaussian RV
• (3) sum of 2 independent univariate Gaussian RVs leads to another univariate
Gaussian RV
Multivariate Gaussian – Marginals
• Marginal PDFs
• Property: The 1D marginal PDF of
multivariate Gaussian X,
for any single variable,
is (univariate) Gaussian
Multivariate Gaussian – Marginals
• Marginal PDFs
• Property: Marginal PDFs of multivariate Gaussian X in N dimensions,
over any chosen subset of the variables (subset size M < N),
are (multivariate) Gaussian
• Proof:
• Choose transformation B as a projection matrix of size MxN, where M < N
• Each row has all zeros except a 1 at one position
• e.g., row [1 0 … 0 ] will select the first component of X
• If we consider multivariate Gaussian X := AW + µ, where A is invertible,
then BX = (BA)W + (Bµ)
• Note: Because A is invertible (full rank), BA has rank M
• By definition, BX is also multivariate Gaussian
• Mean = Bµ, Covariance = (BA)(BA)T = BAATBT = BCBT = C’,
where C’ is a square sub-matrix of C corresponding to the chosen M variables
Multivariate Gaussian – Marginals
• Marginal PDFs being Gaussian
doesn’t imply
joint PDF is multivariate Gaussian
• Example
• Let X be a standard Normal
• Let Y = X (2B – 1)
where B is Bernoulli with parameter 0.5

• More examples
• https://en.wikipedia.org/wiki/Normally_distributed_and_uncorrelated_does_not_impl
y_independent
Multivariate Gaussian – Marginals
• Marginal PDFs being Gaussian doesn’t imply joint PDF is multivariate
Gaussian
• Only top-row left, top-row middle, bottom-row left are bivariate Gaussian
• All marginals are Gaussian
Multivariate Gaussian – Conditionals
• Conditional PDFs
• If multivariate Gaussian X
is partitioned into X1 and X2,
then conditional PDF P(X1|X2=x2)
is also a multivariate Gaussian
• P(X1|X2=x2)
=
P(X1,X2=x2) / P(X2=x2)
Multivariate Gaussian – Conditionals
• Conditional PDFs
• If multivariate Gaussian X is partitioned into X1 and X2,
then the conditional PDF P(X1|X2=x2) is also a multivariate Gaussian
Multivariate Gaussian – Conditionals
• “Conditional” PDFs
• What about this way of slicing ?
• Yes, profile on the line has a Gaussian shape
Multivariate Gaussian – Conditionals
• “Conditional” PDFs
• What about this way of slicing ?
• Yes, profile on the line has a Gaussian shape
Multivariate Gaussian – ML Estimation
Multivariate Gaussian – ML Estimation
• Data: {y1, …, yN}
• Take log-likelihood function
• ML estimate (MLE) for mean vector (= sample mean)
• Take derivative with respect to (w.r.t.) µ, and assign to zero. Solve.
• Quadratic form aTBa
= ΣiΣj aiajBij 𝜕
𝜇 − 𝑥 !𝐶 %# 𝜇 − 𝑥 = 2𝐶 %#(𝜇 − 𝑥)
• Partial derivative w.r.t. ak 𝜕𝜇
= Σj ajBkj + Σi aiBik
= 2 Σj Bkjaj (because B is symmetric)
= 2 (k-th row of B * column-vector a)
• Scalar function, say f(a), of multiple scalar variables in column-vector ‘a’
• Jacobian df/da will be a row vector of the same length as ‘a’
• Change in function value (df) = derivative (df/da) * change in variable (da)
• Can be reshaped/rearranged into a column vector of the same shape as ‘a’
Multivariate Gaussian – ML Estimation
• Data: {y1, …, yN}
• Take log-likelihood function
• MLE for covariance matrix (= sample covariance; uncorrected/biased)
• Take derivative w.r.t. C, and assign to zero. Solve.
• Need partial derivatives w.r.t. Cij
• Scalar function, say f(C), of multiple scalar variables in C
• Consider a (column)-vectorized form of C
• Jacobian df/dC will be a row vector of the same length as (column)-vectorized C
• Can be reshaped/rearranged into a matrix of the same shape as C
Multivariate Gaussian – ML Estimation
• “Matrix Calculus”
• http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/calculus.html
• https://en.wikipedia.org/wiki/Matrix_calculus
• http://www.matrixcalculus.org/
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
• Principal Component Analysis (PCA)
• What is it about ?
• What does it tell us about the distribution underlying the data ?
• What does it tell us about the distribution underlying the data,
when the data is known to have a multivariate Gaussian distribution ?
• Applications
Principal Component Analysis (PCA)
• “Modes of variation”
• Set of vectors (directions and magnitudes) that are used to depict the
variation in a population or sample, around the mean

https://statistics.laerd.com/spss-tutorials/pearsons-product-moment-correlation-using-spss-statistics.php
Principal Component Analysis (PCA)
• “Modes of variation”
• Set of vectors (directions and magnitudes) that are used to depict the
variation in a population or sample, around the mean
• Can we do it for the
distribution of images of each digit ?
Principal Component Analysis (PCA)
• Directions of maximal variance
• Consider a general multivariate random variable X with PDF P(X)
• We aren’t assuming it to be a Gaussian yet
Principal Component Analysis (PCA)
• Directions of maximal variance
Principal Component Analysis (PCA)
• Directions of maximal variance
Principal Component Analysis (PCA)
• Directions of maximal variance
Principal Component Analysis (PCA)
• Directions of maximal variance
• When covariance matrix C is diagonal (sample mean at origin)
• Let d-th element on diagonal of C be Cdd
• Let d-th element in vector ‘v’ be vd

• Objective function is a convex combination of { Cdd } with weights { (vd)2 }


Principal Component Analysis (PCA)
• Directions of maximal variance
• When covariance matrix C is diagonal (and sample mean at origin):
• Minor axis corresponds to Dimension d = 1
dimension d with
smallest 1/Cdd, i.e., largest Cdd

• The point on hypersphere that


maximizes objective function
lies at the end of the minor axis Dimension d = 2
of one of the hyper-ellipsoids

These level sets being


ellipsoids isn’t because of
any Gaussian assumption.
Principal Component Analysis (PCA)
• Directions of maximal variance
• When covariance matrix C is diagonal (and sample mean at origin):

• Second mode of variation is second cardinal axis (another eigenvector).


Variance along that mode = second-largest eigenvalue = C22
• Similar arguments hold for 3rd, 4th, ... directions
• Thus, for any P(X) with a diagonal covariance matrix C, modes of variation are
cardinal directions that maximize variance of projected data
Principal Component Analysis (PCA)
• Directions of maximal variance
• For a general SPD covariance matrix C (and sample mean at origin):
Principal Component Analysis (PCA)
• Directions of maximal variance
• Data

• What do the eigenvectors look like ? Compute them and see for yourself.
Principal Component Analysis (PCA)
• Example 1: rng(0); N = 1e5; data = randn (2,N)
Principal Component Analysis (PCA)
• Example 2: rng(0); N = 1e5; data = rand (2,N)
Principal Component Analysis (PCA)
• Example 3: rng(0); N = 1e5; data = [[1 -1];[1 1]]/sqrt(2) * (rand (2,N))
Principal Component Analysis (PCA)
• Example 4: rng(0); N = 1e5; data = exprnd (1,[2,N])
Principal Component Analysis (PCA)
• Example 5: rng(0); N=1e5; data = [[1 -1];[1 1]]/sqrt(2) * exprnd (1,[2,N])
Principal Component Analysis (PCA)
• What happens to covariance matrix when we rotate the data ?
• In general
• Let 𝐶/ ≔ 𝐸 𝑋 − 𝜇/ 𝑋 − 𝜇/ # = 𝐸 𝑋𝑋 # − 𝜇/ 𝜇/#
• Let 𝑌 ≔ 𝑅𝑋 (for any invertible R)
• Then, 𝜇0 ≔ 𝐸 𝑌 = 𝐸 𝑅𝑋 = 𝑅𝐸 𝑋 = 𝑅𝜇/
• Then, 𝐶0 ≔ 𝐸 𝑌 − 𝜇0 𝑌 − 𝜇0 #
• = 𝐸 𝑌𝑌 # − 𝜇0 𝜇0#
• = 𝐸 𝑅𝑋𝑋 # 𝑅# − 𝑅𝜇/ 𝜇/# 𝑅#
• = 𝑅 𝐸 𝑋𝑋 # − 𝜇/ 𝜇/# 𝑅#
• = 𝑅C1 R#
• If 𝐶! = 𝑎I and R is orthogonal, then 𝐶" = 𝑅𝑎𝐼R# = 𝑎I = 𝐶!
• Thus, eigenvalues don’t change, and eigenvectors can be any orthogonal basis
• Thus, directions of maximal variance aren’t unique
Principal Component Analysis (PCA)
• Spaces of maximal variance
• What if we want to find multi-D lower-dimensional spaces that maximize
“total dispersion/variance” ?
• Total dispersion/variance is empirical average of squared distance from mean
Principal Component Analysis (PCA)
• Spaces of maximal variance
• What if we want to find multi-D lower-dimensional spaces that maximize
“total dispersion/variance” ?
• Total dispersion/variance is empirical average of squared distance from mean
• When covariance matrix C is diagonal (and sample mean is at origin):
Principal Component Analysis (PCA)
• Spaces of maximal variance
• What if we want to find multi-D lower-dimensional spaces that maximize
“total dispersion/variance” ?
• When covariance matrix C is diagonal (and sample mean at origin):
Principal Component Analysis (PCA)
• Spaces of maximal variance
• What if we want to find multi-D lower-dimensional spaces that maximize
“total dispersion/variance” ?
• When covariance matrix C is diagonal (and sample mean at origin):

• Now this problem is similar to what we had solved before


• Like before, we have a “convex” combination of {Cdd} with weights {0 ≤ ad ≤ 1}
• So, increase a1 to its limit (i.e., 1) and then increase a2 to its limit (i.e., 1)
Principal Component Analysis (PCA)
• Spaces of maximal variance
• What if we want to find multi-D lower-dimensional spaces that maximize
“total dispersion/variance” ?
• When covariance matrix C is diagonal (and sample mean at origin):

• Any orthogonal basis spanning the space spanned by first 2 cardinal axes will be a solution
• Similar arguments will hold for lower-dimensional spaces of dimensions 3, 4, …, D-1
• Similar arguments will also hold for a general SPD covariance matrix C
Principal Component Analysis (PCA)
• PCA applied to data from a multivariate Gaussian distribution
• Consider X is multivariate Gaussian
• If X := AW + b, then:
• Principal modes of variation are directions given by
eigenvectors of covariance matrix C := AAT
• Principal modes of variation are along
axes of hyper-ellipsoids that are level sets of P(X)
• Variances along principal modes of variation are
the eigenvalues of C
• If X := RSW + b, then:
• Principal modes of variation are
column vectors of orthogonal matrix R
i.e., eigenvectors of C = RS2RT
• Variances along principal modes of variation are
the eigenvalues of C, i.e., diagonal elements in S2
Principal Component Analysis (PCA)
• Applications: Dimensionality reduction
• Intrinsic dimension: Minimum number of variables (degrees of freedom)
required to represent the signal
• Consider a multivariate random vector X of N scalar variables: x = (x1, …, xN)
• Consider a function g(.), and M<N scalar variables a1, …, aM such that
every x~P(X) can be written as x = g (a1, …, aM) for some a1, …, aM,
then signal X needs only M variables for representation
• Here, “intrinsic dimension” of X is M, instead of the “representation dimension” = N
Principal Component Analysis (PCA)
• Applications: Dimensionality reduction

https://medium.com/analytics-vidhya/dimensionality-reduction-using-principal-component-analysis-pca-41e364615766
Principal Component Analysis (PCA)
• Applications: Dimensionality reduction
• Acquired data is corrupted with errors
• e.g., measurement errors
• Such errors make the signal representation seem to be of a dimension higher than
intrinsic dimension
• Dimensionality reduction:
Transformation of data
from a higher-dimensional space into a lower-dimensional space
so that
lower-D representation (ideally close to its intrinsic dimension)
retains some meaningful properties of original data,
• PCA can perform linear dimensionality reduction
Principal Component Analysis (PCA)
• Applications:
Dimensionality reduction
• Using PCA
• X may be N dimensional
• PCA finds an
M-dimensional space
that captures most of the
variability (total dispersion)
in the data

http://bennymachinelearning.blogspot.com/2017
/08/machine-learning-principal-component.html
Principal Component Analysis (PCA)
• Applications: Dimensionality reduction
• Using PCA
• X may be N dimensional
• PCA can find an
M-dimensional space
(often when M << N)
that captures most of
the variability
(total dispersion)
in the data
Multivariate Gaussian – Mahalanobis Distance
Multivariate Gaussian – Mahalanobis Distance

• Term (y-µ)T C-1 (y-µ) appearing in exponent


= squared Mahalanobis distance
of point y from mean µ
• d(y,µ; C)2 := (y-µ)T C-1 (y-µ) (where C is SPD)
• Prasanta Chandra Mahalanobis
founded
Indian Statistical Institute (ISI) in Kolkata
Multivariate Gaussian – Mahalanobis Distance
• d(y,µ; C)2 := (y-µ)T C-1 (y-µ) (where C is SPD)
• Generalizes Euclidean distance in a multidimensional space
• When C is Identity:
• Mahalanobis distance = Euclidean distance
• When C is diagonal:
• Mahalanobis distance rescales “units” along each dimension
based on standard deviation of the marginal along that dimension
• A level set of a Multivariate Gaussian PDF is
the locus of points with equal Mahalanobis distance from the mean
Multivariate Gaussian – Mahalanobis Distance
• d(y,µ; C)2 := (y-µ)T C-1 (y-µ) (where C is SPD)
• Property: The Mahalanobis distance is a “distance metric”
• Proof:
• A distance metric is a function 𝑑 . , . → 𝑅 that needs to satisfy 3 properties:
• (1) identity of indiscernibles: d(x,y) = 0 iff x = y
• (2) symmetry: d(x,y) = d(y,x)
• (3) triangle inequality: d(x,y) <= d(x,z) + d(z,y)
• These imply non-negativity (i.e., d(x,y) >= 0, for all x,y):
0 = d(x,x) <= d(x,y) + d(y,x) = 2 d(x,y)
• In our case of SPD matrix C:
• C being SPD implies: d(x,y; C) >= 0 for all x,y
• C being SPD implies: d(x,y; C) = 0 iff x=y
• Because of the specific quadratic form of d(.,.; C)2 : d(x,y; C) = d(y,x; C)
Multivariate Gaussian – Mahalanobis Distance
• Property: The Mahalanobis distance is a true distance metric
• Proof (when covariance matrix C is diagonal):

• Showing LHS <= RHS is equivalent to showing LHS2 <= RHS2


Multivariate Gaussian – Mahalanobis Distance
• Property: The Mahalanobis distance is a true distance metric
• Proof (when covariance matrix C is diagonal):
Multivariate Gaussian – Mahalanobis Distance
• Property: The Mahalanobis distance is a true distance metric
• Proof (for a general covariance matrix C):
Multivariate Gaussian – Mahalanobis Distance
• A level set of a Multivariate Gaussian is the locus of points with the
same Mahalanobis distance from the mean
• Scaling the coordinate frame: X := SW
• How does the Mahalanobis distances change (w.r.t. case when C = Identity) ?
• How do the level sets change ?
Multivariate Gaussian – Mahalanobis Distance
• A level set of a Multivariate Gaussian is the locus of points with the
same Mahalanobis distance from the mean
• Scaling + “Rotating” (proper + improper) coordinate frame: Y := USW
• How does the Mahalanobis distances change (w.r.t. case when C = Identity) ?
• How do the level sets change ?
Multivariate Gaussian – Mahalanobis Distance
•.
Multivariate Gaussian – Applications
Multivariate Gaussian – Applications
• Multivariate Gaussian (Mahalanobis distance) for anomaly detection

Blue = normal
Blue = normal Red = anamolous
Red = anamolous Using Mahalanobis distance from mean of normal data,
Using Euclidean distance from mean of normal data using covariance of normal data
Multivariate Gaussian – Applications
• Multivariate Gaussian for maximum-likelihood classification (in 1D)

http://sar.kangwon.ac.kr/etc/rs_note/rsnote/cp11/cp11-7.htm
Multivariate Gaussian – Applications
• Multivariate Gaussian for maximum-likelihood classification (in 2D)
• If det(C1)=det(C2),
then
likelihood-based
classification is
equivalent to
Mahalanobis-
distance based
classification

https://onlinelibrary.
wiley.com/doi/full/1
0.1111/maps.13314
Multivariate Gaussian – Applications
• Multivariate Gaussian for maximum-likelihood classification
• How do decision boundaries look like ?
• P(x|Class1) = G (x; m1, C1)
• P(x|Class2) = G (x; m2, C2)
• Decision surface comprises all points ‘x’ at which likelihoods are equal
• { x : P(x|Class1) = P(x|Class2) }
• { x : 0 = log ( P(x|Class1) / P(x|Class2) ) }
• At any point in the domain ‘x’, the log likelihood-ratio is:
log (P(x|Class1) / P(x|Class2))
=
- 0.5 (x-m1)T C1-1 (x-m1) - 0.5 log (det (C1))
+ 0.5 (x-m2)T C2-1 (x-m2) + 0.5 log (det (C2))
• In general, decision surface is a hyper-quadric
Multivariate Gaussian – Applications
• Multivariate Gaussian for maximum-likelihood classification
• Decision boundaries

http://mi.eng.cam.
ac.uk/~mjfg/local/
4F10/lect2.pdf
Multivariate Gaussian – Applications
• Multivariate Gaussian for maximum-likelihood classification
• Decision boundaries

http://mi.eng.cam.
ac.uk/~mjfg/local/
4F10/lect2.pdf
Multivariate Gaussian – Applications
• Multivariate Gaussian for maximum-likelihood classification
• Decision boundaries
• When C1 = C2 = C, then decision boundary is:
• 0 = log (P(x|Class1) / P(x|Class2))
=
- 0.5 (x-m1)T C-1 (x-m1)
+ 0.5 (x-m2)T C-1 (x-m2)
•0
=
+ (m2-m1)T C-1 x
+ 0.5 m1T C-1 m1
- 0.5 m2T C-1 m2
• Decision surface is a hyper-plane
Multivariate Gaussian – Applications
• Multivariate Gaussian for maximum-likelihood classification
• Example (Data taken from R. A. Fisher's classic 1936 paper)
• UCI ML repository Iris dataset http://archive.ics.uci.edu/ml/datasets/Iris/

http://mi.eng.cam.
ac.uk/~mjfg/local/
4F10/lect2.pdf
Datasets
• UCI Machine Learning Repository
• https://archive.ics.uci.edu/ml/
Singular Value Decomposition (SVD)
• Singular Value Decomposition (SVD)
• What is it about ?
• What can we say about existence ?
• What can we say
about uniqueness ?
• How does it help us
understand the
multivariate Gaussian ?
Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD)
• Matrix factorization
• Let matrix A be size MxN
• When A is real valued, then SVD of A = U S VT, where:
• V is orthogonal of size NxN
• When A is complex: V is unitary
• U is orthogonal of size MxM
• When A is complex: U is unitary
• S is (rectangular) diagonal with size MxN
• Values on diagonal = singular values
• Singular values are non-negative real (even when A, U, V are complex-valued)
• If the m-th columns of U and V are um and vm, respectively, then:
Singular Value Decomposition (SVD)
• A = U S VT
• An example, in pictures:

A = U S VT
Singular Value Decomposition (SVD)
• Geometric interpretation of the action of a matrix A on a vector
• In this example,
A is square A

VT U

S
Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD)
• Matrix norm
• Induced by a vector norm

• Geometric interpretation related to 2-norm


• Apply “linear operator” A to all unit-norm vectors x (starting at origin)
• Let y := A x, for all such x
• Then, among all vectors y, pick the norm of the vector y’ that has the largest norm
Singular Value Decomposition (SVD)
• Matrix
norm
• Induced
by a
vector
norm
Singular Value Decomposition (SVD)
• Existence, for any real matrix A
Singular Value Decomposition (SVD)
• Existence
Singular Value Decomposition (SVD)
• Existence
Singular Value Decomposition (SVD)
• Existence
• How to analyze S further ? Induction on size of A, i.e., MxN
Singular Value Decomposition (SVD)
• Properties of singular values, vectors
• What does A = U S VT imply ?
• Some insights via algebra and geometry
• Let i-th column of V be vi
• Let j-th column of U be uj
• What is Avi ? For example, take i = 2.
(assume S is at least of size 2x2)
• Av2 = USVT v2
= U S [0 1 0 … 0]T
= U [0 S22 0 … 0]T
= S22 u2
• Thus, Av1 is along u1, and, hence, orthogonal to all other columns of U
• Also, Av2 is along u2, and, hence, orthogonal to all other columns of U, …
• Also, if any vector v is orthogonal to v1, then Av is orthogonal to Av1 or u1
Singular Value Decomposition (SVD)
• Uniqueness analysis of singular values, singular vectors

v1
Singular Value Decomposition (SVD)
x
• Uniqueness analysis w

θ
v1
Singular Value Decomposition (SVD) x
• Uniqueness analysis w

θ
v1

2
Singular Value Decomposition (SVD)
• Uniqueness analysis x
w

θ
v1
Singular Value Decomposition (SVD)
• Uniqueness analysis
• Why is norm(B) <= norm(A) ?
• We know that A = USVT , where U and V are orthogonal, and S is as shown above
• Let β := norm(B)
• By definition of norm(B),
there exists a unit-norm column-vector y such that norm(By) = β
• Use that y to construct a longer (but still) unit-norm column vector x := V [0,yT]T
• norm (A x)
= norm ( USVT V[0,yT]T )
= norm ( S [0,yT]T )
= norm ( [ 0, (By)T ]T )
= norm (By)

• Thus, there exists a vector x such that norm(Ax) = β,
which implies that norm(A) cannot be less than β, i.e., norm(A) >= β = norm(B)
Singular Value Decomposition (SVD)
• Uniqueness analysis

• Properties of other singular values and singular vectors follows by induction


• Thus, if all singular values are distinct, then all singular vectors are unique
(upto sign)
Singular Value Decomposition (SVD)
• How does SVD help us in understanding the multivariate Gaussian ?
• Consider X := AW, where:
• Components of W are independent standard-normal. A is of size MxN, where M < N.
• We use A := USVT , where:
• S is MxN (rectangular) diagonal. U is MxM orthogonal. V is NxN orthogonal.
• AW
= USVT W
= U S W’ (where components of W’ are also independent standard-normal)
= U S’ W’’ (where S’ is square with columns as the first M columns of S,
W’’ is first M components of W’)
= A’ W’’ (where A’=US’ is MxM, and W’’ is Mx1)
• Covariance(X) = C = AAT = U SST UT = A’A’T, where:
• SST is square diagonal of size MxM
• For matrix C to be SPD, the rank of S needs to be M (M non-zero singular values)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy