Module08 PolynomialRegressionSplineGAMs
Module08 PolynomialRegressionSplineGAMs
High VIF
Structural Multicollinearity (after centering)
Smoothing spline
With df = 16
Fitting Smoothing Splines
Cubic spline
Smoothing spline
With df = 16
Smoothing spline with
LOOCV, df = 6.79
Local regression
• Local regression is a different approach for fitting flexible non-linear functions,
which involves computing the fit at a target point 𝑥0 using only the regression
nearby training observations.
• Local regression is sometimes referred to as a memory-based procedure, because
like nearest-neighbours, we need all the training data each time we wish to
compute a prediction.
• Choices to be made:
• How to define weighting function K
• Form of regression (linear, cubic, quadratic)
• Span 𝑠 (plays the role of a tuning parameter, smaller 𝑠: more local – higher variance)
Local regression
𝑦𝑖 = 𝛽0 + 𝑓𝑗 (𝑥𝑖𝑗 ) + 𝜖𝑖
𝑗=1
• This is an example of GAM, where linear component 𝛽𝑗 𝑥𝑖𝑗 in multiple linear regression
model is replaced by smooth non-linear function𝑓𝑗 (𝑥𝑖𝑗 ).
• It is called an additive model because we calculate a separate 𝑓𝑗 for each𝑋𝑗 , and then add
together all of their contributions.
Generalized additive models (GAMs)
• In the regression setting GAM has the form
𝐸 𝑌 𝑋1 , . . , 𝑋𝑝 = 𝛼 + 𝑓1 𝑋1 + 𝑓2 𝑋2 . . +𝑓𝑝 (𝑋𝑝 )
• The 𝑓𝑗 ()s are unspecified, smooth, non-parametric functions
• Each function is fitted using a scatter-plot smoother (e.g. cubic spline, kernel
smoother) and then estimate all p functions simultaneously using an algorithm
• An additive logistic regression model is represented by
𝑃 𝑌=1𝑋
log = 𝛼 + 𝑓1 𝑋1 + 𝑓2 𝑋2 . . +𝑓𝑝 (𝑋𝑝 )
1−𝑃 𝑌 =1 𝑋
• The above model can also be extended to the generalized linear models which
include linear model, logit, probit, gamma, negative-binomial, log-linear models.
• Linear and other parametric forms can be mixed with the nonlinear terms, a
necessity when some of the inputs are qualitative variables (factors).
Generalized additive models (GAMs)
• Additive models can replace linear models, e.g. additive decomposition of time-
series
𝑌𝑡 = 𝑆𝑡 + 𝑇𝑡 + 𝜖𝑡
Where 𝑆𝑡 is seasonal component, 𝑇𝑡 is the trend and 𝜖𝑡 is the error term.
Model fitting with GAMs
• The additive model has the form
𝑝
𝑌 = 𝛼 + σ𝑗=1 𝑓𝑗 (𝑋𝑗 ) + 𝜖
where 𝐸 𝜖 = 0
The penalized sum of square error similar to smoothing spline is applicable for this model
2
𝑛 𝑝 𝑝
• The minimizer of the PRSS is an additive cubic spline model in 𝑋𝑗 , with knots at each of
the unique values of 𝑥𝑖𝑗 , where 𝑖 = 1, . . 𝑁
Backfitting Algorithm
1 𝑁
1. Initialize 𝛼ො = σ𝑖=1 𝑦𝑖 , 𝑓መ𝑗
= 0, ∀𝑖, 𝑗
𝑁
2. For 𝑗 = 1𝑡𝑜𝑝, loop
𝑝 𝑁
𝑓መ𝑗 ← 𝒮𝑗 (𝑦𝑖 − 𝛼 − σ𝑘=1 𝑓መ𝑘 (𝑥𝑖𝑘 )
𝑘≠𝑗 1
1 𝑁
𝑓መ𝑗 ← 𝑓መ𝑗 − σ𝑖=1 𝑓መ𝑗 (𝑥𝑖𝑗 )
𝑁
Until the change in 𝑓መ𝑗 is smaller than some threshold
𝑁
• 𝒮𝑗 is cubic smoothing spline applied to targets (𝑦𝑖 − 𝛼 − σ𝑘≠𝑗 𝑓መ𝑘 (𝑥𝑖𝑘 )
1
to obtain new estimates of 𝑓መ𝑗
Backfitting Algorithm
• Operation of smoother𝒮𝑗 only at the training points can be represented by an N ×
N operator matrix 𝑺𝑗
• Then the degrees of freedom for the jth term are (approximately) computed
• as 𝑑𝑓𝑗 = 𝑡𝑟𝑎𝑐𝑒[𝑺𝑗 ] − 1,
Advantages and Disadvantages
• Advantages:
1) GAMs allow us to fit a non-linear 𝑓𝑗 for each𝑋𝑗 , so that we can automatically
model non-linear relationships that standard linear regression will miss.
2) The non-linear fits can potentially make more accurate predictions for the
response𝑌.
3) Because the model is additive, we can examine the effect of each 𝑋𝑗 on 𝑌
individually while holding all of the other variables fixed.
4) The smoothness of the function 𝑓𝑗 for the variable 𝑋𝑗 can be summarized via
degrees of freedom.
Disadvantages:
1) The model is restricted to be additive. With many variables, important
interactions can be missed. For removing that manually interactions are added using
linear regression, local regression techniques.
Regression with GAM
require(gam)
gam1 <- gam(wage ~ s(age, df = 4)+ s(year, df = 4) + education, data = Wage)
par(mfrow = c(1,3))
plot(gam1, se = T)
Logit with GAM
gam2 <- gam(I(wage>250) ~ s(age, df = 4)+ s(year, df = 4) + education,
data = Wage, family = binomial)
par(mfrow = c(1,3))
plot(gam2)
Kernel Density Estimation
• The Kernel Density Estimation is a mathematic process of finding an estimate probability
density function of a random variable.
• It is a non-parametric method to estimate the probability density function of a random
variable based on kernels as weights.
• Let (𝑥1, 𝑥2, … , 𝑥𝑛)be independent and identically distributed samples drawn from some
univariate distribution with an unknown density𝑓at
𝑛
any given point x.
1
መ
𝑓𝜆 𝑥 = 𝐾𝜆 𝑥 − 𝑥𝑖
𝑛
𝑖=1
𝐾 is the kernel — a non-negative function — and 𝜆 > 0 is a smoothing parameter called
the bandwidth.
• A range of kernel functions are commonly used: uniform, triangular, biweight, triweight,
Epanechnikov, normal, and others.
• The Kernel Density Estimation works by plotting out the data and beginning to create a
curve of the distribution.
Bandwidth Selection
• The most common optimality criterion used to
select bandwidth is termed as mean integrated
squared error.
• In each of the kernels 𝐾𝜆 , 𝜆 is a parameter that
controls its width:
• For the Epanechnikov or tri-cube kernel with
metric width, ℎ is the radius of the support region.
• For the Gaussian kernel,𝜆 is the standard deviation.
• ℎ is the number k of nearest neighbors in k-nearest
neighborhoods, often expressed as a fraction or
span k/N of the total training sample. Figure: Kernel density estimation with different
bandwidth. Red: KDE with 𝜆 =0.05, Black:
KDE with 𝜆=0.337, Green: KDE with 𝜆=2,
Grey curve is normal density with mean o and
variance 1 source: wikipedia
Kernel Smoothing
KNN and Kernel Smoothing
• KNN average is computed as
𝑓መ 𝑥 = 𝐴𝑣𝑒(𝑦𝑖 |𝑥𝑖 ∈ 𝑁𝑘 (𝑥))
• Here 𝑁𝑘 (𝑥) is the set of k points nearest to x in squared distance
• Moving 𝑥0 from left to right, the KNN remains constant, until a point
𝑥𝑖 to the right of 𝑥0 becomes closer than the furthest point 𝑥𝑖 ′ in the
neighborhood to the left of 𝑥0 , at which time 𝑥𝑖 replaces 𝑥𝑖 ′ .
• This leads to discontinuous 𝑓መ 𝑥
• Alternatively, assign weights that die off smoothly with distance from
the target point
Kernel Smoothing
• Nadaraya Watson Kernel weighted average:
σ𝑁 𝑖=1 𝐾𝜆 𝑥0 , 𝑥𝑖 𝑦𝑖
መ
𝑓 𝑥0 = 𝑁
σ𝑖=1 𝐾𝜆 𝑥0 , 𝑥𝑖
• Epanechnikov quadratic kernel:
𝑥−𝑥0
𝐾𝜆 𝑥0 , 𝑥 = 𝐷
𝜆
3
Where 𝐷 𝑡 = (1 − 𝑡 2 ) 𝑡 ≤1
4
=0 otherwise
Kernel Smoothing
𝑥 − 𝑥0
𝐾𝜆 𝑥0 , 𝑥 = 𝐷
𝜆
𝜆 represents width, larger the value of 𝜆 the smoother is the kernel