Mapping Regional Climate
Mapping Regional Climate
D G Rossiter
Cornell University, Section of Soil & Crop Sciences
ISRIC–World Soil Information
Contents
1 Introduction 1
2 Example dataset 2
3 Data exploration 3
3.1 Feature-space summary . . . . . . . . . . . . . . . . . . . . . 3
3.2 Station locations . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Postplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.4 * Viewing in geographic context . . . . . . . . . . . . . . . . . 9
10 Data-driven models 79
10.1 Regression trees . . . . . . . . . . . . . . . . . . . . . . . . . . 80
10.1.1 Fitting a regression tree model . . . . . . . . . . . . . 80
10.1.2 Regression tree prediction over the study area . . . . . 89
10.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10.2.1 Fitting a Random Forest model . . . . . . . . . . . . . 91
10.2.2 Random Forest prediction over the study area . . . . . 98
10.3 Tuning data-driven models . . . . . . . . . . . . . . . . . . . 100
10.4 Cubist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
10.5 Additional covariables . . . . . . . . . . . . . . . . . . . . . . 115
10.6 Models with the extended set of predictors . . . . . . . . . . . 116
10.6.1 Relation among predictors . . . . . . . . . . . . . . . . 116
10.6.2 Random forest with additional covariables . . . . . . . 119
10.6.3 Variable importance in the extended model . . . . . . 123
10.7 Shapley values . . . . . . . . . . . . . . . . . . . . . . . . . . 124
10.7.1 *Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 125
ii
10.7.2 Practice . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.7.3 Shapley Additive exPlanations (SHAP) . . . . . . . . 127
10.8 Extended vs. base model . . . . . . . . . . . . . . . . . . . . . 130
14 Answers 178
15 Challenge 185
References 187
iii
“Advance step-by-step; slow but steady wins the race”
1 Introduction
This exercise presents various methods for regional mapping of climate vari-
ables from station information. The methods all relate to the universal
model of spatial distribution of a variable:
1
and a few challenges. §15 is an exercise to apply these techniques to a
different study area and/or a different climate variable.
Note: The source for this document is a text file that includes ordinary
LATEX source and “chunks” of R source code, using the Noweb1 syntax. The
formatted R source code, R text output, and R graphs in this document
were automatically generated and incorporated into a LATEX source file by
running the Noweb source document through R, using the knitr package
[30]. The LATEX source was then compiled by LATEX into the PDF you are
reading. The source code is provided in file MappingRegionalClimate.R.
2 Example dataset
In this tutorial we use agricultural climate as an example of a spatial points
dataset. The Northeast Regional Climate Center2 has kindly provided a set
of point ESRI shapefiles with various variables related to agricultural climate
measured from 1971-2000. This dataset was devloped by the Unifed Climate
Access Network (UVAN) network, and covers the entire United States of
America and its dependencies. It consists of several variables relevant for
agricultural climate: mean monthly and annual precipitation, mean monthly
and annual temperature, annual freeze free period base 32°F and 28°F3 ,
annual extreme minimum temperature, and monthly and annual growing
degree-days , base 50°F (relevant for C4 crops) and base 40°F (relevant for
C3 crops)4 .
In this tutorial we use as an example growing degree-days, base 50°F 5 (GDD50).
2
In §13 we compare models and predictions of this variable with models and
predictions for others from the same dataset.
For the purposes of this tutorial, we have prepared the dataset of GDD50,
along with State boundaries and a set of environmental covariables associ-
ated with agricultural climate. If you are interested in how this was prepared,
for example if you want to apply this method for other agricultural climate
variables or in other regions of the USA, the details are in the related tuto-
rial, “Tutorial: Regional mapping of climate variables from point samples:
Data preparation”.
3 Data exploration
## E N ANN_GDD50 ELEVATION_
## Min. :-375745 Min. :-393923 Min. : 795 Min. : 5.0
## 1st Qu.:-156768 1st Qu.:-208100 1st Qu.:2100 1st Qu.: 330.0
## Median : 47130 Median :-105357 Median :2463 Median : 711.0
## Mean : 3231 Mean : -77213 Mean :2518 Mean : 799.7
## 3rd Qu.: 151648 3rd Qu.: 36720 3rd Qu.:2930 3rd Qu.:1160.0
## Max. : 318960 Max. : 276614 Max. :4021 Max. :3950.0
The stations go from almost at sea level up to more than 4 000’ (1 220 m.a.s.l.).
The maximum annual GDD50 is more than five times the minimum. So in
3
both we have a good range to find statistical relations.
## [1] MULTIPOLYGON
## 18 Levels: GEOMETRY POINT LINESTRING POLYGON ... TRIANGLE
## [1] MULTILINESTRING
## 18 Levels: GEOMETRY POINT LINESTRING POLYGON ... TRIANGLE
3.3 Postplots
We now display the locations, but also using the point symbol represent the
data value. This is called a postplot, because it “posts” (puts into geographic
position) the data values.
Task 5 : Display a postplot of the annual growing degree days above 50° F
6 See the help for the palette function
4
3e+05
JAYNEWP
MASS CHAZ ENOS
MALO ST A
OGDE LAWR CHAS DANN
PLAT WEST
SOUT
CANT PERU ESSE MOUN
MORR
2e+05
BURL SAIN
GOUV WATE
RAY ELIZ
TUPP LAKE MONT
WANA SOUT
WATE CORN CHEL
WATE NEWC
STIL ROCH
LOWV INDI
OLD WOOD
RUTL
1e+05
WHIT
OSWE BOON CAVE
CAMD CONKGLEN
GLEN
ALBI HINC
LOCK BROC GRIF DORS
SODU SALE BALL BELL
NIAGLOCK ROCH SYRA UTIC
BATA UTIC LITT GLOV SARA
LITT
BUFF AVON AUBU
GENE
CANA
HEML MORR CHER GRAF READ VERN
WALE MOUN TULL TROY
ALBA
AURO COOP
0e+00
CORR
WARR BRAD MONT MILL
−1e+05
CAPE
5
(attribute ANN_GDD50). •
For this we use the “Grammar of Graphics” [27] as implemented in the
ggplot2 package [25, 26]. This is part of the so-called “tidyverse”7 set of
packages from Hadley Wickham. The tidyverse web site has a comprehensive
introduction to ggplot28 .
require(ggplot2)
The ggplot2 concept is that a graphic is initialized with ggplot and then
elements are added to the graphic, each separated by a + operator, which in
this context means “add to the graph”, not arithmetic addition.
In this example we first open the plot with the ggplot function, specifying
the source of the data for the plot with the data argument, and then:
1. specify how the graph should be set up on the page with the aes
function: the x axis from the E coördinate and the y axis from the N
coördinate;
2. add points with the the geom_point “geometry” function;
3. add axis labels with the xlab and ylab functions;
4. add a fixed-scale coördinate system for the graphic, with coord_fixed.
The points have an “aesthetic” (how they are displayed), specified with
the aes function, and the name of the data frame where the names in the
aesthetic can be found. We make the size of the points proportional by
degree days, and the colour of the points by elevation, so we can visually
assess if there is any relation both with coördinates and with elevation. We
specify a printing character with the shape argument to geom_point.
ggplot(data=ne.df) +
aes(x=E, y=N) +
geom_point(aes(size=ANN_GDD50,
colour=ELEVATION_),
shape=20) +
xlab("E") + ylab("N") + coord_fixed()
Q1 : Does there appear to be any regional trend with North or East? with
elevation? Jump to A1 •
7 https://www.tidyverse.org
8 https://ggplot2.tidyverse.org
6
2e+05
ELEVATION_
3000
0e+00 2000
1000
N
ANN_GDD50
1000
2000
3000
4000
−2e+05
−4e+05
7
45°N
44°N
43°N
GDD base 50F
1000
2000
3000
4000
Latitude
42°N
Elevation, feet a.s.l.
3000
2000
41°N
1000
40°N
39°N
8
45°N
44°N
43°N
3000
2000
1000
Latitude
42°N
GDD base 50F
1500
2000
2500
41°N 3000
3500
4000
40°N
39°N
Note the one very small GDD50, which causes the remainer of the points to
look quite similar. We can remove this for display purposes, and then show
it separately as a red point.
(which.lowest.gdd <- which.min(ne.m$ANN_GDD50))
## [1] 293
ggplot() +
geom_sf(data = ne.m[-which.lowest.gdd, ],
aes(size=ANN_GDD50,
colour = ELEVATION_), shape = 10) +
geom_sf(data = ne.m[which.lowest.gdd, ], colour = "red") +
labs(x = "Longitude", y = "Latitude",
size = "GDD base 50F",
colour = "Elevation, feet a.s.l.") +
geom_sf(data = state.ne.m.boundary, col = "darkgray", size = 2)
9
Figure 5: Stations shown on Google Earth
kml_layer(ne.wgs84, colour=ANN_GDD50,
colour_scale=SAGA_pal[[1]], shape=shape,
points_names=station.names)
## Writing to KML...
kml_close('ne.stations.kml')
## Closing ne.stations.kml
Open the resulting KML file in Google Earth to see the station locations in
their geographic context; see Figure 5.
10
4 Trend surface: a linear model solved by Ordinary Least Squares
An obvious approach to prediction is to develop a model of GDD50 based on
one or more of the covariables in the dataframe: (1) Easting, (2) Northing
coördinates and (3) elevation. We know that in general higher elevations
and more northerly latitudes are cooler; in this Atlantic region maybe the
more easterly longitudes are warmer. We saw these relations spatially with
the 2.5D postplots of the previous section, here we look at the relations in
feature space for each possible covariable separately.
print(p3)
print(p2)
print(p1)
ANN_GDD50
ANN_GDD50
NJ NJ NJ
NY NY NY
PA PA PA
VT VT VT
2000 2000 2000
−4e+05 −2e+05 0e+00 2e+05 −4e+05 −2e+05 0e+00 2e+05 0 1000 2000 3000 4000
Northing Easting ELEVATION_
We now examine the scatterplots to see if these relations are useful pre-
dictors of GDD50, and if so, does the relation appear to be linear, or if a
transformation to linearity is needed.
Q2 : Describe the relations between GDD50 and the three possible predic-
tors. Jump to A2
11
4000
3000
STATE
ANN_GDD50
NJ
NY
PA
VT
2000
1000
0 20 40 60
sqrt(ELEVATION_)
Task 7 : Re-display the relation between GDD50 and elevation, but with
the elevation square-root transformed to correspond to the inverse parabolic
shape perceived in the original graph. •
ggplot() +
geom_point(
aes(x=sqrt(ELEVATION_), y=ANN_GDD50, colour=STATE),
data=ne.df
)
Note: This does not agree with theory, in that the adiabatic lapse rate (due
to thinner air) of temperature with elevation is linear.
## [1] -0.728741
## [1] -0.009425215
## [1] -0.7468479
12
Q3 : Which linear correlations are strongest? What does the sign (±) of
the correlation indicate? Jump to A3 •
Task 9 : Fit a linear model, using ordinary least squares (OLS), of annual
GDD predicted by the square root of elevation. Display the model summary.
•
The workhorse lm “linear modelling” function fits the model.
m.ols.elev <- lm(ANN_GDD50 ~ sqrt(ELEVATION_), data=ne.df)
summary(m.ols.elev)
##
## Call:
## lm(formula = ANN_GDD50 ~ sqrt(ELEVATION_), data = ne.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1346.24 -278.03 2.65 258.74 1012.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3472.287 53.520 64.88 <2e-16 ***
## sqrt(ELEVATION_) -37.000 1.893 -19.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 382.3 on 303 degrees of freedom
## Multiple R-squared: 0.5578,Adjusted R-squared: 0.5563
## F-statistic: 382.2 on 1 and 303 DF, p-value: < 2.2e-16
Q4 : How much of the variability of GDD50 over the four states is explained
by this model? Jump to A4 •
Task 10 : Fit a linear model, using ordinary least squares (OLS), of an-
nual GDD predicted by the additive effects of Northing and square root of
elevation. Display the model summary. •
Again we use the lm “linear modelling” function to fit the model:
m.ols <- lm(ANN_GDD50 ~ sqrt(ELEVATION_) + N, data=ne.df)
summary(m.ols)
##
## Call:
## lm(formula = ANN_GDD50 ~ sqrt(ELEVATION_) + N, data = ne.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -899.15 -156.45 -9.78 153.12 660.08
13
Annual GDD50
4000
NJ
NY
PA
3500
VT
3000
2500
Actual
2000
1500
1000
Fitted by OLS
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.119e+03 3.409e+01 91.49 <2e-16 ***
## sqrt(ELEVATION_) -2.924e+01 1.138e+00 -25.69 <2e-16 ***
## N -1.981e-03 8.051e-05 -24.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 220.9 on 302 degrees of freedom
## Multiple R-squared: 0.8528,Adjusted R-squared: 0.8518
## F-statistic: 874.9 on 2 and 302 DF, p-value: < 2.2e-16
Now the model explains much more (85.2%) of the variation in GDD50; we
conclude that the North coördinate is an important predictor.
Task 11 : Plot the actual vs. fits as a scatterplot, adding a 1:1 line. •
plot(ne.m$ANN_GDD50 ~ fitted(m.ols),
col=ne.m$STATE, pch=20, asp=1,
xlab="Fitted by OLS", ylab="Actual",
main="Annual GDD50")
legend("topleft", levels(ne.m$STATE), pch=20, col=1:4)
grid()
abline(0,1)
Q5 : How well does the model fit the observations? Are any observations
poorly-fit? Jump to A5 •
14
• Residuals are normally-distributed;
• No relation between the fitted value and the distribution of the resid-
uals, either their mean or their spread;
• No observations with high leverage (i.e., influence on the regression co-
efficients) have large residuals. This is measured with Cook’s Distance,
which gives the effect of deleting an observation on the regression co-
efficients.
par(mfrow=c(1,3))
plot(m.ols, which=c(1,2,5))
par(mfrow=c(1,1))
3047
2
2
Standardized residuals
Standardized residuals
Residuals
0
0
−2
−500
−2
−4
3070 3070
−1000
−4
3070
Cook's distance
1000 1500 2000 2500 3000 3500 −3 −2 −1 0 1 2 3 0.00 0.01 0.02 0.03 0.04
Task 13 : Find and display the points with the maximum and minimum
model residuals. •
15
The which.min and which.max functions find the index (position) in a vector
of the minimum and maximum value, respectively.
(ix <- c(which.max(residuals(m.ols)), which.min(residuals(m.ols))))
## 3087 3070
## 113 96
ne.df[ix,]
ne.df[ix,c("LATITUDE_D", "LONGITUDE_")]
## LATITUDE_D LONGITUDE_
## 3087 42.52 -74.97
## 3070 43.30 -75.15
Note: If you wish, find the approximate location on Google Earth by enter-
ing the geographic coördinates; note these are on NAD27 so will not match
the WGS84 of Google Earth exactly. You can also find this on Google Maps
and display the terrain.
Note that 3911’ = 1192 m; the strong suspicion is that this height which
should be in feet was assumed to be in meters and then multiplied by
(3.28084 feet m-2), when in fact it was in feet. So, the incorrectly high
elevation lead to an incorrectly low predicted value. After some research we
find9 revised records for this station10 :
BEGINS ENDS LATITUDE LONGITUDE
NUM DIV ST COUNTY COOP STATION YEARMODY YEARMODY D M S D M S ELEV
305113 02 NY OTSEGO MARYLAND 6 SW 19831026 19881001 42 31 00 -074 58 00 1192
305113 02 NY OTSEGO MARYLAND 6 SW 19881001 20080702 42 31 00 -074 58 00 1192
Here the elevation is given as 1192’, which matches well with the terrain;
6 miles SW from the village of Maryland is along Schenvus Creek , whose
9 https://wrcc.dri.edu/Monitoring/Stations/station_inventory_show.php?snet=
coop&sstate=NY
10 In fact this station was moved in July 2008, after our time period 1971-2000, so there
is another record for it at a slightly different location
16
bluffs are at about 1200’. Further this matches the assumed mistake. So we
feel justified in correcting this record accordingly.
The over-prediction is two miles SW of Hinckley, NY in Oneida County, near
Barneveld. Again this appears to be a mistake, the elevation listed as 114’
but is about 1200’; so the model predicts as if it is at a much lower elevation,
i.e., the prediction is too high. It seems here there was just a missing digit;
the database record is:
BEGINS ENDS LATITUDE LONGITUDE
NUM DIV ST COUNTY COOP STATION YEARMODY YEARMODY D M S D M S ELEV
303889 06 NY ONEIDA HINCKLEY 2 SW 19871014 19930301 43 18 00 -075 09 00 1141
We can correct these two points with their elevations from this (it appears
correct) database. We make sure to make a record of these changes and our
reasons for making them, in case other analysts want to check our work.
Task 14 : Correct the elevation attribute of these two points in the dataset.
•
We correct the ELEVATION_ field, which we are using for modelling. But we
see that the dataset also has a ELEV_FT field, which duplicates this infor-
mation – it is unclear why. At any rate, we do not want an inconsistent
dataset, so we correct both elevation fields to the same value.
ne.m[ix[1],"ELEVATION_"] <- ne.m[ix[1],"ELEV_FT"] <- 1192
ne.df[ix[1],"ELEVATION_"] <- ne.df[ix[1],"ELEV_FT"] <- 1192
ne.m[ix[2],"ELEVATION_"] <- ne.m[ix[2],"ELEV_FT"] <- 1141
ne.df[ix[2],"ELEVATION_"] <- ne.df[ix[2],"ELEV_FT"] <- 1141
##
## Call:
## lm(formula = ANN_GDD50 ~ sqrt(ELEVATION_) + N, data = ne.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -532.76 -153.15 -7.84 155.44 641.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.151e+03 3.321e+01 94.87 <2e-16 ***
## sqrt(ELEVATION_) -3.040e+01 1.113e+00 -27.33 <2e-16 ***
## N -1.952e-03 7.730e-05 -25.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 211.6 on 302 degrees of freedom
## Multiple R-squared: 0.865,Adjusted R-squared: 0.8641
## F-statistic: 967.3 on 2 and 302 DF, p-value: < 2.2e-16
The model coefficients have changed and the 𝑅 2 has increased about 1.5%,
as a result of correcting just these two (out of a total 305) records.
What about the regression diagnostics?
17
par(mfrow=c(1,3))
plot(m.ols, which=c(1,2,5))
par(mfrow=c(1,1))
3
3
3047 3047 3047
400
2
2
Standardized residuals
Standardized residuals
200
1
1
Residuals
0
0
0
−200
−1
−1
−400
−2
−2
3031 3031
−600
−3
3031
Cook's distance
1000 1500 2000 2500 3000 3500 −3 −2 −1 0 1 2 3 0.00 0.01 0.02 0.03 0.04
These are much better. Notice how the points with high absolute residuals
now have much lower leverage, and the normal Q-Q plot no longer has points
well off the expected 1:1 line.
Redo the actual vs. fitted plot:
plot(ne.m$ANN_GDD50 ~ fitted(m.ols),
col=ne.m$STATE, pch=20, asp=1,
xlab="Fitted by OLS (corrected data)",
ylab="Actual",
main="Annual GDD50")
legend("topleft", levels(ne.m$STATE), pch=20, col=1:4)
grid()
abline(0,1)
Challenge: Check the records for the largest residuals in the corrected
model; if there is good evidence to do so, adjust the database accordingly
and re-fit the model.
Challenge: Add Easting to the additive linear model and see if the model co-
efficient is significantly different from zero. You can try a pure trend surface
(without elevation), a trend surface with elevation, and a trend surface with
an elevation interaction – i.e., the effect of elevation is not the same across
the region. Interpret the “best” model. Does this have a physical interpre-
18
Annual GDD50
4000
NJ
NY
PA
3500
VT
3000
2500
Actual
2000
1500
1000
tation, in the same way we can relate GDD50 to Northing and elevation?
Compare with ANOVA and/or AIC; confirm good linear model diagnostics.
Identify the stations with the largest positive and negative residuals; try to
explain why.
Are the model residuals spatially correlated? If so, that violates the as-
sumption of independent residuals that is necessary for the OLS fit to be
optimum.
Task 17 : Add the linear model residuals to the sf object, so we can display
them spatially. •
ne.m$ols.resid <- residuals(m.ols)
Note: This code uses the ability of R to build a command string using the
paste function, parse it into R internal format with the parse functions, and
then evaluate it in the current environment with the eval function.
19
Arguments:
.oint.obj.name : Name of the point object, not the object itself
.field.name : Name of the data column (“field”) in the point object to be plotted
.field.label : A text label for the field
.title : Plot title, default none.
bubble.sf <- function(.point.obj.name, .field.name, .field.label, .title="") {
# make a plus/minus indicator
eval(parse(
text = paste0("pm <- factor(",
.point.obj.name, "$",
.field.name, "> 0)")
))
# rename them
levels(pm) <- c("-, overprediction", "+, underprediction")
# plot
eval(parse(
text = paste0("ggplot(",
.point.obj.name,
") + geom_sf(aes(colour = pm, size = abs(",
.field.name,
")), shape = 1) + labs(size = paste('+/-', .field.label),
colour = '', title = '",
.title, "') +
scale_colour_manual(values = c('red', 'green'))")
))
}
OLS
45°N
44°N
+/− residuals, GDD50F
43°N 200
400
600
42°N
41°N
−, overprediction
+, underprediction
40°N
39°N
20
trend surface. Jump to A7 •
1
[𝑧(s𝑖 ) − 𝑧(s 𝑗 )] 2
𝛾(s𝑖 , s 𝑗 ) = (3)
2
where s is a geographic point and 𝑧(s) is its attribute value, in this case the
residual ANN_GDD50 from the OLS model. Each pair of points is separated
by a vector h, generally computed as the Euclidean distance between the
points:
Õ
𝑛
1
h = ||x𝑖 , x 𝑗 || = ( (𝑠𝑖,𝑘 − 𝑠 𝑗,𝑘 ) 2 ) 2 (4)
𝑘=1
where 𝑛 is the number of dimensions (in this example, 2). There are (𝑛 ·
(𝑛 − 1))/2 point-pairs that can be compared this way; in our example this is
(305 · 304)/2 = 46 360; clearly we need some way to summarize this.
The model of spatial dependence assumes 2nd-order stationarity, i.e., the
semivariance does not depend on the absolute location. Therefore an em-
pirical variogram averages the individual semivariances in some separation
range called a “bin”:
1 Õ
𝑁h
𝛾(h) = [𝑧(s𝑖 ) − 𝑧(s𝑖 + h)] 2 (5)
2𝑁h 𝑖=1
where h is a lag vector, i.e., a range of separations.
The analyst chooses the bin widths: wide enough to have enough point-pairs
(>≈ 150) for reliable estimation, narrow enough to reveal the fine structure
of spatial dependence.
Then compute and display the variogram. We use a cutoff of 100 km and
a bin size of 16 km, to have enough points in the closest bin, and to avoid
very local effects.
21
v.r.ols <- variogram(ols.resid ~ 1, locations=ne.m,
cutoff=100000, width=16000)
plot(v.r.ols, pl=T)
1061
40000 269
643 945
794
369
30000
semivariance
70
20000
10000
distance
Q8 : What is the range of the spatial correlation? That is, the maximum
separation distance at which there is lower semivariance than the maximum
(total sill)? Jump to A8 •
22
plot(v.r.ols, pl=T, model=vmf.r.ols)
1061
40000 269
643 945
794
369
30000
semivariance
70
20000
10000
distance
The effective range of the fitted model is 36.3 km, that is, pairs of points
within this range partially duplicate each other’s information.
This successful model fit shows that there is spatial correlation among the
residuals, so the OLS fit of §4 is not optimal.
Task 21 : Examine the short-range behaviour with the variogram cloud (all
point-pairs): •
The cloud argument to the variogram function produces a variogram cloud,
i.e., a variogram that shows each point-pair’s semivariance vs. separation,
rather than summarizing in (somewhat arbitrary) bins. This allow us to
identify pairs of points with unusually high or low semivariances for their
separation.
vc <- variogram(ols.resid ~ 1, locations=ne.m, cutoff=12000, cloud=T)
plot(vc, pch=20, cex=2)
23
100000
80000
semivariance
60000
40000
20000
distance
Note: The order of the points is arbitrary, since the distance between them
does not depend on which point is the origin and which the destination.
We then look for the highest semivariances at shortest distances (< 8 km).
We use the order function is used to sort the indices, here in order of
increasing separation.
vc.df <- as.data.frame(vc)
vc.close <- subset(vc.df, vc.df$dist < 8000)
## sort by separation, look for anomalies.
vc.close[order(vc.close$dist),c("dist","gamma","left","right")]
This list shows all the point-pairs separated by < 8km, see field dist.
They have semivariances ranging from about 9000 to 55000 GDD502 , ex-
cept for the pair of points 106 and 107, which have a very large semivariance,
9.9044 × 104 . This shows that the two have quite different residuals from
the linear model fit These two are separated by only about 4.5 km but are
quite dissimilar. This is quite an anomaly, let’s see the details for this two
stations:
print(ne.m[c(106,107),c("STATE","STATION_NA","LATITUDE_D","LONGITUDE_",
"ELEVATION_", "ANN_GDD50","ols.resid")])
24
Figure 9: Little Falls NY weather station locations. Source: USGS 15
Minute Series, Little Falls NY quadrangle, 1900. Available from http:
//nationalmap.gov/historical/
These two stations are both near Little Falls (NY), one (3081) on Mill Street
along the Mohawk River and one (3080) on Reservoir Road north of the
village; see Figure 9.
Checking the topographic map, the elevations are correct; the discrepancy
is because the Mill Street station is in a narrow river valley that is much
warmer than predicted by the model, hence the large positive residual (actual
- predicted). The Reservoir Road station is somewhat over-predicted.
So, the data seems to be correct; the lesson is that close-by stations can have
quite different micro-climates, and we have no factor in the model to account
for this. Local interpolators such as kriging will also fail in this situation.
The key difference here is that in the linear model fit by OLS, the residuals
𝜀 are assumed to be independently and identically distributed with the same
25
variance 𝜎 2 :
y = X𝛽 + 𝜀, 𝜀 ∼ N (0, 𝜎 2 I) (6)
Whereas, now the residuals are themselves considered to be a random vari-
able 𝜂 that has a covariance structure:
y = X𝛽 + 𝜂, 𝜂 ∼ N (0, V) (7)
26
In modelling terminology, the coefficients 𝛽 are called fixed effects, because
their effect on the response variable is fixed once the parameters are known.
By contrast the covariance parameters 𝜂 are called random effects, because
their effect on the response variable is stochastic, depending on a random
variable with these parameters.
Models with the form of Equation 7 are called mixed models: some effects
are fixed (here, the relation between elevation or Northing and the GDD50)
and others are random (here, the error variances) but follow a known struc-
ture; these models have many applications and are extensively discussed
in Pinheiro and Bates [19]. Here the random effect 𝜂 represents both the
spatial structure of the residuals from the fixed-effects model, and the un-
explainable (short-range) noise. This latter corresponds to the noise 𝜎 2 of
the linear model of Equation 6.
To solve Equation 10 we first need to compute V, i.e., estimate the variance
parameters 𝜃 = [𝜎 2 , 𝑠, 𝑎], use these to compute C with equation 11 and from
this V, after which we can use equation 10 to estimate the fixed effects 𝛽.
But 𝜃 is estimated from the residuals of the fixed-effects regression, which
has not yet been computed. How can this “chicken-and-egg”13 computation
be solved?
The answer is to use residual (sometimes called “restricted”) maximum like-
lihood (REML) to maximize the likelihood of the random effects 𝜃 indepen-
dently of the fixed effects 𝛽.
Here we fit the fixed effects (regression coefficients) at the same time as we
estimate the spatial correlation.
Lark and Cullis [13, Eq. 12] show that the likelihood of the parameters in
Equation 6 can be expanded to include the spatial dependence implicit in
the variance-covariance matrix V, rather than a single residual variance 𝜎 2 .
The log-likelihood is then:
1 1
ℓ(𝛽, 𝜃|y) = 𝑐 −log |V| − (y − X𝛽) 𝑇 V −1 (y − X𝛽) (12)
2 2
where 𝑐 is a constant (and so does not vary with the parameters) and V
is built from the variance parameters 𝜃 and the distances between the ob-
servations. By assuming second-order stationarity14 , the structure can be
summarized by the covariance parameters 𝜃 = [𝜎 2 , 𝑠, 𝑎], i.e., the total sill,
nugget proportion, and range.
However, maximizing this likelihood for the random-effects covariance pa-
rameters 𝜃 also requires maximizing in terms of the fixed-effects regression
parameters 𝛽, which in this context are called nuisance parameters since
at this point we don’t care about their values; we will compute them after
determining the covariance structure.
Both the covariance and the nuisance parameters 𝛽 must be estimated, it
seems at the same time (“chicken and egg” problem) but in fact the technique
13 from the question “which came first, the chicken or the egg?”
14 thatis, the covariance structure is the same over the entire field, and only depends on
the distance between pairs of points
27
of REML can be used to first estimate 𝜃 without having to know the nuisance
parameters. Then we can use these to compute C with equation 11 and from
this V, after which we can use equation 10 to estimate the fixed effects 𝛽.
The maximum likelihood estimate of 𝜃 is thus called “restricted”, because
it only estimates the covariance parameters (random effects). Conceptu-
ally, REML estimation of the covariance parameters 𝜃 is ML estimation of
both these and the nuisance parameters 𝛽, with the later integrated out [19,
§2.2.5]: ∫
ℓ(𝜃|y) = ℓ(𝛽, 𝜃|y) 𝑑𝛽 (13)
Pinheiro and Bates [19, §2.2.5] show how this is achieved, given a likelihood
function, by a change of variable to a statistic sufficient for 𝛽.
The computations are performed with the gls function of the nlme ‘Non-
linear mixed effects models’ package [1].
Task 22 : Set up and solve a GLS model, using the covariance structure
estimated from the variogram of the OLS residuals. •
The linear model formulation is the same as for lm. However:
• It has an additional argument correlation, which specifies the cor-
relation structure.
• This is built with various correlation models; we use corExp for expo-
nential spatial correlation, which is what we fit for the OLS residuals,
– The form names the dimensions, here 2D with the Easting and
Northing.
– We initialize the search for the correlation structure parameters
with the value argument, a list of the initial values. Here we only
specify the range.
– Our fitted variogram model for the OLS residuals showed a zero
nugget, so here the function should not fit a nugget. So we set
the nugget argument to FALSE.15
Note: For a list of the predefined model forms see ?corClasses. Users can
also define their own corStruct classes.
require(nlme)
vmf.r.ols[1:2,]
28
value=c(vmf.r.ols[2,"range"]),
form=~E + N,
nugget=FALSE))
29
These intervals seem fairly wide, indicating that the model is perhaps not
sufficiently specified to capture all the reasons for variation in GDD50 over
this area..
Task 26 : Compare with the correlation structure estimated from the OLS
residuals. •
intervals(m.gls, level=0.95)$corStruct
vmf.r.ols[2,"range"]
## [1] 12106.96
Q10 : How closely does the correlation structure fitted by GLS match that
estimated from the variogram of the OLS residuals? Jump to A10 •
30
Annual GDD50
4000
NJ
NY
PA
3500
VT
3000
2500
Actual
2000
1500
1000
Fitted by GLS
The fit clusters well around the 1:1 line (good accuracy) but is diffuse (low
precision).
As with the OLS model (§4.4), the GLS residuals may show spatial correla-
tion. We examine this with an empirical variogram and then fit a variogram
model.
31
GLS
45°N
44°N
+/− residuals, GDD50F
43°N 200
400
600
42°N
41°N
−, overprediction
+, underprediction
40°N
39°N
Q11 : Does GLS remove the spatial correlation? Describe the spatial
correlation structure of the GLS residuals. Jump to A11 •
These residuals should show the spatial correlation discovered in the REML
fit.
32
747 795 867 927
40000
516
567
349 691
186
30000
semivariance
31
20000
10000
distance
Q12 : How well does this model fit the empirical variogram of the GLS
residuals? Jump to A12 •
33
1061
40000 269
643 945
794
369
30000
semivariance
70
20000
10000
distance
1061
40000 269
643 945
794
369
30000
semivariance
70
20000
10000
distance
34
This does not fit too badly, and clearly has a substantial nugget. Our ini-
tial estimates were close to the fitted values. This must be converted to a
proportional nugget, i.e., the proportion of the total sill represented by the
nugget.
(prop.nugget <- vmf.r.ols.sph[1,"psill"]/sum(vmf.r.ols.sph[,"psill"]))
## [1] 0.3527983
The nugget is about 35% of the total sill, with this model. We now use
this in gls, substituting corSpher to specify a spherical model of spatial
dependence.
To include a nugget variance, we must:
1. specify nugget argument as TRUE;
2. expand the value argument to a two-element list: (1) the starting
value of the range, as before; and (2) the starting value of the nugget.
m.gls.2 <- gls(model=ANN_GDD50 ~ sqrt(ELEVATION_) + N,
data=ne.df,
correlation=corSpher(
value=c(vmf.r.ols.sph[2,"range"], prop.nugget),
form=~E + N,
nugget=TRUE))
intervals(m.gls.2, level=0.95)$corStruct
The nugget proportion has been fit as 0.462, somewhat higher than our
original estimate 0.353.
Recall that the range parameter of the exponential model is 1/3 of the
effective range, so to compare them:
intervals(m.gls)$corStruct["range", 2]*3
## [1] 52370.98
intervals(m.gls.2)$corStruct["range", 2]
## [1] 301499.4
The effective range is much longer, about 300 km for the spherical model with
nugget, compared to about 50 km for the exponential model with no nugget.
This seems unrealistic when we compare it with the residual variogram. The
spherical model is thus not appropriate; we showed it here in order to explain
how to fit a proportional nugget, if the empirical residual variogram suggests
35
that there is a nugget variance.
We can also compare the regression coefficients:
intervals(m.gls, level=0.95)$coef[,2]
## (Intercept) sqrt(ELEVATION_) N
## 3.136371e+03 -3.004478e+01 -1.909907e-03
intervals(m.gls.2, level=0.95)$coef[,2]
## (Intercept) sqrt(ELEVATION_) N
## 3.209772e+03 -3.241043e+01 -1.658746e-03
The changed spatial correlation structure has considerably changed the re-
gression coefficients.
How much has accounting for the spatial correlation of the model residuals
affected the linear models?
Task 30 : Compare the coefficients of the GLS and OLS models. Compute
the relative change. •
coefficients(m.gls)
## (Intercept) sqrt(ELEVATION_) N
## 3.136371e+03 -3.004478e+01 -1.909907e-03
coefficients(m.ols)
## (Intercept) sqrt(ELEVATION_) N
## 3.150881e+03 -3.040452e+01 -1.952152e-03
round(100*(coefficients(m.gls) - coefficients(m.ols))/coefficients(m.ols),2)
## (Intercept) sqrt(ELEVATION_) N
## -0.46 -1.18 -2.16
Q13 : How much have the linear model coefficients changed from the OLS
to the GLS fit? What explains the change in coefficients? Jump to A13 •
Task 31 : Compute the difference between the GLS and OLS residuals, add
them to the spatial points, and display as a bubble plot. •
We use a different colour scheme to emphasize that this is the difference
between residuals, not the residuals themselves.
ne.m$diff.gls.ols.resid <- (ne.m$gls.resid - ne.m$ols.resid)
summary(ne.m$diff.gls.ols.resid)
36
GLS − OLS
45°N
44°N
41°N −, overprediction
+, underprediction
40°N
39°N
Q14 : Where are the largest differences between the OLS and GLS residu-
als? Jump to A14
•
The coefficient for elevation was reduced by a smaller amount, and it is for
the square root. To visualize this effect, we can use a scatterplot of the
change in residuals vs. this marginal predictor.
37
30
20
State
NJ
NY
GLS − OLS residual
PA
10
VT
Annual GDD50F
1000
0
2000
3000
4000
−10
−20
0 20 40 60
Sqrt(Elevation)
This shows clearly that GLS residuals are larger at the lower elevations,
and that the largest adjustments tend to be at the largest GDD50. Note
that there would be a confounding effect if the two predictors (elevation and
Northing) were not almost independent, as they are in this case.
So in combination, a southerly, low-lying station will have the largest positive
GLS-OLS residuals, i.e., GLS predicts higher than OLS; a northerly, high-
elevation station the largest negative residuals.
Task 33 : Display the station name, elevation, and coördinates of the most
positive and negative residuals. •
ix <- which.max(ne.m$diff.gls.ols.resid)
ne.df[ix,c("STATION_NA","STATE","ELEVATION_","N","E")]
ix <- which.min(ne.m$diff.gls.ols.resid)
ne.df[ix,c("STATION_NA","STATE","ELEVATION_","N","E")]
The largest negative residual (GLS reduced the prediction the most) is for
Mt. Mansfield (VT), the highest elevation station in the dataset, and near
the N limit. The largest positive residual (GLS increased the prediction
the most) is for Cape May (NJ), just above sea level and the southernmost
station.
Conclusion: accounting for spatial correlation in the residuals significantly
changed the linear model, resulting in differences up to 30 GDD50.
38
problem. For regional studies we want to predict and visualize over the
entire area; i.e., we want to produce a map of the target variable. For this
we need the predictors used in the models at a set of grid cells covering the
whole study area. Point predictions are then made at the centre of each grid
cell.
These have been prepared in the companion tutorial “Regional mapping of
climate variables from point samples: Data preparation” and loaded with
the points dataset in 3, above.
Task 34 : Predict over the grid with the OLS and GLS models, add the
results to the dataframe, and summarize them. •
dem.ne.m.df$pred.ols <- predict(m.ols, newdata=dem.ne.m.df)
dem.ne.m.df$pred.gls <- predict(m.gls, newdata=dem.ne.m.df)
summary(dem.ne.m.df[,-(1:3)])
39
We include an option .plot.limits, default NULL, to specify the limits of
the legend scale, to allow side-by-side comparison of maps.
Arguments:
Task 36 : Plot these on the same visual scale, over the entire bounding box.
•
The reason to visualize over the bounding box is to see the effect of extrap-
olation of a linear model beyond its calibration range, in this case Northing
and elevation.
We now call the function for both predictions. The argument names used
within the function are assigned to the names we give when calling the
function. So the function operates on the map. Note that we use the same
limits for the colour ramp, so the two maps can be directly compared.
ols.ix <- which(names(dem.ne.m.df)=="pred.ols")
gls.ix <- which(names(dem.ne.m.df)=="pred.gls")
# set up limits of the scale for the target variable
# use the extremes +/-10 GDD, to make sure that all values are shown
(gdd.pred.lim <- round(
c(min(dem.ne.m.df[,c(ols.ix, gls.ix)])-10, max(dem.ne.m.df[,ols.ix, gls.ix])+10)))
40
Annual GDD, base 50F, OLS prediction
2e+05
GDD50
0e+00
3000
N
2000
1000
−2e+05
−4e+05
2e+05
GDD50
0e+00
3000
N
2000
1000
−2e+05
−4e+05
Task 37 : Compute the differences between the OLS and GLS predictions,
add them to the data frame, and display them. •
summary(dem.ne.m.df$diff.gls.ols <-
dem.ne.m.df$pred.gls - dem.ne.m.df$pred.ols)
41
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -31.2889 -15.3346 -4.9698 -7.6704 0.2453 16.0963
2e+05
+/− GDD50
10
0e+00
0
N
−10
−20
−2e+05 −30
−4e+05
The GLS model predicts slightly higher to the north and at higher elevations.
42
print(vmf.r.gls)
Task 38 : Summarize the GLS trend surface residuals with a histogram and
numerical summary. •
summary(ne.m$gls.resid)
hist(ne.m$gls.resid, freq=F,
xlab="GDD50",
main="GLS residuals")
rug(ne.m$gls.resid)
16 ‘Kriging’ is named for the South African mining geostatistician Danie Krige (1919–2013),
who developed the method in the 1950’s for estimating gold reserves. His empirical
method was formalized in the 1960’s by Georges Matheron (1930–2000) working at the
École de Mines in France. The theory had been previously developed in the 1930’s by
Andrey Kolmogorov (1903-1987) but was not practical until digital computers had been
developed.
43
GLS residuals
0.0015
0.0010
Density
0.0005
0.0000
GDD50
Notice that the mean residual is not zero. This is because GLS trades
unbiasedness for precision of the trend coefficients.
Now that we have (1) the known points; (2) a fitted authorized variogram
model, we can predict at any location. The following optional section ex-
plains the mathematics of the OK system.
Õ
𝑛
𝑧0 = 𝜆𝑖 𝑧𝑖 (14)
𝑖=1
Note: Always remember, this “best” is with reference to the fitted variogram
model. And there is no way to objectively know if that model is correct. So
whether OK is “best” in the real world is not proveable.
44
A𝜆 = b (15)
where:
𝜆1 𝛾(x1 , x0 )
𝜆2 𝛾(x2 , x0 )
𝜆 = ... b = ..
.
𝜆𝑁 𝛾(x 𝑁 , x0 )
𝜓 1
In the A matrix the upper-left block 𝑁 × 𝑁 block is the spatial correlation
structure of the observations; these are derived from the fitted variogram
model. The last row ensures unbiasedness in estimating the spatial mean.
The right-hand column is used to find the LaGrange multiplier that mini-
mizes the variance. The kriging variance at a point is given by the scalar
product of the weights (and multiplier) vector 𝜆 with the right-hand side of
the kriging system.
ˆ 2 (x0 ) = b𝑇 𝜆
𝜎 (16)
Task 39 : Predict the deviation from the trend surface at each location on
the grid, using Ordinary Kriging (OK) of the GLS residuals, and display its
summary. •
Note: We use OK instead of Simple Kriging (SK) because the spatial mean
of the GLS residuals may not be zero. The non-spatial mean of the GLS
mean is not required to be zero, as in OLS.
There are several R packages that implement kriging. Here we use the krige
function of the gstat package, which uses the fitted variogram model as in
model argument.
The points on which to krige are centres of the grid cells, which we converted
to an sf geometry.
45
class(dem.ne.m.sf)
system.time(
ok.gls.resid <- krige(gls.resid ~ 1,
loc=ne.m, newdata=dem.ne.m.sf,
model=vmf.r.gls)
)
summary(ok.gls.resid)
hist(ok.gls.resid$var1.pred,
main = "OK deviations from GLS trend surface", xlab = "GDD50")
abline(v=0, col="red")
rug(ok.gls.resid$var1.pred)
6000
4000
2000
0
GDD50
Half of the adjustments are between about ±40 GDD50, so not very many;
however there are some quite large adjustments at the extremes. The kriging
prediction variance has a very low value at a grid cell centre that must be
close to an observation, but otherwise is fairly large; the median prediction
standard deviation is 196 GDD50.
Task 40 : Add this residual and its prediction variance to the GLS trend
surface data frame. •
46
dem.ne.m.df$ok.gls.resid <- ok.gls.resid$var1.pred
dem.ne.m.df$ok.gls.resid.var <- ok.gls.resid$var1.var
2e+05
GDD50
500
0e+00
250
N
−250
−2e+05
−4e+05
Q15 : Where are the largest adjustments to the GLS trend? Jump to
A15 •
The predictions at these points are the spatial mean of the GLS residuals
(not necessarily their arithmetic mean):
mean(dem.ne.m.df$ok.gls.resid) # spatial mean over the grid
## [1] -0.6000178
47
## [1] 8.495601
2e+05
GDD50
50
0e+00
100
N
150
200
−2e+05
−4e+05
We have two predictions: the trend and the deviations from it. Adding these
will give a final prediction.
Task 43 : Add the kriged GLS residuals to the trend surfaces for a final
GLS-RK prediction. •
48
summary(dem.ne.m.df$pred.rkgls <-
dem.ne.m.df$pred.gls + dem.ne.m.df$ok.gls.resid)
2e+05
GDD50
4000
0e+00
3000
N
2000
1000
−2e+05
−4e+05
Q17 : How does this compare to the GLS trend surface map? Can you see
the local adjustments? Jump to A17 •
In the previous section we predicted over a bounding box covering the four
States. This predicted into areas where there were no observations, e.g.,
parts of CT, MA, NH, MD as well as Ontario. This is called extrapolation,
as opposed to interpolation. Of course, some areas in the four States are
outside the convex hull of the points, e.g., near the borders of these four
States adjacent to States not in the study area. But here, (1) observation
points are never far away, because (2) points were chosen to cover the area.
Only (1) is valid outside the study area, and only for a limited distance.
Q18 : For what areas in the bounding box, outside the four States, are you
49
confident that the prediction is as good as that inside? (Hint: for the OK
part, see the kriging prediction variance map.) Jump to A18 •
Also, the map user from these four States will expect a map showing only
these.
## class : SpatRaster
## dimensions : 186, 228, 13 (nrow, ncol, nlyr)
## resolution : 3450, 3704 (x, y)
## extent : -392899.3, 393700.7, -399006.4, 289937.6 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=aea +lat_0=42.5 +lon_0=-76 +lat_1=39 +lat_2=44 +x_0=0 +y_0=0 +ellps=WGS84 +units
## source(s) : memory
## names : ELEVATION_, dist.lakes, dist.coast, mrvbf, tri3, pop15, ...
## min values : 0.000, 0.0, 2.326731e+00, 0.000000, 0.000, 0.00000, ...
## max values : 4156.336, 648951.9, 7.087011e+05, 3.991617, 371.698, 3.88183, ...
50
ELEVATION_ dist.lakes dist.coast mrvbf
500000
2e+05
2e+05
2e+05
2e+05
3500 500000 3.5
3000 400000 3.0
400000
2500 2.5
300000
−2e+05 0
−2e+05 0
−2e+05 0
−2e+05 0
300000 2.0
2000
1500 200000 200000 1.5
1000 1.0
100000 100000
500 0.5
0 0 0.0
−2e+05 0 2e+05 −2e+05 0 2e+05 −2e+05 0 2e+05 −2e+05 0 2e+05
2e+05
tri3 pop15 pop2pt5 pred.ols
2e+05
2e+05
2e+05
350 3.5 4.0 3500
300 3.0
250 3.0 3000
2.5
−2e+05 0
−2e+05 0
−2e+05 0
−2e+05 0
200 2.0 2500
2.0
150 1.5 2000
100 1.0 1.0 1500
50 0.5
0 0.0 0.0 1000
−2e+05 0 2e+05 −2e+05 0 2e+05 −2e+05 0 2e+05 −2e+05 0 2e+05
2e+05
2e+05
2e+05
3500 10 40000
400
3000 0 30000
200
−2e+05 0
−2e+05 0
−2e+05 0
−2e+05 0
2500
−10 0 20000
2000
−20 −200 10000
1500
1000 −30 −400
pred.rkgls
2e+05
3500
3000
−2e+05 0
2500
2000
1500
1000
−2e+05 0 2e+05
51
RK−GLS prediction, annual GDD50F
2e+05
3500
3000
0
2500
2000
−2e+05
1500
1000
−2e+05 0 2e+05
Õ
𝑛
𝑧0 = 𝜆𝑖 𝑧𝑖 (17)
𝑖=1
The difference between KED and OK (§7.1) is that KED also includes co-
variates in the kriging system, so that the linear trend with covariates and
the local deviations at each prediction point are solved together to obtain
the weights 𝜆.
17 KED is mathematically equivalent to what is called Universal Kriging (UK); that term
is often reserved for KED when only coördinates are used as covariables.
52
8.1 * The Universal Kriging system
The weights 𝜆𝑖 are determined from the Universal Kriging System, which is
derived from an expression for the prediction variance, which is minimized
to derive these equations.
Weights are found by solving:
AU 𝜆 U = b U (18)
where
𝛾 (x1 ,x1 ) ··· 𝛾 (x1 ,x 𝑁 ) 𝑓1 (x1 ) ··· 𝑓 𝑘 (x1 )
1
.. .. .. .. ..
. ··· . . . ··· .
𝛾 (x 𝑁 ,x1 ) ··· 𝛾 (x 𝑁 ,x 𝑁 ) 𝑓1 (x 𝑁 ) ··· 𝑓 𝑘 (x 𝑁 )
1
··· ···
AU = 1 1 0 0 0
𝑓1 (x1 ) ··· 𝑓1 (x 𝑁 ) ···
0 0 0
.. .. .. .. .. .. ..
. . . . . . .
𝑓 𝑘 (x1 ) ··· 𝑓 𝑘 (x 𝑁 ) ···
0 0 0
𝛾 (x1 ,x0 )
𝜆1
..
···
.
𝜆𝑁 𝛾 (x ,x )
𝑁 0
𝜆U 𝜓0
= bU =
1
𝜓1 𝑓1 (x0 )
···
..
.
𝜓𝑘
𝑓 𝑘 (x0 )
The 𝜆U vector contains the 𝑁 weights for the sample points and the 𝑘 + 1
LaGrange multipliers (1 for the overall mean and 𝑘 for the trend model).
The bU vector is structured like an additional column of Au , but referring to
the point to be predicted. This contains the semivariances of the prediction
point vs. the known points.
The kriging variance at a point is given by the scalar product of the weights
(and multiplier) vector 𝜆 with the right-hand side of the kriging system.
𝑇
𝜎
ˆ 2 (x0 ) = b𝑈 𝜆𝑈 (19)
53
Good explanations of KED are from Webster and Oliver [24] and Goovaerts
[6]; in §12 we explain Ordinary Kriging (OK), where there is no trend, only
local interpolation.
KED uses the krige method of the gstat package directly with the residual
variogram, and so does not require a separate regression prediction step.
KED as implemented by krige uses GLS to compute the trend component,
with a covariance structure specified by the analyst, generally from fitting
a variogram model to the residual variogram This differs from gls, which
computes the covariance structure by REML.
In gstat the residuals are estimated from a linear model fit, using the
variogram function with a formula for the trend. Since the covariance struc-
ture is not yet known, this must be by OLS.
54
1061
40000 269
643 945
794
369
30000
semivariance
70
20000
10000
distance
1061
40000 269
643 945
794
369
30000
semivariance
70
20000
10000
distance
The effective range of the exponential model is three times the range pa-
55
rameter, so here about 363 km. This implies that there is local structure,
not explained by the covariables, to this range.
With this covariance structure, we can now predict by KED, again specifying
the dependence of the target variable on the covariables.
Task 46 : Compute the KED prediction and its variance over the prediction
grid. •
We call krige with a model formula that shows the linear dependence on
covariables, exactly the same formula that we used to compute the empirical
variogram of the residuals in the previous step. These two formulas must be
identical.
k.ked <- krige(ANN_GDD50 ~ sqrt(ELEVATION_)+ N, locations=ne.m,
newdata=dem.ne.m.sf, model=vmf.ked)
summary(k.ked)
Note that the kriging prediction variance is also computed; this is known
because by definition kriging minimizes it.
56
Annual GDD, base 50F, KED prediction
2e+05
GDD50
4000
0e+00
3000
N
2000
1000
−2e+05
−4e+05
The predictions will be slightly different from the RK-GLS predictions, be-
cause the spatial correlation of the residuals was estimated from the OLS
trend surface, from the GLS fit.
Task 48 : Compute and display the differences between RK-GLS and KED
over the grid. •
summary(dem.ne.m.df$diff.rkgls.ked <-
dem.ne.m.df$pred.rkgls - dem.ne.m.df$pred.ked)
Half of the differences are quite small, under about 15 GDD50. There are
a few larger differences, but all less than 100 GDD50 (compare with the
sample mean, 2518).
Display the locations of the differences:
display.difference.map("diff.rkgls.ked",
"Difference annual GDD base 50F, RK-GLS - KED",
"+/- GDD50")
57
Difference annual GDD base 50F, RK−GLS − KED
2e+05
+/− GDD50
0e+00 50
N
−50
−2e+05
−4e+05
The largest positive residuals (RK-GLS predicts higher) are along Lake Erie
and western Lake Ontario. The cooling lake effect which we saw in the OLS
residuals is increased in the GLS residuals, since the GLS trend predicts
somewhat lower than the OLS trend in this area. So the RK is higher here.
The largest negative residuals are in the Catskills and southern VT, where
the GLS trend predicted somewhat higher than the OLS trend.
58
KED prediction [N, sqrt(ELEVATION)], annual GDD50F
2e+05
3500
3000
0
2500
2000
−2e+05
1500
1000
−2e+05 0 2e+05
Task 50 : Compute and summarize the LOOCV for this KED prediction.
•
The krige.cv function of the gstat package computes this:
59
kcv.ked <- krige.cv(ANN_GDD50 ~ sqrt(ELEVATION_)+ N, locations=ne.m, model=vmf.ked)
summary(kcv.ked$residual)
Overall the results are fairly good, but there are some large prediction errors
at both extremes. An overall measure is the root of the mean squared error,
RMSE:
(loocv.ked.rmse <- sqrt(sum(kcv.ked$residual^2)/length(kcv.ked$residual)))
## [1] 199.4083
44°N
43°N −, overprediction
+, underprediction
42°N
+/− GDD50
200
41°N
400
600
40°N
39°N
There are several regions with intermixed fairly large under- and over-
predictions; this means that in these regions there are local factors not
accounted for. Other regions are consistently over- or under-predicted (Ver-
mont mountains, Lake Ontario plain, respectively).
An advantage of KED over GLS-RK is that KED can be applied in some local
neighbourhood, so that the relation with the covariables (here, Northing,
Easting, square root of elevation) is re-fit at each prediction point. Besides
the obvious computational advantage (fewer points → less computation),
this allows a varying effect of the covariates over space. In our example,
it may be that the effect on GDD50 of +100 km Northing may be more
towards the south of the region than the north, or vice versa; or it may be
60
a smaller effect in a north-south trending large valley such as the Hudson
or Lake Champlain. The effect of elevation may be more, or less, in the
Adirondacks in northern New York compared to the Allegheny Plateau in
Pennsylvania.
Note: The linear model is not re-solved at each point; the UK system
(§8.1) implicitly includes these in the solution. The AU a matrix includes
the covariance betwen the neighbourhood points, as well as their values of
the covariates, and the bU vector includes the covariance between the neigh-
bourhood points and the prediction point, as well as the prediction point’s
covariate values.
Whether the kriging is global or local, the bU vector must be computed
at each prediction point. For local kriging a full AU a matrix of all the
observation points can be rapidly cut down to the set of local points closest
to the prediction point.
Task 52 : Recompute the KED prediction over the study area of §8.4 with
a local neighbourhood. •
The krige package has two optional arguments that can be used to imple-
ment this:
1. nmax “maximum number of neighbours to use”;
2. maxdist “maximum distance to a point to use”; this can be used along
with nmin “minimum number of neighbours to use” to ensure that no
predictions are made with few points. This latter will produce NA
“not available” values if there are prediction points too far from the
minimum number
We prefer nmax because here the points are well-distributed, and we can use
the kriging prediction variance to find areas that are too poorly-predicted.
The obvious question is how to determine this number. One way is to
try different numbers and compare their cross-validation statistics (see the
Challenge at the end of this §). Here we choose to use 20% of the points,
i.e., 305/5 ≈ 61 to show how the method works.
ked.n.max <- 61 # change this to try other numbers of neighbours
k.ked.nn <- krige(ANN_GDD50 ~ sqrt(ELEVATION_)+ N, locations=ne.m,
newdata=dem.ne.m.sf, model=vmf.ked,
nmax=ked.n.max)
summary(k.ked.nn)
61
## var1.pred var1.var geometry
## Min. : 906.4 Min. : 3829 POINT :40052
## 1st Qu.:1963.6 1st Qu.:33697 epsg:NA : 0
## Median :2379.5 Median :38411 +proj=aea ...: 0
## Mean :2468.2 Mean :37254
## 3rd Qu.:2859.3 3rd Qu.:42122
## Max. :3947.8 Max. :69912
2e+05
GDD50
4000
0e+00
3000
N
2000
1000
−2e+05
−4e+05
display.difference.map("diff.ked",
paste("Difference annual GDD base 50F, KED global - KED",
ked.n.max,"nearest neighbours"),
"+/- GDD50")
62
Difference annual GDD base 50F, KED global − KED 61 nearest neighbours
2e+05
+/− GDD50
0e+00 100
0
N
−100
−200
−300
−2e+05
−4e+05
summary(kcv.ked$residual)
## [1] 191.7587
## [1] 199.4083
Q21 : Which KED, global or local, gives the best cross-validation results?
What can you conclude about the strength of regional to local effects on
GDD50? Jump to A21 •
Task 56 : Display a bubble plot of the difference between the global and
63
local cross-validation residuals. •
summary(kcv.ked$diff <- kcv.ked$residual - kcv.ked.nn$residual)
44°N
41°N
−, overprediction
+, underprediction
40°N
39°N
Q22 : Where are the largest differences? Does this seem to be geographically-
consistent? Can you explain? Jump to A22
•
Challenge: Since the trend surface is now fit locally, it may be that Easting
is significant in some parts of the region. Refit the global residual variogram
with this included in the linear model, and use it in the kriging prediction
formula. How much and where does this change the prediction?
64
is the equivalent of using the lm function, but is more convenient because it
directly produces a gridded data structure, as in kriging.
k.ok <- krige(ANN_GDD50 ~ sqrt(ELEVATION_)+ N, locations=ne.m,
newdata=dem.ne.m.sf, model=NULL)
We then krige the residuals from the OLS trend of the known points; we
computed those in §4.3, and computed their variogram in §4.4.
k.okr <- krige(ols.resid ~ 1, locations=ne.m,
newdata=dem.ne.m.sf, model=vmf.r.ols)
We then add these together to get a final prediction of the trend and the
local deviations from it, which is what KED does in one step:
k.ok$rk.pred <- k.ok$var1.pred + k.okr$var1.pred
Finally, compare this to the KED prediction, and compute their differences:
k.ok$diff.pred <- k.ok$rk.pred - k.ked$var1.pred
summary(k.ok$rk.pred); summary(k.ked$var1.pred)
summary(k.ok$diff.pred)
65
Naive RK vs. KED surface, GDD base 50F
2e+05
+/− GDD50
0e+00 10
N
−10
−2e+05
−4e+05
We also saw differences between GLS and OLS for the calibration points,
in §5.5 and for the prediction grid in §6; here the differences are smaller
because we also kriged the residuals from the two trend surfaces.
So this shows that KED as implemented by krige does use GLS, not OLS,
to compute the trend component. The difference with RK/GLS is that the
covariance structure is based on the OLS trend, not fit at the same time as
the trend surface coefficients.
66
and can not be extrapolated beyond the range of calibration. A further
disadvantage is that the choice of function is arbitrary; it is generally some
smooth function of the predictor, with the degree of smoothness determined
by cross-validation.
Note: The loess function has an span argument, which controls the degree
of smoothing by setting the neighbourhood for the local fit as a proportion
of the number of points. The default span=0.75 thus uses the 3/4 of the
total points closest each point. These are then weighted so that closer points
have more weight; see ?loess for details. The default works well in most
situations, and here we only want a visual impression, not a “best fit” in a
statistical sense.
67
g1 <- ggplot(ne.df, aes(x=E, y=ANN_GDD50)) +
geom_point() +
geom_smooth(method="loess")
g2 <- ggplot(ne.df, aes(x=N, y=ANN_GDD50)) +
geom_point() +
geom_smooth(method="loess")
g3 <- ggplot(ne.df, aes(x=ELEVATION_, y=ANN_GDD50)) +
geom_point() +
geom_smooth(method="loess")
g4 <- ggplot(ne.df, aes(x=sqrt(ELEVATION_), y=ANN_GDD50)) +
geom_point() +
geom_smooth(method="loess")
require(gridExtra)
grid.arrange(g1, g2, g3, g4, ncol = 2)
4000 4000
3000 3000
ANN_GDD50
ANN_GDD50
2000 2000
1000 1000
4000 4000
3000 3000
ANN_GDD50
ANN_GDD50
2000 2000
1000 1000
68
Jump to A23 •
GAM can be fit in R with the gam function of the mgcv “Mixed GAM Com-
putation Vehicle” package. This specifies the model with a formula, as with
lm, but terms can now be arbitrary functions of predictor variables, not just
the variables themselves or simple transformations that apply to the whole
range of the variable, e.g. sqrt or log. Smooth functions of one or more
variables are specified with the s function of the mgcv package.
Task 59 : Fit a GAM to the annual GDD50 at the observation stations, with
the predictors being a two-dimensional thin-plate spline of the coördinates
and a one-dimensional penalized regression spline of the elevation. •
m.g.xy <- gam(ANN_GDD50 ~ s(E, N) + s(ELEVATION_), data=ne.df)
summary(m.g.xy)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## ANN_GDD50 ~ s(E, N) + s(ELEVATION_)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2517.518 9.986 252.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(E,N) 23.529 27.300 37.8 <2e-16 ***
## s(ELEVATION_) 8.521 8.922 51.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.908 Deviance explained = 91.7%
## GCV = 34111 Scale est. = 30415 n = 305
summary(residuals(m.g.xy))
69
Q24 : How well does this model fit the calibration observations? Jump to
A24 •
Task 60 : Plot the residuals as a bubble plot, and examine their spatial
structure with a variogram. •
ne.m$resid.m.g.xy <- residuals(m.g.xy)
bubble.sf("ne.m", "resid.m.g.xy", "GDD50", "Residuals from GAM")
44°N
+/− GDD50
100
43°N
200
300
42°N 400
41°N
−, overprediction
+, underprediction
40°N
39°N
70
3
40000
128 138
222 239
60 192
30000 16
semivariance
142
20000
40
10000
distance
Q25 : Does there appear to be any local spatial correlation of the residuals?
Does the empirical variogram support your conclusion? Jump to A25 •
The plot.gam function of the mgcv package displays the marginal smooth
fit. For the 2D surface (model term s(E,N), this is shown as a wireframe plot
if the optional scheme argument is set to 1. The select argument selects
which model term to display. We orient it to see lowest GDD towards viewer,
using the theta argument:
plot.gam(m.g.xy, rug=T, se=T, select=1,
scheme=1, theta=30+130, phi=30)
71
s(E,N,2
3.53)
N
Q26 : Does the GAM 2D trend differ from a linear trend surface? Jump
to A26 •
This surface can also be shown with the vis.gam function of the mgcv pack-
age, also showing ± 1 standard error of fit:
vis.gam(m.g.xy, plot.type="persp", color="terrain",
theta=160, zlab="Annual GDD50", se=1.96)
72
Annual GDD50
N
73
1000
500
s(ELEVATION_,8.52)
0
−500
−1000
−1500
ELEVATION_
Q27 : Does the fitted marginal relation with elevation appear to be linear?
Jump to A27 •
Notice the very large confidence interval at the high elevations – we have so
few points there that the smoother is quite uncertain in this range.
Task 61 : Compare the GAM model fits with the actual values. •
(rmse.gam <- sqrt(sum(residuals(m.g.xy)^2)/length(residuals(m.g.xy))))
## [1] 164.6781
74
Annual GDD50
4000
NJ
NY
PA
3500
VT
3000
2500
Actual
2000
1500
1000
The RMSE is 164.7; there are no observations that are particularly badly-fit.
Since we now have a model which uses the covariables known across the
prediction grid, we can use the model to predict.
Task 62 : Predict the annual GDD50, and the standard error of prediction,
across the prediction grid, using the fitted GAM, and display the predictions.
•
The predict.gam function predicts from a fitted GAM. The se.fit op-
tional argument specifies that the standard error of prediction should also
be computed.
tmp <- predict.gam(object=m.g.xy, newdata=dem.ne.m.df, se.fit=TRUE)
summary(tmp$fit)
summary(tmp$se.fit)
75
Annual GDD, base 50F, GAM prediction
2e+05
GDD50
4000
0e+00
3000
N
2000
1000
−2e+05
−4e+05
This map shows more detail than the OLS and GLS maps, especially in the
high elevations and along the Atlantic coast.
76
Annual GDD base 50F, Standard error of GAM prediction
2e+05
GDD50 s.e.
250
0e+00 200
N
150
100
−2e+05 50
−4e+05
Consistent with the marginal plots, we see that the standard error is much
higher at the highest elevations, where there are few observations to support
the GAM.
An obvious question is where this map differs from the GLS and GLS-RK
maps.
There are some large differences. See where these are located:
display.difference.map("diff.gls.gam",
"Difference annual GDD base 50F, GLS - GAM",
"+/- GDD50")
77
Difference annual GDD base 50F, GLS − GAM
2e+05
+/− GDD50
0e+00
500
N
−500
−2e+05
−4e+05
Q28 : Where are the largest differences between the GAM and GLS pre-
dictions? Jump to A28
•
There are some large differences. See where these are located:
ggplot() +
geom_point(aes(x=E, y=N, colour=diff.rkgls.gam), data=dem.ne.m.df) +
xlab("E") + ylab("N") + coord_fixed() +
ggtitle("Difference annual GDD base 50F, GLS/RK - GAM") +
scale_colour_distiller(name="GDD50", space="Lab", palette="Spectral")
78
Difference annual GDD base 50F, GLS/RK − GAM
2e+05
GDD50
0e+00
500
N
−500
−2e+05
−4e+05
Q29 : Where are the largest differences between the GAM and GLS-RK
predictions? Jump to A29 •
10 Data-driven models
A data-driven model is an alternative to linear modelling. It makes no as-
sumptions about linearity; rather, it uses a set of regression trees. These
partition the feature space of predictors into a set of rectangles in the di-
mensions of the feature space, i.e., defined by limits of each predictor in
feature space. These rectangles each then have a simple prediction model,
in the simplest case just a constant, which is a single predicted value of the
response variable for all combinations of predictor variables in that feature-
space rectangle. The advantages of this approach are:
1. no assumption that the functional form is the same throughout the
range of the predictors;
2. over-fitting can be avoided by specifying large enough rectangles; their
optimum size can be calculated by cost-complexity pruning.
This is a high-variance, low-bias method. This means that different fits of the
model with different subsets of the data may result in quite different fitted
models (high variance), but the predictions not be systematically different
from the expected values (low bias).
A disadvantage of this approach is that, unlike linear models. it can not not
extrapolate outside of its range of calibration, i.e., the multivariate feature-
79
space limits of its predictors. But that might equally be considered an
advantage18 , because it avoids any assumptions about the relation between
target and predictors outside the calibration space.
We first start with the simplest data-driven approach: the regression tree.
This replaces a regression equation, as developed in the previous section,
with a decision tree based only only optimal splitting of the response variable
by the predictors.
Task 67 : Compute a regression tree for the response GDD50, from the
18 “Elk naadeel heeft zijn voordeel” – Johann Cruijff
80
predictors N, E, and ELEVATION_. Note that there is no need to transform
any predictor. •
The rpart function has several control options: (1) the minimum number
of observations which can be considered for a split (using the minsplit
argument); and (2) the minimum value of a complexity parameter (using
the cp argument). This corresponds to the improvement in R2 with each
split. A small complexity parameter (close to 0) grows a larger tree, which
may be over-fitting.
We set these to allow maximum splitting: split even if only two cases, using
the minsplit optional argument. Also specify a small complexity parameter
with the cp optional argument: keep splitting until there is less than 0.3%
improvement in (unadjusted) R2.
The model formulation is the same as for linear modelling: specify the pre-
dictand (dependent variable) on the left side of the ~ formula operator and
the predictors on the right side, separated by the + formula operator. Note
there is no interaction possible in tree models; the predictors are considered
separately when determining which to use for a split.
m.rt <- rpart(ANN_GDD50 ~ N + E + ELEVATION_,
data=ne.df,
minsplit=2,
cp=0.003)
## n= 305
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 305 100137700.0 2517.518
## 2) N>=-155967.2 183 27412290.0 2182.536
## 4) ELEVATION_>=1275 45 3699167.0 1778.200
## 8) N>=119836.3 12 921737.0 1493.500
## 16) ELEVATION_>=2945 1 0.0 795.000 *
## 17) ELEVATION_< 2945 11 389480.0 1557.000
## 34) ELEVATION_>=1355 10 68720.0 1503.000 *
## 35) ELEVATION_< 1355 1 0.0 2097.000 *
## 9) N< 119836.3 33 1451091.0 1881.727
## 18) E>=-237967.7 31 1059064.0 1854.065
## 36) ELEVATION_>=1721 11 354595.6 1715.818 *
## 37) ELEVATION_< 1721 20 378607.8 1930.100 *
## 19) E< -237967.7 2 612.5 2310.500 *
## 5) ELEVATION_< 1275 138 13957200.0 2314.384
## 10) N>=82678.87 40 2854040.0 2076.125
## 20) ELEVATION_>=520 19 568665.7 1878.263 *
## 21) ELEVATION_< 520 21 868542.6 2255.143
## 42) N>=142190.2 16 195405.8 2173.375 *
## 43) N< 142190.2 5 223838.8 2516.800 *
## 11) N< 82678.87 98 7905649.0 2411.633
## 22) ELEVATION_>=946 34 1936208.0 2197.941
## 44) N>=-36918.74 16 529437.0 2028.250 *
## 45) N< -36918.74 18 536519.1 2348.778 *
## 23) ELEVATION_< 946 64 3592056.0 2525.156
## 46) N>=-104679.8 46 1681030.0 2451.326
## 92) ELEVATION_>=115 44 1336795.0 2433.023
## 184) E>=-156122.8 34 803000.0 2383.382 *
81
## 185) E< -156122.8 10 165155.6 2601.800 *
## 93) ELEVATION_< 115 2 5202.0 2854.000 *
## 47) N< -104679.8 18 1019503.0 2713.833
## 94) E< 122469.2 11 451684.0 2587.000 *
## 95) E>=122469.2 7 112794.9 2913.143 *
## 3) N< -155967.2 122 21387870.0 3019.992
## 6) ELEVATION_>=395 56 6252615.0 2709.107
## 12) ELEVATION_>=1505 10 202742.9 2189.100 *
## 13) ELEVATION_< 1505 46 2757956.0 2822.152
## 26) N>=-231009.3 25 1153309.0 2706.320 *
## 27) N< -231009.3 21 869901.0 2960.048 *
## 7) ELEVATION_< 395 66 5130590.0 3283.773
## 14) N>=-252628.4 38 2433245.0 3148.526
## 28) ELEVATION_>=25 32 1337325.0 3088.969
## 56) ELEVATION_>=220 11 161948.7 2936.545 *
## 57) ELEVATION_< 220 21 785949.2 3168.810 *
## 29) ELEVATION_< 25 6 377040.8 3466.167 *
## 15) N< -252628.4 28 1058940.0 3467.321
## 30) ELEVATION_>=52.5 20 420733.0 3394.950 *
## 31) ELEVATION_< 52.5 8 271573.5 3648.250 *
82
require(rpart.plot)
rpart.plot(m.rt, digits=3, type=4, extra=1)
2518
n=305
N >= −156e+3
< −156e+3
2183 3020
n=183 n=122
ELEVATION_ >= 1275 ELEVATION_ >= 395
< 1275 < 395
2451 2714
n=46 n=18
ELEVATION_ >= 115 E < 122e+3
< 115 >= 122e+3
2433
n=44
E >= −156e+3
< −156e+3
795 1503 2097 1716 1930 2310 1878 2173 2517 2028 2349 2383 2602 2854 2587 2913 2189 2706 2960 2937 3169 3466 3395 3648
n=1 n=10 n=1 n=11 n=20 n=2 n=19 n=16 n=5 n=16 n=18 n=34 n=10 n=2 n=11 n=7 n=10 n=25 n=21 n=11 n=21 n=6 n=20 n=8
Q30 : What is the first (root) splitting variable? At what value is the
split? What is the mean value of GDD50 of the whole dataset, and of the
two branches? How many observations in each branch? Jump to A30 •
Although there is no model with coefficients, we can still see which predictor
variables had the most influence on the tree.
83
This is the proportion of the variance explained by the tree which is due to
each variable.
x <- m.rt$variable.importance
data.frame(variableImportance = 100 * x / sum(x))
## variableImportance
## N 48.90337
## ELEVATION_ 38.60227
## E 12.49436
We now examine the reduction in fitting and cross-validation error with the
printcp “print the complexity parameter” function.
Task 71 : Print and plot the error rate vs. the complexity parameter and
tree size. •
printcp(m.rt)
##
## Regression tree:
## rpart(formula = ANN_GDD50 ~ N + E + ELEVATION_, data = ne.df,
## minsplit = 2, cp = 0.003)
##
## Variables actually used in tree construction:
## [1] E ELEVATION_ N
##
## Root node error: 100137734/305 = 328320
##
## n= 305
##
## CP nsplit rel error xerror xstd
## 1 0.5126697 0 1.000000 1.00622 0.070339
## 2 0.0999090 1 0.487330 0.51823 0.041222
## 3 0.0974250 2 0.387421 0.46287 0.039271
## 4 0.0328739 3 0.289996 0.31868 0.025099
## 5 0.0319311 4 0.257122 0.29848 0.023483
## 6 0.0237411 5 0.225191 0.28081 0.023420
## 7 0.0163615 6 0.201450 0.25288 0.020905
## 8 0.0141488 7 0.185089 0.25099 0.021131
## 9 0.0132452 8 0.170940 0.24478 0.020829
## 10 0.0089030 9 0.157695 0.24123 0.021066
## 11 0.0086905 10 0.148792 0.23276 0.019475
## 12 0.0073373 11 0.140101 0.23617 0.019555
## 13 0.0071789 12 0.132764 0.22297 0.018837
## 14 0.0053152 13 0.125585 0.21269 0.016959
## 15 0.0045440 14 0.120270 0.20473 0.016469
## 16 0.0044868 15 0.115726 0.20596 0.016475
## 17 0.0039088 16 0.111239 0.20617 0.016620
## 18 0.0038889 17 0.107330 0.20208 0.016268
## 19 0.0036613 18 0.103441 0.20385 0.016333
## 20 0.0035335 19 0.099780 0.20161 0.015913
## 21 0.0032541 21 0.092713 0.20369 0.016004
## 22 0.0032032 22 0.089459 0.19978 0.015816
## 23 0.0030000 23 0.086256 0.19658 0.015869
plotcp(m.rt)
84
size of tree
1 3 5 7 9 11 13 15 17 19 22 24
1.2
1.0
X−val Relative Error
0.8
0.6
0.4
0.2
cp
Note: Your results will likely be different. This is because the cross-validation
makes a random split of the full dataset into a number of subsets for model
building and evaluation. Each run gives a different random split.
The xerror field in the summary shows the cross-validation error; that is,
applying the model to the original data split 𝐾-fold, each time excluding
some observations. If the model is over-fitted, the cross-validation error
increases; note that the fitting error, given in the error field, always de-
creases. By default, the split is 10-fold; this can be modified by the control
argument to the rpart function.19
Q32 : Does this model appear to be overfit? Why or why not? What ap-
pears to be the optimum complexity parameter to avoid over-fitting? Jump
to A32 •
Task 72 : Prune the tree back to complexity level estimated from the
previous answer. •
We do this with the prune function, specifying the cp “complexity parame-
19 See the help for rpart.control.
85
ter” argument.
(m.rt.p <- prune(m.rt, cp=0.0045))
## n= 305
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 305 100137700.0 2517.518
## 2) N>=-155967.2 183 27412290.0 2182.536
## 4) ELEVATION_>=1275 45 3699167.0 1778.200
## 8) N>=119836.3 12 921737.0 1493.500
## 16) ELEVATION_>=2945 1 0.0 795.000 *
## 17) ELEVATION_< 2945 11 389480.0 1557.000 *
## 9) N< 119836.3 33 1451091.0 1881.727 *
## 5) ELEVATION_< 1275 138 13957200.0 2314.384
## 10) N>=82678.87 40 2854040.0 2076.125
## 20) ELEVATION_>=520 19 568665.7 1878.263 *
## 21) ELEVATION_< 520 21 868542.6 2255.143 *
## 11) N< 82678.87 98 7905649.0 2411.633
## 22) ELEVATION_>=946 34 1936208.0 2197.941
## 44) N>=-36918.74 16 529437.0 2028.250 *
## 45) N< -36918.74 18 536519.1 2348.778 *
## 23) ELEVATION_< 946 64 3592056.0 2525.156
## 46) N>=-104679.8 46 1681030.0 2451.326 *
## 47) N< -104679.8 18 1019503.0 2713.833
## 94) E< 122469.2 11 451684.0 2587.000 *
## 95) E>=122469.2 7 112794.9 2913.143 *
## 3) N< -155967.2 122 21387870.0 3019.992
## 6) ELEVATION_>=395 56 6252615.0 2709.107
## 12) ELEVATION_>=1505 10 202742.9 2189.100 *
## 13) ELEVATION_< 1505 46 2757956.0 2822.152
## 26) N>=-231009.3 25 1153309.0 2706.320 *
## 27) N< -231009.3 21 869901.0 2960.048 *
## 7) ELEVATION_< 395 66 5130590.0 3283.773
## 14) N>=-252628.4 38 2433245.0 3148.526
## 28) ELEVATION_>=25 32 1337325.0 3088.969 *
## 29) ELEVATION_< 25 6 377040.8 3466.167 *
## 15) N< -252628.4 28 1058940.0 3467.321 *
86
rpart.plot(m.rt.p, digits=3, type=4, extra=1)
2518
n=305
N >= −156e+3
< −156e+3
2183 3020
n=183 n=122
ELEVATION_ >= 1275 ELEVATION_ >= 395
< 1275 < 395
2198 2525
n=34 n=64
N >= −36.9e+3 N >= −105e+3
< −36.9e+3 < −105e+3
2714
n=18
E < 122e+3
>= 122e+3
795 1557 1882 1878 2255 2028 2349 2451 2587 2913 2189 2706 2960 3089 3466 3467
n=1 n=11 n=33 n=19 n=21 n=16 n=18 n=46 n=11 n=7 n=10 n=25 n=21 n=32 n=6 n=28
Q33 : How does this tree differ from the original regression tree? Jump
to A33 •
Task 74 : Use the pruned regression tree to predict at the calibration points.
•
We do this with the predict method applied to a rpart object; this au-
tomatically calls function predict.rpart. The points to predict and the
87
values of the predictor variables at those points are supplied in a dataframe
as argument newdata. We count the number of predicted values with the
unique function; there is only one value per “box” in the feature space
defined by the predictor variables.
summary(p.rt.p <- predict(m.rt.p, newdata=ne.df))
## [1] 16
## [1] 11.16128
30
20
10
0
ANN_GDD50
88
xlim=c(500,4200), ylim=c(500,4200),
col=ne.df$STATE,
main="Annual GDD50")
legend("topleft", levels(ne.m$STATE), pch=20, col=1:4)
grid()
abline(0,1)
Annual GDD50
NJ
4000
NY
PA
3000 VT
actual
2000
1000
Q34 : How many unique values are predicted by the pruned regression
tree? How close is the fit to the actual values, compared to the OLS or GLS
models? Explain. Jump to A34 •
Task 78 : Predict over the grid with the regression tree model, add the
results to the dataframe, and summarize them. •
We also add the point set in the same colour scheme; if the point is visible
it means the residual (lack of fit) is large.
display.prediction.map("pred.rt",
"Annual GDD, base 50F, regression tree prediction",
"GDD50")
89
Annual GDD, base 50F, regression tree prediction
2e+05
GDD50
4000
0e+00
3000
N 2000
1000
−2e+05
−4e+05
Q35 : What is the spatial pattern of the regression tree prediction? Explain
why. Jump to A35 •
90
2. Save all these trees; when predicting, use all of them and average their
predictions.
3. In addition we can summarize the whole set of trees to see how different
they are, thus how robust is the final model.
4. Also, for each tree we can use observations that were not used to
construct it for true validation, called out-of-bag validation. This gives
a good idea of the true prediction error.
The first step may seem suspicious. The underlying idea is that what we
observe is the best sample we have of reality; if we sampled again we’d expect
to get similar values. So we simulate re-sampling by sampling from this same
set, with replacement, to get a sample of the same size but a different sample.
If this idea bothers you, read Efron and Gong [4], Shalizi [22] or Hastie et al.
[7, §8.2].
So, how well does the random forest method work for a spatially-distributed
variable?
We know that by definition a linear model, whether OLS or GLS fit, will vary
smoothly with the predictors. That is, if these predictors vary smoothly in
space the map of the linear model predictions will also look smooth. In this
example the predictor Northing is by definition smooth, and the predictor
elevation often changes smoothly, so we can expect a smooth prediction map.
However, random forests are not linear. They are based on an ensemble of
regression trees, with no requirement for smooth cut points.
Another attractive feature of random forests is that there is no need for pre-
dictor variable selection to avoid colinearity. Since the predictors to compare
are randomly chosen at each split, it is possible for “minor” predictors which
would not be included in a parametric approach to contribute to some of
the trees, and thus to the ensemble prediction. In this example Easting was
not used in the regression models, but should be used here. The relative
importance of the predictors is reported by the RF function.
Task 80 : Fit a random forest model of GDD50 based on all three possible
covariates: elevation, Northing and Easting. •
There are several R packages that implement random forests. A very fast
implementation, widely used, is provided by the ranger package [29].
require(ranger)
The ranger function of the ranger package fits this model. This model
requires two parameters:
1. the number of trees in the forest, optional argument num.trees, de-
fault value 500;
91
2. the number of predictors to compare at each split, optional argument
mtry, default value is 1/3 of the predictors.
Here we accept the default for the number of predictors, which in this case
will be one of the three predictors, but require more than the default number
of trees
We also specify the optional importance argument as "permutation" to see
two measures of predictor importance, as explained below.
Since this does not assume linearity we can use the untransformed station
elevation.
m.rf <- ranger(ANN_GDD50 ~ ELEVATION_ + N + E,
data=ne.df, num.trees=1200,
importance="permutation")
# proportional importance
ranger::importance(m.rf)/sum(ranger::importance(m.rf))*100
## ELEVATION_ N E
## 45.41475 43.63581 10.94944
ranger::importance(m.rf)/dim(ne.df)[1]
## ELEVATION_ N E
## 724.1184 695.7541 174.5840
Task 81 : Plot the actual vs. fitted values from the random forest model.
Compute the mean error (bias, ME) and the root mean squared error (RMSE).
•
The predict function gives predicted values based on a model. To pre-
dict back at the calibration points, the newdata argument to the predict
function must specify this point set.
Note: Each run of the random forest will produce different fits, so your
graph may not look exactly like this one.
summary(rf.fits <- predict(m.rf, data = ne.df)$predictions)
plot(ne.m$ANN_GDD50 ~ rf.fits,
col=ne.m$STATE, pch=20, asp=1,
xlab="Fitted by random forest",
ylab="Actual",
main="Annual GDD50")
legend("topleft", levels(ne.m$STATE), pch=20, col=1:4)
grid()
abline(0,1)
92
Annual GDD50
4000
NJ
NY
PA
3500
VT
3000
2500
Actual
2000
1500
1000
## [1] -0.8869138
## [1] 5.67573
The fits are very close. However, this is not a good measure of the prediction
accuracy. For that we use out-of-bag (OOB) cross-validation.
OOB predictions are automatically computed with the ranger function, at
the same time it builds the forest. Each point is predicted as the average
of the RF predictions for those regression trees where that point was not
included in the “bag”.
Task 82 : Plot the actual vs. out-of-bag validation values from the random
forest model. Compute the mean error (bias, ME) and the root mean squared
error (RMSE). •
summary(rf.oob <- m.rf$predictions)
plot(ne.m$ANN_GDD50 ~ rf.oob,
col=ne.m$STATE, pch=20, asp=1,
xlab="Fitted by random forest (OOB)",
ylab="Actual",
main="Annual GDD50")
legend("topleft", levels(ne.m$STATE), pch=20, col=1:4)
grid()
abline(0,1)
93
Annual GDD50
4000
NJ
NY
PA
3500
VT
3000
2500
Actual
2000
1500
1000
## [1] -2.488285
## [1] 11.59619
Task 83 : Compare the ME and RMSE for the model fits and out-of-bag
residuals. •
(rf.oob.me/rf.me)
## [1] 2.805555
(rf.oob.rmse/rf.rmse)
## [1] 2.043119
Q37 : How does the OOB RMSE compare to the RMSE from the model fits
at the training points? Which is more realistic as a measure of prediction
accuracy? Jump to A37 •
94
Task 84 : Extract the RF model residuals, summarize them, and show them
as a bubble plot. •
The model residuals are computed from the fits at each point and the actual
values; for this we use the fits, not the out-of-bag estimates.
ne.m$rf.resid <- (ne.df$ANN_GDD50 - rf.fits)
summary(ne.m$rf.resid)
44°N
+/− GDD50
100
43°N
200
300
42°N 400
41°N
−, overprediction
+, underprediction
40°N
39°N
The residuals are fairly similar in range to the linear model. There are some
very poorly-fit points. Notice that the mean residual is not zero, as in the
least-squares fit of the linear model.
95
Random Forest OOB residuals, actual−predicted
45°N
44°N
+/− GDD50
200
43°N
400
600
42°N 800
41°N
−, overprediction
+, underprediction
40°N
39°N
As seen in the 1:1 plots and in the bubble plot legenfs, the out-of-bag resid-
uals are much larger: The mean ratio of the two is 1.44.
Task 86 : List the eight worst-fit points, sorted by their absolute residuals,
along with the residual from the GLS fit linear model. •
(ix <- order(abs(ne.m$rf.resid.oob), decreasing=TRUE)[1:8])
ne.m[ix,c("STATE","STATION_NA","ELEVATION_",
"gls.resid", "rf.resid")]
96
## 3842 POINT (49632.7 -297715.7)
## 3089 POINT (129349.2 -113350.6)
## 3836 POINT (-102705.5 -177146.3)
## 3120 POINT (130736.3 -52170.55)
## 3888 POINT (22581.52 -130037)
names(ne.m)
The RF almost always comes closer than the GLS regression to these worse-
fit points, because it is only using fairly similar points, in terms of the
predictors (N, E, elevation) to predict, and does not try to fit a regional
trend, which must consider all the calibration points.
However, the lowest actual GDD value (Mt. Mansfield in VT) is badly under-
predicted by the random forest model. This is because it is so unlike any
other point, since its elevstion is so much higher than the others. In the
linear model this has strong leverage (highest elevation by far, well to the
North) and is thus closely fit. The other poorly-predicted stations are also
“unusual” in their covariate-space neighbourhood.
369 643
10000 1061
794 945
70
269
8000
semivariance
6000
4000
2000
distance
97
Q38 : Do the residuals have spatial structure? This depends again on each
run of the model; your results will look different from the ones presented
here. Jump to A38 •
summary(ne.m$rf.resid)
summary(ne.m$gls.resid)
sd(ne.m$ols.resid)
## [1] 210.8974
sd(ne.m$rf.resid)
## [1] 99.28129
sd(ne.m$gls.resid)
## [1] 211.0888
The statistics are similar; there is no clear “winner” Note that the GLS and
RF models are not unbiased – the mean residual is not zero.
We can use the RF model to predict over the study area, as we did in §6 for
the OLS and GLS models.
Task 89 : Predict over the grid with the RF model, add the results to the
dataframe, and summarize them. •
dem.ne.m.df$pred.rf <- predict(m.rf, data=dem.ne.m.df)$prediction
summary(dem.ne.m.df$pred.rf)
98
Annual GDD, base 50F, random forest prediction
2e+05
GDD50
4000
0e+00
3000
N
2000
1000
−2e+05
−4e+05
Q39 : What is the difference in the spatial pattern between the RF surface
and the RK-GLS surface? Jump to A39 •
Q40 : Why is there only one predicted value in most of Lake Erie, another
in the north-most part of Lake Erie, Lake Ontario, and the adjacent areas
of Ontario (Canada)? Jump to A40 •
99
Annual GDD, base 50F, GLS − RF predictions
2e+05
+/− GDD50
600
0e+00
300
N
−300
−2e+05 −600
−4e+05
Q41 : Why are the largest discrepancies in the east, especially in Connecti-
cut and Massachusetts, and in the northwest (Ontario)? Jump to A41
•
Q42 : Why is the border of PA with OH on the west visible in this difference
map? Jump to A42 •
Q43 : Why are there positive differences (GLS-RK greater than RF) in the
PA mountains and in the Catskills and Taconics in NY? Why does this not
occur in the Adirondacks? Jump to A43 •
Data-driven models have parameters that control their behaviour and can
significantly affect their predictive power.
For example, regression trees (§10.1) can be adjusted by the minimum num-
ber of observations which can be considered for a split (using the minsplit
argument) and the minimum value of a complexity parameter (using the cp
argument).
100
Random forests (§10.2) can also be controlled by the minimum number of
observations in a terminal node (optional argument nodesize), as well as
the number of predictors to compare at each split (optional argument mtry).
In the randomForest function these have default values of 5 and 1/3 of the
number of predictors, respectively. These can have a large influence on the
resulting forest. Too small terminal nodes will result in over-fit trees, too
large in poorer fits. Too many predictors tested at each split will not allow
less powerful predictors into the forest; too few will result in many poorly-
fitted trees.
Note: The number of trees ntree also has an influence on the random forest
model. Too few trees will cause repeated model fits to be too variable, too
many wastes computing time. This parameter is not optimized as such, a
large value is used and the graph of out-of-bag RMSE vs. number of trees is
examined to select an appropriate value.
This package is highly flexible and can be used to optimize 239 kinds of
data-driven models, including the ones we use in this tutorial. To see the
list of possible models:
length(names(getModelInfo()))
## [1] 239
101
head(names(getModelInfo()),24)
Many more details are given in the caret package on-line book20 . The pack-
age is highly adaptable, and before applying it in a production environment,
carefully read the documentation. Here we just present a simple case.
The principal function of the caret package is train, which implements the
cross-validation procedure and reports the optimum combination of param-
eters. To use this, we have to set up the following arguments:
For the meaning of each of these, see ?ranger and the explanations in the
journal article. Here we only have three predictors, so mtry can vary from 1
to 3. The splitting rule is set to "variance", the normal criterion for regres-
sion trees: the split is where the between-class variance is maximized and the
within-class variance is minimized. The minimum node size min.node.size
defaults to 5, we try smaller (more complex trees in the forest) and larger
(less complex) values.
20 http://topepo.github.io/caret/index.html
102
dim(preds <- ne.df[, c("E", "N", "ELEVATION_")])
## [1] 305 3
## [1] 305
system.time(
ranger.tune <- train(x = preds, y = response, method="ranger",
tuneGrid = expand.grid(.mtry = 1:3,
.splitrule = "variance",
.min.node.size = 1:10),
trControl = trainControl(method = 'cv'))
)
## Random Forest
##
## 305 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 274, 274, 273, 274, 276, 275, ...
## Resampling results across tuning parameters:
##
## mtry min.node.size RMSE Rsquared MAE
## 1 1 197.5190 0.8882416 154.9209
## 1 2 198.6177 0.8869913 156.1293
## 1 3 201.1845 0.8846177 157.3081
## 1 4 200.5310 0.8852271 157.3311
## 1 5 200.3875 0.8854635 157.3609
## 1 6 200.3338 0.8867208 157.7577
## 1 7 201.0084 0.8860766 157.6522
## 1 8 202.1988 0.8847521 159.0614
## 1 9 201.6555 0.8861695 159.0470
## 1 10 203.1736 0.8845083 160.0833
## 2 1 200.3364 0.8843045 159.7509
## 2 2 200.1645 0.8845410 158.9134
## 2 3 199.9086 0.8843256 159.6087
## 2 4 199.6293 0.8849575 158.8335
## 2 5 198.7004 0.8864801 157.8884
## 2 6 199.1197 0.8860764 157.8304
## 2 7 199.8209 0.8851734 159.4180
## 2 8 199.6407 0.8849978 158.4018
## 2 9 200.0531 0.8850541 158.5416
## 2 10 200.1777 0.8847316 158.5939
## 3 1 202.7557 0.8809174 163.0506
## 3 2 202.6245 0.8811076 163.1930
## 3 3 202.6026 0.8811257 163.6100
## 3 4 201.9123 0.8818638 162.4594
## 3 5 203.1566 0.8805201 163.1596
## 3 6 202.7799 0.8810873 163.1450
## 3 7 202.7258 0.8811961 162.3679
## 3 8 201.9091 0.8820909 161.7236
## 3 9 202.4731 0.8812187 161.9962
## 3 10 202.3668 0.8816545 161.6169
##
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 1, splitrule
## = variance and min.node.size = 1.
names(ranger.tune$result)
103
## [1] "mtry" "splitrule" "min.node.size" "RMSE"
## [5] "Rsquared" "MAE" "RMSESD" "RsquaredSD"
## [9] "MAESD"
ix <- which.min(ranger.tune$result$RMSE)
ranger.tune$result[ix, c(1,3,4)]
ix <- which.max(ranger.tune$result$Rsquared)
ranger.tune$result[ix, c(1,3,5)]
ix <- which.min(ranger.tune$result$MAE)
ranger.tune$result[ix, c(1,3,6)]
plot.train(ranger.tune, metric="RMSE")
plot.train(ranger.tune, metric="Rsquared")
plot.train(ranger.tune, metric="MAE")
203
RMSE (Cross−Validation)
202
201
200
199
198
2 4 6 8 10
0.888
Rsquared (Cross−Validation)
0.886
0.884
0.882
2 4 6 8 10
104
#Randomly Selected Predictors
1 2 3
164
162
MAE (Cross−Validation)
160
158
156
2 4 6 8 10
Task 93 : Build an optimal model and display its fit to known points. •
The ranger function requires a formula argument (as does randomForest):
(ranger.rf <- ranger(ANN_GDD50 ~ N + E + ELEVATION_, data=ne.df,
mtry=2, min.node.size=5))
## Ranger result
##
## Call:
## ranger(ANN_GDD50 ~ N + E + ELEVATION_, data = ne.df, mtry = 2, min.node.size = 5)
##
## Type: Regression
## Number of trees: 500
## Sample size: 305
## Number of independent variables: 3
## Mtry: 2
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 40600.49
## R squared (OOB): 0.8767443
plot(ne.df$ANN_GDD50 ~ ranger.fits,
col=ne.m$STATE, pch=20, asp=1,
xlab="Fitted by ranger",
ylab="Actual",
main="Annual GDD50")
legend("topleft", levels(ne.m$STATE), pch=20, col=1:4)
grid(); abline(0,1)
105
Annual GDD50
4000
NJ
NY
PA
3500
VT
3000
2500
Actual
2000
1500
1000
Fitted by ranger
Task 94 : Predict over the grid with the optimized ranger model, add the
results to the dataframe, and summarize them. •
summary(dem.ne.m.df$pred.ranger <-
predict(ranger.rf, data=dem.ne.m.df)$predictions)
106
Annual GDD, base 50F, ranger RF prediction
2e+05
GDD50
4000
0e+00
3000
N
2000
1000
−2e+05
−4e+05
10.4 Cubist
A popular data-driven model is Cubist, derived from the C4.5 models [20]
but extensively modified and implemented as R package Cubist, an R port
of the Cubist GPL C code released by RuleQuest21 .
library(Cubist)
107
not discrete values equal to the number of leaves in the regression tree. The
advantage over random forests is that the model can be interpreted, to a
certain extent. A disadvantage of Cubist is that its algorithm is not easy to
understand; however its results are generally quite good.
Cubist models can be improved in two ways: (1) with “committees” of mod-
els and (2) by adjusting predictions based on nearest neighbours in feature
(predictor) space.
committees Committees are a form of boosting. A set of model trees are built in se-
quence. The first tree is the standard Cubist best tree, using the original
data in the training set. Subsequent trees are built from adjusted versions to
the training set. If the previous Cubist tree over(under)-predicted a value,
the response is adjusted down(up)ward for the next model, before it is fit.
The final prediction is the average of the predictions from each model tree.
The idea here is that the predictions by the sequence of trees vary around
the “true” value.
nearest This prediction from the set of trees can then be adjusted using the values of
neighbours some number of nearest neighbours in feature space. The idea here is that
the overall model fits all the training data, but locally we may have some
unknown factor that operates only in a local region of feature space, so if
we have data from that region, we should give it more weight. Specifically,
if the single model or committee predict a value b 𝑦 , the adjusted prediction
based on the 𝐾 nearest-neighbours in feature space is:
1 Õ
𝐾
′
b
𝑦 = 𝑦 −b
𝑤 𝑖 𝑡𝑖 + (b 𝑡𝑖 ) (22)
𝐾 𝑖=1
108
system.time(
cubist.tune <- train(x = all.preds, y = all.resp, "cubist",
tuneGrid = expand.grid(.committees = 1:12,
.neighbors = 0:8),
trControl = trainControl(method = 'cv'))
)
## Cubist
##
## 305 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 273, 275, 273, 275, 273, 273, ...
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 210.2573 0.8718786 167.6044
## 1 1 221.5769 0.8628628 178.9769
## 1 2 210.6112 0.8747445 167.1347
## 1 3 208.1492 0.8769341 163.0249
## 1 4 206.4790 0.8783113 161.3244
## 1 5 203.7662 0.8814272 158.7046
## 1 6 201.8184 0.8838142 157.7601
## 1 7 200.8640 0.8850150 157.6147
## 1 8 200.3459 0.8855277 157.6202
## 2 0 205.0058 0.8790333 163.0586
## 2 1 219.6760 0.8660398 176.0418
## 2 2 208.9411 0.8777240 164.6446
## 2 3 206.5594 0.8799319 160.7576
## 2 4 204.7528 0.8814618 159.4444
## 2 5 201.8613 0.8847895 156.9040
## 2 6 199.7938 0.8873324 155.6350
## 2 7 198.6461 0.8887497 155.3224
## 2 8 197.9578 0.8894739 155.1466
## 3 0 202.8691 0.8811647 161.8315
## 3 1 218.8401 0.8657395 176.6138
## 3 2 207.4001 0.8783374 164.9011
## 3 3 204.8648 0.8806994 160.6794
## 3 4 203.1704 0.8821180 159.2110
## 3 5 200.4419 0.8851970 156.7189
## 3 6 198.4931 0.8875712 155.3964
## 3 7 197.4822 0.8888352 154.9244
## 3 8 196.9101 0.8894136 154.7508
## 4 0 202.9098 0.8815437 160.6203
## 4 1 218.7164 0.8663360 176.2514
## 4 2 207.6422 0.8783791 164.5841
## 4 3 205.1735 0.8806853 160.3500
## 4 4 203.4089 0.8822065 158.9820
## 4 5 200.6324 0.8853992 156.4587
## 4 6 198.6518 0.8878509 155.1050
## 4 7 197.6268 0.8891508 154.6902
## 4 8 197.0396 0.8897766 154.5571
## 5 0 201.8732 0.8826101 160.3219
## 5 1 218.5642 0.8658323 175.6422
## 5 2 207.5171 0.8780473 164.2672
## 5 3 205.1187 0.8802642 160.8382
## 5 4 203.4495 0.8816942 159.7198
## 5 5 200.7792 0.8847665 157.4239
## 5 6 198.8099 0.8872013 156.1177
## 5 7 197.7537 0.8885439 155.7233
## 5 8 197.1678 0.8891688 155.5246
## 6 0 201.7332 0.8827977 159.1467
109
## 6 1 218.5206 0.8660110 175.0839
## 6 2 207.7230 0.8778793 163.9450
## 6 3 205.3653 0.8800645 160.4224
## 6 4 203.6321 0.8815789 159.0473
## 6 5 200.9397 0.8847180 156.7534
## 6 6 198.9269 0.8872355 155.3609
## 6 7 197.8466 0.8886171 155.0002
## 6 8 197.2299 0.8892757 154.7465
## 7 0 202.6818 0.8820772 160.5479
## 7 1 218.2121 0.8664657 175.3425
## 7 2 207.6407 0.8781200 164.5592
## 7 3 205.2958 0.8802922 160.7983
## 7 4 203.5931 0.8817662 159.6460
## 7 5 200.9625 0.8848071 157.3061
## 7 6 198.9925 0.8872597 155.9881
## 7 7 197.9128 0.8886249 155.6001
## 7 8 197.3177 0.8892481 155.3269
## 8 0 201.7402 0.8829329 159.2934
## 8 1 217.9982 0.8668571 174.9493
## 8 2 207.4035 0.8785513 163.9813
## 8 3 205.1164 0.8806774 160.2740
## 8 4 203.4035 0.8821619 159.1970
## 8 5 200.7747 0.8852188 156.8818
## 8 6 198.8194 0.8876589 155.5882
## 8 7 197.7478 0.8890136 155.2713
## 8 8 197.1300 0.8896540 154.9691
## 9 0 203.0686 0.8816014 160.9398
## 9 1 218.3989 0.8663564 175.6231
## 9 2 207.9935 0.8777747 165.0536
## 9 3 205.7117 0.8798509 161.2064
## 9 4 204.0382 0.8812716 160.1078
## 9 5 201.4694 0.8842362 157.7847
## 9 6 199.5703 0.8866051 156.5587
## 9 7 198.5427 0.8879052 156.2332
## 9 8 197.9861 0.8884772 156.0188
## 10 0 202.3774 0.8819816 160.2178
## 10 1 218.1874 0.8666635 175.0949
## 10 2 207.8090 0.8780487 164.4668
## 10 3 205.6584 0.8799742 161.0972
## 10 4 204.0275 0.8813430 160.0624
## 10 5 201.4581 0.8843262 157.6882
## 10 6 199.5502 0.8867127 156.4890
## 10 7 198.5114 0.8880297 156.1709
## 10 8 197.9246 0.8886317 155.9063
## 11 0 203.3921 0.8809251 161.3420
## 11 1 218.6205 0.8660776 175.5133
## 11 2 208.3638 0.8772632 165.1390
## 11 3 206.2135 0.8791630 161.7741
## 11 4 204.6161 0.8804841 160.7965
## 11 5 202.0548 0.8834497 158.3949
## 11 6 200.1707 0.8858075 157.2140
## 11 7 199.1492 0.8871015 156.8858
## 11 8 198.5938 0.8876680 156.6825
## 12 0 203.0938 0.8808801 160.8818
## 12 1 218.6963 0.8660353 175.5390
## 12 2 208.4334 0.8772327 165.0972
## 12 3 206.3024 0.8791101 161.5688
## 12 4 204.6967 0.8804346 160.5577
## 12 5 202.1413 0.8834106 158.2277
## 12 6 200.2471 0.8857904 157.0856
## 12 7 199.2170 0.8870987 156.7516
## 12 8 198.6299 0.8876975 156.5114
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 3 and
## neighbors = 8.
plot(cubist.tune, metric="RMSE")
plot(cubist.tune, metric="Rsquared")
110
plot(cubist.tune, metric="MAE")
#Instances
0 3 6
1 4 7
2 5 8
220
RMSE (Cross−Validation)
215
210
205
200
2 4 6 8 10 12
#Committees
#Instances
0 3 6
1 4 7
2 5 8
0.890
Rsquared (Cross−Validation)
0.885
0.880
0.875
0.870
0.865
2 4 6 8 10 12
#Committees
#Instances
0 3 6
1 4 7
2 5 8
180
175
MAE (Cross−Validation)
170
165
160
155
2 4 6 8 10 12
#Committees
Your results will be different, because of the randomness in the splits from
method. So, your choice of optimal values may also be different than those
presented here.
In this case committees improve the model. Although the best result is
with 10 committees, looking at the graphs we can see that 6 gives almost
equally good results, so we prefer the simpler model. Neighbours definitely
do improve the model. Using one neighbour (the closest in feature space)
makes the model much worse – too much fine adjustment to the training set.
Using two to seven neighbours gives improvement, using eight is only slightly
better. This shows that the overall model can benefit by local adjustment.
111
Task 97 : Built an “optimum” Cubist model. •
The model is fit with the cubist function; the optimal number of committees
is specified at model building with the committees argument. Adjustment
by neighbours is done at prediction, because only then do we know which
point we are predicting, hence which are its nearest neighbours. This is
specified by the neighbors to the predict function of the Cubist package.
require(Cubist)
c.model <- cubist(x = all.preds, y = all.resp, committees=6)
summary(c.model)
##
## Call:
## cubist.default(x = all.preds, y = all.resp, committees = 6)
##
##
## Cubist [Release 2.07 GPL Edition] Sat May 11 11:01:09 2024
## ---------------------------------
##
## Target attribute `outcome'
##
## Read 305 cases (4 attributes) from undefined.data
##
## Model 1:
##
## Rule 1/1: [109 cases, mean 2114.8, range 795 to 2845, est err 151.9]
##
## if
## N > -20368.92
## then
## outcome = 2715.4 - 0.496 ELEVATION_ - 0.00105 N - 0.00068 E
##
## Rule 1/2: [196 cases, mean 2741.5, range 1335 to 4021, est err 168.7]
##
## if
## N <= -20368.92
## then
## outcome = 2704 - 0.00264 N - 0.606 ELEVATION_ - 0.00021 E
##
## Model 2:
##
## Rule 2/1: [305 cases, mean 2517.5, range 795 to 4021, est err 171.2]
##
## outcome = 2826.9 - 0.599 ELEVATION_ - 0.00205 N - 0.00023 E
##
## Model 3:
##
## Rule 3/1: [174 cases, mean 2171.9, range 795 to 3084, est err 172.4]
##
## if
## N > -145485.8
## then
## outcome = 2642.4 - 0.474 ELEVATION_ - 0.00077 E - 0.00045 N
##
## Rule 3/2: [74 cases, mean 2729.1, range 1774 to 3428, est err 178.6]
##
## if
## N <= -145485.8
## ELEVATION_ > 330
## then
## outcome = 1956.9 - 0.00513 N - 0.457 ELEVATION_ - 9e-05 E
##
## Rule 3/3: [76 cases, mean 3090.3, range 2078 to 4021, est err 182.4]
##
## if
112
## ELEVATION_ <= 330
## then
## outcome = 3106.6 - 2.148 ELEVATION_ - 0.00188 N
##
## Model 4:
##
## Rule 4/1: [305 cases, mean 2517.5, range 795 to 4021, est err 170.7]
##
## outcome = 2842.7 - 0.613 ELEVATION_ - 0.00206 N - 0.0003 E
##
## Model 5:
##
## Rule 5/1: [174 cases, mean 2171.9, range 795 to 3084, est err 174.1]
##
## if
## N > -145485.8
## then
## outcome = 2629.8 - 0.462 ELEVATION_ - 0.00071 E - 0.00043 N
##
## Rule 5/2: [74 cases, mean 2729.1, range 1774 to 3428, est err 183.9]
##
## if
## N <= -145485.8
## ELEVATION_ > 330
## then
## outcome = 1925.4 - 0.0052 N - 0.45 ELEVATION_ - 6e-05 E
##
## Rule 5/3: [76 cases, mean 3090.3, range 2078 to 4021, est err 182.7]
##
## if
## ELEVATION_ <= 330
## then
## outcome = 3112.7 - 2.176 ELEVATION_ - 0.00185 N
##
## Model 6:
##
## Rule 6/1: [305 cases, mean 2517.5, range 795 to 4021, est err 171.1]
##
## outcome = 2854.2 - 0.623 ELEVATION_ - 0.00208 N - 0.00036 E
##
##
## Evaluation on training data (305 cases):
##
## Average |error| 169.9
## Relative |error| 0.36
## Correlation coefficient 0.93
##
##
## Attribute usage:
## Conds Model
##
## 43% 100% N
## 16% 100% ELEVATION_
## 92% E
##
##
## Time: 0.0 secs
Rule 1 splits at N = -20368.92, near the centre of the map. There is then
a slightly different linear regression for the two halves. The elevation and
Northing coefficients are both larger for the southern half. Rule 2 has no
split, just a single linear model. It is not the same as the linear model fit in
§4, because it is fit based on the values adjusted by Rule 1 predictions. Rule
3 again splits on Northing, but much further south, at N = -145485.8, and
on low elevations, 330.
113
Task 98 : Examine the fit of the Cubist model to the known points. •
# predictive accuracy
cubist.fits <- predict(c.model, newdata=all.preds,
neighbors=cubist.tune$bestTune$neighbors)
## Test set RMSE
sqrt(mean((cubist.fits - all.resp)^2))
## [1] 164.9617
## [1] 0.918133
plot(ne.df$ANN_GDD50 ~ cubist.fits,
col=ne.m$STATE, pch=20, asp=1,
xlab="Fitted by cubist",
ylab="Actual",
main="Annual GDD50")
legend("topleft", levels(ne.m$STATE), pch=20, col=1:4)
grid(); abline(0,1)
Annual GDD50
4000
NJ
NY
PA
3500
VT
3000
2500
Actual
2000
1500
1000
Fitted by cubist
Notice that this Cubist model severely underpredicts the lowest GDD50 at
Mount Mansfield (VT). This is because it is too far in elevation space from
any other observation points to have any neighbours that could modify the
linear regressions in the rule set, which reduce GDD50 with elevation.
Task 99 : Predict over the study area with the Cubist model and display
the resulting map. •
summary(dem.ne.m.df$pred.cubist <-
predict(c.model, newdata=dem.ne.m.df,
neighbours=cubist.tune$bestTune$neighbors))
114
display.prediction.map("pred.cubist",
"Annual GDD, base 50F, Cubist prediction",
"GDD50")
2e+05
GDD50
4000
0e+00
3000
N
2000
1000
−2e+05
−4e+05
Although several of the rules split on Northing, that is not visible in this
map, because the results of each rule are averaged.
Machine learning models are generally applied to problems with large num-
bers of predictors. In the example above we only used three. To illustrate
the more common situation, we add some additional predictors that might
be related to agricultural climate.
• The two Great Lakes in this region (Erie and Ontario) may have a
local climate effect, because of the high heat capacity of the lake water,
which can extend the late summer into the early fall.
• The Atlantic Ocean has a local cooling effect along the shore, especially
on Long Island.
• Local terrain may influence climate. For example, narrow valleys in
the Finger Lakes regions are known as “frost pockets” and often have
morning ground fogs in spring and early summer.
– The multiresolution index of valley bottom flatness (MRVBF) [5]
identifies valley bottoms based on their topographic signature as
flat low-lying areas, at increasingly-broad scales, and combines
these into a single index.
115
– The terrain ruggedness index (TRI) [21] expresses heterogeneity.
It is the sum change in elevation between a grid cell and its eight
neighbours.
• Population density may affect local climate. Urban areas affect the lo-
cal climate, typically making it warmer, whereas very sparsely-populated
rural areas are typically cooler, with more precipitation as snow.
names(dem.ne.m.df)
116
## N -0.174523066 0.2902955 -0.4805457 -0.38103463
## dist.lakes 0.038256788 -0.0867092 0.5157054 0.35285482
## dist.coast 0.007069881 0.1043530 -0.5880433 -0.40621053
## mrvbf 1.000000000 -0.2890518 0.1215610 0.24153216
## tri3 -0.289051831 1.0000000 -0.2965867 -0.24542933
## pop15 0.121560957 -0.2965867 1.0000000 0.76256951
## pop2pt5 0.241532159 -0.2454293 0.7625695 1.00000000
corrplot::corrplot(cor(ne.df[, pred.field.names]),
diag = FALSE, type = "upper",
method = "ellipse",
addCoef.col = "black")
dist.coast
dist.lakes
pop2pt5
pop15
mrvbf
tri3
N
E
1
0.4
N −0.61 0.49 −0.17 0.29 −0.48 −0.38
0.2
−0.6
tri3 −0.3 −0.25
−0.8
pop15 0.76
−1
The distances to lakes and coast are inversely related, and these are related
to the coordinates, because of the geography. Of course the two popula-
tion densitities are positively correlated. So there is definite colinearity of
predictors.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.9722 1.3402 1.0421 0.90030 0.84913 0.60906
## Proportion of Variance 0.4322 0.1996 0.1207 0.09006 0.08011 0.04122
## Cumulative Proportion 0.4322 0.6317 0.7524 0.84246 0.92257 0.96379
## PC7 PC8 PC9
## Standard deviation 0.44312 0.35714 0.04461
## Proportion of Variance 0.02182 0.01417 0.00022
## Cumulative Proportion 0.98561 0.99978 1.00000
pc$rotation
117
## N 0.3109224 -0.314313343 0.62114856 -0.12705918
## dist.lakes -0.4193089 -0.212978286 -0.29841764 0.29538055
## dist.coast 0.4326850 0.280778692 0.15863513 -0.18364982
## mrvbf -0.1039718 0.409092410 0.35508637 0.57166123
## tri3 0.1774628 -0.432105357 -0.24569463 -0.23203648
## pop15 -0.4209219 0.114890542 0.03467234 -0.42571320
## pop2pt5 -0.3650623 0.208263038 0.18319644 -0.52134788
## PC5 PC6 PC7 PC8
## ELEVATION_ -0.162470428 0.79755536 0.12334465 0.105094901
## E 0.009735334 0.25594993 0.07815143 0.057295979
## N 0.012651062 0.16333795 -0.03239904 0.167173955
## dist.lakes -0.026389104 -0.07907947 0.14921339 0.685065908
## dist.coast -0.022078707 -0.21343002 -0.01103666 0.674014967
## mrvbf -0.587486966 0.10787655 -0.10906923 -0.016268807
## tri3 -0.736527939 -0.31693438 -0.14230730 -0.066428423
## pop15 -0.056171541 0.30501125 -0.70805527 0.171490413
## pop2pt5 -0.285298260 0.12162744 0.64942480 0.005220769
## PC9
## ELEVATION_ 0.0006230410
## E -0.6118814943
## N 0.5888366804
## dist.lakes 0.3224711541
## dist.coast -0.4178478488
## mrvbf 0.0033655551
## tri3 -0.0004812559
## pop15 -0.0158959011
## pop2pt5 0.0048274545
biplot(pc)
biplot(pc, choices=3:4)
−15 −10 −5 0 5 10 15
0.15
15
3845
3809
38703853
3796
3885
3786 3821
3847
3850
38183873
3879
3846
0.10
3800 10
mrvbf 3861 3878
38843793
3864
3814 3804
3871 3795
3833 3854 3033
3816
3083
3114 3883 3839
2854 3824
3834 3812 3075
3065 dist.coast
2876 3820 3837 3815
3828
3101 3813
3057
3134
0.05
3855
3858 3860 3876 3026
3042
3029 3021 3829
3783 3126
3799 3055 3032
3024
3811
3014
3806 3016
38103792
3082
3801
3890
3790
3823 3140
3863
3027
pop153099 28792881 3856
2870 3130
3874
3105
3036
3882
3022
3094
3018
3121 3143
3868
3798
3791
3852 28713133
2868 3046
3141
3784 3841 3886
3808 3058
3020 3138
3805
3019
2887
2878 2866
3842
3887 3015 3056 3069
3107
38673064
2865 3802 3071 3838 3785
3865 3059
3048 3875
3122
0.00
3788
0
3079
4712
30634722 3078
2860 3880
−5
3061
4707
4704
47084717
4710
4718 4724
−10
3115
3145 3120
tri3
3045 4726 3076
4720
4715
3054 4727
4713
−0.15
E
−15
3034 4716
3053
PC1
118
−10 −5 0 5 10
2859
0.2
10
mrvbf
3871
3123
47273136 4722
2881
3854
3095
3104 3071
0.1
3856
2866 3028
5
dist.lakes 3875 3790
3834
3878
3813 3133
2855 3886 2869 38333044
3820 4717
2890
3817
38153864 3052 3050
2880 3087 4723
2875 3077
3812 38252871
3837
3072
3800 2876 4709
3860
38293867 2887
47042868
3135
3870 4725
4719 3015
3120 3826
288428834718
4713 2854
3783 3111 E 3103
3869 3031
3793 3858
2885 31303064
3865
3789
3805 3801
3019
2852
3881
30903873
3885 3824
3025 3043 4705
3027
3014
4707
3809 3038
3849
2888 3030
30673855
4724 4708 30883140
PC4
38033863
3859
3832 3797
3862 3816
3851
3787 3049
38573110
3844 4728 3116
ELEVATION_ 2867 3792 3842 3122
4710
30172879 3129
37912853 2856
3794 3093
3113
0.0
3799
2863 3877
3020
3040 31193879 4714 4711
3132 3039
0
3876 3843
3810
38272860
3079
3802
3838 3066
3890
3045
3807
3887 2889
3089 3115
3847 3818
38454726 3037
3078
3795
3788
3836 3823 2865
3831 3041
28863057
3850
3852
3021
3146 31022878
47163819
3841
3785 3092 28743889
3138
3866
3029
3018 3796
3075
3125
3839 3141
3821
3056 3081
2870
3085 3109
3835 3786
3128 2861
3070
3853 3121
30353086
38083848
3798 3784
3880
3868
38723822
3828
3840
3124
2864
3144
3082
3118
3804
3883
3023
3073
3106
4715
3055
3134
3061
3054 3024 3126
3047
3062
3068
3145 4712 N
30533811 3888 2858 3059
3846
3107
3080 3127
3874 2873
3142
3060
3076
2857 3091
3112 3046
3026
3065 3108 4706
3016
dist.coast
3063
3069
3094 3036
2862
3074
2882 3099
3882tri3
3884
3806 3048 3084 3105
3114
3139
3830
3034 2877 3137
3058
31173042
4720 4721
3861 2872
3051 3022
−0.1
3032
−5
3098
3100 3131
3143 3083
pop15
3096
3097 3033
3814
pop2pt5
3101
−10
−0.2
PC3
The two populations are closely positively related in PC1 and PC2. Lakes
and coasts are almost perfectly inversely related in PC1 and PC2. Popu-
lation is correlated with distance to lakes (and so inverse to distance from
coast); this is a geographic accident because population is denser towards
the SE (NYC, NJ). The terrain indices and geography also have a fortuitous
relation. We can not easily reduce these PCs to meaningful factors.
All the previous methods could be extended with this set of variables. Here
we look at one, Random Forest.
dim(preds <- ne.df[, pred.field.names])
## [1] 305 9
## [1] 305
system.time(
ranger.tune <- train(x = preds, y = response,
method="ranger",
tuneGrid = expand.grid(.mtry = 2:7,
.splitrule = "variance",
.min.node.size = 1:10),
trControl = trainControl(method = 'cv'))
)
119
## 31 5 1 204.2634
ix <- which.max(ranger.tune$result$Rsquared)
ranger.tune$result[ix, c(1,3,5)]
ix <- which.min(ranger.tune$result$MAE)
ranger.tune$result[ix, c(1,3,6)]
plot.train(ranger.tune, metric="RMSE")
plot.train(ranger.tune, metric="Rsquared")
215
210
205
2 4 6 8 10
0.875
0.870
0.865
2 4 6 8 10
All three methods agree on five predictors to try at each split, and node size
of three. There is not much improvement past mtry=4, and min.node.size
from 1 to 3 give similar results.
Build a model with the optimal parameters:
rf.ext <- ranger(model.formula,
data=ne.df, importance="impurity", mtry=4, min.node.size=3,
oob.error=TRUE, num.trees=1024)
print(rf.ext)
## Ranger result
##
## Call:
120
## ranger(model.formula, data = ne.df, importance = "impurity", mtry = 4, min.node.size = 3, oob.
##
## Type: Regression
## Number of trees: 1024
## Sample size: 305
## Number of independent variables: 9
## Mtry: 4
## Target node size: 3
## Variable importance mode: impurity
## Splitrule: variance
## OOB prediction error (MSE): 41773.81
## R squared (OOB): 0.8731823
str(rf.ext, max.level=1)
## List of 16
## $ predictions : num [1:305] 3484 3465 3563 3505 2891 ...
## $ num.trees : num 1024
## $ num.independent.variables: num 9
## $ mtry : num 4
## $ min.node.size : num 3
## $ variable.importance : Named num [1:9] 14568004 4201191 24527912 27200127 5687762 ...
## ..- attr(*, "names")= chr [1:9] "ELEVATION_" "E" "N" "dist.lakes" ...
## $ prediction.error : num 41774
## $ forest :List of 7
## ..- attr(*, "class")= chr "ranger.forest"
## $ splitrule : chr "variance"
## $ treetype : chr "Regression"
## $ r.squared : num 0.873
## $ call : language ranger(model.formula, data = ne.df, importance = "impurity",
## $ importance.mode : chr "impurity"
## $ num.samples : int 305
## $ replace : logi TRUE
## $ dependent.variable.name : chr "ANN_GDD50"
## - attr(*, "class")= chr "ranger"
round(rf.ext$prediction.error/rf.ext$num.samples)
## [1] 137
121
ANN_GDD50
4000
3500
3000
Measured
2500
2000
1500
1000
Ranger RF fit
display.prediction.map("pred.rf.ext",
"Annual GDD, base 50F, ranger prediction, 9 predictors",
"GDD50")
122
Annual GDD, base 50F, ranger prediction, 9 predictors
2e+05
GDD50
4000
0e+00
3000
N 2000
1000
−2e+05
−4e+05
Task 100 : Compute and display the minimum depth distribution of the
9-predictor model. •
The effect of variable substitution is clear: when one predictor is not used at
a particular split, a correlated predictor can partially substitute for it. Here
we have N and dist.lakes as one pair; another is pop15 and dist.coast
as a substitute for ELEVATION (highest populations in the NYC area).
require(randomForestExplainer)
tmp <- min_depth_distribution(rf.ext)
plot_min_depth_distribution(tmp)
123
Distribution of minimal depth and its mean
N 1.2
dist.lakes 1.27
ELEVATION_ 1.61
Minimal depth
pop15 1.88
0
1
2
Variable
3
pop2pt5 2.57
4
5
6
7
E 2.65 NA
dist.coast 2.99
tri3 3.07
mrvbf 3.76
124
over all possible coalitions.
10.7.1 *Theory
The Shapley value is defined via a value function 𝑣𝑎𝑙 of players in the “game”
𝑆.
The Shapley value 𝜙 𝑗 (𝑣𝑎𝑙) of a feature 𝑗’s value is its contribution to the
payout, weighted and summed over all possible feature value combinations:
10.7.2 Practice
Task 101 : Build a Predictor object for use with iml functions. •
require(iml)
# matrix of predictors to be evaluated
X <- ne.df[, pred.field.names]
# the
predictor <- Predictor$new(model = rf.ext, data = X, y = ne.df[, "ANN_GDD50"])
Task 102 : Compute the Shapley values for the predictors at some inter-
esting stations. •
Now we can compute the Shapley values. For example, the values for the
station with the fewest GDD50:
ix <- which.min(ne.df[, "ANN_GDD50"])
ne.df[ix, 2:3]
## STATE STATION_NA
## 4716 VT MOUNT MANSFIELD
X[ix,]
125
Actual prediction: 1134.48
Average prediction: 2517.12
dist.coast=353370.958168614
mrvbf=1.10668780806122e−08
tri3=211.472473144531
dist.lakes=234949.721155918
E=252835.483952616
pop2pt5=0.858196973800659
pop15=1.35869264602661
N=230248.447162461
ELEVATION_=3950
The figure shows the actual values of each predictor at the observation point,
and the 𝜙 value, i.e., numerical contribution to the difference between actual
and average prediction. These sum to the difference. At this climate station
all contributions are negative, i.e., they assist in lowering the prediction from
the average. The elevation contributes the most; Northing is also important.
Compare to the Shapley values at a more southerly station, Pittsburgh (PA)
airport:
ix <- which(ne.df[, "STATION_NA"] == "PITTSBURGH INTL AP")
X[ix,]
126
Actual prediction: 2880.62
Average prediction: 2517.12
N=−213573.115071714
pop15=2.48937964439392
pop2pt5=2.46377372741699
E=−358143.833914717
mrvbf=0.734555423259735
dist.coast=454398.228409012
tri3=38.0851058959961
ELEVATION_=1150
dist.lakes=162488.497742708
0 100 200
phi
Here the prediction is greater than the average, with positive contributions
by the negative Northing and the two population densities.
So we can see for each climate station the reason for its prediction. Molnar
emphasizes: “[B]e careful to interpret the Shapley value correctly: [it] is the
average contribution of a feature value to the prediction in different coalitions.
[It] is not the difference in prediction when we would remove the feature from
the model.”
Task 103 : Compute the SHAP values for all predictors annd stations. •
Note the use of the nsim argument: “To obtain the most accurate results,
nsim should be set as large as feasibly possible.”
require(fastshap)
# a prediction function
pfun <- function(object, newdata) {
predict(object, data = newdata)$predictions
}
# matrix of predictors to be evaluated
X <- ne.df[ , pred.field.names]
22 https://github.com/bgreenwell/fastshap
127
fshap <- fastshap::explain(object = rf.ext,
X = X,
shap_only = FALSE, # also return feature and baseline values
pred_wrapper = pfun,
nsim = 24)
names(fshap)
head(fshap$shapley_values)
ne.df[1, 2:3]
## STATE STATION_NA
## 2852 NJ ATLANTIC CITY AP
ne.df[1, pred.field.names]
Each observation has a set of SHAP values. These are the contribution to
the difference between the observed value at that point and the average value
of all observations, i.e., how much this predictor affects the single prediction,
away from the “null” model of the overall average. For the first observation
we see that Northing and distance to the Great Lakes have the most effect;
elevation is also important.
To visualize all the SHAP values for all the predictors and observations we
can use the shapviz package. The shapviz function from this package
requires (1) a matrix of computed SHAP values, (2) a set of features for
which to display the values, (2) a list of observations.
Once the shapviz visualization object is set up, we can display the SHAP
values with the sv_importance function. We have a fairly small dataset and
number of predictors, so we can display all of them together in a so-called
”beehive” plot.
128
## [1] "shapviz"
dist.lakes
ELEVATION_
High
pop15
Feature value
E
pop2pt5
Low
dist.coast
tri3
mrvbf
The dependence of the predictions on any predictor, vs. its value, and its
relation to any other single predictor, can be shown with the sv_dependence
function:
Task 105 : Show the SHAP values for distance to lakes, coloured by Nor-
thing. •
sv_dependence(sv.fshap, v = "dist.lakes", color_var = "N")
ix <- which(fshap$shapley_values[, "dist.lakes"] > 350)
ne.df[ix, 2:5]
129
250
SHAP value
2e+05
0e+00
−2e+05
The strongly positive contributions of distance to lakes are all far from the
lakes and increase with negative Northing; these are mostly in New Jersey.
Is the extended random forst model (with more covariates) more successful
than the base model (using only Northing and elevation) in modelling the
growing degree days?
Task 106 : Compute and display the difference between the 9-predictor and
2-predictor maps. •
summary(dem.ne.m.df$diff.rf.9.3 <-
dem.ne.m.df$pred.rf.ext - dem.ne.m.df$pred.rf)
display.difference.map("diff.rf.9.3",
"Difference RF 9 and RF 3 predictors",
"+/- GDD50")
130
Difference RF 9 and RF 3 predictors
2e+05
+/− GDD50
0e+00 500
N
250
−250
−2e+05
−4e+05
The largest differences are outside the study area, mostly (?) due to distance
from the Great Lakes. Within it, small negative adjustments on Long Island,
but also near Lake Ontario (not correct).
11.1 * Theory
131
are overall affine transformation parameters (to center the function in 2D)
and 2𝑘 of which link to the control points.
The general method is to minimize the residual sum of squares (RSS) of
the fitted function, subject to a constraint that the function be “smooth” in
some sense; this is expressed by a roughness penalty which balances the fit
to the observations with smoothness. This is a minimization problem. If xi
is one point in 2D space (i.e., it has two coördinates) and 𝑦 𝑖 is the attribute
value at the same points, the aim is to minimize:
Õ
𝑁
min {𝑦 𝑖 − 𝑓 (xi )}2 + 𝜆𝐽 [ 𝑓 ] (23)
𝑓
𝑖=1
∫ ∫ 2 !2 2 2 ! 2
𝜕 𝑓 (x) 𝜕 𝑓 (x) 𝜕 2 𝑓 (x)
𝐽[ 𝑓 ] = +2 + d𝑥 1 d𝑥 2 (24)
𝜕𝑥 2 𝜕𝑥 1 𝜕𝑥 2 𝜕𝑥 22
R R
1
where (𝑥 1 , 𝑥 2 ) are the two coördinates of the vector x. In practice the double
integral is discretized over some grid known as knots; these may be defined
by the observations or may be a different set, maybe an evenly-spaced grid.
This penalty can be interpreted as the “bending energy” of a thin plate repre-
sented by the function 𝑓 (𝑥); by minimizing this energy the spline function in
over the 2D plane is a thin (flexible) plate which, according to the first term
of Equation 23 would be forced to pass through data points, with minimum
bending. However the second term of Equation 23 allows some smoothing:
the plate does not have to bend so much, since it is allowed to pass “close
to” but not necessarily through the data points. The higher the 𝜆, the less
exact is the fit. This has two purposes: (1) it allows for measurement error;
the data points are not taken as exact; (2) it results in a smoother surface.
So cross-validation is used to determine the degree of smoothness.
The solution to Equation 24 is a linear function:
Õ
𝑁
𝑓 (x) = 𝛽0 + 𝛽𝑇 x + 𝛼 𝑗 ℎ 𝑗 (x) (25)
𝑗=1
where the 𝛽 account for the overall trend and the 𝛼 are the coefficients of
the warping.
The set of functions ℎ 𝑗 (x) is the basis kernel, also called a radial basis
function (RBF), for thin-plate splines:
132
where the norm distance 𝑟 = ∥x − x 𝑗 ∥ is also called the radius of the basis
function. The norm is usually the Euclidean (straight-line) distance.
11.2 Practice
Task 107 : Set up for thin-plate splines and compute the minimum-
curvature spline, subject to roughness constraint determined by generalized
cross-validation. •
The Tps function of the fields package compute this; however the coörd-
inates must be formatted as a matrix field in the dataframe, using the matrix
function.
require(fields)
ne.tps <- ne.df
ne.tps$coords <- matrix(c(ne.df$E, ne.df$N), byrow=F, ncol=2)
surf.1 <-Tps(ne.tps$coords, ne.tps$ANN_GDD50)
class(surf.1)
summary(surf.1)
## CALL:
## Tps(x = ne.tps$coords, Y = ne.tps$ANN_GDD50)
##
## Number of Observations: 305
## Number of unique points: 305
## Number of parameters in the null space 3
## Parameters for fixed spatial drift 3
## Effective degrees of freedom: 88.5
## Residual degrees of freedom: 216.5
## MLE tau 220.7
## GCV tau 228.6
## MLE sigma 129700000
## Scale passed for covariance (sigma) <NA>
## Scale passed for nugget (tau^2) <NA>
## Smoothing parameter lambda 0.0003755
##
## Residual Summary:
## min 1st Q median 3rd Q max
## -882.40000 -113.60000 -0.01774 127.30000 574.90000
##
## Covariance Model: Rad.cov
## Names of non-default covariance arguments:
## p
##
## DETAILS ON SMOOTHING PARAMETER:
## Method used: GCV Cost: 1
## lambda trA GCV GCV.one GCV.model tauHat
## 3.755e-04 8.846e+01 7.359e+04 7.359e+04 NA 2.286e+02
##
## Summary of all estimates found for lambda
## lambda trA GCV tauHat -lnLike Prof converge
## GCV 0.0003755 88.46 73587 228.6 2149 13
## GCV.model NA NA NA NA NA NA
## GCV.one 0.0003755 88.46 73587 228.6 NA 13
## RMSE NA NA NA NA NA NA
## pure error NA NA NA NA NA NA
## REML 0.0009690 59.35 74532 245.0 2146 7
133
Task 108 : Set up a grid covering the four States at approximately 6 x 6 km
resolution, and convert to a dataframe with the coördinates as a matrix field.
This last is because the fields package works with coördinate matrices. •
The spsample function of the sp package can make various sampling plans,
including a regular grid, within a study area.
We compute the approximate area of the four states, in km, from its bound-
ing box in the US Census state shapefile; these are in m2, and so must be
converted to km2. We then ask for a grid with each cell covering about
(9 km)2.
Note: Recall, this is not for exact prediction, just to get an overview of the
regional distribution of the variable of interest.
resolution <- 9
st_bbox(state.ne.m)
## xmax
## 510213.8
## xmax
## 6299
Task 109 : Predict over the four-states grid using the fitted thin-plate
spline. •
The predict.Krig method of the fields package computes the prediction
as a matrix.
surf.1.pred <- predict.Krig(surf.1, states.grid.df$coords)
class(surf.1.pred)
dim(surf.1.pred)
## [1] 6285 1
134
summary(as.vector(surf.1.pred))
Annual GDD50
2e+05
0e+00
N
−2e+05
−4e+05
This map captures the main features of the annual GDD50 fairly well, even
though elevation was not used in the thin-plate spline empirical model. In
particular, it captures the high-GDD areas along the Lake Ontario and Lake
Erie plains, the low-GDD cold spots in the Adirondack, Catskill and Green
Mountains, as well as in the Allegheny State Park area of SW NY/NE PA,
and the very high GDD-area around the Delaware Bay. It does not account
for local variations in GDD because of elevation.
135
To determine the predicted value at any location, just predict at that point.
To make the target point as a sfc_POINT we first create the point geome-
try with st_point and then make it a spatial object with Simple Features
geometry with the st_sfcethod. And of course we need to specify its CRS.
Since we first specify geographic coördinates, we then transform to the CRS
used in the grid.
pt <- st_sfc(st_point(x=c(-76.402175, 42.453271)))
st_crs(pt) <- 4326
pt <- st_transform(pt, st_crs(states.grid))
st_coordinates(pt)
## X Y
## [1,] -33055.67 -5118.213
## [,1]
## [1,] 2205.347
The thin-plate spline interpolation predicts 2205 annual GDD50 for this
location.
12 Local interpolators
A purely local approach to prediction is to ignore any causative factors (in
this case, northing and elevation) and just use “nearby” known observations
to predict at any location. This is an operational realization of the well-
known Tober’s First Law of Geography: “everything is related to everything
else, but near things are more related than distant things” [23]23 . In the
current case this is not advisable, because of the strong and useful relation
of the target variable ANN_GDD50 with the covariables, which we have seen
in earlier sections. However, for completeness we illustrate this method.
Local approaches can be model-based or model-free.
The best-known model-based method is Ordinary Kriging, which relies on
a model of local spatial dependence of the target variable. In this case the
universal model of spatial distribution shown in 1 is simplified to:
136
We model 𝜀(s) with an authorized model of spatial dependence, usually
with an authorized variogram model. This model is then used in Ordinary
Kriging (OK). This also reveals the magnitude of 𝜀 ′ (s), i.e., the pure noise
that can not be modelled nor predicted.
Good explanations of Ordinary Kriging are from Webster and Oliver [24]
and Goovaerts [6].
## xmax
## 321841.2
## xmax
## 21456.08
137
2184
2174
2197
3e+05
2124
2115
2085
1894 1983
semivariance
2e+05 1829
1707
1493
1299
1e+05 1049
686
156
distance
Within this range the variogram is unbounded, which means the maximum
variation has not been reached even between observations at the largest
separations. This is because of the strong regional effect of Northing.
Note: Notice the fairly large nugget variance. In the residual variogram
used in KED and RK there was no nugget. How can this be explained?
1574 1243
1766
1797
1962 1414
1939
4e+05 1970
2034
2054
2012
2049
3e+05
2020
semivariance
2003
1896
1914
1810
1744
2e+05
1625
1525
1330
1172
1e+05 930
600
119
distance
138
250000 1372
1363 1427
1333
1300
200000 1231
1199
1084
semivariance
150000 1033
907
829
100000 703
519
310
50000 55
distance
250000 1372
1363 1427
1333
1300
200000 1231
1199
1084
semivariance
150000 1033
907
829
100000 703
519
310
50000 55
distance
This fits fairly well; our original estimates were not too far off. The range is
139
fitted to be about 35% longer than our estimate.
Once we have the fitted model and the data points, we can predict at any
location, by solving the OK system (§7.1) for the weights 𝜆𝑖 to be used in
the weighted average (Eqn. 14). The mathematics of OK were presented in
(§7.1).
summary(k.ok)
Task 114 : List the 24 stations that were used for the local prediction at
the block containing the Ithaca weather station, in order of their separations
from Ithaca. •
e.ith.pt <- subset(ne.m, (ne.m$STATION_NA=="ITHACA CORNELL UNIV"))
dist.pt <- st_distance(ne.m, e.ith.pt)
(ix <- order(dist.pt)[1:24])
print(cbind(ne.df[ix,c('STN_NAME','STATE','ELEVATION_','ANN_GDD50')],
dist=round(dist.pt[ix,]/1000,1)))
140
## 3023 AURORA RESEARCH FARM NY 830 2590 35.2 [m]
## 3128 TULLY HEIBERG FOREST NY 1899 1746 46.8 [m]
## 3028 BINGHAMTON BROOME CO AP NY 1600 2141 47.7 [m]
## 3055 ELMIRA NY 844 2395 48.4 [m]
## 3141 WAVERLY NY 845 2304 50.5 [m]
## 3056 ENDICOTT NY 827 2245 51.2 [m]
## 3022 AUBURN NY 744 2352 52.9 [m]
## 3107 PENN YAN NY 830 2430 55.0 [m]
## 3059 GENEVA RESEARCH FARM NY 718 2396 67.4 [m]
## 3093 MORRISVILLE 6 SW NY 1300 1818 72.6 [m]
## 3014 ADDISON NY 980 2123 74.0 [m]
## 3027 BATH NY 1120 2055 74.9 [m]
## 3102 NORWICH NY 1020 2203 76.1 [m]
## 3881 TOWANDA 1 ESE PA 750 2447 77.8 [m]
## 3126 SYRACUSE HANCOCK INTL AP NY 410 2467 79.8 [m]
## 3036 CANANDAIGUA 3 S NY 720 2531 81.3 [m]
## 3851 MONTROSE PA 1420 2064 81.3 [m]
## 3119 SHERBURNE 2 S NY 1080 2157 82.7 [m]
## 3025 BAINBRIDGE 2 E NY 994 2063 84.4 [m]
## 3050 DEPOSIT NY 1000 2245 94.1 [m]
## 3121 SODUS CENTER NY 420 2558 95.5 [m]
Notice the wide range of elevations in this group of stations, from 410’ to
almost 1900’.
45°N
44°N
43°N var1.pred
3500
42°N 3000
2500
2000
41°N
1500
40°N
39°N
For comparison with the maps made with other techniques, we also add the
results to the data.frame covering the same area and then display it with
ggplot:
dim(dem.ne.m.df)
## [1] 40052 32
141
"Annual GDD base 50F, OK prediction",
"GDD50")
2e+05
GDD50
4000
0e+00
3000
N
2000
1000
−2e+05
−4e+05
We see this is much smoother than other methods, mainly because it does
not take elevation into account.
142
Annual GDD base 50F, Standard error of OK prediction
2e+05
GDD50 s.d.
0e+00 500
400
N
300
200
−2e+05
−4e+05
Clearly, the further from local information, the more the prediction is un-
certain, e.g., northwest of Lake Ontario. The prediction uncertainty is least
near stations, especially near a cluster of stations, e.g., the NYC area. We
could choose to not predict at areas with too much uncertainty, based on
user requirements.
The minimum and mean prediction standard deviations:
summary(dem.ne.m.df$sd.ok)
ix <- which.min(dem.ne.m.df$sd.ok)
k.ok[ix,"var1.pred"]
round(100*dem.ne.m.df[ix,"sd.ok"]/k.ok$var1.pred[ix],1)
## [1] 3.2
The minimum is quite small, only about 3% of the prediction at that point,
and the mean is only about 2.5 times as poor.
A better way to evaluate the predictive power OK is by Leave-one-out cross-
validation (LOOCV). Here each point is removed from the dataset in turn,
and predicted by the others, using the fitted variogram model. If the obser-
143
vation points well represent the total population, as they do here by design
of the weather station network, this gives a good estimate of the prediction
error.
Task 117 : Compute and summarize the LOOCV for this OK prediction. •
The krige.cv function of the gstat package computes this:
kcv.ok <- krige.cv(ANN_GDD50 ~ 1, locations=ne.m, model=vmf.o)
summary(kcv.ok$residual)
Overall the results are fairly good, but there are some extremely bad pre-
dictions. An overall measure is the root of the mean squared error, RMSE:
(loocv.ok.rmse <- sqrt(sum(kcv.ok$residual^2)/length(kcv.ok$residual)))
## [1] 268.1796
LOOCV OK residuals
45°N
44°N
+/− GDD50
43°N 300
600
900
42°N
41°N
−, overprediction
+, underprediction
40°N
39°N
There are several regions with intermixed fairly large under- and over-
predictions; this means that in these regions there are local factors, most
notable elevation.
Task 119 : Find the worst predictions, try to explain in geographic terms.
•
ne.m[which.min(kcv.ok$residual),2:6]
144
## Bounding box: xmin: 252835.5 ymin: 230248.4 xmax: 252835.5 ymax: 230248.4
## Projected CRS: +proj=aea +lat_0=42.5 +lat_1=39
## +lat_2=44 +lon_0=-76 +ellps=WGS84 +units=m
## STATE STATION_NA LATITUDE_D LONGITUDE_ ELEVATION_
## 4716 VT MOUNT MANSFIELD 44.53 -72.82 3950
## geometry
## 4716 POINT (252835.5 230248.4)
ne.m[which.max(kcv.ok$residual),2:6]
145
k.idw <- idw(ANN_GDD50 ~ 1, locations=ne.m, newdata=dem.ne.m.sf,
idp=2, nmax=24)
summary(k.idw$var1.pred)
summary(k.ok$var1.pred)
2e+05
GDD50
4000
0e+00
3000
N
2000
1000
−2e+05
−4e+05
summary(kcv.idw$residual)
summary(kcv.ok$residual)
146
## [1] 300.457
(loocv.ok.rmse)
## [1] 268.1796
44°N
+/− GDD50
43°N 300
600
900
42°N
41°N
−, overprediction
+, underprediction
40°N
39°N
These show much stronger residual spatial structure than do the OK resid-
uals.
Challenge: Find the optimal power for IDW by testing several inverse-
distance powers, and comparing their cross-validation statistics.
The simplest way to predict a variable at one position is to use the value
of that variable at the nearest neighbour in geographic space. For climate
variables with a fairly dense network, as in this example, this is a common
procedure: look at the climate record for the nearest station and consider
that the local climate can not differ “too much”.
A spatial expression of this is to divide the prediction area into Thiessen
polygons24 , where each location is in a polygon whose centroid is its nearest
neighbour. The advantage of this approach is that it requires no statistical
24 also known as a Voronoi tessellation or a Dirichlet tessellation
147
model; in particular, there is no assumption of second-order stationarity as
required by kriging.
Task 121 : Compute and display the Thiessen polygons over the study area.
•
The computation is with the voronoi “Voronoi tesselation” function of the
sf package. The optional bnd argument specifies the bounding box for the
tesselation. We use the bounding box of the four States, converted to a
SpatVector.
v <- terra::voronoi(vect(ne.m), bnd=vect(state.ne.m))
class(v)
## [1] "SpatVector"
## attr(,"package")
## [1] "terra"
plot(v)
3e+05
2e+05
1e+05
0
−1e+05
−2e+05
−3e+05
−4e+05
148
3e+05
2e+05
1e+05
0
−1e+05
−2e+05
−3e+05
−4e+05
This map now can be used for prediction. Simply, the entire area of each
polygon is predicted with the value from the nearest station, i.e., its centroid.
Task 123 : Predict GDD50 over the study area by assigning the GDD50
from the centroid weather station to the entire polygon. •
The st_join “spatial join” method of the sf package queries its second
argument (layer from which attributes are queries) and assigns attribute
values to the geometries in the first argument. In this case each polygon
covers an area; within that area is only one climate station in the sf object
ne.m, so the value of the attributes at that point will be assigned to the
polygon.
Note: For overlays where several points are in the same polygon, a user-
specified function must be applied to return a single value.
class(v); dim(v)
## [1] "SpatVector"
## attr(,"package")
## [1] "terra"
## [1] 305 39
149
## 12 293 0 0
VT
PA
NY
NJ
Task 124 : Plot the Thiessen polygons with their predicted values. •
ggplot(v3) +
geom_sf(aes(bg = ANN_GDD50.y)) +
labs(bg = "Annual GDD50", title = "Thiessen polygon prediction")
150
Thiessen polygon prediction
45°N
44°N
42°N 3000
2000
41°N
1000
40°N
39°N
To determine the predicted value at any location, just predict at that point.
To make the target point as a sfc_POINT we first create the point geometry
with st_point and then make it a spatial object with Simple Features ge-
ometry with the st_sfc method. And of course we need to specify its CRS.
We first specify the point location using geographic coördinates, and then
we then transform to the CRS used in the grid. For illustration, we use the
geocode_OSM function of the tmaptools package to retrieve the coördinates
of a known address from the Open Street Map Nominatim database. Here
we choose Cornell University’s Musgrave Research Farm.
# an arbitrary point of interest
require(tmaptools)
(query.pt <- geocode_OSM("1256 Poplar Ridge Rd, Aurora, NY 13026"))
## $query
## [1] "1256 Poplar Ridge Rd, Aurora, NY 13026"
##
## $coords
## x y
## -76.65689 42.73498
##
## $bbox
## xmin ymin xmax ymax
## -76.65694 42.73493 -76.65684 42.73503
require(sf)
(pt <- st_sfc(st_point(query.pt$coords)))
151
## Geometry set for 1 feature
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -76.65689 ymin: 42.73498 xmax: -76.65689 ymax: 42.73498
## CRS: NA
class(pt)
## X Y
## [1,] -53753.43 26326.6
pt <- st_as_sf(pt)
class(pt)
class(v3)
The Thiessen polygon interpolation predicts 2590 annual GDD50 for this
location.
152
dm <- st_distance(ne.m)
str(dm)
## Units: [m] num [1:305, 1:305] 0 14343 64767 34083 159361 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:305] "1" "2" "3" "4" ...
## ..$ : chr [1:305] "1" "2" "3" "4" ...
Task 126 : Find the nearest neighbour of each station, i.e., the one at closest
distance. •
We use the apply function to apply the which.min “which minimum?” func-
tion across the rows of the distance matrix. However, all the diagonals are
zero (distance between a station and itself), so we first have to put a large
distance on diagonal so the stations themselves won’t come out as minima.
diag(dm) <- max(dm)*1.1
nn <- apply(dm, 1, which.min)
str(nn)
This gives the index of each station’s nearest neighbour. For example, station
1’s nearest neighbour is station 2, and vice-versa.
Task 127 : Predict each station from its nearest neighbour, and plot the
regional predictions. •
nn.gdd <- ne.m[nn,"ANN_GDD50"]
str(nn.gdd)
Task 128 : Compare to the actual value and compute evaluation statistics.
•
obs <- ne.m[,"ANN_GDD50"]
summary(diff <- st_drop_geometry(obs - nn.gdd))
## ANN_GDD50 geometry
## Min. :-914.000 POINT :305
## 1st Qu.:-193.000 epsg:NA: 0
## Median : -9.000
## Mean : -4.393
## 3rd Qu.: 175.000
## Max. : 983.000
hist(diff$ANN_GDD50,
xlab = "Annual GDD50",
main="Cross-validation errors, Thiessen polygons")
rug(diff$ANN_GDD50)
(me <- mean(diff$ANN_GDD50))
## [1] -4.393443
153
(rmse <- sqrt(sum(diff$ANN_GDD50^2)/length(diff$ANN_GDD50)))
## [1] 332.5643
80
60
Frequency
40
20
0
Annual GDD50
The mean error is close to zero, but the RMSE is quite large, 333 degree-
days.
We can compare these to the evaluation statistics at the same points from
the Random Forest out-of-bag computed above (§10.2.1).
rf.oob.me; rf.oob.rmse
## [1] -2.488285
## [1] 11.59619
Clearly, the predictions from Thiessen polygons, with this density of climate
stations, are quite poor, compared to the data-driven model using coörd-
inates and elevation.
154
X−validation error, Thiessen polygon prediction
45°N
44°N
500
42°N
0
−500
41°N
40°N
39°N
The largest errors are where the topography changes rapidly between sta-
tions, for example at and next to Mt. Mansfield (VT) – these two polygons
have the largest over- and under-predictions, respectively. The smallest er-
rors are where elevations do not change much between neighbour stations,
for example on the Lake Ontario plain.
155
Task 131 : List the shapefiles in this directory. •
The list.files function lists the files to the console output; you can also
look at the directory in a file manager.
• Its first argument is the directory in which to look.
• The optional pattern argument gives a regular expression to match
the file names. Here we just want the base shape files; there are other
“helper” files with the same name but different file extensions, so we
specify a pattern of the shp extension, at the end of the string, as
symbolized by the special $ regular expression character.
If the files are in the directory where you are connected25 , you can just
specify ".", which is a Unix and R abbreviation for “the current directory”.
If they are in a subdirectory, you need to name that, using the forward slash
"/" character to show you are descending the directory tree.
List the unpacked files:
list.files(".", pattern=".shp$")
Each of these shapefiles has associated metadata, with extension .xml which
can be read in several viewers.
list.files(".", pattern=".xml$")
## [1] "dem_ne_4km_TRI3_IDW2.sdat.aux.xml"
## [2] "extmin_7100j.shp.xml"
## [3] "frz28_7100j.shp.xml"
## [4] "frz32_7100j.shp.xml"
## [5] "gdd40_7100j.shp.xml"
## [6] "gdd50_7100j.shp.xml"
## [7] "maat7100.shp.xml"
## [8] "map7100.shp.xml"
## [9] "mrvbf_ne_4km.sdat.aux.xml"
gdd40_7100j : growing degree days, base 40°F (applies to C3 crops such as spring
wheat and barley)
maat7100 : mean air temperature °F
frz32_7100j : length of frost-free period, consecutive days above 32 °F (for frost-
sensitive crops)
frz28_7100j : length of frost-free period, consecutive days above 28 °F (for frost-
tolerant crops)
extmin_7100j : extreme minimum temperature °F
These all have annual and monthly records, averages over 1971-2000.
In addition, one variable is related to precipitation:
25 use getwd to find this
156
map7100 : mean annual precipitation, inches26 .
Task 132 : Import the temperature records for the entire USA into a
temporary data frame. •
This is the maat7100 variable.
varname <- "maat7100"
tmp <- st_read(dsn=".", layer=varname,
int64_as_string = FALSE, quiet = TRUE)
head(tmp)
26 1 inch = 2.54 cm
157
Task 133 : Restrict this dataframe to the the four selected states. •
ix <- (tmp$STATE %in% c("NY","NJ","PA","VT"))
tmp <- tmp[ix,]
names(tmp)
Task 134 : Transform this frame to to the same metric CRS ne.crs loaded
in §3. •
st_crs(tmp)$proj4string
## [1] "+proj=aea +lat_0=42.5 +lon_0=-76 +lat_1=39 +lat_2=44 +x_0=0 +y_0=0 +ellps=WGS84 +units=m +no_de
Task 135 : Transfer the temperature records to the same object with the
GDD50, and then remove the temporary objects. •
We use the cbind “bind columns” function for this.
names(tmp.m)
Add these coördinates as regular fields, for linear models which use them as
predictors.
ne.coords <- st_coordinates(ne.m)
ne.m <- cbind(ne.m, E = ne.coords[,1], N = ne.coords[ , 2])
rm(ne.coords)
158
## 795 2100 2463 2518 2930 4021
summary(ne.m$ANN_TNORM_)
summary(t.ann.std)
Task 137 : Compare these to see how closely they are correlated. •
cor(ne.m$ANN_GDD50, ne.m$ANN_TNORM_)
## [1] 0.9811989
cor(gdd50.std, t.ann.std)
## [1] 0.9811989
0
−1
−2
−3
−3 −2 −1 0 1 2 3
Note that the correlation is the same for both the standardized and un-
standardized variables. It’s interesting that the GDD50 are a bit higher
than the mean annual T at both extremes of the 1:1 plot. So they are
159
closely correlated, but not identical.
We choose to use RK-GLS (§7) and Random Forests (§10.2) as the prediction
methods; the other methods could all be used.
Task 138 : Build OLS models of both standardized variables from the two
coördinates and square root of elevation. Compare their adjusted 𝑅 2 and
coefficients. •
summary(m.ols.t <- lm(t.ann.std~ sqrt(ELEVATION_)+N+E, data=ne.m))
##
## Call:
## lm(formula = t.ann.std ~ sqrt(ELEVATION_) + N + E, data = ne.m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6861 -0.1944 -0.0391 0.1842 0.8651
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.096e+00 5.505e-02 19.913 < 2e-16 ***
## sqrt(ELEVATION_) -5.321e-02 1.829e-03 -29.088 < 2e-16 ***
## N -3.614e-06 1.199e-07 -30.148 < 2e-16 ***
## E -9.150e-07 1.134e-07 -8.071 1.68e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2821 on 301 degrees of freedom
## Multiple R-squared: 0.9212,Adjusted R-squared: 0.9204
## F-statistic: 1173 on 3 and 301 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = gdd50.std ~ sqrt(ELEVATION_) + N + E, data = ne.m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9447 -0.2365 -0.0221 0.2303 1.0474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.374e+00 6.709e-02 20.483 < 2e-16 ***
## sqrt(ELEVATION_) -6.184e-02 2.230e-03 -27.735 < 2e-16 ***
## N -2.895e-06 1.461e-07 -19.814 < 2e-16 ***
## E -9.386e-07 1.382e-07 -6.793 5.88e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3439 on 301 degrees of freedom
## Multiple R-squared: 0.8829,Adjusted R-squared: 0.8818
## F-statistic: 756.6 on 3 and 301 DF, p-value: < 2.2e-16
coef(m.ols.t)
## (Intercept) sqrt(ELEVATION_) N E
## 1.096150e+00 -5.320974e-02 -3.613649e-06 -9.149859e-07
coef(m.ols.gdd)
## (Intercept) sqrt(ELEVATION_) N E
160
## 1.374233e+00 -6.183727e-02 -2.894689e-06 -9.386282e-07
coef(m.ols.t)/coef(m.ols.gdd)
## (Intercept) sqrt(ELEVATION_) N E
## 0.7976448 0.8604801 1.2483723 0.9748119
Q45 : Do the two models explain the same amount of spatial variability by
the same predictors? If the two variables have the same spatial structure,
what should be the ratio of the coefficients? Is that the case here? Jump
to A45 •
Task 139 : Display bubble plots of the residuals, and 1:1 plots of actual vs.
fitted, for both variables. •
summary(ne.m$ols.resid.t <- residuals(m.ols.t))
44°N 44°N
+/− OLS residuals +/− xx
0.2 0.25
43°N 43°N
0.4 0.50
0.6 0.75
42°N 0.8 42°N 1.00
41°N 41°N
−, overprediction −, overprediction
+, underprediction +, underprediction
40°N 40°N
39°N 39°N
80°W 78°W 76°W 74°W 72°W 80°W 78°W 76°W 74°W 72°W
Task 140 : Compare the residuals with a bubble plot of their differences. •
summary(ne.m$ols.resid.diff <- ne.m$ols.resid.t - ne.m$ols.resid.gdd)
161
Difference of residuals, standardized T − standardized GDD
45°N
44°N
+/− difference
43°N 0.2
0.4
0.6
42°N
41°N
−, overprediction
+, underprediction
40°N
39°N
Now we begin to see the pattern, where the regional trend of annual mean
temperature is higher or lower than that for GDD50.
Q46 : What is the pattern of differences between the residuals from the
models for the two variables? Identify the most interesting areas. What
could be an explanation? Jump to A46 •
Task 141 : Model the residual spatial dependence for both variables, i.e.,
fit variogram models to the OLS residuals, for both variables. •
require(gstat)
v.r.ols.t <- variogram(ols.resid.t ~ 1,
locations=ne.m, cutoff=120000, width=12000)
(vmf.r.ols.t <- fit.variogram(v.r.ols.t,
vgm(psill=0.1, model="Exp",
range=20000, nugget=0.02)))
#
v.r.ols.gdd <- variogram(ols.resid.gdd ~ 1,
locations=ne.m, cutoff=120000, width=12000)
(vmf.r.ols.gdd <- fit.variogram(v.r.ols.gdd,
vgm(psill=0.12, model="Exp",
range=20000, nugget=0.02)))
#
# a common y-axis scale
ymax <- max(sum(vmf.r.ols.t[,"psill"]), sum(vmf.r.ols.gdd[,"psill"]))*1.1
p1 <- plot(v.r.ols.t, pl=T, model=vmf.r.ols.t, ylim = c(0, ymax))
162
p2 <- plot(v.r.ols.gdd, pl=T, model=vmf.r.ols.gdd, ylim = c(0, ymax))
print(p1, split=c(1,1,1,2), more=T)
print(p2, split=c(1,2,1,2), more=F)
0.12
0.10
semivariance
0.08 747 795 867 927
349 516 567 691
0.06 31 186
0.04
0.02
distance
0.08 31
0.06
0.04
0.02
distance
Q47 : How strong is the local spatial dependence of the residuals from
the two trend surfaces? Do the two trend surfaces have the same residual
local spatial structure? If not, what is the difference? What does that imply
about the trend surface model and spatial structure of the variables? Jump
to A47 •
Task 142 : Use this estimated spatial dependence among the residuals to
re-fit the models with GLS. •
We estimate starting values for the proportional nugget from the variogram
fit.
# require(nlme)
require(nlme)
(p.nugget <- vmf.r.ols.t[1,"psill"]/sum(vmf.r.ols.t[,"psill"]) + 0.001)
## [1] 0.7053578
## [1] 0.5427663
163
data=ne.m,
correlation=corExp(
value=c(vmf.r.ols.gdd[2,"range"], p.nugget),
form=~E + N,
nugget=TRUE))
summary(m.gls.t)
summary(m.gls.gdd)
164
##
## Residual standard error: 0.3611266
## Degrees of freedom: 305 total; 301 residual
## (Intercept) sqrt(ELEVATION_) N E
## 1.035287e+00 -5.104088e-02 -3.489755e-06 -1.018102e-06
coefficients(m.gls.gdd)
## (Intercept) sqrt(ELEVATION_) N E
## 1.272419e+00 -5.797018e-02 -2.688525e-06 -1.117387e-06
coefficients(m.gls.t)/coefficients(m.gls.gdd)
## (Intercept) sqrt(ELEVATION_) N E
## 0.8136364 0.8804679 1.2980184 0.9111453
This shows the ratio of the coefficients for the two variables. If they would
have the same regional spatial structure (trend surface), these values would
all be 1.
intervals(m.gls.gdd)$corStruct
intervals(m.gls.t)$corStruct/intervals(m.gls.gdd)$corStruct
Again, all these ratios would be 1 if the structures were identical. The mean
annual temperature has proportionally longer range and higher nugget than
GDD50. The range parameters are 77–95 km, for an effective range of about
240–300 km; this is quite a bit longer than what we estimated by eye from
the residual variograms from the OLS fit.
Task 145 : Add the GLS model residuals to the spatial data frame. •
165
ne.m$gls.resid.t <- residuals(m.gls.t)
ne.m$gls.resid.gdd <- residuals(m.gls.gdd)
Task 146 : Display the fitted model of spatial correlation with the empirical
variogram of the GLS residuals. •
We first convert the correlation structure found by gls to a variogram model;
the partial sill is estimated as the variance of the residuals, adjusted for the
proportional nugget. We estimate the nugget as a proportion of this total
sill.
(p.nugget <- intervals(m.gls.t)$corStruct["nugget","est."])
## [1] 0.6735544
## [1] 0.08065724
## [1] 0.6175653
## [1] 0.1226937
166
0.12
semivariance
0.10
0.08 747 795 867 927
349 516 567 691
0.06 31 186
0.04
0.02
distance
0.10
0.08 31
0.06
0.04
0.02
distance
Task 147 : Predict over the regional grid with the GLS model and add the
result to the dataframe. •
dem.ne.m.df$pred.gls.t <- predict(m.gls.t, newdata=dem.ne.m.df)
dem.ne.m.df$pred.gls.gdd <- predict(m.gls.gdd, newdata=dem.ne.m.df)
dem.ne.m.df$diff.gls.t.gdd <-
dem.ne.m.df$pred.gls.t - dem.ne.m.df$pred.gls.gdd
Task 148 : Plot the two regional predictions on the same visual scale. •
We use a different palette from the non-standardized prediction maps, to
emphasize that these are standardized values.
(std.pred.lim <- c(min(dem.ne.m.df[,c("pred.gls.t","pred.gls.gdd")]),
max(dem.ne.m.df[,c("pred.gls.t","pred.gls.gdd")])))
display.prediction.map("pred.gls.t",
"Mean Annual Temperature, standardized, GLS prediction",
"GDD50", std.pred.lim, .palette="RdGy")
display.prediction.map("pred.gls.gdd",
"Annual GDD, base 50F, standardized, GLS prediction",
"GDD50", std.pred.lim, .palette="RdGy")
167
Mean Annual Temperature, standardized, GLS prediction
2e+05
GDD50
2
0e+00
1
N
0
−1
−2
−3
−2e+05
−4e+05
2e+05
GDD50
2
0e+00
1
N
0
−1
−2
−3
−2e+05
−4e+05
Task 149 : Compute the differences between the two GLS predictions, add
them to the data frame, summarize them, and display them as a map. •
summary(dem.ne.m.df$diff.gls.t.gdd <-
dem.ne.m.df$pred.gls.t - dem.ne.m.df$pred.gls.gdd)
168
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.376528 -0.120566 0.016410 -0.007091 0.102386 0.492284
display.difference.map("diff.gls.t.gdd",
"Difference, MAT - GDD50, standardized",
"+/- s.d.",
.palette="BrBG")
2e+05
+/− s.d.
0.4
0e+00
0.2
N
0.0
−0.2
−2e+05
−4e+05
Q48 : Describe the spatial distribution of the differences between the MAT
and GDD50 predictions. Jump to A48 •
Although the GLS model residuals do not have strong spatial structure, still
we can krige them to improve the maps.
Task 150 : Predict the deviations from the GLS trend surface at each
location on the grid, using Ordinary Kriging (OK) of the GLS residuals for
both standardized variables; display their summary, and display as maps. •
Recall we build variogram models of the residuals from the correlation struc-
tures found by gls.
ok.gls.resid.t <- krige(gls.resid.t ~ 1, loc=ne.m, newdata=dem.ne.m.sf,
model=vmf.r.gls.t)
169
dem.ne.m.df$ok.gls.resid.t <- ok.gls.resid.t$var1.pred
dem.ne.m.df$ok.gls.resid.gdd <- ok.gls.resid.gdd$var1.pred
ggplot() +
geom_point(aes(x=E, y=N, colour=ok.gls.resid.t), data=dem.ne.m.df) +
xlab("E") + ylab("N") + coord_fixed() +
ggtitle("Residuals from GLS trend surface, MAT (std)") +
scale_colour_distiller(name="T (std)", space="Lab", palette="RdBu")
ggplot() +
geom_point(aes(x=E, y=N, colour=ok.gls.resid.gdd), data=dem.ne.m.df) +
xlab("E") + ylab("N") + coord_fixed() +
ggtitle("Residuals from GLS trend surface, GDD base 50F (std)") +
scale_colour_distiller(name="GDD50 (std)", space="Lab", palette="RdBu")
2e+05
T (std)
0e+00
0.2
N
0.0
−0.2
−2e+05
−4e+05
170
Residuals from GLS trend surface, GDD base 50F (std)
2e+05
GDD50 (std)
0.4
0e+00 0.2
N
0.0
−0.2
−2e+05 −0.4
−4e+05
display.difference.map("ok.gls.resid.diff.t.gdd",
"Difference: kriged residuals from GLS trend surface",
"+/- s.d.",
.palette="BrBG")
171
Difference: kriged residuals from GLS trend surface
2e+05
+/− s.d.
0e+00 0.2
0.1
N
0.0
−0.1
−0.2
−2e+05
−4e+05
Task 152 : Add the kriged GLS residuals to the trend surfaces for a final
GLS-RK prediction. Display the two maps. •
summary(dem.ne.m.df$pred.rkgls.std.t <-
dem.ne.m.df$pred.gls.t + dem.ne.m.df$ok.gls.resid.t)
summary(dem.ne.m.df$pred.rkgls.std.gdd <-
dem.ne.m.df$pred.gls.gdd + dem.ne.m.df$ok.gls.resid.gdd)
display.prediction.map("pred.rkgls.std.t",
"GLS-RK prediction, Mean Annual Temperature, standardized",
"MAAT (std)", std.pred.lim, .palette="YlOrBr")
display.prediction.map("pred.rkgls.std.gdd",
"GLS-RK prediction, Annual GDD, base 50F, standardized",
"GDD50 (std)", std.pred.lim, .palette="YlOrBr")
172
GLS−RK prediction, Mean Annual Temperature, standardized
2e+05
GDD50
2
0e+00
1
N
0
−1
−2
−3
−2e+05
−4e+05
2e+05
GDD50
2
0e+00
1
N
0
−1
−2
−3
−2e+05
−4e+05
Task 153 : Compute and display the difference between the two GLS-RK
predictions. •
#
summary(dem.ne.m.df$diff.rkgls.std.t.gdd <-
(dem.ne.m.df$pred.rkgls.std.t - dem.ne.m.df$pred.rkgls.std.gdd))
173
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.462388 -0.120160 0.037457 0.002504 0.124340 0.477189
display.difference.map("diff.rkgls.std.t.gdd",
"GLS-RK predictions, difference, MAT - GDD50, standardized",
"+/- s.d.",
.palette="BrBG")
2e+05
+/− s.d.
0e+00
0.25
N
0.00
−0.25
−2e+05
−4e+05
Another way to compare the variables is with their RF models. This does
not show linear model coefficients or local spatial correlation structure, but
does show variable importance, and also produces two maps.
174
Task 155 : Compare the variable importance of the two models. •
We use the importance function of the randomForest package:
randomForest::importance(m.rf.std.t)
## %IncMSE IncNodePurity
## ELEVATION_ 44.40325 58.15060
## N 45.70317 80.86923
## E 44.97693 36.19765
## dist.lakes 38.95843 65.83409
## dist.coast 29.96851 56.13357
randomForest::importance(m.rf.std.gdd)
## %IncMSE IncNodePurity
## ELEVATION_ 71.93775 120.02959
## N 82.20071 130.47469
## E 55.55530 45.31475
The Northing is more influential in the MAT model than in the GDD50
model, whereas the elevation is slightly more influential in the GDD50 model.
This agrees with the results of the GLS model (§13.4), where the absolute
Northing coefficient was larger for the MAT model, and the absolute eleva-
tion coefficient larger for the GDD50 model.
Task 156 : Plot the two fits, and the two OOB fits, side-by-side. •
plot(t.ann.std ~ predict(m.rf.std.t, newdata=ne.m),
col=ne.m$STATE, pch=20, asp=1,
xlab="Fitted by random forest", ylab="Actual",
main="Mean Annual T (std)")
legend("topleft", levels(ne.m$STATE), pch=20, col=1:4)
grid(); abline(0,1)
plot(gdd50.std ~ predict(m.rf.std.t, newdata=ne.m),
col=ne.m$STATE, pch=20, asp=1,
xlab="Fitted by random forest", ylab="Actual",
main="Annual GDD50 (std)")
legend("topleft", levels(ne.m$STATE), pch=20, col=1:4)
grid(); abline(0,1)
plot(t.ann.std ~ predict(m.rf.std.t), col=ne.m$STATE, pch=20, asp=1,
xlab="Fitted by random forest (OOB)", ylab="Actual",
main="Mean Annual T (std)")
legend("topleft", levels(ne.m$STATE), pch=20, col=1:4)
grid(); abline(0,1)
plot(gdd50.std ~ predict(m.rf.std.t), col=ne.m$STATE, pch=20, asp=1,
xlab="Fitted by random forest (OOB)", ylab="Actual",
main="Annual GDD50 (std)")
legend("topleft", levels(ne.m$STATE), pch=20, col=1:4)
grid(); abline(0,1)
175
Mean Annual T (std) Annual GDD50 (std)
NJ NJ
2
NY NY
2
PA PA
VT VT
1
0
Actual
Actual
0
−1
−1
−2
−2
−3
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
NJ NJ
2
NY NY
2
PA PA
VT VT
1
1
0
Actual
Actual
0
−1
−1
−2
−2
−3
−3
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
There is not much difference between these. The GDD50 for Mount Mans-
field (VT) is better fit than the MAT.
Task 157 : Predict over the regional grid with both RF models, and display
the maps. •
dem.ne.m.df$pred.rf.std.t <- predict(m.rf.std.t, newdata=dem.ne.m.df)
dem.ne.m.df$pred.rf.std.gdd <- predict(m.rf.std.gdd, newdata=dem.ne.m.df)
(std.pred.lim <- c(min(dem.ne.m.df[,c("pred.rf.std.t","pred.rf.std.gdd")]),
max(dem.ne.m.df[,c("pred.rf.std.t","pred.rf.std.gdd")])))
display.prediction.map("pred.rf.std.t",
"Mean Annual Temperature, standardized, RF prediction",
"degrees C", std.pred.lim, .palette="YlOrBr")
display.prediction.map("pred.rf.std.gdd",
"Annual GDD, base 50F, standardized, RF prediction",
"GDD50", std.pred.lim, .palette="YlOrBr")
176
Mean Annual Temperature, standardized, RF prediction
2e+05
GDD50
2
0e+00
1
N
0
−1
−2
−2e+05
−4e+05
2e+05
GDD50
2
0e+00
1
N
−1
−2
−2e+05
−4e+05
Task 158 : Compute the differences between the two maps and display the
difference map. •
summary(dem.ne.m.df$diff.rf.std.t.gdd <-
dem.ne.m.df$pred.rf.std.t - dem.ne.m.df$pred.rf.std.gdd)
display.difference.map("diff.rf.std.t.gdd",
"RF predictions, difference, MAT - GDD50, standardized",
"+/- s.d.",
.palette="BrBG")
177
RF predictions, difference, MAT − GDD50, standardized
2e+05
+/− s.d.
1.0
0e+00
0.5
N 0.0
−0.5
−2e+05
−4e+05
The summary show that half the differences between the standardized vari-
ables are quite small, about ±0.115 standard deviations. This agrees with
the close correlation between the variables.
This map shows similar differences between the variables as the GLS model
difference map (§13.4): normalized MAT is higher than normalized GDD50
in the Appalachian, Taconic and Allegheny mountains, and on eastern Long
Island; the reverse is the case in Adirondacks and Green Mountains, and
especially in the Lake Champlain valley.
14 Answers
A2 :
Northing : This relation looks linear at the more southerly portions of the map (NJ and
PA except the northern tier of counties) but quite spread out, with a less
obvious relation, in the northerly portion (NY and VT, northern PA).
Easting : Note the much wider spread of GDD in the East, which ranges from southern
NJ to northern VT, than in the continental climate of the West. Easting does
not appear to be predictive of the expected value of GDD – it appears that
a linear regression would have a (near) zero slope.
ELEVATION_ : There is a clear relation (higher elevations have fewer GDD) but it appears
inverse-parabolic rather than linear.
178
Return to Q2 •
A3 : Elevation has the strongest correlation; Northing is not much weaker. The cor-
relation coefficients are both negative: GDD50 decreases with increasing Northing
and elevation. There is essentially no relation of GDD50 with Easting. Return to
Q3 •
A4 : Square root of elevation explains a bit more than half (55.6%) of the total
variation in GDD50. Return to Q4 •
A5 : In general there is a good fit: most points are near the 1:1 actual:fit line.
However two NY points (red) are poorly fit, one underfit at ≈ (1300, 1950) (fit,
actual) and one overfit at ≈ (2700, 1700). Return to Q5 •
A6 :
The model diagnostics look pretty good: (1) no relation between fits and residuals;
(2) approximately equal variance at all fitted values; (3) residuals mostly normally-
distributed; (4) the high-leverage point is consistent with the others. However
two points are severely under-predicted (positive residuals) and one severely over-
predicted (negative residual). Return to Q6
•
A7 :
There is obvious correlation, with some reasonable geographic interpretation. The
model under-predicts GDD along the Lake Erie and Ontario plains (this due to the
moderating effect of the lake) and over-predicts along the Atlantic coast, southeast-
ern VT and the northern tier of PA. The Atlantic coast may stay cooler in spring
than indicated by its very low elevation and southerly position. The model also
under-predicts in SE PA and SW NJ (the Philadelphia area) where there may be
influence from warm southerly winds. Return to Q7 •
A9 : The confidence interval of the range parameter seems quite wide, almost
double from the lower limit 1.285 × 104 to the upper limit 2.3715 × 104 . Return to
Q9 •
A10 :
The range of spatial dependence has been adjusted; from the variogram fit we
estimated in §4.4, i.e., 1.2107×104 ; the REML estimate is somewhat longer, 1.7457×
104 m. These are 1/3 of the effective range, since we fit an exponential model. The
effective range found by gls is thus 5.2371 × 104 m. Return to Q10 •
A11 : The GLS fit does not remove spatial correlation, it just takes it into account
179
when computing the regression parameters. Return to Q11 •
A12 : This fits reasonably well; we conclude that we’ve detected the true correlation
structure of the residuals. Return to Q12 •
A13 :
The coefficients have changed a small amount; that for Northing was reduced by
over 2%. This shows that some clustered far N and far S elevation points had higher
leverage in the OLS model than the cluster warranted. Return to Q13 •
A14 : We see the effect of the reduced coefficient for Northing: the GLS predicts
somewhat lower in the N, so the residuals are lower than the OLS residuals (red
circles in the bubble plot); the reverse is true in the S. Return to Q14 •
A15 : There is quite some adjustment along the Great Lakes plain (NW) and
near some cities (additional GDD) and the Atlantic coast (lower GDD). There are
several “hotspots” of large adjustments near single climate stations. Return to
Q15 •
A16 : These areas are beyond the range of spatial correlation, here about 50 km,
and so are not affected by “nearby” observations. Return to Q16 •
A17 : Compared to the GLS surface this has more detail and is adjusted locally,
for example along the Lake Ontario plain. The discrepancies (points that can be
seen on the map) are much smaller. Return to Q17 •
A18 : Recall that the fitted variogram model shows local spatial dependence only
to an effective range of 52 km, so that there is no local adjustment further than this
from any point. Hardly any area of the bordering States or Province is within this
distance of an observation point.
The other issue is the extrapolation of the linear trend. All areas are within the
elevation range, so that is not a problem. But for areas further North or South, we
are assuming the linear trend continues unchanged. Return to Q18 •
A19 : (1) The global trend on the covariates is adequate to compute residuals and
their spatial structure; we do not expect it to change much if re-fit locally. (2) It is
not practical because the trend surface would have to be re-computed, the residuals
extracted, and the variogram re-fit at 1000’s of prediction points; this would have
to be done automatically without the possibility of checking for artefacts. In local
KED we can use the single fitted residual variogram model, while the trend is
adjusted locally and the predictions are dependent only on observations in the local
neighbourhood. The model is not changed, the results of applying it are.
Note that with very dense point networks (e.g., precision agriculture) this procedure
180
is applicable, and implemented by the VESPER computer program27 for Ordinary
Kriging, but not KED. Return to Q19 •
A20 : The largest over-predictions by local KED are extrapolations, outside the
area with points, e.g., west side of Lake Ontario, and the south-central Appalachi-
ans. Within the interpolation area, global KED is generally a bit higher, especially
in north-central PA. This may be due to a stronger Northing effect in the global
model, vs. the effect as found in local neighbourhoods. Return to Q20 •
A23 : The marginal relations are not well-fit by linear relations, although the
square root of elevation is nearly linear, as we saw in the OLS/GLS modelling.
However, in GAM modelling we do not need to select a single transformation over
the whole range of a predictor to linearize the relation. One transformation may
not be applicable to the whole range. Instead, we allow the smooth fit to determine
the local adjustment. Return to Q23 •
A24 : The model fits well; the adjusted 𝑅 2 is 0.908, and the residuals are less spread
than those from the OLS and GLS models. The effective degrees of freedom, i.e.,
accounting for the many local regressions in the splines, was 23.53 for the 2D surface,
and 8.52 for the 1D relation with elevation.
Return to Q24 •
A25 : Both of these plots support the conclusion that there is no spatial dependence
of the model residuals. (1) The bubbles appear to be randomly distributed, both
in colour and size. (2) The variogram shows the same spatial correlation at all
separations. Return to Q25 •
A26 : The GAM 2D trend clearly differs from the linear trend surface, especially
in the Lake Ontario plain (towards the right centre in the figure). It is also also
higher than a linear trend in the S Hudson valley, but lower along the Atlantic shore
(upper left in the figure). These are areas we identified with large OLS and GLS
residuals in the linear trend surfaces (§4, §5). Return to Q26 •
27 https://sydney.edu.au/agriculture/pal/software/vesper.shtml
181
A27 : This is almost linear now, as suggested by physical theory, once the smooth
geographic trend is removed. However at the low elevations there is a wide spread
of GDD50, which is well-fit by an almost vertical portion of the marginal smooth
function. Return to Q27 •
A28 : The largest differences are to the E and W, out of the calibration area –
this illustrates that GAM should not be extrapolated. The much lower GDD50 in
New England is because the GAM trend surface was considerably below the GLS
surface along the Atlantic coast, and this was extrapolated eastward. The GAM
predicts higher values along the Great Lakes plains and lower Hudson valley, as we
saw in the GLS residuals. Return to Q28 •
A29 : The largest differences are quite local, around certain weather stations
where the local deviation from the trend could be accounted for in RK-GLS, but
was somewhat averaged out in GAM. Return to Q29 •
A30 : The first (root) splitting variable is N. The split is at −1.55967 × 105 N. The
mean value of GDD50 of the whole dataset is 2517.52; the mean value of the observa-
tions in the left branch (less than) is 2182.54 and of the right branch (greater than)
is3019.99. These branches have 183 and 122 observations, respectively. Return to
Q30 •
A31 : Northing is most important; it explains almost 50% of the variance. Then
elevation explains about another 40%, and Easting very little. This agrees with the
linear correlations computed as preparation for fitting the OLS model. Return to
Q31 •
A32 : Each run of the rpart function will give the same tree (if the same parameters
are specified) but slightly different cross-validation statistics. The cross-validation
error reaches an effective minimum around 15 splits, CP about 0.0045. So building
the tree with CP=0.003 was overfitting. Return to Q32 •
A33 : The pruned tree has the same root and higher levels, but fewer splits and
leaves. Return to Q33 •
A34 : There are 16 unique values predicted by the pruned regression tree. The fit
to the actual values is better the OLS or GLS models; the RMSE from the regression
tree is 11.16 GDD50; for the GLS model 12.08 In this case the linear model predicts
more values but on average they deviate more from the true values. Return to
Q34 •
A35 : The regression tree divides the area into “blocks” mostlt with the Northing
but in one place with the Easting. It also slices most of the “blocks” according to
elevation zones. These give the maximum between-group variance at the leaves of
the regression tree, without overfitting. Return to Q35 •
182
A36 : Permuting either elevation or Northing leads to a large change in the
predictions; Easting much less. Return to Q36 •
A37 : The out-of-bag mean errors are from 2 to 3 times that of the fits. This is
a typical result for random forests. The OOB errors are indicative of the errors at
unknown points, i.e., prediction accuracy. Return to Q37 •
A38 : For most runs we see no or weak residual spatial structure, because of the
averaging effect of the repeated bootstrap sampling; in many trees close point-pairs
lose one of the points. Return to Q38 •
A39 : The RF surface shows some irregular patches and abrupt transitions, whereas
the RK-GLS surface is by construction smooth. Return to Q39 •
A40 : The RF prediction is from the “box” containing the elevation (all the same
in the lake), Northing and Easting, which was fit with the most similar points.
These are presumably along the Lake shore. There is no extrapolation via a trend
surface to modify the effect of the coördinates in the model. Return to Q40 •
A41 : The GLS and GLS-RK models have coefficients for the coördinates, which
here are far East, so the predictions change. The RF does not have any information
in this area and so puts it all in the “boxes”. Return to Q41 •
A42 : There are no points in OH so the prediction by RF is made from the fitted
model of the nearby PA stations. These apparently use the Easting. Return to
Q42 •
A43 : The RF model has no way to extrapolate to higher or lower elevations than
in the calibration set. In the Alleghenies and Catskills there are a lot of higher-
elevation area beyond the elevation of weather stations. In the Adirondacks the
stations are all at low elevations, but apparently the Northing is here used to predict
the GDD. This is why the RF over-predicts in the lowlands around Plattsburgh and
Lake Champlain. Return to Q43 •
A44 : The random forest model is clearly not suitable to show a regional trend,
especially outside of the model calibration area. It does allow for non-linearities
and local combinations of factors, for example on the Lake Ontario and Erie plains.
Return to Q44 •
183
A46 : There is a clear spatial pattern and local spatial dependence. The residuals
from the MAT model are lower than those from the GDD50 model in the N and
S edges, and at the higher elevations (Adirondacks, Green Mountains, Allegheny
Plateau). The reverse is the case in the centre and especially on Long Island (NY)
and northern NJ.
Return to Q46 •
A47 :
The variogram structures are both weak (short range, very high nugget proportion).
That is, the regional trend explains most of the variation in both cases.
There is weaker residual spatial structure for mean annual temperature than for
GDD50 (lower total sill, higher nugget proportion). This implies that the MAT
trend (the predictors N, E and elevation) explains more of the spatial variability.
Note however that the MAT trend surface has a lower 𝑅 2 than the GDD50 trend
surface. This implies that MAT is more explained regionally and less by local
variations than annual GDD50. Return to Q47 •
A48 : The MAT is proportionally higher than GDD50 in the higher elevations,
especially towards the SW. The opposite effect is seen in the lowlands, especially
the Lake Champlain valley (NE). Return to Q48 •
184
15 Challenge
Do a similar analysis either for:
• Over the same study area as the example:
– the growing degree days in one of the growing-season months:
May through September; or
– the annual growing degree days at base 40°F
– one of the other climate variables in one of the other shapefiles.
• Over one of the States within the study area.
• Over some other region of the USA:
– the growing degree days base 50°F, i.e., the same as used in this
example. Note that for this you will have to define a suitable
coördinate reference system (CRS) for that area.
If you choose a different study area, you will need to re-build the points
database and prediction grid, as explained in the companion tutorial “Tuto-
rial: Regional mapping of climate variables from point samples Data prepa-
ration”.
If you have to define a suitable CRS:
• For E-W oriented regions, you can uses the same Albers Equal Area
projection as was used in this tutorial, but the parameters will be
different.
• For N-S oriented regions, you will need to select a different projection.
A good choice is Transverse Mercator, PROJ.4 name tmerc. The
parameters for this are28 :
+proj=tmerc +lat_0=Latitude of natural origin
+lon_0=Longitude of natural origin
+k=Scale factor at natural origin
+x_0=False Easting
+y_0=False Northing
185
7. Mapping the various predictions and their differences.
186
References
[1] D Bates. Fitting linear mixed models in R. R News, 5(1):27–30, 2005.
28
[2] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classifica-
tion and regression trees. Wadsworth, 1983. 80
[3] I. C. Briggs. Machine contouring using minimum curvature. Geophysics,
39(1):39–48, January 1974. ISSN 0016-8033, 1942-2156. doi: 10.1190/
1.1440410. 131
[4] Bradley Efron and Gail Gong. A leisurely look at the bootstrap, the
jackknife & cross-validation. American Statistician, 37:36–48, 1983. 91
[5] John C. Gallant and Trevor I. Dowling. A multiresolution index of
valley bottom flatness for mapping depositional areas. Water Re-
sources Research, 39(12):ESG4–1 – ESG4–13, Dec 2003. doi: 10.1029/
2002WR001426. 115
[6] P Goovaerts. Geostatistics for natural resources evaluation. Applied
Geostatistics. Oxford University Press, New York; Oxford, 1997. 54,
137
[7] T Hastie, R Tibshirani, and J H Friedman. The elements of statistical
learning data mining, inference, and prediction. Springer series in statis-
tics. Springer, New York, 2nd ed edition, 2009. ISBN 9780387848587.
67, 80, 90, 91, 131
[8] Tomislav Hengl, Gerard B. M. Heuvelink, and David G. Rossiter. About
regression-kriging: From equations to case studies. Computers & Geo-
sciences, 33(10):1301–1315, 2007. doi: 10.1016/j.cageo.2007.05.001. 43
[9] M. F. Hutchinson. Interpolating mean rainfall using thin plate smooth-
ing splines. International Journal of Geographical Information Science,
9(4):385 – 403, 1995. 131
[10] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani.
An introduction to statistical learning: with applications in R. Number
103 in Springer texts in statistics. Springer, 2013. ISBN 9781461471370.
67, 80, 90
[11] Max Kuhn. Building predictive models in R using the caret package.
Journal of Statistical Software, 28(5):1–26, 2008. 101
[12] Max Kuhn and Kjell Johnson. Applied Predictive Modeling. Springer,
2013 edition edition, Sep 2013. ISBN 978-1-4614-6848-6. 107
[13] R. M. Lark and B. R. Cullis. Model based analysis using REML for
inference from systematically sampled data on soil. European Journal
of Soil Science, 55(4):799–813, 2004. 26, 27
[14] Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting
Model Predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wal-
lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in
187
Neural Information Processing Systems 30 (Nips 2017), volume 30, La
Jolla, 2017. Neural Information Processing Systems (nips). 127
[15] G. S. McMaster and W. W. Wilhelm. Growing degree-days: one equa-
tion, two interpretations. Agricultural and Forest Meteorology, 87(4):
291–300, 1997. doi: 10.1016/S0168-1923(97)00027-0. 2
[16] H. Mitasova and J. Hofierka. Interpolation by regularized spline
with tension: II. Application to terrain modeling and surface ge-
ometry analysis. Mathematical Geology, 25(6):657–669, 1993. doi:
10.1007/BF00893172. 131
[17] H. Mitasova and L. Mitas. Interpolation by regularized spline with
tension: I. Theory and implementation. Mathematical Geology, 25(6):
641–655, 1993. doi: 10.1007/BF00893171. 131
[18] Christoph Molnar. Interpretable Machine Learning: A Guide for Mak-
ing Black Box Models Explainable. Leanpub. URL https://leanpub.
com/interpretable-machine-learning. 124
[19] J C Pinheiro and D M Bates. Mixed-effects models in S and S-PLUS.
Springer, 2000. ISBN 0387989579. 27, 28
[20] J. R Quinlan. C4.5: programs for machine learning. The Morgan
Kaufmann series in machine learning. Morgan Kaufmann Publishers,
1993. ISBN 1-55860-238-0. 107
[21] Shawn J. Riley, Stephen D. DeGloria, and Robert Eliot. A terrain
ruggedness that quantifies topographic heterogeneity. Intermountain
Journal of Sciences, 5(1–4):23–27, 1999. 116
[22] Cosma Shalizi. The bootstrap. American Scientist, 98(3):186–
190, 2010. doi: DOI:10.1511/2010.84.186. URL http://www.
americanscientist.org/issues/pub/2010/3/the-bootstrap/3. 91
[23] W. R. Tobler. A computer movie simulating urban growth in the Detroit
region. Economic Geography, 46:234–240, 1970. ISSN 0013-0095. doi:
10.2307/143141. 136
[24] R. Webster and M. A. Oliver. Geostatistics for environmental scientists.
John Wiley & Sons Ltd., 2nd edition, 2008. 54, 137
[25] Hadley Wickham. ggplot2. http://ggplot2.org/. URL http://
ggplot2.org/. 6
[26] Hadley Wickham. ggplot2: Elegant graphics for data analysis. Use R!
Springer, August 2009. ISBN 0387981403. 6
[27] Leland Wilkinson. The grammar of graphics. Statistics and computing.
Springer, New York, 2nd ed edition, 2005. ISBN 9780387286952. 6
[28] S. N. Wood. Thin plate regression splines. Journal of the Royal Sta-
tistical Society Series B-Statistical Methodology, 65:95–114, 2003. doi:
10.1111/1467-9868.00374. 131
188
[29] Marvin N. Wright and Andreas Ziegler. ranger: a fast implementation
of random forests for high dimensional data in C++ and R. Journal of
Statistical Software, 77(1):1–17, Mar 2017. doi: 10.18637/jss.v077.i01.
91
[30] Yihui Xie. knitr: Elegant, flexible and fast dynamic report generation
with R, 2011. URL http://yihui.name/knitr/. Accessed 04-Mar-
2016. 2
Task 159 : Load the RColorBrewer package and display the ready-made
palettes for continuous sequences. •
library(RColorBrewer)
display.brewer.all(type="seq")
29 http://colorbrewer2.org/
189
YlOrRd
YlOrBr
YlGnBu
YlGn
Reds
RdPu
Purples
PuRd
PuBuGn
PuBu
OrRd
Oranges
Greys
Greens
GnBu
BuPu
BuGn
Blues
Task 160 : Select one of the palettes and re-display the GDD50 map with
it, •
ggplot(data=ne.df) +
aes(x=E, y=N) +
geom_point(aes(size=ANN_GDD50, colour=ANN_GDD50),
shape=20) +
scale_colour_distiller(space="Lab",
palette="Greens") +
xlab("E") + ylab("N") + coord_fixed()
190
2e+05 ANN_GDD50
1000
2000
3000
4000
0e+00
N
ANN_GDD50
4000
3000
−2e+05
2000
1000
−4e+05
191
Spectral
RdYlGn
RdYlBu
RdGy
RdBu
PuOr
PRGn
PiYG
BrBG
These should be used when the central value is a natural zero and we want
to emphasize divergences from it in two directions, e.g., for residuals.
192
Index of Commands
+ formula operator, 81 geom_smooth (ggplot2 package), 67
+ operator, 6 getModelInfo (train package), 102
~ formula operator, 81 getwd, 156
ggplot (ggplot2 package), 6, 39, 141
add argument (plot function), 4 ggplot2 package, 1, 6, 51, 67, 189
aes (ggplot2 package), 6, 189 gls, 166, 169
apply, 153 gls (nlme package), 28–30, 33, 35, 54, 179
as.data.frame, 24 grid.arrange (gridExtra package), 67
gridExtra package, 67
block argument (krige function), 140
gstat package, 1, 21, 22, 45, 54, 59, 137,
bnd argument (voronoi function), 148
144
brewer.pal (RColorBrewer package), 51
idp argument (idw function), 145
caret package, 101, 102, 108
idw (gstat package), 145
cbind, 158
iml package, 125
cloud argument (variogram function), 23
importance (randomForest package), 175
col argument (plot function), 4
importance (ranger package), 92
colour argument (aes function), 6, 189
intervals (nlme package), 29
committees argument (cubist function), 112
control argument (rpart function), 85 kml_close (plotKML package), 10
coord_fixed (ggplot2 package), 6 kml_layer (plotKML package), 10
cor, 12 kml_open (plotKML package), 10
corExp (nlme package), 28, 33 knitr package, 2
correlation argument (gls function), 28 krige (gstat package), 45, 54, 56, 61, 64,
corSpher (nlme package), 35 66, 145
corStruct class, 28 krige.cv (gstat package), 59, 144
cp argument (prune function), 85
cp argument (rpart function), 81, 100 labels argument (text function), 4
crop (terra package), 148 list.files, 156
Cubist package, 1, 107, 112 lm, 13, 28, 65, 69
cubist (Cubist package), 112 load, 3
loess, 67
data argument (ggplot function), 6 log, 69
data.frame class, 141 lwd argument (plot function), 4
eval, 19 matrix, 133
expand.grid, 102 maxdist argument (krige function), 61
explain (fastshap package), 127 method argument (train function), 102, 108,
111
fastshap package, 127
mgcv package, 69, 71, 72
fields package, 1, 131, 133, 134
min.node.size argument (ranger function),
fit.variogram (gstat package), 22, 55, 139
102, 105
form argument (corExp function), 28
minsplit argument (rpart function), 81,
function, 39
100
gam (mgcv package), 69 model argument (krige function), 45
geocode_OSM (tmaptools package), 151 mtry (ranger package), 92
geom_point (ggplot2 package), 6 mtry argument (randomForest function), 101
geom_sf (ggplot2 package), 6 mtry argument (ranger function), 102, 105
193
neighbors argument (predict.cubist func- rpart (rpart package), 80, 81, 85, 182
tion), 112 rpart class, 87
newdata (predict package), 92 rpart package, 1, 80
newdata argument (predict.rpart function), rpart.control (rpart package), 85
88 rpart.plot (rpart.plot package), 82
nlme package, 1, 28, 29 rpart.plot package, 82
nmax argument (krige function), 61, 140
nmin argument (krige function), 61 s (mgcv package), 69
nodesize argument (randomForest function), save, 3
101 scale_colour_brewer (ggplot2 package),
nsim argument (explain function), 127 189
ntree argument (randomForest function), scale_colour_distiller (ggplot2 package),
101 189
nugget argument (corExp function), 28, 33 scheme argument (plot.gam function), 71
nugget argument (corSpher function), 35 se.fit argument (predict.gam function),
num.trees (ranger package), 91 75
select argument (plot.gam function), 71
order, 24 sf class, 4, 6, 19, 54, 149
sf package, 1, 10, 45, 148, 149, 152
palette, 4 sfc_POINT class, 136, 151
parse, 19 shape argument (geom_point function), 6
paste, 19 shapviz (shapviz package), 128
pattern argument (list.files function), shapviz class, 128
156 shapviz package, 128
plot, 4 size argument (aes function), 6
plot (terra package), 51 sp package, 134
plot.gam (mgcv package), 71 span argument (loess function), 67
plot.xy, 4 SpatRaster class, 50
plot_min_depth_distribution (randomForestExplainer
SpatVect class, 148
package), 123 SpatVector class, 148
plotKML package, 1, 10 spsample (sp package), 134
predict, 87 sqrt, 69
predict (cubist package), 112 st_cast (sf package), 4
predict (randomForest package), 92 st_distance (` package), 152
predict (ranger package), 127 st_geometry (sf package), 4
predict.gam (mgcv package), 75 st_join (sf package), 149
predict.Krig (fields package), 134 st_point (sf package), 136, 151
predict.rpart (rpart package), 87 st_sfc (m package), 136
Predictor class, 125 st_sfc (sf package), 151
printcp (rpart package), 84 st_transform (sf package), 10
prune (rpart package), 85 sv_dependence (shapviz package), 129
randomForest (randomForest package), 101, sv_importance (shapviz package), 128
105
terra package, 50, 51
randomForest package, 1, 175
text, 4
randomForestExplainer package, 123
theta argument (plot.gam function), 71
ranger (ranger package), 91, 93, 105
tmaptools package, 151
ranger package, 1, 91, 102, 106, 127
Tps (fields package), 133
raster package, 1
train (caret package), 102, 105, 108
RColorBrewer package, 51, 189
194
trainControl (caret package), 102
trControl argument (train function), 102
tuneGrid argument (train function), 102
unique, 88
which.max, 16
which.min, 16, 153
195