0% found this document useful (0 votes)
17 views21 pages

A Machine Learning Tutorial for Operational Meteorology.

This document is a tutorial on traditional machine learning methods specifically for operational meteorology, highlighting the increasing use of machine learning in the field. It addresses the lack of formal education on machine learning for meteorology students and aims to demystify these methods by providing clear explanations and practical examples. The paper covers various machine learning techniques, including linear regression, logistic regression, and decision trees, while also offering code resources to facilitate their application in meteorological datasets.

Uploaded by

Ana Luz Alabi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views21 pages

A Machine Learning Tutorial for Operational Meteorology.

This document is a tutorial on traditional machine learning methods specifically for operational meteorology, highlighting the increasing use of machine learning in the field. It addresses the lack of formal education on machine learning for meteorology students and aims to demystify these methods by providing clear explanations and practical examples. The paper covers various machine learning techniques, including linear regression, logistic regression, and decision trees, while also offering code resources to facilitate their application in meteorological datasets.

Uploaded by

Ana Luz Alabi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

AUGUST 2022 CHASE ET AL.

1509

A Machine Learning Tutorial for Operational Meteorology. Part I: Traditional


Machine Learning

RANDY J. CHASE,a,b,c DAVID R. HARRISON,b,d,e AMANDA BURKE,b,c GARY M. LACKMANN,f


a,b,c
AND AMY MCGOVERN
a
School of Computer Science, University of Oklahoma, Norman, Oklahoma
b
School of Meteorology, University of Oklahoma, Norman, Oklahoma
c
NSF AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography, University of Oklahoma,
Norman, Oklahoma
d
Cooperative Institute for Severe and High-Impact Weather Research and Operations, University of Oklahoma, Norman, Oklahoma
e
NOAA/NWS/Storm Prediction Center, Norman, Oklahoma
f
Department of Marine, Earth, and Atmospheric Sciences, North Carolina State University, Raleigh, North Carolina

(Manuscript received 14 April 2022, in final form 26 May 2022)

ABSTRACT: Recently, the use of machine learning in meteorology has increased greatly. While many machine learning
methods are not new, university classes on machine learning are largely unavailable to meteorology students and are not
required to become a meteorologist. The lack of formal instruction has contributed to perception that machine learning
methods are “black boxes” and thus end-users are hesitant to apply the machine learning methods in their everyday work-
flow. To reduce the opaqueness of machine learning methods and lower hesitancy toward machine learning in meteorol-
ogy, this paper provides a survey of some of the most common machine learning methods. A familiar meteorological
example is used to contextualize the machine learning methods while also discussing machine learning topics using plain
language. The following machine learning methods are demonstrated: linear regression, logistic regression, decision trees,
random forest, gradient boosted decision trees, naïve Bayes, and support vector machines. Beyond discussing the different
methods, the paper also contains discussions on the general machine learning process as well as best practices to enable
readers to apply machine learning to their own datasets. Furthermore, all code (in the form of Jupyter notebooks and
Google Colaboratory notebooks) used to make the examples in the paper is provided in an effort to catalyze the use of
machine learning in meteorology.
KEYWORDS: Radars/Radar observations; Satellite observations; Forecasting techniques; Nowcasting;
Operational forecasting; Artificial intelligence; Classification; Data science; Decision trees; Machine learning;
Model interpretation and visualization; Regression; Support vector machines; Other artificial intelligence/machine learning

1. Introduction space suggests that nontechnical explanations may be an im-


portant part of how end users perceive the trustworthiness of
The mention and use of machine learning (ML) within me-
ML guidance (e.g., Cains et al. 2022). Thus, an additional goal
teorological journal articles is accelerating (Fig. 1; e.g., Burke
of these papers is to enhance trustworthiness of ML methods
et al. 2020; Hill et al. 2020; Lagerquist et al. 2020; Li et al. 2020;
through plain language discussions and meteorological examples.
Loken et al. 2020; Mao and Sorteberg 2020; Muñoz-Esparza
In practice, ML models are often viewed as a black box,
et al. 2020; Wang et al. 2020; Bonavita et al. 2021; Cui et al.
which could also be contributing to user hesitancy. These
2021; Flora et al. 2021; Hill and Schumacher 2021; Schumacher
et al. 2021; Yang et al. 2021; Zhang et al. 2021). With a growing mystified feelings toward ML methods can lead to an inherent
number of published meteorological studies using ML methods, distrust with ML methods, despite their potential. Further-
it is increasingly important for meteorologists to be well versed more, the seemingly opaque nature of ML methods prevents
in ML. However, the availability of meteorology specific resour- ML forecasts from meeting one of the three requirements of a
ces about ML terms and methods is scarce. Thus, this series of good forecast outlined by Murphy (1993): consistency. In
papers (total of two) aim to reduce the scarcity of meteorology short, Murphy (1993) explains that in order for a forecast to
specific ML resources. be good, the forecast must 1) be consistent with the user’s
While many ML methods are generally not new (i.e., pub- prior knowledge, 2) have good quality (i.e., accuracy) and 3) be
lished before 2002), there is a concern from ML developers valuable (i.e., provide benefit). Plenty of technical papers dem-
that end users (i.e., non-ML specialists) may be hesitant or onstrate how ML forecasts can meet requirements 2 and 3, but
are concerned about trusting ML. However, early work in this as noted above if the ML methods are confusing and enigmatic,
then it is difficult for ML forecasts to be consistent with a mete-
orologist’s prior knowledge. This series of papers will serve as a
Denotes content that is immediately available upon publica- reference for meteorologists in order to make the black box of
tion as open access.
ML more transparent and enhance user trust in ML.
This paper is organized as follows. Section 2 provides an in-
Corresponding author: Randy J. Chase, randychase@ou.edu troduction to all ML methods discussed in this paper and will

DOI: 10.1175/WAF-D-22-0070.1
Ó 2022 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright
Policy (www.ametsoc.org/PUBSReuseLicenses).
Unauthenticated | Downloaded 11/06/24 08:16 PM UTC
1510 WEATHER AND FORECASTING VOLUME 37

FIG. 1. Search results for the Meteorology and Atmospheric Science category when searching abstracts for machine learning methods
and severe weather. Machine learning keywords searched were the following: linear regression, logistic regression, decision trees, random
forest, gradient-boosted trees, support vector machines, k-means, k-nearest, empirical orthogonal functions, principal component analysis,
self-organizing maps, neural networks, convolutional neural networks, and unets. Severe weather keywords searched were the following:
tornadoes, hail, hurricanes, and tropical cyclones. (a) Counts of publications per year for all papers in the Meteorology and Atmospheric
Science category (black line; reduced by one order of magnitude), machine learning topics (blue line), and severe weather topics (red
line). (b) As in (a), but with the two subtopics normalized by the total number of Meteorology and Atmospheric Science papers. (c) Num-
ber of neural network papers (including convolutional and unets) published in Meteorology and Atmospheric sciences. All data are de-
rived from Clarivate Web of Science.

define common ML terms. Section 3 discusses the general ML tomorrow’s high temperature, the input feature would be
methods in context of a simple meteorological example, while tomorrow’s forecasted temperature from a numerical
also describing the end-to-end ML pipeline. Then, section 4 weather model (e.g., GFS) and the label would be tomor-
summarizes this paper and also discusses the topics of the row’s observed temperature.
next paper in the series. Supervised ML methods can be further broken into two sub-
categories: regression and classification. Regression tasks are
2. Machine learning methods and common terms ML methods that output a continuous range of values, like the
forecast of tomorrow’s high temperature (e.g., 75.08F). Mean-
This section will describe a handful of the most common
while classification tasks are characteristic of ML methods that
ML methods. Before that, it is helpful to define some termi-
classify data (e.g., will it rain or snow tomorrow). Reposing to-
nology used within ML. First, we define ML as any empirical1
morrow’s high temperature forecast as a classification task
method where parameters are fit (i.e., learned) on a training
would be: “Will tomorrow be warmer than today?” This paper
dataset in order to optimize (e.g., minimize or maximize) a
will cover both regression and classification methods. In fact,
predefined loss (i.e., cost) function. Within this general frame-
many ML methods can be used for both tasks.
work, ML has two categories: supervised and unsupervised
All ML methods described here will have one thing in com-
learning. Supervised learning are ML methods that are
mon: the ML method quantitatively uses the training data to
trained with prescribed input features and output labels. For
optimize a set of weights (i.e., thresholds) that enable the pre-
example, predicting tomorrow’s high temperature at a specific
diction. These weights are determined either by minimizing the
location where we have measurements (i.e., labels). Mean-
error of the ML prediction or maximizing a probability of a
while, unsupervised methods do not have a predefined output
class label. The two different methods coincide with the regres-
label (e.g., self-organizing maps; Nowotarski and Jensen
sion and classification, respectively. Alternative names for error
2013). An example of an unsupervised ML task would be
that readers might encounter in the literature are loss or cost.
clustering all 500 mb geopotential height maps to look for
Now that some of the common ML terms has been discussed,
unspecified patterns in the weather. This paper focuses on
the following subsections will describe the ML methods. It will
supervised learning.
start with the simplest methods (e.g., linear regression) and
The input features for supervised learning, also referred
move to more complex methods (e.g., support vector machines)
to as input data, predictors, or variables, can be written
as the sections proceed. Please note that the following subsec-
mathematically as the vector (matrix) X. The desired output
tions aim to provide an introduction and the intuition behind
of the ML model is usually called the target, predictand or
each method. An example of the methods being applied and
label, and is mathematically written as the scalar (vector) y.
helpful application discussion can be found in section 3.
Drawing on the meteorological example of predicting
a. Linear regression
1
By “empirical” we mean any method that uses data as opposed An important concept in ML is when choosing to use ML
to physics. for a task, one should start with the simpler ML models first.

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


AUGUST 2022 CHASE ET AL. 1511

Occam’s razor2 tells us to prefer the simplest solution that can


solve the task or represent the data. While this does not al-
ways mean the simplest ML model available, it does mean
that simpler models should be tried before more complicated
ones (Holte 1993). Thus, the first ML method discussed is lin-
ear regression, which has a long history in meteorology (e.g.,
Malone 1955) and forms the heart of the model output statis-
tics product (i.e., MOS; Glahn and Lowry 1972) that many
meteorologists are familiar with. Linear regression is popular
because it is a simple method that is also computationally effi-
cient. At its simplest form, linear regression approximates the
value you would like to predict (ŷ) by fitting weight terms
(wi) in the following equation:


iD
ŷ  wi x i : (1)
i0

The first predictor (x0) is always 1 so that w0 is a bias term, al-


lowing the function to move from the origin as needed. The
term D is the number of features for the task.
As noted before, with ML, the objective is to find wi such
that a user-specified loss function (i.e., error function) is mini- FIG. 2. A visual example of linear regression with a single input
mized. The most common loss function for traditional linear predictor. The x axis is a synthetic input feature, and the y axis is a
regression is the residual summed squared error (RSS): synthetic output label. The solid black line is the regression fit, and
the red dashed lines are the residuals.

N
RSS  (yj 2 ŷ j )2 , (2)
j1 Here, k (which is $0) is a user-defined parameter that con-
trols the weight of the penalty. Likewise, another modified
where yj is a true data point, ŷ j is the predicted data point, and version of linear regression is lasso regression (Tibshirani
N is the total number of data points in the training dataset. A 1996) which minimizes the sum of the absolute value of the
graphical example of a linear regression and its residuals is weights. This penalty to learning is also termed an L1 penalty.
shown in Fig. 2. Linear regression using residual summed The lasso loss function mathematically is
squared error can work very well and is a fast learning algo-
rithm, so we suggest it as a baseline method before choosing 
N 
D
RSSlasso  (yj 2 ŷ j )2 1 k |wi |: (4)
more complicated methods. The exact minimization method is
j1 i0
beyond the scope of this paper, but know that the minimization
uses the slope (i.e., derivative) of the loss function to determine Both lasso and ridge encourage the learned weights to be
how to adjust the trainable weights. If this sounds familiar, that small but in different ways. The two penalties are often com-
is because it is the same minimization technique learned in most
bined to create the elastic-net penalty (Zou and Hastie 2005):
first year college calculus classes and is a similar technique to
what is used in data assimilation for numerical weather predic- 
N 
D
tion (cf. Chapter 5 and section 10.5 in Kalnay 2002; Lackmann RSSelastic  (yj 2 ŷ j )2 1 k [aw2i 1 (1 2 a)|wi |]: (5)
j1 i0
2011). The concept of using the derivative to find the minimum
is repeated throughout most ML methods given there is often a
In general, the addition of components to the loss function,
minimization (or maximization) objective.
like described in Eqs. (3)–(5), is known as regularization and
Occasionally datasets can contain irrelevant or noisy predic-
is found in other ML methods. Some recent examples of pa-
tors which can cause instabilities in the learning. One approach
pers using linear regression include subseasonal prediction of
to address this is to use a modified version of linear regression
tropical cyclone parameters (Lee et al. 2020), relating mesocy-
known as ridge regression (Hoerl and Kennard 1970), which
minimizes both the summed squared error (like before) and clone characteristics to tornado intensity (Sessa and Trapp
the sum of the squared weights called an L2 penalty. Mathe- 2020) and short term forecasting of tropical cyclone intensity
matically, the new loss function can be described as (Hu et al. 2020).
N 
D
b. Logistic regression
RSSridge  (yj 2 ŷ j )2 1 k w2i : (3)
j1 i0
As a complement to linear regression, the first classification
method discussed here is logistic regression. Logistic regres-
sion is an extension from linear regression in that it uses the
2
https://en.wikipedia.org/wiki/Occam\%27s_razor. same functional form of Eq. (1). The differences lie in how

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


1512 WEATHER AND FORECASTING VOLUME 37

2020), subseasonal prediction of surface temperature (Vigaud


et al. 2019) and predict the transition of tropical cyclones to
extratropical cyclones (Bieli et al. 2020).

c. Naïve Bayes
An additional method to do classification is known as naïve
Bayes (Kuncheva 2006), which is named for its use of Bayes’s
theorem and can be written as the following:

P(y)P(x|y)
P(y|x)  : (8)
P(x)

In words, Eq. (8) is looking for the probability of some la-


bel y (e.g., snow), given a set of input features x [P(y|x); e.g.,
temperature]. This probability can be calculated from know-
ing the probability of the label y occurring in the dataset
[P(y); e.g., how frequent it snows] times the probability of the
input features given it belongs to the class y [P(x|y); e.g., how
frequently is it 328F when it is snowing], divided by the proba-
bility of the input features [P(x)]. The naïve part of the naïve
Bayes algorithm comes from assuming that all input features
FIG. 3. A graphical depiction of the sigmoid function [Eq. (6)]. x, are independent of one another and the term P(x|y) can be
The x axis is the predicted label value, while the y axis is the now modeled by an assumed distribution (e.g., normal distribu-
scaled value. tion) with parameters determined from the training data.
While these assumptions are often not true, the naïve Bayes
the weights for Eq. (1) are determined and a minor adjust- classifier can be skillful in practice. A few simplification steps
ment to the output of Eq. (1). More specifically, logistic re- results in the following:
gression applies the sigmoid function (Fig. 3) to the output of  
Eq. (1) defined as follows:    N
ŷ  argmax log P(y) 1 log[P(xi |y)] : (9)
i0
1
S(ŷ)  : (6)
1 1 e2ŷ Again in words, the predicted class (ŷ) from naïve Bayes is
the classification label (y) such that the sum of the log of the
Large positive values into the sigmoid results in a value of 1
while large negative values result in a value of 0. Effectively, probability of that classification [P(y)] and the sum of log of
the sigmoid scales the output of Eq. (1) to a range from 0 to 1, all the probabilities of the specific inputs given the classifica-
which then can be interpreted like a probability. For the sim- tion [P(xi|y)] is maximized. To help visualize the quantity
plest case of classification involving just two classes (e.g., rain P(xi|y), a graphical example is shown in Fig. 4. This example
or snow), the output of the sigmoid can be interpreted as a uses surface weather measurements from a station near
probability of either class (e.g., rain or snow). The output Marquette, Michigan, where data were compiled when it was
probability then allows for the classification to be formulated raining and snowing. Figure 4 shows distribution of air tempera-
as the wi that maximizes the probability of a desired class. ture (i.e., an input feature) given the two classes (i.e., rain versus
Mathematically, the classification loss function for logistic re- snow). To get P(xi|y), we need to assume an underlying distribu-
gression can be described as tion function. The common assumed distribution with naïve
Bayes is the normal distribution:

iD    
loss  2yi log S(ŷ) 1 (1 2 yi )log 1 2 S(ŷ) : (7) 
1 1 x2m
i0 f (x; m, s)  √ exp 2 , (10)
s 2p 2 s
Like before for linear regression, the expression in Eq. (7)
is minimized using derivatives. If the reader is interested in where m is the mean and s is the standard deviation of the
more information on the mathematical techniques of minimi- training data. While the normal distribution assumption for
zation they can find more information in chapter 5 of Kalnay the temperature distribution in Fig. 4 is questionable due to
(2002). thermodynamic constraints that lock the temperature at 328F
Logistic regression has been used for a long time within me- (i.e., latent cool/heating), naïve Bayes can still have skill. Ini-
teorology. One of the earliest papers using logistic regression tially, it might not seem like any sort of weights/biases are be-
showed skill in predicting the probability of hail greater than ing fit like the previously mentioned methods (e.g., logistic
1.9 cm (Billet et al. 1997), while more recent papers have used regression), but m and s are being learned from the training
logistic regression to identify storm mode (Jergensen et al. data. If performance from the normal distribution is poor,

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


AUGUST 2022 CHASE ET AL. 1513

McGovern et al. 2014; Mecikalski et al. 2015; Lagerquist et al.


2017; Gagne et al. 2017; Czernecki et al. 2019; Burke et al.
2020; Hill et al. 2020; Loken et al. 2020; Gensini et al. 2021;
Flora et al. 2021; Loken et al. 2022), solar power (e.g.,
McGovern et al. 2015), precipitation (e.g., Elmore and Grams
2016; Herman and Schumacher 2018b,a; Taillardat et al. 2019;
Loken et al. 2020; Wang et al. 2020; Mao and Sorteberg 2020;
Li et al. 2020; Hill and Schumacher 2021; Schumacher et al.
2021), satellite and radar retrievals (e.g., Kühnlein et al. 2014;
Conrick et al. 2020; Yang et al. 2021; Zhang et al. 2021), and
climate-related topics (e.g., Cui et al. 2021).
To start, we will describe decision trees in context of a clas-
sification problem. The decision tree creates splits in the data
(i.e., decisions) that are chosen such that either the Gini impu-
rity value or the entropy value decreases after the split. Gini
impurity is defined as


ik
Gini  pi (1 2 pi ), (11)
i0

where pi is the probability of class i (i.e., the number of data


FIG. 4. Visualizing the probability of an input feature given the points labeled class i divided by the total number of data
class label. This example is created from 5-min weather station ob- points). While entropy is defined as
servations from near Marquette, MI (years included: 2005–20). The
precipitation phase was determined by the present weather sensor. 
ik
entropy  pi log2 ( pi ): (12)
The histogram is the normalized number of observations in that
i0
temperature bin, while the smooth curves are the normal distribu-
tion fit to the data. Red is for raining instances and blue is for
Both functions effectively measure how similar the data
snowing instances.
point labels are in each one of the groupings of the tree after
some split in the data. Envision the flowchart as a tree. The
other distributions can be assumed, like a multinomial or a decision is where the tree branches into two directions, result-
Bernoulli distribution. ing in two separate leaves. The goal of a decision tree is to
A popular use of naïve Bayes classification in the meteoro- choose the branch that results in a leaf having a minimum of
logical literature has been the implementation of ProbSevere Gini or entropy. In other words, the data split would ideally
(e.g., Cintineo et al. 2014, 2018, 2020) which uses various se- result in two subgroups of data where all the labels are the
vere storm parameters and observations to classify the likeli- same within each subgroup. Figure 5 shows both the Gini im-
hood of any storm becoming severe in the next 60 min. purity and entropy for a two class problem. Consider the ex-
Additional examples of naïve Bayes classifiers in meteorology ample of classifying winter precipitation as rain or snow.
have been used for identifying tropical cyclone secondary From some example surface temperature dataset, the likely
eyewall formation from microwave imagery (Kossin and decision threshold would be near 328F, which would result in
Sitkowski 2009), identifying anomalous propagation in radar the subsequent two groupings of data point labels (i.e., snow/
data (Peter et al. 2013) and precipitation type (e.g., convective/ rain) having a dominant class label (i.e., fraction of class k is
stratiform) retrievals from geostationary satellites (Grams et al. near 0 or 1) and thus having a minimum of entropy or Gini
2016). (i.e., near 0). The actual output of this tree could be either the
majority class label, or the ratio of the major class (i.e.,
d. Trees and forests
a probabilistic output).
Decision trees are based on a decision making method that While it is helpful to consider a decision tree with a single
humans have been using for years: flow charts, where the decision, also known as a tree with a depth of 1, the prediction
quantitative decision points within the flowchart are learned power of a single decision is limited. A step toward more com-
automatically from the data. Early use of decision trees in me- plexity is to include increasing depth (i.e., more decisions/
teorology (e.g., Chisholm et al. 1968) actually predated the branches). To continue with the rain/snow example from the
formal description of the decision tree algorithm (Breiman previous paragraph, we could include a second decision based
1984; Quinlan 1993; Breiman 2001). Since then, tree-based on measured wet bulb temperature. A tree with depth two
methods have grown in popularity and have been demon- will likely have better performance, but the prediction power
strated to predict a variety of complex meteorological phe- is still somewhat limited.
nomena. Topics include the following: aviation applications An additional step to increase the complexity of decision
(e.g., Williams et al. 2008a,b; Williams 2014; Muñoz-Esparza trees, beyond including more predictors, is a commonly used
et al. 2020), severe weather (e.g., Gagne et al. 2009, 2013; method in meteorology: ensembles. While it might not be

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


1514 WEATHER AND FORECASTING VOLUME 37

loss function for random forest for regression and gradient


boosted regression is the same loss function as linear regression
described in the previous section [e.g., Eq. (2)], the residual
summed squared error.

e. Support vector machines


A support vector machine (commonly referred to as SVM;
Vapnik 1963) is an ML method similar to linear and logistic
regression. The idea is that a support vector machine uses a
linear boundary to do its predictions, which has a similar
mathematical form but written differently to account to vec-
tor notation. The equation is

ŷ  wT x 1 b, (13)

where w is a vector of weights, x is a vector of input features,


b is a bias term, and ŷ is the regression prediction. In the case of
classification, only sign of the right side of Eq. (13) is used. This
linear boundary can be generalized beyond two-dimensional
problems (i.e., two input features) to three-dimensions where
the decision boundary is called a plane, or any higher-order
FIG. 5. A visual representation of the two functions that can be space where the boundary is called a hyperplane. The main dif-
used in decision trees for classification, entropy (blue) and Gini im- ference between linear methods discussed in sections 2a and 2b
purity (red). and support vector machines is that support vector machines in-
clude margins to the linear boundary. Formally, the margin is
clear here, decision trees become over-fit (i.e., work really the area between the linear boundary and the closest training
well for training data, but perform poorly on new data) as the data point for each class label (e.g., closest rain data point and
depth of the tree increases. An alternative approach is to use closest snow data point). This is shown schematically with a
an ensemble of trees (i.e., a forest). Using an ensemble of synthetic dataset in Fig. 6a. While this is an ideal case, usually
trees forms the basis of two additional tree based methods: classes overlap (Fig. 6b), but support vector machines can still
random forests (Breiman 2001) and gradient boosted decision handle splitting the classes. The optimization task for support
trees (Friedman 2001). vector machines is stated as the following: Find wT such that the
Random forests are a collection of decision trees that are margin is maximized. In other words, support vector machines
trained on random subsets of data and random subsets of in- aim to maximize the distance between the two closest obser-
put variables from the initial training dataset. In other words, vations on either side of the hyperplane. Mathematically, the
the mathematics are exactly the same for each tree, the deci- margin distance is described as
sions still aim to minimize the loss (e.g., entropy), but each
1
tree is given a different random subset of data sampled from margin  : (14)
the original dataset with replacement. Gradient boosted deci- wT w
sion trees are an ensemble of trees that instead of training
Like before, the maximization is handled by numerical
multiple trees on random subsets (i.e., random forest), each
techniques to optimize the problem, but the resulting solution
tree in the ensemble is successively trained on the remaining
will be the hyperplane with the largest separation between the
error from the previous trees. To put it another way, rather
classes. A powerful attribute of the support vector machine
than minimizing the total error on random trees, the reduced
method is that it can be extended to additional mathematical for-
error from the first decision tree is now minimized on the sec-
mulations for the boundary, for example a quadratic function.
ond tree, and the reduced error from trees one and two is Thus, the person using support vector machines can decide which
then minimized on the third tree and so on. To come up with function would work best for their data. Recent applications of
a single prediction out of the ensemble of trees, the predic- support vector machines in meteorology include the classification
tions can be combined through a voting procedure (i.e., count of storm mode (Jergensen et al. 2020), hindcasts of tropical cyclo-
up the predicted classes of each tree) or by taking the average nes (Neetu et al. 2020), and evaluating errors with quantitative
probabilistic output from each tree. Random forests can use precipitation retrievals in the United States (Kurdzo et al. 2020).
either method, while gradient boosted trees are limited to the
voting procedure.
3. Machine learning application and discussion
While the discussion here has been centered on classification
for the tree-based methods, they can be used for regression as This section will discuss the use of all ML methods with a
well. The main alteration to the decision tree method to con- familiar use-case: thunderstorms. Specifically, this section will
vert to a regression-based problem is the substitution of the show two ML applications derived from popular meteorologi-
loss function [i.e., Eq. (11) and (12)]. For example, a common cal datasets: radar and satellite. The particular data used are

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


AUGUST 2022 CHASE ET AL. 1515

FIG. 6. Support vector machine classification examples. (a) Ideal (synthetic) data where the x and y axis are both in-
put features, while the color designates what class each point belongs to. The decision boundary learned by the sup-
port vector machine is the solid black line, while the margin is shown by the dashed lines. (b) A real world example
using NAM 1800 UTC forecasts of U and V wind and tipping-bucket measurements of precipitation. Blue plus
markers are raining instances, and the red minus signs are non-raining instances. Black lines are the decision boundary
and margins.

from the Storm Event Imagery dataset (SEVIR; Veillette to estimate GLM measurements prior to GOES-16 (i.e.,
et al. 2020), which contains over 10 000 storm events from be- November 2016).
tween 2017 and 2019. Each event spans four hours and in-
b. Data
cludes measurements from both GOES-16 and NEXRAD. An
example storm event and the five measured variables}red The first step of any ML project is to obtain data. Here, the
channel visible reflectance (0.64 mm; channel 2), midtropo- data are from a public archive hosted on the Amazon web ser-
spheric water vapor brightness temperature (6.9 mm; channel 9), vice. For information of how to obtain the SEVIR data as
clean infrared window brightness temperature (10.7 mm; chan- well as the code associated with this manuscript see the data
nel 13), vertically integrated liquid (VIL; from NEXRAD), and availability statement. One major question at this juncture is
Geostationary Lightning Mapper (GLM) measured lightning as follows: “How much data are needed to do machine learning?”
flashes}are found in Fig. 7. In addition to discussing ML in con- While there does not exist a generic number that can apply to all
text of the SEVIR dataset, this section will follow the general datasets, the idea is to obtain enough data such that one’s training
steps to using ML and contain helpful discussions of the best data are diverse. A diverse dataset is desired because any bias
practices as well as the most common pitfalls. found within the training data would be encoded in the ML
method (McGovern et al. 2021). For example, if a ML model was
a. Problem statements trained on only images where thunderstorms were present, then
the ML model would likely not know what a non-lightning pro-
The SEVIR data will be applied to two tasks: 1) Does this ducing storm would look like and be biased. Diversity in the
image contain a thunderstorm? and 2) How many lightning SEVIR dataset is created by including random images (i.e., no
flashes are in this image? To be explicit, we assume the GLM storms) from all around the United States (cf. Fig. 2 in Veillette
observations are unavailable, and we need to use the other et al. 2020).
measurements (e.g., infrared brightness temperature) as fea- After obtaining the data, it is vital to remove as much spuri-
tures to estimate if there are lightning flashes (i.e., classifica- ous data as possible before training because the ML model
tion), and how many of them are there (i.e., regression). will not know how to differentiate between spurious data and
While both of these tasks might be considered redundant high quality data. A common anecdote when using ML mod-
since we have GLM, the goal of this paper is to provide dis- els is garbage in, garbage out. The SEVIR dataset has already
cussion on how to use ML as well as discussion on the ML gone through rigorous quality control, but this is often not the
methods themselves. That being said, a potential useful appli- case with raw meteorological datasets. Two examples of qual-
cation of the trained models herein would be to use them on ity issues that would likely be found in satellite and radar
satellite sensors that do not have lightning measurements. For datasets are satellite artifacts (e.g., GOES-17 heat pipe;
example, all generations of GOES prior to GOES-16 did McCorkel et al. 2019) and radar ground clutter (e.g., Hubbert
not have a lightning sensor collocated with the main sensor. et al. 2009). Cleaning and manipulating the dataset to get it
Thus, we could potentially use the ML models trained here ready for ML often takes a researcher 50%–80% of their

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


1516 WEATHER AND FORECASTING VOLUME 37

FIG. 7. An example storm image from


the SEVIR dataset. This event is from
6 Aug 2018. (a) The visible reflectance,
(b) the midtropospheric water vapor bright-
ness temperature, (c) the clean infrared
brightness temperatures, (d) the vertically
integrated liquid retrieved from NEXRAD,
and (e) gridded GLM number of flashes.
Annotated locations of representative per-
centiles that were engineered features used
for the ML models are shown in (a).

time.3 Thus, do not be discouraged if cleaning one’s datasets classified as containing a thunderstorm if the image has at least
is taking a large amount of time because a high-quality dataset one flash in the last five minutes. For Problem Statement 2, the
will be best for having a successful ML model. sum of all lightning flashes in the past five minutes within the
Subsequent to cleaning the data, the next step is to engineer image are used for the regression target.
the inputs (i.e., features) and outputs (i.e., labels). One ave- Now that the data have been quality controlled and our fea-
nue to create features is to use every single pixel in the image tures and labels have been extracted, the next step is to split
as a predictor. While this could work, given the number of that dataset into three independent subcategories named the
pixels in the SEVIR images (589 824 total pixels for one visi- training, validation, and testing sets. The reason for these three
ble image) it is computationally impractical to train a ML subcategories is because of the relative ease at which ML
model with all pixels. Thus, we are looking for a set of statis- methods can “memorize” the training data. This occurs be-
tics than can be extracted from each image. For the genera- cause ML models can contain numerous (e.g., hundreds, thou-
tion of features, domain knowledge is critical because sands, or even millions) learnable parameters, thus the ML
choosing meteorologically relevant quantities will ultimately model can learn to perform well on the training data but not
determine the ML models skill. For the ML tasks presented generalize to other non-training data, which is called over-fitting.
in section 3a, information about the storm characteristics To assess how over-fit a ML model is, it is important to evaluate
(e.g., strength) in the image would be beneficial features. For a trained ML model on data outside of its training data (i.e.,
example, a more intense storm is often associated with more validation and testing sets).
lightning. Proxies for estimating storm strength would be the The training dataset is the largest subset of the total amount
magnitude of reflectance in the visible channel; how cold of data. The reason the training set is the largest is because
brightness temperatures in the water vapor and clean infrared the aforementioned desired outcome of most ML models
channel are; and how much vertically integrated water there is to generalize on wide variety of examples. Typically, the
is. Thus, to characterize these statistics, we extract the follow- amount of training data is between 70% and 85% of the total
ing percentiles from each image and variable: 0, 1, 10, 25, 50, amount of data available. The validation dataset, regularly
75, 90, 99, and 100.
5%–15% of the total dataset, is a subset of data used to assess
To create the labels the number of lightning flashes in the
if a ML model is over-fit and is also used for evaluating best
image are summed. For Problem Statement 1, an image is
model configurations (e.g., the depth of a decision tree). These
model configurations are also known as hyper-parameters.
3
https://www.nytimes.com/2014/08/18/technology/for-big-data- Machine learning models have numerous configurations and per-
scientists-hurdle-to-insights-is-janitor-work.html. mutations that can be varied and could impact the skill of any

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


AUGUST 2022 CHASE ET AL. 1517

one trained ML model. Thus, common practice is to system-


atically vary the available hyper-parameter choices, also
called a grid search, and then evaluate the different trained
models based on the validation dataset. Hyper-parameters
will be discussed in more detail later. The test dataset is the
last grouping that is set aside to the very end of the ML pro-
cess. The test dataset is often of similar size to the validation
dataset, but the key difference is that the test dataset is used
after all hyper-parameter variations have been concluded.
The reason for this last dataset is because when doing the sys-
tematic varying of the hyper-parameters the ML practitioner
is inadvertently tuning a ML model to the validation dataset.
One will often choose specific hyper-parameters in such a
way to achieve the best performance on the validation data-
set. Thus, to provide a truly unbiased assessment of the
trained ML model skill for unseen data, the test dataset is set
aside and not used until after training all ML models.
It is common practice outside of meteorology (i.e., data sci-
ence) to randomly split the total dataset into the three sub-
sets. However, it is important to strive for independence of
the various subsets. A data point in the training set should not
be highly correlated to a data point in the test set. In meteo- FIG. 8. The normalized distributions of minimum brightness tem-
perature (Tb) from the clean infrared channel for thunderstorm
rology this level of independence is often challenging given
images (blue; T-storm) and non-thunderstorm images (red; No
the frequent spatial and temporal autocorrelations in meteo- T-storm).
rologic data. Consider the SEVIR dataset. Each storm event
has 4 h of data broken into 5-min time steps. For one storm
available methods does not require substantially more effort
event, there is a large correlation between adjacent 5-min
from the ML practitioner and will likely result in finding a best
samples. Thus, randomly splitting the data would likely pro-
performing model.
vide a biased assessment of the true skill of the ML model. To
To start off, all methods are initially trained using their
reduce the number of correlated data points across subsets,
default hyper-parameters in scikit-learn and just one input
time is often used to split the dataset. For our example, we
feature, the minimum infrared brightness temperature (Tb).
choose to split the SEVIR data up by training on 1 January
We choose to use Tb because meteorologically it is a proxy
2017–1 June 2019 and split every other week in the rest of
for the depth of the storms in the domain, which is correlated
2019 into the validating and testing sets. This equates to a
to lightning formation (Yoshida et al. 2009). To assess the pre-
72%, 13%, and 15% split for the training, validation, and test
dictive power of this variable, the distributions of Tb for thun-
sets, respectively. In the event that the total dataset is small
derstorms and no thunderstorms are shown in Fig. 8. As
and splitting the data into smaller subsets creates less robust
expected, Tb for thunderstorms shows more frequent lower
statistics, a resampling method known as k-fold cross valida-
temperatures than non-thunderstorm images. Training all
tion (e.g., Bischl et al. 2012; Goodfellow et al. 2016) can be
methods using Tb achieves an accuracy of 80% on the valida-
used. The SEVIR dataset was sufficiently large that we chose
tion dataset. While accuracy is a common and easy to under-
not to do k-fold cross validation, but a meteorological exam-
stand metric, it is best to always use more than one metric
ple using it can be found in Shield and Houston (2022). when evaluating ML methods.
c. Training and evaluation Another common performance metric for classification
tasks is the area under the curve (AUC). More specifically
1) CLASSIFICATION the common area metric is associated with the receiver op-
As stated in section 3a, task 1 is to classify if an image con- erating characteristics curve (ROC). The ROC curve is cal-
tains a thunderstorm. Thus, the classification methods available culated from the relationship between the probability of
to do this task are logistic regression, naïve Bayes, decision false detection (POFD) and the probability of detection
trees, random forest, gradient boosted trees, and support vector (POD). Both POFD and POD parameters are calculated
machines. To find an optimal ML model, it is often best to try from determining parameters within a contingency table
all methods available. While this might seem like a considerable which are the true positives (both the ML prediction and la-
amount of additional effort, the ML package used in this tuto- bel say thunderstorm), false positives (ML prediction
rial (i.e., scikit learn4) uses the same syntax for all methods predicts thunderstorm, label has no thunderstorm), false
[e.g., method.fit(X, y), method.predict(Xval)]. Thus, fitting all negatives (ML prediction is no thunderstorm, label shows
there is a thunderstorm) and true negatives (ML says no
thunderstorm, label says no thunderstorm). The POFD and
4
https://scikit-learn.org/stable/. POD are defined by

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


1518 WEATHER AND FORECASTING VOLUME 37

FIG. 9. Performance metrics from the simple classification (only using Tb). (a) Receiver operating characteristic
(ROC) curves for each ML model (except support vector machines), logistic regression (LgR; blue), naïve Bayes
(NB; red), decision tree (DT; geen), random forest (RF; yellow), and gradient boosted trees (GBT; light green). The
area under the ROC curve is reported in the legend. (b) Performance diagram for all ML models [same colors as (a)].
Color fill is the corresponding CSI value for each success ratio–probability of detection (SR–POD) pair. Dashed con-
tours are the frequency bias.

FalsePositive characterized by models that capture nearly all events (i.e.,


POFD  , (15)
TruePositive 1 FalsePositive thunderstorms), while not predicting a lot of false alarms (i.e.,
false positives). This corner is also associated with high values
TruePositive of critical success index (CSI; filled contours Fig. 9b), defined
POD  : (16)
TruePositive 1 FalseNegative as

All of the ML models, except support vector machines (as TruePositive


CSI  ,
coded in sklearn), can provide a probabilistic estimation of the TruePositive 1 FalsePositive 1 FalseNegative
classification (e.g., this image is 95% likely to have lightning in (18)
it). When calculating the accuracy before, we assumed a thresh-
old of 50% to designate what the ML prediction was. To get which is a metric that shows a model’s performance without
the ROC curve, the threshold probability is instead varied from considering the true negatives. Not considering the true nega-
0% to 100%. The resulting ROC curves for all of the ML meth- tives is important because true negatives can dominate ML
ods except support vector machines are shown in Fig. 9a. We tasks in meteorology given the often rare nature of events
see that for this simple one feature model, all methods are still with large impacts (e.g., floods, hail, tornadoes). The last set
very similar and have AUCs near 0.9 (Fig. 9a), which is gener- of lines on this diagram are the frequency bias contours
ally considered good performance.5 (dashed gray lines Fig. 9b). These contours indicate if a model
An additional method for evaluating the performance of classi- is overforecasting or underforecasting.
fication method is called a performance diagram (Fig. 9b; For the simple ML models trained, even though most of
Roebber 2009). The performance diagram is also calculated from them have a similar accuracy and AUC, the performance dia-
the contingency table, using the POD again for the y axis, but gram suggests their performance is indeed different. Consider
this time the x axis is the success ratio (SR) which is defined as the tree based methods (green box; Fig. 9b). They are all ef-
fectively at the same location with a POD of about 0.9 and a
TruePositive SR of about 0.75, which is a region that has a frequency bias
SR  : (17)
TruePositive 1 FalsePositive of almost 1.5. Meanwhile the logistic regression, support vec-
tor machines and naïve Bayes methods are much closer to the
From this diagram, several things can be gleaned about
frequency bias line of 1, while having a similar CSI as the tree
the models’ performance. In general, the top right corner is
based methods. Thus, after considering overall accuracy,
where “best” performing models are found. This area is
AUC and the performance diagram, the best performing
model would be the logistic regression, support vector ma-
5
No formal peer reviewed journal states this; it is more of a rule chines, or naïve Bayes. At this junction, the practitioner has
of thumb in machine learning practice. the option to consider if they want a slightly overforecasting

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


AUGUST 2022 CHASE ET AL. 1519

FIG. 10. As in Fig. 9, but now trained with all available predictors. The annotations from Fig. 9 have been removed.

system or a slightly underforecasting system. For the thunder- Using all available input features yields an accuracy of
storm, no-thunderstorm task, there are not many implications 90%, 84%, 86%, 91%, 90%, and 89% for logistic regression,
for overforecasting or underforecasting. However, developers naïve Bayes, decision tree, random forest, gradient boosted
of a tornado prediction model may prefer a system that pro- trees, and support vector machines, respectively. Beyond
duces more false positives (overforecasting; storm warned, no the relatively good accuracy, the ROC curves are shown in
tornado) than false negatives (underforecasting; storm not Fig. 10a. This time there are generally two sets of curves, one
warned, tornado) as missed events could have significant im- better performing group (logistic regression, random forest,
pact to life and property. It should be clear that without going gradient boosted trees, and support vector machines) with
beyond a single metric, this differentiation between the ML AUCs of 0.97 and a worse performing group (naïve Bayes
methods would not be possible. and decision tree) AUCs around 0.87. This separation coin-
While the previous example was simple by design, we as hu- cides with the flexibility of the classification methods. The bet-
mans could have used a simple threshold at the intersection of ter performing groups are better set to deal with many
the two histograms in Fig. 8 to achieve similar accuracy (e.g., features and nonlinear interactions of the features, while the
81%; not shown). The next logical step with the classification worse performing group is a bit more restricted in how it com-
task would be to use all available features. One important bines many features. Considering the performance diagram
thing to mention at this step is that it is good practice to nor- (Fig. 10b), the same grouping of high AUC performing mod-
malize input features. Some of the ML methods (e.g., random els have higher CSI scores (.0.8) and have little to no fre-
forest) can handle inputs of different magnitudes (e.g., CAPE quency bias. Meanwhile the lower AUC performing models
is on the order of hundreds to thousands, but lifted index is have lower CSI (0.75) and NB has a slight overforecasting
on the order of one to tens), but others (e.g., logistic regres- bias. Overall, the ML performance on classifying if an image
sion) will be unintentionally biased toward larger magnitude has a thunderstorm is doing well with all predictors. While a
features if you do not scale your input features. Common scal- good performing model is a desired outcome of ML, at this
ing methods include min–max scaling and scaling your input point we do not know how the ML is making its predictions.
features to have mean 0 and standard deviation of 1 (i.e., stan- This is part of the “black-box” issue of ML and does not lend
dard anomaly) which are defined mathematically as follows: itself to being consistent with the ML user’s prior knowledge
(see note in introduction on consistency; Murphy 1993).
x 2 xmin
minmax  , and (19) To alleviate some of opaqueness of the ML black box, one
xmax 2 xmin
can interrogate the trained ML models by asking: “What in-
x2m put features are most important to the decision?” and “Are
standard anomaly  , (20) the patterns the ML models learned physical (e.g., follow me-
s
teorological expectation)?” The techniques named permuta-
respectively. In Eq. (19), xmin is the minimum value within the tion importance (Breiman 2001; Lakshmanan et al. 2015) and
training dataset for some input feature x while xmax is the accumulated local effects (ALE; Apley and Zhu 2020) are
maximum value in the training dataset. In Eq. (20), m is used to answer these two questions, respectively. Permutation
the mean of feature x in the training dataset and s is the stan- importance is a method in which the relative importance of
dard deviation. For this paper, the standard anomaly is used. an input feature is quantified by considering the change in

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


1520 WEATHER AND FORECASTING VOLUME 37

FIG. 11. Backward permutation importance test for the best performing classification ML models. Single pass results are in the top row,
while multi-pass forward results are for the bottom row. Each column corresponds to a different ML method: (a),(d) logistic regression;
(b),(e) random forest; and (c),(f) gradient boosted trees. Bars are colored by their source, yellow for the vertically integrated liquid (VIL),
red for the infrared (IR), blue for water vapor (WV), and black for visible (VIS). Number subscripts correspond to the percentile of that
variable. The dashed black line is the original AUC value when all features are not shuffled.

evaluation metric (e.g., AUC) when that input variable is the single-pass permutation importance. For visual learners,
shuffled (i.e., randomized). The intuition is that the most im- see the animations (for the backward direction; Figs. ES4 and
portant variables when shuffled will cause the largest change ES5) in the supplement of McGovern et al. (2019). The reason
to the evaluation metric. There are two main flavors of per- for doing multi-pass permutation importance is because corre-
mutation importance, named single-pass and multi-pass. lated features could result in falsely identifying unimportant
Single-pass permutation importance goes through each input variables using the single pass permutation importance. The
variable and shuffles them one by one, calculating the change best analysis of the permutation test is to use both the single
in the evaluation metrics. Multi-pass permutation importance pass and multi-pass tests in conjunction.
uses the result of the single-pass, but progressively permutes The top five most important features for the better per-
features. In other words, features are successively permuted forming models (i.e., logistic regression, random forest, and
in the order that they were determined as important (most gradient boosted trees) as determined by permutation impor-
important then second most important etc.) from the single tance are shown in Fig. 11. For all ML methods both the sin-
pass but are now left shuffled. The specific name for the gle and multi-pass test show that the maximum vertically
method we have been describing is the backward multi-pass integrated liquid is the most important feature, while the min-
permutation importance. The backward name comes from the imum brightness temperature from the clean infrared and
direction of shuffling, starting will all variables unshuffled and midtropospheric water vapor channels are found within the
shuffling more and more of them. There is the opposite direc- top five predictors (except multi-pass test for logistic regres-
tion, named forward multi-pass permutation importance, where sion). In general, the way to interpret these is to take the con-
the starting point is that all features are shuffled to start. Then sensus over all models which features are important. At this
each feature is unshuffled in order of their importance from point it time to consider if the most important predictors

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


AUGUST 2022 CHASE ET AL. 1521

FIG. 12. Accumulated local effects (ALE) for (a) the maximum vertically integrated liquid (VILmax), (b) the minimum brightness tem-
perature from infrared (IRmin), and (c) the minimum brightness temperature from the water vapor channel (WVmin). Lines correspond to
all the ML methods trained (except support vector machines) and colors match Fig. 9. Gray histograms in the background are the counts
of points in each bin.

make meteorological sense. Vertically integrated liquid has as the ALE for that bin. This process is repeated for all bins
been shown to have a relationship to lightning (e.g., Watson which result in a curve. For example, the ALE for some of
et al. 1995) and is thus plausible to be the most important pre- the top predictors of the permutation test is shown in in
dictor. Similarly, the minimum brightness temperature at the Fig. 12. At this step, the ALEs can be mainly used to see if
water vapor and clean infrared channels also makes physical the ML models have learned physically plausible trends with
sense because lower temperatures are generally associated input features. For the vertically integrated liquid, all models
with taller storms. We could also reconcile the maximum in- show that as the max vertically integrated liquid increases
frared brightness temperature (Fig. 11a) as a proxy for the from about 2 to 30 kg m22 the average output probability of
surface temperature which correlates to buoyancy, but note the model will increase, but values larger than 30 kg m22 gen-
that the relative change in AUC with this feature is quite erally all have the same local effect on the prediction
small. Conversely, any important predictors that do not align (Fig. 12a). As for the minimum clean infrared brightness tem-
with traditional meteorological knowledge may require fur- perature, the magnitude of the average change is considerably
ther exploration to determine why the model is placing such different across the different models, but generally all have
weight on those variables. Does the predictor have some the same pattern. As the minimum temperature increases
statistical correlation with the meteorological event that is from 2888 to 2558C, the mean output probability decreases:
unexplained by past literature, or are there nonphysical temperatures larger than 2178C have no change (Fig. 12b).
characteristics of the data that may be influencing the model Last, all models but the logistic regression show a similar pat-
during training? In the latter case, it is possible that your tern with the minimum water vapor brightness temperature,
model might be getting the right answer for the wrong but notice the magnitude of the y axis (Fig. 12c). Much less
reasons. change occurs with this feature. For interested readers, addi-
Meanwhile minimum brightness temperature at both the tional interpretation techniques and examples can be found in
water vapor and clean infrared channels also makes physical Molnar (2022).
sense since lower temperatures are related with taller storms.
2) REGRESSION
We could also reconcile the max infrared brightness tempera-
ture as a proxy for the surface temperature, which correlates As stated in section 3a, task 2 is to predict the number of
to buoyancy, but not that the relative change in AUC with lightning flashes inside an image. Thus, the regression meth-
this feature is quite small. If any the top predictors do not ods available to do this task are linear regression, decision
make sense meteorologically, then your model might be get- tree, random forest, gradient boosted trees, and support vec-
ting the right answer for the wrong reasons. tor machines. Similar to task 1 a simple scenario is considered
Accumulated local effects are where small changes to input first, using Tb as the lone predictor. Figure 13 shows the gen-
features and their associated change on the output of the eral relationship between Tb and the number of flashes in the
model are quantified. The goal behind ALE is to investigate image. For Tb . 2258C, most images do not have any light-
the relationship between an input feature and the output. ning, while Tb , 2258C shows a general increase of lightning
ALE is performed by binning the data based on the feature of flashes. Given there are a lot of images with zero flashes (ap-
interest. Then for each example in each bin, the feature value proximately 50% of the total dataset; black points in Fig. 13),
is replaced by the edges of the bin. The mean difference in the linear methods will likely struggle to capture a skillful pre-
the model output from the replaced feature value is then used diction. One way to improve performance would be to only

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


1522 WEATHER AND FORECASTING VOLUME 37

1 N
bias  (y 2 ŷ j ), (21)
N j1 j

1 N
MAE  |y 2 ŷ j |, (22)
N j1 j


1 N
RMSE  (y 2 ŷ j )2 , (23)
N j1 j


N
(yj 2 ŷ j )2
j1
R 12
2
: (24)

N
(yj 2 y)2
j1

All of these metrics are shown in Fig. 15. In general, the met-
rics give a more quantitative perspective to the one-to-one plots.
The poor performance of the linear methods shows, with the
two worst performances being the support vector machines and
linear regression with biases of 71 and 6 flashes, respectively.
FIG. 13. The training data relationship between the minimum
brightness temperature from infrared (Tb) and the number of While no method provides remarkable performance, the ran-
flashes detected by GLM. All non-thunderstorm images (number dom forest and gradient boosted trees perform better with this
of flashes equal to 0) are in black. single feature model (show better metrics holistically).
As before, the next logical step is to use all available fea-
predict the number of flashes on images where there are tures to predict the number of flashes: those results are found
nonzero flashes. While this might not seem like a viable way in Figs. 16 and 17. As expected, the model performance in-
forward since non-lightning cases would be useful to pre- creases. Now all models show a general correspondence be-
dict, in practice we could leverage the very good perfor- tween the predicted number of flashes and the true number of
mance of the classification model from section 3c(1), and flashes in the one-to-one plot (Fig. 16). Meanwhile the scatter
then use the trained regression on images that are confident for random forest and gradient boosted trees has reduced
to have at least one flash in them. An example of this done considerably when comparing to the single input models
in the literature is Gagne et al. (2017), where hail size pre- (Figs. 16c,d). While comparing the bias of the models trained
dictions were only made if the classification model said with all predictors is relatively similar, the other metrics are
there was hail. much improved, showing large reductions in MAE and
As before, all methods are fit on the training data initially RMSE and increases in R2 (Fig. 17) for all methods except de-
using the default hyper-parameters. A common way to com- cision trees. This reinforces that fact that similar to the classifi-
pare regression model performance is to create a one-to-one cation example, it is always good to compare more than one
plot, which has the predicted number of flashes on the x axis metric.
and the true measured number of flashes on the y axis. A Since the initial fitting of the ML models used the default
perfect model will show all points tightly centered along the parameters, there might be room for tuning the models to
diagonal of the plot. This is often the quickest qualitative as- have better performance. Here we will show an example of
sessment of how a regression model is performing. While Tb some hyper-parameter tuning of a random forest. The com-
was well suited for the classification of thunderstorm/no- mon parameters that can be altered in a random forest include
thunderstorm, it is clear that fitting a linear model to the the following: the maximum depth of the trees (i.e., number of
data in Fig. 13 did not do well (Figs. 14a,e), leading to a decisions in a tree) and the number of trees in the forest. The
strong overprediction of the number of lightning flashes in formal hyper-parameter search will use the full training data-
an images with less than 100 flashes, while under-predicting set, and systematically vary the depth of the trees from 1 to 10
the number of flashes for images with more than 100 flashes. (in increments of 1) as well as the number of trees from 1 to
The tree based methods tend to do better, but there is still a 100 (1, 5, 10, 25, 50, 100). This results in 60 total models that
large amount of scatter and an over estimation of storms are trained.
with less than 100 flashes. To evaluate which is the best configuration, the same metrics
To tie quantitative metrics to the performance of each as before are shown in Fig. 18 as a function of the depth of the
model the following are common metrics calculated: mean trees. The random forest quickly gains skill with added depth
bias, mean absolute error (MAE), root mean squared error beyond one, with all metrics improving for both the training
(RMSE) and coefficient of determination (R2). Their mathe- (dashed lines) and validation datasets (solid lines). Beyond a
matical representations are the following: depth of four, the bias, MAE, and RMSE all stagnate, but the

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


AUGUST 2022 CHASE ET AL. 1523

FIG. 14. The one-to-one relationship between the predicted number of lightning flashes from the ML learning models
trained on only Tb (x axis; ŷ) and the number of measure flashes from GLM (y axis; y). Each marker is one observation.
Meanwhile areas with more than 100 points in close proximity are shown in the colored boxes. The lighter the shade of
the color, the higher the density of points. (a) Linear regression (LnR; reds), (b) decision tree (DT; blues), (c) random
forest (RF; oranges), (d) gradient boosted trees (GBT; purples), and (e) linear support vector machines (SVM; grays).

R2 value increases until a depth of eight where the training training metric skills but no increase (or a decrease) to vali-
data continue to increase. There does not seem to be that dation data skill is the overfitting signal we discussed in
large of an effect of increasing the number of trees beyond section 3b. Thus, the best random forest model choice for
10 (color change of lines). The characteristic of increasing predicting lightning flashes is a random forest with a max
depth of eight and a total of 10 trees. The reason we choose
10 trees, is because in general choosing a simpler model is
less computationally expensive to use as well as a more in-
terpretable than a model with 1000 trees.

d. Testing
As mentioned before, the test dataset is the dataset you
hold out until the end when all hyper-parameter tuning has
finished so that there is no unintentional tuning of the final
model configuration to a dataset. Thus, now that we have eval-
uated the performance of all our models on the validation da-
taset it is time to run the same evaluations as in sections 3c(1)
and 3c(2). These test results are the end performance metrics
that should be interpreted as the expected ML performance
on new data (e.g., the ML applied in practice). For the ML
models here, the metrics are very similar as the validation set.
(For brevity the extra figures are included in the appendix
Figs. A1–A3.)

4. Summary and future work


This manuscript was the first of two machine learning (ML)
FIG. 15. Validation dataset metrics for all ML models. Colors are as tutorial papers designed for the operational meteorology
in Fig. 14. Exact numerical value is reported on top of each bar. community. This paper supplied a survey of some of the most

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


1524 WEATHER AND FORECASTING VOLUME 37

FIG. 16. As in Fig. 14, but now the x axis is provided from the ML models trained with all available input features.

common ML methods. All ML methods described here are such a way that ML methods are more familiar to readers as
considered supervised methods, meaning the data the models they encounter them in the operational community and within
are trained from include pre-labeled truth data. The specific the general meteorological literature. Moreover, this manu-
methods covered included linear regression, logistic regres- script provided ample references of published meteorological
sion, decision trees, random forests, gradient boosted decision examples as well as open-source code to act as catalysts for
trees, naïve Bayes, and support vector machines. The over- readers to adapt and try ML on their own datasets and in their
arching goal of the paper was to introduce the ML methods in workflows.

FIG. 18. Hyper-parameter tuning of a random forest for predict-


ing the number of lightning flashes. All input features are used.
Solid lines are the validation dataset while the dashed lines are the
FIG. 17. As in Fig. 15, but for ML models trained with all available training data. The vertical dotted line is the depth of trees where
input features. overfitting begins.

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


AUGUST 2022 CHASE ET AL. 1525

FIG. A1. As in Fig. 9, but now for the test dataset.

Additionally, this manuscript provided a tutorial example of 3) Exhibited a regression task to predict the number of light-
how to apply ML to a couple meteorological tasks using the Storm ning flashes in a satellite image. This section also con-
Event Imagery dataset (SEVIR; Veillette et al. 2020) dataset. We tained discussions of training/evaluation as well as an ex-
ample of hyper-parameter tuning [section 3c(2)].
1) Discussed the various steps of preparing data for ML (i.e.,
4) Released python code to conduct all steps and examples
removing artifacts; engineering features, training/valida-
in this manuscript (see data availability statement).
tion/testing splits; section 3b).
2) Conducted a classification task to predict if satellite im- The follow on paper in this series will discuss a more com-
ages had lightning within them. This section included dis- plex, yet potentially more powerful, grouping of ML methods:
cussions of training, evaluation and interrogation of the neural networks and deep learning. Like a lot of the ML meth-
trained ML models [section 3c(1)]. ods described in this paper, neural networks are not necessarily

FIG. A2. As in Fig. 14, but for the test dataset.

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


1526 WEATHER AND FORECASTING VOLUME 37

same things discussed in this paper. The latest version of gi-


thub repository can be located here: https://github.com/ai2es/
WAF_ML_Tutorial_Part1. If you are interested in the version
of the repository that was available at time of publication
please see the zendo archive of version 1 here: https://zenodo.
org/record/6941510. The original github repo for SEVIR is
located here: https://github.com/MIT-AI-Accelerator/neurips-
2020-sevir.

APPENDIX

Testing Dataset Figures


This appendix contains the test dataset evaluations for
both the classification task (Fig. A1) and the regression
task (Figs. A2 and A3). Results are largely the same as the
validation set, so to save space they were included here.

REFERENCES

Anaconda, 2020: Anaconda software distribution. Anaconda Inc.,


FIG. A3. As in Fig. 15, but for the test dataset. accessed 1 July 2022, https://docs.anaconda.com/.
Apley, D. W., and J. Zhu, 2020: Visualizing the effects of predic-
new (Rumelhart et al. 1986) and were first applied to meteorol- tor variables in black box supervised learning models. J. Roy.
ogy topics decades ago (e.g., Key et al. 1989; Lee et al. 1990). Stat. Soc., 82B, 1059–1086, https://doi.org/10.1111/rssb.12377.
Although, given the exponential growth of computing resources Bieli, M., A. H. Sobel, S. J. Camargo, and M. K. Tippett, 2020: A
and dataset sizes, research using neural networks and deep statistical model to predict the extratropical transition of
learning in meteorology has been accelerating (e.g., Fig. 1c; tropical cyclones. Wea. Forecasting, 35, 451–466, https://doi.
org/10.1175/WAF-D-19-0045.1.
Gagne et al. 2019; Lagerquist et al. 2020; Cintineo et al. 2020;
Billet, J., M. DeLisi, B. Smith, and C. Gates, 1997: Use of regres-
Chase et al. 2021; Hilburn et al. 2021; Lagerquist et al. 2021;
sion techniques to predict hail size and the probability of
Molina et al. 2021; Ravuri et al. 2021). Thus, it is important that large hail. Wea. Forecasting, 12, 154–164, https://doi.org/10.
operational meteorologists also understand the basics of neural 1175/1520-0434(1997)012,0154:UORTTP.2.0.CO;2.
networks and deep learning. Bischl, B., O. Mersmann, H. Trautmann, and C. Weihs, 2012: Re-
sampling methods for meta-model validation with recommen-
Acknowledgments. We would like to acknowledge and thank dations for evolutionary computation. Evol. Comput., 20,
the three anonymous reviewers who provided valuable feedback 249–275, https://doi.org/10.1162/EVCO_a_00069.
to this manuscript. This material is based upon work supported by Bisong, E., Ed., 2019: Google colaboratory. Building Machine
the National Science Foundation under Grant ICER-2019758, Learning and Deep Learning Models on Google Cloud Plat-
form: A Comprehensive Guide for Beginners, Apress, 59–64,
supporting authors RJC, AM, and AB. Author DRH was pro-
https://doi.org/10.1007/978-1-4842-4470-8_7.
vided support by NOAA/Office of Oceanic and Atmospheric Re-
Bonavita, M., and Coauthors, 2021: Machine learning for earth
search under NOAA–University of Oklahoma Cooperative system observation and prediction. Bull. Amer. Meteor. Soc.,
Agreements NA16OAR4320115 and NA21OAR4320204, U.S. 102, E710–E716, https://doi.org/10.1175/BAMS-D-20-0307.1.
Department of Commerce. The scientific results and conclusions, Breiman, L., 1984: Classification and Regression Trees. Routledge,
as well as any views or opinions expressed herein, are those of the 368 pp.
authors and do not necessarily reflect the views of NOAA or the }}, 2001: Random forests. Mach. Learn., 45, 5–32, https://doi.
Department of Commerce. We want to acknowledge the work org/10.1023/A:1010933404324.
put forth by the authors of the SEVIR dataset (Mark S. Veillette, Burke, A., N. Snook, D. J. Gagne II, S. McCorkle, and A.
Siddharth Samsi, and Christopher J. Mattioli) for making a high- McGovern, 2020: Calibration of machine learning–based
quality free dataset. We would also like to acknowledge the probabilistic hail predictions for operational forecasting.
open-source python community for providing their tools for Wea. Forecasting, 35, 149–168, https://doi.org/10.1175/WAF-
D-19-0105.1.
free. Specifically, we acknowledge Google Colab (Bisong 2019),
Cains, M. G., and Coauthors, 2022: NWS forecasters’ perceptions
Anaconda (Anaconda 2020), scikit-learn (Pedregosa et al. 2011),
and potential uses of trustworthy AI/ML for hazardous weather
Pandas (Wes McKinney 2010), Numpy (Harris et al. 2020), and risks. 21st Conf. on Artificial Intelligence for Environmental Sci-
Jupyter (Kluyver et al. 2016). ence, Houston, TX, Amer. Meteor. Soc., 1.3, https://ams.confex.
com/ams/102ANNUAL/meetingapp.cgi/Paper/393121.
Data availability statement. As an effort to catalyze the use Chase, R. J., S. W. Nesbitt, and G. M. McFarquhar, 2021: A dual-
and trust of machine learning within meteorology we have frequency radar retrieval of two parameters of the snowfall
supplied a github repository with a code tutorial of a lot of the particle size distribution using a neural network. J. Appl.

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


AUGUST 2022 CHASE ET AL. 1527

Meteor. Climatol., 60, 341–359, https://doi.org/10.1175/JAMC- Gensini, V. A., C. Converse, W. S. Ashley, and M. Taszarek,
D-20-0177.1. 2021: Machine learning classification of significant tornadoes
Chisholm, D., J. Ball, K. Veigas, and P. Luty, 1968: The diagnosis and hail in the United States using ERA5 proximity sound-
of upper-level humidity. J. Appl. Meteor., 7, 613–619, https://doi. ings. Wea. Forecasting, 36, 2143–2160, https://doi.org/10.1175/
org/10.1175/1520-0450(1968)007,0613:TDOULH.2.0.CO;2. WAF-D-21-0056.1.
Cintineo, J. L., M. Pavolonis, J. Sieglaff, and D. Lindsey, 2014: An Glahn, H. R., and D. A. Lowry, 1972: The use of Model Output
empirical model for assessing the severe weather potential of Statistics (MOS) in objective weather forecasting. J. Appl. Me-
developing convection. Wea. Forecasting, 29, 639–653, https:// teor., 11, 1203–1211, https://doi.org/10.1175/1520-0450(1972)011,
doi.org/10.1175/WAF-D-13-00113.1. 1203:TUOMOS.2.0.CO;2.
}}, and Coauthors, 2018: The NOAA/CIMSS ProbSevere Goodfellow, I., Y. Bengio, and A. Courville, 2016: Deep Learning.
model: Incorporation of total lightning and validation. Wea. MIT Press, 800 pp., http://www.deeplearningbook.org.
Forecasting, 33, 331–345, https://doi.org/10.1175/WAF-D-17- Grams, H. M., P.-E. Kirstetter, and J. J. Gourley, 2016: Naïve
0099.1. Bayesian precipitation type retrieval from satellite using a
}}, M. J. Pavolonis, J. M. Sieglaff, L. Cronce, and J. Brunner, cloud-top and ground-radar matched climatology. J. Hydrome-
2020: NOAA Probsevere v2.0}Probhail, Probwind, and teor., 17, 2649–2665, https://doi.org/10.1175/JHM-D-16-0058.1.
Probtor. Wea. Forecasting, 35, 1523–1543, https://doi.org/10. Harris, C. R., and Coauthors, 2020: Array programming with
1175/WAF-D-19-0242.1. NumPy. Nature, 585, 357–362, https://doi.org/10.1038/s41586-
Conrick, R., J. P. Zagrodnik, and C. F. Mass, 2020: Dual-polarization 020-2649-2.
radar retrievals of coastal Pacific Northwest raindrop size distri- Herman, G., and R. Schumacher, 2018a: Dendrology in numerical
bution parameters using random forest regression. J. Atmos. weather prediction: What random forests and logistic regres-
Oceanic Technol., 37, 229–242, https://doi.org/10.1175/JTECH- sion tell us about forecasting. Mon. Wea. Rev., 146, 1785–
D-19-0107.1. 1812, https://doi.org/10.1175/MWR-D-17-0307.1.
Cui, W., X. Dong, B. Xi, and Z. Feng, 2021: Climatology of linear }}, and }}, 2018b: Money doesn’t grow on trees, but forecasts
mesoscale convective system morphology in the United do: Forecasting extreme precipitation with random forests.
States based on the random-forests method. J. Climate, 34,
Mon. Wea. Rev., 146, 1571–1600, https://doi.org/10.1175/
7257–7276, https://doi.org/10.1175/JCLI-D-20-0862.1.
MWR-D-17-0250.1.
Czernecki, B., M. Taszarek, M. Marosz, M. Półrolniczak, L.
Hilburn, K. A., I. Ebert-Uphoff, and S. D. Miller, 2021: Develop-
Kolendowicz, A. Wyszogrodzki, and J. Szturc, 2019: Applica-
ment and interpretation of a neural-network-based synthetic
tion of machine learning to large hail prediction}The impor-
radar reflectivity estimator using GOES-R satellite observa-
tance of radar reflectivity, lightning occurrence and convec-
tions. J. Appl. Meteor. Climatol., 60, 3–21, https://doi.org/10.
tive parameters derived from ERA5. Atmos. Res., 227, 249–
1175/JAMC-D-20-0084.1.
262, https://doi.org/10.1016/j.atmosres.2019.05.010.
Hill, A. J., and R. S. Schumacher, 2021: Forecasting excessive
Elmore, K. L., and H. Grams, 2016: Using mPING data to gener-
rainfall with random forests and a deterministic convection-
ate random forests for precipitation type forecasts. 14th Conf.
allowing model. Wea. Forecasting, 36, 1693–1711, https://doi.
on Artificial and Computational Intelligence and its Applica-
org/10.1175/WAF-D-21-0026.1.
tions to the Environmental Sciences, New Orleans, LA,
}}, G. R. Herman, and R. S. Schumacher, 2020: Forecasting se-
Amer. Meteor. Soc., 4.2, https://ams.confex.com/ams/96Annual/
vere weather with random forests. Mon. Wea. Rev., 148,
webprogram/Paper289684.html.
Flora, M. L., C. K. Potvin, P. S. Skinner, S. Handler, and A. 2135–2161, https://doi.org/10.1175/MWR-D-19-0344.1.
McGovern, 2021: Using machine learning to generate storm- Hoerl, A. E., and R. W. Kennard, 1970: Ridge regression: Biased
scale probabilistic guidance of severe weather hazards in the estimation for nonorthogonal problems. Technometrics, 12,
Warn-on-Forecast system. Mon. Wea. Rev., 149, 1535–1557, 55–67, https://doi.org/10.1080/00401706.1970.10488634.
https://doi.org/10.1175/MWR-D-20-0194.1. Holte, R. C., 1993: Very simple classification rules perform well
Friedman, J., 2001: Greedy function approximation: A gradient on most commonly used datasets. Mach. Learn., 11, 63–90,
boosting machine. Ann. Stat., 29, 1189–1232, https://doi.org/ https://doi.org/10.1023/A:1022631118932.
10.1214/aos/1013203451. Hu, L., E. A. Ritchie, and J. S. Tyo, 2020: Short-term tropical cy-
Gagne, D., A. McGovern, and J. Brotzge, 2009: Classification of clone intensity forecasting from satellite imagery based on
convective areas using decision trees. J. Atmos. Oceanic Tech- the deviation angle variance technique. Wea. Forecasting, 35,
nol., 26, 1341–1353, https://doi.org/10.1175/2008JTECHA1205.1. 285–298, https://doi.org/10.1175/WAF-D-19-0102.1.
}}, }}, }}, and M. Xue, 2013: Severe hail prediction within Hubbert, J. C., M. Dixon, S. M. Ellis, and G. Meymaris, 2009:
a spatiotemporal relational data mining framework. 13th Int. Weather radar ground clutter. Part I: Identification, model-
Conf. on Data Mining, Dallas, TX, Institute of Electrical and ing, and simulation. J. Atmos. Oceanic Technol., 26, 1165–
Electronics Engineers, 994–1001, https://doi.org/10.1109/ICDMW. 1180, https://doi.org/10.1175/2009JTECHA1159.1.
2013.121. Jergensen, G. E., A. McGovern, R. Lagerquist, and T. Smith,
}}, }}, S. Haupt, R. Sobash, J. Williams, and M. Xue, 2017: 2020: Classifying convective storms using machine learning.
Storm-based probabilistic hail forecasting with machine learn- Wea. Forecasting, 35, 537–559, https://doi.org/10.1175/WAF-
ing applied to convection-allowing ensembles. Wea. Forecast- D-19-0170.1.
ing, 32, 1819–1840, https://doi.org/10.1175/WAF-D-17-0010.1. Kalnay, E., 2002: Atmospheric Modeling, Data Assimilation and
}}, H. Christensen, A. Subramanian, and A. Monahan, 2019: Predictability. Cambridge University Press, 341 pp., https://
Machine learning for stochastic parameterization: Genera- doi.org/10.1017/CBO9780511802270.
tive adversarial networks in the Lorenz ’96 model. J. Adv. Key, J., J. Maslanik, and A. Schweiger, 1989: Classification of
Model. Earth Syst., 12, e2019MS001896, https://doi.org/10. merged AVHRR and SMMR Arctic data with neural net-
1029/2019MS001896. works. Photogramm. Eng. Remote Sens., 55, 1331–1338.

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


1528 WEATHER AND FORECASTING VOLUME 37

Kluyver, T., and Coauthors, 2016: Jupyter Notebooks}A publish- Mao, Y., and A. Sorteberg, 2020: Improving radar-based precipi-
ing format for reproducible computational workflows. Posi- tation nowcasts with machine learning using an approach
tioning and Power in Academic Publishing: Players, Agents based on random forest. Wea. Forecasting, 35, 2461–2478,
and Agendas, F. Loizides and B. Schmidt, Eds., IOS Press, https://doi.org/10.1175/WAF-D-20-0080.1.
87–90. McCorkel, J., J. Van Naarden, D. Lindsey, B. Efremova, M.
Kossin, J. P., and M. Sitkowski, 2009: An objective model for iden- Coakley, M. Black, and A. Krimchansky, 2019: GOES-17 ad-
tifying secondary eyewall formation in hurricanes. Mon. Wea. vanced baseline imager performance recovery summary.
Rev., 137, 876–892, https://doi.org/10.1175/2008MWR2701.1. (IGARSS 2019) 2019 IEEE Int. Geoscience and Remote Sens-
Kühnlein, M., T. Appelhans, B. Thies, and T. Nauß, 2014: Precipi- ing Symp., Yokohama, Japan, Institute of Electrical and Elec-
tation estimates from MSG SEVIRI daytime, nighttime, and tronics Engineers, 1–4, https://doi.org/10.1109/IGARSS40859.
twilight data with random forests. J. Appl. Meteor. Climatol., 2019.9044466.
53, 2457–2480, https://doi.org/10.1175/JAMC-D-14-0082.1. McGovern, A., D. Gagne, J. Williams, R. Brown, and J. Basara,
Kuncheva, L. I., 2006: On the optimality of naïve Bayes with de- 2014: Enhancing understanding and improving prediction of
pendent binary features. Pattern Recognit. Lett., 27, 830–837, severe weather through spatiotemporal relational learning.
https://doi.org/10.1016/j.patrec.2005.12.001. Mach. Learn., 95, 27–50, https://doi.org/10.1007/s10994-013-
Kurdzo, J. M., E. F. Joback, P.-E. Kirstetter, and J. Y. N. Cho, 5343-x.
2020: Geospatial QPE accuracy dependence on weather ra- }}, }}, J. Basara, T. Hamill, and D. Margolin, 2015: Solar en-
dar network configurations. J. Appl. Meteor. Climatol., 59, ergy prediction: An international contest to initiate interdisci-
1773–1792, https://doi.org/10.1175/JAMC-D-19-0164.1. plinary research on compelling meteorological problems.
Lackmann, G., Ed., 2011: Numerical weather prediction/data as- Bull. Amer. Meteor. Soc., 96, 1388–1395, https://doi.org/10.
similation. Midlatitude Synoptice Meteorology: Dynamics, 1175/BAMS-D-14-00006.1.
Analysis, and Forecasting, Amer. Meteor. Soc., 274–287. }}, R. Lagerquist, D. Gagne, G. Jergensen, K. Elmore, C.
Lagerquist, R., A. McGovern, and T. Smith, 2017: Machine learn- Homeyer, and T. Smith, 2019: Making the black box more
ing for real-time prediction of damaging straight-line convec- transparent: Understanding the physical implications of ma-
tive wind. Wea. Forecasting, 32, 2175–2193, https://doi.org/10. chine learning. Bull. Amer. Meteor. Soc., 100, 2175–2199,
1175/WAF-D-17-0038.1. https://doi.org/10.1175/BAMS-D-18-0195.1.
}}, }}, C. R. Homeyer, D. J. Gagne II, and T. Smith, 2020: }}, I. Ebert-Uphoff, D. J. Gagne II, and A. Bostrom, 2021: The
Deep learning on three-dimensional multiscale data for next- need for ethical, responsible, and trustworthy artificial intelli-
hour tornado prediction. Mon. Wea. Rev., 148, 2837–2861, gence for environmental sciences. arXiv, 2112.08453, https://
https://doi.org/10.1175/MWR-D-19-0372.1. arxiv.org/abs/2112.08453.
}}, J. Q. Stewart, I. Ebert-Uphoff, and C. Kumler, 2021: Using McKinney, W., 2010: Data structures for statistical computing in
deep learning to nowcast the spatial coverage of convection Python. Proceedings of the Ninth Python in Science Confer-
from Himawari-8 satellite data. Mon. Wea. Rev., 149, 3897– ence, S. van der Walt and J. Millman, Eds., 56–61, https://doi.
3921, https://doi.org/10.1175/MWR-D-21-0096.1. org/10.25080/Majora-92bf1922-00a.
Lakshmanan, V., C. Karstens, J. Krause, K. Elmore, A. Ryzhkov, Mecikalski, J., J. Williams, C. Jewett, D. Ahijevych, A. LeRoy,
and S. Berkseth, 2015: Which polarimetric variables are im- and J. Walker, 2015: Probabilistic 0–1-h convective initiation
portant for weather/no-weather discrimination? J. Atmos. nowcasts that combine geostationary satellite observations
Oceanic Technol., 32, 1209–1223, https://doi.org/10.1175/ and numerical weather prediction model data. J. Appl. Me-
JTECH-D-13-00205.1. teor. Climatol., 54, 1039–1059, https://doi.org/10.1175/JAMC-
Lee, C.-Y., S. J. Camargo, F. Vitart, A. H. Sobel, J. Camp, S. D-14-0129.1.
Wang, M. K. Tippett, and Q. Yang, 2020: Subseasonal predic- Molina, M. J., D. J. Gagne, and A. F. Prein, 2021: A benchmark
tions of tropical cyclone occurrence and ACE in the S2S da- to test generalization capabilities of deep learning methods
taset. Wea. Forecasting, 35, 921–938, https://doi.org/10.1175/ to classify severe convective storms in a changing climate.
WAF-D-19-0217.1. Earth Space Sci., 8, e2020EA001490, https://doi.org/10.1029/
Lee, J., R. Weger, S. Sengupta, and R. Welch, 1990: A neural net- 2020EA001490.
work approach to cloud classification. IEEE Trans. Geosci. Molnar, C., 2022: Interpretable Machine Learning: A Guide for
Remote Sens., 28, 846–855, https://doi.org/10.1109/36.58972. Making Black Box Models Explainable. 2nd ed. 329 pp.,
Li, L., and Coauthors, 2020: A causal inference model based on https://christophm.github.io/interpretable-ml-book.
random forests to identify the effect of soil moisture on pre- Muñoz-Esparza, D., R. D. Sharman, and W. Deierling, 2020: Avi-
cipitation. J. Hydrometeor., 21, 1115–1131, https://doi.org/10. ation turbulence forecasting at upper levels with machine
1175/JHM-D-19-0209.1. learning techniques based on regression trees. J. Appl. Me-
Loken, E. D., A. J. Clark, and C. D. Karstens, 2020: Generating teor. Climatol., 59, 1883–1899, https://doi.org/10.1175/JAMC-
probabilistic next-day severe weather forecasts from convec- D-20-0116.1.
tion-allowing ensembles using random forests. Wea. Forecast- Murphy, A. H., 1993: What is a good forecast? An essay on the
ing, 35, 1605–1631, https://doi.org/10.1175/WAF-D-19-0258.1. nature of goodness in weather forecasting. Wea. Forecasting,
}}, }}, and A. McGovern, 2022: Comparing and interpreting 8, 281–293, https://doi.org/10.1175/1520-0434(1993)008,0281:
differently designed random forests for next-day severe WIAGFA.2.0.CO;2.
weather hazard prediction. Wea. Forecasting, 37, 871–899, Neetu, S., M. Lengaigne, J. Vialard, M. Mangeas, C. Menkes, I.
https://doi.org/10.1175/WAF-D-21-0138.1. Suresh, J. Leloup, and J. Knaff, 2020: Quantifying the bene-
Malone, T., 1955: Application of statistical methods in weather fits of nonlinear methods for global statistical hindcasts of
prediction. Proc. Natl. Acad. Sci. USA, 41, 806–815, https:// tropical cyclones intensity. Wea. Forecasting, 35, 807–820,
doi.org/10.1073/pnas.41.11.806. https://doi.org/10.1175/WAF-D-19-0163.1.

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC


AUGUST 2022 CHASE ET AL. 1529

Nowotarski, C. J., and A. A. Jensen, 2013: Classifying proximity and satellite meteorology. Advances in Neural Information
soundings with self-organizing maps toward improving super- Processing Systems, H. Larochelle et al., Eds., Vol. 33, Curran
cell and tornado forecasting. Wea. Forecasting, 28, 783–801, Associates, Inc., 22 009–22 019, https://proceedings.neurips.cc/
https://doi.org/10.1175/WAF-D-12-00125.1. paper/2020/file/fa78a16157fed00d7a80515818432169-Paper.pdf.
Pedregosa, F., and Coauthors, 2011: Scikit-learn: Machine learning Vigaud, N., M. K. Tippett, J. Yuan, A. W. Robertson, and N.
in Python. J. Mach. Learn. Res., 12, 2825–2830. Acharya, 2019: Probabilistic skill of subseasonal surface tem-
Peter, J. R., A. Seed, and P. J. Steinle, 2013:Application of a perature forecasts over North America. Wea. Forecasting, 34,
Bayesian classifier of anomalous propagation to single- 1789–1806, https://doi.org/10.1175/WAF-D-19-0117.1.
polarization radar reflectivity data. J. Atmos. Oceanic Technol., Wang, C., P. Wang, D. Wang, J. Hou, and B. Xue, 2020: Nowcast-
30, 1985–2005, https://doi.org/10.1175/JTECH-D-12-00082.1. ing multicell short-term intense precipitation using graph
Quinlan, J., 1993: C4.5: Programs for Machine Learning. Morgan models and random forests. Mon. Wea. Rev., 148, 4453–4466,
Kaufmann, 302 pp. https://doi.org/10.1175/MWR-D-20-0050.1.
Ravuri, S., and Coauthors, 2021: Skilful precipitation nowcasting Watson, A. I., R. L. Holle, and R. E. López, 1995: Lightning from
using deep generative models of radar. Nature, 597, 672–677, two national detection networks related to vertically inte-
https://doi.org/10.1038/s41586-021-03854-z. grated liquid and echo-top information from WSR-88D ra-
Roebber, P., 2009: Visualizing multiple measures of forecast qual- dar. Wea. Forecasting, 10, 592–605, https://doi.org/10.1175/
ity. Wea. Forecasting, 24, 601–608, https://doi.org/10.1175/
1520-0434(1995)010,0592:LFTNDN.2.0.CO;2.
2008WAF2222159.1.
Williams, J., 2014: Using random forests to diagnose aviation tur-
Rumelhart, D. E., G. E. Hinton, and R. J. Williams, 1986: Learn-
bulence. Mach. Learn., 95, 51–70, https://doi.org/10.1007/
ing representations by back-propagating errors. Nature, 323,
s10994-013-5346-7.
533–536, https://doi.org/10.1038/323533a0.
}}, D. Ahijevych, S. Dettling, and M. Steiner, 2008a: Combining
Schumacher, R. S., A. J. Hill, M. Klein, J. A. Nelson, M. J.
observations and model data for short-term storm forecasting.
Erickson, S. M. Trojniak, and G. R. Herman, 2021: From
Proc. SPIE, 7088, 708805, https://doi.org/10.1117/12.795737.
random forests to flood forecasts: A research to operations
}}, R. Sharman, J. Craig, and G. Blackburn, 2008b: Remote
success story. Bull. Amer. Meteor. Soc., 102, E1742–E1755,
detection and diagnosis of thunderstorm turbulence. Proc.
https://doi.org/10.1175/BAMS-D-20-0186.1.
SPIE, 7088, 708804, https://doi.org/10.1117/12.795570.
Sessa, M. F., and R. J. Trapp, 2020: Observed relationship be-
Yang, L., H. Xu, and S. Yu, 2021: Estimating PM2.5 concentra-
tween tornado intensity and pretornadic mesocyclone charac-
teristics. Wea. Forecasting, 35, 1243–1261, https://doi.org/10. tions in contiguous eastern coastal zone of China using
1175/WAF-D-19-0099.1. MODIS AOD and a two-stage random forest model. J. At-
Shield, S. A., and A. L. Houston, 2022: Diagnosing supercell envi- mos. Oceanic Technol., 38, 2071–2080, https://doi.org/10.1175/
ronments: A machine learning approach. Wea. Forecasting, JTECH-D-20-0214.1.
37, 771–785, https://doi.org/10.1175/WAF-D-21-0098.1. Yoshida, S., T. Morimoto, T. Ushio, and Z. Kawasaki, 2009: A
Taillardat, M., A.-L. Fougères, P. Naveau, and O. Mestre, 2019: fifth-power relationship for lightning activity from Tropical
Forest-based and semiparametric methods for the postpro- Rainfall Measuring Mission satellite observations. J. Geophys.
cessing of rainfall ensemble forecasting. Wea. Forecasting, 34, Res., 114, D09104, https://doi.org/10.1029/2008JD010370.
617–634, https://doi.org/10.1175/WAF-D-18-0149.1. Zhang, Z., D. Wang, J. Qiu, J. Zhu, and T. Wang, 2021: Machine
Tibshirani, R., 1996: Regression shrinkage and selection via the learning approaches for improving near-real-time IMERG
lasso. J. Roy. Stat. Soc., 58B, 267–288, https://doi.org/10.1111/ rainfall estimates by integrating cloud properties from
j.2517-6161.1996.tb02080.x. NOAA CDR PATMOS-x. J. Hydrometeor., 22, 2767–2781,
Vapnik, V., 1963: Pattern recognition using generalized portrait https://doi.org/10.1175/JHM-D-21-0019.1.
method. Autom. Remote Control, 24, 774–780. Zou, H., and T. Hastie, 2005: Regularization and variable selec-
Veillette, M., S. Samsi, and C. Mattioli, 2020: SEVIR: A storm tion via the elastic net. J. Roy. Stat. Soc., 67B, 301–320,
event imagery dataset for deep learning applications in radar https://doi.org/10.1111/j.1467-9868.2005.00503.x.

Unauthenticated | Downloaded 11/06/24 08:16 PM UTC

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy