Notes6 Classification
Notes6 Classification
Padhraic Smyth,
Department of Computer Science
University of California, Irvine
March 2023
As with regression, in the standard classification problem we have a dataset D = {(xi , yi )}, i = 1, . . . , N
consisting of IID samples from some underlying unknown distribution p(x, y) = p(y|x)p(x). We are
interested in learning a model for p(y|x) that can predict a value or distribution for y given a vector of input
values x. However, unlike regression, the Y variable is now categorical (discrete), in general taking one of
K possible values or class labels, i.e., y ∈ {1, . . . , K}. These K values will typically be mapped to semantic
categories that have a particular meaning, e.g., for an image classification problem y = 1 might correspond
to the class “dog”, y = 2 might be the class “cat”, and so on: we will index the class labels as k = 1, . . . , K
to maintain generality. The x input vector to our classification model will in general be d-dimensional with
real or integer-valued components, where one of the components typically is the constant value 1 so that our
model has an intercept term (as in regression modeling).
For classification applications we are typically interested in models that can produce estimates of class
probabilities, since having an estimate of the conditional probability of a particular class k given an in-
put x is very useful in many practical applications. More specifically, let p(y = k|x) be the unknown
true conditional probability of label k given input x: this is what we want to model or learn. and let
f (x; θ) = (f1 (x; θ), . . . , fK (x; θ)) be our model’s estimate of the K-ary vector of true conditional class
probabilities p(y = k|x), k = 1, . . . , K. This is similar to the multinomial model (discussed in earlier notes)
but where now the K conditional probabilities for y are allowed to change as a function of x. Because we are
interpreting the K outputs of our model as conditional probabilities, then we will need to impose constraints
P
on the fk values, i.e., in particular we will need k fk (x; θ) = 1 and that 0 ≤ fk (x; θ) ≤ 1, k = 1, . . . , K
(we will describe a general way to enforce this constraint below).
Technically, we will only ever need to estimate K − 1 class probabilities, and so our prediction model
could just have K − 1 outputs instead of K, i.e., f (x; θ) = (f1 (x; θ), . . . , fK−1 (x; θ)) and fK (x; θ) can
be found by computing K−1
P
k=1 fk (x; θ). In practice, however, its common in machine learning, for K >
2, to ignore this constraint and build/train models with K outputs rather than K − 1 (and this generally
doesn’t cause any problems). The exception in machine learning is the special case of K = 2, i.e., binary
1
Predictive Modeling: Classification, CS 274A, Probabilistic Learning 2
classification. For K = 2 it is common in machine learning to just have a single output for the classification
model, namely f (x; θ), which approximates the true p(y = 1|x), with p(y = 0|x) being approximated by
1 − f (x; θ). (The K = 2 case is also different to K > 2 in terms of notation in that it is traditional to index
the labels as {0, 1} rather than {1, 2} but this is not an important detail).
There are a broad range of classification models whose outputs can be written in the general form:
where the zk (x; θ)’s are scalar-valued functions of the input that can lie anywhere on the real line, i.e.,
zk (x; θ) ∈ R. This formulation allows us to use any unconstrained modeling method we want in order to
generate the zk (x; θ)’s, e.g., zk (x; θ) could be a linear weighted sum that could take positive or negative
value. The equation above is then just a convenient way to transform these z’s to be positive (numerator)
and to ensure that they sum to 1 (denominator), i.e., to satisfy our constraint that the outputs of our model,
the fk ’s are valid estimates of a conditional probability distribution. The functional form above is sometimes
referred to as the “softmax” function in machine learning, and shows up in statistics as the functional form
for multinomial logistic regression models.
Two of the most well-known models in machine learning of this general form are discussed below. Note
that these 2 classes of models parallel the use of linear and neural models for regression (see NoteSet 5), but
where now we have the additional aspect of generating K outputs (conditional probability estimates) via the
softmax equation above.
1. Logistic (linear) Classifiers: (also known as logistic regression). For a logistic classifier, the functions
zk (x; θ) are linear, i.e., zk (x; θ) = β Tk x, where β k is a d-dimensional vector of weights, one per class,
and the total set of parameters is θ = (β 1 , . . . , β k ). This classifier is very simple: given an input x,
compute a weighted sum of the inputs, one weighted sum per class (using weights β k for class k) and
then convert the outputs to probabilities using the softmax operation above. Even though the output stage
of the model (the softmax transformation) is non-linear, the logistic classifier is usually considered to be a
linear classifier since it produces in linear decision boundaries between the classes (which we discuss later)
and (equivalently) it is computed from linear functions of the inputs. However, its important to note that its
functional form (the fk ’s) are not linear in the parameters (due to the softmax operation).
Note also that a direct extension of the logistic model is to augment the input x with additional prede-
fined variables that are functions of x using x′ , e.g., x′ = (1, x1 , x21 , x2 , x22 , . . .), e.g., for low-dimensional
problems where we want additional flexibility, i.e., non-linearity in the original x space (but still linear in
the augmented x′ space).
Predictive Modeling: Classification, CS 274A, Probabilistic Learning 3
2. Neural (nonlinear) Classifiers: This is a very broad class of models, particularly in the area of deep
learning, but for our purposes we will view them as being defined as follows: zk (x; θ) = β Tk g(x; ϕ) where
the zk ’s are then transformed to be probabilities using the softmax operation above.
Here g(x; ϕ) is an h-dimensional vector representing some hidden representation that the neural network
has learned, that tranforms the d-dimensional input x, using parameters ϕ, into a set of h real-valued numbers
represented by the vector-valued g(x; ϕ). For example if x corresponds to a set of pixels in an image (with
a large value of d) then g(x; ϕ) could be some much lower-dimensional representation of x that is (perhaps)
translation and scale-invariant in terms of the classification problem1 .
The parameters of the overall neural network model can be defined as θ = {β 1 , . . . , β K , ϕ}, where the
β vectors play the same role as the class-specific weight vectors in a logistic model, and the ϕ parameters
are the feature extraction part of the model (transforming the original inputs x into a representation g that is
useful for prediction).
What loss function should we use to optimize the parameters θ for a classifier model such as a logistic or
neural network model? One way to approach this is to define a conditional likelihood for our problem. Our
data consists of D = (Dx , Dy ) = {(xi , yi }, 1 ≤ i ≤ N , and we want to work with the conditional likelihood
p(Dy |Dx , θ). Assuming IID data samples from p(x, y) we can define a multinomial log-likelihood of the
form
N
Y
L(θ) = p(yi |xi ; θ)
i=1
where yi ∈ {1, . . . , K}. The terms p(yi |xi , θ) represent our model-based probabilities, namely the fk
functions, parametrized by θ, that are estimates of the true (unknown) probabilities p(yi |x). In particular, if
yi = k, then we want to use fk (xi ; θ) as the corresponding probability from the model, i.e., if yi = k, then
according to our model p(yi |xi , θ) = fk (xi ; θ). We can represent this notationally by the use of an indicator
function I(yi , k) which takes value 1 if yi = k and value 0 otherwise. Thus, our likelihood becomes:
N
Y N Y
Y K
L(θ) = p(yi |xi ; θ) = fk (xi ; θ)I(yi ,k) .
i=1 i=1 k=1
In effect the indicator function is selecting the appropriate output of the model to use in the likelihood for
each example i: for each datapoint i, the output k such that yi = k will have value fk (xi ; θ)1 and all the
1
For readers familiar with neural networks, g(x; ϕ) would typically be the last hidden layer in the neural model before the
output. By characterizing the neural via g(x; ϕ), we are hiding a huge amount of detail in terms of how these neural models are
built, particularly in deep learning; for example in image classification and in language modeling (where the “class” y is the identity
of the next word in a sequence), ϕ could represent billions of parameters (weights) and the function g(x; ϕ) could be very complex
(e.g., containing transformer components, etc). But we are ignoring (on purpose) this level of detail here.
Predictive Modeling: Classification, CS 274A, Probabilistic Learning 4
other K − 1 terms will have value fk (xi ; θ)0 = 1 and be factored out. Another way to write this would be
as
YK Y K
L(θ) = fk (xi ; θ)
k=1 i:yi =k
which creates K separate products in the likelihood, one per class k. We can see that maximizing this
likelihood L(θ) corresponds to having the model give as high a probability as possible (e.g., close to 1) for
each true class label yi in the training data, where the prediction is generated as a function of the input xi .
We can rewrite in the form of a log-likelihood as
N X
X K
l(θ) = I(yi , k) log fk (xi ; θ)
i=1 k=1
which is the empirical risk function used widely for classification problems in machine learning. Thus,
minimizing this function (as a function of θ) is equivalent to maximizing the conditional log-likelihood, i.e.,
we have R(θ) = N1 N 1
P P
i=1 ∆(yi , fk (xi ; θ)) where ∆(yi , fk (xi ; θ)) = k I(yi , k) log fk (xi ;θ) . This loss is
often referred to as the log-loss or cross-entropy loss in machine learning. For convenience we will define
N K
1 XX 1
CE(θ) = I(yi , k) log
N fk (xi ; θ)
i=1 k=1
where f is the single output of the model and the binary yi ’s select the appropriate loss term to use within
the sum for each datapoint.
So, what we have shown above is that finding the parameters that minimize the well-known log-loss/cross-
entropy loss function (used to train many machine learning classifiers, e.g., in image classification, in lan-
guage modeling, etc), corresponds directly to finding the parameters that maximize the log-likelihood of a
multinomial likelihood model with an IID assumption. And long as our model f (xi ; θ) is differentiable as a
function of θ, we can define gradients with respect to the parameters (the components of θ) and use any of
a wide variety of gradient-based methods to maximize l(θ) (or equivalently minimize CE(θ)).
Without any regularization in CE(θ) (or any priors in our likelihood-based setup) this optimization is a
maximum likelihood estimation procedure. With priors, as with regression, the log of the prior will show up
Predictive Modeling: Classification, CS 274A, Probabilistic Learning 5
as an additional regularization term (e.g., proportional to j=1 θj2 for Gaussian priors or L2 regularization)
P
in addition to CE(θ) and minimization of CE(θ)+λr(θ), where λ is the relative weight of the regularization
term and r(θ) is the regularization function itself (e.g., r(θ) = j=1 θj2 ) is equivalent to maximizing the
P
product of the likelihood times the prior, i.e., performing MAP estimation.
And as with regression, we can also be fully Bayesian, averaging over parameter uncertainty, to generate
a predictive distribution p(y|x, Dy , Dx ) to make predictions for any future x. Unfortunately, as with regres-
sion, doing this computation exactly is impossible for most models of interest (even simple models such
as logistic classifiers), and approximate methods such as Monte Carlo sampling methods or deterministic
approximations (Laplace, variational) must be relied on if one wants to be fully Bayesian.
Note on Notation: The notation in this section below is a little different from the notation in the earlier
sections, e.g., discriminants gk (x) can in principle be any function of x: they could be linear, could be non-
linear, could be transformed (or not) to sum to 1 and lie between 0 and 1, and so on. So gk (x) is intended to
be very general below.
In this section we take a look at classifiers in terms of how they implement mappings from inputs x to
“hard decisions” ŷ ∈ {1, . . . , K}, rather than necessarily mapping to K conditional probabilities. Classifiers
like the logistic model or neural networks can make hard decisions by selecting a specific label given an input
x, e.g., ŷx = arg max fk (x; θ), i.e., the most likely class. Its easy to see that the softmax operation doesn’t
change the identity of the most likely class, i.e., that ŷx = arg maxk fk (x; θ) = arg maxk zk (x; θ) , where
fk () is the softmax transformation of zk ().
We can define a general form for classifiers by using the notion of discriminant functions, defined as
follows:
• For each class k ∈ {1, . . . , K} we have a discriminant function gk (x) that produces a real-valued
scalar discriminant value for each class k, conditioned on an input x. Each discriminant function can
be parametrized by parameters θk (the dependence on θk is suppressed below for simplicity).
• The classifier makes a decision on any input x by computing the K discriminant values and assigning
x to the class with the largest value, i.e., ŷx = arg max gk (x)
k
Discriminant functions provide a very general way to think about classifiers, including both probabilistic
and non-probabilistic approaches (e.g., tree-based classifiers).
Some examples of discriminant functions are:
• Polynomial discriminants where gk (x; θk ) is an mth order polynomial function of x with polynomial
coefficients defined by θk (e.g., m = 2 for a quadratic).
Predictive Modeling: Classification, CS 274A, Probabilistic Learning 6
• Non-linear discriminants such as gk (x) = h(θTk x) where h is a complex non-linear function such
as defined by a neural network.
We define the decision region Rk for class k, 1 ≤ k ≤ K, as the region of the (d-dimensional, real-
valued) input space x where the discriminant function for class k is larger than any of the other discriminant
functions, i.e.,
x ∈ Rk ⇔ ŷx = k ⇔ k = arg max gj (x), j = 1, . . . , K
j
So, decision region Rk is the region in the input space where x is classified as being in class k by the
classifier, or equivalently, gk (x) > gj (x), ∀j ̸= k. Note that a decision region need not be a single contigu-
ous region in x but could be the union of multiple disjoint subregions (depending on how the discriminant
functions are defined).
Decision boundaries between decision regions are defined by equality of the respective discriminant
functions. Consider the two-class case (K = 2) for simplicity, with g1 (x) and g2 (x). Points in the input
space for which g1 (x) = g2 (x) are by definition on the decision boundary between the two classes. This is
equivalent to saying that the equation for the decision boundary is defined by g1 (x) − g2 (x) = 0. In fact for
the two-class case it is clear that we don’t need two discriminant functions, i.e., we can just define a single
discriminant function g(x) = g1 (x) − g2 (x) and predict class 1 if g(x) > 0 and class 2 if g(x) < 0. And if
g(x) = 0 we would randomly select class 1 or 2.
As mentioned earlier, linear discriminants are defined as gk (x) = θTk x, i.e., an inner product of a weight
vector and the input vector x, where θk is a d-dimensional weight vector for class k. In particular, for the
two-class case, the decision boundary equation is defined as g1 (x) − g2 (x) = θT1 x − θT2 x = θT x = 0,
i.e., we only need a single weight vector θ. The equation θT x = 0 will in general define the equation of
a (d − 1)-dimensional hyperplane in the d-dimensional x space, partitioning the input space into two con-
tiguous decision regions separated by the linear hyperplane. More generally, for K > 2, linear discriminant
functions will lead to piecewise linear decision regions in the input space x.
We also note that if our discriminant is defined as gk (x) = h(θTk x), where h() is some monotonic
function, then we have in effect in a linear discriminant, i.e., we have linear decision boundaries in the input
space x, since the maximization operation ŷx = arg max gk (x) to select the predicted class is unchanged
k
whether we use h(θTk x) or θTk x as our definition for gk (x). An example is the logistic classifier where the
non-linear function h() is the logistic function and where the decision boundaries in the input space x are
linear for K = 2 and piecewise linear for K > 2.
More generally, if the gk (x) are polynomial functions of order r in x, the decision boundaries in the gen-
eral case will also be polynomials of order r. An example of a discriminant function that produces quadratic
boundaries (r = 2) in the general case is the multivariate Gaussian classifier (which we discuss later). And
if the gk (x) are non-linear functions of x then in general we will get non-linear decision boundaries (e.g.,
for neural networks with one or more hidden layers).
Predictive Modeling: Classification, CS 274A, Probabilistic Learning 7
Optimal discriminant functions are discriminant functions that minimize the classification error rate of a
classifier on average with respect to some underlying p(x, y). Below we will assume that all misclassifica-
tion errors incur equal cost2 , known as the 0-1 cost function or 0-1 loss. It is not hard to see, that for 0-1 loss
that the optimal discriminant is
gk (x) = p(y = k|x), 1 ≤ k ≤ K
which is equivalent to maximizing the posterior probability of the discrete-valued class variable Y given x,
i.e., picking the most likely class for each x.
From the definition of the optimal discriminant it is easy to see that another version of the optimal
discriminant (in the sense that it will lead to the same decision for any x) is defined by gk (x) = p(x|y =
k)p(y = k) (because of Bayes rule). Again, this is theoretically optimal if we know it, but in practice
we usually do not. And similarly, any monotonic function of these discriminants, such as log p(x|y =
k) + log p(y = k) are also optimal discriminants.
Note that these are optimal predictions in theory, i.e., if we know the true p(y = k|x) (or some mono-
tonic function of it) exactly. In practice will usually need to approximate this conditional distribution by
assuming some functional form for it (e.g., the logistic form) and learning the parameters for this functional
form from data. The assumption of a specific functional form may lead to bias (or approximation error)
in our estimate of the optimal discriminant function and learning parameters from a data set will lead to
variance (or estimation error).
As mentioned above, the optimal discriminant for any classification problem is defined by gk (x) = p(y =
k|x) (or some monotonic function of this quantity). Even though in practice we won’t know the precise
functional form or the parameters of p(y = k|x) it is nonetheless useful to look at the error rate that we
would get with the optimal classifier. The error rate of the optimal classifier is known as the Bayes error
rate and provide a lower bound on the performance of any actual classifier (analogous to the unexplainable
variance σy2 in regression problems). As we will see below, the Bayes error rate depends on how much
overlap there is between the density functions for each class, p(x|y = k), in the input space x: if there is a
lot of overlap we will get a high Bayes error rate, and with little or no overlap we get a low (near zero) Bayes
error rate. For example, for high Bayes error, think of a 2-dimensional x space, with K two-dimensional
Gaussians in this space that are heavily overlapped; and if we could “pull” these Gaussiansn further and
further apart to reduce overlap, then the Bayes error rate will decrease.
Consider the error rate of the optimal classifier at some particular point x in the input space. The
probability of error is
ex = 1 − p(y = k ∗ |x) = 1 − max{p(y = k|x).
k
2
More generally we can minimize expected cost where different errors may have different costs, but we will not pursue that here
Predictive Modeling: Classification, CS 274A, Probabilistic Learning 8
Since our optimal classifier will always select k ∗ given x, the classifier will be correct p(y = k ∗ |x) fraction
of the time and incorrect 1−p(y = k ∗ |x) fraction of the time, at x. (We are conveniently ignoring any points
x here that might fall exactly on a decision boundary, and aslo assuming that our classifiers are deterministic
as long as an input x is not on a decision boundary).
To compute the overall error rate (i.e., the probability that the optimal classifier will make an error on a
random x drawn from p(x)) we need to compute the expected value of the error rate with respect to p(x):
This term Pe∗ is known as the Bayes error rate. It is the optimal (lowest possible) error rate for any classifier
with for some fixed feature space x with respect to p(x, y).
i.e., as the sum of K different error terms defined over the K decision regions. These error terms, one per
class k, are proportional to how much overlap class k has with each of the other class densities.
Although we can only ever compute this for toy problems where we assume full knowledge of p(y|x)
and p(x), the concept of the Bayes error rate is nonetheless useful and it can be informative to see how it
depends on density overlap.
Note that any achievable classifier can never do better than the Bayes error rate in terms of its accuracy:
the only way to improve on the Bayes error rate is to change the input feature vector x, e.g., to add one
or more features to the input space that can potentially separate the class densities more. So in principle it
would seem as if it should always be a good idea to include as many features as we can in a classification
problem, since the Bayes error rate of a higher dimensional space is always at least as low (if not lower)
than any subset of dimensions of that space. This is true in theory, in terms of the optimal error rate in
the higher-dimensional space: but in practice, when learning from a finite amount of data N , adding more
Predictive Modeling: Classification, CS 274A, Probabilistic Learning 9
features (more dimensions) means we need to learn more parameters, so our actual classifier in the higher-
dimensional space could in fact be less accurate (due to estimation noise) than a lower-dimensional classifier,
even though the higher-dimensional space might have a lower Bayes error rate.
In general, the actual error rate of a classifier can be thought of as having components similar to those
for regression. In regression, for a real-valued y and the squared error loss function, the error rate of a
prediction model can be decomposed into a sum of the inherent variability of y given x, plus a bias term,
plus a variance term. For classification problems the decomposition does not have this simple additive form,
but it is nonetheless useful to think of the actual error rate of a classifier as having components that come
from (1) the Bayes error rate, (2) the bias (approximation error) of the classifier, and (3) variance (estimation
error due to fitting the model to a finite amount of data).
The definitions above of optimal discriminants suggest two different strategies for learning classifiers from
data. In the first we try to learn p(y|x) directly: this is referred to as the conditional or discriminative
approach and examples include logistic regression and neural networks that we discussed earlier.
The second approach, which we discuss in this section, is where we learn a model the input data p(x|y =
k) for each class k , and then use Bayes rule to make predictions via p(y = k|x). This is sometimes referred
to as the generative or joint approach since we are in effect modeling the joint distribution p(x, y), rather
than just the conditional p(y|x), allowing us in principle to generate or simulate data from the model (well-
known examples are Gaussian classifiers and naive Bayes classifiers).
Generative models have the drawback that modeling p(x|y) can be difficult to do accurately, particularly
as the dimensionality d increases, whereas modeling the conditional distribution p(y|x) can be much easier.
Thus, joint or generative classifiers are less widely-used in practice than their conditional counterparts.
The generative approach can have some advantages, however, compared to the conditional approach,
particularly if we know something about the distribution of the data in the input space x. In particular, gen-
erative models can be useful for dealing with missing data in the input space, for semi-supervised learning,
or for detecting outliers or distribution shifts or data from novel classes in the input space.
Gaussian Classifiers A classical approach to generative classifiers for multivariate d-dimensional data x
is to assume that the conditional densities for each class have a multivariate Gaussian distribution, i.e.,
where µk is a d-dimensional mean and Σk is a d × d covariance matrix, and µ and Σk can be different for
each class. In the general case we have discriminant functions that can be written in the form
where Ck involves terms that depend on k (such as det(Σk ) and p(y = k)) but that don’t depend on x. The
first term is a quadratic in x. Thus, the discriminant functions for Gaussian generative classifiers are (in
general) quadratic functions3 , and consequently the decision boundaries are also quadratic in form. For
example, for K = 2 we can solve for the x values that satisfy g1 (x) = g2 (x) to find the quadratic function
in x that defines the decision boundaries.
The main weaknesses of Gaussian classifiers are that (i) they require us to make a strong assumption
about the parametric form of the distributions of x in the input space, and (ii) they require O(d2 ) parameters
per class, which can be problematic for problems where d is large (e.g., if d is the number of pixels in
an image). To address the O(d2 ) problem one approach is to approximate the full covariance matrix for
each class with a diagonal matrix (all covariances except the diagonals set to 0): this is obviously a big
approximation but may be good enough in some cases for discriminating between classes.
Markov Sequence Classifiers Generative classifiers can often be easy to extend to non-vector data. For
example, consider data consisting of multiple sequences xi , 1 ≤ i ≤ N , where each xi is a categorical se-
quence and the sequences can have different lengths. Examples might be protein sequences in bioinformatics
or sequences of visits to Web pages by different visitors to a Website. Consider also that the sequences are
classified with labels yi , e.g., different types of proteins or visitors who make purchases on a Website versus
those that don’t.
We can in principle build a generative model for each class, i.e., p(x|y = k), such as a Markov model,
with discriminant functions log p(x|y = k) + log p(y = k), 1 ≤ k ≤ K. Given a sequence x of any length,
its probability p(x|y = k) can be computed for each class k, via the factorized representation implicit in the
Markov chain. This general approach was the basis of speech recognition systems for many years, where
each class k corresponded to a word in the vocabulary (and K would be quite large, e.g., K = 50, 000), and
the underlying generative models per class (or word) were hidden Markov models.
Parameter Estimation for Generative Models Estimating the parameters of a generative model is usu-
ally quite straightforward. Since the data are assumed to be all labeled, we can separate the likelihood (or
log-likelihood) into separate products (or sums), one for each class and for the parameters of that class, i.e.,
PK P
log L(θ) = k=1 yi :yi =k log p(xi |yi = k, θ k )p(yi = k) . This is for the case where the parameters for
each class are independent of the parameters in other classes: if the parameters are tied or linked in some
way then we follow a different path, one which is also usually quite straightforward.
We can then partition our training data into K subsets according to the K class labels, and then optimize
P
(separately) the log-likelihood term for each class, i.e., yi :yi =k log p(xi |yi = k, θk )p(yi = k) to find
the θk ’s. We can do this parameter estimation per class using any of our favorite methods from density
estimation, e.g., maximum likelihood, maximum a posteriori, etc. We can even be fully Bayesian if we
wish to, by inferring posterior distributions for θk for each class and then averaging over these posterior
distributions when making predictions about the class labels y for a new data point x.
3
There are some exceptions, such as when each class has a common covariance, Σk = Σ, and the quadratic terms drop out
resulting in linear discriminants.