0% found this document useful (0 votes)
29 views7 pages

Multi Class Learning With Individual Sparsity

Uploaded by

Philip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views7 pages

Multi Class Learning With Individual Sparsity

Uploaded by

Philip
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence

Multi Class Learning with Individual Sparsity

Ben Zion Vatashsky and Koby Crammer


Department of Electrical Engineering
Technion - Israel Institute of Technology, 32000 Haifa, Israel
vatashsky@gmail.com, koby@ee.technion.ac.il

Abstract quired to output one of c possible classes, matrix models


ω ∈ d×c are used. Here, linear models prediction is per-
Multi class problems are everywhere. Given an formed by first computing the inner product of the input with
input the goal is to predict one of a few possible each column of the matrix, and then processing the result-
classes. Most previous work reduced learning to ing c scalars, f (ω1 x, ..., ωc x). In multi class problems
minimizing the empirical loss over some training we have f (a1 , ..., ac ) = arg maxs as . Natural extensions
set and an additional regularization term, prompt- for norm regularization from vector to matrices are entry-
ing simple models or some other prior knowl-  
2
edge. Many learning regularizations promote spar- wise norms, such as the Frobenius norm: s t |ω t,s | ,
sity, that is, small models or small number of fea- used in multi class SVMs [Weston and Watkins, 1998;
tures, as performed in group LASSO. Yet, such Crammer and Singer, 2001; Lee et al., 2004].
models do not always represent the classes well. In Alternatively a mixed norm can be used. The most com-
some problems, for each class, there is a small set mon such usage is group LASSO [Yuan and Lin, 2006;
of features that represents it well, yet the union of Bakin, 1999] which uses the 2,1 mixed norm as a mechanism
these sets is not small. We propose to use other reg- for selecting a group of variables. That is, the regularization
ularizations that promote this type of sparsity, an- promotes choosing a small sub-set of the features (rows), and
alyze the generalization property of such formula- then any feature from this subset can be used. However, if
tions, and show empirically that indeed, these regu- we seek sparser models, then the 1,1 (which is the vector
larizations not only perform well, but also promote 1 norm in the mixed norm representation) is often used, as
such sparsity structure. a convex relaxation of counting the number of non-zero ele-
ments.
1 Introduction We argue that for many situations these two norms are not
Regularization is a highly-used and understood task in su- capturing our requirements from models. Though it is reason-
pervised learning. Given data-samples, called training data, able to believe that many features are redundant, this redun-
modern algorithms not only seek for a model that performs dancy might be different among classes. For example color
well on the training data, but also require that the model pattern may be very informative for zebras, but less informa-
would be simple in some sense, where simplicity is measured tive for horses, dogs, cats or chameleons which may have a
via some regularization function. Regularization is widely variety of color patterns, and thus 2,1 may not be the right
used as a mechanism to prevent overfitting or impose prior choice. Additionally, the global sparsity prompted by 1,1
knowledge of a structure on a model. Learning binary pre- may generate models that are very sparse for some classes,
diction problems or regression of real numbers is studied for and dense for others. This may happen if the data is far from
over half a century, with a lot of work focusing in linear mod- being balanced, as there are few examples of one class, and
els. Given an input x ∈ d , its inner product with some many of another classes.
vector ω ∈ d is used to make a prediction, f (ω  x). In In this work we propose to use another principle for regu-
many cases regularization is defined to be some norm of that larization. Instead of forcing a small number of features (2,1 )
vector, ω. or a small model altogether (1,1 ), we propose to promote
Among others, SVM [Boser et al., 1992; Cortes and Vap- models, in which for each class (independently) there would
nik, 1995] can be represented as such classification prob- be a small number of relevant features. Such regularization
lem with 2 norm [Mukherjee et al., 2002]. For regression, would generate small, that is sparse, models, yet would not
Ridge Regression [Hoerl, 1962; Hoerl and Kennard, 1970; “favor” certain classes. We formulate learning with such reg-
Tikhonov and Arsenin, 1977] (2 norm) and LASSO [Tibshi- ularization in the next section and provide robustness analysis
rani, 1996] (1 norm) are popular. and generalization bound for this, and in fact, for all mixed
In more complex problems, such as multi class catego- norms. Specifically, we show that our regularization is equiv-
rization, in which given an input x the algorithm is re- alent to an algorithm which is robust to a feature noise that is

1729
different per class, yet worst classes’ noise is not too large. counts. The Euclidean norm is used, as we want to demote
We also report results with 14 text classification problems, classes for which there are many non-zero elements compared
with a large range in size, number of classes and dimensions. with others. As performed in other contexts, we replace the
We show that our proposed regularization attains higher F1 zero norm with the unit norm, which is convex and continu-
scores than any other mixed norm regularization we tried, yet ous, and obtain the following relaxation,
results in sparser models. One explanation for that differ-  d 2
1 
c
ence is that our algorithm attains higher recall values (over 1 2
Ω(ω) = ω1,2;2 = |ω t,s | .
classes) paying in precision. This demonstrates exactly the 2 2 s=1 t=1
point that the obtained models are not too-sparse for some
classes, as models learned with 1,1 regularization may be Thus, given a training set (xi , yi ) where xi ∈ d and yi ∈
(when some classes contain very few non-zero model terms, {1, ..., c}, we propose to learn by minimizing the following
they may never be predicted). problem,

n
λ 2
Related work: Most work with mixed norm regularization min L(ω, (xi , yi )) + ω1,2 ,
focussed on 2,1 and ∞,1 norms [Duchi and Singer, 2009b;
ω
i=1
2
2009a; Bradley and Bagnell, 2009; Zhao et al., 2009; Mairal where L(ω, (xi , yi )) is some multi class loss, such as the
et al., 2010]. A work on mixed norm regularization includ- multi class log loss (with a curvature parameter ν),
ing 1,2 was proposed for signal estimation [Kowalski, 2009; ⎛ ⎞
Kowalski and Torrésani, 2008] and kernel learning for binary 1 
classification [Kowalski et al., 2009]. In a work on hier- Llog (ω, x, y ∗ ) = log ⎝1 + eν(1+ωy x−ωy∗ x )⎠ . (1)
ν
archical penalization [Szafranski et al., 2008] a  43 ,1 norm ∗
y=y
was derived and used. The p,1 norm regularization was Previous work with 1,2 regularization included signal re-
also used in other contexts [Fornasier and Rauhut, 2008; construction [Kowalski, 2009; Kowalski and Torrésani, 2008]
Teschke and Ramlau, 2007]. and binary classification with multiple kernels [Kowalski et
al., 2009]. To the best of our knowledge no work has used
Notation: Given a matrix A, its mixed norm, p,q;r , is de- 1,2 regularization for multi class problems.
fined by computing the p norm of each column (or row) of A,
and then computing the q norm of the result. The order of the 3 Analysis
summation (columns or rows first) is defined by r ∈ {1, 2}. We now provide an analysis for learning with mixed-norm
We define a mixed norm where we either first sum over rows matrices. First we state an equivalence to robustness over a
(r = 1) or over columns (r = 2), certain noise, and then a generalization bound based on Gaus-
⎛   pq ⎞ q1 sian complexity analysis. In both cases we build on previous
  tools, and modify them to our setting. We emphasize that it is
=⎝ ⎠
p
A p,q;1 |At,s | not a direct application, but a non-trivial derivation is needed.
t s The properties we show in this chapter demonstrate theo-
⎛   pq ⎞ q1 retical guarantees according to a prior knowledge. This prior
  knowledge may guide us to choose our proposed 1,2;2 regu-
Ap,q;2 = ⎝ ⎠ .
p
|At,s | larization. We extend the equivalence of binary SVM to ro-
s t bust optimization [Xu et al., 2009] to a multi class setting.
An early work on mixed norms spaces was done by We also state generalization bounds for 1,2;2 regularization,
Benedek and Panzone at 1961 [Benedek and Panzone, 1961]. using Gaussian complexity.
3.1 Equivalence to noise robustness
2 Individual Variable Selection Robustness to noise is one of regularization’s main goals. Re-
Group selection, promoted by group LASSO and other mixed cently Xu et al. [Xu et al., 2009] showed, that with some lim-
norms of the form p,1 , assume that one group of variables itations, regularization for binary classification using hinge
have good descriptive qualities for the entire problem. Specif- loss, is equivalent to robust optimization. We extend this
ically, for feature selection in multi class, the assumption is notion to the multi class setting by showing an equivalence
that a subset of the features is globally able to describe well all of mixed norm regularization and robustness to noise for the
the classes. That is, a small set is sought that would work well sum of hinges loss function.
across all classes. We take a different route and assume that One approach we tried was to reduce multi class learning
for each class there is a small number of features that describe into a binary classification problem with c-times more exam-
it well. However, we do not assume that this set overlaps ples. Applying the result of Xu et al. [Xu et al., 2009] on the
other classes’ sets, and also, we do not want to have too small new problem, provided us a robustness equivalence only to
number of features for some classes.
 This intuition leads p,1;2 mixed norms.
to the following regularization, i (zero norm of column i)2 . We thus re-derived this equivalence from scratch. Follow-
That is, we compute the number of non-zero elements per ing the outline of Xu et al. [Xu et al., 2009] we proved the
class and then take the Euclidean norm of this vector of following result (here yi ∈ {−1, 1}c ).

1730
n
Theorem 1 Given a set of examples {xi , yi }i=1 , non separa- where X1 , ..., Xn are samples selected independently from
ble for each class s = 1, ..., c (for each class s = 1, ..., c the set X according to a probability P and g1 , ..., gn are in-
and every ω s ∈ d , there is an example j(s) such that: dependent Gaussian random variables, where for each i ∈
yj(s),s (ω 
s xj(s) ) < 0, where yj(s),s is the s term of yj(s) ), {1, ..., n} , gi ∼ N (0, 1).
and the two sets, Based on this definition Bartlett and Mendelson proved the
following theorem.

n
Ň = (δ 1 , · · · , δn ) δ i p∗ ,q∗ ≤ M Theorem 2 [Bartlett and Mendelson, 2003] Let F be a class
of function mapping from X to A = c and let F1 , ...Fc
i=1
be real valued classes, such that F is a subset of their di-
and   rect sum. For a given loss function L : Y × A → [0, 1], let
Ň0 = δ δp∗ ,q∗ ≤ M , φ : Y × A → [0, 1] be a dominating cost function (for all
a ∈ A and y ∈ Y, φ(y, a) ≥ L(y, a)), such that for all
for some p∗ , q∗ ≥ 1, the following two optimization problems y ∈ Y, φ(y, ·) is a Lipschitz function (with respect to Eu-
are equivalent: clidean distance on A) with a constant Λ. Let (Xi , Yi )ni=1 be
samples selected independently according to probability P.

n 
c
  Then, for any integer n and 0 < δ < 1, with a probability of
min sup 1 − yi,s (ω 
s (xi − δ i,s ))
at least 1 − δ, over samples of size n, for every f ∈ F, the
ω +
δ 1 ,··· ,δ n ∈Ň following holds:
i=1 s=1 

c
8 ln(2/δ)

n 
c EL(Y, f (X)) ≤ Ên φ(Y, f (X))+kΛ Gn (Fs )+ ,
min sup ω, δ + ξi,s s=1
n
ω,ξ δ∈Ň (4)
0 i=1 s=1
where k is some constant.
s.t. ξi,s ≥ 1 − yi,s (ω 
s xi ) i = 1, ..., n; s = 1, ..., c
We focus on the sum of clipped hinges loss,
ξi,s ≥ 0 i = 1, ..., n; s = 1, ..., c ,
1    
c
(2) φ(Y, f (x)) = min 1, max 0, 1 − Ys · ω  s X ,
c s=1
where A, B = Tr(A B) is the matrix (Frobenius) inner
1
n
product.
Ên φ(Y, f (X)) = φ(Yi , f (Xi )) and fs ∈ FsM
The proof is omitted due to lack of space. When a norm of n i=1
the perturbation is bounded, the regularization term is similar such that:
to the definition of the dual norm (multiplied by the bound 
value). This means that p∗ ,q∗ is the dual norm of some p,q FsM = x → ω  s x : ω s ∈  , s ∈ {1, ..., c} ,
d

norm, where p1 + p1∗ = 1q + q1∗ = 1 [Bradley and Bagnell, 


ω = [ω 1 ...ω c ] , ωp,q;r ≤ M (5)
2009]. Thus, we get the regularization term:  
and f = (f1 , ..., fc ) ∈ F M = F1M , ..., FcM .
sup ω, δ = M ωp,q . (3) In the next lemma we bound the Gaussian complexity term
δp∗ ,q∗ ≤M
in Eq. (4) for our setting.
 
According to Eq. (3), the correspondence between the reg- Lemma 3 Let f = {f1 , ..., fc } ∈ F M = F1M , ..., FcM ,
ularization tradeoff parameter λ and the noise bound M is belong to a set defined in Eq. (5). Then, the following
λ = M . In other words, λ may be used to estimate the noise bound
 holds
 for the corresponding Gaussian complexities
boundnM (= λ) and vice versa. For example, if we know Gn FsM :
that i=1 δ i ∞,2;2 ≤ M , then 1,2;2 would be the best reg-  c
  2M
ularization. In this type of noise, for each class the noise Gn FsM ≤ EX Eg Xgp∗ ,q∗ ;r ,
may be different, yet it does not get too high, as the norm of s=1
n
largest classes’ noise, summed over examples should not be where X = [x1 , ..., xn ] ∈ d×n is a matrix of n samples,
too large. independently selected, according to probability P and g ∈
n×c is a matrix with independent Gaussian variables gs,i .
3.2 Generalization bound
The proof is omitted due to lack of space.
We next provide a generalization analysis for the 1,2;2 reg- We now state and prove the main theorem of
ularization using Gaussian complexity. We use the follow-  the section.
Theorem 4 Let f = (f1 , ..., fc ) ∈ F M = F1M , ..., FcM ,
ing measure of function complexity given by Bartlett and defined in Eq. (5) with a bounded 1,2;2 norm. Then, for any
Mendelson [Bartlett and Mendelson, 2003]: integer n and 0 < δ < 1, with a probability of at least 1 − δ,
Definition 1 [Bartlett and Mendelson, 2003] The Gaussian over samples of length n, the following holds:
complexity of a function class F mapping from a set X to  EL(Y, f (X)) ≤Ên φ(Y, f (X))+
is defined as:  
  2c ln 2d UB 8 ln(2/δ)
2
n 2kM X∞ + ,
Gn (F) = EX Eg sup gi f (Xi ) , n n
UB
f ∈F n 1 where X∞ is an upper bound on the ∞ norm of x.

1731

Proof: Plugging Lemma 3 in Theorem 2, and using the fact 2 ln 2d
Setting μ = 
c n , leads to the following result,
that the Lipschitz constant for the sum of hinges loss is 1, and xi 2∞
s=1 i=1
that EX [·] ≤ supX∈X [·], we get the following generalization
bound:   

c 
n 
c 
n
≤!2
2
EL(Y, f (X)) ≤Ên φ(Y, f (X))+ Eg max xi,t gi,s xi ∞ ln 2d
 s=1
t=1...d
i=1 s=1 i=1
2kM 8 ln(2/δ) 
sup Eg Xgp∗ ,q∗ ;r + . 
n
n X∈X n
=!2c
2
(6) xi ∞ ln 2d. (8)
i=1
For p = 1, q = 2, r = 2 we have p∗ = ∞, q∗ = 2, r∗ = 2,
We denote by X∞ UB
any bound on the ∞ norm of x. Substi-
sup Eg Xg∞,2;2 ≤ sup Eg Xg∞,1;2 tuting Eq. (8) in Eq. (7) yields,
X∈X X∈X √

c 
n sup Eg Xg∞,2;2 ≤ X∞ UB
2nc ln 2d. (9)
= sup Eg max xi,t gi,s . X∈X
xi ∈X t=1...d
s=1 i=1 Plugging the right term of Eq. (9) in Eq. (6) would produce
(7) the desired bound. 
Using other norms would yield different terms in Eq. (9).
Next we bound the right expectation, using a technique used
This demonstrates that the norm may be selected according
by Massart [Massart, 2000]. Using Jensen’s inequality, we
to bounds and parameters of the data.
get for all μ > 0:
  c 
 n 4 Multi Class Learning Optimization
exp μEg max xi,t gi,s As mentioned above, our multi class optimization problem is
t=1...d
s=1 i=1
   a sum of a loss term and a regularization mixed norm term:
c n
≤Eg exp μ max xi,t gi,s min L(ω, {(xi , yi )}) +
λ q
ωp,q , (10)
t=1...d
s=1 i=1 ω q
   n

c n
where L(ω, {(xi , yi )}) =
=Eg max exp μ xi,t gi,s i=1 L(ω, {(xi , yi )}) is the
t=1...d losses sum.
s=1 i=1
   We implemented a proximal splitting algorithm [Com-

c 
d 
n
bettes and Pesquet, 2011] to solve this problem which is con-
≤ Eg exp μ xi,t gi,s vex, and is composed by a sum of a smooth function and a
s=1 t=1 i=1 non-differentiable function. The algorithm we used is the
  

c 
d 
n forward-backward splitting which is suitable for such prob-
≤2 Eg exp μ xi,t gi,s lems.
s=1 t=1 i=1 The forward-backward algorithm is iterating between two

c 
d 
n steps: a gradient step for the smooth function - called forward
=2 Eg [exp (μxi,t gi,s )] step, and a proximity step - called backward step, where the
s=1 t=1 i=1 proximity operator of the second function is used on the result
n ∞   of the previous step. A pseudo code of the algorithm is given
2 
c d
1 2 in Alg. 1.
=√ exp μxi,t gi,s − gi,s dgi,s .
2π s=1 t=1 i=1 2 For the gradient step we use an adaptive learning rate γk
−∞ for each iteration k. The value of γk is selected by starting
∞ 2 √ with high value and reducing it until the following condition
Using the Gaussian integral, e−x dx = π, we bound is met [Bach et al., 2011],
−∞ " #
the RHS of the last equality with, 
L(ω new ) ≤L(ω k ) + Tr ∇L(ω k ) (ω new − ω k ) +
   

c 
d 
n
1 2 2 
c 
d
1 2 2
n
1 2
2 exp μ xi,t =2 exp μ xi,t ω new − ω k 2 . (11)
t=s t=1 i=1
2 s=1 t=1
2 i=1 2γk
 
1 2  In our experiments we found that indeed this method pro-
c n
≤2d exp μ xi 2∞ . duced better results than with a constant γ. For the third
2 s=1 i=1
step of the algorithm we use the proximity operator defined
By taking log and dividing by μ, we get (for μ > 0): by Moreau [Moreau, 1962] for lower semicontinuous convex
 c  functions, adapted here for mixed norms,
 n
ln 2d 1  
c n
Eg max xi,t gi,s ≤ + μ
2
xi ∞ . γλ q 1 2
μ 2 prox γλ ·q (θ) = arg min ωp,q + ω − θ2 .
s=1
t=1...d
i=1 s=1 i=1 q p,q ω q 2

1732
Algorithm 1 Forward-backward algorithm for multi class Ex. per class
Dataset Ex. # Features Cl. #
with p,q regularization Mean STD
20 News 18828 252122 20 941 97
Initialize: ω 0 , k = 0
Amazon 7 13580 686724 7 1940 0
Iterate: Amazon 3 38781 1876019 3 12927 0
1. set γk according to 11 Enron bec 751 7134 10 75 37
2. θ k = ω k − γk ∇ωk L(ω k , {(xi , yi )}) Enron far 3020 13561 10 302 290
3. ω k+1 = prox γk λ ·q (θ k ) Enron kam 3172 18441 10 317 194
q p,q Enron kit 2345 15688 10 235 163
4. k ← k + 1 Enron lok 1966 16012 10 197 278
Enron san 863 10921 10 86 108
Enron wil 2542 8816 10 254 487
Proximity operators for mixed norms were developed by NYT desk 10000 114534 26 385 703
Kowalski [Kowalski, 2009], with closed expressions for NYT online 10000 114534 25 400 699
p, q ∈ {1, 2}. These operators were used for binary kernel NYT section 10000 114534 20 500 827
classification [Kowalski et al., 2009]. Reuters 4 5000 268170 4 1250 725

Table 1: Multi class datasets


5 Empirical study
In order to empirically analyze 1,2;2 regularization, we eval-
uated all six regularization functions of the combinations Fig. 1(a) shows the Macro-F1 vs sparsity for the six regu-
p, q ∈ {1, 2}, combined with the multi class log-loss shown larizations, averaged over all datasets. Higher points indicate
in Eq. (1)1 . better prediction performance, while points to the right, indi-
Fourteen (14) multi class document classification prob- cate sparser models. Both 2,2 and 2,1;2 , yield similar perfor-
lems which their properties are summarized in Table 1 were mance and, as expected, very low sparsity. The performance
used. The 20 Newsgroups dataset contains approximately of 1,1 , 1,2;1 and 2,1;1 is lower, yet yield sparser models.
20, 000 newsgroup messages. The two Amazon datasets con- The best algorithm both in terms performance and sparsity
tain product reviews, and given a review, the task is to predict is 1,2;2 . This is surprising as we evaluate sparsity using the
one of seven product categories, or a subset of three cate- entire sparsity which 1,1 is minimizing.
gories. The tasks of the seven Enron datasets is to automati- To better understand the performance of each algorithm we
cally sort emails into one of ten folders. The three tasks based plot in Fig. 1(b) the precision vs recall. Regularization with
on the New York Times corpus are to predict the desk that 2,2 norm yields relatively high precision and recall. The
produced the story (Financial, Sports, etc.), the online section common sparsity regularization 1,1 is worse both in terms
to which the article was posted, and the section in which the of recall and precision. Our proposed sparsity regularization
article was printed. Finally, the Reuters corpus (RCV1-v2) 1,2;2 has lower precision than 2,2 (yet better than 1,1 ), but
contains a subset of 5, 000 newswire stories that should be la- achieves the highest recall, which allows it to have the highest
beled with one of four general topics: corporate, economics, F1 altogether.
government and markets. Additional details can be found in We next analyze the structure of the models over different
a recent paper [Crammer et al., 2009]. The two right columns rates of sparsity, as it may be the case that we are limited
show the mean and standard deviation of numbers of exam- by sparsity constraints. We trained 128 models for all five
ples per class. Clearly some datasets are well balanced while regularizations (excluding 2,2 ) using lambda values in the
other are far from being balanced. range of 10−5 − 102 .
We first compared the algorithms (or regularizations) in For each of the models we computed class and feature
terms of classification performance. The value of the trade- sparsity statistics, specifically the standard deviation of each
off parameter λ was set using a random split of the data into a group sparsity. First, we compute the class sparsity by
training set consists of 80% of the examples and a test set with computing the non zero rates of class terms per feature t:
the remaining 20%. This partition was used for the optimal λ 1
c
c s=1 I [ω t,c = 0] and then taking the standard deviation
selection. The goal of the process was to find the parameters of these quantities, denoted by STDc . Second, similarly, we
with the optimal performance. The final results are based on compute the feature sparsity by computing the non zero rates
8-fold cross validation. d
of feature terms per class s: d1 t=1 I [ω t,c = 0], and then
Performance is evaluated using Macro-F1 which averages
taking the standard deviation of these quantities, denoted by
F1 over harmonic mean of precision and recall per class. We
STDf . A low value of STDc indicates that for all features
also use macro-precision and macro-recall, each are averages
the amount of class sparsity is close to uniform. That is, the
of precision and recall over classes. Additionally, we evaluate
number of non-zero elements per row is close to a constant.
the sparsity of a model by the fraction of zero-elements in
Similarly, a low STDf indicates a consistent feature sparsity.
it. Higher F1 indicates better prediction performance, while
higher sparsity indicates smaller models. Results for three datasets are given in Fig 2. The left pan-
els show STDf vs total amount of non zero rates for three
1
We have also tried the multi class hinge loss mentioned above, datasets, and the right panels show the STDc vs total amount
yet performance in general was inferior to the log-loss and thus it of non zero rates for the same datasets. In all plots the STD
was omitted. goes to zero when the non zero rates go to zero (high spar-

1733
Macro F1 vs. Sparsity 0.15
STD
f
STD
c

71.4 0.5
1,2; 1
0.45
1,2; 2
0.4
71.2 2,1; 1
1,2; 1
0.1 0.35
1,2; 2 2,1; 2
0.3
71 1,1
2,1; 1
0.25
2,1; 2 0.2
F1 (%)

70.8 0.05
1,1 0.15

0.1
70.6
0.05

0
1e−4 1e−3 1e−2 1e−1 1e−4 1e−3 1e−2 1e−1
70.4
(a) Amazon 7 (b) Amazon 7
70.2 STDf STDc

0.2 0.5
1,2; 1
70 0.18 0.45
10 20 30 40 50 60 70 80 90 1,2; 2
Sparsity (%) 0.16
2,1; 1 0.4
1,2; 1
0.14
0.35
2,1; 2
(a) Macro F1 vs. Sparsity(%) 0.12
1,1 0.3
1,2; 2

0.1
2,1; 1
0.25

Macro precision vs. Macro recall 0.08


0.2
2,1; 2
80.8 0.06 1,1
1,2;1
0.15
0.04

1,2;2
0.1
80.6 0.02

2,1;1
0.05
0
5e−3 2e−2 1e−1 5e−1 5e−3 2e−2 1e−1 5e−1
80.4 2,1;2
1,1 (c) Enron bec (d) Enron bec
Precision (%)

80.2
2,2 STDf STDc

0.18 0.5
80
0.16 0.45

0.14 0.4
79.8 1,2; 1 1,2; 1
0.12 0.35
1,2; 2 1,2; 2
0.3
79.6 0.1
2,1; 1 2,1; 1
0.25
0.08
2,1; 2 2,1; 2
0.2
79.4 0.06
1,1 0.15
1,1
0.04
0.1
79.2
68.4 68.6 68.8 69 69.2 69.4 69.6 69.8 70 0.02
0.05

Recall (%) 0
1e−3 1e−2 1e−1 1e−3 1e−2 1e−1

(b) Macro Precision vs. Macro Recall (e) NYT online (f) NYT online

Figure 1: Summary of performance results on 14 multi class tasks Figure 2: Group sparsity STD vs fraction of non-zero elements (log
scale). Left: Feature sparsity STD. Right: Class sparsity STD.

sity), as more and more entries of the model are set to zero.
Focusing on the left panels we see that STDf is zero for
6 Conclusions and Future Work
2,1;1 , as this regularization either does not enforce a sparsity This work presents a novel approach for multi class prob-
of a feature, or cancels the entire feature altogether, and thus lems, proposing an individual feature selection, which is for-
there is a constant feature sparsity for all classes. The second mulated by the 1,2;2 norm regularization. This approach
best (or uniform) regularization is 1,2;2 , then 1,1 , and 1,2;1 was thoroughly investigated and compared with the common
has the least consistent feature sparsity, as it enforces large 2,2 , 1,1 and 2,1 regularizations, examining performance
amount of zero entries per feature, yet the number of classes and sparsity pattern results. The empirical study, conducted
is no more than 50 (the options for terms removal are limited). on 14 multi class datasets, demonstrates the superiority of the
Moving to the right panels we observe that 1,2;1 has the 1,2;2 regularization for the multi class problems, outperform-
lowest STDc , then, 1,1 and 1,2;2 are close to each other, ing other regularizations. Interestingly, results for 1,2;2 are
and 2,1;1 is with the highest STDc . In both measures, 1,1 not only better, but also sparser than all other regularizations,
STD values are mediocre compared to other regularizations’ including the general sparsity promoting 1,1 . It was also
values which fits its lack of specific sparsity structure. As ex- shown that sparsity patterns fitted the expectations with con-
pected, 1,2;1 has a consistent class sparsity (low STDc ) and sistent feature sparsity results for 1,2;2 . In addition, theoreti-
1,2;2 has a consistent feature sparsity (low STDf ). However, cal guarantees were proved using robustness and Rademacher
1,2;1 has a low feature consistency (high STDf ) while 1,2;2 analysis. These guarantees supply theoretical reasoning for
has a medium class consistency (very close to 1,1 ). This choosing 1,2;2 norm, given a specific prior knowledge.
can be explained, as before, by the fact that c  d, which Future work may analyze and find criteria for performance
enables scattered non zero terms for each of the classes, al- advantage of the proposed regularization and further investi-
lowing a higher chance for features with a consistent class gate specific loss functions and types of datasets, as well as
sparsity. 2,1;2 yields only dense results, hence not relevant additional types of problems, e.g. multi task. Another pos-
for this analysis. sible direction is exploring combinations (sums) of different
Similar behavior was evident in all other datasets. regularizations to obtain a mixed effect.

1734
References norm regularization. In Proceedings of the 26th Annual Interna-
[Bach et al., 2011] Francis Bach, Rodolphe Jenatton, Julien Mairal, tional Conference on Machine Learning, ICML ’09, pages 545–
and Guillaume Obozinski. Convex optimization with sparsity- 552, 2009.
inducing norms. In Optimization for Machine Learning. MIT [Kowalski, 2009] Matthieu Kowalski. Sparse regression using
Press, 2011. mixed norms. Applied and Computational Harmonic Analysis,
[Bakin, 1999] Sergey Bakin. Adaptive regression and model selec- 27(3):303 – 324, 2009.
tion in data mining problems, 1999. Thesis (Ph.D.)–Australian [Lee et al., 2004] Yoonkyung Lee, Yi Lin, and Grace Wahba. Mul-
National University, 1999. ticategory support vector machines, theory, and application to the
[Bartlett and Mendelson, 2003] Peter L. Bartlett and Shahar classification of microarray data and satellite radiance data. Jour-
Mendelson. Rademacher and gaussian complexities: risk bounds nal of the American Statistical Association, 99:67–81, 2004.
and structural results. J. Mach. Learn. Res., 3:463–482, March [Mairal et al., 2010] Julien Mairal, Rodolphe Jenatton, Guillaume
2003. Obozinski, and Francis Bach. Network flow algorithms for struc-
[Benedek and Panzone, 1961] A Benedek and Panzone. R.: The tured sparsity. In Advances in Neural Information Processing
Systems 23, pages 1558–1566. 2010.
space lp, with mixed norm. Duke Math. J, (28):301–324, 1961.
[Massart, 2000] Pascal Massart. Some Applications of Concentra-
[Boser et al., 1992] Bernhard E. Boser, Isabelle M. Guyon, and
tion Inequalities to Statistics. Annales de la Faculté des Sciences
Vladimir N. Vapnik. A training algorithm for optimal margin
de Toulouse, IX(2):245–303, 2000.
classifiers. In Proceedings of the fifth annual workshop on Com-
putational learning theory, COLT ’92, pages 144–152, 1992. [Moreau, 1962] J Moreau. Fonctions convexes duales et points
[Bradley and Bagnell, 2009] David Bradley and J. Andrew (Drew) proximaux dans un espace hilbertien. CR Acad. Sci. Paris Sér.
A Math, (255), 1962.
Bagnell. Convex coding. Technical Report CMU-RI-TR-09-22,
Robotics Institute, Pittsburgh, PA, May 2009. [Mukherjee et al., 2002] Sayan Mukherjee, Ryan Rifkin, and
[Combettes and Pesquet, 2011] Patrick Combettes and Jean- Tomaso Poggio. Regression and classification with regulariza-
tion, 2002.
Christophe Pesquet. Proximal splitting methods in signal
processing. In H.H. Bauschke, R.S. Burachik, P.L. Combettes, [Szafranski et al., 2008] Marie Szafranski, Yves Grandvalet, and
V. Elser, D.R. Luke, and H. Wolkowicz, editors, Fixed-Point Pierre Morizet-Mahoudeaux. Hierarchical penalization. In
Algorithms for Inverse Problems in Science and Engineering, Advances in Neural Information Processing Systems 20 (NIPS
pages 185–212. Springer, 2011. 2007), 2008.
[Cortes and Vapnik, 1995] Corinna Cortes and Vladimir Vapnik. [Teschke and Ramlau, 2007] Gerd Teschke and Ronny Ramlau. An
Support-vector networks. In Machine Learning, volume 20, iterative algorithm for nonlinear inverse problems with joint spar-
pages 273–297, 1995. sity constraints in vector-valued regimes and an application to
[Crammer and Singer, 2001] Koby Crammer and Yoram Singer. color image inpainting. Inverse Problems, 23(5):1851, 2007.
On the algorithmic implementation of multiclass kernel-based [Tibshirani, 1996] R Tibshirani. Regression shrinkage and selection
vector machines. Journal of Machine Learning Research, 2:265– via the lasso. Journal of the Royal Statistical Society. Series B
292, December 2001. (Methodological, (58), 1996.
[Crammer et al., 2009] Koby Crammer, Mark Dredze, and Alex [Tikhonov and Arsenin, 1977] A N Tikhonov and V Y Arsenin. So-
Kulesza. Multi-class confidence weighted algorithms. In lutions of ill posed problems, 1977.
EMNLP, pages 496–504, 2009. [Weston and Watkins, 1998] J Weston and C Watkins. Multi-class
[Duchi and Singer, 2009a] John Duchi and Yoram Singer. Boosting support vector machines. Pattern Recognition, (CSD-TR-98-
with structural sparsity. In ICML, ICML ’09, 2009. 04):0–9, 1998.
[Duchi and Singer, 2009b] John Duchi and Yoram Singer. Efficient [Xu et al., 2009] H Xu, C Caramanis, and S Mannor. Robustness
learning using forward-backward splitting. In NIPS, pages 495– and regularization of support vector machines. Journal of Ma-
503, 2009. chine Learning Research, (10):1485–1510, 2009.
[Fornasier and Rauhut, 2008] Massimo Fornasier and Holger [Yuan and Lin, 2006] Ming Yuan and Yi Lin. Model selection and
Rauhut. Recovery algorithms for vector-valued data with joint estimation in regression with grouped variables. Journal of the
sparsity constraints. SIAM Journal on Numerical Analysis, Royal Statistical Society, Series B, 68:49–67, 2006.
46(2):577–613, 2008. [Zhao et al., 2009] Peng Zhao, Guilherme Rocha, and Bin Yu. The
[Hoerl and Kennard, 1970] A E Hoerl and R W Kennard. Ridge composite absolute penalties family for grouped and hierarchical
regression: Biased estimation for nonorthogonal problems. Tech- variable selection. ANNALS OF STATISTICS, 37:3468, 2009.
nometrics, (12):55–67, 1970.
[Hoerl, 1962] A E Hoerl. Application of ridge analysis to regres-
sion problems. Chemical Engineering Progress, 58(3):54–59,
1962.
[Kowalski and Torrésani, 2008] Matthieu Kowalski and Bruno
Torrésani. Sparsity and persistence: mixed norms provide sim-
ple signal models with dependent coefficients. Signal Image and
Video Processing, 3(3):251–264, 2008.
[Kowalski et al., 2009] Matthieu Kowalski, Marie Szafranski, and
Liva Ralaivola. Multiple indefinite kernel learning with mixed

1735

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy