Multi Class Learning With Individual Sparsity
Multi Class Learning With Individual Sparsity
1729
different per class, yet worst classes’ noise is not too large. counts. The Euclidean norm is used, as we want to demote
We also report results with 14 text classification problems, classes for which there are many non-zero elements compared
with a large range in size, number of classes and dimensions. with others. As performed in other contexts, we replace the
We show that our proposed regularization attains higher F1 zero norm with the unit norm, which is convex and continu-
scores than any other mixed norm regularization we tried, yet ous, and obtain the following relaxation,
results in sparser models. One explanation for that differ- d 2
1
c
ence is that our algorithm attains higher recall values (over 1 2
Ω(ω) = ω1,2;2 = |ω t,s | .
classes) paying in precision. This demonstrates exactly the 2 2 s=1 t=1
point that the obtained models are not too-sparse for some
classes, as models learned with 1,1 regularization may be Thus, given a training set (xi , yi ) where xi ∈ d and yi ∈
(when some classes contain very few non-zero model terms, {1, ..., c}, we propose to learn by minimizing the following
they may never be predicted). problem,
n
λ 2
Related work: Most work with mixed norm regularization min L(ω, (xi , yi )) + ω1,2 ,
focussed on 2,1 and ∞,1 norms [Duchi and Singer, 2009b;
ω
i=1
2
2009a; Bradley and Bagnell, 2009; Zhao et al., 2009; Mairal where L(ω, (xi , yi )) is some multi class loss, such as the
et al., 2010]. A work on mixed norm regularization includ- multi class log loss (with a curvature parameter ν),
ing 1,2 was proposed for signal estimation [Kowalski, 2009; ⎛ ⎞
Kowalski and Torrésani, 2008] and kernel learning for binary 1
classification [Kowalski et al., 2009]. In a work on hier- Llog (ω, x, y ∗ ) = log ⎝1 + eν(1+ωy x−ωy∗ x )⎠ . (1)
ν
archical penalization [Szafranski et al., 2008] a 43 ,1 norm ∗
y=y
was derived and used. The p,1 norm regularization was Previous work with 1,2 regularization included signal re-
also used in other contexts [Fornasier and Rauhut, 2008; construction [Kowalski, 2009; Kowalski and Torrésani, 2008]
Teschke and Ramlau, 2007]. and binary classification with multiple kernels [Kowalski et
al., 2009]. To the best of our knowledge no work has used
Notation: Given a matrix A, its mixed norm, p,q;r , is de- 1,2 regularization for multi class problems.
fined by computing the p norm of each column (or row) of A,
and then computing the q norm of the result. The order of the 3 Analysis
summation (columns or rows first) is defined by r ∈ {1, 2}. We now provide an analysis for learning with mixed-norm
We define a mixed norm where we either first sum over rows matrices. First we state an equivalence to robustness over a
(r = 1) or over columns (r = 2), certain noise, and then a generalization bound based on Gaus-
⎛ pq ⎞ q1 sian complexity analysis. In both cases we build on previous
tools, and modify them to our setting. We emphasize that it is
=⎝ ⎠
p
A p,q;1 |At,s | not a direct application, but a non-trivial derivation is needed.
t s The properties we show in this chapter demonstrate theo-
⎛ pq ⎞ q1 retical guarantees according to a prior knowledge. This prior
knowledge may guide us to choose our proposed 1,2;2 regu-
Ap,q;2 = ⎝ ⎠ .
p
|At,s | larization. We extend the equivalence of binary SVM to ro-
s t bust optimization [Xu et al., 2009] to a multi class setting.
An early work on mixed norms spaces was done by We also state generalization bounds for 1,2;2 regularization,
Benedek and Panzone at 1961 [Benedek and Panzone, 1961]. using Gaussian complexity.
3.1 Equivalence to noise robustness
2 Individual Variable Selection Robustness to noise is one of regularization’s main goals. Re-
Group selection, promoted by group LASSO and other mixed cently Xu et al. [Xu et al., 2009] showed, that with some lim-
norms of the form p,1 , assume that one group of variables itations, regularization for binary classification using hinge
have good descriptive qualities for the entire problem. Specif- loss, is equivalent to robust optimization. We extend this
ically, for feature selection in multi class, the assumption is notion to the multi class setting by showing an equivalence
that a subset of the features is globally able to describe well all of mixed norm regularization and robustness to noise for the
the classes. That is, a small set is sought that would work well sum of hinges loss function.
across all classes. We take a different route and assume that One approach we tried was to reduce multi class learning
for each class there is a small number of features that describe into a binary classification problem with c-times more exam-
it well. However, we do not assume that this set overlaps ples. Applying the result of Xu et al. [Xu et al., 2009] on the
other classes’ sets, and also, we do not want to have too small new problem, provided us a robustness equivalence only to
number of features for some classes.
This intuition leads p,1;2 mixed norms.
to the following regularization, i (zero norm of column i)2 . We thus re-derived this equivalence from scratch. Follow-
That is, we compute the number of non-zero elements per ing the outline of Xu et al. [Xu et al., 2009] we proved the
class and then take the Euclidean norm of this vector of following result (here yi ∈ {−1, 1}c ).
1730
n
Theorem 1 Given a set of examples {xi , yi }i=1 , non separa- where X1 , ..., Xn are samples selected independently from
ble for each class s = 1, ..., c (for each class s = 1, ..., c the set X according to a probability P and g1 , ..., gn are in-
and every ω s ∈ d , there is an example j(s) such that: dependent Gaussian random variables, where for each i ∈
yj(s),s (ω
s xj(s) ) < 0, where yj(s),s is the s term of yj(s) ), {1, ..., n} , gi ∼ N (0, 1).
and the two sets, Based on this definition Bartlett and Mendelson proved the
following theorem.
n
Ň = (δ 1 , · · · , δn ) δ i p∗ ,q∗ ≤ M Theorem 2 [Bartlett and Mendelson, 2003] Let F be a class
of function mapping from X to A = c and let F1 , ...Fc
i=1
be real valued classes, such that F is a subset of their di-
and rect sum. For a given loss function L : Y × A → [0, 1], let
Ň0 = δ δp∗ ,q∗ ≤ M , φ : Y × A → [0, 1] be a dominating cost function (for all
a ∈ A and y ∈ Y, φ(y, a) ≥ L(y, a)), such that for all
for some p∗ , q∗ ≥ 1, the following two optimization problems y ∈ Y, φ(y, ·) is a Lipschitz function (with respect to Eu-
are equivalent: clidean distance on A) with a constant Λ. Let (Xi , Yi )ni=1 be
samples selected independently according to probability P.
n
c
Then, for any integer n and 0 < δ < 1, with a probability of
min sup 1 − yi,s (ω
s (xi − δ i,s ))
at least 1 − δ, over samples of size n, for every f ∈ F, the
ω +
δ 1 ,··· ,δ n ∈Ň following holds:
i=1 s=1
c
8 ln(2/δ)
n
c EL(Y, f (X)) ≤ Ên φ(Y, f (X))+kΛ Gn (Fs )+ ,
min sup ω, δ + ξi,s s=1
n
ω,ξ δ∈Ň (4)
0 i=1 s=1
where k is some constant.
s.t. ξi,s ≥ 1 − yi,s (ω
s xi ) i = 1, ..., n; s = 1, ..., c
We focus on the sum of clipped hinges loss,
ξi,s ≥ 0 i = 1, ..., n; s = 1, ..., c ,
1
c
(2) φ(Y, f (x)) = min 1, max 0, 1 − Ys · ω s X ,
c s=1
where A, B = Tr(A B) is the matrix (Frobenius) inner
1
n
product.
Ên φ(Y, f (X)) = φ(Yi , f (Xi )) and fs ∈ FsM
The proof is omitted due to lack of space. When a norm of n i=1
the perturbation is bounded, the regularization term is similar such that:
to the definition of the dual norm (multiplied by the bound
value). This means that p∗ ,q∗ is the dual norm of some p,q FsM = x → ω s x : ω s ∈ , s ∈ {1, ..., c} ,
d
1731
Proof: Plugging Lemma 3 in Theorem 2, and using the fact 2 ln 2d
Setting μ =
c n , leads to the following result,
that the Lipschitz constant for the sum of hinges loss is 1, and xi 2∞
s=1 i=1
that EX [·] ≤ supX∈X [·], we get the following generalization
bound:
c
n
c
n
≤!2
2
EL(Y, f (X)) ≤Ên φ(Y, f (X))+ Eg max xi,t gi,s xi ∞ ln 2d
s=1
t=1...d
i=1 s=1 i=1
2kM 8 ln(2/δ)
sup Eg Xgp∗ ,q∗ ;r + .
n
n X∈X n
=!2c
2
(6) xi ∞ ln 2d. (8)
i=1
For p = 1, q = 2, r = 2 we have p∗ = ∞, q∗ = 2, r∗ = 2,
We denote by X∞ UB
any bound on the ∞ norm of x. Substi-
sup Eg Xg∞,2;2 ≤ sup Eg Xg∞,1;2 tuting Eq. (8) in Eq. (7) yields,
X∈X X∈X √
c
n sup Eg Xg∞,2;2 ≤ X∞ UB
2nc ln 2d. (9)
= sup Eg max xi,t gi,s . X∈X
xi ∈X t=1...d
s=1 i=1 Plugging the right term of Eq. (9) in Eq. (6) would produce
(7) the desired bound.
Using other norms would yield different terms in Eq. (9).
Next we bound the right expectation, using a technique used
This demonstrates that the norm may be selected according
by Massart [Massart, 2000]. Using Jensen’s inequality, we
to bounds and parameters of the data.
get for all μ > 0:
c
n 4 Multi Class Learning Optimization
exp μEg max xi,t gi,s As mentioned above, our multi class optimization problem is
t=1...d
s=1 i=1
a sum of a loss term and a regularization mixed norm term:
c n
≤Eg exp μ max xi,t gi,s min L(ω, {(xi , yi )}) +
λ q
ωp,q , (10)
t=1...d
s=1 i=1 ω q
n
c n
where L(ω, {(xi , yi )}) =
=Eg max exp μ xi,t gi,s i=1 L(ω, {(xi , yi )}) is the
t=1...d losses sum.
s=1 i=1
We implemented a proximal splitting algorithm [Com-
c
d
n
bettes and Pesquet, 2011] to solve this problem which is con-
≤ Eg exp μ xi,t gi,s vex, and is composed by a sum of a smooth function and a
s=1 t=1 i=1 non-differentiable function. The algorithm we used is the
c
d
n forward-backward splitting which is suitable for such prob-
≤2 Eg exp μ xi,t gi,s lems.
s=1 t=1 i=1 The forward-backward algorithm is iterating between two
c
d
n steps: a gradient step for the smooth function - called forward
=2 Eg [exp (μxi,t gi,s )] step, and a proximity step - called backward step, where the
s=1 t=1 i=1 proximity operator of the second function is used on the result
n ∞ of the previous step. A pseudo code of the algorithm is given
2
c d
1 2 in Alg. 1.
=√ exp μxi,t gi,s − gi,s dgi,s .
2π s=1 t=1 i=1 2 For the gradient step we use an adaptive learning rate γk
−∞ for each iteration k. The value of γk is selected by starting
∞ 2 √ with high value and reducing it until the following condition
Using the Gaussian integral, e−x dx = π, we bound is met [Bach et al., 2011],
−∞ " #
the RHS of the last equality with,
L(ω new ) ≤L(ω k ) + Tr ∇L(ω k ) (ω new − ω k ) +
c
d
n
1 2 2
c
d
1 2 2
n
1 2
2 exp μ xi,t =2 exp μ xi,t ω new − ω k 2 . (11)
t=s t=1 i=1
2 s=1 t=1
2 i=1 2γk
1 2 In our experiments we found that indeed this method pro-
c n
≤2d exp μ xi 2∞ . duced better results than with a constant γ. For the third
2 s=1 i=1
step of the algorithm we use the proximity operator defined
By taking log and dividing by μ, we get (for μ > 0): by Moreau [Moreau, 1962] for lower semicontinuous convex
c functions, adapted here for mixed norms,
n
ln 2d 1
c n
Eg max xi,t gi,s ≤ + μ
2
xi ∞ . γλ q 1 2
μ 2 prox γλ ·q (θ) = arg min ωp,q + ω − θ2 .
s=1
t=1...d
i=1 s=1 i=1 q p,q ω q 2
1732
Algorithm 1 Forward-backward algorithm for multi class Ex. per class
Dataset Ex. # Features Cl. #
with p,q regularization Mean STD
20 News 18828 252122 20 941 97
Initialize: ω 0 , k = 0
Amazon 7 13580 686724 7 1940 0
Iterate: Amazon 3 38781 1876019 3 12927 0
1. set γk according to 11 Enron bec 751 7134 10 75 37
2. θ k = ω k − γk ∇ωk L(ω k , {(xi , yi )}) Enron far 3020 13561 10 302 290
3. ω k+1 = prox γk λ ·q (θ k ) Enron kam 3172 18441 10 317 194
q p,q Enron kit 2345 15688 10 235 163
4. k ← k + 1 Enron lok 1966 16012 10 197 278
Enron san 863 10921 10 86 108
Enron wil 2542 8816 10 254 487
Proximity operators for mixed norms were developed by NYT desk 10000 114534 26 385 703
Kowalski [Kowalski, 2009], with closed expressions for NYT online 10000 114534 25 400 699
p, q ∈ {1, 2}. These operators were used for binary kernel NYT section 10000 114534 20 500 827
classification [Kowalski et al., 2009]. Reuters 4 5000 268170 4 1250 725
1733
Macro F1 vs. Sparsity 0.15
STD
f
STD
c
71.4 0.5
1,2; 1
0.45
1,2; 2
0.4
71.2 2,1; 1
1,2; 1
0.1 0.35
1,2; 2 2,1; 2
0.3
71 1,1
2,1; 1
0.25
2,1; 2 0.2
F1 (%)
70.8 0.05
1,1 0.15
0.1
70.6
0.05
0
1e−4 1e−3 1e−2 1e−1 1e−4 1e−3 1e−2 1e−1
70.4
(a) Amazon 7 (b) Amazon 7
70.2 STDf STDc
0.2 0.5
1,2; 1
70 0.18 0.45
10 20 30 40 50 60 70 80 90 1,2; 2
Sparsity (%) 0.16
2,1; 1 0.4
1,2; 1
0.14
0.35
2,1; 2
(a) Macro F1 vs. Sparsity(%) 0.12
1,1 0.3
1,2; 2
0.1
2,1; 1
0.25
1,2;2
0.1
80.6 0.02
2,1;1
0.05
0
5e−3 2e−2 1e−1 5e−1 5e−3 2e−2 1e−1 5e−1
80.4 2,1;2
1,1 (c) Enron bec (d) Enron bec
Precision (%)
80.2
2,2 STDf STDc
0.18 0.5
80
0.16 0.45
0.14 0.4
79.8 1,2; 1 1,2; 1
0.12 0.35
1,2; 2 1,2; 2
0.3
79.6 0.1
2,1; 1 2,1; 1
0.25
0.08
2,1; 2 2,1; 2
0.2
79.4 0.06
1,1 0.15
1,1
0.04
0.1
79.2
68.4 68.6 68.8 69 69.2 69.4 69.6 69.8 70 0.02
0.05
Recall (%) 0
1e−3 1e−2 1e−1 1e−3 1e−2 1e−1
(b) Macro Precision vs. Macro Recall (e) NYT online (f) NYT online
Figure 1: Summary of performance results on 14 multi class tasks Figure 2: Group sparsity STD vs fraction of non-zero elements (log
scale). Left: Feature sparsity STD. Right: Class sparsity STD.
sity), as more and more entries of the model are set to zero.
Focusing on the left panels we see that STDf is zero for
6 Conclusions and Future Work
2,1;1 , as this regularization either does not enforce a sparsity This work presents a novel approach for multi class prob-
of a feature, or cancels the entire feature altogether, and thus lems, proposing an individual feature selection, which is for-
there is a constant feature sparsity for all classes. The second mulated by the 1,2;2 norm regularization. This approach
best (or uniform) regularization is 1,2;2 , then 1,1 , and 1,2;1 was thoroughly investigated and compared with the common
has the least consistent feature sparsity, as it enforces large 2,2 , 1,1 and 2,1 regularizations, examining performance
amount of zero entries per feature, yet the number of classes and sparsity pattern results. The empirical study, conducted
is no more than 50 (the options for terms removal are limited). on 14 multi class datasets, demonstrates the superiority of the
Moving to the right panels we observe that 1,2;1 has the 1,2;2 regularization for the multi class problems, outperform-
lowest STDc , then, 1,1 and 1,2;2 are close to each other, ing other regularizations. Interestingly, results for 1,2;2 are
and 2,1;1 is with the highest STDc . In both measures, 1,1 not only better, but also sparser than all other regularizations,
STD values are mediocre compared to other regularizations’ including the general sparsity promoting 1,1 . It was also
values which fits its lack of specific sparsity structure. As ex- shown that sparsity patterns fitted the expectations with con-
pected, 1,2;1 has a consistent class sparsity (low STDc ) and sistent feature sparsity results for 1,2;2 . In addition, theoreti-
1,2;2 has a consistent feature sparsity (low STDf ). However, cal guarantees were proved using robustness and Rademacher
1,2;1 has a low feature consistency (high STDf ) while 1,2;2 analysis. These guarantees supply theoretical reasoning for
has a medium class consistency (very close to 1,1 ). This choosing 1,2;2 norm, given a specific prior knowledge.
can be explained, as before, by the fact that c d, which Future work may analyze and find criteria for performance
enables scattered non zero terms for each of the classes, al- advantage of the proposed regularization and further investi-
lowing a higher chance for features with a consistent class gate specific loss functions and types of datasets, as well as
sparsity. 2,1;2 yields only dense results, hence not relevant additional types of problems, e.g. multi task. Another pos-
for this analysis. sible direction is exploring combinations (sums) of different
Similar behavior was evident in all other datasets. regularizations to obtain a mixed effect.
1734
References norm regularization. In Proceedings of the 26th Annual Interna-
[Bach et al., 2011] Francis Bach, Rodolphe Jenatton, Julien Mairal, tional Conference on Machine Learning, ICML ’09, pages 545–
and Guillaume Obozinski. Convex optimization with sparsity- 552, 2009.
inducing norms. In Optimization for Machine Learning. MIT [Kowalski, 2009] Matthieu Kowalski. Sparse regression using
Press, 2011. mixed norms. Applied and Computational Harmonic Analysis,
[Bakin, 1999] Sergey Bakin. Adaptive regression and model selec- 27(3):303 – 324, 2009.
tion in data mining problems, 1999. Thesis (Ph.D.)–Australian [Lee et al., 2004] Yoonkyung Lee, Yi Lin, and Grace Wahba. Mul-
National University, 1999. ticategory support vector machines, theory, and application to the
[Bartlett and Mendelson, 2003] Peter L. Bartlett and Shahar classification of microarray data and satellite radiance data. Jour-
Mendelson. Rademacher and gaussian complexities: risk bounds nal of the American Statistical Association, 99:67–81, 2004.
and structural results. J. Mach. Learn. Res., 3:463–482, March [Mairal et al., 2010] Julien Mairal, Rodolphe Jenatton, Guillaume
2003. Obozinski, and Francis Bach. Network flow algorithms for struc-
[Benedek and Panzone, 1961] A Benedek and Panzone. R.: The tured sparsity. In Advances in Neural Information Processing
Systems 23, pages 1558–1566. 2010.
space lp, with mixed norm. Duke Math. J, (28):301–324, 1961.
[Massart, 2000] Pascal Massart. Some Applications of Concentra-
[Boser et al., 1992] Bernhard E. Boser, Isabelle M. Guyon, and
tion Inequalities to Statistics. Annales de la Faculté des Sciences
Vladimir N. Vapnik. A training algorithm for optimal margin
de Toulouse, IX(2):245–303, 2000.
classifiers. In Proceedings of the fifth annual workshop on Com-
putational learning theory, COLT ’92, pages 144–152, 1992. [Moreau, 1962] J Moreau. Fonctions convexes duales et points
[Bradley and Bagnell, 2009] David Bradley and J. Andrew (Drew) proximaux dans un espace hilbertien. CR Acad. Sci. Paris Sér.
A Math, (255), 1962.
Bagnell. Convex coding. Technical Report CMU-RI-TR-09-22,
Robotics Institute, Pittsburgh, PA, May 2009. [Mukherjee et al., 2002] Sayan Mukherjee, Ryan Rifkin, and
[Combettes and Pesquet, 2011] Patrick Combettes and Jean- Tomaso Poggio. Regression and classification with regulariza-
tion, 2002.
Christophe Pesquet. Proximal splitting methods in signal
processing. In H.H. Bauschke, R.S. Burachik, P.L. Combettes, [Szafranski et al., 2008] Marie Szafranski, Yves Grandvalet, and
V. Elser, D.R. Luke, and H. Wolkowicz, editors, Fixed-Point Pierre Morizet-Mahoudeaux. Hierarchical penalization. In
Algorithms for Inverse Problems in Science and Engineering, Advances in Neural Information Processing Systems 20 (NIPS
pages 185–212. Springer, 2011. 2007), 2008.
[Cortes and Vapnik, 1995] Corinna Cortes and Vladimir Vapnik. [Teschke and Ramlau, 2007] Gerd Teschke and Ronny Ramlau. An
Support-vector networks. In Machine Learning, volume 20, iterative algorithm for nonlinear inverse problems with joint spar-
pages 273–297, 1995. sity constraints in vector-valued regimes and an application to
[Crammer and Singer, 2001] Koby Crammer and Yoram Singer. color image inpainting. Inverse Problems, 23(5):1851, 2007.
On the algorithmic implementation of multiclass kernel-based [Tibshirani, 1996] R Tibshirani. Regression shrinkage and selection
vector machines. Journal of Machine Learning Research, 2:265– via the lasso. Journal of the Royal Statistical Society. Series B
292, December 2001. (Methodological, (58), 1996.
[Crammer et al., 2009] Koby Crammer, Mark Dredze, and Alex [Tikhonov and Arsenin, 1977] A N Tikhonov and V Y Arsenin. So-
Kulesza. Multi-class confidence weighted algorithms. In lutions of ill posed problems, 1977.
EMNLP, pages 496–504, 2009. [Weston and Watkins, 1998] J Weston and C Watkins. Multi-class
[Duchi and Singer, 2009a] John Duchi and Yoram Singer. Boosting support vector machines. Pattern Recognition, (CSD-TR-98-
with structural sparsity. In ICML, ICML ’09, 2009. 04):0–9, 1998.
[Duchi and Singer, 2009b] John Duchi and Yoram Singer. Efficient [Xu et al., 2009] H Xu, C Caramanis, and S Mannor. Robustness
learning using forward-backward splitting. In NIPS, pages 495– and regularization of support vector machines. Journal of Ma-
503, 2009. chine Learning Research, (10):1485–1510, 2009.
[Fornasier and Rauhut, 2008] Massimo Fornasier and Holger [Yuan and Lin, 2006] Ming Yuan and Yi Lin. Model selection and
Rauhut. Recovery algorithms for vector-valued data with joint estimation in regression with grouped variables. Journal of the
sparsity constraints. SIAM Journal on Numerical Analysis, Royal Statistical Society, Series B, 68:49–67, 2006.
46(2):577–613, 2008. [Zhao et al., 2009] Peng Zhao, Guilherme Rocha, and Bin Yu. The
[Hoerl and Kennard, 1970] A E Hoerl and R W Kennard. Ridge composite absolute penalties family for grouped and hierarchical
regression: Biased estimation for nonorthogonal problems. Tech- variable selection. ANNALS OF STATISTICS, 37:3468, 2009.
nometrics, (12):55–67, 1970.
[Hoerl, 1962] A E Hoerl. Application of ridge analysis to regres-
sion problems. Chemical Engineering Progress, 58(3):54–59,
1962.
[Kowalski and Torrésani, 2008] Matthieu Kowalski and Bruno
Torrésani. Sparsity and persistence: mixed norms provide sim-
ple signal models with dependent coefficients. Signal Image and
Video Processing, 3(3):251–264, 2008.
[Kowalski et al., 2009] Matthieu Kowalski, Marie Szafranski, and
Liva Ralaivola. Multiple indefinite kernel learning with mixed
1735