0% found this document useful (0 votes)

17 views47 pages

KLIEP

The paper presents a new method for direct importance estimation to adapt to covariate shift, where training and test samples follow different input distributions. The proposed Kullback-Leibler Importance Estimation Procedure (KLIEP) avoids density estimation and includes a model selection process, optimizing parameters through cross-validation on test samples. Simulations demonstrate that KLIEP outperforms existing methods, improving prediction performance in scenarios affected by covariate shift.

Uploaded by

jijinhan12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views47 pages

KLIEP

Uploaded by

jijinhan12

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Annals of the Institute of Statistical Mathematics, vol.60, no.4, pp.699–746, 2008.

Direct Importance Estimation

for Covariate Shift Adaptation∗
Masashi Sugiyama
Department of Computer Science, Tokyo Institute of Technology
2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552, Japan
sugi@cs.titech.ac.jp
Taiji Suzuki
Department of Mathematical Informatics, The University of Tokyo
7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
s-taiji@stat.t.u-tokyo.ac.jp
Shinichi Nakajima
Nikon Corporation
201-9 Oaza-Miizugahara, Kumagaya-shi, Saitama 360-8559, Japan
nakajima.s@nikon.co.jp
Hisashi Kashima
IBM Research, Tokyo Research Laboratory
1623-14 Shimotsuruma, Yamato-shi, Kanagawa 242-8502, Japan
hkashima@jp.ibm.com
Paul von Bünau
Department of Computer Science, Technical University Berlin
Franklinstr. 28/29, 10587 Berlin, Germany
buenau@cs.tu-berlin.de
Motoaki Kawanabe
Fraunhofer FIRST.IDA
Kekuléstr. 7, 12489 Berlin, Germany
motoaki.kawanabe@first.fraunhofer.de

∗
This is an extended version of an earlier conference paper (Sugiyama et al., 2008). A
MATLAB implementation of the proposed importance estimation method is available from
‘http://sugiyama-www.cs.titech.ac.jp/~sugi/software/KLIEP’.
Direct Importance Estimation for Covariate Shift Adaptation 2

Abstract
A situation where training and test samples follow diﬀerent input distributions is
called covariate shift. Under covariate shift, standard learning methods such as max-
imum likelihood estimation are no longer consistent—weighted variants according
to the ratio of test and training input densities are consistent. Therefore, accurately
estimating the density ratio, called the importance, is one of the key issues in covari-
ate shift adaptation. A naive approach to this task is to ﬁrst estimate training and
test input densities separately and then estimate the importance by taking the ratio
of the estimated densities. However, this naive approach tends to perform poorly
since density estimation is a hard task particularly in high dimensional cases. In
this paper, we propose a direct importance estimation method that does not in-
volve density estimation. Our method is equipped with a natural cross validation
procedure and hence tuning parameters such as the kernel width can be objectively
optimized. Furthermore, we give rigorous mathematical proofs for the convergence
of the proposed algorithm. Simulations illustrate the usefulness of our approach.

Keywords

Covariate shift, Importance sampling, Model misspeciﬁcation, Kullback-Leibler di-

vergence, Likelihood cross validation.

1 Introduction
A common assumption in supervised learning is that training and test samples follow the
same distribution. However, this basic assumption is often violated in practice and then
standard machine learning methods do not work as desired. A situation where the input
distribution P (x) is different in the training and test phases but the conditional distribu-
tion of output values, P (y|x), remains unchanged is called covariate shift (Shimodaira,
2000). In many real-world applications such as robot control (Sutton and Barto, 1998;
Shelton, 2001; Hachiya et al., 2008), bioinformatics (Baldi and Brunak, 1998; Borgwardt
et al., 2006), spam filtering (Bickel and Scheffer, 2007), brain-computer interfacing (Wol-
paw et al., 2002; Sugiyama et al., 2007), or econometrics (Heckman, 1979), covariate shift
is conceivable and thus learning under covariate shift is gathering a lot of attention these
days.
The influence of covariate shift could be alleviated by weighting the log likelihood
terms according to the importance (Shimodaira, 2000):

pte (x)
w(x) := ,
ptr (x)

where pte (x) and ptr (x) are test and training input densities. Since the importance is
usually unknown, the key issue of covariate shift adaptation is how to accurately estimate
Direct Importance Estimation for Covariate Shift Adaptation 3

the importance1 .
A naive approach to importance estimation would be to first estimate the training
and test densities separately from training and test input samples, and then estimate the
importance by taking the ratio of the estimated densities. However, density estimation is
known to be a hard problem particularly in high-dimensional cases (Härdle et al., 2004).
Therefore, this naive approach may not be effective—directly estimating the importance
without estimating the densities would be more promising.
Following this spirit, the kernel mean matching (KMM) method has been proposed
recently (Huang et al., 2007), which directly gives importance estimates without going
through density estimation. KMM is shown to work well, given that tuning parameters
such as the kernel width are chosen appropriately. Intuitively, model selection of impor-
tance estimation algorithms (such as KMM) is straightforward by cross validation (CV)
over the performance of subsequent learning algorithms. However, this is highly unreliable
since the ordinary CV score is heavily biased under covariate shift—for unbiased estima-
tion of the prediction performance of subsequent learning algorithms, the CV procedure
itself needs to be importance-weighted (Sugiyama et al., 2007). Since the importance
weight has to have been fixed when model selection is carried out by importance weighted
CV, it can not be used for model selection of importance estimation algorithms2 .
The above fact implies that model selection of importance estimation algorithms
should be performed within the importance estimation step in an unsupervised manner.
However, since KMM can only estimate the values of the importance at training input
points, it can not be directly applied in the CV framework; an out-of-sample extension is
needed, but this seems to be an open research issue currently.
In this paper, we propose a new importance estimation method which can overcome
the above problems, i.e., the proposed method directly estimates the importance without
density estimation and is equipped with a natural model selection procedure. Our basic
idea is to find an importance estimate w(x) b such that the Kullback-Leibler divergence
from the true test input density pte (x) to its estimate pbte (x) = w(x)p
b tr (x) is minimized.
We propose an algorithm that can carry out this minimization without explicitly mod-
eling ptr (x) and pte (x). We call the proposed method the Kullback-Leibler Importance
Estimation Procedure (KLIEP). The optimization problem involved in KLIEP is convex,

1
Covariate shift matters in parameter learning only when the model used for function learning is
misspecified (i.e., the model is so simple that the true learning target function can not be expressed)
(Shimodaira, 2000)—when the model is correctly (or overly) speciﬁed, ordinary maximum likelihood
estimation is still consistent. Following this fact, there is a criticism that importance weighting is not
needed; just the use of a complex enough model can settle the problem. However, too complex models
result in huge variance and thus we practically need to choose a complex enough but not too complex
model. For choosing such an appropriate model, we usually use a model selection technique such as cross
validation (CV). However, the ordinary CV score is heavily biased due to covariate shift and we also need
to importance-weight the CV score (or any other model selection criteria) for unbiasedness (Shimodaira,
2000; Sugiyama and Müller, 2005; Sugiyama et al., 2007). For this reason, estimating the importance is
indispensable when covariate shift occurs.
2
Once the importance weight has been ﬁxed, importance weighted CV can be used for model selection
of subsequent learning algorithms.
Direct Importance Estimation for Covariate Shift Adaptation 4

so the unique global solution can be obtained. Furthermore, the solution tends to be
sparse, which contributes to reducing the computational cost in the test phase.
Since KLIEP is based on the minimization of the Kullback-Leibler divergence, its
model selection can be naturally carried out through a variant of likelihood CV, which is
a standard model selection technique in density estimation (Härdle et al., 2004). A key
advantage of our CV procedure is that, not the training samples, but the test input samples
are cross-validated. This highly contributes to improving the model selection accuracy
when the number of training samples is limited but test input samples are abundantly
available.
The simulation studies show that KLIEP tends to outperform existing approaches in
importance estimation including the logistic regression based method (Bickel et al., 2007),
and it contributes to improving the prediction performance in covariate shift scenarios.

2 New Importance Estimation Method

In this section, we propose a new importance estimation method.

2.1 Formulation and Notation

Let D ⊂ (Rd ) be the input domain and suppose we are given i.i.d. training input samples
{xtr
i }i=1 from a training input distribution with density ptr (x) and i.i.d. test input samples
ntr

{xte
j }j=1 from a test input distribution with density pte (x). We assume that ptr (x) > 0
nte

for all x ∈ D. The goal of this paper is to develop a method of estimating the importance
w(x) from {xtr i }i=1 and {xj }j=1 :
ntr te nte 3

pte (x)
w(x) := .
ptr (x)

Our key restriction is that we avoid estimating densities pte (x) and ptr (x) when estimating
the importance w(x).

2.2 Kullback-Leibler Importance Estimation Procedure

(KLIEP)
Let us model the importance w(x) by the following linear model:

∑
b
b
w(x) = αℓ φℓ (x), (1)
ℓ=1

3
Importance estimation is a pre-processing step of supervised learning tasks where training output
samples {yitr }ni=1
tr
at the training input points {xtr
i }i=1 are also available (Shimodaira, 2000; Sugiyama
ntr

and Müller, 2005; Huang et al., 2007; Sugiyama et al., 2007). However, we do not use {yitr }ni=1tr
in the
importance estimation step since they are irrelevant to the importance.
Direct Importance Estimation for Covariate Shift Adaptation 5

where {αℓ }bℓ=1 are parameters to be learned from data samples and {φℓ (x)}bℓ=1 are basis
functions such that

φℓ (x) ≥ 0 for all x ∈ D and for ℓ = 1, 2, . . . , b.

Note that b and {φℓ (x)}bℓ=1 could be dependent on the samples {xtr i }i=1 and {xj }j=1 ,
ntr te nte

i.e., kernel models are also allowed—we explain how the basis functions {φℓ (x)}bℓ=1 are
chosen in Section 2.3.
b
Using the model w(x), we can estimate the test input density pte (x) by

pbte (x) = w(x)p

b tr (x).

We determine the parameters {αℓ }bℓ=1 in the model (1) so that the Kullback-Leibler di-
vergence from pte (x) to pbte (x) is minimized4 :
∫
pte (x)
KL[pte (x)∥bpte (x)] = pte (x) log dx
D b
w(x)p tr (x)
∫ ∫
pte (x)
= pte (x) log dx − b
pte (x) log w(x)dx. (2)
D ptr (x) D

Since the ﬁrst term in Eq.(2) is independent of {αℓ }bℓ=1 , we ignore it and focus on the
second term. We denote it by J:
∫
J := b
pte (x) log w(x)dx (3)
D
( b )
1 ∑ ∑ ∑
nte nte
1
≈ b te
log w(x j ) = log αℓ φℓ (xte
j ) ,
nte j=1 nte j=1 ℓ=1

where the empirical approximation based on the test input samples {xte j }j=1 is used from
nte

the ﬁrst line to the second line above. This is our objective function to be maximized
with respect to the parameters {αℓ }bℓ=1 , which is concave (Boyd and Vandenberghe, 2004).
Note that the above objective function only involves the test input samples {xte j }j=1 , i.e.,
nte

we did not use the training input samples {xtr i }i=1 yet. As shown below, {xi }i=1 will be
ntr tr ntr

used in the constraint.

b
w(x) is an estimate of the importance w(x) which is non-negative by deﬁnition. There-
b
fore, it is natural to impose w(x) ≥ 0 for all x ∈ D, which can be achieved by restricting

αℓ ≥ 0 for ℓ = 1, 2, . . . , b.

4
One may also consider an alternative scenario where the inverse importance w−1 (x) is parame-
terized and the parameters are learned so that the Kullback-Leibler divergence from ptr (x) to pbtr (x)
(= wb−1 (x)pte (x)) is minimized. We may also consider using KL[bpte (x)∥pte (x)]—however, this involves
b
the model w(x) in a more complex manner and does not seem to result in a simple optimization problem.
Direct Importance Estimation for Covariate Shift Adaptation 6

In addition to the non-negativity, w(x) b should be properly normalized since pbte (x) (=
b
w(x)p tr (x)) is a probability density function:
∫ ∫
1= pbte (x)dx = b
w(x)p tr (x)dx (4)
D D

1 ∑
ntr
1 ∑
ntr ∑b
≈ b tr
w(x i ) = αℓ φℓ (xtr
i ),
ntr i=1
ntr i=1 ℓ=1

where the empirical approximation based on the training input samples {xtr
i }i=1 is used
ntr

from the ﬁrst line to the second line above.

Now our optimization criterion is summarized as follows.
[n ( b )]
∑ te ∑
te
maximize log αℓ φℓ (xj )
{αℓ }bℓ=1
j=1 ℓ=1

∑
ntr ∑
b
subject to i ) = ntr and α1 , α2 , . . . , αb ≥ 0.
αℓ φℓ (xtr
i=1 ℓ=1

This is a convex optimization problem and the global solution can be obtained, e.g., by
simply performing gradient ascent and feasibility satisfaction iteratively5 . A pseudo code
is described in Figure 1(a). Note that the solution {b αℓ }bℓ=1 tends to be sparse (Boyd
and Vandenberghe, 2004), which contributes to reducing the computational cost in the
test phase. We refer to the above method as Kullback-Leibler Importance Estimation
Procedure (KLIEP).

2.3 Model Selection by Likelihood Cross Validation

The performance of KLIEP depends on the choice of basis functions {φℓ (x)}bℓ=1 . Here we
explain how they can be appropriately chosen from data samples.
Since KLIEP is based on the maximization of the score J (see Eq.(3)), it would
be natural to select the model such that J is maximized. The expectation over pte (x)
involved in J can be numerically approximated by likelihood cross validation (LCV) as
follows: First, divide the test samples {xte j }j=1 into R disjoint subsets {Xr }r=1 . Then
nte te R

obtain an importance estimate w br (x) from {Xj }j̸=r and approximate the score J using
te

Xr as
te

1 ∑
Jbr := br (x).
log w
|Xrte | te x∈Xr

5
∑b
If necessary, we may regularize the solution, e.g., by adding a penalty term (say, ℓ=1 αℓ2 ) to the
objective function or by imposing an upper bound on the solution. The normalization constraint (4) may
also be weakened by allowing a small deviation. These modiﬁcation is possible without sacriﬁcing the
convexity.
Direct Importance Estimation for Covariate Shift Adaptation 7

Input: m = {φℓ (x)}bℓ=1 , {xtr

i }i=1 , and {xj }j=1
ntr te nte

b
Output: w(x)
Aj,ℓ ←− φ∑ te
ℓ (xj ) for j = 1, 2, . . . , nte and ℓ = 1, 2, . . . , b;
bℓ ←− ntr i=1 φℓ (xtr
1 ntr
i ) for j = 1, 2, . . . , nte ;
Initialize α (> 0) and ε (0 < ε ≪ 1);
Repeat until convergence
α ←− α + εA⊤ (1./Aα); % Gradient ascent
α ←− α + (1 − b⊤ α)b/(b⊤ b); % Constraint satisfaction
α ←− max(0, α); % Constraint satisfaction
α ←− α/(b⊤ α); % Constraint satisfaction
end
∑
b
w(x) ←− bℓ=1 αℓ φℓ (x);
(a) KLIEP main code
(k)
Input: M = {mk | mk = {φℓ (x)}bℓ=1 }, {xtr
i }i=1 , and {xj }j=1
(k)
ntr te nte

b
Output: w(x)
Split {xte
j }j=1 into R disjoint subsets {Xr }r=1 ;
nte te R

for each model m ∈ M

for each split r = 1, 2, . . . , R
wbr (x) ←− KLIEP(m, {xtr i }i=1 , {Xj }j̸=r );
ntr te
∑
Jbr (m) ←− |X1te | x∈Xrte log w br (x);
r
end
b ∑ b
J(m) ←− R1 R r=1 Jr (m);
end
b ←− argmaxm∈M J(m);
m b
b
w(x) ←− KLIEP(m, b {xtri }i=1 , {xj }j=1 );
ntr te nte

(b) Model selection by LCV

Figure 1: The KLIEP algorithm in pseudo code. ‘./’ indicates the element-wise division
and ⊤ denotes the transpose. Inequalities and the ‘max’ operation for vectors are applied
element-wise. A MATLAB implementation of the KLIEP algorithm is available from
‘http://sugiyama-www.cs.titech.ac.jp/~sugi/software/KLIEP’.

We repeat this procedure for r = 1, 2, . . . , R, compute the average of Jbr over all r, and
use the average Jb as an estimate of J:

1∑b
R
Jb := Jr . (5)
R r=1

For model selection, we compute Jb for all model candidates (the basis functions
b A pseudo code
{φℓ (x)}bℓ=1 in the current setting) and choose the one that minimizes J.
of the LCV procedure is summarized in Figure 1(b).
Direct Importance Estimation for Covariate Shift Adaptation 8

One of the potential limitations of CV in general is that it is not reliable in small sample
cases since data splitting by CV further reduces the sample size. On the other hand, in
our CV procedure, the data splitting is performed only over the test input samples, not
over the training samples. Therefore, even when the number of training samples is small,
our CV procedure does not suﬀer from the small sample problem as long as a large number
of test input samples are available.
A good model may be chosen by the above CV procedure, given that a set of promising
model candidates is prepared. As model candidates, we propose using a Gaussian kernel
model centered at the test input points {xte j }j=1 , i.e.,
nte

∑
nte
b
w(x) = αℓ Kσ (x, xte
ℓ ),
ℓ=1
′
where Kσ (x, x ) is the Gaussian kernel with kernel width σ:
( )
′ ∥x − x′ ∥2
Kσ (x, x ) := exp − . (6)
2σ 2
The reason why we chose the test input points {xte j }j=1 as the Gaussian centers, not
nte

the training input points {xtri }i=1 , is as follows. By deﬁnition, the importance w(x) tends
ntr

to take large values if the training input density ptr (x) is small and the test input density
pte (x) is large; conversely, w(x) tends to be small (i.e., close to zero) if ptr (x) is large
and pte (x) is small. When a function is approximated by a Gaussian kernel model, many
kernels may be needed in the region where the output of the target function is large;
on the other hand, only a small number of kernels would be enough in the region where
the output of the target function is close to zero. Following this heuristic, we decided to
allocate many kernels at high test input density regions, which can be achieved by setting
the Gaussian centers at the test input points {xte j }j=1 .
nte

Alternatively, we may locate (ntr +nte ) Gaussian kernels at both {xtr i }i=1 and {xj }j=1 .
ntr te nte

However, in our preliminary experiments, this did not further improve the performance,
but slightly increased the computational cost. When nte is very large, just using all the test
input points {xte j }j=1 as Gaussian centers is already computationally rather demanding.
nte

To ease this problem, we practically propose using a subset of {xte j }j=1 as Gaussian centers
nte

for computational eﬃciency, i.e.,

∑
b
b
w(x) = αℓ Kσ (x, cℓ ), (7)
ℓ=1

j }j=1 and b (≤ nte ) is a preﬁxed

where cℓ is a template point randomly chosen from {xte nte

number.

3 Theoretical Analyses
In this section, we investigate the convergence properties of the KLIEP algorithm. The
theoretical statements we prove in this section are roughly summarized as follows.
Direct Importance Estimation for Covariate Shift Adaptation 9

• When a non-parametric model (e.g., kernel basis functions centered at test samples)
is used for importance estimation, KLIEP converges to the optimal solution with
convergence rate slightly slower than Op (n− 2 ) under n = ntr = nte (Theorem 1 and
1

Theorem 2).

• When a ﬁxed set of basis functions is used for importance estimation, KLIEP con-
verges to the optimal solution with convergence rate Op (n− 2 ). Furthermore, KLIEP
1

has asymptotic normality around the optimal solution (Theorem 3 and Theorem 4).

3.1 Mathematical Preliminaries

Since we give rigorous mathematical convergence proofs, we ﬁrst slightly change our no-
tation for clearer mathematical exposition.
Below, we assume that the numbers of training and test samples are the same, i.e.,

n = nte = ntr .

We note that this assumption is just for simplicity; without this assumption, the conver-
gence rate is solely determined by the sample size with the slower rate.
For arbitrary measure P̃ and P̃ -integrable function f , we express its “expectation” as
∫
P̃ f := f dP̃ .

Let P and Q be the probability measures which generate test and training samples,
respectively. In a similar fashion, we deﬁne the empirical distributions of test and training
samples by Pn and Qn , i.e.,

1∑ 1∑
n n
Pn f = f (xte
j ), Qn f = f (xtr
i ).
n j=1 n i=1

The set of basis functions is denoted by

F := {φθ | θ ∈ Θ},

where Θ is some parameter or index set. The set of basis functions at n samples are
denoted using Θn ⊆ Θ by
Fn := {φθ | θ ∈ Θn } ⊂ F,
which can behave stochastically. The set of ﬁnite linear combinations of F with positive
coeﬃcients and its bounded subset are denoted by
{ }
∑
G := αl φθl αl ≥ 0, φθl ∈ F ,
l
G M
:= {g ∈ G | ∥g∥∞ ≤ M } ,
Direct Importance Estimation for Covariate Shift Adaptation 10

and their subsets at n samples are denoted by

{ }
∑
Gn := αl φθl αl ≥ 0, φθl ∈ Fn ⊂ G,
l
GnM := {g ∈ Gn | ∥g∥∞ ≤ M } ⊂ G M .

Let Ĝn be the feasible set of KLIEP:

Ĝn := {g ∈ Gn | Qn g = 1}.

Under the notations described above, the solution ĝn of (generalized) KLIEP is given as
follows:
ĝn := arg max Pn log (g) .
g∈Ĝn

For simplicity, we assume the optimal solution is uniquely determined. In order to derive
the convergence rates of KLIEP, we make the following assumptions.
Assumption 1

1. P and Q are mutually absolutely continuous and have the following property:
dP
0 < η0 ≤ ≤ η1
dQ
on the support of P and Q. Let g0 denote
dP
g0 := .
dQ

2. φθ ≥ 0 (∀φθ ∈ F), and ∃ϵ0 , ξ0 > 0 such that

Qφθ ≥ ϵ0 , ∥φθ ∥∞ ≤ ξ0 , (∀φθ ∈ F).

3. For some constants 0 < γ < 2 and K,

( )γ
M
sup log N (ϵ, G , L2 (Q̃)) ≤ K
M
, (8)
Q̃ ϵ

where the supremum is taken over all ﬁnitely discrete probability measures Q̃, or
( )γ
M
log N[] (ϵ, G , L2 (Q)) ≤ K
M
. (9)
ϵ

N (ϵ, F, d) and N[] (ϵ, F, d) are the ϵ-covering number and the ϵ-bracketing number
of F with norm d, respectively (van der Vaart and Wellner, 1996).
Direct Importance Estimation for Covariate Shift Adaptation 11

We deﬁne the (generalized) Hellinger distance with respect to Q as

(∫ )1/2
′ √ √ 2
hQ (g, g ) := ( g − g ) dQ
′ ,

where g and g ′ are non-negative measurable functions (not necessarily probability den-
sities). The lower bound of g0 appeared in Assumption 1.1 will be used to ensure the
existence of a Lipschitz continuous function that bounds the Hellinger distance from the
true. The bound of g0 is needed only on the support of P and Q. Assumption 1.3 con-
trols the complexity of the model. By this complexity assumption, we can bound the tail
probability of the diﬀerence between the empirical risk and the true risk uniformly over
the function class G M .

3.2 Non-Parametric Case

First, we introduce a very important inequality that is a version of Talagrand’s concentra-
tion inequality. The original form of Talagrand’s concentration inequality is an inequality
about the expectation of a general function f (X1 , . . . , Xn ) of n variables, so the range of
applications is quite large (Talagrand, 1996a,b).
Let
σP (F)2 := sup(P f 2 − (P f )2 ).
f ∈F

For a functional Y : G → R deﬁned on a set of measurable functions G, we deﬁne its norm

as
∥Y ∥G := sup |Y (g)|.
g∈G

For a class F of measurable functions such that ∀f ∈ F, ∥f ∥∞ ≤ 1, the following

bound holds, which we refer to as the Bousquet bound (Bousquet, 2002):
{
P ∥Pn − P ∥F ≥ E∥Pn − P ∥F
√
2t t }
+ (σP (F)2 + 2E∥Pn − P ∥F ) + ≤ e−t . (10)
n 3n
We can easily see that E∥Pn − P ∥F and σP (F) in the Bousquet bound can be replaced by
other functions bounding from above. For example, E∥Pn −P ∥F can be upper-bounded by
the Rademacher complexity and σP (F) can be bounded by using the L2 (P )-norm (Bartlett
et al., 2005) By using the above inequality, we obtain the following theorem. The proof
is summarized in Appendix A.

Theorem 1 Let

an0 := (Qn g0 )−1 ,

γn := max{−Pn log(ĝn ) + Pn log(an0 g0 ), 0}.
Direct Importance Estimation for Covariate Shift Adaptation 12

Then
√
hQ (an0 g0 , ĝn ) = Op (n− 2+γ +
1
γn ).

The technical advantage of using the Hellinger distance instead of the KL-divergence is
that the Hellinger distance is bounded from above by a Lipschitz continuous function
while the KL-divergence is not Lipschitz continuous because log(x) diverges to −∞ as
x → 0. This allows us to utilize uniform convergence results of empirical processes. See
the proof for more details.

Remark 1 If there exists N such that ∀n ≥ N , g0 ∈ Gn , then γn = 0 (∀n ≥ N ). In this

setting,
hQ (ĝn /an0 , g0 ) = Op (n− 2+γ ).
1

Remark 2 an0 can be removed because

√∫
√
hQ (an0 g0 , g0 ) = g0 (1 − an0 )2 dQ
√ √
an0 | = Op (1/ n) = Op (n− 2+γ ).
1
= |1 −

Thus,
√
hQ (ĝn , g0 ) ≤ hQ (ĝn , an0 g0 ) + hQ (an0 g0 , g0 ) = Op (n− 2+γ +
1
γn ).

We can derive another convergence theorem based on a diﬀerent representation of the

bias term from Theorem 1. The proof is also included in Appendix A.

Theorem 2 In addition to Assumption 1, if there is gn∗ ∈ Ĝn such that for some constant
c0 , on the support of P and Q
g0
∗
≤ c20 ,
gn
then
hQ (g0 , ĝn ) = Op (n− 2+γ + hQ (gn∗ , g0 )).
1

Example 1 We brieﬂy evaluate the convergence rate in a simple example in which d = 1,

the support of P is [0, 1] ⊆ R, F = {K1 (x, x′ ) | x′ ∈ [0, 1]}, and Fn = {K1 (x, xte
j ) | j =
1, . . . , n} (for simplicity, we consider the case where the Gaussian width σ is 1, but we
can apply the same argument to another choice of σ). Assume that P has a density p(x)
with a constant η2 such that p(x) ≥ η2 > 0 (∀x ∈ [−1, 1]). We also assume that the true
importance g0 is a mixture of Gaussian kernels, i.e.,
∫
g0 (x) = K1 (x, x′ )dF (x′ ) (∀x ∈ [0, 1]),

where F is a positive ﬁnite measure

∫ the support of which is contained in [0, 1]. For a
measure F , we deﬁne gF ′ (x) := K1 (x, x′ )dF ′ (x′ ). By Lemma 3.1 of Ghosal and van der
′
Direct Importance Estimation for Covariate Shift Adaptation 13

Vaart (2001), for every 0 < ϵn < 1/2, there exits a discrete positive ﬁnite measure F ′ on
[0, 1] such that
∥g0 − gF ′ ∥∞ ≤ ϵn , F ′ ([0, 1]) = F ([0, 1]).
Now divide [0, 1] into bins with width ϵn , then the number of sample points xte
j that fall
in a bin is a binomial random variable. If exp(−η2 nϵn /4)/ϵn → 0, then by the Chernoﬀ
bound6 , the probability of the event

Wn := {max min |x − xte

j | ≤ ϵn }
j x∈supp(F ′ )

converges to 1 (supp(F ′ ) means the support of F ′ ) because the density p(x) is bounded
√
from below across the support. One can show that |K1 (x, x1 )−K1 (x, x2 )| ≤ |x1 −x2 |/ e+
|x1 − x2 |2 /2 (∀x) because

|K1 (x, x1 ) − K1 (x, x2 )|

= exp(−(x − x1 )2 /2)[1 − exp(x(x2 − x1 ) + (x21 − x22 )/2)]
≤ exp(−(x − x1 )2 /2)|x(x2 − x1 ) + (x21 − x22 )/2|
≤ exp(−(x − x1 )2 /2)(|x − x1 ||x1 − x2 | + |x1 − x2 |2 /2)
√
≤ |x1 − x2 |/ e + |x1 − x2 |2 /2.
∑
Thus there exists α̃j ≥ 0 (j = 1, . . . , n) such that for g̃n∗ := j α̃j K1 (x, xte
j ), the following
∗ ′
√
is satisﬁed on the event Wn : ∥g̃n − gF ′ ∥∞ ≤ F ([0, 1])(ϵn / e + ϵn /2) = O(ϵn ). Now deﬁne
2

g̃n∗
gn∗ := .
Qn g̃n∗

Then gn∗ ∈ Ĝn . √

Set ϵn = 1/ n. √Noticing |1 − Qn g̃n∗ | = |1 − Qn (g̃n∗ − gF ′ + gF ′ − g0 + g0 )| ≤ O(ϵn ) +
|1 − Qn g0 | = Op (1/ n), we have
√
∥gn∗ − g̃n∗ ∥∞ = ∥gn∗ ∥∞ |1 − Qn g̃n∗ | = Op (1/ n).

From the above discussion, we obtain

√
∥gn∗ − g0 ∥∞ = Op (1/ n).

This indicates √
hQ (gn∗ , g0 ) = Op (1/ n),
and that g0 /gn∗ ≤ c20 is satisfied with high probability.
For the bias term of Theorem 1, set ϵn = C log(n)/n for sufficiently large C > 0 and
replace g0 with an0 g0 . Then we obtain γn = Op (log(n)/n).
6
Here we refer to the Chernoff
∑n bound as follows:
∑n let {Xi }ni=1 be independent
∑n random variables taking
values on 0 or 1, then P ( i=1 Xi < (1 − δ) i=1 E[Xi ]) < exp(−δ 2 i=1 E[Xi ]/2) for any δ > 0.
Direct Importance Estimation for Covariate Shift Adaptation 14

As for the complexity of the model, a similar argument to Theorem 3.1 of Ghosal and
van der Vaart (2001) gives
( )2
M
log N (ϵ, G , ∥ · ∥∞ ) ≤ K log
M
ϵ
for 0 < ϵ < M/2. This gives both conditions (8) and (9) of Assumption 1.3 for arbitrary
small γ > 0 (but the constant K depends on γ). Thus the convergence rate is evaluated
as hQ (g0 , ĝn ) = Op (n−1/(2+γ) ) for arbitrary small γ > 0.

3.3 Parametric Case

Next, we show asymptotic normality of KLIEP in a finite-dimensional case. We do not
assume that g0 is contained in the model, but it can be shown that KLIEP has asymptotic
normality around the point that is “nearest” to the true. The finite-dimensional model
we consider here is
F = Fn = {φl | l = 1, . . . , b} (∀n).
We define φ as  
φ1 (x)
 
φ(x) :=  ...  .
φb (x)
Gn and GnM are independent of n and we can write them as
{ }
Gn = G = αT φ | α ≥ 0 ,
{ }
GnM = G M = αT φ | α ≥ 0, ∥αT φ∥∞ ≤ M .
We define g∗ as the optimal solution in the model, and α∗ as the coefficient of g∗ :
g∗ := arg max P log g, g∗ = α∗T φ. (11)
g∈G,Qg=1

In addition to Assumption 1, we assume the following conditions:

Assumption 2
1. Q(φφT ) ≻ O (positive deﬁnite).
2. There exists η3 > 0 such that g∗ ≥ η3 .

Let
ψ(α)(x) = ψ(α) := log(αT φ(x)).
Note that if Q(φφT ) ≻ O is satisﬁed, then we obtain the following inequality:
φT φφT
∀β ̸= 0, β T ∇∇T P ψ(α∗ )β = β T ∇P β = −β T
P β
αT φ α=α∗ (α∗T φ)2
( )
T g0
= −β Q φφ 2 β ≤ −β T Q(φφT )βη0 ϵ20 /ξ02 < 0.
T
g∗
Direct Importance Estimation for Covariate Shift Adaptation 15

Thus, −∇∇T P ψ(α∗ ) is positive deﬁnite. We write it as

I0 := −∇∇T P ψ(α∗ ) (≻ O).

We set
α̂n
α̌n := ,
an∗
−1
√
where an∗ := (Qn g∗ )√ and α̂nT φ = ĝn . We ﬁrst show the n-consistency of α̂n /an∗ (i.e.,
∥α̌n − α∗ ∥ = Op (1/ n)). From now on, let ∥ · ∥0 denote a norm deﬁned as

∥α∥20 := αT I0 α.

By the positivity of I0 , there exist 0 < ξ1 < ξ2 such that

ξ1 ∥α∥ ≤ ∥α∥0 ≤ ξ2 ∥α∥. (12)

Lemma 1 In a ﬁnite ﬁxed dimensional model under Assumption 1 and Assumption 2,

the KLIEP estimator satisﬁes
√
∥α̂n /an∗ − α∗ ∥ = ∥α̌n − α∗ ∥ = Op (1/ n).
√
From the relationship (12), this also implies ∥α̌n − α∗ ∥0 = Op (1/ n), which indicates
√
hQ (ĝn , an∗ g∗ ) = Op (1/ n).

The proof is provided in Appendix B.

Next we discuss the asymptotic law of the KLIEP estimator. To do this we should
introduce an approximating cone which is used to express the neighborhood of α∗ . Let

S := {α | QαT φ = 1, α ≥ 0},
Sn := {α | Qn αT φ = 1/an∗ , α ≥ 0}.

Note that α∗ ∈ S and α̌n , α∗ ∈ Sn . Let the approximating cones of S and Sn at α∗ be C

and Cn , where an approximating cone is deﬁned in the following deﬁnition.

Definition 1 Let D be a closed subset in Rk and θ ∈ D be a non-isolated point in D.

If there is a closed cone A that satisﬁes the following conditions, we deﬁne A as an
approximating cone at θ:
• For an arbitrary sequence yi ∈ D − θ, yi → 0

inf ∥x − yi ∥ = o(∥yi ∥).

x∈A

• For an arbitrary sequence xi ∈ A, xi → 0

inf ∥xi − y∥ = o(∥xi ∥).

y∈D−θ
Direct Importance Estimation for Covariate Shift Adaptation 16

Now S and Sn are convex polytopes, so that the approximating cones at α∗ are also
convex polytopes and

C = {λ(α − α∗ ) | α ∈ S, λ ≥ 0, λ ∈ R},
Cn = {λ(α − α∗ ) | α ∈ Sn , λ ≥ 0, λ ∈ R},

for a suﬃciently small ϵ. Without loss of generality, we assume for some j, α∗,i = 0 (i =
1, . . . , j) and α∗,i > 0 (i = j + 1, . . . , b). Let νi := Qφi . Then the approximating cone C
is spanned by µi (i = 1, . . . , b − 1) deﬁned as
[ ]T [ ]T
ν1 νb−1
µ1 := 1, 0, . . . , 0, − , . . . , µb−1 := 0, . . . , 0, 1, − .
νb νb
That is, { b−1 }
∑
C= βi µi | βi ≥ 0 (i ≤ j), βi ∈ R .
i=1

Let N (µ, Σ) be a multivariate normal distribution with mean µ and covariance Σ; we use
the same notation for a degenerate normal distribution (i.e., the Gaussian distribution
conﬁned to the range
√ of a rank deﬁcient covariance matrix Σ). Then we obtain the
asymptotic law of n(α̌n − α∗ ).

Theorem 3 Let7 Z1 ∼ N (0, I0 − P (φ/g∗ )P (φ/g∗ )T ) and Z2 ∼ N (0, QφφT − QφQφT ),

where Z1 and Z2 are independent. Further deﬁne Z := I0−1 (Z1 +Z2 ) and λ∗ = ∇P ψ(α∗ )−
Qφ. Then √
n(α̌n − α∗ ) arg min ∥δ − Z∥0 (convergence in law).
δ∈C,λT
∗ δ=0

The proof is provided in Appendix B. If α∗ > 0 (α∗ is an inner point of the feasible
set), asymptotic normality can be proven in a simpler way. Set Rn and R as follows:

Qn φQn φT QφQφT
Rn := I − , R := I − .
∥Qn φ∥2 ∥Qφ∥2

Rn and R are projection matrices to linear spaces Cn = {δ | δ T Qn φ = 0} and C = {δ |

p
δ T Qφ = 0} respectively. Note that Rn (α̌n − α∗ ) = α̌n − α∗ . Now α̌n → α∗ indicates that
the probability of the event {α̌n > 0} goes to 1. Then on the event {α̌n > 0}, by the
KKT condition
√ √
0 = nRn (∇Pn ψ(α̌n ) − an∗ Qn φ) = nRn (∇Pn ψ(α̌n ) − Qn φ)
√ √
= nR(∇Pn ψ(α∗ ) − Qn φ) − nRI0 R(α̌n − α∗ ) + op (1)
√ √
⇒ n(α̌n − α∗ ) = n(RI0 R)† R(∇Pn ψ(α∗ ) − ∇P ψ(α∗ ) − Qn φ + Qφ) + op (1)
(RI0 R)† RI0 Z, (13)
7
Since α∗T (I0 − P (φ/g∗ )P (φ/g∗ )T )α∗ = 0, Z1 obeys a degenerate normal distribution.
Direct Importance Estimation for Covariate Shift Adaptation 17

where † means the Moore-Penrose pseudo-inverse and in the third equality we used the
relation ∇P ψ(α∗ ) − Qφ = 0 according to the KKT condition. On the other hand, since
δ = Rδ for δ ∈ C, we have
∥Z − δ∥20 =(Z − δ)T I0 (Z − δ) = (Z − Rδ)T I0 (Z − Rδ)
=(δ − (RI0 R)† RI0 Z)T RI0 R(δ − (RI0 R)† RI0 Z)
+ (the terms independent of δ).
The minimizer of the right-hand side of the above equality in C is δ = (RI0 R)† RI0 Z. This
and the result of Theorem 3 coincide with (13). √
In addition to Theorem 3 we can show the asymptotic law of n(α̂n − α∗ ). The proof
is also given in Appendix B.

Theorem 4 Let Z, Z2 and λ∗ be as in Theorem 3. Then

√
n(α̂n − α∗ ) arg min ∥δ − Z∥0 + (Z T I0 α∗ )α∗ (convergence in law).
δ∈C,λT
∗ δ=0

The second term of the right-hand side is expressed by (Z T I0 α∗ )α∗ = (Z2T α∗ )α∗ .

Remark 3 By the KKT condition and the deﬁnition of I0 , it can be easily checked that
α∗T I0 δ = 0 (∀δ ∈ C ∩ {δ ′ | λT ′
∗ δ = 0}), ∥α∗ ∥0 = α∗T I0 α∗ = 1.
√
Thus Theorem 4 gives an orthogonal decomposition of the asymptotic law of n(α̂n − α∗ )
to a parallel part and an orthogonal part to C ∩ {δ ′ | λT ′
∗ δ = 0}. Hence in particular, if
α∗ > 0, then λ∗ = 0 and C is a linear subspace so that
√
n(α̂n − α∗ ) Z.

4 Illustrative Examples
We have shown that the KLIEP algorithm has preferable convergence properties. In this
section, we illustrate the behavior of the proposed KLIEP method and how it can be
applied in covariate shift adaptation.

4.1 Setting
Let us consider a one-dimensional toy regression problem of learning
f (x) = sinc(x).
Let the training and test input densities be
ptr (x) = N (x; 1, (1/2)2 ),
pte (x) = N (x; 2, (1/4)2 ),
Direct Importance Estimation for Covariate Shift Adaptation 18

1.6 ptr(x)
f(x)
pte(x) 1.5
1.4 Training
Test
1.2
1
1

0.8 0.5

0.6
0
0.4

0.2
−0.5

0
−0.5 0 0.5 1 1.5 2 2.5 3 −0.5 0 0.5 1 1.5 2 2.5 3
x x

(a) Training input density ptr (x) and test input (b) Target function f (x), training sam-
density pte (x) ples {(xtr tr ntr
i , yi )}i=1 , and test samples
{(xte te n
j , yj )}j=1 .
te

Figure 2: Illustrative example.

where N (x; µ, σ 2 ) denotes the Gaussian density with mean µ and variance σ 2 . We create
training output value {yitr }ni=1
tr
by

yitr = f (xtr tr
i ) + ϵi ,

where the noise {ϵtr

i }i=1 has density N (ϵ; 0, (1/4) ). Test output value {yj }j=1 are also
ntr 2 te nte

generated in the same way. Let the number of training samples be ntr = 200 and the
number of test samples be nte = 1000. The goal is to obtain a function fb(x) such that
the following generalization error G (or the mean test error) is minimized:
nte ( )2
1 ∑
G := fb(xte
j ) − y te
j . (14)
nte j=1

This setting implies that we are considering a (weak) extrapolation problem (see Fig-
ure 2, where only 100 test samples are plotted for clear visibility).

4.2 Importance Estimation by KLIEP

First, we illustrate the behavior of KLIEP in importance estimation, where we only use
i }i=1 and {xj }j=1 .
{xtr ntr te nte

Figure 3 depicts the true importance and its estimates by KLIEP; the Gaussian kernel
model (7) with b = 100 is used and three diﬀerent Gaussian widths are tested. The
graphs show that the performance of KLIEP is highly dependent on the Gaussian width;
b
the estimated importance function w(x) is highly ﬂuctuated when σ is small, while it is
overly smoothed when σ is large. When σ is chosen appropriately, KLIEP seems to work
reasonably well for this example.
Direct Importance Estimation for Covariate Shift Adaptation 19

w(x) w(x)
^ 25 ^ (x)
50 w (x) w
^ tr ^ (x tr)
w
w (x i ) i

40 20

30 15

20 10

10 5

0 0
−0.5 0 0.5 1 1.5 2 2.5 3 −0.5 0 0.5 1 1.5 2 2.5 3
x x

(a) Gaussian width σ = 0.02 (b) Gaussian width σ = 0.2

w(x)
25 ^ (x)
w
^ (xtr )
w i

0
−0.5 0 0.5 1 1.5 2 2.5 3
x

(c) Gaussian width σ = 0.8

Figure 3: Results of importance estimation by KLIEP. w(x) is the true importance func-
b
tion and w(x) is its estimation obtained by KLIEP.

Figure 4 depicts the values of the true J (see Eq.(3)) and its estimate by 5-fold LCV
(see Eq.(5)); the means, the 25 percentiles, and the 75 percentiles over 100 trials are
plotted as functions of the Gaussian width σ. This shows that LCV gives a very good
estimate of J, which results in an appropriate choice of σ.

4.3 Covariate Shift Adaptation by IWLS and IWCV

Next, we illustrate how the estimated importance could be used for covariate shift adapta-
i , yi )}i=1 and {xj }j=1 for learning; the test output values {yj }j=1
tion. Here we use {(xtr tr ntr te nte te nte

are used only for evaluating the generalization performance.

Direct Importance Estimation for Covariate Shift Adaptation 20

2.6
J
2.4 ^
J LCV
2.2

1.8

1.6

1.4

1.2

0.8
0.02 0.2 0.5 0.8
σ (Gaussian Width)

Figure 4: Model selection curve for KLIEP. J is the true score of an estimated importance
(see Eq.(3)) and JbLCV is its estimate by 5-fold LCV (see Eq.(5)).

We use the following polynomial regression model:

∑
t
fb(x; θ) := θi xℓ , (15)
ℓ=0

where t is the order of polynomials. The parameter vector θ is learned by importance-

weighted least-squares (IWLS):
[n ]
∑tr ( )2
bIWLS := argmin
θ b tr ) fb(xtr ; θ) − y tr
w(x .
i i i
θ
i=1

It is known that IWLS is consistent when the true importance w(xtr i ) is used as weights—
ordinary LS is not consistent due to covariate shift, given that the model fb(x; θ) is not
correctly speciﬁed8 (Shimodaira, 2000). For the linear regression model (15), the above
minimizer θbIWLS is given analytically by

bIWLS = (X ⊤ W
θ c X)−1 X ⊤ W
c y,

where

[X]i,ℓ = (xtr
i )
ℓ−1
,
( )
c = diag w(x
W b tr b tr b tr
1 ), w(x 2 ), . . . , w(x ntr ) ,
y = (y1tr , y2tr , . . . , yntrtr )⊤ . (16)

8
A model fb(x; θ) is said to be correctly specified if there exists a parameter θ ∗ such that fb(x; θ ∗ ) =
f (x).
Direct Importance Estimation for Covariate Shift Adaptation 21

diag (a, b, . . . , c) denotes the diagonal matrix with diagonal elements a, b, . . . , c.

We choose the order t of polynomials based on importance-weighted CV (IWCV)
(Sugiyama et al., 2007). More speciﬁcally, we ﬁrst divide the training samples {zitr | zitr =
b
i , yi )}i=1 into R disjoint subsets {Zr }r=1 . Then we learn a function fr (x) from
tr ntr
(xtr tr R

{Zjtr }j̸=r by IWLS and compute its mean test error for the remaining samples Zrtr :

1 ∑ ( )2
br :=
G b
w(x) fbr (x) − y .
|Zrtr |
(x,y)∈Zrtr

br over all r, and

We repeat this procedure for r = 1, 2, . . . , R, compute the average of G
b as an estimate of G:
use the average G

∑ R
b := 1
G br .
G (17)
R r=1

For model selection, we compute G b for all model candidates (the order t of polynomials
in the current setting) and choose the one that minimizes G. b We set the number of folds
in IWCV at R = 5. IWCV is shown to be unbiased, while ordinary CV with misspeciﬁed
models is biased due to covariate shift (Sugiyama et al., 2007).
Figure 5 depicts the functions learned by IWLS with diﬀerent orders of polynomials.
The results show that for all cases, the learned functions reasonably go through the test
samples (note that the test output points are not used for obtaining the learned functions).
Figure 6(a) depicts the true generalization error of IWLS and its estimate by IWCV; the
means, the 25 percentiles, and the 75 percentiles over 100 runs are plotted as functions
of the order of polynomials. This shows that IWCV roughly grasps the trend of the true
generalization error. For comparison purposes, we also include the results by ordinary LS
and ordinary CV in Figure 5 and Figure 6. Figure 5 shows that the functions obtained by
ordinary LS go through the training samples, but not through the test samples. Figure 6
shows that the scores of ordinary CV tend to be biased, implying that model selection by
ordinary CV is not reliable.
Finally, we compare the generalization error obtained by IWLS/LS and IWCV/CV,
which is summarized in Figure 7 as box plots. This shows that IWLS+IWCV tends
to outperform other methods, illustrating the usefulness of the proposed approach in
covariate shift adaptation.

5 Discussion
In this section, we discuss the relation between KLIEP and existing approaches.
Direct Importance Estimation for Covariate Shift Adaptation 22

f(x) f(x)
1.5 ^
1.5
^
f (x) fIWLS (x)
IWLS
^ ^
f (x) f (x)
LS LS
1 Training 1 Training
Test Test

0.5 0.5

0 0

−0.5 −0.5

−0.5 0 0.5 1 1.5 2 2.5 3 −0.5 0 0.5 1 1.5 2 2.5 3

x x

(a) Polynomial of order 1 (b) Polynomial of order 2

1.5 f(x)
^
f (x)
IWLS
^
f (x)
LS
1 Training
Test

0.5

−0.5

−0.5 0 0.5 1 1.5 2 2.5 3

(c) Polynomial of order 3

Figure 5: Learned functions obtained by IWLS and LS, which are denoted by fbIWLS (x)
and fbLS (x), respectively.

5.1 Kernel Density Estimator

The kernel density estimator (KDE) is a non-parametric technique to estimate a density
p(x) from its i.i.d. samples {xk }nk=1 . For the Gaussian kernel, KDE is expressed as

1 ∑n
pb(x) = Kσ (x, xk ), (18)
n(2πσ 2 )d/2 k=1

where Kσ (x, x′ ) is the Gaussian kernel (6) with width σ.

The estimation performance of KDE depends on the choice of the kernel width σ, which
can be optimized by LCV (Härdle et al., 2004)—a subset of {xk }nk=1 is used for density
estimation and the rest is used for estimating the likelihood of the held-out samples. Note
Direct Importance Estimation for Covariate Shift Adaptation 23

0.18 0.4
G G
^ ^
G IWCV 0.35 G IWCV
0.16
^ ^
G CV G CV
0.3
0.14
0.25
0.12
0.2
0.1
0.15

0.08 0.1

0.06 0.05
1 2 3 1 2 3
t (Order of Polynomials) t (Order of Polynomial)

(a) IWLS (b) LS

Figure 6: Model selection curves for IWLS/LS and IWCV/CV. G denotes the true gen-
bIWCV and G
eralization error of a learned function (see Eq.(14)), while G bCV denote its
estimate by 5-fold IWCV and 5-fold CV, respectively (see Eq.(17)).

0.35
95%

0.3

0.25

0.2
75%

0.15
50%
0.1 25%
5%
0.05
IWLS+IWCV IWLS+CV LS+IWCV LS+CV

Figure 7: Box plots of generalization errors.

that model selection based on LCV corresponds to choosing σ such that the Kullback-
Leibler divergence from p(x) to pb(x) is minimized.
KDE can be used for importance estimation by ﬁrst estimating pbtr (x) and pbte (x)
separately from {xtr i }i=1 and {xj }j=1 , and then estimating the importance by w(x)
ntr te nte
b =
pbte (x)/b
ptr (x). A potential limitation of this approach is that KDE suﬀers from the curse
of dimensionality (Härdle et al., 2004), i.e., the number of samples needed to maintain
the same approximation quality grows exponentially as the dimension of the input space
increases. Furthermore, model selection by LCV is unreliable in small sample cases since
data splitting in the CV procedure further reduces the sample size. Therefore, the KDE-
based approach may not be reliable in high-dimensional cases.
Direct Importance Estimation for Covariate Shift Adaptation 24

5.2 Kernel Mean Matching

The kernel mean matching (KMM) method avoids density estimation and directly gives
an estimate of the importance at training input points (Huang et al., 2007).
The basic idea of KMM is to ﬁnd w(x) b such that the mean discrepancy between
nonlinearly transformed samples drawn from pte (x) and ptr (x) is minimized in a universal
reproducing kernel Hilbert space (Steinwart, 2001). The Gaussian kernel (6) is an example
of kernels that induce universal reproducing kernel Hilbert spaces and it has been shown
that the solution of the following optimization problem agrees with the true importance:
∫ ∫ 2
min Kσ (x, ·)pte (x)dx − Kσ (x, ·)w(x)ptr (x)dx
w(x) H
∫
subject to w(x)ptr (x)dx = 1 and w(x) ≥ 0,

where ∥ · ∥H denotes the norm in the Gaussian reproducing kernel Hilbert space and
Kσ (x, x′ ) is the Gaussian kernel (6) with width σ.
An empirical version of the above problem is reduced to the following quadratic pro-
gram:
[ n ]
1 ∑ ∑
tr ntr

i , xi′ ) −
wi wi′ Kσ (xtr tr
min wi κi
{wi }i=1 2 ′
ntr
i,i =1 i=1

∑
ntr
subject to wi − ntr ≤ ntr ϵ and 0 ≤ w1 , w2 , . . . , wntr ≤ B,
i=1

where
ntr ∑
nte
κi := Kσ (xtr te
i , xj ).
nte j=1

B (≥ 0) and ϵ (≥ 0) are tuning parameters which control the regularization eﬀects. The
solution {wbi }ni=1
tr
is an estimate of the importance at the training input points {xtr
i }i=1 .
ntr

Since KMM does not involve density estimation, it is expected to work well even in
high-dimensional cases. However, the performance is dependent on the tuning parame-
ters B, ϵ, and σ, and they can not be simply optimized, e.g., by CV since estimates of
the importance are available only at the training input points. Thus, an out-of-sample
extension is needed to apply KMM in the CV framework, but this seems to be an open
research issue currently.
A relation between KMM and a variant of KLIEP has been studied in Tsuboi et al.
(2008).

5.3 Logistic Regression

Another approach to directly estimating the importance is to use a probabilistic classiﬁer.
Let us assign a selector variable δ = −1 to training input samples and δ = 1 to test input
Direct Importance Estimation for Covariate Shift Adaptation 25

samples, i.e., the training and test input densities are written as

ptr (x) = p(x|δ = −1),

pte (x) = p(x|δ = 1).

An application of the Bayes theorem immediately yields that the importance can be
expressed in terms of δ as follows (Bickel et al., 2007):

p(x|δ = 1) p(δ = −1) p(δ = 1|x)

w(x) = = .
p(x|δ = −1) p(δ = 1) p(δ = −1|x)

The probability ratio of test and training samples may be simply estimated by the ratio
of the numbers of samples:
p(δ = −1) ntr
≈ .
p(δ = 1) nte
The conditional probability p(δ|x) could be approximated by discriminating test samples
from training samples using a logistic regression (LogReg) classifier, where δ plays the
role of a class variable. Below, we briefly explain the LogReg method.
The LogReg classifier employs a parametric model of the following form for expressing
the conditional probability p(δ|x):
1
pb(δ|x) := ∑u ,
1 + exp (−δ ℓ=1 βℓ ϕℓ (x))

where u is the number of basis functions and {ϕℓ (x)}uℓ=1 are ﬁxed basis functions. The
parameter β is learned so that the negative log-likelihood is minimized:
[ n ( ( u ))
∑ tr ∑
b := argmin
β log 1 + exp tr
βℓ ϕℓ (x ) i
β i=1
( (
ℓ=1
)) ]
∑nte ∑
u
+ log 1 + exp − βℓ ϕℓ (xtr
j ) .
j=1 ℓ=1

Since the above objective function is convex, the global optimal solution can be ob-
tained by standard nonlinear optimization methods such as Newton’s method, conjugate
gradient, or the BFGS method (Minka, 2007). Then the importance estimate is given by
( u )
ntr ∑
b
w(x) = exp βbℓ ϕℓ (x) .
nte ℓ=1

An advantage of the LogReg method is that model selection (i.e., the choice of basis
functions {ϕℓ (x)}uℓ=1 ) is possible by standard CV, since the learning problem involved
above is a standard supervised classiﬁcation problem.
Direct Importance Estimation for Covariate Shift Adaptation 26

6 Experiments
In this section, we compare the experimental performance of KLIEP and existing ap-
proaches.

6.1 Importance Estimation for Artificial Datasets

Let ptr (x) be the d-dimensional Gaussian density with mean (0, 0, . . . , 0)⊤ and covariance
identity and pte (x) be the d-dimensional Gaussian density with mean (1, 0, . . . , 0)⊤ and
covariance identity. The task is to estimate the importance at training input points:

pte (xtr
i )
wi := w(xtr
i ) = tr
for i = 1, 2, . . . , ntr .
ptr (xi )

We compare the following methods:

KLIEP(σ): {wi }ni=1

tr
are estimated by KLIEP with the Gaussian kernel model (7). The
number of template points is ﬁxed at b = 100. Since the performance of KLIEP is
dependent on the kernel width σ, we test several diﬀerent values of σ.

KLIEP(CV): The kernel width σ in KLIEP is chosen based on 5-fold LCV (see Sec-
tion 2.3).

KDE(CV): {wi }ni=1

tr
are estimated by KDE with the Gaussian kernel (18). The kernel
widths for the training and test densities are chosen separately based on 5-fold LCV
(see Section 5.1).

KMM(σ): {wi }ni=1

tr
are estimated by KMM (see Section 5.2). The performance of KMM
√ √
is dependent on B, ϵ, and σ. We set B = 1000 and ϵ = ( ntr − 1)/ ntr following
Huang et al. (2007), and test several diﬀerent values of σ. We used the CPLEX
software for solving quadratic programs in the experiments.

LogReg(σ): Gaussian kernels (7) are used as basis functions, where kernels are put at all
training and test input points9 . Since the performance of LogReg is dependent on
the kernel width σ, we test several diﬀerent values of σ. We used the LIBLINEAR
implementation of logistic regression for the experiments (Lin et al., 2007).

LogReg(CV): The kernel width σ in LogReg is chosen based on 5-fold CV.

We ﬁxed the number of test input points at nte = 1000 and consider the following two
settings for the number ntr of training samples and the input dimension d:

(a) ntr = 100 and d = 1, 2, . . . , 20,

9
We also tested another LogReg model where only 100 Gaussian kernels are used and the Gaussian
centers are chosen randomly from the test input points. Our preliminary experiments showed that this
does not degrade the performance.
Direct Importance Estimation for Covariate Shift Adaptation 27

(b) d = 10 and ntr = 50, 60, . . . , 150.

We run the experiments 100 times for each d, each ntr , and each method, and evaluate
the quality of the importance estimates {wbi }ni=1
tr
by the normalized mean squared error
(NMSE):
ntr ( )2
1 ∑ wbi wi
NMSE := ∑ntr − ∑ntr .
ntr i=1 i′ =1 wbi′ i′ =1 wi′

NMSEs averaged over 100 trials are plotted in log scale in Figure 8. Figure 8(a)
shows that the error of KDE(CV) sharply increases as the input dimension grows, while
KLIEP, KMM, and LogReg with appropriate kernel widths tend to give smaller errors
than KDE(CV). This would be the fruit of directly estimating the importance without
going through density estimation. The graph also shows that the performance of KLIEP,
KMM, and LogReg is dependent on the kernel width σ—the results of KLIEP(CV) and
LogReg(CV) show that model selection is carried out reasonably well. Figure 9(a) sum-
marizes the results of KLIEP(CV), KDE(CV), and LogReg(CV), where, for each input
dimension, the best method in terms of the mean error and comparable ones based on
the t-test at the significance level 5% are indicated by ‘◦’; the methods with significant
difference from the best method are indicated by ‘×’. This shows that KLIEP(CV) works
significantly better than KDE(CV) and LogReg(CV).
Figure 8(b) shows that the errors of all methods tend to decrease as the number
of training samples grows. Again, KLIEP, KMM, and LogReg with appropriate kernel
widths tend to give smaller errors than KDE(CV), and model selection in KLIEP(CV)
and LogReg(CV) is shown work reasonably well. Figure 9(b) shows that KLIEP(CV)
tends to give significantly smaller errors than KDE(CV) and LogReg(CV).
Overall, KLIEP(CV) is shown to be a useful method in importance estimation.

6.2 Covariate Shift Adaptation with Regression and Classifica-

tion Benchmark Datasets
Here we employ importance estimation methods for covariate shift adaptation in regres-
sion and classiﬁcation benchmark problems (see Table 1).
Each dataset consists of input/output samples {(xk , yk )}nk=1 . We normalize all the
input samples {xk }nk=1 into [0, 1]d and choose the test samples {(xte te nte
j , yj )}j=1 from the
pool {(xk , yk )}nk=1 as follows. We randomly choose one sample (xk , yk ) from the pool
(c) (c)
and accept this with probability min(1, 4(xk )2 ), where xk is the c-th element of xk
and c is randomly determined and ﬁxed in each trial of experiments; then we remove xk
from the pool regardless of its rejection or acceptance, and repeat this procedure until
we accept nte samples. We choose the training samples {(xtr tr ntr
i , yi )}i=1 uniformly from the
rest. Intuitively, in this experiment, the test input density tends to be lower than the
(c)
training input density when xk is small. We set the number of samples at ntr = 100 and
nte = 500 for all datasets. Note that we only use {(xtri , yi )}i=1 and {xj }j=1 for training
tr ntr te nte

regressors or classiﬁers; the test output values {yj }j=1 are used only for evaluating the
te nte

generalization performance.
Direct Importance Estimation for Covariate Shift Adaptation 28

KLIEP(0.5)
KLIEP(2)
KLIEP(7)
Average NMSE over 100 Trials (in Log Scale)

KLIEP(CV)
−3 KDE(CV)
10
KMM(0.1)
KMM(1)
KMM(10)
LogReg(0.5)
LogReg(2)
−4
10 LogReg(7)
LogReg(CV)

−5
10

−6
10
2 4 6 8 10 12 14 16 18 20
d (Input Dimension)

(a) When input dimension is changed

KLIEP(0.5)
KLIEP(2)
−3 KLIEP(7)
10
Average NMSE over 100 Trials (in Log Scale)

KLIEP(CV)
KDE(CV)
KMM(0.1)
KMM(1)
KMM(10)
−4 LogReg(0.5)
10
LogReg(2)
LogReg(7)
LogReg(CV)

−5
10

−6
10

50 100 150
ntr (Number of Training Samples)

(b) When training sample size is changed

Figure 8: NMSEs averaged over 100 trials in log scale.

Direct Importance Estimation for Covariate Shift Adaptation 29

−3
10
KLIEP(CV)
KDE(CV)
LogReg(CV)
Average NMSE over 100 Trials (in Log Scale)

−4
10

−5
10

−6
10
2 4 6 8 10 12 14 16 18 20
d (Input Dimension)

(a) When input dimension is changed

−3
10
KLIEP(CV)
KDE(CV)
LogReg(CV)
Average NMSE over 100 Trials (in Log Scale)

−4
10

−5
10

−6
10

50 100 150
ntr (Number of Training Samples)

(b) When training sample size is changed

Figure 9: NMSEs averaged over 100 trials in log scale. For each dimension/number of
training samples, the best method in terms of the mean error and comparable ones based
on the t-test at the significance level 5% are indicated by ‘◦’; the methods with significant
difference from the best method are indicated by ‘×’.
Direct Importance Estimation for Covariate Shift Adaptation 30

We use the following kernel model for regression or classiﬁcation:

∑
t
fb(x; θ) := θℓ Kh (x, mℓ ),
ℓ=1

where Kh (x, x′ ) is the Gaussian kernel (6) with width h and mℓ is a template point
randomly chosen from {xte j }j=1 . We set the number of kernels
nte 10
at t = 50. We learn the
parameter θ by importance-weighted regularized least-squares (IWRLS) (Sugiyama et al.,
2007): [n ]
∑tr ( )2
bIWRLS := argmin b tr b tr
i ) f (xi ; θ) − yi
tr
θ w(x + λ∥θ∥2 . (19)
θ
i=1

bIWRLS is analytically given by

The solution θ
bIWRLS = (K ⊤ W
θ c K + λI)−1 K ⊤ W
c y,

where I is the identity matrix, y is deﬁned by Eq.(16), and

[K]i,ℓ := Kh (xtr
i , mℓ ),
c := diag (w
W b1 , w
b2 , . . . , w
bntr ) .

The kernel width h and the regularization parameter λ in IWRLS (19) are chosen by
5-fold IWCV. We compute the IWCV score by

1∑ 1
5 ∑ ( )
b
w(x)L fbr (x), y ,
5 r=1 |Zrtr |
(x,y)∈Zrtr

where Zrtr is the r-th held-out sample set (see Section 4.3) and
{
(by − y)2 (Regression),
L (b
y , y) := 1
2
(1 − sign{b
y y}) (Classiﬁcation).

We run the experiments 100 times for each dataset and evaluate the mean test error :

1 ∑
nte ( )
L fb(xte
j ), y te
j .
nte j=1

The results are summarized in Table 1, where ‘Uniform’ denotes uniform weights, i.e., no
importance weight is used. The table shows that KLIEP(CV) compares favorably with
Uniform, implying that the importance weighting techniques combined with KLIEP(CV)
10
We fixed the number of kernels at a rather small number since we are interested in investigating the
prediction performance under model misspecification; for over-specified models, importance-weighting
methods have no advantage over the no importance method.
Direct Importance Estimation for Covariate Shift Adaptation 31

Table 1: Mean test error averaged over 100 trials. The numbers in the brackets are
the standard deviation. All the error values are normalized so that the mean error by
‘Uniform’ (uniform weighting, or equivalently no importance weighting) is one. For each
dataset, the best method and comparable ones based on the Wilcoxon signed rank test at
the signiﬁcance level 5% are described in bold face. The upper half are regression datasets
taken from DELVE (Rasmussen et al., 1996) and the lower half are classiﬁcation datasets
taken from IDA (Rätsch et al., 2001). ‘KMM(σ)’ denotes KMM with kernel width σ.
KLIEP KDE KMM KMM KMM LogReg
Data Dim Uniform
(CV) (CV) (0.01) (0.3) (1) (CV)
kin-8fh 8 1.00(0.34) 0.95(0.31) 1.22(0.52) 1.00(0.34) 1.12(0.37) 1.59(0.53) 1.38(0.40)
kin-8fm 8 1.00(0.39) 0.86(0.35) 1.12(0.57) 1.00(0.39) 0.98(0.46) 1.95(1.24) 1.38(0.61)
kin-8nh 8 1.00(0.26) 0.99(0.22) 1.09(0.20) 1.00(0.27) 1.04(0.17) 1.16(0.25) 1.05(0.17)
kin-8nm 8 1.00(0.30) 0.97(0.25) 1.14(0.26) 1.00(0.30) 1.09(0.23) 1.20(0.22) 1.14(0.24)
abalone 7 1.00(0.50) 0.97(0.69) 1.02(0.41) 1.01(0.51) 0.96(0.70) 0.93(0.39) 0.90(0.40)
image 18 1.00(0.51) 0.94(0.44) 0.98(0.45) 0.97(0.50) 0.97(0.45) 1.09(0.54) 0.99(0.47)
ringnorm 20 1.00(0.04) 0.99(0.06) 0.87(0.04) 1.00(0.04) 0.87(0.05) 0.87(0.05) 0.93(0.08)
twonorm 20 1.00(0.58) 0.91(0.52) 1.16(0.71) 0.99(0.50) 0.86(0.55) 0.99(0.70) 0.92(0.56)
waveform 21 1.00(0.45) 0.93(0.34) 1.05(0.47) 1.00(0.44) 0.93(0.32) 0.98(0.31) 0.94(0.33)
Average 1.00(0.38) 0.95(0.35) 1.07(0.40) 1.00(0.36) 0.98(0.37) 1.20(0.47) 1.07(0.36)

are useful for improving the prediction performance under covariate shift. KLIEP(CV)
works much better than KDE(CV); actually KDE(CV) tends to be worse than Uniform,
which may be due to high dimensionality. We tested 10 diﬀerent values of the kernel
width σ for KMM and described three representative results in the table. KLIEP(CV)
is slightly better than KMM with the best kernel width. Finally, LogReg(CV) is overall
shown to work reasonably well, but it performs very poorly for some datasets. As a result,
the average performance is not good.
Overall, we conclude that the proposed KLIEP(CV) is a promising method for covari-
ate shift adaptation.

7 Conclusions
In this paper, we addressed the problem of estimating the importance for covariate shift
adaptation. The proposed method, called KLIEP, does not involve density estimation so it
is more advantageous than a naive KDE-based approach particularly in high-dimensional
problems. Compared with KMM which also directly gives importance estimates, KLIEP
is practically more useful since it is equipped with a model selection procedure. Our
experiments highlighted these advantages and therefore KLIEP is shown to be a promising
method for covariate shift adaptation.
In KLIEP, we modeled the importance function by a linear (or kernel) model, which
resulted in a convex optimization problem with a sparse solution. However, our framework
allows the use of any models. An interesting future direction to pursue would be to search
for a class of models which has additional advantages, e.g., faster optimization (Tsuboi
Direct Importance Estimation for Covariate Shift Adaptation 32

et al., 2008).
LCV is a popular model selection technique in density estimation and we used a vari-
ant of LCV for optimizing the Gaussian kernel width in KLIEP. In density estimation,
however, it is known that LCV is not consistent under some condition (Schuster and Gre-
gory, 1982; Hall, 1987). Thus it is important to investigate whether a similar inconsistency
phenomenon is observed also in the context of importance estimation.
We used IWCV for model selection of regressors or classiﬁers under covariate shift.
IWCV has smaller bias than ordinary CV and the model selection performance was shown
to be improved by IWCV. However, the variance of IWCV tends to be larger than ordinary
CV (Sugiyama et al., 2007) and therefore model selection by IWCV could be rather
unstable. In practice, slightly regularizing the importance weight involved in IWCV can
ease the problem, but this introduces an additional tuning parameter. Our important
future work in this context is to develop a method to optimally regularize IWCV, e.g.,
following the line of Sugiyama et al. (2004).
Finally, the range of application of importance weights is not limited to covariate shift
adaptation. For example, the density ratio could be used for anomaly detection, feature
selection, independent component analysis, and conditional density estimation. Exploring
possible application areas will be important future directions.

A Proofs of Theorem 1 and Theorem 2

A.1 Proof of Theorem 1
The proof follows the line of Nguyen et al. (2007). From the deﬁnition of γn , it follows
that
−Pn log ĝn ≤ −Pn log(an0 g0 ) + γn .
Then, by the convexity of − log(·), we obtain
( )
ĝn + an0 g0 −Pn log ĝn − Pn log an0 g0 γn
− Pn log ≤ ≤ −Pn log an0 g0 +
2 2 2
( n
)
ĝn + a0 g0 γn
⇔ − Pn log n
− ≤ 0.
2a0 g0 2
( ′)
log(g/g ′ ) is unstable when g is close to 0, while log g+g2g ′
is a slightly increasing function
with respect to g ≥ 0, its minimum is attained at g = 0, and − log(2) > −∞. Therefore,
the above expression is easier to deal with than log(ĝn /g0 ). Note that this technique can
be found in van der Vaart and Wellner (1996) and van de Geer (2000).
Direct Importance Estimation for Covariate Shift Adaptation 33

an
We set g ′ := Since Qn g ′ = Qn g0 = 1/an0 ,
0 g0 +ĝn
2an
.
0
( )
ĝn + an0 g0 γn
− Pn log n
− ≤0
2a0 g0 2
( ′)
′ g γn
⇒ (Qn − Q)(g − g0 ) − (Pn − P ) log −
g0 2
( ′)
g
≤ −Q(g ′ − g0 ) + P log
g0
(√ )
g′ ( √ )
≤ 2P − 1 − Q(g ′ − g0 ) = Q 2 g ′ g0 − 2g0 − Q(g ′ − g0 )
g0
( √ )
′ ′
= Q 2 g g0 − g − g0 = −hQ (g ′ , g0 )2 . (20)

The Hellinger distance between ĝn /an0 and g0 has the following bound (see Lemma 4.2 in
van de Geer, 2000):
1
hQ (ĝn /an0 , g0 ) ≤ hQ (g ′ , g0 ).
16
( ′)
Thus it is sufficient to bound |(Qn − Q)(g ′ − g0 )| and |(Pn − P ) log gg0 | from above.
From now on, we consider the case where the inequality (8) in Assumption 1.3 is
satisfied. The proof for the setting of the inequality (9) can be carried out along the line
′
of Nguyen et al. (2007). ) will utilize the Bousquet bound (10) to bound |(Qn − Q)(g −
( ′ We
g0 )| and |(Pn −P ) log gg0 |. In the following, we prove the assertion in 4 steps. In the first
( ′)
′
and second steps, we derive upper bounds of |(Qn − Q)(g − g0 )| and |(Pn − P ) log gg0 |,
respectively. In the third step, we bound the ∞-norm of ĝn which is needed to prove the
convergence. Finally, we combine the results of Steps 1 to 3 and obtain the assertion.
The following statements heavily rely on Koltchinskii (2006).

Step 1. Bounding |(Qn − Q)(g ′ − g0 )|.

Let
g + g0
ι(g) := ,
2
and
GnM (δ) := {ι(g) | g ∈ GnM , Q(ι(g) − g0 ) − P log(ι(g)/g0 ) ≤ δ} ∪ {g0 }.
Let ϕM
n (δ) be
√ √
ϕM
n (δ) := ((M + η1 )
γ/2 1−γ/2
δ / n) ∨ ((M + η1 )n−2/(2+γ) ) ∨ (δ/ n).
Then applying Lemma 2 to F = {2(g − g0 )/(M + η1 ) | g ∈ GnM (δ)}, we obtain that there
is a constant C that only depends on K and γ such that
[ ]
EQ sup |(Qn − Q)(g − g0 )| ≤ CϕM
n (δ), (21)
M ,∥g−g ∥
g∈Gn 0 Q,2 ≤δ
Direct Importance Estimation for Covariate Shift Adaptation 34

√
where ∥f ∥Q,2 := Qf 2 .
Next, we deﬁne the “diameter” of a set {g − g0 | g ∈ GnM (δ)} as
√
D̃M (δ) := sup Q(g − g0 )2 = sup ∥g − g0 ∥Q,2 .
M (δ)
g∈Gn M (δ)
g∈Gn

It is obvious that
√
D̃M (δ) ≥ sup Q(g − g0 )2 − (Q(g − g0 ))2 .
M (δ)
g∈Gn

Note that for all g ∈ GnM (δ),

√ √ √ √
Q(g − g0 )2 = Q( g − g0 )2 ( g + g0 )2
√ √
≤ (M + 3η1 )Q( g − g0 )2 = (M + 3η1 )hQ (g, g0 )2 .

Thus from the inequality (20), it follows that

∀g ∈ GnM (δ), δ ≥ Q(g − g0 ) − P log(g/g0 )

≥ hQ (g, g0 )2 ≥ ∥g − g0 ∥2Q,2 /(M + 3η1 ),

which implies √
D̃M (δ) ≤ (M + 3η1 )δ =: DM (δ).
So, by the inequality (21), we obtain
[ ]
EQ sup |(Qn − Q)(g − g0 )| ≤ CϕM M
n (D (δ))
M (δ)
g∈Gn
( )
δ (1−γ/2)/2 δ 1/2
≤ CM √ ∨ n−2/(2+γ) ∨ √ ,
n n

where CM is a constant depending on M , γ, η1 , and K.

Let q > 1 be an arbitrary constant. For some δ > 0, let δj := q j δ, where j is an
integer, and let
∪ δ
HδM := { (g − g0 ) | g ∈ GnM (δj )}.
δ ≥δ
δj
j

Then, by Lemma 3, there exists KM for all M > 1 such that for
[ √ ]
M t t
Un,t (δ) := KM ϕM M
n (D (δ)) + DM (δ) + ,
n n

and an event EδM { }

M
En,δ := sup |(Qn − Q)g| ≤ M
Un,t (δ) ,
g∈HδM
Direct Importance Estimation for Covariate Shift Adaptation 35

the following is satisﬁed:

Q(EδM ) ≥ 1 − e−t .

Step 2. Bounding |(Pn − P )(log(g ′ /g0 )|.

Along the same arguments with Step 1 using the Lipschitz continuity of the function
g 7→ log( g+g
2g0
0
) on the support of P , we also obtain a similar inequality for
∪ {δ ( )
g
}
H̃n,δ :=
M
log | g ∈ Gn (δj ) ,
M

δ ≥δ
δj g0
j

i.e., there exists a constant K̃M that depends on K, M , γ, η1 , and η0 such that
P (ẼδM ) ≥ 1 − e−t ,
where ẼδM is an event deﬁned by
{ }
M
Ẽn,δ := sup |(Pn − P )f | ≤ Ũn,t
M
(δ) ,
f ∈H̃δM

and [ √ ]
M t M t
Ũn,t (δ) := K̃M ϕM M
n (D (δ)) + D (δ) + .
n n

Step 3. Bounding the ∞-norm of ĝn /an0 .

We can show that all elements of Ĝn are uniformly bounded from above with high
probability. Let { }
Sn := inf Qn φ ≥ ϵ0 /2 ∩ {3/4 < an0 < 5/4}.
φ∈Fn

Then by Lemma 4, we can take a suﬃciently large M̄ such that g/an0 ∈ GnM̄ (∀g ∈ Ĝn ) on
the event Sn and Q(Sn ) → 1.

Step 4. Combining Steps 1,2, and 3.

We consider an event
M̄
En := En,δ ∩ Ẽn,δ
M̄
∩ Sn .
On the event En , ĝn ∈ GnM̄ . For ψ : R+ → R+ , we deﬁne the #-transform and the
♭-transform as follows (Koltchinskii, 2006):
ψ(σ)
ψ ♭ (δ) := sup , ψ # (ϵ) := inf{δ > 0 | ψ ♭ (δ) ≤ ϵ}.
σ≥δ σ
Here we set
δnM (t) := (Un,t
M #
) (1/4q), M
Vn,t M ♭
(δ) := (Un,t ) (δ),
δ̃nM (t) := (Ũn,t
M #
) (1/4q), M
Ṽn,t M ♭
(δ) := (Ũn,t ) (δ).
Direct Importance Estimation for Covariate Shift Adaptation 36

Then on the event En ,

δj M̄
sup |(Qn − Q)(g − g0 )| ≤ Un,t (δ) ≤ δj Vn,t
M̄
(δ), (22)
M̄ (δ )
g∈Gn δ
( )
j

g δj M̄
sup (Pn − P ) log ≤ Ũn,t (δ) ≤ δj Ṽn,t
M̄
(δ). (23)
M̄ (δ )
g∈Gn j
g0 δ

Take arbitrary j and δ such that

δj ≥ δ ≥ δnM̄ (t) ∨ δ̃nM̄ (t) ∨ 2qγn .

Let
GnM̄ (a, b) := GnM̄ (b)\GnM̄ (a) (a < b).
Here, we assume ι(ĝn /an0 ) ∈ GnM̄ (δj−1 , δj ). Then we will derive a contradiction. In these
settings, for g ′ := ι(ĝn /an0 ),
g′ g′ γn
δj−1 ≤ |Q(g ′ − g0 ) + P log | ≤ |(Qn − Q)(g ′ − g0 )| + |(Pn − P ) log | +
g0 g0 2
γn
≤ δj Vn,t
M̄ M̄
(δ) + δj Ṽn,t (δ) + ,
2
which implies
3 1 γn
≤ − ≤ Vn,t
M̄ M̄
(δ) + Ṽn,t (δ). (24)
4q q 2δj
M̄ M̄ 3
So, either Vn,t (δ) or Ṽn,t (δ) is greater than 8q . This contradicts the deﬁnition of the
#-transform.
We can show that δnM̄ (t) ∨ δ̃nM̄ (t) = O(n− 2+γ t). To see this, for some s > 0, set
2

( (1−γ/2)/2 )# ( 1/2 )#
δ ( −2/(2+γ) )# δ
δ̂1 = √ (s), δ̂2 = n (s), δ̂3 = √ (s),
n n
(√ )# ( )#
t t
δ̂4 = δ (s), δ̂5 = (s),
n n

where all the #-transforms are taken with respect to δ. Then they satisfy
√
(1−γ/2)/2 √ −2/(2+γ) 1/2 √ δ̂4 t/n
δ̂ / n n δ̂ / n t/n
s= 1 , s= , s= 3 , s= , s= .
δ̂1 δ̂2 δ̂3 δ̂4 δ̂5
Thus, by using some constants c1 , . . . , c4 , we obtain

δ̂1 = c1 n−2/(2+γ) , δ̂2 = c2 n−2/(2+γ) , δ̂3 = c3 n−1 , δ̂4 = c4 t/n, δ̂5 = c5 t/n.

Following the line of Koltchinskii (2006), for ϵ = ϵ1 + · · · + ϵm , we have

(ψ1 + · · · + ψm )# (ϵ) ≤ ψ1# (ϵ1 ) ∨ · · · ∨ ψm

#
(ϵm ).
Direct Importance Estimation for Covariate Shift Adaptation 37

Thus we obtain δnM̄ (t) ∨ δ̃nM̄ (t) = O(n− 2+γ t).

The above argument results in

1 √
hQ (ĝn /an0 , g0 ) ≤ hQ (g ′ , g0 ) = Op (n− 2+γ + γn ).
1

In the following, we show lemmas used in the proof of Theorem 1. We use the same
notations as those in the proof of Theorem 1.

Lemma 2 Consider a class F of functions such that −1 ≤ f ≤ 1 for all f ∈ F and

supQ̃ log N (ϵ, F, L2 (Q̃)) ≤ ϵTγ , where the supremum is taken over all ﬁnitely discrete prob-
ability measures. Then there is a constant CT,γ depending on γ and T such that for
δ 2 = supf ∈F Qf 2 ,
[ √ √ ]
E[∥Qn − Q∥F ] ≤ CT,γ (n− 2+γ ) ∨ (δ 1−γ/2 / n) ∨ (δ/ n) .
2
(25)

Proof
This lemma can be shown along a similar line to Mendelson (2002), but we shall pay
attention to the point that F may not contain the constant function 0. Let (ϵi )1≤i≤n be
i.i.d. Rademacher random variables, i.e., P (ϵi = 1) = P (ϵi = −1) = 1/2, Rn (F) be the
Rademacher complexity of F deﬁned as

1 ∑ n
Rn (F) = EQ Eϵ sup | ϵi f (xtr
i )|.
n f ∈F i=1

Then by Talagrand (1994),

EQ sup ∥Qn f 2 ∥ ≤ sup Qf 2 + 8Rn (F). (26)

f ∈F f ∈F

Set δ̂ 2 = supf ∈F Qn f 2 . Then noticing that log N (ϵ, F ∪ {0}, L2 (Qn )) ≤ ϵTγ + 1, it can be
shown that there is a universal constant C such that
∑n ∫ δ̂ √
1 C
Eϵ sup ϵi f (xi ) ≤ √
tr
1 + log N (ϵ, F, L2 (Qn ))dϵ
n f ∈F i=1 n 0
( √ )
C T
≤√ δ̂ 1−γ/2 + δ̂ . (27)
n 1 − γ/2

See van der Vaart and Wellner (1996) for detail. Taking the expectation with respect Q
and employing Jensen’s inequality and (26), we obtain
CT,γ [( 2 )(1−γ/2)/2 ( 2 )1/2 ]
Rn (F) ≤ √ δ + Rn (F) + δ + Rn (F) ,
n
Direct Importance Estimation for Covariate Shift Adaptation 38

where CT,γ is a constant depending on T and γ. Thus we have

[ √ √ ]
Rn (F) ≤ CT,γ (n− 2+γ ) ∨ (δ 1−γ/2 / n) ∨ (δ/ n) .
2
(28)
By the symmetrization argument (van der Vaart and Wellner, 1996), we have
E[sup |(Qn − Q)f |] ≤ 2Rn (F). (29)
f ∈F

Combining (28) and (29), we obtain the assertion.

Lemma 3 For all M > 1, there exists KM depending on γ, η1 , q, and K such that
( [ √ ])
t M t
Q sup |(Qn − Q)g| ≥ KM ϕn (D (δ)) +
M M
D (δ) + ≤ e−t .
g∈HδM n n
Proof
Since ϕM M M
n (D (δ))/δ and D (δ)/δ are monotone decreasing, we have
[ ] [ ]
∑ δ
E sup |(Qn − Q)f | ≤ E sup |(Qn − Q)(g − g0 )|
f ∈HδM δ ≥δ
δj M (δ )
g∈Gn j
j

∑ δ ∑ δ ϕM M
n (D (δj ))
≤ Cϕn (D (δj )) ≤
M M
′C ′

δ ≥δ
j
δj δ 1−γ
δ ≥δ j j
δjγ
∑ δ ϕM M ∑ δ 1−γ ′
n (D (δ))
≤ ′C
M
= Cϕn (D (δ))M
′

δj ≥δ j
δ 1−γ δγ′ δ 1−γ
δj ≥δ j
∑ ′
≤ CϕM n (D M
(δ)) q −j(1−γ ) = cγ,q ϕM M
n (D (δ)), (30)
j≥0

where cγ,q is a constant that depends on γ, K, and q, and

√ δ √
sup Qf 2 ≤ sup sup Q(g − g0 )2
f ∈Hδ
M δ j ≥δ δ j M
g∈Gn (δj )

DM (δj ) DM (δ)
≤ δ sup ≤δ = DM (δ). (31)
δj ≥δ δ j δ
Using the Bousquet bound, we obtain
( [ √ ])
ϕM (DM (δ)) t DM (δ) t
Q sup |(Qn − Q)g|/M ≥ C cγ,q n + + ≤ e−t ,
g∈HδM M n M n
where C is some universal constant. Thus, there exists KM for all M > 1 such that
( [ √ ])
t t
Q sup |(Qn − Q)g| ≥ KM ϕM M
n (D (δ)) + DM (δ) + ≤ e−t .
M
g∈Hδ n n
Direct Importance Estimation for Covariate Shift Adaptation 39

Lemma 4 For an event Sn := {inf φ∈Fn Qn φ ≥ ϵ0 /2} ∩ {3/4 < an0 < 5/4}, we have

Q (Sn ) → 1.

Moreover, there exists a suﬃciently large M̄ > 0 such that g/an0 ∈ GnM̄ (∀g ∈ Ĝn ) on the
event Sn .

Proof
It is obvious that ( )
1
(Qn − Q)g0 = Op √ .
n
Thus, because of Qg0 = 1, ( )
1
an0 = 1 + Op √ .
n
Moreover, Assumption 1.3 implies
( )
1
∥Qn − Q∥Fn = Op √ .
n

Thus, √
inf Qn φ ≥ ϵ0 − Op (1/ n),
φ∈Fn

implying { }
Q(S̄n ) → 1 for S̄n := inf Qn φ ≥ ϵ0 /2 .
φ∈Fn

On the event Sn , all the elements of Ĝn is uniformly bounded from above:
∑ ∑ ∑
1 = Qn ( αl φ l ) = αl Qn (φl ) ≥ αl ϵ0 /2
l l l
∑
⇒ αl ≤ 2/ϵ0 .
l

Set M̃ = 2ξ0 /ϵ0 , then on the event Sn , Ĝn ⊂ GnM̃ is always satisﬁed. Since an0 is bounded
from above and below on the event Sn , we can take a suﬃciently large M̄ > M̃ such that
g/an0 ∈ GnM̄ (∀g ∈ Ĝn ).
Direct Importance Estimation for Covariate Shift Adaptation 40

A.2 Proof of Theorem 2

∗
The proof is a version of Theorem 10.13 in van de Geer (2000). We set g ′ := gn +ĝ 2
n
. Since
Qn g ′ = Qn ĝn = 1,
( )
ĝn + gn∗
− Pn log ≤0
2gn∗
( ′) (√ )
g g ′
⇒ δn := (Qn − Q)(g ′ − gn∗ ) − (Pn − P ) log ∗
≤ 2P ∗
− 1 − Q(g ′ − gn∗ )
gn gn
[( ) (√ )] [ ( √ )]
gn∗ g′ gn∗ g′
= 2P 1− − 1 + 2P −1 − Q(g ′ − gn∗ )
g0 gn∗ g0 gn∗
(√ )(
(√ √ ) g0 √ √ )
= 2Q g0 − gn ∗ + 1 g ′− gn∗ − hQ (g ′ , gn∗ )2
gn∗
≤ (1 + c0 )hQ (g0 , gn∗ )hQ (g ′ , gn∗ ) − hQ (g ′ , gn∗ )2 . (32)

If (1 + c0 )hQ (g0 , gn∗ )hQ (g ′ , gn∗ ) ≥ |δn |, the assertion immediately follows. Otherwise we can
apply the same arguments as Theorem 1 replacing g0 with gn∗ .

B Proofs of Lemma 1, Theorem 3 and Theorem 4

B.1 Proof of Lemma 1
n
First we prove the consistency of α̌n . Note that for g ′ = g∗ +ĝ2n /a∗
( ) ( ′)
g′ g
P log ≤ 0, − P n log ≤ 0.
Q(g ′ )g∗ g∗

Thus , we have
( ) ( )
′ g′ g′
− log Qg − (Pn − P ) log ≤ P log ≤ 0. (33)
g∗ Q(g ′ )g∗

In a ﬁnite dimensional situation, the inequality (8) is satisﬁed with arbitrary γ > 0; see
Lemma 2.6.15 in van der Vaart and Wellner (1996). Thus, we can show that the left-hand
side of (33) converges
( T to
) 0 in probability in a similar way to the proof of Theorem 1. This
p
and ∇∇P log α φ+g∗
2g∗
= −I0 /4 ≺ O give α̂n → α∗ .
√ α=α∗
Next we prove n-consistency. By the KKT condition, we have

∇Pn ψ(α̂n ) − λ̂ + ŝ(Qn φ) = 0, λ̂T α̂n = 0, λ̂ ≤ 0, (34)

∇P ψ(α∗ ) − λ∗ + s∗ (Qφ) = 0, λT
∗ α∗ = 0, λ∗ ≤ 0, (35)
Direct Importance Estimation for Covariate Shift Adaptation 41

with the Lagrange multiplier λ̂, λ∗ ∈ Rb and ŝ, s∗ ∈ R (note that KLIEP “maximizes”
Pn ψ(α), thus λ̂ ≤ 0). Noticing that ∇ψ(α) = αTφφ , we obtain

α̂nT ∇Pn ψ(α̂n ) + ŝ(Qn α̂nT φ) = 1 + ŝ = 0. (36)

Thus we have ŝ = −1. Similarly we obtain s∗ = −1. This gives
λ̂ = ∇Pn ψ(α̂n ) − Qn φ, λ∗ = ∇P ψ(α∗ ) − Qφ. (37)
p
Therefore, α̂n → α∗ and g∗ ≥ η3 > 0 gives
p
λ̂ −→ λ∗ .
Thus the probability of {i | λ̂i < 0} ⊇ {i | λ∗,i < 0} goes to 1 (λ̂i and λ∗,i mean the i-th
element of λ̂ and λ∗ respectively). Recalling the complementary condition λ̂T α̂n = 0, the
probability of {i | α̂n,i = 0} ⊇ {i | λ∗,i < 0} goes to 1. Again by the complementary
condition λT∗ α∗ = 0, the probability of

(α̌n − α∗ )T λ∗ = 0
goes to 1. In particular
√ (α̌n − α∗ )T λ∗ = op (1/n).
Set Zn′ := n(∇Pn ψ(α∗ ) − Qn φ − (∇P ψ(α∗ ) − Qφ)). By the optimality and consis-
tency of α̌n , we obtain
0 ≤Pn ψ(α̌n ) − Pn ψ(α∗ )
1 ( )
=(α̌n − α∗ )T ∇Pn ψ(α∗ ) − (α̌n − α∗ )T I0 (α̌n − α∗ ) + op ∥α̌n − α∗ ∥2
2
Z ′
1 ( )
=(α̌n − α∗ )T (λ∗ + √n ) − (α̌n − α∗ )T I0 (α̌n − α∗ ) + op ∥α̌n − α∗ ∥2
n 2
Z ′
1 ( )
=(α̌n − α∗ )T √n − (α̌n − α∗ )T I0 (α̌n − α∗ ) + op ∥α̌n − α∗ ∥2 + 1/n (38)
n 2
√
because
√ ∇∇ T
P n ψ(α∗ ) = −I0 +op (1) and (α̌n −α∗ ) λ∗ = op (1/n). Thus noticing Zn / n =
T

Op (1/ n), we obtain the assertion.

B.2 Proof of Theorem 3

The proof relies on Self and Liang (1987) and Fukumizu et al. (2004), but we shall pay
attention to the fact that the feasible parameter set stochastically behaves and the true
importance g0 may not be contained in the model. Set
√
Zn := nI0−1 (∇Pn ψ(α∗ ) − Qn φ − (∇P ψ(α∗ ) − Qφ)) .
By Lemma 1 and the inequality (38), we obtain
1
0 ≤ (α̌n − α∗ )T ∇Pn ψ(α∗ ) − (α̌n − α∗ )T I0 (α̌n − α∗ ) + op (1/n)
2
1 √ 2 1 √
= − ∥α̌n − α∗ − Zn / n∥0 + ∥Zn / n∥20 + op (1/n) .
2 2
Direct Importance Estimation for Covariate Shift Adaptation 42

We deﬁne
√
ρ(α) := ∥α − α∗ − Zn / n∥20 ,
α̃n := arg min ρ(α), α̈n := arg min ρ(α).
α∈Sn ,λT
∗ α=0 α∈S,λT
∗ α=0

√ √
In the following, we show (Step 1) n(α̌n − α̃n ) √
= op (1), (Step 2) n(α̃n − α̈n ) = op (1),
√ the asymptotic law of n(α̈n − α∗ ) and simultaneously it gives
and ﬁnally (Step 3) derive
the asymptotic law of n(α̌n − α∗ ).
√
Step 1. Derivation of n(α̌n − α̃n ) = op (1).
ρ(α∗ ) ≥ ρ(α̃n ) implies
√ √ √ √
∥α̃n − α∗ ∥0 ≤ ∥α̃n − α∗ − Zn / n∥0 + ∥Zn / n∥0 ≤ 2∥Zn / n∥0 = Op (1/ n).

As shown in the proof of Lemma 1, the probability of λT

∗ α̌n = 0 goes to 1. This and the
optimality of α̃n gives
1 1
− ρ(α̃n ) ≥ − ρ(α̌n ) − op (1/n). (39)
2 2
Due to the optimality of α̌n , and applying the Taylor expansion of log-likelihood as in
(38) to α̃n instead of α̌n we have
1 1
− ρ(α̃n ) ≤ − ρ(α̌n ) + op (1/n) . (40)
2 2
The condition λT∗ α̃n = 0 is needed to ensure this inequality. √ If this condition is not
satisﬁed, we cannot assure more than λT ∗ (α̃ n − α∗ ) = O p (1/ n). Combining (39) and
(40), we obtain
1
−op (1/n) ≤ (ρ(α̌n ) − ρ(α̃n )) ≤ op (1/n).
2
By the optimality of α̃n and the convexity of Sn , we obtain
√ √ √
∥ n(α̌n − α̃n )∥20 ≤ ∥ n(α̌n − α∗ ) − Zn ∥20 − ∥ n(α̃n − α∗ ) − Zn ∥20
= op (1). (41)

√
Step 2. Derivation of n(α̃n − α̈n ) = op (1).
In a similar way to the case of α̃n , we can show
√
α̈n − α∗ = Op (1/ n).

Let α̃n′ and α̈n′ denote the projection of α̃n to S and α̈n to Sn :

α̃n′ := arg min ∥α̃n − α∥0 , α̈n′ := arg min ∥α̈n − α∥0 .
α∈S,λT
∗ α=0 α∈Sn ,λT
∗ α=0
Direct Importance Estimation for Covariate Shift Adaptation 43

Then
√ √ √
∥ n(α̈n − α∗ ) − Zn ∥0 ≥ ∥ n(α̈n′ − α∗ ) − Zn ∥0 − ∥ n(α̈n′ − α̈n )∥0
√ √
≥ ∥ n(α̃n − α∗ ) − Zn ∥0 − ∥ n(α̈n′ − α̈n )∥0 ,

and similarly
√ √ √
∥ n(α̃n − α∗ ) − Zn ∥0 ≥ ∥ n(α̈n − α∗ ) − Zn ∥0 − ∥ n(α̃n′ − α̃n )∥0 .

Thus
√ √ √
−∥ n(α̃n′ − α̃n )∥0 ≤ ∥ n(α̃n − α∗ ) − Zn ∥0 − ∥ n(α̈n − α∗ ) − Zn ∥0
√
≤ ∥ n(α̈n′ − α̈n )∥0 .

So, if we can show

√ √
∥ n(α̃n′ − α̃n )∥0 = op (1), ∥ n(α̈n′ − α̈n )∥0 = op (1), (42)

then
√ √ √ √
∥ n(α̈n − α̃n )∥0 = ∥ n(α̈n − α∗ ) − n(α̃n′ − α∗ ) + n(α̃n′ − α̃n )∥0
√ √ √
≤ ∥ n(α̈n − α∗ ) − n(α̃n′ − α∗ )∥0 + ∥ n(α̃n′ − α̃n )∥0
√ √ √
≤ ∥ n(α̃n′ − α∗ ) − Zn ∥20 − ∥ n(α̈n − α∗ ) − Zn ∥20 + op (1)
√ √ √
≤ op (1) + ∥ n(α̃n − α∗ ) − Zn ∥20 − ∥ n(α̈n − α∗ ) − Zn ∥20 + op (1)
≤ op (1). (43)

Thus it is suﬃcient to prove (42).

Note that as n → ∞, the probabilities of α̈n ∈ α∗ + C and α̃n ∈ α∗ + Cn tend to 1
because ∥α̃n − α∗ ∥, ∥α̈n − α∗ ∥ = op (1). Similar to µi , we deﬁne µ̂i using ν̂i := Qn φi instead
of νi . It can be easily seen that
p
µ̂i −→ µi ,
and with high probability
{ b−1 }
∑
Cn = βi µ̂i | βi ≥ 0 (i ≤ j), βi ∈ R ,
i=1

where j is the number satisfying α∗,i = 0 (i = 1, . . . , j) and α∗,i > 0 (i = j + 1, . . . , b).

As mentioned above, α̃n − α∗ ∈ Cn and α̈ ∑n − α∗ ∈ C with high probability.
∑ Thus,
α̃n and α̈n can √be expressed as α̃n − α∗√= β̃i µ̂i and α̈n − α∗√= β̈i µi . Moreover
α̃n −α∗ = Op (1/ n) and α̈n −α∗ = Op (1/ n) imply β̃i , β̈i = Op (1/ n). Since α̃n , α̈n , α∗ ∈
∗ α = 0}, β̃i = 0 and β̈i = 0 for all i such that λ∗,i ̸= 0. This gives
{α | λT
∑ ∑
β̃i µi ∈ C ∩ {δ | λT
∗ δ = 0}, β̈i µ̂i ∈ Cn ∩ {δ | λT
∗ δ = 0}.
Direct Importance Estimation for Covariate Shift Adaptation 44

Thus, with high probability, the following is satisﬁed:

√ √ ∑ ∑ √ ∑
n∥α̃n − α̃n′ ∥0 ≤ n β̃i µ̂i − β̃i µi ≤ n |β̃i |∥µ̂i − µi ∥0 = op (1),
0
√ √ ∑ ∑ √ ∑
n∥α̈n − α̈n′ ∥0 ≤ n β̈i µi − β̈i µ̂i ≤ n |β̈i |∥µ̂i − µi ∥0 = op (1),
0

which imply (42). Consequently (43) is obtained.

√
Step 3. Derivation of the asymptotic law of n(α̌n − α∗ ).
By (41) and (43), we have obtained
√
n∥α̌n − α̈n ∥0 = op (1). (44)

By the central limit theorem,

√ √
n(∇Pn ψ(α∗ ) − ∇P ψ(α∗ )) Z1 , n(Qn φ − Qφ) Z2 .

The independence of Z1 and Z2 follows from the independence of Pn and Qn . Thus by

the continuous mapping theorem, we have

Zn I0−1 (Z1 + Z2 ).

A projection to a closed convex set is a continuous map. Thus, by the continuous mapping
theorem, it follows that
√
n(α̈n − α∗ ) arg min ∥δ − Z∥0 .
δ∈C,λT
∗ δ=0

By (44) and Slusky’s lemma,

√
n(α̌n − α∗ ) arg min ∥δ − Z∥0 .
δ∈C,λT
∗ δ=0

This concludes the proof.

B.3 Proof of Theorem 4

Note that √ √ √
n(α̂n − α∗ ) − n(α̌n − α∗ ) = n(1 − 1/an∗ )α̂n .
√ √
From the deﬁnition, n(1/an∗ − 1) = n(Qn (g∗ ) − 1) α∗T Z2 . Now α∗T (I0 −
P (φ/g∗ )P (φ /g∗ ))α∗ = 0 which implies α∗ Z1 = 0 (a.s.), thus α∗T Z2 = α∗T I0 Z (a.s.).
T T
p
Recalling α̂n → α∗ , we obtain the assertion by Slusky’s lemma and the continuous map-
ping theorem.
Direct Importance Estimation for Covariate Shift Adaptation 45

Acknowledgments
This work was supported by MEXT (17700142 and 18300057), the Okawa Foundation,
the Microsoft CORE3 Project, and the IBM Faculty Award.

References
P. Baldi and S. Brunak. Bioinformatics: The Machine Learning Approach. MIT Press,
Cambridge, 1998.

P. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of

Statistics, 33:1487–1537, 2005.

S. Bickel, M. Brückner, and T. Scheﬀer. Discriminative learning for diﬀering training

and test distributions. In Proceedings of the 24th International Conference on Machine
Learning, 2007.

S. Bickel and T. Scheﬀer. Dirichlet-enhanced spam ﬁltering based on biased samples.

In B. Schölkopf, J. Platt, and T. Hoﬀman, editors, Advances in Neural Information
Processing Systems 19. MIT Press, Cambridge, MA, 2007.

K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola.

Integrating structured biological data by kernel maximum mean discrepancy. Bioinfor-
matics, 22(14):e49–e57, 2006.

O. Bousquet. A Bennett concentration inequality and its application to suprema of em-

pirical process. C. R. Acad. Sci. Paris Ser. I Math., 334:495–500, 2002.

S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cam-

bridge, 2004.

K. Fukumizu, T. Kuriki, K. Takeuchi, and M. Akahira. Statistics of Singular Models. The

Frontier of Statistical Science 7. Iwanami Syoten, 2004. in Japanese.

S. Ghosal and A. W. van der Vaart. Entropies and rates of convergence for maximum
likelihood and Bayes estimation for mixtures of normal densities. Annals of Statistics,
29:1233–1263, 2001.

H. Hachiya, T. Akiyama, M. Sugiyama, and J. Peters. Adaptive importance sampling

with automatic model selection in value function approximation. In Proceedings of the
Twenty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-08), Chicago, USA,
Jul. 13–17 2008.

P. Hall. On Kullback-Leibler loss and density estimation. The Annals of Statistics, 15(4):
1491–1519, 1987.
Direct Importance Estimation for Covariate Shift Adaptation 46

W. Härdle, M. Müller, S. Sperlich, and A. Werwatz. Nonparametric and Semiparametric

Models. Springer Series in Statistics. Springer, Berlin, 2004.

J. J. Heckman. Sample selection bias as a speciﬁcation error. Econometrica, 47(1):153–

162, 1979.

J. Huang, A. Smola, A. Gretton, K. M. Borgwardt, and B. Schölkopf. Correcting sample

selection bias by unlabeled data. In B. Schölkopf, J. Platt, and T. Hoﬀman, editors,
Advances in Neural Information Processing Systems 19, pages 601–608. MIT Press,
Cambridge, MA, 2007.

V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimiza-

tion. Annals of Statistics, 34:2593–2656, 2006.

C.-J. Lin, R. C. Weng, and S. S. Keerthi. Trust region Newton method for large-scale
logistic regression. Technical report, Department of Computer Science, National Taiwan
University, 2007. URL http://www.csie.ntu.edu.tw/~cjlin/liblinear/.

S. Mendelson. Improving the sample complexity using global data. IEEE Transactions
on Information Theory, 48:1977–1991, 2002.

T. P. Minka. A comparison of numerical optimizers for logis-

tic regression. Technical report, Microsoft Research, 2007. URL
http://research.microsoft.com/~minka/papers/logreg/minka-logreg.pdf.

X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functions and

the likelihood ratio by penalized convex risk minimization. In Advances in Neural
Information Processing Systems 20, 2007.

C. E. Rasmussen, R. M. Neal, G. E. Hinton, D. van Camp, M. Revow, Z. Ghahra-

mani, R. Kustra, and R. Tibshirani. The DELVE manual, 1996. URL
http://www.cs.toronto.edu/~delve/.

G. Rätsch, T. Onoda, and K.-R. Müller. Soft margins for adaboost. Machine Learning,
42(3):287–320, 2001.

E. Schuster and C. Gregory. On the non-consistency of maximum likelihood nonpara-

metric density estimators. In W. F. Eddy, editor, In Computer Science and Statistics:
Proceedings of the 13th Symposium on the Interface, pages 295–298, New York, NY,
USA, 1982. Springer.

S. G. Self and K. Y. Liang. Asymptotic properties of maximum likelihood estimators and

likelihood ratio tests under nonstandard conditions. Journal of the American Statistical
Association, 82:605–610, 1987.

C. R. Shelton. Importance Sampling for Reinforcement Learning with Multiple Objectives.

PhD thesis, Massachusetts Institute of Technology, 2001.
Direct Importance Estimation for Covariate Shift Adaptation 47

H. Shimodaira. Improving predictive inference under covariate shift by weighting the

log-likelihood function. Journal of Statistical Planning and Inference, 90(2):227–244,
2000.

I. Steinwart. On the inﬂuence of the kernel on the consistency of support vector machines.
Journal of Machine Learning Research, 2:67–93, 2001.

M. Sugiyama, M. Kawanabe, and K.-R. Müller. Trading variance reduction with unbi-
asedness: The regularized subspace information criterion for robust model selection in
kernel regression. Neural Computation, 16(5):1077–1104, 2004.

M. Sugiyama, M. Krauledat, and K.-R. Müller. Covariate shift adaptation by importance

weighted cross validation. Journal of Machine Learning Research, 8:985–1005, May
2007.

M. Sugiyama and K.-R. Müller. Input-dependent estimation of generalization error under

covariate shift. Statistics & Decisions, 23(4):249–279, 2005.

M. Sugiyama, S. Nakajima, H. Kashima, P. von Bünau, and M. Kawanabe. Direct impor-

tance estimation with model selection and its application to covariate shift adaptation.
In Advances in Neural Information Processing Systems 20, Cambridge, MA, 2008. MIT
Press.

R. S. Sutton and G. A. Barto. Reinforcement Learning: An Introduction. MIT Press,

Cambridge, MA, 1998.

M. Talagrand. Sharper bounds for Gaussian and empirical processes. Annals of Probabil-
ity, 22:28–76, 1994.

M. Talagrand. New concentration inequalities in product spaces. Inventiones Mathemat-

icae, 126:505–563, 1996a.

M. Talagrand. A new look at independence. Annals of Statistics, 24:1–34, 1996b.

Y. Tsuboi, H. Kashima, S. Hido, S. Bickel, and M. Sugiyama. Direct density ratio esti-
mation for large-scale covariate shift adaptation. In M. J. Zaki, K. Wang, C. Apte, and
H. Park, editors, Proceedings of the Eighth SIAM International Conference on Data
Mining (SDM2008), pages 443–454, Atlanta, Georgia, USA, Apr. 24–26 2008.

S. van de Geer. Empirical Processes in M-Estimation. Cambridge University Press, 2000.

A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. with
Applications to Statistics. Springer, New York, 1996.

J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, and T. M. Vaughan.

Brain-computer interfaces for communication and control. Clinical Neurophysiology,
113(6):767–791, 2002.

Shrinkage Estimation: Dominique Fourdrinier William E. Strawderman Martin T. Wells
No ratings yet
Shrinkage Estimation: Dominique Fourdrinier William E. Strawderman Martin T. Wells
339 pages
Information Projection Approach To Propensity Score Estimation For Handling Selection Bias Under Missing at Random
No ratings yet
Information Projection Approach To Propensity Score Estimation For Handling Selection Bias Under Missing at Random
30 pages
Introduction To Machine Learning CS - 229
No ratings yet
Introduction To Machine Learning CS - 229
109 pages
City House Price Prediction: Mini Project Report
No ratings yet
City House Price Prediction: Mini Project Report
3 pages
Likelihood-Free Adaptive Bayesian Inference Via Nonparametric Distribution Matching
No ratings yet
Likelihood-Free Adaptive Bayesian Inference Via Nonparametric Distribution Matching
61 pages
AIML-Unit 5 Notes
No ratings yet
AIML-Unit 5 Notes
45 pages
GMM 1
No ratings yet
GMM 1
3 pages
Sanku and Tanujit 2014
No ratings yet
Sanku and Tanujit 2014
23 pages
Sugiyama 07 A
No ratings yet
Sugiyama 07 A
21 pages
Prediction-Powered Inference With Imputed Covariates and Nonuniform Sampling
No ratings yet
Prediction-Powered Inference With Imputed Covariates and Nonuniform Sampling
61 pages
Capitulo 2 Big Data
No ratings yet
Capitulo 2 Big Data
25 pages
BMC Bioinformatics: Bias in Error Estimation When Using Cross-Validation For Model Selection
No ratings yet
BMC Bioinformatics: Bias in Error Estimation When Using Cross-Validation For Model Selection
8 pages
Brown
No ratings yet
Brown
28 pages
Biometrics - 2020 - Williamson - Nonparametric Variable Importance Assessment Using Machine Learning Techniques
No ratings yet
Biometrics - 2020 - Williamson - Nonparametric Variable Importance Assessment Using Machine Learning Techniques
14 pages
Aiml Unit-4
No ratings yet
Aiml Unit-4
82 pages
Paper 2hR25n26
No ratings yet
Paper 2hR25n26
43 pages
Varin 2005
No ratings yet
Varin 2005
10 pages
Real Time Cloud Computing and Machine Learning Applications - Tulsi Pawan Fowdur
No ratings yet
Real Time Cloud Computing and Machine Learning Applications - Tulsi Pawan Fowdur
810 pages
Week2 StatisticalLearning
No ratings yet
Week2 StatisticalLearning
46 pages
Assignment 10 Solution
No ratings yet
Assignment 10 Solution
8 pages
Kmeans - Ipynb - Colab
No ratings yet
Kmeans - Ipynb - Colab
2 pages
I2ml3e Chap19
No ratings yet
I2ml3e Chap19
33 pages
2009 Paninsky Nonparametric Estimation of Entropy and Distributions
No ratings yet
2009 Paninsky Nonparametric Estimation of Entropy and Distributions
34 pages
Classification and Kernel Density Estimation
No ratings yet
Classification and Kernel Density Estimation
7 pages
Article
No ratings yet
Article
23 pages
Ch2 Statistical Learning
No ratings yet
Ch2 Statistical Learning
51 pages
Jrc120214 Ai in Medicine and Healthcare Report-Aiwatch v50
No ratings yet
Jrc120214 Ai in Medicine and Healthcare Report-Aiwatch v50
94 pages
R L D A L S: Egularized Earning For Omain Daptation Under Abel Hifts
No ratings yet
R L D A L S: Egularized Earning For Omain Daptation Under Abel Hifts
26 pages
CSD311: Artificial Intelligence
No ratings yet
CSD311: Artificial Intelligence
31 pages
Tajmouati Samya Publications 09 08 2022 10 08 16 55
No ratings yet
Tajmouati Samya Publications 09 08 2022 10 08 16 55
6 pages
5.covariate Shift
No ratings yet
5.covariate Shift
19 pages
1 s2.0 S0047259X06000339 Main
No ratings yet
1 s2.0 S0047259X06000339 Main
26 pages
Aiml Lab Manual - Srmtrpec - 20 Feb 24
No ratings yet
Aiml Lab Manual - Srmtrpec - 20 Feb 24
86 pages
Q3 Week 3 Introduction To Estimation Edited
No ratings yet
Q3 Week 3 Introduction To Estimation Edited
50 pages
PRML Assignment 3
No ratings yet
PRML Assignment 3
3 pages
Tabak Turner
No ratings yet
Tabak Turner
20 pages
Deeskhith Resume AI ML GenAI
No ratings yet
Deeskhith Resume AI ML GenAI
2 pages
A Hybrid Method For Density Power Divergence Minimization With Application To Robust Univariate Location and Scale Estimation
No ratings yet
A Hybrid Method For Density Power Divergence Minimization With Application To Robust Univariate Location and Scale Estimation
25 pages
Lipton 18 A
No ratings yet
Lipton 18 A
9 pages
Weighted Generalized Score Test For Comparing Predictive Values in The Presence of Verification Bias
No ratings yet
Weighted Generalized Score Test For Comparing Predictive Values in The Presence of Verification Bias
22 pages
A New Forecasting Framework For Stock Market Index Value With An Overfitting Prevention LSTM Model
No ratings yet
A New Forecasting Framework For Stock Market Index Value With An Overfitting Prevention LSTM Model
24 pages
Data Mining Techniques
No ratings yet
Data Mining Techniques
33 pages
Machine Learning in Non Stationary Environments Ab00 PDF
No ratings yet
Machine Learning in Non Stationary Environments Ab00 PDF
263 pages
Active Surrogate Estimators An Active Learning Approach To LabelEfficient Model Evaluation
No ratings yet
Active Surrogate Estimators An Active Learning Approach To LabelEfficient Model Evaluation
14 pages
Ghojogh, Benyamin, and Mark Crowley
No ratings yet
Ghojogh, Benyamin, and Mark Crowley
23 pages
03 Propensity Scores Notes
No ratings yet
03 Propensity Scores Notes
12 pages
Grade 10-Lesson 2 - Ai Project Cycle
No ratings yet
Grade 10-Lesson 2 - Ai Project Cycle
58 pages
Multivariate Classification
No ratings yet
Multivariate Classification
7 pages
Oracle Bounds and Exact Algorithm For Dyadic Classification Trees
No ratings yet
Oracle Bounds and Exact Algorithm For Dyadic Classification Trees
15 pages
Machine Learning Lecture Notes Undergrad
No ratings yet
Machine Learning Lecture Notes Undergrad
19 pages
Detecting and Correcting For Label Shift With Black Box Predictors
No ratings yet
Detecting and Correcting For Label Shift With Black Box Predictors
11 pages
Implicitly Adaptive Importance Sampling: Topi Paananen Juho Piironen Paul-Christian Bürkner Aki Vehtari
No ratings yet
Implicitly Adaptive Importance Sampling: Topi Paananen Juho Piironen Paul-Christian Bürkner Aki Vehtari
19 pages
Consolidated 5th Sem Scheme and Syllabus Updated0
No ratings yet
Consolidated 5th Sem Scheme and Syllabus Updated0
27 pages
Exhibit 10 Updated
No ratings yet
Exhibit 10 Updated
7 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
9 pages
Football - Match - Result - Prediction - Using - Neural - Networks - and - Deep - Learning Yeah
No ratings yet
Football - Match - Result - Prediction - Using - Neural - Networks - and - Deep - Learning Yeah
4 pages
AppliedMachineLearning S12023 24
No ratings yet
AppliedMachineLearning S12023 24
5 pages
Numerical Modeling of Selective Laser Melting: Influence of Process Parameters On The Melt Pool Geometry
No ratings yet
Numerical Modeling of Selective Laser Melting: Influence of Process Parameters On The Melt Pool Geometry
16 pages
A Simple and Effective Model-Based Variable Importance Measure PDF
No ratings yet
A Simple and Effective Model-Based Variable Importance Measure PDF
27 pages
Ch3 PDF
No ratings yet
Ch3 PDF
55 pages
Cito Proefschrift Maarten Marsman PDF
No ratings yet
Cito Proefschrift Maarten Marsman PDF
114 pages
High-Dimensional, Two-Sample Testing
No ratings yet
High-Dimensional, Two-Sample Testing
9 pages
Front Matter
No ratings yet
Front Matter
11 pages
Machine Learning For AC OPF
No ratings yet
Machine Learning For AC OPF
4 pages
Model Selection and Multiple Hypothesis Testing
No ratings yet
Model Selection and Multiple Hypothesis Testing
6 pages
CSE R22 08 NOV 2022-Course Structure
No ratings yet
CSE R22 08 NOV 2022-Course Structure
52 pages
IJEDR1702035
No ratings yet
IJEDR1702035
4 pages
Ai 05 00003
No ratings yet
Ai 05 00003
17 pages
Estimating The Support of A High-Dimensional Distribution
No ratings yet
Estimating The Support of A High-Dimensional Distribution
28 pages
Minimum L - Distance Estimators For Non-Normalized Parametric Models
No ratings yet
Minimum L - Distance Estimators For Non-Normalized Parametric Models
32 pages
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
No ratings yet
Huawei H12-211 PRACTICE EXAM HCNA-HNTD H
117 pages
School of Business and Management CHRIST (Deemed To Be University) Course Plan
No ratings yet
School of Business and Management CHRIST (Deemed To Be University) Course Plan
14 pages
The Annals of Statistics 10.1214/009053606000000830 Institute of Mathematical Statistics
No ratings yet
The Annals of Statistics 10.1214/009053606000000830 Institute of Mathematical Statistics
22 pages
Data Science CLASS 12 INVESTIGATORY PROJECT
No ratings yet
Data Science CLASS 12 INVESTIGATORY PROJECT
9 pages
Expectation Maximization
No ratings yet
Expectation Maximization
23 pages
Machine Learning MCQ S
No ratings yet
Machine Learning MCQ S
318 pages
JNTUA R20 Regulations B Tech S.-Ii-Semester
No ratings yet
JNTUA R20 Regulations B Tech S.-Ii-Semester
3 pages
Chap 4
No ratings yet
Chap 4
21 pages
Top 7 Useful AI Websites - LinkedIn
No ratings yet
Top 7 Useful AI Websites - LinkedIn
5 pages
Final: CS 189 Spring 2013 Introduction To Machine Learning
No ratings yet
Final: CS 189 Spring 2013 Introduction To Machine Learning
9 pages
Wne WP361
No ratings yet
Wne WP361
36 pages
Data Mining
No ratings yet
Data Mining
27 pages
Python Data Science 2024 - Explo - Wilson, Stephen
No ratings yet
Python Data Science 2024 - Explo - Wilson, Stephen
170 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
6 pages
IIT Kanpur Machine Learning End Sem Paper
No ratings yet
IIT Kanpur Machine Learning End Sem Paper
10 pages
Machine Learning Tutorial
100% (1)
Machine Learning Tutorial
44 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.