0% found this document useful (0 votes)
22 views36 pages

Line Regre

linear regression

Uploaded by

Salman Malik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views36 pages

Line Regre

linear regression

Uploaded by

Salman Malik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/376456640

Linear regression under model uncertainty

Article in Probability Uncertainty and Quantitative Risk · December 2023


DOI: 10.3934/puqr.2023024

CITATIONS READS
0 140

2 authors:

Shuzhen Yang Jianfeng Yao


Shandong University The Chinese University of Hong Kong (Shenzhen)
52 PUBLICATIONS 177 CITATIONS 160 PUBLICATIONS 2,779 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Jianfeng Yao on 13 December 2023.

The user has requested enhancement of the downloaded file.


Linear regression under model uncertainty

Shuzhen Yang∗ Jianfeng Yao†

November 27, 2023

Abstract

We reexamine the classical linear regression model when the model is subject to two
types of uncertainty: (i) some of covariates are either missing or completely inaccessible,
and (ii) the variance of the measurement error is undetermined and changing according to a
mechanism unknown to the statistician. By following the recent theory of sublinear expecta-
tion, we propose to characterize such mean and variance uncertainty in the response variable
by two specific nonlinear random variables, which encompass an infinite family of probabil-
ity distributions for the response variable in the sense of (linear) classical probability theory.
The approach enables a family of estimators under various loss functions for the regression
parameter and the parameters related to model uncertainty. The consistency of the estimators
is established under mild conditions on the data generation process. Three applications are
introduced to assess the quality of the approach including a forecasting model for the S&P
Index.

Keywords: Robust regression; G-normal distribution; distribution uncertainty; heteroscedastic


error; S&P index

Shandong University-Zhong Tai Securities Institute for Financial Studies, Shandong University, PR Chi-
na, (yangsz@sdu.edu.cn). This work was supported by the National Key R&D program of China (Grant
No.2018YFA0703900, ZR2019ZD41), National Natural Science Foundation of China (Grant No.11701330), and
Taishan Scholar Talent Project Youth Project.

School of Data Science, The Chinese University of Hong Kong, Shenzhen, China, Email: jeffyao@cuhk.edu.cn

1
1 Introduction
Robust regression has been actively developed during the years 1970-2000. A long catalogue
of robust estimates for the regression coefficients has appeared in the literature that includes the
L1 , M, GM, RM, LMS and LT S , S , MM, τ and S RC estimates among others.1 According to
Huber, a robust procedure (or stability, see [8, page 5], is “in the sense that small deviations from
the model assumptions should impair the performance only slightly, that is, the latter (described,
say, in terms of the asymptotic variance of an estimate, or of the level and power of a test)
should be close to the nominal value calculated at the model”. The robust regression estimates
above have been designed to achieve such robustness while improving estimation efficiency and
protecting against unexpected procedure breakdown.
Note that a central assumption in this robust statistics literature is that the majority of the
data under analysis follows a distribution given by an assumed model. Although the assumed
model can be very generic, it however must be unique as requested by statistical theory in order
to enable inference about the model. When the data under analysis significantly deviates from
the assumed model, inference runs out of the set-up of traditional robust statistics. Quoting again
Huber, “the interpretation of results obtained by blind robust estimators becomes questionable
when the fraction of contaminants is no longer small.” [8, page 198]
Originated from the field of mathematical finance, model uncertainty is a concept that can
help statisticians deal with “no longer small” deviations of the data from an assumed model in
some precise contexts. In an early work, [7] proposed to tackle model ambiguity aversion by
the family of max-min expected utility functions, in a framework where data may follow an in-
finite family of models (or distributions). The concept of model uncertainty and its applications
in mathematical finance are successively developed in the papers [2, 11, 3, 4, 5]. Particularly,
coherent risk measures were introduced in [1] to study both market risks and non-market risks.
Over the last decade, a fundamental concept of sublinear expectation was developed in [12, 13]
which provides a general theory for quantifying uncertainty about probability distribution of ran-
dom variables, and more generally, of stochastic processes.2 One important result of the theory is
a central limit theorem (under sublinear expectation) that bridges the general theory and statisti-
cal data analysis under model or distribution uncertainty. Parallel to the role of a classical central
1
Actually Huber complained that “the collection of estimates to choose from has become so extensive that it is
worse than bewildering, namely counterproductive”. [8, page 195]
2
In fact, the theory covers nonlinear expectations which are more general than the concept of sublinear expecta-
tions. However for the purpose of this paper, it is sufficient to consider sublinear expectations.

2
limit theorem to classical statistical inference, a nonlinear normal distribution is introduced to
approximate asymptotic distributions of large sums of independent variables. This nonlinear nor-
mal distribution under sublinear expectation is the celebrated G-normal distribution. This theory
is fully developed in the recent monograph [14]. (A review of relevant results in Appendix A).
This new theory of sublinear expectation leads to many questions to explore in data analysis
in situations where distribution uncertainty is inherent to the data generation process under con-
sideration. An example of such exploration is a recent work [15] where we constructed a new
VaR predictor for financial indexes which shows a significant advantage over most of the existing
benchmark VaR predictors. A fundamental idea underlying [15] is that, in parallel to classical
data analysis where the normal distribution is a natural choice for measurement errors or data
fluctuations, the G-normal distribution can serve as a primary tool for analyzing data fluctuations
when distributions of data are subject to high uncertainty. Such high distribution uncertainty is
indeed common in financial indexes such as the NASDAQ and S&P 500 indexes. The results
obtained in [15] for VaR prediction provide a new confirmation of the existence of such distri-
bution uncertainty. They also showcase the power and usefulness of the new theory of sublinear
expectation for data analysis under model or distribution uncertainty.
In this paper we explore the implication of such model uncertainly in the context of regres-
sion analysis. Precisely, consider a q-dimensional deterministic covariate vector X ∈ Rq and a
univariate dependent random variable Y ∈ R within a regression model of the form

Y = β> X + η + ε, (1.1)

where β ∈ Rq is the vector of regression coefficients. The novelty here is the terms η and ε which
account for mean uncertainty, and variance uncertainty, respectively. In layman’s language, we
can say that β> X accounts for the contribution to the response mean from the given covariates X,
while the unexplained or remaining part of the mean is non-accessible either because no more
significant covariates are available, or it is varying through a somehow unknown mechanism.
This uncertain part of the mean is modeled by the nonlinear random variable η. Furthermore,
the fluctuation of Y around its true mean, that is, the error ε, cannot be determined by a single
classical probability distribution; rather it will follow the nonlinear G-normal distribution in
order to capture the underlying uncertainty. The model (1.1) is referred as distribution-uncertain
regression model.
Under both uncertainties about the mean and variance of the response variable, is it still
possible to consistently estimate the regression parameter β in (1.1)? To answer the question,

3
we consider a general loss function φ and introduce two population optimal parameters, under
model uncertainty, namely,

β (φ) = arg minq E[φ(Y − β> X − η)], (1.2)
β∈R

and
β∗ (φ) = arg minq −E[−φ(Y − β> X − η)]. (1.3)
β∈R

The particular feature here is that E is the sublinear expectation operator. Possible choices for
the loss function are φ(z) = z2 for the square loss, φ(z) = [α − I(z < 0)]z for quantile loss at
a given level α ∈ (0, 1), and φ(z) = I(z ≤ 0) for the Value-at-Risk (VaR) loss. In general the

optimal parameters β (φ) and β∗ (φ) depend on the loss function φ(·) under model uncertainty.
On the other hand, if Y had neither mean uncertainty nor variance uncertainty, that is, η was a
real constant and ε a classical centred noise variable, the model (1.1) would become a classical

linear regression model, and we would have β (φ) = β∗ (φ) ≡ β for a large class of possible loss
functions φ.
As a main contribution of the paper, Theorem 2.1 in Section 2 characterizes the population

optimal parameters β (φ) and β∗ (φ) for a wide class of convex loss functions φ. Next, in Section 3
we apply this characterization to the case of the square loss φ(z) = z2 . Based on this characteri-
zation, we propose a class of estimators for both the regression parameter β and those parameters
that involve in the mean-uncertainty variable η and the variance-uncertainty variable ε. Under
appropriate conditions on the data observation process, we establish large sample consistency of
these estimators.
The related literature on regression analysis under model uncertainty is actually quite limited.
When the error distribution in the regression model belongs to a finite family, [9] constructed a
k-sample maximum expectation regression over the given finite family of distribution. Using the
square loss, several estimators are proposed which are consistent and asymptotically normal. In a
follow-up work, still under the assumption of finite-number uncertainty, [10] investigated a more
general form of maximum expectation regression estimators and established their consistency
and asymptotic normality under appropriate conditions.
Other sections of the paper are as follows. Section 4 reports simulation experiments to assess
the finite-sample properties of the proposed estimators under model uncertainty. In Section 5, we
develop three applications of our method to robust regression, regression under heteroscedastic
error, and to an analysis of daily returns of the S&P 500 Index. In Appendix A, we recall useful

4
results from the theory of sublinear expectation which are relevant to the work in this paper. All
technical proofs are gathered in Appendix B.
The main contributions of this paper are given as follows:
(i). We explore the implication of model uncertainly in the context of regression analysis. The
novelty here is the terms η and ε which account for mean uncertainty, and variance uncertainty,
respectively. Under both uncertainties about the mean and variance of the response variable,
we introduce two population optimal parameters which can be used to estimate the regression
parameter.
(ii). We consider the square loss φ(z) = z2 , and propose a class of estimators for both the
regression parameter β and those parameters that involve in the mean-uncertainty variable η and
the variance-uncertainty variable ε. We find that we can use a moving block method to estimate
the minimum and maximum mean and variance. We apply the classical LSE to estimate the
related parameters in each block. Then, combining all the estimations of the blocks, we can
obtain the estimations for the regression parameters. Therefore, essentially, this is a nonlinear
regression problem.

2 Linear regression under distribution uncertainty


Consider the distribution-uncertain regression model (1.1) and the associate population opti-

mal parameters β (φ) and β∗ (φ) in (1.2) and (1.3) for a given convex loss function φ. As men-
tioned in Introduction, standard choices for the loss function cover the least squares estimator,
quantile regression estimator and a VaR estimator.
Technically, we first construct a specific infinite family of probabilities. Consider a canonical
probability space (Ω, F , P) where Ω = C([0, 1]) is the space of real-valued continuous functions
on [0,1]. Let {Bt }0≤t≤1 be a Brownian motion. The parameter space we consider is Θ = L2 (Ω ×
[0, 1], [σ, σ]), space of square-integrable, progressively measurable random processes on [0,1]
with values in the interval [σ, σ]. Here the two parameters 0 < σ < σ are the lower and the
upper limit for parameter processes θ = (θ s )0≤t≤1 ∈ Θ, respectively. The family of probability
measures {Pθ } is: for A ∈ F ,
Z 
Pθ (A) = P ◦ ξθ−1 (A) = P(ξθ ∈ A), where ξθ (·) = θ s dBs .
0

Now, we consider the mean uncertainty variable η. Precisely, under Pθ , the mean uncertainty
variable η takes constant µθ ∈ [µ, µ], θ ∈ Θ. Furthermore, we assume that for any given µ ∈ [µ, µ],

5
there exists θµ ∈ Θ such that η = µ under Pθµ . The variance uncertainty variable ε follows
a nonlinear G-normal distribution NG (0, [σ2 , σ2 ]), with lower and upper variance parameters
(σ2 , σ2 ).3 Note that the distribution uncertainty of the error ε includes an infinite family of
distributions {Fθ (·)}θ∈Θ = NG (0, [σ2 , σ2 ]), where Fθ (·) is determined by Pθ .

Remark 2.1. Note that, under Pθ , the mean uncertainty variable η takes a constant and for any
given µ ∈ [µ, µ], there exists θµ ∈ Θ such that η = µ under Pθµ . Thus, we can verify that η
satisfies the maximal distribution under sublinear expectation E[·]. Since η is a constant under
each Pθ , θ ∈ Θ, η is a particular case of maximal distribution. The definition of the maximal
distribution is given in Appendix A.

By the representation theorem of sublinear expectation, Theorem A.1, we can express the
nonlinear expectation of any function of ε as

E[φ(ε)] = max Eθ [φ(ε)],


θ∈Θ

where Eθ [·] is the classical linear expectation under Pθ . Therefore the population optimal param-
eters in (1.2)-(1.3) have the form

β (φ) = arg minq E[φ(Y − β> X − η)] = arg minq max Eθ [φ(Y − β> X − µθ )], (2.1)
β∈R β∈R θ∈Θ

and
β∗ (φ) = arg minq −E[−φ(Y − β> X − η)] = arg minq min Eθ [φ(Y − β> X − µθ )]. (2.2)
β∈R β∈R θ∈Θ

In other words, β (φ) and β∗ (φ) are optimal for the min-max loss and the min-min loss strategies,
respectively, over the infinite family of probabilities {Pθ }θ∈Θ .
Next, we have a technical lemma of exchange rule between the maximization or minimization
steps in (2.1) and (2.2).

Lemma 2.1. We assume that the loss function φ(·) ∈ Cl.Lip (R) is convex. We have the exchange
formulas for (2.1) and (2.2):

minq max Eθ [φ(Y − β> X − µθ )] = max minq Eθ [φ(Y − β> X − µθ )], (2.3)
β∈R θ∈Θ θ∈Θ β∈R

and
minq min Eθ [φ(Y − β> X − µθ )] = min minq Eθ [φ(Y − β> X − µθ )]. (2.4)
β∈R θ∈Θ θ∈Θ β∈R

3
The details of G-normal distribution are given in Appendix A.1.

6

As a consequence of Lemma 2.1, the two optimal parameters β (φ) and β∗ (φ) can actually
be determined under two classical normal distributions N(µθφ , σ2 ) and N(µθφ , σ2 ), with some
specific mean parameters µθφ and µθφ . This characterization of the parameters are instrumental
for the construction of their estimators presented in Section 3.

Theorem 2.1. We assume that the loss function φ(·) ∈ Cl.Lip (R) is convex. There exists an optimal
distribution parameter θφ (s) = σ, 0 ≤ s ≤ 1, such that
∗ h i
β (φ) = arg minq Eθφ φ(Y − β> X − µθφ ) .
β∈R

Similarly, there exists another optimal distribution parameter θφ (s) = σ, 0 ≤ s ≤ 1, such that
h i
β∗ (φ) = arg minq Eθφ φ(Y − β> X − µθφ ) .
β∈R

The proofs of Lemma 2.1 and Theorem 2.1 are given in Appendix B.1 and B.2, respectively.

In order to calculate the two optimal parameters β (φ) and β∗ (φ), we can use Theorem 2.1
with the following two-step procedure.

(1) Find the optimal liner expectations Eθφ [·] and Eθφ [·] based on the criterion function φ(·)
such that
Eθφ [φ(Y − β> X − µθφ )] = max Eθ [φ(Y − β> X − µθ )],
θ∈Θ

and
Eθφ [φ(Y − β> X − µθφ )] = min Eθ [φ(Y − β> X − µθ )].
θ∈Θ

(2) Once Eθφ [·] and Eθφ [·] are found, perform standard regression analysis under the two linear
expectations to find the optimal parameters

β (φ) = arg minq Eθφ [φ(Y − β> X − µθφ )], β∗ (φ) = arg minq Eθφ [φ(Y − β> X − µθφ )].
β∈R β∈R

This two-step procedure defines a new mechanism for determining the optimal parameters

β (φ) and β∗ (φ) under the considered distribution uncertainty. The procedure is valid for a general
convex loss function φ(·) ∈ Cl.Lip (Rq ).

3 Least squares regression under distribution uncertainty


We now develop the least squares procedure for the estimation of the regression parameter β
under the distribution-uncertain model (1.1). The loss function is thus φ(·) = (·)2 , and the two

7
population optimal parameters in (2.1) and (2.2) are:

β = arg minq E[(Y − β> X − η)2 ], β∗ = arg minq −E[−(Y − β> X − η)2 ]. (3.1)
β∈R β∈R


We call β the upper-least squares parameter (U-LSE), and β∗ the lower-least squares parameter
(L-LSE). Applying the general Theorem 2.1 to the present case, we get the following character-
ization of these parameters, as well as that of the two variance parameters σ and σ.

Theorem 3.1. Consider the distribution-uncertain regression model (1.1) under the square loss
function φ(z) = z2 .

(i). The U-LSE β can be estimated by the observation samples from

Y = β> X + µσ + ε0 , (3.2)

where ε0 follows the classical normal distribution N(0, σ2 ).

(ii). The L-LSE β∗ can be estimated by the observation samples from

Y = β> X + µσ + ε00 , (3.3)

where ε00 follows the classical normal distribution N(0, σ2 ).

(iii). The variance parameters σ and σ are characterized as follows:



σ2 = Eσ [(Y − β X − µσ )2 ], σ2 = Eσ [(Y − β∗ X − µσ )2 ], (3.4)

where (Eσ [·], Eσ [·]) mean the expectations under θ(·)2 (s) = σ, θ(·)2 (s) = σ, 0 ≤ s ≤ 1.

Results in Theorem 3.1 can be summarized as follows. When we adopt the min-max strategy,

minq max Eθ [(Y − βX − µθ )2 ],


β∈R θ∈Θ


the U-LSE β is the optimal parameter such that

σ2 = Eσ [(Y − β X − µσ )2 ] = minq E[(Y − β> X − η)2 ]. (3.5)
β∈R


These characterizations will enable a sample counterpart of the U-LSE β which will be a consis-
tent estimator for the parameter β, and subsequently, another consistent estimator for the upper
variance σ.

8
Similarly, when we consider the min-min strategy, the L-LSE β∗ is the optimal parameter
such that
σ2 = Eσ [(Y − β∗ X − µσ )2 ] = minq −E[−(Y − β> X − η)2 ]. (3.6)
β∈R

Consistent estimators for both the parameter β and the lower variance σ can also be derived by
using the sample counterparts of these parameters.

Consequently, we have the following results for U-LSE β and L-LSE β∗ .

Corollary 3.1. For the given square loss function φ(z) = z2 , we have that

β = β∗ = β

for the distribution-uncertain regression model (1.1).

Proof. Based on the square loss function φ(z) = z2 , by (i) of Theorem 3.1, it follows that the

U-LSE β can be estimated by the observation samples from

Y = β> X + µσ + ε0 ,

where ε0 follows the classical normal distribution N(0, σ2 ), which deduces that β = β. In a
similar manner, we have that β∗ = β. This completes the proof. 

3.1 Consistent estimators for the regression parameter β and distribution-


uncertainty parameters (µ, µ, σ2 , σ2 )
In order to formulate a theory of consistent estimation, we need to define precisely the gen-
eration process of the data under consideration as follows.

Data generation process: The samples {(xi , yi )}Ti=1 satisfy

yi = βxi + η j + εi , 1 + n0 ( j − 1) ≤ i ≤ n0 j, 1 ≤ j ≤ K, (3.7)

where η j ∈ [µ, µ], εi ∼ N(0, σ2j ), 1+n0 ( j−1) ≤ i ≤ n0 j with σ2j ∈ [σ2 , σ2 ], and εi , 1 ≤ i ≤ T
are independent with each other. Thus, there are K groups in the samples, and each group
has n0 elements with mean η j and variance σ2j . The total number of samples is T = n0 K.
In model (3.7), we choose the values of η j , σ2j from [µ, µ] × [σ2 , σ2 ], where 1 ≤ j ≤ K.

9
Remark 3.1. The main challenge here for estimating the diverse parameters in the distribution-
uncertain model (1.1) is that the theoretical characterizations of the U-LSE and L-LSE param-
eters given in Theorem 3.1 cannot be used directly, because we do not have at our disposal
samples from the two normal distributions N(µσ , σ2 ) and N(µσ , σ2 ) that appear in (3.2) and
(3.3), respectively. The difficulty is also due to the fact that from one sample (xi , yi ) to next, the
uncertain mean η and uncertain error ε can change significantly. We propose a method based
on moving and overlapping blocks that lead to a family of intermediate residuals which are ap-
proximately distributed as N(µσ , σ2 ). These intermediate residuals are then used for consistent
estimation of (β, σ2 ). Afterwards, we can build consistent estimations for (µ, µ). The consistency
for σ2 is not proved.
Data generated under (3.7) can be seen as a practical instance of the general distribution-
uncertain model (1.1). It defines a specification needed for the introduction of an estimation
theory. It is possible to relax a few conditions of the process. For example, the group size n0
may vary with the groups, and the uncertain mean η and uncertain error ε in the sample can
have a controlled variation within each group. Particularly, only the samples {(xi , yi )}Ti=1 are
available to us, and we have no direct access to all other parameters and variables such as (i)
the group partition and the group length n0 ; (ii) the group means (η j ) that account for the mean
uncertainty; (iii) the error variances (σ j ) that account for the error uncertainty. Therefore, the
problem of parameter estimation here is not straightforward.

The main idea of our approach is to use moving blocks. The samples {(xi , yi )}Ti=1 are scanned
subsequently as m = T − n + 1 blocks of a given block length n as in

{1, . . . , n}, {2, . . . , n + 1}, . . . , {T − n + 1, . . . , T }.

Denote the data in the lth block by Bl = {(xi , yi )}l≤i≤l+n−1 , 1 ≤ l ≤ m.


Estimators are constructed in several steps.

Step 1. Estimators for the parameters (β, µ, µ, σ2 ):

(i) For each block 1 ≤ l ≤ m with data Bl = {(xi , yi )}l≤i≤l+n−1 , we run an ordinary LSE proce-
dure using the standard regression model

yi = βl xi + µl + εi , l ≤ i ≤ l + n − 1.

Let (β̂l , µ̂l ) be the obtained estimates for the regression parameter and mean parameter.

10
Denote by zi = yi − β̂l xi − µ̂l the corresponding residuals. Define the mean squared error
from the lth block by
l+n−1
1 X 2
σ̂2l = z.
n − 1 i=l i

(ii) Find the block k̂ with minimum mean squared error, that is,

k̂ = arg min σ̂2l .


1≤l≤m

Let

wi = yi − β̂k̂ xi , 1 ≤ i ≤ T,
l+n−1
1 X
µ̃l = wi , 1 ≤ l ≤ m.
n i=l

We introduce the following estimators.

• The lower and upper means {µ, µ} are estimated, respectively, by

µ̂ = min µ̃l , µ̂ = max µ̃l . (3.8)


1≤l≤m 1≤l≤m

• The regression parameter β and the lower variance σ2 are estimated by

β̂ = β̂k̂ , (3.9)
σ̂2 = σ̂2k̂ , (3.10)

that is, the regression estimators from the minimum mean squared error block k̂.

Later, we will show that under appropriate conditions, the estimators (β̂, µ̂, µ̂) converge to
(β, µ, µ) as n → ∞.

Step 2. Estimator for the upper variance σ2 : To estimate the upper variance σ2 , we need to
remove the mean uncertainty which is present in the intermediate residuals wi = yi − β̂k̂ xi , 1 ≤
i ≤ T . Let n1 < n be a small window size and P = T/n1 (in practice, values like n1 = 10, 20, 40
are recommended). Let
jn1
1 X
w̃i = wi − wi , 1 + ( j − 1)n1 ≤ i ≤ jn1 , 1 ≤ j ≤ P.
n1 i=1+( j−1)n1

11
This steps centralize the data over a local window, and is expected to remove the fluctuation
(uncertainty) about observation means. Define, for 1 ≤ l ≤ m,
l+n−1
1 X 2
σ̃
ˆ 2l = w̃ .
n − 1 i=l i

Finally we estimate the upper variance by


2
σ̂ = max σ̃
ˆ 2j . (3.11)
1≤ j≤m

Remark 3.2. Now, we consider a simple case for model (3.7) where β = 0. For each block
1 ≤ l ≤ m with data Bl = {yi }l≤i≤l+n−1 ,

yi = µl + εi , l ≤ i ≤ l + n − 1.

Define the lover variance, mean value and squared error from the l-th block by
l+n−1
1 X
µ̂l = yi ,
n i=l

and
l+n−1
1 X
σ̂2l = (yi − µ̂l )2 , 1 ≤ l ≤ m.
n − 1 i=l
The lower variance σ̂2 , lower and upper means {µ, µ} are estimated, respectively, by

σ̂2 = min σ̂2l µ̂ = min µ̂l , µ̂ = max µ̂l . (3.12)


1≤ j≤m 1≤l≤m 1≤l≤m

To estimate the upper variance σ2 , let n1 < n be a small window size and P = T/n1 . Let
jn1
1 X
ỹi = yi − yi , 1 + ( j − 1)n1 ≤ i ≤ jn1 , 1 ≤ j ≤ P.
n1 i=1+( j−1)n1

This steps centralize the data over a local window, and is expected to remove the fluctuation
about observation means. Define, for 1 ≤ l ≤ m,
l+n−1
1 X 2
σ̃
ˆ 2l = ỹ .
n − 1 i=l i

Finally we estimate the upper variance by


2
σ̂ = max σ̃
ˆ 2j . (3.13)
1≤ j≤m

The construction of the estimators above is motivated by the following observations.

12
(i) When two groups of samples, with respective sample means (µ1 , µ2 ) and sample variances
(σ21 , σ22 ), are merged to one group, the mean of the resulting group takes value in the
interval [µ1 ∧ µ2 , µ1 ∨ µ2 ]; its variance is larger than σ21 ∧ σ22 .

(ii) If the two groups have a same sample mean µ and different sample variances (σ21 , σ22 ), the
variance of the merged group belongs to the interval [σ21 ∧ σ22 , σ21 ∨ σ22 ].

Furthermore, with reference to the data generation process (3.7), consider a data group of
length n0 , A j = {(xi , yi )}i=1+(
jn0
j−1)n0 , where 1 ≤ j ≤ K. When n ≤ n0 , there exists a moving group
Bl = {(xi , yi )}i=l
l+n−1
, where 1 ≤ l ≤ m, such that Bl ⊂ A j . We use the ordinary LSE to estimate the
regression and mean parameters within each of the data blocks of {Bl }ml=1 , and obtain m estimates
for the regression coefficient β and the corresponding mean squared errors. Based on observation
(i), we can use the minimum mean squared error from these m data blocks as an estimator for the
minimum variance of the data groups {A j }Kj=1 . This is done with block k̂ and the mean squared
error σ̂2k̂ from this block. Then, by Theorem 3.1 and Corollary 3.1, we can obtain the estimation
β̂k̂ for β based on the block k̂.
The next question is to estimate (µ, µ, σ) via the m sets of residuals {Cl }ml=1 , where Cl = {wi =
yi − β̂k̂ xi }i=l
l+n−1
. By observation (i), we can calculate means {µ̃l } of these m sets of residuals, and
their minimum and maximum values will be a good estimator for min η j and max η j , respec-
1≤ j≤K 1≤ j≤K
tively. As the latter values converge to the lower and upper mean, µ and µ, respectively, when
K → ∞, these estimators are consistent. In fact, we can always use σ̄ˆ 2 to estimate σ2 , but this
estimator may be not consistent. It seems that we can show that it is biased.
Finally for estimating the upper variance σ2 in Step 2 , we first remove the mean uncertainty
that is present in the intermediate residuals {Cl }ml=1 by using local averaging over smaller blocks
of size n1 < n. Then, by observation (ii), we can estimate σ with the maximum value of the mean
squared errors from blocks {Cl }ml=1 after removing mean uncertainty.
The theoretical consistency of these estimators are established in the following theorem.

Theorem 3.2. Consider the data generation process (3.7), and assume that n ≤ n0 . Then, as
n → ∞, we have

(i) the lower variance estimator is consistent with the minimum variance, that is, σ̂2 →
min1≤ j≤K σ2j , with probability 1;

(ii) the estimator β̂ for the regression parameter is strongly consistent, that is, β̂ → β, with
probability 1;

13
(iii) the lower and upper mean estimators in (3.12) are consistent with the minimum and max-
imum means, that is, (µ̂, µ̂) converge to (min1≤ j≤K η j , max1≤ j≤K η j ) with probability 1;

Remark 3.3. The proof of the theorem is given in Appendix B.3. At moment, we find that the
upper variance estimator in (3.11) is consistent with the maximum variance in practical analysis,
2
that is, σ̂ → max1≤ j≤K σ2j (see Sections 4). However, we cannot give a rigorous proof of the
above results. Furthermore, if we assume that as K → ∞,

( min η j , max η j , min σ2j , max σ2j ) −→ (µ, µ, σ2 , σ2 ),


1≤ j≤K 1≤ j≤K 1≤ j≤K 1≤ j≤K

2
we can use (µ̂, µ̂, σ̂2 , σ̂ ) to estimate (µ, µ, σ2 , σ2 ).

4 Simulation experiments
Simulations are conducted to check the finite-sample performance of the Robust-LSE esti-
mators proposed in Section 3.1. The design for the data generation process (3.7) is as follows:
for 1 ≤ j ≤ K, 1 + n0 ( j − 1) ≤ i ≤ n0 j,

• η j takes value in [0, 5] uniformly, σ j takes value in [0.1, 1] uniformly. Define

(ηmin , ηmax ) = ( min η j , max η j ), (σmin , σmax ) = ( min σ j , max σ j ).


1≤ j≤K 1≤ j≤K 1≤ j≤K 1≤ j≤K

• εi ∈ N(0, σ2j );

• yi = xi + η j + εi , (β = 1).

Consider the estimators (β̂, µ̂, µ̂, σ̂, σ̂) defined in Steps 1 and 2 in Section 3.1. We take
(n0 , n, n1 ) = (200, 150, 20) and varying T ∈ {400, 800, 1600, 3200} (or equivalently, K = T/n0 ∈
{2, 4, 8, 16}).
For each combination of (T, n0 , n, n1 ), we generate 500 independent replications of the data
set {η j , σ j }Kj=1 and errors {εi }Ti=1 . The average values of the parameters (ηmin , ηmax , σmin , σmax ) over
the 500 replications are denoted as (η̄min , η̄max , σ̄min , σ̄max ).
Table 1 reports empirical statistics for the Robust-LSE estimators and for comparison pur-
pose, the ordinary LSE estimators. For each case, we calculate the average and standard error
for the two estimators of β. The method Robust-LSE indeed provides a better estimator β̂ than
the ordinary LSE, with smaller standard errors for K ∈ {2, 4} and comparable standard errors for

14
K ∈ {8, 16}. Note that, we have taken the parameters {η j , σ j }Kj=1 uniformly from some intervals.
The induced mean and variance uncertainties are less severe when the number of groups K grows
because in this case, an averaging effect appears to reduce such uncertainties, and thus the ordi-
nary LSE method is able to provide an accurate estimate for the regression parameter. However,
if the uncertain mean and variance values {η j , σ j } do not obey any clearly defined distributions
(as done here), the performance of the ordinary LSE is likely to worsen. Furthermore by con-
2
struction, the Robust-LSE method provides consistent estimators (µ̂, µ̂, σ̂2 , σ̂ ) for the mean and
variance uncertainty parameters in the samples.

15
Table 1: Empirical statistics of the Robust-LSE estimators and the ordinary LSE estimator
from 500 replications. Average and standard errors are reported for β. Parameters are β = 1,
(n0 , n, n1 ) = (200, 150, 20) and T ∈ {400, 800, 1600, 3200}.

β̂ (µ̂, µ̂) (σ̂, σ̂)

T = 400, (η̄min , η̄max ) = (1.7405, 3.2692), (σ̄min , σ̄max ) = (0.3991, 0.7055)

Robust-LSE 0.9729 (1.7479,3.2705) (0.3742,0.7129)


(0.5151)

LSE 1.0299 2.4562 0.7352


(1.4121)

T = 800, (η̄min , η̄max ) = (0.9819, 3.9324), (σ̄min , σ̄max ) = (0.2890, 0.8275)

Robust-LSE 0.9820 (0.8607,3.9908) (0.2703,0.8399)


(0.3583)

LSE 0.9839 2.6162 1.1562


(0.5803)

T = 1600, (η̄min , η̄max ) = (0.5733, 4.4535), (σ̄min , σ̄max ) = (0.1975, 0.8956)

Robust-LSE 0.9994 (0.4028, 4.5665) (0.1861,0.9164)


(0.2387)

LSE 1.0147 2.5216 1.3733


(0.2105)

T = 3200, (η̄min , η̄max ) = (0.2987, 4.6994), (σ̄min , σ̄max ) = (0.1525, 0.9491)

Robust-LSE 0.9916 (0.0730,4.9132) (0.1427,0.9776)


(0.1153)

LSE 0.9948 2.4905 1.4729


(0.0812)

16
Figure 1: Samples of regression lines from the ordinary LSE and from the minimum mean
squared error block k̂ LSE.

(T,n0,n,n1)=(400,200,150,20) (T,n0,n,n1)=(800,200,150,20)
6.5 10
Samples Samples
6 LSE 9 LSE
Min−LSE Min−LSE
5.5 8

5
7
4.5
6
4
5
3.5
4
3

2.5 3

2 2

1.5 1
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 1 1.5 2 2.5 3 3.5 4 4.5 5

(T,n0,n,n1)=(1600,200,150,20) (T,n0,n,n1)=(3200,200,150,20)
16 25
Samples Samples
14 LSE LSE
Min−LSE Min−LSE
20

12

15
10

8
10

5
4

2 0
1 2 3 4 5 6 7 8 9 0 2 4 6 8 10 12 14 16 18

In Figure 1, we plot samples of regression lines for the ordinary LSE and from the minimum
mean square error block k̂ LSE with parameters (β̂k̂ , µ̂k̂ ) given in the Robust-LSE method, re-
spectively, and for sample size T ∈ {400, 800, 1600, 3200}. It is clear that lines from minimum
mean square error block k̂ LSE focus on the sub-samples with minimum mean squared errors,
while lines from the ordinary LSE focus on the whole sample. This explains why in general the
Robust-LSE method can provide a better estimator for the regression parameter β.
Now, we fix the value of (T, n0 , n1 ) = (1600, 200, 20), and verify properties of the Robust-
LSE estimators when the block length n grows. We take n = 60, 80, 160, 200. Table 2 reports
empirical statistics from 500 replications. From the estimators (β̂, µ̂, µ̂, σ̂, σ̂), it is again observed
that the method Robust-LSE performs better than the ordinary LSE. Furthermore, the conver-

17
gence of the Robust-LSE estimators (β̂, µ̂, µ̂, σ̂, σ̂) is verified when n grows. In Figure 2, we
show as in Figure 1, sample regression lines from the ordinary LSE and the minimum mean
squared error block k̂ from the Robust-LSE method. We observe that the latter can catch the
groups with minimum variance in all the tested cases of block length n.

Table 2: Empirical statistics of the Robust-LSE estimators and the ordinary LSE estimator
from 500 replications. Average and standard errors are reported for β. Parameters are β = 1,
(T, n0 , n1 ) = (1600, 200, 20) and n ∈ {60, 80, 160, 200}.

β̂ (µ̂, µ̂) (σ̂, σ̂)

n = 60, (η̄min , η̄max ) = (0.5642, 4.4594), (σ̄min , σ̄max ) = (0.1965, 0.8901)

Robust-LSE 0.9877 (0.1597,4.9009) (0.1633,0.9994)


(0.3214)

LSE 1.0142 2.4645 1.3726


(0.2175)

n = 80, (η̄min , η̄max ) = (0.5683, 4.4583), (σ̄min , σ̄max ) = (0.2013, 0.9046)

Robust-LSE 0.9950 (0.3790,4.8098) (0.1755,0.9810)


(0.2193)

LSE 1.0054 2.4732 1.3782


(0.2168)

n = 160, (η̄min , η̄max ) = (0.5857, 4.4779), (σ̄min , σ̄max ) = (0.2034, 0.8995)

Robust-LSE 1.0028 (0.4650,4.5696) ( 0.1926,0.9123)


(0.1769)

LSE 0.9893 2.5715 1.3637


(0.2126)

n = 200, (η̄min , η̄max ) = (0.5610, 4.4350), (σ̄min , σ̄max ) = (0.2024, 0.8955)

Robust-LSE 1.0016 (0.5519,4.5037) (0.1959,0.8887)


(0.1039)

LSE 0.9980 2.5469 1.3712


(0.2175)

18
Figure 2: Samples of regression lines from the ordinary LSE, and from the minimum mean
squared error block k̂ LSE. Parameters are β = 1, (T, n0 , n1 ) = (1600, 200, 20) and n ∈
{60, 80, 160, 200}.
(T,n0,n,n1)=(1600,200,60,20) (T,n0,n,n1)=(1600,200,80,20)
14 14

Samples Samples
12 LSE LSE
12
Min−LSE Min−LSE
10
10

8
8
6

6
4

2 4

0 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

(T,n0,n,n1)=(1600,200,160,20)
(T,n0,n,n1)=(1600,200,200,20)
16
14
Samples Samples
14 LSE LSE
Min−LSE 12
Min−LSE

12
10
10

8
8

6
6

4 4

2 2
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

5 Applications

5.1 Robust regression


We apply the Robust-LSE estimators to the traditional robust regression problem. Precisely,
we compare our method with a benchmark robust regression estimator, namely the MM estima-
tor. Actually, [17] has given an extensive review and comparison of the existing robust regression
estimators under various scenarios of model contamination. Overall two estimators perform bet-
ter than the other competitors, namely the MM estimator [16] and the REWLSE estimator [6].

19
Since these two best performers are close each other, we chose the MM estimator as a reference
in this study.
Following a classical setting in the literature on robust regression, we consider a simple linear
model with contamination of the form Y = X + ε, with samples {(xi , yi )}Ti=1 where xi = 1 + 0.01 ∗
i, 1 ≤ i ≤ T , and 6 scenarios for the errors {εi }: for 1 ≤ m ≤ 6,

Scenario m: εi ∈ N(0, 1), 1 ≤ i ≤ am ∗ T , εi ∈ N(0, 100), am ∗ T < i ≤ T .


Here am ∈ {0.95, 0.90, 0.80, 0.85, 0.70, 0.60, 0.50}, and 1 − am is referred as the contamina-
tion rate of the base standard normal errors by a normal error with larger variance 100.

Under each scenario, we generate 500 replications of the data, and calculate the ordinary
LSE, the MM estimator and the Robust-LSE estimator for the regression parameter β. Table 3
reports the MSEs of the estimators from 500 replications; a companion plot for these MSEs is
given at the bottom of the table. We can see that in general, the ordinary LSE has a large MSE.
The Robust-LSE and the MM estimators have almost identical performances for scenarios 1 and
2 with light contamination. In contrast for scenarios 3, 4 and 5 with heavier contamination,
the Robust-LSE clearly outperforms the MM estimator: especially in the last case with 50%
contamination, the MM estimator shows a breakdown with a MSE almost the double of the one
from the ordinary LSE (about 10 times of the one from the Robust-LSE estimator).
Next we examine the large sample behaviour of the three estimators by gradually increasing
the sample size from T = 200 to T = 1000. Among the 6 scenarios of contamination, we report
the results for scenarios 1 and 4. The empirical MSEs are reported in Table 4, and displayed
in a plot at its bottom. We can see that the Robust-LSE estimator performs better than the MM
estimator in scenario 4 (medium contamination) while they are similar under scenario 1 (light
contamination) while being both preferable than the ordinary LSE estimator. Besides, all the
three estimators show consistency when the sample size increases.

20
Table 3: Empirical MSEs of the ordinary LSE, MM and Robust-LSE estimators for the regression
coefficient β. Sample size T = 200 with 500 replications.

Scenario MM LSE Robust-LSE

1 0.0205 0.2075 0.0228

2 0.0264 0.3459 0.0237

3 0.0603 0.5733 0.0405

4 0.1823 0.6875 0.0547

5 0.6079 0.6970 0.1060

6 1.6230 0.7033 0.1667


1.8
MM
1.6 LSE
Robust−LSE
1.4

1.2

0.8

0.6

0.4

0.2

0
1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6
From Case 1 to Case 6

21
Table 4: Empirical MSEs of the ordinary LSE, MM and Robust-LSE estimators for the regression
coefficient β. Sample size T ∈ {200, 400, 600, 800, 1000} under scenarios 1 and 4 with 500
replications.

T = 200 T = 400 T = 600 T = 800 T = 1000

Scenario 1 MM 0.0205 0.0024 0.0007 0.0003 0.0002

LSE 0.2075 0.0282 0.0071 0.0031 0.0017

R-LSE 0.0228 0.0025 0.0007 0.0003 0.0001

Scenario 4 MM 0.1823 0.0230 0.0064 0.0029 0.0013

LSE 0.6875 0.0833 0.0251 0.0116 0.0052

R-LSE 0.0547 0.0059 0.0020 0.0007 0.0004


The MSE of estimations of β for Case 1 The MSE of estimations of β for Case 4
0.25 0.7
MM MM
LSE LSE
R−LSE 0.6
0.2 R−LSE

0.5

0.15
0.4

0.1 0.3

0.2
0.05
0.1

0 0
200 300 400 500 600 700 800 900 1000 200 300 400 500 600 700 800 900 1000
T T

5.2 Regression under heteroscedastic errors


In this section, we consider a special regression model under heteroscedastic errors:

Yi j = βXi j + εi, j , 1 ≤ j ≤ n0 , 1 ≤ i ≤ K, T = n0 K,

where εi, j ∈ N(0, σ2i ). We set β = 1, K = 10, Xi j = 1 + 0.005( j + (i − 1)n0 ) and

i=1 = {0.6995, 0.5851, 0.3481, 0.1304, 0.7165, 0.3344, 0.4721, 0.5211, 0.1955, 0.4851}
{σi }10

with (min1≤i≤10 σi , max1≤i≤10 σi ) = (0.1304, 0.7165). This list of variances is quite arbitrary; their
exact values have no particular meaning in our discussion.

22
The particularity here is that the model has only variance uncertainty. We apply our Robust-
LSE method, without prior knowledge about the heteroscedasticity of the data set, to obtain an
estimation for the regression parameter β and the underlying minimum and maximum volatility
(σmin , σmax ). Table 5 reports empirical averages of these estimates from 500 replications. The
corresponding ordinary LSE estimates are also given for comparison. Figure 3 plots these em-
pirical values. We find that the Robust-LSE can provide an estimator for β which is as good
as the ordinary LSE; it can also provide accurate estimations for the minimum and maximum
volatilities while the ordinary LSE cannot.

Table 5: Heteroscedastic regression models with (β, σ̄min , σ̄max ) = (1, 0.1304, 0.7165). Averages
of estimators from 500 replications and sample size T ∈ {500, 1000, 1500, 2000}.

Parameters β Min. volatility Max. volatility

T = 500 R-LSE 0.9770 0.1213 0.7844

LSE 0.9981 0.4845 0.4845

T = 1000 R-LSE 1.0066 0.1246 0.7691

LSE 1.0001 0.4864 0.4864

T = 1500 R-LSE 1.0050 0.1258 0.7544

LSE 1.0002 0.4858 0.4858

T = 2000 R-LSE 1.0013 0.1267 0.7466

LSE 1.0000 0.4854 0.4854

23
Figure 3: Plots of empirical averages of the estimates in Table 5 (see captions there).

1.015
R−LSE
1.01 LSE

1.005
The estimate of β

0.995

0.99

0.985

0.98

0.975
500 1000 1500 2000
The value of T

0.5 0.8

0.45
0.75
R−LSE
The estimate of maximum volatility
The estimate of minimum volatility

LSE
0.4
Minimum volatility 0.7

0.35 R−LSE
0.65 LSE
0.3 Maximum volatility

0.6
0.25

0.55
0.2

0.15 0.5

0.1 0.45
500 1000 1500 2000 500 1000 1500 2000
The value of T The value of T

5.3 Real data analysis


We consider a simple linear model:

Y = β1 X1 + β2 X2 + ε, X1 , X2 ∈ R, (5.1)

where ε satisfies a normal distribution N(0, σ2 ). In real market, it is important to select the factors
for the linear regression model. However, we may not observe the factor X2 and ignore it. Thus,
it is possible that we consider the following model:

Y = β1 X1 + L + ε, X1 ∈ R, (5.2)

where L is a constant. Note that, we can use the ordinal LSE to obtain the coefficient of model
(5.2). Based on the distribution-uncertain regression model (1.1), we use a mean uncertain term

24
to represent the unknown factor β2 X2 . The new model is

Y = β1 X1 + η + ε, X1 ∈ R, (5.3)

where η takes value in a interval under sublinear expectation, and ε has the N(0, σ2 ) distribution.
We analyze the S&P500 Index to assess the performance of the models (5.2) and (5.3). The
daily closing price data of the index covers the period from Jan. 3, 2000 to July 17, 2020. We
consider a first order autoregression version of models (5.2) and (5.3):

Xt+1 = βXt + L + εt+1 , Xt+1 = βXt + η + εt+1 .

To estimate the parameters in model (5.3), we choose n = 10, and use the estimations of the
block with the minimum variance to calculate the R2 and F-statistic.

Table 6: Regression results of LSE and Robust-LSE under criterion F0.01 (2, 247) = 4.6921

Year Method β R2 F-statistic

201907–202007 R-LSE -0.4196 0.1802 27.1386

LSE -0.3592 0.1290 18.2981

201807–201907 R-LSE 0.3250 0.0878 11.8815

LSE 0.0117 0.0001 0.0168

201707–201807 R-LSE -0.2005 0.0380 4.8738

LSE -0.0523 0.0027 0.3385

201607–201707 R-LSE -0.7813 0.3269 59.9758

LSE -0.1764 0.0311 3.9661

201507–201607 R-LSE -0.5840 0.2115 33.1283

LSE 0.0391 0.0015 0.1890

Table 6 shows that the model from the Robust-LSE performs better than the one from the
ordinary LSE fit according to both the index R2 coefficient and the F-statistic of goodness-of-fit.
Furthermore beyond the 5 years reported in the table, we have also repeated the same comparison
for all the past 20 years of the S&P500 Index: at 1% level, F-statistic is 18 times significant for

25
the model fitted with the Robust-LSE, while it is the case for one only model fitted with the
ordinary LSE.

6 Conclusion
In this study, a robust liner regression model under both mean and variance uncertainty in the
response variable is investigated. We use a G-normal distribution to represent the variance uncer-
tainty, and another nonlinear random variable for the mean uncertainty. These nonlinear random
variables in fact encompass an infinite family of distributions for the response variable, instead
of a single distribution in the classical regression model. For a given estimation loss criterion,
two estimation strategies, namely the min-max and the min-min strategies are introduced for es-
timating the regression parameter. The theory of sublinear expectation allows us to characterize
the optimal parameters for the two estimation strategies. By considering the square loss func-
tion, the method leads to the robust (upper and lower) least squares estimators that capture the
maximum volatility and minimum volatility in the response variable. Under mild conditions on
the data generation process, the consistency of the estimators for both the regression parameter
and the parameters of mean and variance uncertainty is established. These theoretical results are
confirmed by simulation experiments. The usefulness of the approach is assessed favorably in
three applications in comparison to the existing regression methods including the ordinary LSE
and a benchmark robust regression estimator.
Further investigation of the proposed method would include more extensive real data analysis.
It is also worth researching on alternative data generation process for the general distribution-
uncertain regression model (1.1).

Acknowledgement
We would like to thanks the Referees’s careful reading and helpful comments which have
substantially improved this paper.

26
A Preliminaries from the sublinear expect ion theory
In the following, we introduce the sublinear expectation theory which is used to describe
the infinite family of distributions. We suppose that there are an infinite family of probabilities
{Pθ }θ∈Θ behind the error ε, and the related distribution is defined as Fθ (z) = Pθ (ε ≤ z), z ∈
R, θ ∈ Θ, where Θ is a given set. Based on the given infinite family of probabilities {Pθ }θ∈Θ , we
introduce the representation results of a sublinear expectation E[·], which is defined on a linear
space H of real valued functions on Ω. A sublinear expectation E[·] : H → R satisfies, for
X, Y ∈ H,
(i). E[X] ≤ E[Y], X ≤ Y;
(ii). E[c] = c, c ∈ R;
(iii). E[X + Y] ≤ E[X] + E[Y];
(iv). E[λX] = λE[X], λ ≥ 0.
The next result represents a sublinear expectation E[·] as a supremum over a family of clas-
sical linear expectations.

Theorem A.1. [14, Theorem 1.2.1] Let E[·] be a sublinear expectation on H. There exists an
infinite family of linear expectation {Eθ , θ ∈ Θ} such that

E[X] = max Eθ [X], X ∈ H. (A.1)


θ∈Θ

Define the space Cl.Lip (R) of functions φ(·) which are locally Lipschitz: for some positive
constants C and k depending on φ,

|φ(x) − φ(y)| ≤ C(1 + |x|k + |y|k ) |x − y| , x, y ∈ R.

We have the following nonlinear central limit theorem.

Theorem A.2. [14, Theorem 2.4.4] Let {Zi }∞


i=1 be a sequence of real-valued random variables on
a sublinear expectation (Ω, H, E[·]). Further, let Zi+1 and Zi be identically distributed and Zi+1
is independent from {Z1 , Z2 , · · · , Zi } for i ≥ 1. In addition, we assume that

E[Z1 ] = E[−Z1 ] = 0,

and E[|Z1 |2+δ ] < ∞ for some δ > 0. Then, the sequence
 Z + Z + · · · + Z ∞
1 2 n

n n=1

27
converges to a G-normally distributed random variable Z under sublinear expectation E[·]: that
is, for φ(·) ∈ Cl.Lip (R),
 Z + Z + ··· + Z 
1 2 n
lim E φ( √ ) = E[φ(Z)].
n→∞ n
The impact of this nonlinear central limit theorem on statistics is as follows. In parallel to
the role of the normal distribution that appears in the limit of a classical central limit theorem,
the nonlinear G-normal random variable Z that appears in this theorem can serve as a natural
model for measurement errors in the nonlinear expectation framework, that is, when variables
are subject not to a single distribution but to potentially infinite many and unknown distributions.
In this paper, we apply this idea to the measurement error ε in a linear regression model as a way
to catch up with its distribution uncertainty.
Now, we first introduce the maximal distribution under sublinear expectation E[·].

Definition A.1. (Maximal distribution) Given the sublinear expectation space (Ω, H, E) and a
random variable η, if there exists an interval [µ, µ] ⊂ R such that

E[φ(η)] = max φ(y), ∀φ ∈ Cl.Lip (R),


y∈[µ,µ]

η is called maximally distributed.

In the following, we develop more details on the G-normal distribution.

A.1 The G-normal distribution for variance uncertainty


In the following, we explicitly construct a random variable Z1 which follows the G-normal
distribution given in Theorem A.2. Recall the infinite family of probabilities {Pθ }θ∈Θ introduced
in Section 2. Let Z1 satisfies
E[Z1 ] = −E[−Z1 ] = x.

Since E[·] = maxθ∈Θ Eθ [·], this relationship means that

max Eθ [Z1 ] = min Eθ [Z1 ] = x,


θ∈Θ θ∈Θ

that is, the maximum mean and the minimum mean of Z1 over θ ∈ Θ are the same. In other
words, Z1 has no uncertainty on its mean. The expectations of Z1 under {Pθ }θ∈Θ are given by
Z
Eθ [φ(Z1 )] = φ(z)dFθ (z), (A.2)
R

28
where φ ∈ Cl.Lip (R) is some criterion (test) function.
In general, it is difficult to calculate the sublinear expectation E[φ(Z1 )]. We construct a G-
normal distribution using a partial differential equation. This is because the partial differential
equation tool can help us to find the optimal parameter θφ such that E[φ(Z1 )] = Eθφ [φ(Z1 )] and to
calculate the expectation Eθ [φ(Z1 )] under linear expectation Eθφ [·].

Assumption A.1. Let us assume {Zt }0≤t≤1 satisfies the following stochastic differential equation,

dZt = θt dBt , Z0 = 0,

under Pθ , θ ∈ Θ = L2 (Ω × [0, 1], [σ, σ]), where Θ is the set of all progressively measurable
processes taking value on [σ, σ].

The stochastic process {Zt }0≤t≤1 in Assumption A.1 admits a time-varying variance for the
given probability measure Pθ . Therefore, there are infinite many distributions behind this process.
We define the distribution of Z1 as the G-normal distribution NG (0, [σ2 , σ2 ]).4 Therefore, for a
given criterion function φ(·) ∈ Cl.Lip (R), we have
Z t
E[φ(Zt )] = max Eθ [φ(Zt )] = max Eθ [φ( θ s dBs )].
θ∈∈Θ θ∈Θ 0

Proposition 2.2.10 of [14] showed that u(t, x) = E[φ(Zt + x)] is the unique viscosity solution of
the following partial differential equation:

∂t u(t, x) − G(∂2xx u(t, x)) = 0, t > 0, x ∈ R, (A.3)

with the initial condition u(0, x) = φ(x), x ∈ R, where the function G(·) is defined as
1 2 +
a+ = max(a, 0), and a− = max(−a, 0).

G(a) = σ a − σ2 a− , (A.4)
2
It should be noted that u(1, 0) = E[φ(Z1 )]. Using the process {Zt }0≤t≤1 , we can calculate the
characteristics of the G-normal random variable Z1 for a given criterion function φ(·) under the
infinite family of distributions {Fθ }θ∈Θ .
4
Based on Assumption A.1, we use NG (0, [σ2 , σ2 ]) to represent the infinite family of distributions {Fθ }θ∈Θ behind
the random variable Z1 .

29
B Proofs

B.1 Proof of Lemma 2.1


In the first step, we prove that,

minq max Eθ [φ(Y − β> X − µθ )] = max minq Eθ [φ(Y − β> X − µθ )].


β∈R θ∈Θ θ∈Θ β∈R

For any given β ∈ Rq , since X is a deterministic vector variable, from (1.1), ε = Y − β> X − η
satisfies a G-normal distribution NG (a, [σ2 , σ2 ]), where a is a constant, which depends on β.
Note that by assumption, φ(·) is convex. Let

1
Z ∞

(y−a t)2
u(t, x) = p φ(y + x)e 2σ2 t dy.
2πσ2 t −∞

Because the equation (A.3) admits a unique classical solution, we can verify that u(t, x) is this
solution, with initial condition limt→0 u(t, x) = φ(x). Thus, we can take θ(s) = σ, 0 ≤ s ≤ 1 such
that
u(1, 0) = Eσ [φ(Y − β> X − µσ )] = E[φ(Y − β> X − η)].

By Theorem A.1, we have

E[φ(Y − β> X − η)] = max Eθ [φ(Y − β> X − µθ )],


θ∈Θ

and thus
Eσ [φ(Y − β> X − µσ )] = max Eθ [φ(Y − β> X − µθ )].
θ∈Θ

It follows that

minq Eσ [φ(Y − β> X − µσ )] = minq max Eθ [φ(Y − β> X − µθ )].


β∈R β∈R θ∈Θ

Obviously,
minq Eσ [φ(Y − β> X) − µσ ] ≤ max minq Eθ [φ(Y − β> X − µθ )],
β∈R θ∈Θ β∈R

which implies that

minq max Eθ [φ(Y − β> X − µθ )] ≤ max minq Eθ [φ(Y − β> X − µθ )].


β∈R θ∈Θ θ∈Θ β∈R

On the other hand, it is easy to verify that

minq max Eθ [φ(Y − β> X − µθ )] ≥ max minq Eθ [φ(Y − β> X − µθ )].


β∈R θ∈Θ θ∈Θ β∈R

30
Thus, we have

minq max Eθ [φ(Y − β> X − µθ )] = max minq Eθ [φ(Y − β> X − µθ )], (B.1)
β∈R θ∈Θ θ∈Θ β∈R

and
minq Eσ [φ(Y − β> X − µσ )] = minq max Eθ [φ(Y − β> X − µθ )]. (B.2)
β∈R β∈R θ∈Θ

Similarly, we can obtain the ”min-min=min-min” exchange rule:

minq min Eθ [φ(Y − β> X − µθ )] = min minq Eθ [φ(Y − β> X − µθ )], (B.3)
β∈R θ∈Θ θ∈Θ β∈R

and
minq Eσ [φ(Y − β> X − µσ )] = minq min Eθ [φ(Y − β> X − µθ )]. (B.4)
β∈R β∈R θ∈Θ

This completes the proof. 

B.2 Proof of Theorem 2.1


Note that φ(·) is convex. By the representation results (B.2) and (B.4) of Lemma 2.1, we have

Eσ [φ(Y − β> X − µσ )] = max Eθ [φ(Y − β> X − µθ )],


θ∈Θ

and
Eσ [φ(Y − β> X − µσ )] = min Eθ [φ(Y − β> X − µθ )].
θ∈Θ

This implies that



β (φ) = arg minq Eσ [φ(Y − β> X − µσ )] = arg minq max Eθ [φ(Y − β> X − µθ )],
β∈R β∈R θ∈Θ

and
β∗ (φ) = arg minq Eσ [φ(Y − β> X − µσ )] = arg minq min Eθ [φ(Y − β> X − µθ )].
β∈R β∈R θ∈Θ

This completes the proof. 

B.3 Proof of Theorem 3.2


For notation simplicity, we set A j = {(xi , yi )}ni=1+n
0j
0 ( j−1)
, 1 ≤ j ≤ K, with η j ∈ [µ, µ], and
ε j ∈ N(0, σ2j ), σ2j ∈ [σ2 , σ2 ], the total number of samples is T = n0 K. For each group A j , when
k +n−1
n ≤ n0 , there exists integer k j such that the samples {(xi , yi )}i=k
j
j
⊂ A j . Thus, we can find a
block Bl {(xi , yi )}i=l
l+n−1
belongs to the group of {A j }Kj=1 with the smallest variance min1≤ j≤K σ2j .

31
(i). Recall that within the lth block with data Bl = {(xi , yi )}l+n−1
i=l , using ordinary LSE as defined
in Step 1-(i) of the procedure, we obtain the ordinary LSE for the regression parameter and block
mean, namely (β̂l , µ̂l ). The mean squared error σ̂2l in the block is also easily obtained. Recall the
observation (ii) given below (3.11): if one Bl overlaps with two A j groups, say A jl and A jl+1 , the
mean squared error σ̂2l will be larger than if Bl is contained in a single A j group. Therefore, the
minimum of these mean squared errors will be achieved by one block Bl which is included in a
single A j .
There are K-groups in model (3.7). The observations of each group realize a model. With a
given probability space (Ω, F , P), the probability measure P is such that εi satisfies the normal
distribution N(0, σ2j ), n0 ( j − 1) + 1 ≤ i ≤ n0 j, 1 ≤ j ≤ K. Note that, for any given sufficiently
small δ > 0, we have that

P(σ̂2k̂ > min σ2j + δ)


1≤ j≤K

=P(σ̂21 > min σ2j + δ, σ̂22 > min σ2j + δ, · · · , σ̂2m > min σ2j + δ)
1≤ j≤K 1≤ j≤K 1≤ j≤K

≤P(σ̂2k1 > min σ2j + δ, σ̂2k2 > min σ2j + δ, · · · , σ̂2kK > min σ2j + δ)
1≤ j≤K 1≤ j≤K 1≤ j≤K

≤P(σ̂2k̄ > min σ2j + δ),


1≤ j≤K

where block Bk̄ belongs to the group with the smallest variance min1≤ j≤K σ2j , by classical law of
large number, we have that limn→∞ P(σ̂2k̄ ≥ min1≤ j≤K σ2j +δ) = 0. By induction method, it follows
that

P(σ̂2k̂ < min σ2j − δ)


1≤ j≤K

=1 − P(σ̂21 ≥ min σ2j − δ, σ̂22 ≥ min σ2j − δ, · · · , σ̂2m ≥ min σ2j − δ)


1≤ j≤K 1≤ j≤K 1≥ j≤K

≤2 − P(σ̂21 ≥ min σ2j − δ) − P(σ̂22 ≥ min σ2j − δ, · · · , σ̂2m ≥ min σ2j − δ)


1≤ j≤K 1≤ j≤K 1≥ j≤K
m
X
≤m − P(σ̂2l ≥ min σ2j − δ).
1≤ j≤K
l=1

For each 1 ≤ l ≤ m, by classical law of large number, we have that limn→∞ P(σ̂2l ≥ min1≤ j≤K σ2j −
δ) = 1. Then, we have that σ̂2k̂ → min1≤ j≤K σ2j with probability 1 under P as n → ∞.
In the following, we show that β̂k̂ → β with probability 1 under P as n → ∞. For any given

32
sufficiently small δ > 0, if follows that

P(|β̂k̂ − β| > δ)
=P(|β̂k̂ − β| > δ, |σ̂2k̂ − min σ2j | < δ) + P(|β̂k̂ − β| > δ, |σ̂2k̂ − min σ2j | ≥ δ)
1≤ j≤K 1≤ j≤K

≤P(|β̂k̂ − β| > δ, |σ̂2k̂ − min σ2j | < δ) + P(|σ̂2k̂ − min σ2j | ≥ δ).
1≤ j≤K 1≤ j≤K

As n → ∞, note that (β̂k̂ , σ̂2k̂ ) comes from the same group, by the proof by contradiction, we can
obtain that
lim P(|β̂k̂ − β| > δ, |σ̂2k̂ − min σ2j | < δ) = 0
n→∞ 1≤ j≤K

and
lim P(|σ̂2k̂ − min σ2j | ≥ δ) = 0.
n→∞ 1≤ j≤K

Thus, limn→∞ P(|β̂k̂ − β| > δ) = 0.

(ii). In (i) above, we have obtained the consistency of (β̂k̂ , σ̂2k̂ ) for the parameters (β, min1≤l≤K σ2l ).
Recall the observation given in Remark 3.2 and based on the similar manner in (i), the estimators
µ̂ = min1≤l≤m µ̃l and µ̂ = max1≤l≤m µ̃l for minimum mean and maximum mean from the groups
(A j )1≤ j≤K are consistent, that is, µ̂ = min1≤l≤m µ̃l and µ̂ = max1≤l≤m µ̃l converge to min1≤ j≤K η j and
max1≤ j≤K η j with probability 1. We just show µ̂ converges to min1≤ j≤K η j with probability 1. Note
that, for any given sufficiently small δ > 0, we have that

P(µ̂ > min η j + δ)


1≤ j≤K

=P(µ̃1 > min η j + δ, µ̃2 > min η j + δ, · · · , µ̃m > min η j + δ)


1≤ j≤K 1≤ j≤K 1≤ j≤K

≤P(µ̃k1 > min η j + δ, µ̃k2 > min η j + δ, · · · , µ̃kK > min η j + δ)


1≤ j≤K 1≤ j≤K 1≤ j≤K

≤P(µ̃k̄ > min η j + δ),


1≤ j≤K

where the block Bk̄ belongs to the group with the smallest mean min1≤ j≤K η j . Denote by {yk̄i , xk̄i }ni=1

33
the elements of Bk̄ . Based on classical law of large number, we have that

lim P(µ̃k̄ ≥ min η j + δ)


n→∞ 1≤ j≤K

i=1 (yk̄i − βxk̄i )


Pn Pn
xk̄i
= lim P( + (β − β̂k̂ ) i=1 ≥ min η j + δ)
n→∞ n n 1≤ j≤K
 Pn (y − βx ) Pn Pn
δ
i=1 k̄i k̄i i=1 xk̄i xk̄i
≤ lim P( + (β − β̂k̂ ) ≥ min η j + δ, (β − β̂k̂ ) i=1 ≤ )
n→∞ n n 1≤ j≤K n 2
(yk̄i − βxk̄i ) δ
Pn Pn Pn
xk̄i xk̄i 
+ P( i=1 + (β − β̂k̂ ) i=1 ≥ min η j + δ, (β − β̂k̂ ) i=1 > )
n n 1≤ j≤K n 2
 Pn (y − βx ) δ
Pn
xk̄i δ 
i=1 k̄i k̄i
≤ lim P( ≥ min η j + ) + P( (β − β̂k̂ ) i=1 > )
n→∞ n 1≤ j≤K 2 n 2
=0.

By induction method, it follows that

P(µ̂ < min η j − δ)


1≤ j≤K

=1 − P(µ̂ ≥ min η j − δ)
1≤ j≤K
m
X
≤m − P(µ̃l ≥ min η j − δ).
1≤ j≤K
l=1

For each 1 ≤ l ≤ m, by classical law of large number, we have that limn→∞ P(µ̃l ≥ min1≤ j≤K η j −
δ) = 1. Then, we have that µ̂ → min1≤ j≤K η j with probability 1 under P as n → ∞.
The proof is complete. 

References
[1] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath. Coherent measures of risk. Mathematical
Finance, 9:203–228, 1999.

[2] M. Avellaneda, A. Levy, and A. Parás. Pricing and hedging derivative securities in markets
with uncertain volatilities. Applied Mathematical Finance, 2:73–88, 1995.

[3] Z. Chen and L. Epstein. Ambiguity, risk, and asset returns in continuous time. Economet-
rica, 70(4):1403–1443, 2002.

[4] R. Cont. Model uncertainty and its impact on the pricing of derivative instruments. Math-
ematical Finance, 16:519–547, 2006.

34
[5] H. Föllmer and A. Schied. Stochastic Finance. An Introduction in Discrete Time. Third
revised and extended edition. Walter de Gruyter & Co., Berlin, 2011.

[6] D. Gervini and V.J. Yohai. A class of robust and fully efficient regression estimators. Annals
of Statistics, 30(2):583–616, 2002.

[7] I. Gilboa and D Schmeidler. Maxmin Expected Utility with Non-Unique Prior. Journal of
Mathematical Economics, 18:141–153, 1989.

[8] P. J. Huber and E.M. Ronchetti. Robust Statistics: Second Edition. Wiley, 2009.

[9] L. Lin, Y. Shi, X. Wang, , and S. Yang. k-sample upper expectation linear regression-
modeling, identifiability, estimation and predictio. J. Stat. Plan. Infer., 170:15–26, 2016.

[10] L. Lin, P. Dong, Y. Song, and L. Zhu. Upper expectation parametric regression. Stat. Sin.,
27:1265–1280, 2017.

[11] T. J. Lyons. Uncertain volatility and the risk-free synthesis of derivatives. Applied Mathe-
matical Finance, 2:117–133, 1995.

[12] S. Peng. Filtration consistent nonlinear expectations and evaluations of contingent claims.
Acta Mathematicae Applicatae Sinica, 20:1–24, 2004.

[13] S. Peng. Nonlinear expectations and nonlinear Markov chains. Acta Mathematicae Appli-
catae Sinica, 26B:159–184, 2005.

[14] S. Peng. Nonlinear Expectations and Stochastic Calculus under Uncertainty. Springer,
Berlin, Heidelberg, 2019.

[15] S. Peng, S. Yang, and J. Yao. Improving value-at-risk prediction under model uncertainty.
Journal of Financial Econometrics, doi: 10.1093/jjfinec/nbaa022:1–32, 2021.

[16] V.J. Yohai. High breakdown-point and high efficiency robust estimates for regression. The
Annals of Statistics, 15:642–656, 1987.

[17] C. Yu and W. Yao. Robust linear regression: A review and comparison. Communications
in Statistics: Simulation and Computation, 46(8):6261–6282, 2017.

35

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy