[High Dimensional Statistics] Uniform Laws of large numbers
References
- Yonsei univ. STA9104 : Statistical theory for high dimensional and big data
- High-Dimensional Statistics : A Non-Asymptotic Viewpoint, 2017, Martin J. Wainwright
Uniform laws have been investigated in part of empirical process theory and have many applications in statics. in contrast to the usual law of large numbers that applies to a fixed sequence of random variables, uniform laws of large numbers provide a guarantee that holds uniformly over collections of random variables. A goal of this post is to introduce tools for studying uniform laws with an emphasis on non-asymptotic results.
Summary about convergence
- Pointwise ConvergenceLet
i.e. for each
- Unform Convergence
Let
i.e. for each
,which is same with
1. The derivatives of a pointwise convergent sequence of functions do not have to converge.
consider
2. The integrals of a pointwise convergent sequence of functions do not have to converge.
Consider
3. The limit of a pointwise convergent sequence of continuous functions does not have to be contuniuous
It satisfies pointwise convegence, but limit is not continuous
4. The uniform convergence implies pointwise convergence, but not the other way around.
Same example with above one.
If
contradicting our assumption.
1. MotivationPermalink
1.1. Almost sure convergencePermalink
Before start of demonstration, I will talk about two probabilistic convergence, Almost sure convergence and Convegence in probability .
1.1.1. Convergence in probabilityPermalink
For
,where
1.1.2. Almost sure convergencePermalink
For
,where
Here, second inequality is true by Borel- Cantelli lemma.
1.1.3. Comparison of conceptsPermalink
To understand the difference, Law of large number is one example.
I referred random variable - Convergence in probability vs. almost sure convergence - Cross Validated (stackexchange.com) for explanation here.
Weak law of large numbers, which states about convergence in probability of
However, in case of strong law of large numbers, a curve surely will be in bands.
The difference also can be seen from other representations of definition.
SLLN
WLLN
,
1.2. Cumulative distribution functionsPermalink
Let
Moreover, uniform convergence of CDF is proved by Glivenko-Cantelli Theorem.
Uniform convergence is important because many estimation problem start from functional of CDF. For example,
- Expectation Functionals
which is estimated by plug-in estimator given by
- Quantile functionals
which is estimated by plug-in estimator given by
Above estimators are justified by continuity of a functional. That is, once we know that
continuous with respect to the sup-norm
we say that the function
1.3. Uniform laws for more general function classesPermalink
Let
,which measures the absolute deviation between the sample average and the population average, uniformly over the claff
- Empirical CDFs
We could see
- Failure of uniform law
Not all classes of functionals are Glivenko-Cantelli. Let
Therefore, we conclude that
, and thereby
1.4. Empirical risk minimizationPermalink
- Suppose that samples are drawn i.i.d. according to a distribution
for som fixed but unknown . - A standard approach to estimating
is based on minimizing a cost function that measures the fit between a parameter and the sample . - We then obtain some estimate
that minimizes the empirical risk defined as
One specific example is Maximum Likelihood Estimator.
- In contrast, the population risk is defined as
where expectation is taken over a samples
Which is risk from estimor of empirical risk minimization and true unknown estimator is how small compared to pupulation risk.
For simplicity, assume that there exists some
With the notation , the exess risk can be decomposed as
By the definition of
, uniform law of large numbers over the cost function
Also, for the
It is also dominated by above quantity,
It states that the central challenge in analyzing estimators based on empirical risk minimization is to establish a uniform law of large numbers for the loss class
2. A uniform law via Rademacher complexityPermalink
2.1. Definition of Rademacher complexityPermalink
- Empiricial Rademacher complexity of the function class
For any fixed collection
The empirical Rademacher complexity is given by
, where
- Rademacher complecity of the function class
By taking expectation on empirical rademacher complexity,
By the form of formula, note that the Rademacher complexity is the maximum correlation between the vector
In this sense,the Rademacher complexity measures the size of a function class and it plays a key role in establishing uniform convergence resuts.
2.2. Glivenko-Cantelli propetyPermalink
a function class IS -uniformly bounded if for all .
- Theorem (Glivenko-Cantelli property) For any
-uniformly bounded class of function class , any positive integer and any scale , we have
with
proof)
- Borel-Cantelli lemma
Let
Borel-Cantelli lemma says that if
For above case,
Therefore, since
which will lead to uniform convergence of CDF .
- Step 1) concentration around mean
Let’s say that
Then, by McDiard’s theorem ,
with probability at least
- Step 2) Upper bound on the mean
Combining results, one can get given inequality.
3. Upper bounds on the Rademacher complexityPermalink
3.1. Polynomical discriminationPermalink
Definition ( Polynomial Discrimination)
A class
has cardinality upper bound as
The significance of this property is that it provides with a straightforward approach to controlling the empirical Rademacher complexity
3.2. Bound on the empirical Rademacher complexityPermalink
Suppose that
, where
proof)
For any fixed vector
Therfore, the empirical Rademamcher complexity is upper bounded by the expected supremum of
- one of maximal inequality used here
Let
Then,
3.3. Glivenko-Cantelli theroemPermalink
Corlloary (Glivenko- Cantelli)
Let
with probability at least
proof )
Because for a givne sample
3.4. Vapnik-Chervonenkis dimensionPermalink
Above, we see that Polinomical discrimination is useful to find the upper bound. One of way to verify that a given function class has polynomial discrimination is via the thoery of VC dimension. Consider a function class
Definition (VC dimension) Given a class
- Example : Two-sided Intervals in
Consider the collection of all two-sided intervals over the real line – namely
- Sauer’s Lemma
Consider a set class
3.5. Multivariate version of Glivenko-CantelliPermalink
Theorem
Let
where for vectores
Then,
, with probability at least
proof)
At first, class of indicator functions characterized by the class of sets
because
, which means that
Because class
with probability at least
Leave a comment