[High Dimensional Statistics] Uniform Laws of large numbers

Date :     Updated :

References

  • Yonsei univ. STA9104 : Statistical theory for high dimensional and big data
  • High-Dimensional Statistics : A Non-Asymptotic Viewpoint, 2017, Martin J. Wainwright

​ Uniform laws have been investigated in part of empirical process theory and have many applications in statics. in contrast to the usual law of large numbers that applies to a fixed sequence of random variables, uniform laws of large numbers provide a guarantee that holds uniformly over collections of random variables. A goal of this post is to introduce tools for studying uniform laws with an emphasis on non-asymptotic results.


Summary about convergence - Pointwise Convergence
Let (N,d) be metric space. For sequence of functions
fn:AN,n=1,,nf:AN The sequence of functions are said to converge pointwise to the function f, if for each xA, fn(x)f(x),
i.e. for each xA and ϵ>0, L=L(x,ϵ) such that d(fn(x),(x))ϵ,nL

- Unform Convergence
Let (N,d) be metric space. For sequence of functions
fn:AN,n=1,,nf:AN The sequence of functions are said to converge pointwise to the function f, if for each xA, fn(x)f(x),
i.e. for each xA and ϵ>0, L=L(ϵ) such that d(fn(x),(x))ϵ,nL
,which is same with
supxAd(fn(x),(x))<ϵ,nL - some relations
1. The derivatives of a pointwise convergent sequence of functions do not have to converge.

consider X=R and fn(x)=1nsin(n2x) . Then,
limnfn(x)=0 So, the pointwise limit function is f(x)=0; the sequence of functions converges to 0. What about the derivatives of the sequence?
fn(x)=ncos(n2x) and for most xR, above derivative is unbounded, which means that it does not converge.
2. The integrals of a pointwise convergent sequence of functions do not have to converge.

Consider X=[0,1], and fn(x)=2n2x(1+n2x2)2. Then,
limn→=fn(x)=0 However, the integrals are
012n2xdx(1+n2x2)2=u=1+n2x211+n2duu2=111+n2 Therfore, even thought limn→=fn(x)=0 for all xX, the intergral is 1 as n
3. The limit of a pointwise convergent sequence of continuous functions does not have to be contuniuous

A=[0,1] and fn(x)=xn. Then,
fn(x)n=f(x)={0(0x<1)1(x=1)
It satisfies pointwise convegence, but limit is not continuous
4. The uniform convergence implies pointwise convergence, but not the other way around.

Same example with above one.
If fn(x) converges uniformly, then the limit function must be f(x)=0 for x[0,1) and f(1)=1. Uniform convergence implies that for any ϵ>0 there is NϵN such that |xnf(x)| for all nNϵ . Then, consier ϵ=12. Then, there is N such that for all nN, |xnf(x)|<12. If we choose n=N and x=(34)N, f(x)=0 and thus
|fN(x)f(x)|=34
contradicting our assumption.

1. MotivationPermalink


1.1. Almost sure convergencePermalink

​ Before start of demonstration, I will talk about two probabilistic convergence, Almost sure convergence and Convegence in probability .


1.1.1. Convergence in probabilityPermalink

For Xn and X on (Ω,F,P), A sequence of random variables converges in probability to a random variable X, denoted by XnPX if

limnP(sS:|Xn(s)X(s)|<ϵ)1,ϵ>0

,where S is an underlying sample space.

1.1.2. Almost sure convergencePermalink

For Xn and X on (Ω,F,P), A sequence of random variables converges almost surely to a random variable X, denoted by Xna.s.X if

P(sS:limn|Xn(s)X(s)|<ϵ)1n=1I(|Xn(s)X(s)|>ϵ)<,ϵ>0

,where S is an underlying sample space.

Here, second inequality is true by Borel- Cantelli lemma.


1.1.3. Comparison of conceptsPermalink

To understand the difference, Law of large number is one example.

I referred random variable - Convergence in probability vs. almost sure convergence - Cross Validated (stackexchange.com) for explanation here.


Weak law of large numbers, which states about convergence in probability of X¯=i=1nXi/n, says that a large proportion of the sample paths will be in the bands on the right-hand side, at time n. It just means that the probability that particular curve to be located in curve will approaches to one.

image-20221221015029496

However, in case of strong law of large numbers, a curve surely will be in bands.

image-20221221015257623

The difference also can be seen from other representations of definition.

SLLN

P(sS:limn|Xn(s)X(s)|<ϵ)1,ϵ>0limnP(sS:supmn|Xn(s)X(s)|>ϵ)0,ϵ>0

WLLN

limnP(sS:|Xn(s)X(s)|>ϵ)0,ϵ>0

, Xn=i=1nXin in case of law of large numbers.


1.2. Cumulative distribution functionsPermalink

Let X be a random variable with its cumulative distribution function, F(t)=P(Xt),tR. Then, natural estimate of F is the empirical CDF given by,

F^n(t)=1ni=1nI(,t](Xi) Here, empirical CDF is unbaised estimator for pupulation CDF because

E[I(,t](X)]=P(Xt)×1+P(Xt)×0=F(t)


Moreover, uniform convergence of CDF is proved by Glivenko-Cantelli Theorem.


Uniform convergence is important because many estimation problem start from functional of CDF. For example,

  • Expectation Functionals
γg(F)=g(x)dF(x)

which is estimated by plug-in estimator given by γg(Fn^)=1ni=1ng(Xi)

  • Quantile functionals
Qα(F)=inf{tR:F(t)α}

which is estimated by plug-in estimator given by Qα(F^n)=inftR:F(t)^α


Above estimators are justified by continuity of a functional. That is, once we know that Fn^F 0 in probability, then we also have the result that |γ(Fn^)γ(F)|0 for a functional γ, whenvert γ is a continuous with respect to the sup-norm : FG=suptR|F(t)G(t)|


continuous with respect to the sup-norm

​ we say that the function γ is continuous at F in the sup-norm if, for all ϵ>0, there exists a δ>0 such that ||FG||δ implies that |γ(F)γ(G)|ϵ.


1.3. Uniform laws for more general function classesPermalink

​ Let F be a class of integrable real-valued functions with domain X and let Xii=1n be a collection of i.i.d. samples from some distribution P over X. consider the random variable

||PnP||F=supfF|1ni=1nf(Xi)E[f(X)]|

,which measures the absolute deviation between the sample average and the population average, uniformly over the claff F. we say that F is a Glivenko-Cantelli class for P if ||PnP||F converges to zero in probability as n.

  • Empirical CDFs
F={I(,t]:tR}

We could see F satisfies the definition of Glivenko-Cantelli class by Glivenko-Cantelli Theorem. Therefore, F is Glivenko-Cantelli class

  • Failure of uniform law

Not all classes of functionals are Glivenko-Cantelli. Let S be the class of all subsets S of [0,1] such that the subset S has a finite number of elements, and consider the function class FS=IS():SS of indicator functions of such sets. Suppose that samples are drawn from continuous distributions over [0,1] such that P(S)=0 for any SS . on the other hand, for any positive interger nN, the discrete set Sn=X1,,Xn belongs to S, and moreover, by definition of the empirical distribution, we have

1ni=1nISn(Xi)=1

Therefore, we conclude that

PnP=10=1foreverypositiveintegern

, and thereby FS is not a Glivenko-Cantelli class for continuous distributions.


1.4. Empirical risk minimizationPermalink

  • Suppose that samples are drawn i.i.d. according to a distribution Pθ for som fixed but unknown θΩ.
  • A standard approach to estimating θ is based on minimizing a cost function Lθ(X) that measures the fit between a parameter θΩ and the sample XX.
  • We then obtain some estimate θ^ that minimizes the empirical risk defined as
R^n(θ,θ)=1ni=1nLθ(Xi)

​ One specific example is Maximum Likelihood Estimator.


  • In contrast, the population risk is defined as
R(θ,θ)=Eθ[Lθ(X)],

​ where expectation is taken over a samples XPθ. The statistical question of interest is how to bound the excessrisk:

E(θ^,θ)=R(θ^,θ)infθΩR(θ,θ)

​ Which is risk from estimor of empirical risk minimization and true unknown estimator is how small compared to pupulation risk.


​ For simplicity, assume that there exists some θ0Ω such that R(θ0,θ)=infθΩR(θ,θ)

With the notation , the exess risk can be decomposed as

E(θ^,θ)={R(θ^,θ)R^n(θ^,θ)}T1+{R^n(θ^,θ)R^n((θ0,θ)}T2+{Rn^(θ0,θ)R(θ0,θ)}T3

By the definition of θ^, T2<0. For the T1,

T1=EX[Lθ^(X)]1ni=1nLθ^(Xi)

, uniform law of large numbers over the cost function F(Ω)=xLθ(x),θΩ0 is required. With this notation,

T1supθΩ0|1nLθ(Xi)EX[Lθ(X)]|=||PnP||F(Ω)

Also, for the T3 , is dominated by this quantity,

T3=1nLθ0(Xi)Ex[Lθ0(X)]

It is also dominated by above quantity,

E(θ^,θ)2||PnP||F(Ω)

It states that the central challenge in analyzing estimators based on empirical risk minimization is to establish a uniform law of large numbers for the loss class F(Ω).



2. A uniform law via Rademacher complexityPermalink

2.1. Definition of Rademacher complexityPermalink

  • Empiricial Rademacher complexity of the function class F

For any fixed collection x1n=(x1,,xn), consider the subset of Rn given by

F(x1n)={(f(x1),f(x2),f(xn)):fF}

The empirical Rademacher complexity is given by

R(F(x1n)/n)=Eϵ[1ni=1nϵif(xi)]

, where ϵ1,,ϵn are i.i.d. Rademacher random variables.


  • Rademacher complecity of the function class F

​ By taking expectation on empirical rademacher complexity,

R(F)=EX,ϵ[1ni=1nϵif(xi)]

​ By the form of formula, note that the Rademacher complexity is the maximum correlation between the vector (f(x1),f(x2),f(xn)) and the noise vector (ϵ1,,ϵn). Also, if a function class is extremply large, then we can always find a function that has a high correlation with a randomly drawn noise vector. Conversely, if function size is too small, correlation will be small.

​ In this sense,the Rademacher complexity measures the size of a function class and it plays a key role in establishing uniform convergence resuts.

2.2. Glivenko-Cantelli propetyPermalink

  • Define) a function class F IS b-uniformly bounded if ||f||b for all fF.


  • Theorem (Glivenko-Cantelli property) For any b-uniformly bounded class of function class F , any positive integer n1 and any scale δ0, we have
||PnP||F2RN(F)+δ

with P -probability at least 1exp(nδ22b2). Consequently, as long as Rn(F)=o(1), we have ||PnP||Fa.s.0.


proof)

  • Borel-Cantelli lemma

Let ϵ>0 and define

An(ϵ)={sS:|Xn(s)X(s)|ϵ}

Borel-Cantelli lemma says that if i=1P(An(ϵ))<,thenXna.s.X

For above case,

P(An)=Prob(PnPF2Rn(F)+δ)<exp(nδ22b2)

Therefore, since n=1exp(nδ22b2)<, the Borel-Cantelli lemma guarantees $\lVert \mathbb{P} n -\mathbb{P} \rVert{\mathcal{F} } \overset{a.s.}{\rightarrow} 0$ from the statement. Accordingly, the remainder of the argument is devoted to proving the inequality,

||PnP||F2RN(F)+δ

which will lead to uniform convergence of CDF .


  • Step 1) concentration around mean

Let’s say that PnPF:=G(X1,,Xn). Then, to show the concentration around the mean, I will show the bounded difference condition of G with maxLi2bn.


|1ni=1n(G(X1,,Xn)E[G(X1,,Xn)])|supGF|1ni=1n(G(Y1,,Yn)E[G(Y1,,Yn)])||1ni=1n(G(X1,,Xn)E[G(X1,,Xn)])||1ni=1n(G(Y1,,Yn)E[G(Y1,,Yn)])|1n|i=1n(G(X1,,Xn)E[G(X1,,Xn)])i=1n(G(Y1,,Yn)E[G(Y1,,Yn)])|2bnbuniformlyboundedsuponeachsideG(X1,,Xn)G(Y1,,Yn)2bn

Then, by McDiard’s theorem ,

PnPFE[PnPF]t

with probability at least 1exp(nt22b2).


  • Step 2) Upper bound on the mean
E[PnPF]EX[supfF|1ni=1n{f(Xi)EYi[f(Yi)]}|]=EX[supfF|EYi[1ni=1n{f(Xi)f(Yi)}]|]EX,Y[supfF|1ni=1n{f(Xi)f(Yi)}|]=EX,Y,ϵ[supfF|1ni=1nϵi{f(Xi)f(Yi)}|]2EX,Y,ϵ[supfF|1ni=1nϵi{f(Xi))}|]=2Rn(F)


Combining results, one can get given inequality.


3. Upper bounds on the Rademacher complexityPermalink

3.1. Polynomical discriminationPermalink

Definition ( Polynomial Discrimination)

A class F of functions with domain X has polynomial discrimination of order ν1 if, for each positive integer n and collection x1n={x1,,xn}, the set

F(x1n)={(f(x1),,f(xn)):fF}

has cardinality upper bound as

card(F(x1n))(n+1)ν

The significance of this property is that it provides with a straightforward approach to controlling the empirical Rademacher complexity

3.2. Bound on the empirical Rademacher complexityPermalink

Suppose that F has polynomial discrimination of order ν. Then, for all positive integers n and any collection of points x1n=(x1,,x),

Eϵ[supfF|1ni=1nϵif(xi)|]2D(x1n)vlog(n+1)n

, where D(x1n)=supfF1ni=1nf2(xi) is the l2 - radius of the set F(x1n)/n.

proof)

For any fixed vector a=(a1,,an)Rn, the random variable 1ni=1nϵiai is sub-Gaussian with variance proxy σ2=1n2i=1nai2 . Since F has polynomial discrimination of order ν, there are at most (n+1)ν vectors of the form (f(x1),,f(xn)), and each such vector has l2- norm bounded by

1ni=1nf2(xi)D2(x1n)

Therfore, the empirical Rademamcher complexity is upper bounded by the expected supremum of (n+1)ν Rademacher variables, each with variance proxy at most D2(x1n)/n. Applying maximal inequality of sub-Gaussian, it yields that

Eϵ[supfF|1ni=1nϵif(xi)|]D(x1n)2log(2(n+1)ν)n2D(x1n)vlog(n+1)n
  • one of maximal inequality used here

Let X1,, be zero-mean sub-Gaussian random variables with proxy σ2

Then, E[max1in|Xi|]σ2log(2n)


3.3. Glivenko-Cantelli theroemPermalink

Corlloary (Glivenko- Cantelli)

Let F(t)=P(Xt) be the CDF of a random variable X, and let F^n be the empirical CDF based on n i.i.d. sample XiP. Then, for all δ>0 and n1,

FF^n4log(n+1)n+δ

with probability at least 1exp(nδ22b2). Therefor, FF^na.s.0


proof )

Because for a givne sample x1n=(x1,,xn) and function class F={I(,t]():tR}, consider the set F(x1n)={f(x1),,f(xn):fF}. Because it only divides real line into n+1 parts, cardinality is n+1, which means that polynomial discrimination of order 1. Then, by the results above, given inequalities works.

3.4. Vapnik-Chervonenkis dimensionPermalink

Above, we see that Polinomical discrimination is useful to find the upper bound. One of way to verify that a given function class has polynomial discrimination is via the thoery of VC dimension. Consider a function class F where each function f is binary-valued, {0,1} for simplicity. Then the set F(x1n) has at most 2n elements. (Of course, because each element has 2 possibilities).


Definition (VC dimension) Given a class F of binary-valued functions, we say that the set x1n=(x1,,xn) is shattered by F if card(F(x1n))=2n. The VC dimension γ(F) is the largest interger n of which there is some collection x1n=(x1,,xn) of n points that is shatter by F.


  • Example : Two-sided Intervals in R

Consider the collection of all two-sided intervals over the real line – namely Sboth=(b,a]:a,bsuchthatb<a. The class Sboth can shatter any two distinct points. However, given three distinct points x1<x2<x3, it is not possible to pick out the subset {x1,x3}, showing that the VC dimension ν(Sboth)=2.


  • Sauer’s Lemma

Consider a set class F with the VC dimension γ(F)<. Then, for any n>γ(F), we have that

card(F(x1n))(n+1)γ(F)

3.5. Multivariate version of Glivenko-CantelliPermalink

Theorem

Let X1,,Xn be i.i.d. obesrvations frnom a distribution P on Rd . let F denot the multivariate cumulative distribution function of P, that is

F(x)=P(Xx),xRd

where for vectores x=(x1,,xd) and y=(y1,,yd) in Rd , xy means that xjyu for all j=1,,d. Let F^n denote the empirical CDF of P, i.e.

F^(x)=1ni=1nI(Xix),xRd

Then,

||F^(x)F(x)||2Rn(F)+δ4dlog(n+1)n+δ

, with probability at least 1exp(nδ22).


proof)

At first, class of indicator functions characterized by the class of sets A=(,x1](,x2]×(,xd],(x1,,xdRd) has VC dimension d.


because A has VC-dimension d, apllying sauer’s lemma, for fixed x1n , we haver

card(F(X1n))(n+1)d

, which means that F has polynomical discrimination of order d.


Because class F is 1-uniformly bounded (Because it is CDF), implement Glivenko-Cantelli property, for any positive interger n1 and any scale δ0, we have

||F^(x)F(x)||2Rn(F)+δ4dlog(n+1)n+δ

with probability at least 1exp(nδ22), which is,

P(||FF^n||4dlog(n+1)n+δ)exp(nδ22)

Leave a comment