[High Dimensional Statistics] Maximum Mean Discrepancy(MMD)

Date :     Updated :

References

  • Yonsei univ. STA9104 : Statistical theory for high dimensional and big data
  • A Kernel two-sample test , Gretton, 2012.

1. IntroductionPermalink

1. 1. Two sample test problem settingPermalink

X1,,Xniidp and Y1,,Yniidq


Hypothesis : H0:p=q,H1:pq


  • classical method to test two-sample test.

    there are many classical ways for two-sample test. It include,

    1. t-test, in case of 1-dimensional
    2. Hotelling Tsquared test , which is generalization of t-test into Rd
      • However, it only applies when n<d and only sensitive to mean difference
    3. Tests based on empirical CDF, e.g. K.S. test and CVM test
      • Sensitive to any difference in distribution,
      • However, hard to generalize to Rd


  • MMD solve a lot of problems.
    1. It workd well for multivariate data
    2. It is sensitive to general alternatives
    3. It is easy to estimate
    4. It is an example of IPM (Intergral Probability Metrics)


1.2. Integral Probability Metrics(IPM)Permalink

Definition (Integral Probability Metrics) for a class if function F, IPM between p and q is defined as

IPM(p,q)=supfF|Ep[f(x)]Eq[f(x)]|

Examples are total variation distance, wasserstein distance, kolmogorov distance.

2. Maximun Mean Discrepancy(MMD)Permalink

  • If F is too larger, it is difficult to estimate IPM(p,q). Also, if F is too small, IPM(p,q)=0 even if pq.

  • MMD takes middle ground between these two extremes,

    It takes F to be a unit ball in a RKHS H


2.1. Reproducin Kernel Hilbert spacePermalink

Definition (Reproducing kernel Hibert space)

A RKHS is a Hilber space F defined with an inner product <,>F where it has a reproducing kernel property

k:χ×χRsuchthatxχ,fF,<f(x),k(:,x)>F=f(x)ReprodcingProperty

2.2. Definition of MMDPermalink

Definition (Maximun Mean Discrepancy)

Let F to be a unit ball in a RKHS such that fF , ||f||F=<f,f>F1.Then,

MMD(p,q)=supfF|Ep[f(x)]Eq[f(x)]|

, That is, MMD is a IPM with f being unit ball in RKHS.


2.3. Properties of MMDPermalink

  • Property 1

    If RKHS is associated with a “characteristic kernel”, such as gaussian kernel, then

MMD(p,q)=0ifandonlyifp=q

  • Property 2

MMD2(p,q)=[supfF|Ep[f(x)]Eq[f(x)]|]2=[supfF<f,Ep[k(,x)]Eq[k(,x)]]2bythereproducingproperty||f||H2||Ep[k(,x)]Eq[k(,x)||H2||f||H21≤<Ep[k(,x)]Eq[k(,x)],Ep[k(,x)]Eq[k(,x)]>H=<Ep[k(,x)],Ep[k(,x)]>+<Eq[k(,x)],Eq[k(,x)]>2<Ep[k(,x)],Eq[k(,x)]>=EX,Xindp[k(X,X)]+EY,Yindq[k(Y,Y)]2EXindp,Yindq[k(X,Y)]k(x,x)=<k(,x),k(,x)>HRKHSproperty.

​ In fact, the upper bound can be achievd by some fF, hence,

MMD2(p,q)=EX,Xindp[k(X,X)]+EY,Yindq[k(Y,Y)]2EXindp,Yindq[k(X,Y)]

is achived. Even if it is sup, since it can be achieved, we can say that sup is achieved. That is, optimization problem in IPM calculation, which is obtaining supremum, is bypassed.

Also, there is no f in equation, which means that it is very straightforward to estimate.


​ For estimation, consider Plug-in estimator, or a V-statistic in our context :

MMD^2(p,q)=[supfF|1mi=1mf(xi)1nj=1nf(xj)|]2=1m2i,j=1mk(xi,xj)+1n2i,j=1nk(yi,yj)2nmj=1nimk(xi,yj)

2.4. Convergence of MMDPermalink

Let X1,,Xm be i.i.d. observations from p and Y1,,Yn be i.i.d. observation from q. Assum that these two samples are mutually independent. Let F be the unit ball in a reproducing kernel Hilbert space(RKHS) associated with kernel k and define

MMD=supfF|Ep[f(X)]Eq[f(Y)]|MMD^=supfF|1mi=1mf(Xi)1nj=1nf(Yj)|

Also, assume that the kernel k is bounded as 0k(x,y)K for all x,y in the domain X.


Want to show :

P[|MMD^MMD|>2(Km+Kn)+t]exp(t2mn2K(m+n))

proof)

Step 1) obtain upper bound for Rademacher Complexity

Let ϵt,,ϵm be i.i.d. Rademacher random variables taking values of $1,2$where P(ϵ1=1)=P(ϵ1=1)=12. Then, the Rademacher coplexity of F satisfies

Rm(F)=EX,ϵ[supfF|1mi=1mϵif(Xi)|]Km

why?

Rm(F)=EX,ϵ[supfF|1mi=1mϵif(Xi)|]=EX,ϵ[supfF|<f,1mϵik(Xi)>H|]reproducingproperty1mEX,ϵ[||f||H||ϵik(Xi)||H]CauchySchwarzinequality1mEX,ϵ[||ϵik(Xi)||H]Fisunitball=1mEX,ϵ[i=1mϵik(Xi),j=1mϵjk(Xj)H12]=1mEX,ϵ[i=1mj=1mϵiϵjk(Xi),k(Xj)H12]=1mEX,ϵ[(i=1mj=1mϵiϵjk(Xi,Xj))12]reproducingproperty1m(i=1mj=1mEX,ϵ[ϵiϵjk(Xi,Xj)])12Jensensinequality=1m(i=1mEXi[k(Xi,Xi)])12ifij,0E[ϵiϵj]=01m(mK)12=Km

Step 2) Implment McDiarmid’s Inequality

To use McDiardmid’s inequality, we need to derive Bounded difference condition.


Fisrt, bound the |MMD^MMD| by some formula satisfying bounded difference condition .

|MMD^MMD|=|supfF|1mi=1mf(Xi)1nj=1nf(Yj)|supfF|Ep[f(X)]Eq[f(Y)]||supfF|1mi=1mf(Xi)1nj=1nf(Yj)Ep[f(X)]Eq[f(Y)]|Triangleinequality

Here, Let’s denote

supfF|1mi=1mf(Xi)1nj=1nf(Yj)Ep[f(X)]Eq[f(Y)]|

as

:=(X1,,Xm,Ym+1,,Ymn)(X,Y)


Then, we can find bounded difference condition for (X,Y)

for i=1,2,,m,

|(X1,,Xi,,Xm,Ym+1,,Ym+n)(X1,,Xi,,Xm,Ym+1,,Ym+n)|1msupfF|f(Xi)f(Xi)|=1msupfF|f,k(,Xi)Hf,k(,Xi)H|reproducingproperty=1msupfF|f,k(,Xi)k(,Xi)H|1msupfF||f||H||k(,Xi)k(,Xi)||HCauchySchwarz1m||k(,Xi)k(,Xi)||H⟵∵Fisaunitballandnofintherest=1mk(,Xi)k(,Xi),k(,Xi)k(,Xi)H12=1m|k(xi,xi)+k(xi,xi)2k(xi,xi)|12reproducingproperty1m(|k(xi,xi)|+|k(xi,xi)|+|2k(xi,xi)|)12triangleinequality2K12m

Therefore, if we do smiliar procedure to i=m+1,,m+n,

Li={2K12mfori=1,,m2K12nfori=m+1,,m+n

Because i=1m+nLi2=(m+n)4Kmn, applying Mcdiarmid’s inequality

P[(X,Y)E[(X,Y)]>t]exp(t2mn2K(m+n))

Step 3) control E[(X,Y)]

To control E[(X,Y)], Symmtetrization and Rademacher coplexity is used. The result from class is

E[MMD^MMD]=EX,Y[supfF|1mi=1mf(Xi)1nj=1nf(Yj)|supfF|Ep[f(X)]Eq[f(Y)]|]TraidngleineqaualityEX,Y[supfF|1mi=1mf(Xi)1nj=1nf(Yj)Ep[f(X)]Eq[f(Y)]|]Symmetrization=EX,Y[supfF|1mi=1mf(Xi)1nj=1nf(Yj)EX[1mi=1mf(Xi)]EY[1ni=1nf(Yi)]|]JensensinequalityEX,X,Y,Y[supfF|1mi=1m(f(Xi)f(Xi))1nj=1n(f(Yj)f(Yj))|]Ramadecaharvariables=EX,X,Y,Y,ϵ,ϵ[supfF|1mi=1mϵi(f(Xi)f(Xi))1nj=1nϵj(f(Yj)f(Yj))|]TraidngleineqaualityEX,ϵ[|1mi=1mϵif(Xi)|]+EX,ϵ[|1mi=1mϵif(Xi)|]+EY,ϵ[|1mi=1mϵif(Yi)|]+EY,ϵ[|1mi=1mϵif(Yi)|]RamedacherComplexity=2Rm(F)+2Rn(F)

Applying the above result (a), It is easy to verify that

E[|MMD^MMD|]2Km+2Kn

Step 4) Merge the ingredients

Merging two results,

P[(X,Y)>t+E[(X,Y)]]P[(X,Y)>t+2Km+2Kn]exp(t2mn2K(m+n))

2.5. Testing based on MMDPermalink

Suppose that we want to test whether H0:p=q or H1:pq. Based on above ineqaulity, we can construct a valid level α test ϕ:X1,,Xm,Y1,,Yn0,1 such that P(ϕ=1)α under H:p=q.


Because p=q under H0 , MMD=0 and we can treat Yi=Xi. ( both from same distribution p=q). Therefore,

MMD^MMD=MMD^=supfF(1mi=1m(f(xi)f(xi))


applying similar procedure as 2.4. , we can find

P(MMD^>t+2Km)exp(t2m4K)

By letting α=exp(t2m4K) , t=4Klogαm

P[MMD^>2Km+4Klogαm]α

Therefore, we reject H0 if the calculated estimator is larger than 2Km+4Klogαm


  • Because Information distribution is not needed. there is no term related to p or q


2.6. Permutation test based on MMDPermalink

Procedure

  • for i : 1 to B:

    • Randomly permute (X1,,Xm,Y1,,Yn)

      (X~,Y~) =(X~1,,X~m,Y~1,,Y~n)

    • Calculate MMD^i2(X~,Y~)
    • Reject the null of p=q if pval~=1B+1|i=1BI(MMD^>MMD^obs)+1|α

Leave a comment