[High Dimensional Statistics] Concentration Inequalities
References
- Yonsei univ. STA9104 : Statistical theory for high dimensional and big data
- High-Dimensional Statistics : A Non-Asymptotic Viewpoint, 2017, Martin J. Wainwright
0. Motivation
1. From Markov to Chernoff
1.1. Markov’s Inequality
1.2. Chebyshev’s Inequality
1.3. Polynomial Markov
1.4. Chernoff bound
2. Sub-Gaussian random variables
2.1. Definition
2.2. Properties
2.3. Equivalent Definitions
2.4. Sub-Gaussian random vectors
2.5. Hoeffding’s inequality
2.6. Maximal Inequality
3. Sub-Exponential random variables
3.1. Definition
3.2. examples
3.3. Bernstein’s Condition
3.4. The Johnson-Lindenstrauss Lemma
4. Bounded difference conditions
One fact about concentration is that if we change the value of one random variable, the function does not change dramatically.
Formally, we have independent random variables $X_1, \cdots,X_n,$ where each $X_i \in \mathbb{R}$. We have a function $f : \mathbb{R}^n \rightarrow \mathbb{R}$, that satisfied the property that : \(\vert f(x_1, \cdots, x_{k-1}, x_k, x_{k+1}, \cdots , x_n) - f(x_1, \cdots, x_{k-1}, x_k', x_{k+1}, \cdots , x_n) \vert \leq L_k\)
for every $x,x’ \in \mathbb{R}^n$. This is known as the Bounded difference condition
4.1. McDiarmid’s inequality
Theorem (McDiarmid’s inequality) If the random variables $X_1, \cdots , X_n$ are independent and a function f : $\mathbb{R}^n \rightarrow \mathbb{R}$ satisfies the bounded difference condition above, then for all $t \geq 0 $ ,
\[\mathbb{P} (\vert f(X_1, \cdots ,X_n ) - \mathbb{E} [f(X_1, \cdots , X_n) ]\vert \geq t) \leq 2 exp \bigg( - \frac{2t^2} {\sum_{k=1}^n L_k^2}\bigg)\]proof)
Claim : Let $V_i = E[f \vert X_1 , \cdots, X_i ] - E[ f \vert X_1, \cdots, X_{i-1}]$ , then $E[ e^{\lambda V_i} \vert X_1, \cdots ,X_{i-1}]$ = $e^{\frac{\lambda^2 L_i^2 }{8}}$
Let’s notate,
$ A_i = \underset{x}{sup} { E[f \vert X_1, \cdots ,X_{i-1}, x] - E[f\vert X_1, \cdots ,X_{i-1}] }$
$B_i = \underset{x}{inf} { E[f\vert X_1, \cdots ,X_{i-1}, x] - E[f\vert X_1, \cdots ,X_{i-1}] } $
Since $B_i \leq V_i \leq A_i$, by Hoeffding’s lemma and the fact that $A_i - B_i \leq L_i$,
-
Hoeffding’s Lemma : If $a \geq X_i \geq b$, $\mathbb{E} [e^{\lambda (X_i -\mu) }]\leq e^{\frac{\lambda^2 (b-a)^2}{8}} \quad for\quad \lambda \in \mathbb{R^+}$
-
$A_i - B_i \leq L_i$
Therfore, $E[ e^{\lambda V_i} \vert X_1, \cdots ,X_{i-1}]$ = $e^{\frac{\lambda^2 L_i^2 }{8}}$
Then,
\[\begin{align} \mathbb{P} (\vert f(X_1, \cdots ,X_n ) - \mathbb{E} [f(X_1, \cdots , X_n) ]\vert &= \mathbb{P} \bigg( \sum_{i=1}^n V_i \geq t \bigg)\\ &= \mathbb{P} \bigg( e^{\lambda \sum_{i=1}^n V_i }\geq e^{\lambda t} \bigg)\\ & \leq e^{-\lambda t}\mathbb{E} \big[ e^{\lambda \sum_{i=1}^n V_i }\big]\quad \longleftarrow \quad Markov \,\, Inequality \\ & = e^{-\lambda t} \mathbb{E} \big[e^{\lambda \sum_{i=1}^{n-1} V_i} \mathbb{E}\big[ e^{\lambda V_n} \vert X_1, \cdots, X_{n-1} \big]\big] \quad \\ &\leq \cdots \\ &\leq e^{-\lambda t} e^{\lambda^2 \sum_{i=1}^n L_i^2 /8} \\& = exp \bigg( - \frac{2t^2} {\sum_{k=1}^n L_k^2}\bigg) \longleftarrow \lambda = \frac{4t}{\sum_{i=1}^n L_i^2} \end{align}\]Applying same procedure to the other direction, one can get $\mathbb{P} (\vert f(X_1, \cdots ,X_n ) - \mathbb{E} [f(X_1, \cdots , X_n) ]\vert \geq t) \leq 2 exp \bigg( - \frac{2t^2} {\sum_{k=1}^n L_k^2}\bigg)$
4.1.1. Example : Sample mean and Hoeffding’s inequality
A simple example of McDiarmid’s inequality in action is to see that it directly implies the Hoeffding bound.
\(f(X_1 , \cdots, X_n) = \frac{1}{n} \sum_{i=1}^n X_i \quad, for \quad a \leq X_i \leq b, ~\forall i\) since $\vert f(x_1, \cdots, x_{k-1}, x_k, x_{k+1}, \cdots , x_n) - f(x_1, \cdots, x_{k-1}, x_k’, x_{k+1}, \cdots , x_n) \vert \leq \frac{(b-a)}{n}$,
\(\mathbb{P} ( \vert \bar{X} - E[\bar {X}] \geq t) \leq 2 exp \bigg( -\frac{2t^2}{n \times \frac{(b-a)^2}{n^2}}\bigg) = 2 e^{\frac{2nt^2}{(b-a)^2}}\)
4.2. U-statistics
A perhaps more interesting example is that of U-statistics.
Definition (U-statistics) Let $X_1, \cdots, X_n$ be $i.i.d.$ random variables from some distribution $P$ supported on $\mathcal{X}$ . Consider a function $h$ that is symmetric in its arguments and satisfies $\underset{x_1, cdots, x_m \in \mathcal{X}}{sup} \big\vert h(x_1, \cdots ,x_m ) \big\vert \leq K$ for some constant $K$. Let $\Sigma_{n,m}$ denote the summation taken over all subsets $1 \leq i_1< \cdots < i_m \leq n $ of ${1,\cdots ,n}$. Then
\[U_n = \begin{pmatrix} n \\ m\end{pmatrix} ^{-1} \sum_{(n,m)} h(X_{i1}, \cdots , X_{im})\]is called a U-statistics with kernel $h$ of order $m$
4.2.1. Example with order 2 and Bounded difference condition
Suppose $m=2$ and $\vert h(X_i, X_j) \vert \leq b$, then,
\[\vert U(x_1, \cdots, x_{k-1}, x_k, x_{k+1}, \cdots , x_n) - U(x_1, \cdots, x_{k-1}, x_k', x_{k+1}, \cdots , x_n) \vert \leq \pmatrix{n \\2}^{-1} (n-1)(2b) = \frac{4b}{n}\]Then, by Mcdiarmid’s inequality,
\(\mathbb{P}(\vert U(x_1, \cdots, x_{k-1}, x_k, x_{k+1}, \cdots , x_n) - U(x_1, \cdots, x_{k-1}, x_k', x_{k+1}, \cdots , x_n) \vert \geq t) \leq 2 exp(-nt^2 / 8b^2)\)
4.2.2. Concentration Inequalities for U-statistics
Theorem) for the U-statistics of above definition, $U_n $ satisfies following concentration inequalities, \(\mathbb{P} (\vert U_n - \mathbb{E} [U_n] \geq t) \leq 2 \exp{ \left\{ - \frac{t^2 \lfloor{n/m \rfloor} } {2K^2}\right\}}\)
proof)
Proof will follow two step,
Step 1) Let $k = \lfloor \frac{n}{m} \rfloor$ , then
\[V_n = \frac{1}{k} \{ h(X_1, \cdots, X_m) + ,\cdots, h(X_{(k-1)m+1}, \cdots, X_{km}) \}\]has concentration bound :
\[\mathbb{P} (\vert V_n - \mathbb{E} [V_n ] \geq t ) \leq 2 \exp \left\{ - \frac{t^2\lfloor \frac{n}{m} \rfloor}{2K^2} \right\}\]pf) Notate $(X_{i1}, \cdots, X_{im}) = Y_i$, then,
we can use notation of $V_n (Y_1, \cdots, Y_k) = \frac{1}{k} { h(Y_1) + \cdots + h(Y_k) } $
Here, one can check bounded difference condition :
\[\vert V_n (Y_1, \cdots , Y_i , \cdots , Y_k) - V_n (Y_1, \cdots , Y_i ' , \cdots , Y_k) \vert = \vert \frac{1}{k} (h(Y_i) - h(Y_i ')\vert \leq \frac{2K}{k}\]Then, applying McDiarmid’s theorem,
\[P( \vert V_n - E[V_n]\vert \geq t) \leq 2 \exp \bigg\{ - \frac{t^2k}{2K^2} \bigg\} = 2 \exp \bigg\{ - \frac{t^2\lfloor \frac{n}{m} \rfloor}{2K^2} \bigg\}\]Also, above inequality means that $V_n $ is sub-Gaussian with $\sigma^2 = \frac{K^2}{k}$.
Step 2) Let’s say $X_{\sigma} = (X_{\sigma_1 }, \cdots , X_{\sigma_n})$ to be permutation of $X$ and $S^n$ is possible group of such permutations. I will use of the fact that $V_n$ is invariant to permutation, which means that
\[\vert V_n (Y_{\sigma 1}, \cdots , Y_{\sigma i} , \cdots , Y_{\sigma k}) - V_n (Y_{\sigma 1}, \cdots , Y_{\sigma i} ' , \cdots , Y_{\sigma k}) \vert = \vert \frac{1}{k} (h(Y_{\sigma i}) - h(Y_{\sigma i} ')\vert \leq \frac{2K}{k}\]is invariant to permutation. Therefore, permutated $V_{\sigma n}$ still follows same sub-Gaussian.
Then,
\[\sum_{\sigma \in S^n} V_n (X_\sigma) = m! (n-m)! \sum_{i_1 < \cdots < i_m} h(X_{i1}, \cdots, X_{im}) = n! U_n \\ \rightarrow U_n = \frac{1}{n!} \sum_{\sigma \in S^n} V_n (X_\sigma)\]To calculate concentration,
\[\begin{align} \mathbb{P} [U_n - E[U_n] \geq t] & \leq e^{-\lambda t} \mathbb{E} [e^{\lambda (U_n - E[U_n])}] ,~~\longleftarrow \lambda>0 ,~~markov \\ & = e^{-\lambda t} \mathbb{E} [e^{\lambda \frac{1}{n!} (\sum_{\sigma \in S_n }(V_n - E(V_n)))}] \\ & \leq e^{-\lambda t} \frac{1}{n!} \sum_{\sigma \in S_n} \mathbb{E} [e^{\lambda (V_n - E[V_n])}]\\ & \leq e^{-\lambda t} e^{\frac{\lambda^2 K^2} {2k}}, \quad \longleftarrow \quad def \,\, of \,\, SG \\ & =\exp \bigg\{ - \frac{t^2k}{2K^2} \bigg\}, ~~\longleftarrow with ~~\lambda = \frac{kt}{K^2 }\\&= \exp \bigg\{ - \frac{t^2\lfloor \frac{n}{m} \rfloor}{2K^2} \bigg\} \end{align}\]Also, we can apply same procedure in the other way, which means that
\(\mathbb{P} (\vert U_n - \mathbb{E} [U_n] \geq t) \leq 2 \exp{ \bigg\{ - \frac{t^2 \lfloor{n/m \rfloor} } {2K^2}\bigg\}}\)
Leave a comment