Statiscital Learning theory.

supervised learning setting.

  • $ \mathcal{X} $, input space, e.g. $\mathbb{R}^d$
  • $ \mathcal{Y} $, label space, e.g. ${0, 1}$
  • $ \mathcal{Y}’ $, prediction space, e.g. $[0,1]$
  • $ \mathcal{D}(\mathcal{X}, \mathcal{Y}) $, a Distribution (unknown, from real world)
  • $ \mathcal{H} $, Hypothesis class
  • $h\in \mathcal{H}: \mathcal{X} \mapsto \mathcal{Y’} $, hypothesis function, e.g. all linear classifier.
  • $l: \mathcal{Y}’ \times \mathcal{Y} \mapsto [0,1]$, loss function
  • $S$, a given data set, element of which is $\sim \mathcal{D}$ i.i.d
  • $m$ = $ \left\vert S \right\vert $, number of samples in $S$
  • $ {\displaystyle R(h) = \mathop{\operatorname{\mathbb{E}}}_{(x,y) \sim \mathcal{D}}{l(h(x), y)} } $, Risk of a hypothesis
  • $ {\displaystyle \hat{R}_ s(h) = \mathop{\operatorname{\mathbb{E}}} _{(x,y) \sim \text{uniform}(S)}{l(h(x), y)} } $, Empirical Risk of a hypothesis
  • $ { \displaystyle h^* = \mathop{\operatorname{argmin}}_{h\in \mathcal{H}}{R(h)} } $, Best hypothesis
  • $ { \displaystyle h^{**} = \mathop{\operatorname{argmin}}_{h}{R(h)} } $, Best hypothesis of all possible h
  • $ { \displaystyle \hat{h}_ s^* = \mathop{\operatorname{argmin}} _{h\in \mathcal{H}}{\hat{R}_S(h)} } $, Best hypothesis on set $S$.
  • .
  • Estimation error: $R(\hat{h}_S) - R(h^*)$
  • Approximaion error: $R(h^*) - R(h^{**})$
  • .
  • Growth function

    Suppose $\mathcal{H}$ is a set of binary functions.

    \[\Pi_{\mathcal{H}}(m; \mathcal{X}) = \mathop{\operatorname{max}}_{x_1,\ldots,x_m \in \mathcal{X}^m}{ \left\vert \left\{ [ h(x_1),\cdots,h(x_m) ] \vert h \in \mathcal{H} \right\} \right\vert }\]

    which means, select best $x_1,\ldots,x_m$ to maximize number of unique points formed by mapping all $h\in \mathcal{H}$ to $x_1,\ldots,x_m$

  • VC-dimension

    Suppose $\mathcal{H}$ is a set of binary functions.

    \[VCD(\mathcal{H}; \mathcal{X}) = \mathop{\operatorname{max}} d \in \{m\in \mathbb{N}^+ \vert \Pi_{\mathcal{H}}(m; \mathcal{X}) = 2^m\}\]

Current ML is just empirical risk minimization, or ERM

  • Uniform deviation bound

    Here $ \hat{R} $ is $\hat{R}_S$, $\hat{h}$ is $\hat{h}_S$

    \[\begin{aligned} & R(\hat{h}) - R(h^*) \\ &= R(\hat{h}) - \hat{R}(\hat{h}) \\ &+ \hat{R}(\hat{h}) - \hat{R}(h^*) \,\,\,\,(\leq 0) \\ &+ \hat{R}(h^*) - R(h^*) \\ &\leq 2 \sup_{h\in \mathcal{H}}{ \{ R(h) - \hat{R}(h) \} } \end{aligned}\]

    The last term is called uniform deviation bound.

  • Rademacher Complexity

    Given a set $ {\displaystyle A\subseteq \mathbb {R} ^{m}} $, the Rademacher complexity of A is defined as follows

    \[\operatorname {Rad} (A):={\frac {1}{m}}\mathbb {E} _{\sigma }\left[\sup _{a\in A}\sum _{i=1}^{m}\sigma _{i}a_{i}\right]\]
  • Emperical Rademacher

    Let $ S = { x_1,\ldots,x_m } \in \mathcal{X}^m $, the emperical rademacher complexity of $ \mathcal{H}$ is

    \[\widehat{\mathfrak{R}}_S(\mathcal{H}) = \mathop{\operatorname{Rad}}(\{ [h(x_1), \ldots, h(x_n)] \vert h\in \mathcal{H} \})\]

    or

    \[\widehat{\mathfrak{R}}_S(\mathcal{H}) = \frac{1}{m} \mathop{\operatorname{\mathbb{E}}}_{σ_{1\ldots m}}{ \left[ \sup_{h\in \mathcal{H}}{\sum_{i=1}^{m}{σ_ih(x_i)}} \right]}\]
  • Rademacher

    Given distribution $ \mathcal{D} \in Δ(\mathcal{X}) $ and a sample size $m$, the Rademacher complexity of $ \mathcal{H} $ is

    \[\mathfrak{R}_m(G) = \mathop{\operatorname{\mathbb{E}}}_{S \stackrel{\mathop{\operatorname{i.i.d}}}{\sim} \mathcal{D}^m}[\widehat{\mathfrak{R}}_S(\mathcal{H})]\]
  • Theorem 3.1

    Let $ G $ be a family of functions mapping from $\mathcal{X}$ to $ [0, 1] $, given a distribution $ \mathcal{D} \in Δ(\mathcal{X}) $ and a sample $ S $ of size $m$ Then, for any $δ > 0$, with probability at least $ 1 − δ $, the following holds.

    \[\Phi(S) = \sup_{g \in G} \left\{ \mathop{\operatorname{\mathbb{E}}}[g] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S}}[g] \right\} \leq 2 \mathfrak{R}_m(G) + \sqrt{\frac{\log 1/\delta}{2m}} \\ \Phi(S) = \sup_{g \in G} \left\{ \mathop{\operatorname{\mathbb{E}}}[g] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S}}[g] \right\} \leq 2 \widehat{\mathfrak{R}}_S(G) + 3\sqrt{\frac{\log 2/\delta}{2m}}\]

    where

    \[\mathop{\operatorname{\mathbb{E}}}[g] = \mathop{\operatorname{\mathbb{E}}}_{X \sim \mathcal{D}}[g(X)] \\ \widehat{\mathop{\operatorname{\mathbb{E}}}_{S}}[g] = \mathop{\operatorname{\mathbb{E}}}_{X \sim \mathop{\operatorname{uniform}}S}[g(X)]\]
    • Proof

      We can use McDiarmid’s inequality to $ \Phi(S) $, why? suppose $S$ and $S’$ differ by one index, say $z_m$ and $z_m^*$

      \[\begin{aligned} \Phi(S') - \Phi(S) &= \sup_{g' \in G} \left\{ \mathop{\operatorname{\mathbb{E}}}[g'] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S'}}[g'] \right\} - \sup_{g \in G} \left\{ \mathop{\operatorname{\mathbb{E}}}[g] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S}}[g] \right\} \\ &\leq \sup_{g' \in G} \left\{ \mathop{\operatorname{\mathbb{E}}}[g'] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S'}}[g'] \right\} - \left\{ \mathop{\operatorname{\mathbb{E}}}[g'] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S}}[g'] \right\} \\ &\leq \sup_{g' \in G} \left\{ \mathop{\operatorname{\mathbb{E}}}[g'] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S'}}[g'] - \left( \mathop{\operatorname{\mathbb{E}}}[g'] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S}}[g'] \right) \right\} \\ &= \sup_{g \in G} \left\{ \widehat{\mathop{\operatorname{\mathbb{E}}}_{S}}[g] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S'}}[g] \right\} \\ &= \sup_{g \in G} \frac{g(z_m) - g(z_m')}{m} \\ &\leq \frac{1}{m} \end{aligned}\]

      Similarly, we have $ \Phi(S) - \Phi(S’) \leq \frac{1}{m} $. So, by McDiarmid’s inequality

      TODO: For the rest, see Book Foundations of Machine Learning. will write down later.

  • Theorem 3.3 Massart’s lemma

    Let $ A \subset \mathbb{R}^m $ be a finite set, with $r = \max_{\textbf{x}\in A}{| \textbf{x} |_2}$, then

    \[\mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \mathop{\operatorname{sup}}_{\textbf{x}\in A}{ \sum_{i=1}^{m}{σ_ix_i} } \right] } \leq \frac{r\sqrt{2\log \left\vert A \right\vert }}{m}\]

    where $σ_i$s are independent rademacher random variables, $x_1,\ldots,x_m$ are the components of vector $\textbf{x}$

    Proof: Suppose $λ>0$

    \[\begin{aligned} & \exp \left( λ\mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \mathop{\operatorname{sup}}_{\textbf{x}\in A}{ \sum_{i=1}^{m}{σ_ix_i} } \right] } \right) \\ & \leq \mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \exp \left( λ \mathop{\operatorname{sup}}_{\textbf{x}\in A}{ \sum_{i=1}^{m}{σ_ix_i} } \right) \right] } \\ & = \mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \mathop{\operatorname{sup}}_{\textbf{x}\in A} \exp \left( λ \sum_{i=1}^{m}{σ_ix_i} \right) \right]} \\ & \leq \mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \sum_{\textbf{x}\in A} \exp \left( λ \sum_{i=1}^{m}{σ_ix_i} \right) \right]} \\ & = \sum_{\textbf{x}\in A} \mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \exp \left( λ \sum_{i=1}^{m}{σ_ix_i} \right) \right]} \\ & = \sum_{\textbf{x}\in A} \mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \prod_{i=1}^{m} \exp \left( {λσ_ix_i} \right) \right]} \\ & = \sum_{\textbf{x}\in A} \prod_{i=1}^{m} \mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \exp \left( {λσ_ix_i} \right) \right]} & \text{ by independence}\\ & \leq \sum_{\textbf{x}\in A} \prod_{i=1}^{m} \exp \left( {\frac{λ^2(2x_i)^2}{8}} \right) & \text{ by Hoeffding's lemma}\\ & = \sum_{\textbf{x}\in A} \exp \left( \sum_{i=1}^{m}{\frac{λ^2(2x_i)^2}{8}} \right) \\ & \leq \sum_{\textbf{x}\in A} \exp \left( {\frac{λ^2r^2}{2}} \right) \\ & = \left\vert A \right\vert \exp \left( {\frac{λ^2r^2}{2}} \right) \\ \end{aligned}\]

    Now

    \[\begin{aligned} \exp \left( λ\mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \mathop{\operatorname{sup}}_{\textbf{x}\in A}{ \sum_{i=1}^{m}{σ_ix_i} } \right] } \right) & \leq \left\vert A \right\vert \exp \left( {\frac{λ^2r^2}{2}} \right) \\ \mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \mathop{\operatorname{sup}}_{\textbf{x}\in A}{ \sum_{i=1}^{m}{σ_ix_i} } \right] } & \leq \frac{\log \left( \left\vert A \right\vert \exp \left( {\frac{λ^2r^2}{2}} \right) \right)}{λ} = \frac{\log \left( \left\vert A \right\vert \right)}{λ} + \frac{λr^2}{2} \end{aligned}\]

    choose $λ = \frac{\sqrt{2\log \left\vert A \right\vert }}{r}$, we get:

    \[\mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \mathop{\operatorname{sup}}_{\textbf{x}\in A}{ \sum_{i=1}^{m}{σ_ix_i} } \right] } \leq r \sqrt{2\log \left\vert A \right\vert }\]