CS 7545 Fall 2018 Note

From Foundation of Machine Learning (Mohri)

Show Content

A.0 Notation

We will denote by \(\mathbb{H}\) a vector space whose dimension may be infinite.
A.1 Definition Norms

A mapping \(\| \cdot \| : \mathbb{H} \mapsto \mathbb{R}^+\) is said to define a norm if:
- definiteness: \(\| \textbf{x} \| = 0 \iff \textbf{x} = 0\)
- homogeneity: \(\forall x \in \mathbb{H}, \forall \alpha \in \mathbb{R},\, \| alp \textbf{x} \| = \left\vert \alpha \right\vert \| \textbf{x} \|\)
- triangle inequality: \(\forall \textbf{x}, \textbf{y} \in \mathbb{H}, \| \textbf{x} + \textbf{y} \| \leq \| \textbf{x} \| + \| \textbf{y} \|\).
My comment: according to Wikipedia. \(\mathbb{H}\) must be a vector space over a subfield of \(\mathbb{C}\). This is required for \(\left\vert \alpha \right\vert\) to make sense ( or somehow \(\mathbb{H}\) is endowed with an absolute value ) . In homogeneity it should also be \(\alpha \in \mathbb{F}\) where \(\mathbb{F}\) is a subfield of \(\mathbb{C}\).
p-norm

for \(p \geq 1\), the p-norm of \(\textbf{x} \in \mathbb{C}^n\) is defined as
\[\| \textbf{x} \|_p = \left( \sum_{j=1}^{n} \left\vert x_j \right\vert ^p \right) ^{1/p}\]
equivalency of norms

two norms \(\| \cdot \|\) and \(\| \cdot \|'\) are said to be equivalent iff there exists real number \(\alpha, \beta \gt 0\) such that \(\forall \textbf{x}\)
\[\alpha \| \textbf{x} \| \leq \| \textbf{x} \|' \leq \beta \| \textbf{x} \|\]
More generally, all norms on a finite-dimensional space are equivalent.

My comment: To be specific: all norms on a finite-dimensional Banach space are equivalent. Proof (Cached). I cannot find for now a counter example of a non-equivalent norm on a non-complete finite-dimensional space.
Dual norms

see intuition of dual norm here(cached)

Dual is a norm on the dual space

Let \(f : V \mapsto \mathbb{F}\),
\[\| f \|_* = \sup_{x\neq 0}{\frac{ \left\vert f(x) \right\vert} {\| x \|}}\]
This actually belongs to functional analysis… I don’t know…

Notes about some Convex

B.1 Definition Gradient

Let \(f : \mathcal{X} \subset \mathbb{R}^N \mapsto \mathbb{R}\) be a differentiable function, Then, the gradient of \(f\) at \(x \in \mathcal{X}\) is the vector in \(\mathbb{R}^n\) denoted by \(\nabla (\textbf{x})\) and defined by
\[\nabla f(\textbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(\textbf{x}) \\ \vdots \\ \frac{\partial f}{\partial x_n}(\textbf{x}) \end{bmatrix}\]
B.2 Definition Hessian

Let \(f : \mathcal{X} \subset \mathbb{R}^N \mapsto \mathbb{R}\) be a twice differentiable function, Then, the Hession of \(f\) at \(x \in \mathcal{X}\) is the vector in \(\mathbb{R}^n\) denoted by \(\nabla (\textbf{x})\) and defined by
\[\nabla^2f(\textbf{x}) = \begin{bmatrix} \frac{\partial^2f}{\partial x_1,x_j}(\textbf{x}) \end{bmatrix}\]
Theorem B.1 Fermat’s theorem

Let \(f : \mathcal{X} \subset \mathbb{R}^N \mapsto \mathbb{R}\) be a differentiable function. If \(f\) admits a local extremum at \(\textbf{x}^* \in \mathcal{X}\),T then, \(\nabla f(\textbf{x}^*) = 0\), that is, \(\textbf{x}^*\) is a stationary point.

B.3 Convex set

A set \(\mathcal{X} \in \mathbb{R}^n\) is said to be convex if for any two points \(\textbf{x}, \textbf{y} \in \mathcal{X}\), the segment \([\textbf{x}, \textbf{y}]\) lies in \(\mathcal{X}\), that is
\[\{ \alpha \textbf{x} + (1-\alpha) \textbf{y} : 0 \leq \alpha \leq 1 \} \subset \mathcal{X}\]
TODO: https://xingyuzhou.org/blog/notes/strong-convexity (cached)

Fenchel Conjugate

TODO

Notes on some inequality.

Markov inequality
\[P(X\geq a) \leq \frac{\mathbb{E}[x]}{a}\]
In the language of measure theory, Markov’s inequality states that if \((X, \Sigma, \mu)\) is a measure space, f is a measurable extended real-valued function, and \(\epsilon > 0\), then
\[μ ( \{ x ∈ X : \left\vert f ( x ) \right\vert ≥ ε \} ) ≤ \frac{1}{\epsilon} \int_X { \left\vert f \right\vert \operatorname{d\mu} }.\]
If \(\varphi\) is a monotonically increasing nonnegative function for the nonnegative reals, \(X\) is a random variable, \(a ≥ 0\), and \(\varphi(a) > 0\), then
\[P ( \left\vert X \right\vert \geq a ) \leq \frac{\mathbb{E}[\varphi(\left\vert X \right\vert )]}{\varphi(a)}\]
An immediately corollary, using higher moments of nonnegative \(X\) is
\[P (X \geq a) \leq \frac{\mathbb{E}[X^n]}{a^n}\]
Chebyshev's inequality

Chebyshev’s inequality uses the variance to bound the probability that a random variable deviates far from the mean. Specifically:
\[P( \left\vert X - \mathbb{E}[X] \right\vert \geq a ) \leq \frac{\operatorname{Var}(X)}{a^2},\,\, a > 0\]
for which Markov’s inequality reads
\[{\displaystyle \operatorname {P} \left((X- \mathbb{E}[X])^{2}\geq a^{2}\right)\leq {\frac {\operatorname {Var} (X)}{a^{2}}},}\]
Hoeffding's Lemma

Let \(X\) be any real-valued random variable with expected value \({\displaystyle \mathbb {E} (X)=0}\) and such that \({\displaystyle a\leq X\leq b}\) almost surely. Then, for all \({\displaystyle \lambda \in \mathbb {R} }\),
\[{\displaystyle \mathbb {E} \left[e^{\lambda X}\right]\leq \exp \left({\frac {\lambda ^{2}(b-a)^{2}}{8}}\right).}\]
Note that because of the assumption that the random variable \(X\) has zero expectation, the \(a\) and \(b\) in the lemma must satisfy \(a\leq 0\leq b\)

Show Proof

First note that if one of \(a\) or \(b\) is zero, then \({\displaystyle \textstyle \mathbb {P} \left(X=0\right)=1}\) and the inequality follows. If both are nonzero, then \(a\) must be negative and \(b\) must be positive.

Next, since that \({\displaystyle e^{sx}}\) is a convex function on the real line:
\[\forall x \in [a, b]: \,\, e^{sx}\leq \frac{b-x}{b-a}e^{sa}+\frac{x-a}{b-a}e^{sb}.\]
Applying \(\mathbb {E}\) to both sides of the above inequality gives us:
\[{\displaystyle {\begin{aligned}\mathbb {E} \left[e^{sX}\right]&\leq {\frac {b-\mathbb {E} [X]}{b-a}}e^{sa}+{\frac {\mathbb {E} [X]-a}{b-a}}e^{sb}\\&={\frac {b}{b-a}}e^{sa}+{\frac {-a}{b-a}}e^{sb}&&\mathbb {E} (X)=0\\&=(1-\theta )e^{sa}+\theta e^{sb}&&\theta =-{\frac {a}{b-a}}>0\\&=e^{sa}\left(1-\theta +\theta e^{s(b-a)}\right)\\&=\left(1-\theta +\theta e^{s(b-a)}\right)e^{-s\theta (b-a)}\\\end{aligned}}}\]
Let \(u=s(b-a)\) and define \(\varphi :\mathbb {R} \mapsto \mathbb {R}\) :
\[\varphi (u)=-\theta u+\log \left(1-\theta +\theta e^{u}\right)\]
\(\varphi\) is well defined on \(\mathbb{R}\), to see this we calculate:
\[{\displaystyle {\begin{aligned}1-\theta +\theta e^{u}&=\theta \left({\frac {1}{\theta }}-1+e^{u}\right)\\&=\theta \left(-{\frac {b}{a}}+e^{u}\right)\\&>0&&\theta >0,\quad {\frac {b}{a}}<0\end{aligned}}}\]
The definition of \(\varphi\) implies
\[\mathbb {E} \left[e^{sX}\right]\leq e^{\varphi (u)}.\]
By Taylor’s theorem, for every real \(u\) there exists a \(v\) between \({\displaystyle 0}\) and \(u\) such that
\[\varphi(u)=\varphi(0)+u\varphi'(0)+\tfrac{1}{2} u^2\varphi''(v).\]
Note that:
\[{\displaystyle {\begin{aligned}\varphi (0)&=0\\\varphi '(0)&=-\theta +\left.{\frac {\theta e^{u}}{1-\theta +\theta e^{u}}}\right|_{u=0}\\&=0\\[6pt]\varphi ''(v)&={\frac {\theta e^{v}\left(1-\theta +\theta e^{v}\right)-\theta ^{2}e^{2v}}{\left(1-\theta +\theta e^{v}\right)^{2}}}\\[6pt]&={\frac {\theta e^{v}}{1-\theta +\theta e^{v}}}\left(1-{\frac {\theta e^{v}}{1-\theta +\theta e^{v}}}\right)\\[6pt]&=t(1-t)&&t={\frac {\theta e^{v}}{1-\theta +\theta e^{v}}}\\&\leq {\tfrac {1}{4}}&&t>0\end{aligned}}}\]
Therefore,
\[{\displaystyle \varphi (u)\leq 0+u\cdot 0+{\tfrac {1}{2}}u^{2}\cdot {\tfrac {1}{4}}={\tfrac {1}{8}}u^{2}={\tfrac {1}{8}}s^{2}(b-a)^{2}.}\]
This implies
\[{\displaystyle \mathbb {E} \left[e^{sX}\right]\leq \exp \left({\tfrac {1}{8}}s^{2}(b-a)^{2}\right).}\]
Hoeffding's inequality

Let \(X_1, \ldots, X_n\) be independent random variables bounded by the interval \([0, 1]: 0 ≤ X_i ≤ 1\). We define the empirical mean of these variables by
\[{\displaystyle {\overline {X}}={\frac {1}{n}}(X_{1}+\cdots +X_{n}).}\]
One of the inequalities in Theorem 1 of Hoeffding (1963) states
\[{\displaystyle {\begin{aligned}\operatorname {P} ({\overline {X}}-\mathrm {E} [{\overline {X}}]\geq t)\leq e^{-2nt^{2}}\end{aligned}}}\]
where \({\displaystyle 0\leq t}.\)

Theorem 2 of Hoeffding (1963) is a generalization of the above inequality when it is known that \(X_i\) are strictly bounded by the intervals \([a_i, b_i]\):
\[{\displaystyle {\begin{aligned}\operatorname {P} \left({\overline {X}}-\mathrm {E} \left[{\overline {X}}\right]\geq t\right)&\leq \exp \left(-{\frac {2n^{2}t^{2}}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}\right)\\\operatorname {P} \left(\left|{\overline {X}}-\mathrm {E} \left[{\overline {X}}\right]\right|\geq t\right)&\leq 2\exp \left(-{\frac {2n^{2}t^{2}}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}\right)\end{aligned}}}\]

Show Proof

Suppose \(X_1,\ldots,X_n\) are n independent random variables such that
\[{\displaystyle \operatorname {P} \left(X_{i}\in [a_{i},b_{i}]\right)=1,\qquad 1\leq i\leq n.}\]
Let \({\displaystyle S_{n}=X_{1}+\cdots +X_{n}.}\)

Then for \(s, t ≥ 0\), Markov’s inequality and the independence of \(X_i\) implies:
\[{\displaystyle {\begin{aligned}\operatorname {P} \left(S_{n}-\mathrm {E} \left[S_{n}\right]\geq t\right)&=\operatorname {P} \left(e^{s(S_{n}-\mathrm {E} \left[S_{n}\right])}\geq e^{st}\right)\\&\leq e^{-st}\mathrm {E} \left[e^{s(S_{n}-\mathrm {E} \left[S_{n}\right])}\right]\\&=e^{-st}\prod _{i=1}^{n}\mathrm {E} \left[e^{s(X_{i}-\mathrm {E} \left[X_{i}\right])}\right]\\&\leq e^{-st}\prod _{i=1}^{n}e^{\frac {s^{2}(b_{i}-a_{i})^{2}}{8}}\\&=\exp \left(-st+{\tfrac {1}{8}}s^{2}\sum _{i=1}^{n}(b_{i}-a_{i})^{2}\right)\end{aligned}}}\]
Note that things in the parenthesis are a quadratic function and achieves its minimum at
\[{\displaystyle s={\frac {4t}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}.}\]
Thus we get
\[{\displaystyle \operatorname {P} \left(S_{n}-\mathrm {E} \left[S_{n}\right]\geq t\right)\leq \exp \left(-{\frac {2t^{2}}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}\right).}\]
Bregman Divergence

Let \(F:\Omega \to \mathbb {R}\) be a continuously-differentiable, strictly convex function defined on a closed convex set \(\Omega\).

The Bregman distance associated with F for points \(p,q\in \Omega\) is the difference between the value of F at point p and the value of the first-order Taylor expansion of F around point q evaluated at point p:
\[D_{F}(p,q)=F(p)-F(q)-\langle \nabla F(q),p-q\rangle\]
Martingale sequence

A basic definition of a discrete-time martingale is a discrete-time stochastic process (i.e., a sequence of random variables) \(X_1, X_2, X_3, \ldots\) that satisfies for any time \(n\),
\[\begin{aligned} & \mathbf {E} (\vert X_{n}\vert )<\infty \\ & \mathbf {E} (X_{n+1}\mid X_{1},\ldots ,X_{n})=X_{n} \end{aligned}\]
That is, the conditional expected value of the next observation, given all the past observations, is equal to the most recent observation.
Azuma's inequality

Suppose \(\{ X_k : k = 0, 1, 2, 3, \ldots \}\) is a martingale (or super-martingale) and
\[{\displaystyle |X_{k}-X_{k-1}|<c_{k},\,}\]
almost surely. Then for all positive integers \(N\) and all positive reals \(t\),
\[{\displaystyle P(X_{N}-X_{0}\geq t)\leq \exp \left({-t^{2} \over 2\sum _{k=1}^{N}c_{k}^{2}}\right).}\]
TODO: Proof
McDiarmid’s Inequality

McDiarmid’s Inequality is a generalization of Hoeffding’s inequality.

Consider independent random variabls \(X_1, \ldots, X_n\) whose domain is \(\mathcal{X}\), and a mapping \(f: \mathcal{X}^n \mapsto \mathbb{R}\) If, for all \(i = \{1,\ldots,n\}, x_1, \ldots x_n, x^* \in \mathcal{X}\),
\[\left\vert \phi(x_1,\cdots,x_{i-1},x_i,x_{i+1},\cdots,x_n) - \phi(x_1,\cdots,x_{i-1},x^*,x_{i+1},\cdots,x_n) \right\vert \leq c_i\]
(In other words, replacing the \(i\)-th coordinate \({\displaystyle x_{i}}\) by some other value changes the value of \({\displaystyle f}\) by at most \({\displaystyle c_{i}}\).)

Then
\[\mathop{\operatorname{Pr}}(\phi(X_1, \cdots, X_n) - \mathbb{E}[\phi]\geq ε) \leq \exp \left( - \frac{2ε^2}{\sum_{i=1}^{n}{c_i^2}} \right) \\ \mathop{\operatorname{Pr}}(\phi(X_1, \cdots, X_n) - \mathbb{E}[\phi]\leq -ε) \leq \exp \left( - \frac{2ε^2}{\sum_{i=1}^{n}{c_i^2}} \right)\]
Proof:

Let \(Z_k = \mathop{\operatorname{\mathbb{E}}}[f(X_1, \ldots, X_n) \vert X_{1\ldots k}]\) be a sequence of random variables.

Example:
\[\begin{aligned} Z_0 &= \mathop{\operatorname{\mathbb{E}}}[f],\,\,\,\text{ no randomness. } \\ Z_1 &= \mathop{\operatorname{\mathbb{E}}}[f(X_1, \ldots, X_n) \vert X_1] \\ Z_2 &= \mathop{\operatorname{\mathbb{E}}}[f(X_1, \ldots, X_n) \vert X_1, X_2] \\ Z_n &= \mathop{\operatorname{\mathbb{E}}}[f(X_1, \ldots, X_n) \vert X_{1\ldots n}] = f(X_1, \ldots, X_n) \\ \end{aligned}\]
However, we are not defining \(Z_k\) is independent variable, they are strongly dependent.

Informally, given \(Z_{k-1}\), we cannot determine \(X_1, X_{k-1}\), as there may be multiple cases that lead to the same \(Z_{k-1}\).

\(Z_{k}\) is dependent on \(Z_{k-1}\) in a way that if \(Z_k\) generated by \(X_1, X_{k-1}\), \(Z_k\) is as if generated by same \(X_1, X_{k-1}\), plus \(X_k\)

No formal proof is given, but I believe that \(Z_k\) is martingale.
\[\mathop{\operatorname{\mathbb{E}}}{[Z_k\vert Z_{k-1}]} = Z_{k-1}\]
Now apply azuma inequality:
\[P(Z_{N}-Z_{0}\geq t)\leq \exp \left({-t^{2} \over 2\sum _{k=1}^{N}c_{k}^{2}}\right)\]

Online Learning

Halving Algorithm

Suppose we are making binary predictions, with N experts. Suppose there is one perfect expert.

Let \(\widetilde{y}^t, x_i^t\) be the algorithm and the i-th expert’s prediction at round t. \(x_i^t \in \{ 0, 1 \}\).

Let \(y^t\) be the true value nature reveals.

Let \(w_i^t\) be weight we assign to each expert, initial \(w_i^0\) is \(0\).

Let \(M_T(\text{algorithm})\) and \(M_T(i)\) be the total mistakes algorithm or expert i make after rount t. That is:
\[\begin{aligned} M_T(\text{algorithm}) &= \sum_{i=1}^{T}{𝟙[\hat{y}^t \neq y^t]} \\ M_T(i) &= \sum_{i=1}^{T}{𝟙[w_i^t \neq y^t]} \end{aligned}\]
Define the halving algorithm:
\[\begin{aligned} \widetilde{y}^t = \operatorname{sgn} \sum{w_i^tx_i^t} \\ w_i^t = w_i^t \cdot 𝟙[x_i^t = y^t] \end{aligned}\]
Theorem: \(M_T(\text{algorithm}) \leq \log_2N\)

Worst case: suppose we have \(N = 2^n\) experts, only one of whom is perfect. At each round, half of them predict \(1\), the other half predict \(−1\), which maximally slows down the “shrinking rate”. Then, the Halving algorithm will make exactly \(log_2N\) mistakes to discover the perfect expert, and makes no more mistakes from then on.
Majority Weight Algorithm
- Some basic inequality
  - \[- \log (1-x) \geq x ,\, \forall x < 1 \label{ineq:1}\tag{log ineq:1}\]
  - \[-log(1-x) < x + x^2 ,\,\, x \in (0, ≈0.683803) \label{ineq:2}\tag{log ineq:2}\]
  - \[e^{\alpha x} \leq e^{\alpha \cdot 0} + (\frac{e^{\alpha \cdot 1} - e^{\alpha \cdot 0}}{1-0})x = 1 + (e^\alpha - 1) x,\,\, x \in [0,1] \label{ineq:3}\tag{exp ineq:1}\]
What if the best expert makes a few mistakes?

We decay the expert’s weight!!!

Define the Majority Weight Algothim (MWA) to be:
\[\text{MWA:} \begin{cases} \hat{y}^t = \operatorname{round} \left( \frac {\sum{w_i^t}{x_i^t}}{\sum{w_i^t}} \right) \\ w_i^{t+1} = w_i^t \left( 1 - \epsilon \right)^{𝟙[x_i^t \neq y_1^t]},\,\, 0 < \epsilon < ≈0.683803 \end{cases}\]
Define the total weight at time t to be \(\Phi_t = \sum_i{w_i^t}\). Notice that:
- \[\Phi_1 = \sum ... = N\]
- \[\Phi_T \geq w_i^T = (1-\epsilon)^{M_T(i)}\]
- \[\Phi_T \leq \Phi_1 \cdot \left(1-\frac{\epsilon}{2}\right)^{M_T(\text{MWA})}\]
the third inequality holds because if \(y^t \neq \hat{y}^t\), then, at least half of the weight will decay by \((1-\epsilon)\), so the total weight will decay by at least \(( 1 - \frac{\epsilon}{2} )\).

Hence
\[(1-\epsilon)^{M_T(i)} \leq \Phi_{T} \leq N \left( 1-\frac{\epsilon}{2} \right) ^ {M_T(\text{MWA})}\]
take negative log we get
\[-\log \left( (1-\epsilon)^{M_T(i)} \right) \geq -\log \left( N \left( 1-\frac{\epsilon}{2} \right) ^ {M_T(\text{MWA})} \right)\]
Which is
\[M_T(i) \left( - \log \left( 1-\epsilon \right) \right) \geq - log N + M_T(\text{MWA}) \left( - \log \left( 1-\frac{\epsilon}{2} \right) \right)\]
By \ref{ineq:1} and \ref{ineq:2} we get
\[\begin{aligned} M_T(i) \left( \epsilon + \epsilon^2 \right) + \log N & \geq M_T(\text{MWA}) \frac{\epsilon}{2} \\ 2 M_T(i) \left( 1 + \epsilon \right) + \frac{2}{\epsilon} \log N & \geq M_T(\text{MWA}) \end{aligned}\]
Exponential Weight Algorithm

If instead of rounding \(\frac{\sum{w_i^t}{x_i^t}}{\sum{w_i^t}}\), we use a loss function \(l(\widetilde{y}, y)\) that is convex in \(\widetilde{y}\)

Define expert i’s loss at time T to be \(L_T(i) = \sum{l(x_i^t, y^t)}.\), the algorithm’s loss is defined similarly.

Define the Regret of algorithm is \(\text{Regret} = L_T(\text{Alg}) - \min_i{L_T(i)}\), that is, the additional loss of our algorithm compared to the best expert.
\[\text{Algorithm EWA } : \begin{cases} \widetilde{y}^t = \frac{\sum{w_i^t}{x_i^t}}{\sum{w_i^t}} \\ w_i^{t+1} = w_i^t \exp{ - \eta l(x_i^t,y^t)} \\ \end{cases}\]
Theorem: The EWA has the following properties:
\[L_T(EWA) \leq \frac{\eta L_T(i) + \log N }{1-e^{-\eta}},\,\,\,\, i = 1,\ldots,N\]
corollary By tuning m, we have
\[\text{Regret} = L_T(EWA) - L_T(i) \leq \log N + \sqrt{2 L_T(i) \log (N)}\]
Proof of corollary: just use first order tayloy expansion at 0 to \({e^{-\eta}}\)

First we show an addictive alternative setting: Hedge (or actions), it is not used in the proof, but the same proof applies to both settings.

There are N “actions”, For \(t = 1\ldots T\), Algorithm selects some distrubution \(p^t \in \Delta_N\). Nature reveals cost of action \(l_i^t \in [0,1]\) of action \(i\), So the algorithm pays \(\mathbb{E}[l^t] = \sum_i{p_i^t * l_i^t}\), initially \(\textbf{w} = \textbf{1}\).
\[\text{EWA} \begin{cases} \textbf{p}^t = \frac{ \textbf{w}^t }{ \| \textbf{w} \|_1 } \\ w_i^{t+1} = w_i^t \exp(-\eta l_i^t) \end{cases}\] \[\mathbb{E}[L_T(\text{algorithm})] = \sum_{t=1}^{T}{\textbf{p}^t \cdot \textbf{l}^t}\] \[L_T(i) = \sum_{t=1}^{T}{l_i^t} \text{ loss if you always choose action }i\]
Two settings are almost equalent. Set \(l_i^t === l(x_i^t, y^t)\)
- Lemma
  
  Let X be any random variable in [0,1], \(s \in \mathbb{R}\) then by \ref{ineq:3}
  \[e^{sX} \leq 1 + (e^s-1)X\] \[\mathbb{E}[e^{sX}] \leq 1 + (e^s-1)\mathbb{E}[X]\] \[\log \mathbb{E}[{e^{sX}}] \leq \log \left( { 1 + (e^s-1)\mathbb{E}[X] } \right) \leq (e^s-1)\mathbb{E}[X]\]
- Proof
  
  Let sum of weight \(W^t\) be \(\sum_{i=1}^N { w_i^t}\), Cookup potential \(\Phi_t = -\log W^t\).
  
  Define random variable \(X_t\)
  \[p(X_t = l(x_i^t,y^t)) = \frac{w_i^t}{\sum_j{w_j^t}}\]
  Notice:
  \[\begin{aligned} \Phi_{t+1} - \Phi_{t} &= -log (\frac{\sum_{i=1}^N { w_i^{t+1}}}{\sum_{i=1}^N { w_i^t}} ) \\ &= -log (\frac{\sum_{i=1}^N { w_i^{t} \exp(-\eta l(x_i^t,y^t)) }}{\sum_{i=1}^N { w_i^t}} ) \\ &= -log \mathbb{E}[\exp(-\eta X_t)] \end{aligned}\]
  Using EWA-Lemma
  \[\begin{aligned} \Phi_{t+1} - \Phi_{t} &= -log \mathbb{E}[\exp(-\eta X_t)] \\ &\geq -(e^{-\eta}-1) \mathbb{E}[X_t] \\ &= (1 - e^{-\eta}) \sum_{i=1}^{N}{\frac{w_i^t}{\sum_j{w_j^t}}}l(x_i^t, y^t) \\ &\geq (1 - e^{-\eta}) l(\sum_{i=1}^{N}{\frac{w_i^tx_i^t}{\sum_j{w_j^t}}, y^t}) \text{ by convexity of }l \\ &= (1 - e^{-\eta}) l(\widetilde{y}^t,y^t) \end{aligned}\]
  Hence.
  \[(1 - e^{-\eta}) \sum_{t=1}^{T} l(\widetilde{y}^t,y^t) \leq \sum_{t=1}^{T}{\Phi_{t+1}-\Phi_t} = \Phi_{T+1} + \log N\] \[\Phi_{T+1} \leq -\log {W_i^{T+1}} = \eta \sum { l(x_i^t, y^t) } = \eta L_T(i)\]
  Hence
  \[L_T(EWA(\eta)) \leq \frac{\eta L_T(i) + \log N }{1 - e^{-\eta}}\]
Linear Prediction

Algorithm selects \(\textbf{w}^t \in \mathbb{R}^d\)

Nature selects \(\textbf{x}^t \in \mathbb{R}, \| \textbf{x}_t \|_2 \leq 1\)

Algorithm predicts \(\hat{y}^t = \operatorname{sign}(\textbf{w}^t \cdot \textbf{x}^t) \in \{ -1, 1 \}\)

Nature reveals \(y^t \in \{ -1, 1\}\)
\[M_T(alg) = 𝟙[\hat{y}^t \neq y^t] = \sum \frac{1-\hat{y}^ty^t}{2}\]
Assume there exists \(w^* \in \mathbb{R}^d, \|w\| \leq 1\) s.t.
\[(w^* \cdot x^t) y^t > \gamma \forall t\]
where γ is a margin parameter. Equivalently,
\[\|w^*\| \leq \frac{1}{\gamma},\,\,\,\, (w^* \cdot x^t) \cdot y^t > 1\]
Perceptron Algorithm:
\[\textbf{w}^1 = \textbf{0}\] \[\begin{aligned} \textbf{w}^{t+1} = \textbf{w}^t \text{ if } y^t (\textbf{w}^t \cdot \textbf{x}^t) > 0 \\ \textbf{w}^{t+1} = \textbf{w}^t + \textbf{x}^t y^t \text{ otherwise} \end{aligned}\]
Theorem: Perceptron guarantees
\[M_T \leq \frac{1}{\gamma^2} \text{ assuming } w^*\text{ exists }\]
Proof:

Let \(\Phi_t = \| \textbf{w}^* - \textbf{w}^{t+1} \|_2^2\), then
\[\begin{aligned} \frac{1}{\gamma^2} &> \| \textbf{w}^*\|^2 > \| \textbf{w}^*-\textbf{0}\|_2^2 - \|\textbf{w}^*-\textbf{w}^{t+1}\|_2^2 \\ &= \Phi_0 - \Phi_T = \sum_{t=1}^{T}{ \left( \Phi_{t-1} - \Phi_{t} \right) } \\ &= \sum_{t=1}^{T}{ \left( \| \textbf{w}^* - \textbf{w}^{t}\|_2^2 - \| \textbf{w}^* - \textbf{w}^{t+1}\|_2^2 \right) } \\ &= \sum_{t:y^t(\textbf{w}^t\cdot\textbf{x}^t)<0}{ \left( \| \textbf{w}^* - \textbf{w}^{t} \|_2^2 - \| \textbf{w}^* - (\textbf{w}^{t}+\textbf{x}^ty^t) \|_2^2 \right) } \\ &= \sum_{t:y^t(\textbf{w}^t\cdot\textbf{x}^t)<0}{ \left( \| \textbf{w}^* - \textbf{w}^{t} \|_2^2 - \| (\textbf{w}^* - \textbf{w}^{t}) -\textbf{x}^ty^t) \|_2^2 \right) } \\ &= \sum_{t:y^t(\textbf{w}^t\cdot\textbf{x}^t)<0}{ \left( 2 (\textbf{w}^* - \textbf{w}^{t}) \cdot \textbf{x}^t y^t - y^ty^t \textbf{x}^t\cdot \textbf{x}^t \right) } \\ \end{aligned}\]
By our condition, \(- y^ty^t \textbf{x}^t\cdot \textbf{x}^t \geq -1\), \(- \textbf{w}^t \cdot \textbf{x}^t y^t > 0\) (because we make an error here), and \(\textbf{w*} \cdot \textbf{x}^t y^t > 1\)
\[\frac{1}{\gamma^2} > \sum_{t:y^t(\textbf{w}^t\cdot\textbf{x}^t)<0}{ \left( 2 + 0 + (-1) \right) } = M_T(\text{algorithm})\]