supervised learning setting.
Growth function
Suppose $\mathcal{H}$ is a set of binary functions.
\[\Pi_{\mathcal{H}}(m; \mathcal{X}) = \mathop{\operatorname{max}}_{x_1,\ldots,x_m \in \mathcal{X}^m}{ \left\vert \left\{ [ h(x_1),\cdots,h(x_m) ] \vert h \in \mathcal{H} \right\} \right\vert }\]which means, select best $x_1,\ldots,x_m$ to maximize number of unique points formed by mapping all $h\in \mathcal{H}$ to $x_1,\ldots,x_m$
VC-dimension
Suppose $\mathcal{H}$ is a set of binary functions.
\[VCD(\mathcal{H}; \mathcal{X}) = \mathop{\operatorname{max}} d \in \{m\in \mathbb{N}^+ \vert \Pi_{\mathcal{H}}(m; \mathcal{X}) = 2^m\}\]Current ML is just empirical risk minimization, or ERM
Uniform deviation bound
Here $ \hat{R} $ is $\hat{R}_S$, $\hat{h}$ is $\hat{h}_S$
\[\begin{aligned} & R(\hat{h}) - R(h^*) \\ &= R(\hat{h}) - \hat{R}(\hat{h}) \\ &+ \hat{R}(\hat{h}) - \hat{R}(h^*) \,\,\,\,(\leq 0) \\ &+ \hat{R}(h^*) - R(h^*) \\ &\leq 2 \sup_{h\in \mathcal{H}}{ \{ R(h) - \hat{R}(h) \} } \end{aligned}\]The last term is called uniform deviation bound.
Rademacher Complexity
Given a set $ {\displaystyle A\subseteq \mathbb {R} ^{m}} $, the Rademacher complexity of A is defined as follows
\[\operatorname {Rad} (A):={\frac {1}{m}}\mathbb {E} _{\sigma }\left[\sup _{a\in A}\sum _{i=1}^{m}\sigma _{i}a_{i}\right]\]Emperical Rademacher
Let $ S = { x_1,\ldots,x_m } \in \mathcal{X}^m $, the emperical rademacher complexity of $ \mathcal{H}$ is
\[\widehat{\mathfrak{R}}_S(\mathcal{H}) = \mathop{\operatorname{Rad}}(\{ [h(x_1), \ldots, h(x_n)] \vert h\in \mathcal{H} \})\]or
\[\widehat{\mathfrak{R}}_S(\mathcal{H}) = \frac{1}{m} \mathop{\operatorname{\mathbb{E}}}_{σ_{1\ldots m}}{ \left[ \sup_{h\in \mathcal{H}}{\sum_{i=1}^{m}{σ_ih(x_i)}} \right]}\]Rademacher
Given distribution $ \mathcal{D} \in Δ(\mathcal{X}) $ and a sample size $m$, the Rademacher complexity of $ \mathcal{H} $ is
\[\mathfrak{R}_m(G) = \mathop{\operatorname{\mathbb{E}}}_{S \stackrel{\mathop{\operatorname{i.i.d}}}{\sim} \mathcal{D}^m}[\widehat{\mathfrak{R}}_S(\mathcal{H})]\]Theorem 3.1
Let $ G $ be a family of functions mapping from $\mathcal{X}$ to $ [0, 1] $, given a distribution $ \mathcal{D} \in Δ(\mathcal{X}) $ and a sample $ S $ of size $m$ Then, for any $δ > 0$, with probability at least $ 1 − δ $, the following holds.
\[\Phi(S) = \sup_{g \in G} \left\{ \mathop{\operatorname{\mathbb{E}}}[g] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S}}[g] \right\} \leq 2 \mathfrak{R}_m(G) + \sqrt{\frac{\log 1/\delta}{2m}} \\ \Phi(S) = \sup_{g \in G} \left\{ \mathop{\operatorname{\mathbb{E}}}[g] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S}}[g] \right\} \leq 2 \widehat{\mathfrak{R}}_S(G) + 3\sqrt{\frac{\log 2/\delta}{2m}}\]where
\[\mathop{\operatorname{\mathbb{E}}}[g] = \mathop{\operatorname{\mathbb{E}}}_{X \sim \mathcal{D}}[g(X)] \\ \widehat{\mathop{\operatorname{\mathbb{E}}}_{S}}[g] = \mathop{\operatorname{\mathbb{E}}}_{X \sim \mathop{\operatorname{uniform}}S}[g(X)]\]Proof
We can use McDiarmid’s inequality to $ \Phi(S) $, why? suppose $S$ and $S’$ differ by one index, say $z_m$ and $z_m^*$
\[\begin{aligned} \Phi(S') - \Phi(S) &= \sup_{g' \in G} \left\{ \mathop{\operatorname{\mathbb{E}}}[g'] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S'}}[g'] \right\} - \sup_{g \in G} \left\{ \mathop{\operatorname{\mathbb{E}}}[g] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S}}[g] \right\} \\ &\leq \sup_{g' \in G} \left\{ \mathop{\operatorname{\mathbb{E}}}[g'] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S'}}[g'] \right\} - \left\{ \mathop{\operatorname{\mathbb{E}}}[g'] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S}}[g'] \right\} \\ &\leq \sup_{g' \in G} \left\{ \mathop{\operatorname{\mathbb{E}}}[g'] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S'}}[g'] - \left( \mathop{\operatorname{\mathbb{E}}}[g'] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S}}[g'] \right) \right\} \\ &= \sup_{g \in G} \left\{ \widehat{\mathop{\operatorname{\mathbb{E}}}_{S}}[g] - \widehat{\mathop{\operatorname{\mathbb{E}}}_{S'}}[g] \right\} \\ &= \sup_{g \in G} \frac{g(z_m) - g(z_m')}{m} \\ &\leq \frac{1}{m} \end{aligned}\]Similarly, we have $ \Phi(S) - \Phi(S’) \leq \frac{1}{m} $. So, by McDiarmid’s inequality
TODO: For the rest, see Book Foundations of Machine Learning. will write down later.
Theorem 3.3 Massart’s lemma
Let $ A \subset \mathbb{R}^m $ be a finite set, with $r = \max_{\textbf{x}\in A}{| \textbf{x} |_2}$, then
\[\mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \mathop{\operatorname{sup}}_{\textbf{x}\in A}{ \sum_{i=1}^{m}{σ_ix_i} } \right] } \leq \frac{r\sqrt{2\log \left\vert A \right\vert }}{m}\]where $σ_i$s are independent rademacher random variables, $x_1,\ldots,x_m$ are the components of vector $\textbf{x}$
Proof: Suppose $λ>0$
\[\begin{aligned} & \exp \left( λ\mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \mathop{\operatorname{sup}}_{\textbf{x}\in A}{ \sum_{i=1}^{m}{σ_ix_i} } \right] } \right) \\ & \leq \mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \exp \left( λ \mathop{\operatorname{sup}}_{\textbf{x}\in A}{ \sum_{i=1}^{m}{σ_ix_i} } \right) \right] } \\ & = \mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \mathop{\operatorname{sup}}_{\textbf{x}\in A} \exp \left( λ \sum_{i=1}^{m}{σ_ix_i} \right) \right]} \\ & \leq \mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \sum_{\textbf{x}\in A} \exp \left( λ \sum_{i=1}^{m}{σ_ix_i} \right) \right]} \\ & = \sum_{\textbf{x}\in A} \mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \exp \left( λ \sum_{i=1}^{m}{σ_ix_i} \right) \right]} \\ & = \sum_{\textbf{x}\in A} \mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \prod_{i=1}^{m} \exp \left( {λσ_ix_i} \right) \right]} \\ & = \sum_{\textbf{x}\in A} \prod_{i=1}^{m} \mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \exp \left( {λσ_ix_i} \right) \right]} & \text{ by independence}\\ & \leq \sum_{\textbf{x}\in A} \prod_{i=1}^{m} \exp \left( {\frac{λ^2(2x_i)^2}{8}} \right) & \text{ by Hoeffding's lemma}\\ & = \sum_{\textbf{x}\in A} \exp \left( \sum_{i=1}^{m}{\frac{λ^2(2x_i)^2}{8}} \right) \\ & \leq \sum_{\textbf{x}\in A} \exp \left( {\frac{λ^2r^2}{2}} \right) \\ & = \left\vert A \right\vert \exp \left( {\frac{λ^2r^2}{2}} \right) \\ \end{aligned}\]Now
\[\begin{aligned} \exp \left( λ\mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \mathop{\operatorname{sup}}_{\textbf{x}\in A}{ \sum_{i=1}^{m}{σ_ix_i} } \right] } \right) & \leq \left\vert A \right\vert \exp \left( {\frac{λ^2r^2}{2}} \right) \\ \mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \mathop{\operatorname{sup}}_{\textbf{x}\in A}{ \sum_{i=1}^{m}{σ_ix_i} } \right] } & \leq \frac{\log \left( \left\vert A \right\vert \exp \left( {\frac{λ^2r^2}{2}} \right) \right)}{λ} = \frac{\log \left( \left\vert A \right\vert \right)}{λ} + \frac{λr^2}{2} \end{aligned}\]choose $λ = \frac{\sqrt{2\log \left\vert A \right\vert }}{r}$, we get:
\[\mathop{\operatorname{\mathbb{E}}}_{\textbf{σ}}{ \left[ \mathop{\operatorname{sup}}_{\textbf{x}\in A}{ \sum_{i=1}^{m}{σ_ix_i} } \right] } \leq r \sqrt{2\log \left\vert A \right\vert }\]main reference: https://en.wikipedia.org/wiki/Matrix_calculus
Numerator-layout notation
Using numerator-layout notation, we have
Matrix Calculation Formula
Notation:
\(a, b, c, d, e, \textbf{a}, \textbf{b}, \textbf{c}, \textbf{d}, \textbf{e}\) are not a function of \(\textbf{x}\)
\(A, B, C, D, E, \textbf{A}, \textbf{B}, \textbf{C}, \textbf{D}, \textbf{E}\) are not a function of \(\textbf{x}\)
\(f, g, h, u, v, \textbf{f}, \textbf{g}, \textbf{h}, \textbf{u}, \textbf{v}\) are functions of \(\textbf{x}\)
<40> | <40> | <60> |
$ \partial $ vector | $ \partial $ vector | |
[ \frac{\partial \textbf{a}}{\partial \textbf{x}} ] | [ \textbf{0} ] | Zero matrix |
[ \frac{\partial \textbf{x}}{\partial \textbf{x}} ] | [ \textbf{I} ] | Identity matrix |
[ \frac{\partial \textbf{Ax}}{\partial \textbf{x}} ] | [ \textbf{A} ] | |
[ \frac{\partial a \textbf{u}}{\partial \textbf{x}} ] | [ a \frac{\partial \textbf{u}}{\partial \textbf{x}} ] | |
[ \frac{\partial u \textbf{v}}{\partial \textbf{x}}] | [ u \frac{\partial \textbf{v}}{\partial \textbf{x}} + \textbf{v} \frac{\partial u}{\partial \textbf{x}} ] | |
[ \frac{\partial \textbf{Au}}{\partial \textbf{x}} ] | [ \textbf{A} \frac{\partial \textbf{u}}{\partial \textbf{x}} ] | [ \textbf{A} \begin{bmatrix} \frac{\partial \textbf{u}}{\partial x_1} & \cdots & \frac{\partial \textbf{u}}{\partial x_n} \end{bmatrix} ] |
[ \frac{\partial \textbf{u} + \textbf{v}}{\partial \textbf{x}} ] | [ \frac{\partial \textbf{u}}{\partial \textbf{x}} + \frac{\partial \textbf{v}}{\partial \textbf{x}}] | |
[ \frac{\partial \textbf{g(u)} }{\partial \textbf{x}} ] | [ \frac{\partial \textbf{g(u)}}{\partial \textbf{u}} \frac{\partial \textbf{u}}{\partial \textbf{x}}] | [ \begin{bmatrix} \frac{\partial \textbf{g}}{\partial u_1} & \cdots & \frac{\partial \textbf{g}}{\partial u_m} \end{bmatrix} \begin{bmatrix} \frac{\partial \textbf{u}}{\partial x_1} & \cdots & \frac{\partial \textbf{u}}{\partial x_n} \end{bmatrix} ] |
[ \frac{\partial \textbf{f(g(u))} }{\partial \textbf{x}} ] | [ \frac{\partial \textbf{f(g(u))}}{\partial \textbf{g(u)}} \frac{\partial \textbf{g(u)}}{\partial \textbf{u}} \frac{\partial \textbf{u}}{\partial \textbf{x}} ] | |
$ \partial $ scalar | $ \partial $ vector | |
[ \frac{\partial a}{\partial \textbf{x}} ] | [ \textbf{0}^T ] | |
[ \frac{\partial au}{\partial \textbf{x}} ] | [ a \frac{\partial u}{\partial \textbf{x}} ] | |
[ \frac{\partial u+v}{\partial \textbf{x}} ] | [ \frac{\partial u}{\partial \textbf{x}} + \frac{\partial v}{\partial \textbf{x}} ] | |
[ \frac{\partial uv}{\partial \textbf{x}} ] | [ u \frac{\partial v}{\partial \textbf{x}} + v \frac{\partial u}{\partial \textbf{x}} ] | |
[ \frac{\partial g(u)}{\partial \textbf{x}} ] | [ \frac{\partial g(u)}{\partial u} \frac{\partial u}{\partial \textbf{x}} ] | |
[ \frac{\partial f(g(u))}{\partial \textbf{x}} ] | [ \frac{\partial f(g(u))}{\partial g(u)} \frac{\partial g(u)}{\partial u} \frac{\partial u}{\partial \textbf{x}} ] | |
[ \frac{\partial \textbf{a} \cdot \textbf{x}}{\partial \textbf{x}} ] | [ \textbf{a}^T ] | |
[ \frac{\partial \textbf{u} \cdot \textbf{v}}{\partial \textbf{x}} ] | [ \textbf{u}^T \frac{\partial \textbf{v}}{\partial \textbf{x}} + \textbf{v}^T \frac{\partial \textbf{u}}{\partial \textbf{x}} ] | [ \begin{bmatrix} \textbf{u} \cdot \frac{\partial \textbf{v}}{\partial x_1} & \cdots \ \textbf{u} \cdot \frac{\partial \textbf{v}}{\partial x_n} \end{bmatrix} + \begin{bmatrix} \textbf{v} \cdot \frac{\partial \textbf{u}}{\partial x_1} & \cdots \ \textbf{v} \cdot \frac{\partial \textbf{u}}{\partial x_n} \end{bmatrix} ] |
[ \frac{\partial \textbf{u}^T \textbf{Av}}{\partial \textbf{x}} ] | [ \textbf{u}^T \textbf{A} \frac{\partial \textbf{v}}{\partial \textbf{x}} + (\textbf{Av})^T \frac{\partial \textbf{u}}{\partial \textbf{x}} ] | |
[ \frac{\partial \textbf{x}^T \textbf{A} \textbf{x}}{\partial \textbf{x}} ] | [ (\textbf{x}^T \textbf{A}) + ( \textbf{Ax} )^T = \textbf{x}^T(\textbf{A} + \textbf{A}^T) ] | |
[ \frac{\partial \textbf{a}^T \textbf{x} \textbf{x}^T \textbf{b}}{\partial \textbf{x}} ] | [ \textbf{x}^T (\textbf{ab}^T + \textbf{ba}^T) ] | Let \(u = \textbf{a}^T \textbf{x}\) and \(v = \textbf{b}^T \textbf{x}\) |
$ \partial $ scalar | $ \partial $ Matrix | |
[ \frac{\partial a}{\partial \textbf{X}} ] | [ \textbf{0}^\top ] | |
[ \frac{\partial au}{\partial \textbf{X}} ] | [ a \frac{\partial u}{\partial \textbf{X}} ] | |
[ \frac{\partial u+v}{\partial \textbf{X}} ] | [ \frac{\partial u}{\partial \textbf{X}} + \frac{\partial v}{\partial \textbf{X}} ] | |
[ \frac{\partial uv}{\partial \textbf{X}} ] | [ u \frac{\partial v}{\partial \textbf{X} } + v \frac{\partial u}{\partial \textbf{X}} ] | |
[ \frac{\partial g(u)}{\partial \textbf{X}} ] | [ \frac{\partial g(u)}{\partial u} \frac{\partial u}{\partial \textbf{X}} ] | |
[ \frac{\partial g(\textbf{U}) }{\partial X_{ij}} ] | [ \operatorname{tr} \left( \frac{\partial g(\textbf{U})}{\partial \textbf{U} } \frac{\partial \textbf{U}}{\partial X_{ij}} \right) ] | |
[ \frac{\partial \textbf{a}^\top \textbf{Xb}}{\partial \textbf{X}} ] | [ (a^\top b)^\top = ba^\top ] | |
[ \frac{\partial \textbf{a}^\top \textbf{X}^\top\textbf{b}}{\partial \textbf{X}} ] | [ ab^\top ] | |
[ \frac{\partial (\textbf{Xa})^\top \textbf{C} ( \textbf{Xb} ) }{\partial \textbf{X}} ] | [ a(\textbf{CXb})^\top + b (\textbf{Xa})^\top \textbf{C} ] | or [ \left( \textbf{CXba}^\top + \textbf{C}^\top \textbf{X} ab^\top \right)^\top ] |
[ \frac{\partial \operatorname{tr}(\textbf{X}) }{\partial \textbf{X}} ] | [ \textbf{I} ] | |
[ \frac{\partial \operatorname{tr}(\textbf{U} + \textbf{V})}{\partial \textbf{X}} ] | [ \frac{\partial \operatorname{tr}(\textbf{U})}{\partial \textbf{X} } + \frac{\partial \operatorname{tr}(\textbf{V})}{\partial \textbf{X}} ] | |
[ \frac{\partial \operatorname{tr}(a \textbf{U})}{\partial \textbf{X} } ] | [ a \frac{\partial \operatorname{tr}(\textbf{U})}{\partial \textbf{X} } ] | |
[ \frac{\partial \operatorname{tr}(\textbf{AX})}{\partial \textbf{X}} = \frac{\partial \operatorname{tr}(\textbf{XA})}{\partial \textbf{X}} ] | [ \textbf{A} ] | |
[ \frac{\partial \operatorname{tr}(\textbf{AX}^\top)}{\partial \textbf{X}} = \frac{\partial \operatorname{tr}(\textbf{X}^\top \textbf{A})}{\partial \textbf{X}} ] | [ \textbf{A}^\top ] | |
[ \frac{\partial \operatorname{tr}(\textbf{X}^\top \textbf{AX})}{\partial \textbf{X}} ] | [ \textbf{X}^\top (\textbf{A} + \textbf{A}^\top)] | |
[ \frac{\partial \operatorname{tr}(\textbf{X}^{-1}\textbf{A})}{\partial \textbf{X}} ] | [ - \textbf{X}^{-1} \textbf{AX}^{-1} ] | Deduction see below: |
[ \frac{\partial \textbf{AXB}}{\partial \textbf{X}} = \frac{\partial \textbf{ABX}}{\partial \textbf{X} } ] | [ \textbf{BA} ] | |
[ \frac{\partial \textbf{AXBX}^\top \textbf{C} }{\partial \textbf{X}} ] | [ \textbf{BX}^\top \textbf{CA} + (\textbf{CAXB})^\top ] | |
[ \frac{\partial \operatorname{tr}(\textbf{X}^n) }{\partial \textbf{X}} ] | [ n \textbf{X}^{n-1} ] | n is positive integer |
[ \frac{\partial \operatorname{tr}(\textbf{AX}^n) }{\partial \textbf{X}} ] | [ \sum_{i=0}^{n-1}{\textbf{X}^i \textbf{AX}^{n-i-1}} ] | |
[ \frac{\partial \operatorname{tr}(e^{\textbf{X}}) }{\partial \textbf{X} }] | [ e^{\textbf{X}}] | [ e^{\textbf{X}} = \sum_{i=0}^{\infty}{\frac{\textbf{X}^i}{i!}} ] |
[ \frac{\partial \operatorname{tr}(\sin(\textbf{X}))}{\partial \textbf{X}} ] | [ \cos(\textbf{X}) ] | $ \sin(\textbf{X}) $ are defined in a similar way. |
[ \frac{\partial \operatorname{tr}(\textbf{g}(\textbf{X})) }{\partial \textbf{X}} ] | [ \textbf{g}’(\textbf{X}) ] | $ \textbf{g}(\textbf{X}) $ is any polynomial with scalar coefficients, or any matrix function defined by an infinite polynomial series (e.g. $ e^{\textbf{X}},\, \sin(\textbf{X}),\, \cos(\textbf{X}),\, \ln(\textbf{X}) $, etc. using a Taylor series); $g(x)$ is the equivalent scalar function, $g’(x)$ is its derivative, and $\textbf{g}’(\textbf{X})$ is the corresponding matrix function. |
[ \frac{\partial \left\vert \textbf{X} \right\vert }{\partial \textbf{X}} ] | [ \operatorname{cofactor}(\textbf{X})^\top = \left\vert \textbf{X} \right\vert \textbf{X}^{-1} ] | because [ \frac{\partial \left\vert \textbf{X} \right\vert }{\partial X_{ij}} = \textbf{C}_{ij} ] and [ \textbf{C}^\top \textbf{X} = \left\vert \textbf{X} \right\vert \textbf{I} ] |
[ \frac{\partial \ln \left( \left\vert a \textbf{X} \right\vert \right) }{\partial \textbf{X}} ] | [ \textbf{X}^{-1} ] | |
To be continued.
随机模拟 蒙特卡洛算法(Monte Carlo)
直接采样、接受-拒绝采样、重要性采样
马尔可夫链 (Markov Chain)
定义见https://pages.uoregon.edu/dlevin/MARKOV/ 缓存
irreducibility and aperiodicity
A chain P is called irreducible if for any two states $x,y$ there exists an integer t (possibly depending on x and y) such that
\[P^t(x,y) > 0\]This means that it is possible to get from any state to any other state using only transitions of positive probability.
否则的话,所有状态就像有一个高度,状态只能从上到下转移或平行转移,不能从下到上转移。
Let \(\mathcal{T}(x) = \{ t \in \mathbb{Z}^+ : P^t(x,x) > 0\}\) The period of state $x$ is defined to be the GCD of $ \mathcal{T}(x) $.
The chain will be called aperiodic if all states have period 1. If a chain is not aperiodic, we call it periodic.
MCMC M-H and Gibbs
https://cosx.org/2013/01/lda-math-mcmc-and-gibbs-sampling 缓存
EM Algorithm
Notation
bold font $ \textbf{x} $ are vector, superscript $ \textbf{x}^{(i)} $ are i-th sample.
Formula Deduction
\[\begin{align} \ell(θ) &= \sum_{i=1}^{m}{\log{p(\mathbf{x}^{(i)}; θ)}} \;\;\;\; & \text{log likelihood} \\ &= \sum_{i=1}^{m}{\log{\sum_{\textbf{z}^{(i)}}{p(\mathbf{x}^{(i)}, \textbf{z}^{(i)}; θ)}}} & \text{find latent var } \textbf{z}^{(i)} \text{behind } \mathbf{x} \\ &= \sum_{i=1}^{m}{\log{\sum_{\textbf{z}^{(i)}}{Q_{i}(\textbf{z}^{(i)})\frac{p(\mathbf{x}^{(i)}, \textbf{z}^{(i)}; θ)}{Q_{i}(\textbf{z}^{(i)})}}}} & \text{arbitrarily insert a distribution of }z^{(i)} \\ &= \sum_{i=1}^{m}{\log{\underset{\textbf{z}^{(i)}\sim Q_{i}}{E}\left[ \frac{p(\mathbf{x}^{(i)}, \textbf{z}^{(i)}; θ)}{Q_{i}(\textbf{z}^{(i)})} \right]}} & \text{treat }\frac{p(\mathbf{x}^{(i)}, z^{(i)}; θ)}{Q_{i}(z^{(i)})}\text{ as a random variable } \textbf{Z}_i \\ &≥ \sum_{i=1}^{m}{\underset{\textbf{z}^{(i)}\sim Q_{i}}{E}\left[\log \frac{p(\mathbf{x}^{(i)}, \textbf{z}^{(i)}; θ)}{Q_{i}(\textbf{z}^{(i)})} \right]} & \log(E[\mathbf{Z}]) ≥ E[\log(\mathbf{Z})],\text{equal holds iff } p(\mathbf{Z}=E[\mathbf{Z}])=1 \\ &= \sum_{i=1}^{m}{\sum_{\textbf{z}^{(i)}}{Q_{i}(\textbf{z}^{(i)}) \log \frac{p(\mathbf{x}^{(i)}, \textbf{z}^{(i)}; θ)}{Q_{i}(\textbf{z}^{(i)})}}} & \text{ by definition of expectation.} \end{align}\]Equal sign is achieved only when $ \textbf{Z} $ is almost a constant
\[\left\{ \begin{array}{l} Q_{i}{(\textbf{z}^{i})} \propto p(\mathbf{x}^{(i)}, \textbf{z}^{(i)}; θ) \\ \sum_{\textbf{z}^{(i)}}{Q_{i}{(\textbf{z}^{i})}} = 1 \end{array} \right.\]Hence:
\[Q_{i}(\textbf{z}^{(i)} = \textbf{u}) = p(\textbf{z}^{(i)} = \textbf{u} | \mathbf{x}^{(i)}; θ)\]Visualization (Image Source Cached (disable js to prevent force jump))
Algorithm
E-step: Set $ θ $ fixed. Find $ Q $ to maximize $ \mathbb{E}[\log(\textbf{Z})] $ to achive
\[\ell(\theta) = \sum_{i=1}^{m}{\log (\mathbb{E}(\textbf{Z}_i))} = \sum_{i=1}^{m}{\mathbb{E}[\log(\textbf{Z}_i)]}\] \[Q_{i}(\textbf{z}^{(i)} = \textbf{u}) := p(\mathbf{z}^{(i)} = \textbf{u} | \mathbf{x}^{(i)}; θ)\]M-step: Set $ Q_i $ fixed. Find $ θ $ to maximize $ \mathbb{E}[\log(\textbf{Z})] $
\[θ := arg \max_{θ} \sum_{i=1}^{m}{\sum_{\mathbf{z}^{(i)}}{Q_{i}(\mathbf{z}^{(i)}) \log \frac{p(\mathbf{x}^{(i)}, \mathbf{z}^{(i)}; θ)}{Q_{i}(\mathbf{z}^{(i)})}}}\]Monotony
\[\begin{align} \ell(\theta^{t+1}) &= arg \max_{θ} \sum_{i=1}^{m}{\sum_{\mathbf{z}^{(i)}}{Q_{i}(\mathbf{z}^{(i)}) \log \frac{p(\mathbf{x}^{(i)}, \mathbf{z}^{(i)}; θ)}{Q_{i}(\mathbf{z}^{(i)})}}} & \\ &\geq \sum_{i=1}^{m}{\sum_{\mathbf{z}^{(i)}}{Q_{i}(\mathbf{z}^{(i)}) \log \frac{p(\mathbf{x}^{(i)}, \mathbf{z}^{(i)}; θ^t)}{Q_{i}(\mathbf{z}^{(i)})}}} & \\ &=\ell(\theta^t) &\\ \end{align}\]Why EM:
It is hard to directly optimize \(\sum_{i=1}^{m}{\log{p(\mathbf{x}^{(i)}; θ)}}\), while easier to deal with \(\sum_{i=1}^{m}{\log{\sum_{z^{(i)}}{p(\mathbf{x}^{(i)}, z^{(i)}; θ)}}}\)
Example: use EM to find GMM parameters:
\[θ := arg \max_{θ} \sum_{i=1}^{m}{\sum_{z^{(i)}}{Q_{i}(z^{(i)}) \log \frac{p(\mathbf{x}^{(i)}, z^{(i)}; θ)}{Q_{i}(z^{(i)})}}}\]where \(θ\) is a set of parameter \((\mathbf{ϕ,μ,Σ})\)
\[\begin{align} \text{Let } w_{j}^{i} = Q_i (z^{(i)}=j) &= p(z^{(i)}=j|\mathbf{x}^{(i)};ϕ,μ,Σ) \\ &= \frac{p(z^{(i)}=j,\mathbf{x}^{(i)};ϕ,μ,Σ)}{\underset{z^{(i)}}{\sum{}}{p(z^(i)=j,\mathbf{x}^{(i)};ϕ,μ,Σ)}} \end{align}\] \[\begin{align} J(Q,θ) &= \sum_{i=1}^{m}{\sum_{z^{(i)}}{Q_i (z^{(i)}) \log \frac{p(\mathbf{x}^{(i)},z^{(i)}; \mathbf{ϕ},\mathbf{μ},\mathbf{Σ})}{Q_i (z^{(i)})}}} \\ &= \sum_{i=1}^{m}{\sum_{j}{Q_i (z^{(i)}=j) \log \frac{p(\mathbf{x}^{(i)}|z^{(i)}=j;\mathbf{μ},\mathbf{Σ})p(z^{(i)} = j;ϕ)}{Q_i (z^{(i)}=j)}}} \\ &= \sum_{i=1}^{m}{\sum_{j}{w^{(i)}_j \log \frac{(2π)^{-\frac{n}{2}} |Σ_j|^{-\frac{1}{2}}exp{\left(-\frac{1}{2} (x^{(i)}-μ_j)^{T} Σ_j^{-1}(x^{(i)}-μ_j) \right)}ϕ_j}{w^{(i)}_j}}} \end{align}\]\(\partial \mathbf{μ_l}\).
\[\begin{align} \frac{∂J}{∂μ_l} \sim \frac{∂}{∂μ_l} \sum_{i=1}^{m}{w^{(i)}_{l}\left(-\frac{1}{2} (x^{(i)}-μ_l)^{T} Σ_l^{-1}(x^{(i)}-μ_l) \right)} \\ \sim \sum_{i=1}^{m}{w^{(i)}_{l} (x^{(i)}-μ_l)^{T} Σ_l^{-1}(-\mathbf{I})} \\ \sim \sum_{i=1}^{m}{w^{(i)}_{l} (μ_l-x^{(i)})^{T} Σ_l^{-1}} \\ \end{align}\]set to \(0\), we have
\[μ_l = \frac{\underset{i}{\sum{}}{w^{(i)}_{l}\mathbf{x}^{i}}}{\underset{i}{\sum{}}{w^{(i)}_{l}}}\]\(\partial ϕ_j\).
\[\begin{align} \frac{∂J}{∂ϕ_l} &= \frac{∂}{∂ϕ_l} \sum_{i=1}^{m}{w^{(i)}_{l} log ϕ_l} \end{align}\]also \(\sum_{j}{ϕ_j}=1\)
Use Lagrange multiplier:
\[\frac{∂}{∂ϕ_l} \sum_{i=1}^{m}{w^{(i)}_{l} log ϕ_l} = λ \frac{∂}{∂ϕ_l} \sum_{j}{ϕ_j} \\ \frac{1}{ϕ_l} \sum_{i=1}^{m}{w^{(i)}_{l}} = λ\]Use \(\sum_{j}{ϕ_j}=1\), we have
\[\sum_{j=1}^{m}{\frac{1}{λ}\sum_{i=1}^{m}{w^{(i)}_{j}}} = 1 \\ λ = m\]Note: Another way to do this is by substitute of variable, let \(ϕ_j = \frac{\exp \psi_j}{\sum \exp \psi_j }\)
So
\[ϕ_l = \frac{1}{m} \sum_{i=1}^{m}{w^{(i)}_{l}}\]\(\partial \mathbf{Σ}_l\).
\[J(Q,θ) = \sum_{i=1}^{m}{\sum_{j}{w^{(i)}_j log \frac{(2π)^{-\frac{n}{2}} |Σ_j|^{-\frac{1}{2}}exp{\left(-\frac{1}{2} (x^{(i)}-μ_j)^{T} Σ_j^{-1}(x^{(i)}-μ_j) \right)}ϕ_j}{w^{(i)}_j}}} \\ \begin{align} \frac{∂J}{∂({Σ_l}^{-1})} &= \frac{∂}{∂({Σ_l}^{-1})} \sum_{i=1}^{m}{w^{(i)}_{l} log \frac{(2π)^{-\frac{n}{2}} |Σ_l|^{-\frac{1}{2}}exp{\left(-\frac{1}{2} (x^{(i)}-μ_l)^{T} Σ_l^{-1}(x^{(i)}-μ_l) \right)ϕ_j}}{w^{(i)}_{l}}} \\ &= \frac{∂}{∂({Σ_l}^{-1})} \sum_{i=1}^{m}{w^{(i)}_{l} log \left({|Σ_l|^{-\frac{1}{2}}exp{\left(-\frac{1}{2} (x^{(i)}-μ_l)^{T} Σ_l^{-1}(x^{(i)}-μ_l) \right)}} \right)} \\ &= \frac{∂}{∂({Σ_l}^{-1})} \sum_{i=1}^{m}{w^{(i)}_{l} \frac{1}{2} log {|Σ_l|^{-1}}} + \frac{∂}{∂({Σ_l}^{-1})} \sum_{i=1}^{m}{w^{(i)}_{l}\left(-\frac{1}{2} (x^{(i)}-μ_l)^{T} Σ_l^{-1}(x^{(i)}-μ_l) \right)} \\ &= \frac{∂}{∂({Σ_l}^{-1})} \sum_{i=1}^{m}{w^{(i)}_{l} log {|Σ_l|^{-1}}} - \frac{∂}{∂({Σ_l}^{-1})} \sum_{i=1}^{m}{w^{(i)}_{l}\left((x^{(i)}-μ_l)^{T} Σ_l^{-1}(x^{(i)}-μ_l) \right)} \\ &= \sum_{i=1}^{m}{w^{(i)}_{l}{Σ_l}} - \sum_{i=1}^{m}{w^{(i)}_{l}\left((x^{(i)}-μ_l)(x^{(i)}-μ_l)^{T} \right)} \end{align}\]Set to 0, we have
\[Σ_l = \frac{\sum_{i=1}^{m}{w^{(i)}_{l}\left((x^{(i)}-μ_l)(x^{(i)}-μ_l)^{T} \right)}}{\sum_{i=1}^{m}w^{(i)}_{l}}\]Def: A two player (finite strategy) game is given by a pair of matrices
\[N \in \mathbb{R}^{n\times m}, M \in \mathbb{R}^{n\times m}\]where
\[M_{i,j} = \text{payoff to player 1 if } p_1 \text{ selects action } i \text{ and } p_2 \text{ selects action } j\]Let’s draw \(M\) here
\[M = \begin{bmatrix} m_{1,1} & \cdots & m_{1,m} \\ m_{2,1} & \cdots & m_{2,m} \\ m_{3,1} & \cdots & m_{3,m} \\ m_{4,1} & \cdots & m_{4,m} \\ m_{5,1} & \cdots & m_{5,m} \\ \vdots & \vdots & \vdots \\ m_{n,1} & \cdots & m_{n,m} \\ \end{bmatrix}\]Note: \(\textbf{p}^T M \textbf{q}\) is the expected gain of player 1 if \(p_i\) is probability of prayer 1 taking action \(i\) and \(q_j\) is the probability of player 2 taking action \(j\)
Def: A game is zero sum if
\[N = -M\]Def: A Nash equilibrium is a pair \(\widetilde{p} \in \Delta_n, \widetilde{q} \in \Delta_m,\) s.t.
\[\forall p \in \Delta_n, \widetilde{p}^T M \widetilde{q} \geq p^T M \widetilde{q}\] \[\forall q \in \Delta_m, \widetilde{p}^TN\widetilde{q} \geq \widetilde{p}^TNq\]Nash’s theorem: There exist a (possibly non-unique) Nash equilibrium for any 2-player game.
Von Neumann’s min-max theorem:
\[∀M \in \mathbb{R}^{n× m}, \min_{p\in \Delta_n} \max_{q\in \Delta_m} p^T M q = \max_{q\in \Delta_m} \min_{p\in \Delta_n} p^T M q\]We say that an algorithm \(\mathcal{A}\) is no-regret if \(\forall \ell_1 \ldots \ell_T \ldots \in [0,1]\) with \(\textbf{p}^t \in \Delta_n\) chosen as \(\textbf{p}^t \leftarrow \mathcal{A}(\ell_1,\ldots,\ell_{t-1})\)
\[\frac{1}{T} \left( \sum_{t=1}^T{\textbf{p}^t \cdot \boldsymbol{\ell}^t} - \min_{p\in \Delta_n}{\sum_{t=1}^{T}{\textbf{p} \cdot \boldsymbol{\ell}^t}} \right) = \epsilon_T = O(1)\]Observe:
\[\min_{p\in \Delta_n}{\sum_{t=1}^{T}{\textbf{p}^t \cdot \boldsymbol{\ell}_t}} = \min_{i=1\ldots n}{\sum_{t=1}^{T}{\textbf{e}_i \cdot \boldsymbol{\ell}^t}}\]Note:
Let \(M\) be \(\mathbb{R}^{n\times m},\,\, \mathcal{A}\) be a no-regret algorithm.
For \(t = 1 \ldots T\),
Q1: How happy is q
\[\begin{aligned} \frac{1}{T}\sum_{t=1}^{T}{\textbf{p}^t \cdot \textbf{M} \textbf{q}^t } &= \frac{1}{T} \sum_{t=1}^{T}{\max_{\textbf{q}}\textbf{p}^t\cdot \textbf{M} \textbf{q}} \\ &≥ \frac{1}{T}\max_{\textbf{q}}{\sum_{t=1}^{T}{(\textbf{p}^t \cdot \textbf{M} \textbf{q})}} \\ &= \frac{1}{T}\max_{\textbf{q}}{\sum_{t=1}^{T}{(\textbf{p}^t)}} \cdot \textbf{M} \textbf{q} = \max_{\textbf{q}}{ \bar{\textbf{p}} } \cdot \textbf{M} \textbf{q} \\ &≥ \min_{\textbf{p}}\max_{\textbf{q}} \textbf{p}\cdot \textbf{M} \textbf{q} \end{aligned}\]Q2: How happy is p
\[\begin{aligned} \frac{1}{T}\sum_{t=1}^{T}{\textbf{p}^t \cdot \textbf{M} \textbf{q}^t} &= \frac{1}{T}\sum_{t=1}^{T}{\textbf{p}^t \cdot \boldsymbol{\ell}^t} & \\ &= \frac{1}{T}\min_{\textbf{p}}{\sum_{t=1}^{T}{\textbf{p}\cdot \boldsymbol{\ell}^t}} + \epsilon_T & \text{ by definition of no regret} \\ &= \min_{\textbf{p}}{\frac{1}{T} \sum_{t=1}^{T}{\textbf{p} \cdot \textbf{M} \textbf{q}^t}} + \epsilon_T & \\ &= \min_{\textbf{p}}{\textbf{p} \cdot \textbf{M} \bar{\textbf{q}}} + \epsilon_T & \\ &≤ \max_{\textbf{q}} \min_{\textbf{p}} \textbf{p} \cdot \textbf{M} \textbf{q} + \epsilon_T \end{aligned}\]To summarize:
Corollary:
\(\bar{\textbf{p}}\) and \(\overline{\textbf{q}}\) are \(\epsilon_T\)-optimal Nash eq.
Given \(\textbf{x}_1,\ldots,\textbf{x}_n \in \mathcal{X}\), \(\textbf{y}_1,\ldots, \textbf{y}_n \in \{-1,1\}\), Hypothesis class \(H = \{ h_1,\ldots,h_m \}\) where \(h : \mathcal{X} \mapsto \{ -1, 1 \}\)
Weak Learner Assumption:
\[∀ \textbf{p} \in \Delta_n,\, ∃ h \in H,\,\text{s.t. if } \textbf{x}_i \text{ show up with probability } p_i,\text{ then }\] \[\operatorname{Pr}\{ h(\textbf{x}_i) \neq y_i \} \leq \frac{1}{2} - \frac{\gamma}{2},\;\; \gamma > 0\]Which is: n \(∀ \textbf{p} \in \Delta_n,\, ∃ h \in H,\,\text{s.t. } \sum_{i}{p_i\frac{1 - y_ih(\textbf{x}_i) }{2}} \leq \frac{1}{2} - \frac{\gamma}{2}\)
Alternatively:
\[∀ \textbf{p} \in \Delta_n,\, ∃ h \in H,\,\text{s.t. } \gamma \leq \sum_{i}{p_iy_ih(\textbf{x}_i)}\]Proof of \(WLA \implies SLA\)
Define \(\textbf{M} \in \{ -1, 1 \}^{n×m}\), \(\textbf{M}_{i,j} = h_j(\textbf{x}_i)y_i\), then
\[\sum_{i}{p_iy_ih_j(\textbf{x}_i)} = \textbf{p} \cdot \textbf{M} \textbf{e}_j\]WLA says for any \(\textbf{p}\) this is a \(j\), we have
\[\gamma \leq \min_{\textbf{p} \in \Delta_n}{\textbf{p} \cdot \textbf{M} \textbf{e}_j} \leq \min_{\textbf{p} \in \Delta_n}\max_{\textbf{q} \in \Delta_m}{\textbf{p} \cdot \textbf{M} q}\]So
\[0 < \gamma \leq \max_{\textbf{q} \in \Delta_m}\min_{\textbf{p} \in \Delta_n}{\textbf{p} \cdot \textbf{M} q}\]which is strong Learner assumption:
\[\exists q \in \Delta_m \text{ s.t. } 0 < \min_{\textbf{p} \in \Delta_n}{\textbf{p}^T \textbf{M} q}\]Strong Learning Assumption: exist \(\textbf{q} \in \Delta_m\), s.t. \(∀ i = 1\ldots n,\\)
\[\sum_{h\in H}{\textbf{q}_h \cdot h(\textbf{x}_i) y_i} > 0\]How to find \(\textbf{q}\)
If we use a no-regret algorithm to learn p that maximize error of prediction (a.k.a minimizing \(\textbf{p⋅Mq}\)), and we choose \(\textbf{q}\) according to \(\textbf{p}\) to maximize \(\textbf{p⋅Mq}\), then by no regret happy theorem
\[\gamma - \epsilon_T = \min_{\textbf{p}}\max_{\textbf{q}} \textbf{p}\cdot \textbf{M} \textbf{q} - \epsilon_T \leq \min_{\textbf{p}}{\textbf{p} \cdot \textbf{M} \overline{\textbf{q}}}\]So, whenever \(\epsilon_T < \gamma, \overline{\textbf{q}}\) is what we need.
Boosting by Majority Algorithm:
We use EWA as the no-regret algorithm. (Note: EWA requires that \(\textbf{M} \in [0,1]^{n\times m}\) but here \(\textbf{M} \in \{-1,1\}^{n\times m}\). the professor promise it will work somehow. My thought is that let \(\textbf{M}' = \frac{\textbf{M}+\textbf{1}}{2}\), then \(\textbf{p} \cdot \textbf{M}' \textbf{q} = \textbf{p} \cdot \frac{\textbf{M}+\textbf{1}}{2} \textbf{q} = \frac{1}{2} \textbf{p} \cdot \textbf{Mq} + \underbrace{\textbf{p} \cdot \textbf{1q}}_{=1!}\), so optimal \(\textbf{q}\) for \(\textbf{M}'\) is also optimal for \(\textbf{M}\) )
Let \(T > \frac{2\log N}{\gamma^2}\) (which somehow \(\epsilon_T < \gamma\) at this point), \(\textbf{w}^1 = 1\), For \(t = 1 \ldots T\), Let
\[\begin{aligned} \textbf{p}^t &= \frac{\textbf{w}^t}{\| \textbf{w}^t \|_1} & \\ h_t &= \operatorname*{argmax}_{h\in \mathcal{H}}{\sum_{i=1}^{N}{\textbf{p}^t_ih(\textbf{x}_i)y_i}} & \text{ we should choose q to maximize } \textbf{p}\cdot \textbf{Mq} \\ & &\text { but optimal value always happen at corner } \\ & &\text { which is equivalent to choose best } h_t \\ \textbf{w}^{t+1}_i &= \textbf{w}^t_i \exp{ \left( -\eta h_t(\textbf{x}_i)y_i \right) } & \end{aligned}\]Output \(\overline{h_T} = \frac{1}{T}\sum_{t=1}^{T}{h_t}\)
Settings
Let a set \(\mathcal{K} \subset \mathbb{R}^d\) be convex and compact.
For \(t = 1\ldots T\),
Let Regret be \(\left(\sum_{1}^{T}{f_t(\textbf{x}_t)} \right) - \min_{\textbf{x}\in \mathcal{K}}{\sum_{t=1}^{T}{f_t(\textbf{x})}}\)
Note:
Online Gradient Descent Algorithm (OGD)
Define
\[\operatorname{Proj}_{\mathcal{K}}{x} = \operatorname*{argmin}_{y\in \mathcal{K}}{\|y-x\|_2}\]Note: \(\forall \textbf{z} \in \mathcal{K}, \forall \textbf{y}\):
\[\| \operatorname{Proj}(\textbf{y}) - z\|_2 \leq \|y-z\|_2\]OGD Algorithm
Let \(\textbf{x}_0\) be arbitrary \(\textbf{x} \in \mathcal{K}\),
\[\textbf{x}_{t+1} = \operatorname{Proj}_{\mathcal{K}}{x_t-\eta \nabla_t \text{ where } \nabla_t = \nabla f_t(\textbf{x}_t)}\]Theorem
Assume \(\| \nabla f(\textbf{x}_t) \| \leq G,\, \|\textbf{x}_0 - \textbf{x}^* \| \leq D \,(\forall \textbf{x}^* \in \mathcal{K})\), then
\[\operatorname{Regret}_T(\text{OGD}) \leq GD\sqrt{T}\]Proof
Notice that
\[\begin{aligned} \frac{1}{2} \| \textbf{x}_{t+1} - \textbf{x}^* \|^2 &= \frac{1}{2} \| \operatorname{Proj}_{\mathcal{K}}{\textbf{x}_t - \eta \nabla_t} - \textbf{x}^* \|^2 \\ &\leq \frac{1}{2} \| \textbf{x}_t-\eta \nabla_t - \textbf{x}^* \|^2 \\ &= \frac{1}{2} (\textbf{x}_t - \textbf{x}^* - \eta \nabla_t) \cdot (\textbf{x}_t - \textbf{x}^* - \eta \nabla_t) \\ &= \frac{1}{2} \| \textbf{x}_t - \textbf{x}^* \|^2 + \frac{\eta^2}{2}\| \nabla_t\|^2 - \eta \nabla_t \cdot ( \textbf{x}_t - \textbf{x}^* ) \\ & & \\ \eta \nabla_t \cdot ( \textbf{x}_t - \textbf{x}^* ) &\leq \frac{1}{2} \left( \| \textbf{x}_t - \textbf{x}^* \|^2 - \| \textbf{x}_{t+1} - \textbf{x}^* \|^2 \right) + \frac{\eta^2}{2}\| \nabla_t\|^2 \end{aligned}\]Also notice that if \(f\) is convex, then \(f(\textbf{x}^*) - f(\textbf{x}) \geq \nabla f(\textbf{x})(\textbf{x}^* - \textbf{x})\), so
\[\nabla_t \cdot ( \textbf{x}_t - \textbf{x}^* ) \geq f(\textbf{x}_t) - f(\textbf{x}^*)\]So
\[\begin{aligned} \operatorname{Regret}_T(\text{OGD}) &= \sum { f(\textbf{x}_t) - f(\textbf{x}^*) } \\ &\leq \sum_{t=1}^{T} {\nabla_t \cdot ( \textbf{x}_t - \textbf{x}^* ) } \\ &\leq \sum_{t=1}^{T} { \left( \frac{1}{2\eta} \left( \| \textbf{x}_t - \textbf{x}^* \|^2 - \| \textbf{x}_{t+1} - \textbf{x}^* \|^2 \right) + \frac{\eta}{2}\| \nabla_t\|^2 \right) } \\ &\leq \sum_{t=1}^{T} { \frac{1}{2\eta} \left( \| \textbf{x}_t - \textbf{x}^* \|^2 - \| \textbf{x}_{t+1} - \textbf{x}^* \|^2 \right) } + \frac{\eta}{2} TG^2 \\ &\leq \frac{1}{2\eta} \left( (\underbrace{\| \textbf{x}_1 - \textbf{x}^* \|^2}_{\leq D^2} + \underbrace{ - \| \textbf{x}_{T+1} - \textbf{x}^* \|^2 }_{\leq 0} \right) + \frac{\eta}{2} TG^2 \\ &\leq \frac{1}{2\eta} D^2 + \frac{\eta}{2} TG^2 \\ \end{aligned}\]Set \(\eta = \frac{D}{G\sqrt{T}}\), we have
\[\operatorname{Regret}_T(\text{OGD}) \leq DG\sqrt{T}\]Convex optimization to OCO
In this setting, we want to minimize a convex loss function $ f $ over a convex compact set \(\mathcal{K}\)
We use OCO.
For \(t = 1, \ldots, T\), an algorithm choose $ x_t $,
Nature then show $ f_t = f $ (a.k.a, always show the same loss function)
After \(T\) round, output \(\overline{x_T} = \frac{1}{T}\sum_{t=1}^{T}{x_t}\)
Claim:
\[f(\overline{x_t}) - \mathop{\operatorname{min}}_{x \in \mathcal{K}}{f(x)} \leq \frac{1}{T} \operatorname{Regret}_T\]Proof: (easy)
\[\begin{aligned} f(\overline{x_t}) &\leq \frac{1}{T} \sum_{t=1}^{T}{f(x_t)} & \text{ by convexity} \\ &= \frac{1}{T}\sum_{t=1}^{T}{f_t(x_t)} & \\ &= \frac{1}{T} \left( \sum_{t=1}^{T}{f_t(x^*) } + \operatorname{Regret}_T \right) & \\ &= \frac{1}{T} \left( \sum_{t=1}^{T}{f(x^*) } + \operatorname{Regret}_T \right) & \\ &= f(x^*) + \frac{1}{T} \operatorname{Regret}_T & \end{aligned}\]Learning in stocastic setting
Learning in stocastic setting can reduces to OCO
In Stochastic learning setting, we want to find a parameter from a predefined parameter set to minimize the expected loss. (e.g., find a parameter of nerual network parameters to minimize classification errors)
Under conditions of 1. loss function are convex, 2. parameter space is convex; this problem can be reduced to OCO.
Settings:
Let \(X,Y\) be domain of data and set of labels.
Let \((X,Y) \sim D\), which means that X,Y are generated i.i.d from distribution D.
Let \(h_θ\) be a hypothesis function that maps form \(X\) to \(Y\) parameterized by \(θ\).
Let \(\mathcal{H}\) be the set of all \(h\), a.k.a, \(\mathcal{H} = \{ h_θ \vert θ \in Θ \}\)
Let \(\ell(h_θ, x, y)\) be the loss if we use \(h_θ\) on point \((x,y)\)
Let \(\ell(h_θ, x, y)\) be convex in \(θ\) (In realistic scenarios, this may not always be true)
Define Risk of \(θ\):
\[\mathcal{L}(θ) = \mathop{\operatorname{\mathbb{E}}}_{(x,y)\sim D}{\ell(h_θ, x, y)}\]Note: ** \(\mathcal{L}(θ)\) ** is convex!!
We want to find \(\hat{θ}\) from \(T\) data points (i.i.d from some distribution) s.t.
\[\mathcal{L}(\hat{θ}) - \mathcal{L(θ^*)} \leq ε \\ \text{where } θ^* = \mathop{\operatorname{min}}_{θ}{\mathcal{L}(θ)} \leq ε\]Algorithm:
For \(t = 1,\ldots,T\),
select \(θ_t\) using OCO,
then observe \((x_t, y_t)\), (note: it is important not to observe \((x_t, y_t)\) in advance.
then set loss function \(f_t(θ_t) = \ell(h_{θ_t}, x_t, y_t)\),
then output \(\hat{θ} = \frac{1}{T}\sum_{t=1}^{T}{θ_t}\)
Can we say anything about \(\hat{θ}\)? No, because it is heavily dependent on specific \(x_1,y_1,\ldots,y_1,y_t\)
We Can Only something about the Expectation of \(\hat{θ}\)
We want to proof
\[\mathop{\operatorname{\mathbb{E}}}_{(x_1, y_1), \ldots, (x_t, y_t) \sim D}{[\mathcal{L}(\hat{θ})]} - \mathcal{L}(θ^*) \leq \frac{1}{T} \mathop{\operatorname{\mathbb{E}}}_{(x_1, y_1), \ldots, (x_t, y_t) \sim D}{[\text{Regret}_T]}\]It is too long to write \({\displaystyle \mathop{\operatorname{\mathbb{E}}}_{(x_1, y_1), \ldots, (x_t, y_t) \sim D} }\), Let’s use the notation \({\displaystyle \mathop{\operatorname{\mathbb{E}}}_{all}}\)
Proof:
\[\begin{aligned} \mathop{\operatorname{\mathbb{E}}}_{all}{\mathcal{L}(\hat{θ})} &= \mathop{\operatorname{\mathbb{E}}}_{all}{\mathcal{L}\left(\frac{1}{T}\sum_{t=1}^{T}{θ_t}\right)} \\ &\leq \mathop{\operatorname{\mathbb{E}}}_{all}{\frac{1}{T} \sum_{t=1}^{T}{\mathcal{L} \left( θ_t \right)}} & \\ &= \frac{1}{T} \sum_{t=1}^{T} \mathop{\operatorname{\mathbb{E}}}_{all}\mathop{\operatorname{\mathbb{E}}}_{(x,y)\sim D}{\ell(h_{θ_t}, x, y)} & \\ &= \frac{1}{T} \sum_{t=1}^{T} \mathop{\operatorname{\mathbb{E}}}_{all}{\ell(h_{θ_t}, x_t, y_t)} & \text{ tricky part }\\ &= \mathop{\operatorname{\mathbb{E}}}_{all}{\frac{1}{T} \sum_{t=1}^{T} \ell(h_{θ_t}, x_t, y_t)} & \text{} \\ &= \mathop{\operatorname{\mathbb{E}}}_{all}{\frac{1}{T} \sum_{t=1}^{T} \left( \mathop{\operatorname{min}}_{θ}{\ell(h_{θ}, x_t, y_t)} + \mathop{\operatorname{Regret}_T} \right) } & \text{} \\ &\leq \mathop{\operatorname{\mathbb{E}}}_{all}{\frac{1}{T} \sum_{t=1}^{T} \left( \ell(h_{θ^*}, x_t, y_t) + \mathop{\operatorname{Regret}_T} \right) } & \text{ for any }θ^* \\ &= \mathcal{L}(θ^*) + \mathop{\operatorname{\mathbb{E}}}_{all}{\frac{1}{T} \mathop{\operatorname{Regret}_T} } & \text{} \\ \end{aligned}\]Algorithm
\[x_t = \operatorname{arg}\min_{x\in K}{\sum_{s=1}^{t-1}{f_s(x)}}\]Claim:
\[\text{Regret} \leq \sum_{t=1}^{T}{f_t(x_t) - f_t(x_{t+1})}\]Proof By induction.
T = 1:
\[\text{Regret}_T(\text{FTL}) = f_1(x_1) - f_1(x_2)\]T > 1
\[\begin{aligned} \text{Regret}_T(\text{FTL}) &= \sum_{t=1}^{T}{f_t(x_t) - \sum_{t=1}^{T}f_t(x_{T+1})} \\ &= \sum_{t=1}^{T}{\left( f_t(x_t) - f_t(x_{T+1}) \right) } \\ &= \sum_{t=1}^{T-1}{ \left( f_t(x_t) - f_t(x_{T+1}) \right) } + f_T(x_T) - f_T(x_{T+1}) \\ &\leq \sum_{t=1}^{T-1}{ \left( f_t(x_t) - f_t(x_{T}) \right) } + f_T(x_T) - f_T(x_{T+1}) \\ &= \mathop{\operatorname{Regret}_{T-1}} + f_T(x_T) - f_T(x_{T+1}) \\ &\leq \sum_{t=1}^{T-1}{ \left( f_t(x_t) - f_t(x_{t+1}) \right) } + f_T(x_T) - f_T(x_{T+1}) \\ &= \sum_{t=1}^{T}{ \left( f_t(x_t) - f_t(x_{t+1}) \right) } \end{aligned}\]FTL example
Data $ z_t $ reveal one by one.
Predict the $ \mu $.
See scribed lecture
Linear loss is harder:
Let \(\widetilde{f}_t(x) = \nabla_{x_t}f_t(x-x_t) + f_t(x_t)\)
Note that:
Hence, Linear loss is larger (harder)
Bad performance (linear regret) in Linear loss
Example:
\(X \in [-1, 1]\), loss function
\[f_t(X) = \begin{cases} \frac{1}{2}X & \text{ when } t=1 \\ -X & t = 2,4,6,\ldots \\ X & t = 3,5,7,\ldots \end{cases}\]Follow the Regularized Leader (FTRL)
\[x_{t+1} = \mathop{\operatorname{argmin}}_{x\in \mathcal{K}}{\sum_{s=1}^{t}{\eta f_s(x) + R(x)}}\]Assume \(f_s\) is linear, which is the hardest case, let \(f_s(x)\) be \((g_s \cdot x)\)
Lemma: \eta g_t\cdot(x_t-u) = D_R(u, x_t) - D_R(u, x_{t+1}) + D_R(x_t, x_{t+1})
Proof: TODO
\(\Other \Lemma \TODO\).
Expert Setting: full info feedback:
We know the loss function. a.k.a, we know what would happen if we chose another \(x_t\)
Bandit Setting: feedback limited to chosen action
Protocal:
We have n actions (called arms), for t = 1 … T, algorithm selects \(i_t\); Nature reveal \(\ell_{i_t}^{t}\) from unobserved \(\ell^{t} \in [0,1]^n\)
EXP3 algorithm
For adversarial settings.
(Note this is different from the original paper)
Algorithm
N: actions (also called arms); T: time, \(\ell \in [0,1]\): loss
Theorem: \(\mathop{\operatorname{\mathbb{E}}}{[\sum_{t=1}^{T}{\ell_{i_t}^t - \ell_{i^*}^t}]} \leq \frac{\log n}{\eta} + \frac{\eta}{2}Tn\)
Cookup potential:
\[\Phi_t = - \frac{1}{\eta} \log \left( \sum_{i=1}^{N}{w^t_i} \right)\]Note in the following deduction, we reach time t. \(\boldsymbol{w}^t\) is fixed, hence \(\Phi_{t}\) is fixed. \(\boldsymbol{\ell}^t\) is unseen. \(\Phi_{t+1}\) is random variable. \(w_i^{t+1}\) are random variable. \(\ell_i^t\) are random variable.
\[\begin{aligned} \Phi_{t+1} - \Phi_{t} &= - \frac{1}{\eta} \log \left( \frac{\sum_{i=1}^{N}{w_i^{t+1}}}{\sum_{i=1}^{N}{w_i^t}} \right) & \\ &= - \frac{1}{\eta} \log \left( \frac{\sum_{i=1}^{N}{w_i^t \exp(-\eta \hat{\ell}_i^t)}}{\sum_{i=1}^{N}{w_i^t}} \right) & \\ &= - \frac{1}{\eta} \log \left( \sum_{i=1}^{N}{\left(\frac{w_i^t}{\sum_{i=1}^{N}{w_i^t}}\right) \exp(-\eta \hat{\ell}_i^t)} \right) & \\ &= - \frac{1}{\eta} \log \left( \sum_{i=1}^{N}{p^t_i \exp(-\eta \hat{\ell}_i^t)} \right) & \\ &= - \frac{1}{\eta} \log \left( \mathop{\operatorname{\mathbb{E}}}{[\exp(-\eta \hat{\ell}^t)]} \right) & \\ &\geq - \frac{1}{\eta} \log \left( \mathop{\operatorname{\mathbb{E}}}{[1 -\eta \hat{\ell}^t + \frac{1}{2} (\eta \hat{\ell}^t)^2] } \right) & e^{-x} \leq 1 - x + \frac{1}{2}x^2 \text{ when } x\geq 0 \\ &= - \frac{1}{\eta} \log \left( \mathop{1 - \operatorname{\mathbb{E}}}{[ \eta \hat{\ell}^t - \frac{1}{2} (\eta \hat{\ell}^t)^2] } \right) & \\ &\geq \frac{1}{\eta} \mathop{\operatorname{\mathbb{E}}}{[ \eta \hat{\ell}^t - \frac{1}{2} (\eta \hat{\ell}^t)^2] } & \log(1-x)\leq -x \\ &= \mathop{\operatorname{\mathbb{E}}}{[ \hat{\ell}^t - \eta \frac{1}{2} ( \hat{\ell}^t)^2] } & \\ &= \sum_{i=1}^{N}{p_i^t} \hat{\ell}_i^t - \eta \frac{1}{2} \sum_{i=1}^{N}{p_i^t} ( \hat{\ell}_i^t)^2 & \\ \end{aligned}\]All the way we set up \(\hat{\ell}_i^t\) is to make it unbiased, so we can take expectation (on arms we pull) at time t)
\[\begin{aligned} \mathop{\operatorname{\mathbb{E}}}_{I_t \sim \boldsymbol{p}^t}{[\Phi_{t+1} - \Phi_{t} \vert I_{1}, I_{2}, \ldots, I_{t-1}]} &\geq \mathop{\operatorname{\mathbb{E}}} [\sum_{i=1}^{N}{p_i^t} \hat{\ell}_i^t - \eta \frac{1}{2} \sum_{i=1}^{N}{p_i^t} ( \hat{\ell}_i^t)^2] \\ &= \sum_{i=1}^{N}{p_i^t} \mathop{\operatorname{\mathbb{E}}}[\hat{\ell}_i^t] - \eta \frac{1}{2} \sum_{i=1}^{N}{p_i^t} \mathop{\operatorname{\mathbb{E}}}[(\hat{\ell}_i^t)^2] \\ &= \sum_{i=1}^{N}{p_i^t} \ell_i^t - \eta \frac{1}{2} \sum_{i=1}^{N}{p_i^t} \frac{ (\ell_i^t)^2 }{p_i^t} \\ &= \boldsymbol{p}^t \cdot \boldsymbol{\ell}^t - \eta \frac{1}{2} \sum_{i=1}^{N}{ (\ell_i^t)^2 } \\ &\leq \boldsymbol{p}^t \cdot \boldsymbol{\ell}^t - \eta \frac{1}{2} N \\ \end{aligned}\]Now given all real loss up to time T, that is \(\ell^1, \ldots, \ell^T\), the EXP3 algorithm generate a serie of action \(i_1, \ldots, i_T\).
Once real loss is given, each serie is generated by a specific probability. Think it as a tree. So
\[\begin{aligned} & \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{[\Phi_{T+1} - \Phi_1 \vert (i_1,\ldots,i_T)]} \\ &= \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{[\Phi_{T+1} - \Phi_{T} + \Phi_T - \Phi_1 \vert (i_1,\ldots,i_T)]}\\ &= \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{[(\Phi_{T+1} - \Phi_{T} \vert (i_1,\ldots,i_T))]} + \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{[\Phi_T - \Phi_1 \vert (i_1,\ldots,i_T)]}\\ &= \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_{T-1}) \in \{1\ldots N\}^{T-1}}{\left[\mathop{\operatorname{\mathbb{E}}}_{i_T}[(\Phi_{T+1} - \Phi_{T} \vert (i_1,\ldots,i_{T-1}))]\right]} + \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_{T-1}) \in \{1\ldots N\}^{T-1}}{[\Phi_T - \Phi_1 \vert (i_1,\ldots,i_{T-1})]}\\ &= \text{recursive} \\ &\geq \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_{T-1}) \in \{1\ldots N\}^{T-1}}{[(\boldsymbol{p}^T \cdot \boldsymbol{\ell}^T - \eta \frac{1}{2} N)]} + \text{...omitted} \\ &= \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_{T}) \in \{1\ldots N\}^{T}}{\left[\sum_{t=1}^{T}{\ell^t_{i^t}} - \eta \frac{1}{2} N\right]} \\ &= - \eta \frac{NT}{2} + \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_{T}) \in \{1\ldots N\}^{T}}{\left[\sum_{t=1}^{T}{\ell^t_{i^t}}\right]} \\ \end{aligned}\]Moreover, We have
\[\mathop{\operatorname{\mathbb{E}}}{ \left[ \Phi_{T+1} - \Phi_1 \right] } \leq \sum_{t=1}^{T}{\ell_i^t} + \frac{\log N}{\eta} \,\,\,\, \text{ for all } i = 1 \ldots N\]Why: the second term is by definition \(-\frac{1}{\eta} N\). The first term is
\[\begin{aligned} \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{\Phi_{T+1}} &= \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{-\frac{1}{\eta} \log { \left( \sum_{i=1}^{N}{w_i^T} \right) }} & \text{ random var is } w_i^T \\ &\leq -\frac{1}{\eta} \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{ \log { \left( \sum_{i=i^*}^{i^*}{w_i^T} \right) }} & \\ &= -\frac{1}{\eta} \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{ \log { \left( {w_{i^*}^{T-1} \exp (- \eta \hat{\ell}^T_{i^*})} \right) }} & \\ &= -\frac{1}{\eta} \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{ (- \eta \hat{\ell}^T_{i^*}) + \log { \left( {w_{i^*}^{T-1} } \right) }} & \\ &= {\ell}^T_{i^*} - \frac{1}{\eta} \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{ \log { \left( {w_{i^*}^{T-1} } \right) }} & \\ &= \text{ recursive} \\ &= \sum_{t=1}^{T}{ {\ell}^t_{i^*}} \end{aligned}\]To combine
\[\mathop{\operatorname{\mathbb{E}}}{ \left[ \mathop{\operatorname{Regret}_T} \right] } = \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_{T}) \in \{1\ldots N\}^{T}}{\left[\sum_{t=1}^{T}{\ell^t_{i^t}}\right]} - \sum_{t=1}^{T}{\ell_i^t} \leq \eta \frac{NT}{2} + \frac{\log n}{\eta}\]UCB1 algorithm
For stochastic settings.
Settings:
The best arm to choose is the $i^*$ that maximize reward $ {\displaystyle u_i = u(i) = \mathop{\operatorname{\mathbb{E}}}_{X \sim \mathcal{D}_i}{[X]} } $, a.k.a, the best arm in expectation.
Define regret
\[\mathop{\operatorname{Regret}}_T = Tu(i^*) - \sum_{t=1}^{T}{X^{(t)}_{i_t}}\]which is the reward we lost by not following the best policy.
Example: So let’s take the action with the highest average reward directly.
Now average reward of action 1 will never drop to 0, so we’ll never play action 2
UCB1 intuition
The idea is like this:
Without loss of generosity, suppose that #1 is the best arm.
As we explore, by hoeffing’s inequality, with a very high probability, estimated cost is similar to real expected cost.
Suppose we choose a tuning parameter $ε$, at time $t$, suppose algorithm choose arm $i_t ≠ 1$, then consider the following events:
These events are less and less likely to occur as we play. So we want to bond to things:
There is one technical issue, which is that, when applying the Heoffding’s inequality, the $N$ is changing (see following)
Suppose at time $t$ algorithm choose $i_t$, and $i_t$ has been choosen $ N^{(t)}_{i_t}$ times, by Hoeffding’s inequality:
\[\mathop{\operatorname{Pr}}(\hat{u_j} < u_j - ε ) < e^{-2N_jε^2} \\ \mathop{\operatorname{Pr}}(\hat{u_j} > u_j + ε ) < e^{-2N_jε^2}\]I’m not thinking thoroughly now what will occur if we choose to optimize $\epsilon$, but they use a variable exchange.
Algorithm (somehow different from the original)
let $\delta = e^{-2N_jε^2}$
\[\mathop{\operatorname{Pr}}(\hat{u_j} < u_j - \sqrt{\frac{\log \frac{1}{\delta}}{2N_j}} ) < \delta \\ \mathop{\operatorname{Pr}}(\hat{u_j} > u_j + \sqrt{\frac{\log \frac{1}{\delta}}{2N_j}} ) < \delta\]Now redefine event (smaller, larger)
\[\mathop{\operatorname{Pr}}(S_j: u_j < \hat{u_j} - \sqrt{\frac{\log \frac{1}{\delta}}{2N_j}}) < \delta \\ \mathop{\operatorname{Pr}}(L_j: u_j > \hat{u_j} + \sqrt{\frac{\log \frac{1}{\delta}}{2N_j}}) < \delta\]for time $ t $ from $ 1 $ to $ T $,
First consider the event $ S^{(t)}_1 ∨ L^{(t)} _ {i_t}$ occurs.
this means that we highly underestimate the best arm $u_1$ or highly overestimate arm $i_t$, we suffer loss at most 1. (since loss defined to be in $0\leq \text{loss} \leq 1$). So the expected loss, under this case, is $2T\delta$
Now consider the step where the event doesn’t occur.
How should we pick.
Suppose our algorithm choose $i_t$ at time t
\[u_1 < \hat{u_1} + \sqrt{\frac{\log \frac{1}{\delta}}{2N_1}} < \hat{u_{i_t}} + \sqrt{\frac{\log \frac{1}{\delta}}{2N_{i_t}}} < u_{i_t} + 2\sqrt{\frac{\log \frac{1}{\delta}}{2N_{i_t}}}\]Explain:
Now we have
\[Δ_{i_t} = u_1 - u_{i_t} < 2\sqrt{\frac{\log \frac{1}{\delta}}{2N_{i_t}}}\]This is nice, since when we pick, we always pick the one with highest optimistic value, but as $N_{i_t}$ goes up, confidence interval of $i_t$ shrinks. So we will not choose $i_t$ after certain steps.
\[Δ_{i_t} = u_1 - u_{i_t} < 2\sqrt{\frac{\log \frac{1}{\delta}}{2N_{i_t}}} \\ \implies \left( \frac{Δ_{i_t}}{2} \right)^2 < \frac{\log \frac{1}{\delta}}{2N_{i_t}} \\ \implies N_{i_t} < \frac{2\log \frac{1}{\delta}}{Δ_{i_t}^2} \\\]The maximum loss possible of arm k will be:
\[\sum_{i=1}^{\lfloor {2\log \frac{1}{\delta}} / {Δ_k^2} \rfloor}{2\sqrt{\frac{\log \frac{1}{\delta}}{2}} \sqrt{\frac{1}{i}}}\]Using calculus we know that
\[\sum_{i=1}^{n} \sqrt{\frac{1}{i}} \leq 1 + \int_{1}^{n}{\sqrt{\frac{1}{x}} \mathop{\operatorname{dx}}} = 1 + 2\sqrt{n}\]So,
\[\sum_{i=1}^{\lfloor {2\log \frac{1}{\delta}} / {Δ_k^2} \rfloor}{2\sqrt{\frac{\log \frac{1}{\delta}}{2}} \sqrt{\frac{1}{i}}} < 2\sqrt{\frac{\log \frac{1}{\delta}}{2}}{\left(1+2\sqrt{\lfloor {2\log \frac{1}{\delta}} / {Δ_k^2} \rfloor}\right)} \approx 4 \log(1/\delta) / Δ_k\]So, the total regret expected is
\[\mathop{\operatorname{\mathbb{E}}}[\mathop{\operatorname{Regret}_T}] \lessapprox K + 2Tδ + \sum_{k=2}^{K}{4\log(1/δ)/Δ_k}\]Setting $δ = 1/T$, we get something like $\sum_{k=2}^{K}{O(\log(T))/Δ_k}$
My teacher said that factor $1/Δ_k$ is inevitable somehow…, by some theory…
2.1.1 Line
Suppose \(x_1 \neq x_2\) are two points in \(\mathbb{R}^n\). Points of the form
\[y = θx1 + (1 − θ)x2\]where θ ∈ R, form the line passing through \(x_1\) and \(x_2\).
2.1.2 Affine sets
(It is hyperplane (may not passing the origin))
A set \(C ⊆ \mathbb{R}^n\) is affine if the line through any two distinct points in \(C\) lies in \(C\)
We refer to a point of the form \(θ_1 x_1 + · · · + θ_k x_k,\text{ where }θ_1 + · · · + θ_k = 1\), as an affine combination of the points \(x_1 , . . . , x_k\)
If \(C\) is an affine set and \(x_0 ∈ C\), then the set
\[V = C − x_0 = \{x − x_0 \vert x ∈ C\}\]is a subspace
The set of all affine combinations of points in some set \(C ⊆ \mathbb{R}\) n is called the affine hull of \(C\), and denoted \(\operatorname{aff} C\):
\[\operatorname{aff} C = \{ θ_1x_1 + \cdots + θ_kx_k \vert x_1,\ldots,x_k ∈ C, θ_1+\ldots+θ_k = 1 \}\]The affine hull is the smallest affine set that contains \(C\)
2.1.3 Affine dimension and relative interior
We define the affine dimension of a set \(C\) as the dimension of its affine hull
We define the relative interior of the set \(C\), denoted \(\operatorname{relint} C\) as its interior relative to \(\operatorname{aff} C\):
\[\operatorname{relint} = \{x ∈ C \vert B(x, r) ∩ aff C ⊆ C \text{ for some }r > 0\}\]where \(B(x, r) = \{y \vert \|y − x\| ≤ r\}\), the ball of radius \(r\) and center \(x\) in the norm \(\| \cdot \|\)
We can then define the relative boundary of a set \(C\) as \(\operatorname{cl} C \setminus \operatorname{relint} C\), where \(\operatorname{cl} C\) is the closure of C.
2.1.4 Convex sets
A set \(C\) is convex iff
\[θx_1 + (1 − θ)x_2 ∈ C,\,\,\forall θ∈[0,1], x_1,x_2∈C\]Roughly speaking, a set is convex if every point in the set can be seen by every other point, along an unobstructed straight path between them, where unobstructed means lying in the set. Every affine set is also convex, since it contains the entire line between any two distinct points in it, and therefore also the line segment between the points.
We call a point of the form \(θ_1x_1 + \cdots + θ_kx_k\), where \(θ_1 + \cdots + θ_k = 1\) and \(θ_i ≥ 0\), a convex combination of the points \(x_1, \ldots, x_k\). As with affine sets, it can be shown that a set is convex if and only if it contain contains every convex combination of its points. A convex combination of points can be thought of as a mixture or weighted average of the points, with \(θ_i\) the fraction of \(x_i\) in the mixture.
The convex hull of a set \(C\), denoted \(\operatorname{conv} C\), is the set of all convex combinations of points in \(C\):
\[\operatorname{conv} C = \{θ_1x_1 + \ldots + θ_kx_k \vert x_i ∈ C,\, θ_i ≥ 0\, θ_1 + \cdots + θ_k = 1\}\]As the name suggests, the convex hull \(\operatorname{conv} C\) is always convex. It is the smallest convex set that contains \(C\)
2.1.5 Cones
A set \(C\) is called a cone, or nonnegative homogeneous, if for every \(x ∈ C\) and \(θ ≥ 0\), we have \(θx ∈ C\). A set \(C\) is a convex cone if it is convex and a cone, which means that for any\(x1, x2 ∈ C\) and \(θ1 , θ2 ≥ 0\), we have
\[θ_1x_1 + θ_2x_2 ∈ C.\]A point of the form \(θ_1x_1 + \cdots + θ_kx_k\) with \(θ_1, \ldots, θ_k ≥ 0\) is called a conic combination (or a nonnegative linear combination) of \(x_1, \ldots, x_k\). If \(x_i\) are in a convex cone \(C\), then every conic combination of \(x_i\) is in \(C\). Conversely, a set \(C\) is a convex cone if and only if it contains all conic combinations of its elements.
The conic hull of a set \(C\) is the set of all conic combinations of points in \(C\), i.e.,
\[\{ θ_1x_1 + \cdots + θ_kx_k \vert x_i ∈ C,\, θ_i ≥ 0,\, i = 1, \ldots , k\}\]which is also the smallest convex cone that contains \(C\).
3.1 Basic properties and examples
3.1.1 Definition
A function \(f: \mathbb{R}^n \mapsto \mathbb{R}\) is convex if \(\operatorname{dom} f\) is a convex set and
\[\forall x,y ∈ \operatorname{dom} f,\,\, 0 ≤ θ ≤ 1,\,\, f(θx + (1 − θ)y) ≤ θf(x) + (1 − θ)f(y)\]We say \(f\) is concave if \(−f\) is convex, and strictly concave if \(−f\) is strictly convex.
A function is convex if and only if it is convex when restricted to any line that intersects its domain. In other words \(f\) is convex if and only if for all \(x ∈ \operatorname{dom} f\) and all \(v\), the function \(g(t) = f (x + tv)\) is convex (on its domain, \(\{t | x + tv ∈ \operatorname{dom} f \}\)). This property is very useful, since it allows us to check whether a function is convex by restricting it to a line.
3.1.2 Extended-value extensions
It is often convenient to extend a convex function to all of \(\mathbb{R}^n\) by defining its value \(f\) to be \(\infty\) outside its domain.
If f is convex we define its extended-value extension \(\widetilde{f} : \mathbb{R}^n \mapsto \mathbb{R} ∪ \{\infty\}\) by
\[\widetilde{f}(x) = \begin{cases} f(x) & x \in \operatorname{dom} f.\\ \infty & x \not\in \operatorname{dom} f. \end{cases}\]3.1.3 First-order conditions
Suppose \(f\) is differentiable (i.e., its gradient \(∇f\) exists at each point in \(\operatorname{dom} f\), which is open). Then \(f\) is convex if and only if \(\operatorname{dom} f\) is convex and \(f(y)≥f(x) + ∇f(x)^T(y − x)\) holds for all \(x, y ∈ \operatorname{dom} f\)
Strict convexity can also be characterized by a first-order condition: \(f\) is strictly convex if and only if \(\operatorname{dom} f\) is convex and for \(x, y ∈ \operatorname{dom} f, x \neq y\), we have
\[f (y) > f (x) + ∇f(x)^T (y − x).\]3.1.4 Second-order conditions
We now assume that \(f\) is twice differentiable, that is, its Hessian or second derivative \(∇^2f\) exists at each point in \(\operatorname{dom} f\), which is open. Then \(f\) is convex if and only if \(\operatorname{dom} f\) is convex and its Hessian is positive semidefinite: for all \(x ∈ \operatorname{dom} f\)
\[∇^2 f(x) \succeq 0.\]Strict convexity can be partially characterized by second-order conditions. If \(∇^2f (x) \succ 0\) for all \(x ∈ \operatorname{dom} f\), then \(f\) is strictly convex. The converse, however, is not true: for example, the function \(f : \mathbb{R} \mapsto \mathbb{R}\) given by \(f(x) = x^4\) is strictly convex but has zero second derivative at \(x = 0\).
Remark 3.1 The separate requirement that \(\operatorname{dom} f\) be convex cannot be dropped from the first- or second-order characterizations of convexity and concavity. For example, the function \(f(x) = \frac{1}{x2}\), with \(\operatorname{dom} f = \{x ∈ R \vert x \neq 0 \}\), satisfies \(f''(x) > 0\) for all \(x ∈ \operatorname{dom} f\), but is not a convex function.
3.1.6 Sublevel sets
The \(α\)-sublevel set of a function \(f : \mathbb{R}^n \mapsto R\) is defined as
\(C_α = \{x ∈ \operatorname{dom} f \vert f(x) ≤ α \}\).
3.1.7 Epigraph
The epigraph of a function \(f : \mathbb{R}^n \mapsto R\) is defined as
\[\operatorname{epi} f = \{(x, t) \vert x ∈ \operatorname{dom} f, f(x) ≤ t \}\]3.1.8 Jensen’s inequality and extensions
3.1.9 Inequalities
4.5 复数的性质
4.6 多项式的定义
Recall that a function \(p : \mathbb{F} → \mathbb{F}\) is called a polynomial with coefficients in \(\mathbb{F}\) if there exist \(a_0, \ldots, a_m \in \mathbb{F}\) such that
\[p(z) = a_0 + a_1z + a_2z^2 + \cdots + a_mz^m\]若 \(p,s \in \mathcal{P}(\mathbb{F}),\, s \neq 0\),则存在 \(q,r \in \mathcal{P}(\mathbb{F}\) 使
\[p = sq + r \text{ and } \text{deg} r < \text{deg} s\]证:设 n = deg p, m = deg s, 若 n < m 则直接令q为0,否则:
Define \(T : \mathcal{P}_{n-m}(\mathbb{F}) \times \mathcal{P}_{m-1}(\mathbb{F}) → \mathcal{P}_n(\mathbb{F})\) by:
\[T(q,r) = sq + r\]说白了就是 p / s = q 余 r 其中余项r的次数比除数s低。
这显然是个线性空间,由T定义,deq sq >= m, deg r = m-1,所以 \(T(q,r) = 0\) 意味着 \(q = 0,\, r = 0\). 所以 \(\text{null }T\) 维度是0,所以 T 是满射,所以原象存在且唯一
4.9 定义 zeros of a polynomial
A number \(\lambda \in \mathbb{F}\) is called a zero (or root) of a polynomial \(p \in \mathcal{P}(\mathbb{F})\) if \(p(\lambda) = 0\).
4.10 定义 factor
A polynomial \(s \in \mathcal{P}(\mathbb{F})\) is called a factor of \(p \in \mathcal{P}(\mathbb{F})\) if there exists a polynomial \(q \in \mathcal{P}(\mathbb{F})\) such that \(p = sq\).
4.11
简单说就是若z是一个根,则
\[p(z) = (z - \lambda)q(z)\]成立
证略
4.13 代数基本定理 Fundamental Theorem of Algebra
Every nonconstant polynomial with complex coefficients has a zero
这里居然是用Liouville定理证的,表示看不懂,一个自己找的比较基本的证明见。
4.14 复数域上多项式的唯一分解
\[p(z) = c(z-\lambda_1)\cdots(z-\lambda_m)\]4.15 Polynomials with real coefficients have zeros in pairs
实系数多项式的根的共轭也是根
4.16 实系数二次方程的求根公式
4.17 实数域上多项式的唯一分解
\[p(x) = c(x-\lambda_1)\cdots(x-\lambda_m)(x^2+b_1x+c_1)\cdots(x^2+b_Mx+c_M)\]如果 p(x) 有复数根,那么
\[\begin{aligned} p(x) &= (x-\lambda)(x-\overline{\lambda})q(x) &= (x^2-2(\operatorname{Re } \lambda)x + \left\vert \lambda \right\vert ^2) q(x) \end{aligned}\]如果能证明q(x)也是实系数,就可以递归下去。直到p(x)全是实根,再用4.14即可。
由4.11 q(z) 存在,但是对于全体实数,\((x-\lambda)(x-\overline{\lambda})\) 为非0实数,所以q(x)为实数,由4.7,q(x)的系数都是实数
前言写的很好,为了研究 \(\mathcal{L}(V) = \mathcal{L}(V,V)\) 上的函数,如果有一个子空间在映射之后还是在相同的子空间,那么这种空间应该专门起个名字。
5.1 Notation \(\mathbb{F},\, V\)
5.2 定义 不变子空间 invariant subspace
若 \(T \in \mathcal{L}(V)\), \(U\) 是 \(V\) 的子空间,若 \(T(U) \subset U\),则称U在T变换下是不变子空间。
另一种记法: \(T|_U \in \mathcal{L}(U)\).
5.5 定义 eigenvalue (又叫characteristic value) 特征值
Suppose \(T \in \mathcal{L}(V)\). A number \(\lambda \in \mathbb{F}\) is called an eigenvalue of \(T\) if there exists \(v \in V\) such that \(v \neq 0\) and \(Tv = \lambda v\).
5.56 有限维来说,
5.57 定义 eigenvector 特征向量
5.10 Linearly independent eigenvectors
若 \(T \in \mathcal{L}(V)\),它的特征值为 \(\lambda_i\),对应特征向量为 \(v_i\) 那么 \(v_i\) 线性无关
证明见书,此处好像没限制维度。
5.13 Number of eigenvalues
Suppose V is finite-dimensional. Then each operator on V has at most dim V distinct eigenvalues.
由5.10立得
5.14 定义 \(T \vert _U\) 和 \(T / U\)
若 \(T \in \mathcal{L}(V)\), \(U\) 是 \(V\) 的不变子空间,则
The restriction operator:
\(T \vert_U \in \mathcal{L}(U)\) is defined by
\[T \vert_U (u) = Tu\]for \(u \in U\).
The quotient operator \(T / U \in \mathcal{L}(V/U)\) is defined by
\[(T/U)(v+U) = Tv+U\]for \(v \in V\).
5.16 定义 \(T^m\)
若 \(T \in \mathcal{L}(V)\), \(m\)是正整数,则定义
5.17 定义 \(p(T)\)
若 \(T \in \mathcal{L}(V),\, p \in \mathcal{P}(\mathbb{F}\),且
\[p(z) = a_0 + a_1z + a_2z^2 + \cdots + a_mz^m\]则定义
\[p(T) = a_0I + a_1T + a_2T^2 + \cdots + a_mT^m\]5.19 定义 多项式之积
If \(p, q \in \mathcal{P}(\mathbb{F})\), then \(pq \in \mathcal{P}(F)\) is the polynomial defined by
\[(pq)(z) = p(z)q(z)\]注:是卷积
5.20
5.21 Operators on complex vector spaces have an eigenvalue
Every operator on a finite-dimensional, nonzero, complex vector space has an eigenvalue.
证见书
5.26 上三角的等价形式
Suppose \(T \in \mathcal{L}(V)\) and \(v_1,\ldots,v_n\) is a basis of V. 那么以下三个等价
比较显然,不证了。
5.27 Over C, every operator has an upper-triangular matrix
Suppose \(V\) is a finite-dimensional complex vector space and \(T \in \mathcal{L}(V)\). Then T has an upper-triangular matrix with respect to some basis of \(V\).
证明也比较长,不抄了,大意是找一个递归证,利用特征值找到不变子空间(所以在\(\mathbb{R}\)上不行,因为特征值可能不存在)令 \(U = \text{range }(T- \lambda I)\),然后由归纳假设,将 \(U\) 的基扩充成 \(V\) 的基, 对于新扩充的基有 \(T v_k = (T - \lambda I) v_k + \lambda v_k\),有\(Tv_k \in \operatorname{span}(u_1,\ldots,u_m,v_1,\ldots,v_\), 再由5.26得出。
5.30 上三角矩阵可逆等价于对角线上的元素不为0
书上的证法比较麻烦,事实上,完全可以构造出一个非0向量使 \(Tv = 0\)
5.36 定义 ** eigenspace(特征空间), \(E(\lambda, T)\)
若 \(T \in \mathcal{L}(V)\) 且 \(\lambda \in \mathbb{F}\). The eigenspace of \(T\) corresponding to \(\lambda\), denoted by \(E(\lambda, T)\), is defined by
\[E(\lambda, T) = \text{null }(T - \lambda I)\]\(\lambda\)对应的特征空间是所有对应\(\lambda\)的特征向量组成的空间。
5.38 不同λ 对应的特征空间的和是直和 (特征空间相互垂直) ,它们的维度和小于总空间的维度
由5.10已经证过。
5.39 定义 diagonalizable 可对角化
An operator \(T \in \mathcal{L}(V)\) is called diagonalizable if the operator has a diagonal matrix with respect to some basis of \(V\).
5.41 可对角化的等价条件
存在一维子空间 \(U_1,\ldots,U_n\) of \(V\),每个在\(T\)下都是不变子空间,且
\[V = U_1 \oplus \cdots \oplus U_n\]证不想看了。。
5.44
如果 \(T\) 有 \(n\) 个不同的特征值,则 \(T\) 可对角化
6.1 Notation \(\mathbb{F},\, V\)
6.2 定义 dot product 点积
对于 \(\mathbb{R}^n\),定义点积为element-wise积的和
6.3 定义 inner product 内积
定义内积是\(V\)上满足如下性质的二元函数
6.4 例子
6.5 定义 内积空间 (inner product space)
An inner product space is a vector space \(V\) along with an inner product on \(V\).
6.6 Notation \(V\)
注意:从现在开始 \(V\) 是表示内积空间。
6.7 内积空间的性质
证明略
6.8 定义 norm, \(\| v \|\)
\[\| v \| = \sqrt { \langle v,v \rangle }\]6.11 定义 orthogonal
u,v are called orthogonal if \(\langle u,v \rangle = 0\).
6.12 Orthogonality and 0
0 和任意向量都正交 0 是唯一和自己正交的向量
6.13 Pythagorean Theorem
\[\| u + v \|^2 = \| u \|^2 + \| v \|^2\]证:由定义。
6.14 An orthogonal decomposition
若 \(u,v \in V,\, v \neq 0\),令 \(c = \frac{\langle u,v \rangle}{\| v \|^2}\) and \(w = u - cv\),则
\[\langle w,v \rangle = 0\]6.15 Cauchy-Schwarz Inequality
\[\left\vert \langle u,v \rangle \right\vert \leq \| u \| \| v \|.\]证略
6.17 例子
\(\left\vert \int_{-1}^{1}f(x)g(x)\operatorname{dx} \right\vert^2 \leq \left( \int_{-1}^{1} \left( f(x) \right) ^2 \right) \left( \int_{-1}^{1} \left( g(x) \right) ^2 \right)\).
6.18 三角不等式
\[\| u+v \| \leq \| u \| + \| v \|\]6.22 Paralelogram Equality
\[\| u+v \|^2 + \| u-v \|^2 = 2(\|u\|^2 + \|v\|^2)\]6.23 定义 orthonormal
单位正交
6.25
若 \(e_1,\ldots,e_m\) 是 \(V\) 中的单位正交向量,则
\[\| a_1e_1 + \cdots + a_me_m \|^2 = \left\vert a_1 \right|vert^2 + \cdots + \left\vert a_n \right\vert^2\]6.30
若 \(e_1,\ldots,e_n\) 是一组单位正交基,则
\[\begin{aligned} v &= \langle v,e_1 \rangle + \cdots + \langle v,e_n \rangle e_v \\ \| v \|^2 &= \left\vert \langle v,e_1 \rangle \right\vert ^2 + \cdots + \left\vert \langle v,e_n \rangle \right\vert |2 \end{aligned}\]6.31 Gram-Schimdt Procedure 格拉姆-施密特正交化
\[e_j = \frac{v_j - \langle v_j,e_1 \rangle e_1 - \cdots - \langle v_j,e_{j-1} \rangle e_{j-1}}{ \| {v_j - \langle v_j,e_1 \rangle e_1 - \cdots - \langle v_j,e_{j-1} \rangle e_{j-1}} \| }\]6.33 Example
把\(1,x,x^2\)在内积为 \(\langle p,q \rangle = \int_{-1}^{1}{p(x)q(x)\operatorname{dx}}\) 的标准正交基找到了。见书。
6.34 Existence of orthonormal basis
每个有限维内积空间都有一组单位正交基
6.35 一组单位正交向量可以扩充为一组单位正交基
6.37 Upper-triangular matrix with respect to orthonormal basis
Suppose \(T \in \mathcal{L}(V)\). If T has an upper-triangular matrix with respect to some basis of V, then T has an upper-triangular matrix with respect to some orthonormal basis of V.
一个线性变成在一组基下是上三角矩阵,就能在一组标准正交基下是上三角矩阵
证明见书,现在刚看完,一看就懂。
6.38 Schur’s theorem
有限维复向量空间上(\(\mathbb{C}^n\)),每个线性变换都有一组标准正交基,使在这组基上对应的矩阵是上三角阵 5.27说的是在\(\mathbb{C}^n\),每个线性变换都有一组基,使在这组基上对应的矩阵是上三角阵,加之以6.37就可以了。
6.39 定义 linear functional
和3.92定义完全相同
6.42 Riesz Representation Theorem (里斯表示定理)
若 \(V\) 是有限维, \(\varphi \in \mathcal{L}(V)\),那么存在唯一 \(u \in V\) 使
\[\varphi(v) = \langle v,u \rangle\]早就理解出来了,证法也类似,用标准正交基证。
6.44 例 Find \(u \in \mathcal{P}(\mathbb{R})\) such that
\[\int_{-1}^{1}{p(t) \left( \cos(\pi t) \right) \operatorname{dt}} = \int_{-1}^{1}{p(t)u(t) \operatorname{dt}}\]从-1到1的积分是 \(\mathcal{P}_2(\mathbb{R})\) 到 \(\mathbb{R}\) 上的线性映射,题目实际上是,给定\(\varphi(p) ,\, p \in \mathcal{P}_2(\mathbb{R})\),求\(u\) 使 \(\varphi(p) = \langle p,u \rangle\)
6.45 定义 orthogonal complement
\[U^\perp = \left\{ v \in V : \langle v,u \rangle = 0 \text{ for every } u \in U \right\}\]6.46 正交补的性质
6.47
若 \(U\) 是 \(V\) 的有限维子空间,则
\[V = U \oplus U|\perp\]注意,U是有限维的,V没有限制
6.50 若 \(V\) 是有限维的,\(U\) 是 \(V\) 的子空间,则
\[\text{dim }U^\perp = \text{dim }V - \text{dim }U\]6.51
\[U = \left( U^\perp \right) ^\perp\]证明有点混乱,还没看。
6.53 定义 orthogonal projection, \(P_U\)
若 \(U\) 是 \(V\) 的有限维子空间,V到U的正交投影(orthogonal projection) \(P_U \in \mathcal{L}(V)\) 是:
\[\text{ For } v \in V, \text{ write } v = u+w, \text{ where }u \in U \text{ and } w \in U^\perp \text{. Then } P_Uv = u\]6.56 Minimizing the distance to a subspace
若 \(U\) 是 \(V\) 的有限维正交子空间,则
\[\| v -P_Uv \| \leq \| v - u \|\]7.1 Notation
7.2 定义 adjoint, \(T^*\)
设 \(T \in \mathcal{L}(V,W)\), The adjoint of T is the function \(T^* : W → V\) such that
\[\langle Tv,w \rangle = \langle v,T^*w \rangle\]7.5 The adjoint is a linear map
\(\text{If } \mathcal{L}(V,W), \text{ then } T^* \in \mathcal{L}(W,V)\).
7.6 adjoint的性质
7.5和7.6都是用6.3和6.7证的。
7.7 Null space and range of T^*
若 \(T \in \mathcal{L}(V,W)\) 则
7.10 The matrix of \(T^*\)
若 \(T \in \mathcal{L}(V,W),\,e_1,\ldots,e_n 基 V,, f_1,\ldots,f_m 基 W\) 则
\[\mathcal{M} \left( T^*, (f_1,\ldots,f_m), (e_1,\ldots,e_n) \right)\]是
\[\mathcal{M} \left( T, (e_1,\ldots,e_n), (f_1,\ldots,f_m) \right)\]的共轭转置
证明是由定义推出来的。
7.11 定义 self-adjoint (有人叫它Hermitian)
An operator \(T \in \mathcal{L}(V)\) is called self-adjoint if \(T = T^*\)
7.13 Eigenvalues of self-adjoint operators are real
Every eigenvalue of a self-adjoint operator is real.
\[\lambda \| v \| ^2 = \langle \lambda v,v \rangle = \langle Tv,v \rangle = \langle v,Tv \rangle = \langle v,\lambda v \rangle = \overline{\lambda} \| v \|^2\]7.14 Over \(\mathbb{C}\), \(Tv\) is orthogonal to \(v\) for all \(v\) only for the \(\boldsymbol{0}\) operator
若 \(V\) 是复内积空间, \(T \in \mathcal{L}(V)\),则
\[\langle Tv,v \rangle = 0 \implies T = 0\]7.15 Over \(\mathbb{C}\), \(\langle Tv,v \rangle\) is real for all v only for self-adjoint operators
若 \(V\) 是复内积空间, \(T \in \mathcal{L}(V)\),则
\[T \text{ is self-adjoint} \iff \langle Tv,v \rangle \in \mathbb{R}\]证:
\[\langle Tv,v \rangle - \overline{\langle Tv,v \rangle} = \langle Tv,v \rangle - \langle v,Tv \rangle = \langle Tv,v \rangle - \langle T^*v,v \rangle = \langle \left( T-T^* \right) v,v \rangle\]7.16
若 \(V\) 是实内积空间, \(T \in \mathcal{L}(V)\),则
\[\langle Tv,v \rangle = 0 \implies T = 0\]7.18 定义 normal 正规矩阵
\(T \in \mathcal{L}(V)\) is normal if
\[TT^* = T^*T\]7.20
\[T \text{ is normal } \iff \|Tv\| = \|T^*v\|\]证:
\[\begin{aligned} T \text{ is normal} & \iff T^*T-TT^* = 0 \\ & \iff \langle \left( T^*T-TT^* \right) v,v \rangle = 0 \\ & \iff \langle T^*Tv,v \rangle = \langle TT^*v,v \rangle \\ & \iff \text{( by T^*'s definition )} \|Tv\|^2 = \|T^*v\|^2 \end{aligned}\]7.24 Complex Spectral Theorem
若 \(\mathbb{F} = \mathbb{C}\) and \(T \in \mathcal{L}(V)\),那么以下条件等价
证明见书
7.26 Invertible quadratic expressions
若 \(T \in \mathcal{L}(V)\) is self-adjoint, \(b,c \in \mathbb{R}\), \(b^2 < 4c\) 则
\(T^2 + bT + cI\) 可逆
证明有空抄。。
7.27 Self-adjoint operators have eigenvalues
Suppose \(V \neq \{ 0 \}\) and \(T \in \mathcal{L}(V)\) is a self-adjoint operator. Then \(T\) has an eigenvalue.
证明见书,有时间再抄。
7.28 Self-adjoint operators and invariant subspaces
若 \(T \in \mathcal{L}(V)\) is self-adjoint and \(U\) is a subspace of \(V\) that is invariant under \(T\). Then
证明越来越不直观了,见书。
7.29 Real Spectral Theorem
若 \(\mathbb{F} = \mathbb{R}\) and \(T \in \mathcal{L}(V)\),那么以下条件等价
证明见书,有空好好理解一下。
7.31 定义 positive operator
An operator \(T \in \mathcal{L}(V)\) is called positive if \(T\) is self-adjoint and
\[\langle Tv,v \rangle \geq 0\]7.33 Definition square root
R is a square root of T if \(R^2 = T\)
7.35 positive operators的特征
以下条件等价
7.37 定义 isometry(等距同构)
An operator \(S \in \mathcal{L}(V)\) is called an isometry if for all \(v \in V\)
\[\| Sv \| = \| v \|\]即:S perserve norms.
7.42 isometry(等距同构)的性质
若 \(S \in \mathcal{L}(V)\),以下条件等价
7.44 Notation \(\sqrt{T}\)
If T is a positive operator (半正定), then \(\sqrt{T}\) denotes the unique positive square root of \(T\).
7.45 Polar Depomposition
Suppose \(T \in \mathcal{L}(V)\). Then there exists an isometry \(S \in \mathcal{L}(V)\) such that
\[T = S \sqrt{ T^*T }\]注意,用矩阵表示时 \(S\) 和 \(T\) 不一定是一组基
证明以后看。
7.49 定义 singular values
若 \(T \in \mathcal{L}(V)\). The singular values of T are the eigenvalues of \(\sqrt{T^*T}\), with each eigenvalue \(\lambda\) repeated \(\text{dim }E(\lambda, \sqrt{T^*T})\) times.
7.51 Singular Value Decomposition 若 \(T \in \mathcal{L}(V)\) has singular values \(s_1,\ldots,s_n\). Then there exist orthonormas bases \(e_1,\ldots,e_n\) and \(f_1,\ldots,f_n\) of \(V\) such that for all \(v \in V\)
\[T v = v_1 \langle v,e_1 \rangle f_1 + \cdots + s_n \langle v,e_n \rangle f_n\]证明见书。
A.0 Notation
We will denote by \(\mathbb{H}\) a vector space whose dimension may be infinite.
A.1 Definition Norms
A mapping \(\| \cdot \| : \mathbb{H} \mapsto \mathbb{R}^+\) is said to define a norm if:
My comment: according to Wikipedia. \(\mathbb{H}\) must be a vector space over a subfield of \(\mathbb{C}\). This is required for \(\left\vert \alpha \right\vert\) to make sense ( or somehow \(\mathbb{H}\) is endowed with an absolute value ) . In homogeneity it should also be \(\alpha \in \mathbb{F}\) where \(\mathbb{F}\) is a subfield of \(\mathbb{C}\).
p-norm
for \(p \geq 1\), the p-norm of \(\textbf{x} \in \mathbb{C}^n\) is defined as
\[\| \textbf{x} \|_p = \left( \sum_{j=1}^{n} \left\vert x_j \right\vert ^p \right) ^{1/p}\]equivalency of norms
two norms \(\| \cdot \|\) and \(\| \cdot \|'\) are said to be equivalent iff there exists real number \(\alpha, \beta \gt 0\) such that \(\forall \textbf{x}\)
\[\alpha \| \textbf{x} \| \leq \| \textbf{x} \|' \leq \beta \| \textbf{x} \|\]More generally, all norms on a finite-dimensional space are equivalent.
My comment: To be specific: all norms on a finite-dimensional Banach space are equivalent. Proof (Cached). I cannot find for now a counter example of a non-equivalent norm on a non-complete finite-dimensional space.
Dual norms
see intuition of dual norm here(cached)
Dual is a norm on the dual space
Let \(f : V \mapsto \mathbb{F}\),
\[\| f \|_* = \sup_{x\neq 0}{\frac{ \left\vert f(x) \right\vert} {\| x \|}}\]This actually belongs to functional analysis… I don’t know…
B.1 Definition Gradient
Let \(f : \mathcal{X} \subset \mathbb{R}^N \mapsto \mathbb{R}\) be a differentiable function, Then, the gradient of \(f\) at \(x \in \mathcal{X}\) is the vector in \(\mathbb{R}^n\) denoted by \(\nabla (\textbf{x})\) and defined by
\[\nabla f(\textbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(\textbf{x}) \\ \vdots \\ \frac{\partial f}{\partial x_n}(\textbf{x}) \end{bmatrix}\]B.2 Definition Hessian
Let \(f : \mathcal{X} \subset \mathbb{R}^N \mapsto \mathbb{R}\) be a twice differentiable function, Then, the Hession of \(f\) at \(x \in \mathcal{X}\) is the vector in \(\mathbb{R}^n\) denoted by \(\nabla (\textbf{x})\) and defined by
\[\nabla^2f(\textbf{x}) = \begin{bmatrix} \frac{\partial^2f}{\partial x_1,x_j}(\textbf{x}) \end{bmatrix}\]Theorem B.1 Fermat’s theorem
Let \(f : \mathcal{X} \subset \mathbb{R}^N \mapsto \mathbb{R}\) be a differentiable function. If \(f\) admits a local extremum at \(\textbf{x}^* \in \mathcal{X}\),T then, \(\nabla f(\textbf{x}^*) = 0\), that is, \(\textbf{x}^*\) is a stationary point.
B.3 Convex set
A set \(\mathcal{X} \in \mathbb{R}^n\) is said to be convex if for any two points \(\textbf{x}, \textbf{y} \in \mathcal{X}\), the segment \([\textbf{x}, \textbf{y}]\) lies in \(\mathcal{X}\), that is
\[\{ \alpha \textbf{x} + (1-\alpha) \textbf{y} : 0 \leq \alpha \leq 1 \} \subset \mathcal{X}\]TODO: https://xingyuzhou.org/blog/notes/strong-convexity (cached)
TODO
In the language of measure theory, Markov’s inequality states that if \((X, \Sigma, \mu)\) is a measure space, f is a measurable extended real-valued function, and \(\epsilon > 0\), then
\[μ ( \{ x ∈ X : \left\vert f ( x ) \right\vert ≥ ε \} ) ≤ \frac{1}{\epsilon} \int_X { \left\vert f \right\vert \operatorname{d\mu} }.\]If \(\varphi\) is a monotonically increasing nonnegative function for the nonnegative reals, \(X\) is a random variable, \(a ≥ 0\), and \(\varphi(a) > 0\), then
\[P ( \left\vert X \right\vert \geq a ) \leq \frac{\mathbb{E}[\varphi(\left\vert X \right\vert )]}{\varphi(a)}\]An immediately corollary, using higher moments of nonnegative \(X\) is
\[P (X \geq a) \leq \frac{\mathbb{E}[X^n]}{a^n}\]Chebyshev’s inequality uses the variance to bound the probability that a random variable deviates far from the mean. Specifically:
\[P( \left\vert X - \mathbb{E}[X] \right\vert \geq a ) \leq \frac{\operatorname{Var}(X)}{a^2},\,\, a > 0\]for which Markov’s inequality reads
\[{\displaystyle \operatorname {P} \left((X- \mathbb{E}[X])^{2}\geq a^{2}\right)\leq {\frac {\operatorname {Var} (X)}{a^{2}}},}\]Let \(X\) be any real-valued random variable with expected value \({\displaystyle \mathbb {E} (X)=0}\) and such that \({\displaystyle a\leq X\leq b}\) almost surely. Then, for all \({\displaystyle \lambda \in \mathbb {R} }\),
\[{\displaystyle \mathbb {E} \left[e^{\lambda X}\right]\leq \exp \left({\frac {\lambda ^{2}(b-a)^{2}}{8}}\right).}\]Note that because of the assumption that the random variable \(X\) has zero expectation, the \(a\) and \(b\) in the lemma must satisfy \(a\leq 0\leq b\)
First note that if one of \(a\) or \(b\) is zero, then \({\displaystyle \textstyle \mathbb {P} \left(X=0\right)=1}\) and the inequality follows. If both are nonzero, then \(a\) must be negative and \(b\) must be positive.
Next, since that \({\displaystyle e^{sx}}\) is a convex function on the real line:
\[\forall x \in [a, b]: \,\, e^{sx}\leq \frac{b-x}{b-a}e^{sa}+\frac{x-a}{b-a}e^{sb}.\]Applying \(\mathbb {E}\) to both sides of the above inequality gives us:
\[{\displaystyle {\begin{aligned}\mathbb {E} \left[e^{sX}\right]&\leq {\frac {b-\mathbb {E} [X]}{b-a}}e^{sa}+{\frac {\mathbb {E} [X]-a}{b-a}}e^{sb}\\&={\frac {b}{b-a}}e^{sa}+{\frac {-a}{b-a}}e^{sb}&&\mathbb {E} (X)=0\\&=(1-\theta )e^{sa}+\theta e^{sb}&&\theta =-{\frac {a}{b-a}}>0\\&=e^{sa}\left(1-\theta +\theta e^{s(b-a)}\right)\\&=\left(1-\theta +\theta e^{s(b-a)}\right)e^{-s\theta (b-a)}\\\end{aligned}}}\]Let \(u=s(b-a)\) and define \(\varphi :\mathbb {R} \mapsto \mathbb {R}\) :
\[\varphi (u)=-\theta u+\log \left(1-\theta +\theta e^{u}\right)\]\(\varphi\) is well defined on \(\mathbb{R}\), to see this we calculate:
\[{\displaystyle {\begin{aligned}1-\theta +\theta e^{u}&=\theta \left({\frac {1}{\theta }}-1+e^{u}\right)\\&=\theta \left(-{\frac {b}{a}}+e^{u}\right)\\&>0&&\theta >0,\quad {\frac {b}{a}}<0\end{aligned}}}\]The definition of \(\varphi\) implies
\[\mathbb {E} \left[e^{sX}\right]\leq e^{\varphi (u)}.\]By Taylor’s theorem, for every real \(u\) there exists a \(v\) between \({\displaystyle 0}\) and \(u\) such that
\[\varphi(u)=\varphi(0)+u\varphi'(0)+\tfrac{1}{2} u^2\varphi''(v).\]Note that:
\[{\displaystyle {\begin{aligned}\varphi (0)&=0\\\varphi '(0)&=-\theta +\left.{\frac {\theta e^{u}}{1-\theta +\theta e^{u}}}\right|_{u=0}\\&=0\\[6pt]\varphi ''(v)&={\frac {\theta e^{v}\left(1-\theta +\theta e^{v}\right)-\theta ^{2}e^{2v}}{\left(1-\theta +\theta e^{v}\right)^{2}}}\\[6pt]&={\frac {\theta e^{v}}{1-\theta +\theta e^{v}}}\left(1-{\frac {\theta e^{v}}{1-\theta +\theta e^{v}}}\right)\\[6pt]&=t(1-t)&&t={\frac {\theta e^{v}}{1-\theta +\theta e^{v}}}\\&\leq {\tfrac {1}{4}}&&t>0\end{aligned}}}\]Therefore,
\[{\displaystyle \varphi (u)\leq 0+u\cdot 0+{\tfrac {1}{2}}u^{2}\cdot {\tfrac {1}{4}}={\tfrac {1}{8}}u^{2}={\tfrac {1}{8}}s^{2}(b-a)^{2}.}\]This implies
\[{\displaystyle \mathbb {E} \left[e^{sX}\right]\leq \exp \left({\tfrac {1}{8}}s^{2}(b-a)^{2}\right).}\]Let \(X_1, \ldots, X_n\) be independent random variables bounded by the interval \([0, 1]: 0 ≤ X_i ≤ 1\). We define the empirical mean of these variables by
\[{\displaystyle {\overline {X}}={\frac {1}{n}}(X_{1}+\cdots +X_{n}).}\]One of the inequalities in Theorem 1 of Hoeffding (1963) states
\[{\displaystyle {\begin{aligned}\operatorname {P} ({\overline {X}}-\mathrm {E} [{\overline {X}}]\geq t)\leq e^{-2nt^{2}}\end{aligned}}}\]where \({\displaystyle 0\leq t}.\)
Theorem 2 of Hoeffding (1963) is a generalization of the above inequality when it is known that \(X_i\) are strictly bounded by the intervals \([a_i, b_i]\):
\[{\displaystyle {\begin{aligned}\operatorname {P} \left({\overline {X}}-\mathrm {E} \left[{\overline {X}}\right]\geq t\right)&\leq \exp \left(-{\frac {2n^{2}t^{2}}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}\right)\\\operatorname {P} \left(\left|{\overline {X}}-\mathrm {E} \left[{\overline {X}}\right]\right|\geq t\right)&\leq 2\exp \left(-{\frac {2n^{2}t^{2}}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}\right)\end{aligned}}}\]Suppose \(X_1,\ldots,X_n\) are n independent random variables such that
\[{\displaystyle \operatorname {P} \left(X_{i}\in [a_{i},b_{i}]\right)=1,\qquad 1\leq i\leq n.}\]Let \({\displaystyle S_{n}=X_{1}+\cdots +X_{n}.}\)
Then for \(s, t ≥ 0\), Markov’s inequality and the independence of \(X_i\) implies:
\[{\displaystyle {\begin{aligned}\operatorname {P} \left(S_{n}-\mathrm {E} \left[S_{n}\right]\geq t\right)&=\operatorname {P} \left(e^{s(S_{n}-\mathrm {E} \left[S_{n}\right])}\geq e^{st}\right)\\&\leq e^{-st}\mathrm {E} \left[e^{s(S_{n}-\mathrm {E} \left[S_{n}\right])}\right]\\&=e^{-st}\prod _{i=1}^{n}\mathrm {E} \left[e^{s(X_{i}-\mathrm {E} \left[X_{i}\right])}\right]\\&\leq e^{-st}\prod _{i=1}^{n}e^{\frac {s^{2}(b_{i}-a_{i})^{2}}{8}}\\&=\exp \left(-st+{\tfrac {1}{8}}s^{2}\sum _{i=1}^{n}(b_{i}-a_{i})^{2}\right)\end{aligned}}}\]Note that things in the parenthesis are a quadratic function and achieves its minimum at
\[{\displaystyle s={\frac {4t}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}.}\]Thus we get
\[{\displaystyle \operatorname {P} \left(S_{n}-\mathrm {E} \left[S_{n}\right]\geq t\right)\leq \exp \left(-{\frac {2t^{2}}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}\right).}\]Let \(F:\Omega \to \mathbb {R}\) be a continuously-differentiable, strictly convex function defined on a closed convex set \(\Omega\).
The Bregman distance associated with F for points \(p,q\in \Omega\) is the difference between the value of F at point p and the value of the first-order Taylor expansion of F around point q evaluated at point p:
\[D_{F}(p,q)=F(p)-F(q)-\langle \nabla F(q),p-q\rangle\]A basic definition of a discrete-time martingale is a discrete-time stochastic process (i.e., a sequence of random variables) \(X_1, X_2, X_3, \ldots\) that satisfies for any time \(n\),
\[\begin{aligned} & \mathbf {E} (\vert X_{n}\vert )<\infty \\ & \mathbf {E} (X_{n+1}\mid X_{1},\ldots ,X_{n})=X_{n} \end{aligned}\]That is, the conditional expected value of the next observation, given all the past observations, is equal to the most recent observation.
Suppose \(\{ X_k : k = 0, 1, 2, 3, \ldots \}\) is a martingale (or super-martingale) and
\[{\displaystyle |X_{k}-X_{k-1}|<c_{k},\,}\]almost surely. Then for all positive integers \(N\) and all positive reals \(t\),
\[{\displaystyle P(X_{N}-X_{0}\geq t)\leq \exp \left({-t^{2} \over 2\sum _{k=1}^{N}c_{k}^{2}}\right).}\]TODO: Proof
McDiarmid’s Inequality is a generalization of Hoeffding’s inequality.
Consider independent random variabls \(X_1, \ldots, X_n\) whose domain is \(\mathcal{X}\), and a mapping \(f: \mathcal{X}^n \mapsto \mathbb{R}\) If, for all \(i = \{1,\ldots,n\}, x_1, \ldots x_n, x^* \in \mathcal{X}\),
\[\left\vert \phi(x_1,\cdots,x_{i-1},x_i,x_{i+1},\cdots,x_n) - \phi(x_1,\cdots,x_{i-1},x^*,x_{i+1},\cdots,x_n) \right\vert \leq c_i\](In other words, replacing the \(i\)-th coordinate \({\displaystyle x_{i}}\) by some other value changes the value of \({\displaystyle f}\) by at most \({\displaystyle c_{i}}\).)
Then
\[\mathop{\operatorname{Pr}}(\phi(X_1, \cdots, X_n) - \mathbb{E}[\phi]\geq ε) \leq \exp \left( - \frac{2ε^2}{\sum_{i=1}^{n}{c_i^2}} \right) \\ \mathop{\operatorname{Pr}}(\phi(X_1, \cdots, X_n) - \mathbb{E}[\phi]\leq -ε) \leq \exp \left( - \frac{2ε^2}{\sum_{i=1}^{n}{c_i^2}} \right)\]Proof:
Let \(Z_k = \mathop{\operatorname{\mathbb{E}}}[f(X_1, \ldots, X_n) \vert X_{1\ldots k}]\) be a sequence of random variables.
Example:
\[\begin{aligned} Z_0 &= \mathop{\operatorname{\mathbb{E}}}[f],\,\,\,\text{ no randomness. } \\ Z_1 &= \mathop{\operatorname{\mathbb{E}}}[f(X_1, \ldots, X_n) \vert X_1] \\ Z_2 &= \mathop{\operatorname{\mathbb{E}}}[f(X_1, \ldots, X_n) \vert X_1, X_2] \\ Z_n &= \mathop{\operatorname{\mathbb{E}}}[f(X_1, \ldots, X_n) \vert X_{1\ldots n}] = f(X_1, \ldots, X_n) \\ \end{aligned}\]However, we are not defining \(Z_k\) is independent variable, they are strongly dependent.
Informally, given \(Z_{k-1}\), we cannot determine \(X_1, X_{k-1}\), as there may be multiple cases that lead to the same \(Z_{k-1}\).
\(Z_{k}\) is dependent on \(Z_{k-1}\) in a way that if \(Z_k\) generated by \(X_1, X_{k-1}\), \(Z_k\) is as if generated by same \(X_1, X_{k-1}\), plus \(X_k\)
No formal proof is given, but I believe that \(Z_k\) is martingale.
\[\mathop{\operatorname{\mathbb{E}}}{[Z_k\vert Z_{k-1}]} = Z_{k-1}\]Now apply azuma inequality:
\[P(Z_{N}-Z_{0}\geq t)\leq \exp \left({-t^{2} \over 2\sum _{k=1}^{N}c_{k}^{2}}\right)\]Suppose we are making binary predictions, with N experts. Suppose there is one perfect expert.
Let \(\widetilde{y}^t, x_i^t\) be the algorithm and the i-th expert’s prediction at round t. \(x_i^t \in \{ 0, 1 \}\).
Let \(y^t\) be the true value nature reveals.
Let \(w_i^t\) be weight we assign to each expert, initial \(w_i^0\) is \(0\).
Let \(M_T(\text{algorithm})\) and \(M_T(i)\) be the total mistakes algorithm or expert i make after rount t. That is:
\[\begin{aligned} M_T(\text{algorithm}) &= \sum_{i=1}^{T}{𝟙[\hat{y}^t \neq y^t]} \\ M_T(i) &= \sum_{i=1}^{T}{𝟙[w_i^t \neq y^t]} \end{aligned}\]Define the halving algorithm:
\[\begin{aligned} \widetilde{y}^t = \operatorname{sgn} \sum{w_i^tx_i^t} \\ w_i^t = w_i^t \cdot 𝟙[x_i^t = y^t] \end{aligned}\]Theorem: \(M_T(\text{algorithm}) \leq \log_2N\)
Worst case: suppose we have \(N = 2^n\) experts, only one of whom is perfect. At each round, half of them predict \(1\), the other half predict \(−1\), which maximally slows down the “shrinking rate”. Then, the Halving algorithm will make exactly \(log_2N\) mistakes to discover the perfect expert, and makes no more mistakes from then on.
Some basic inequality
What if the best expert makes a few mistakes?
We decay the expert’s weight!!!
Define the Majority Weight Algothim (MWA) to be:
\[\text{MWA:} \begin{cases} \hat{y}^t = \operatorname{round} \left( \frac {\sum{w_i^t}{x_i^t}}{\sum{w_i^t}} \right) \\ w_i^{t+1} = w_i^t \left( 1 - \epsilon \right)^{𝟙[x_i^t \neq y_1^t]},\,\, 0 < \epsilon < ≈0.683803 \end{cases}\]Define the total weight at time t to be \(\Phi_t = \sum_i{w_i^t}\). Notice that:
the third inequality holds because if \(y^t \neq \hat{y}^t\), then, at least half of the weight will decay by \((1-\epsilon)\), so the total weight will decay by at least \(( 1 - \frac{\epsilon}{2} )\).
Hence
\[(1-\epsilon)^{M_T(i)} \leq \Phi_{T} \leq N \left( 1-\frac{\epsilon}{2} \right) ^ {M_T(\text{MWA})}\]take negative log we get
\[-\log \left( (1-\epsilon)^{M_T(i)} \right) \geq -\log \left( N \left( 1-\frac{\epsilon}{2} \right) ^ {M_T(\text{MWA})} \right)\]Which is
\[M_T(i) \left( - \log \left( 1-\epsilon \right) \right) \geq - log N + M_T(\text{MWA}) \left( - \log \left( 1-\frac{\epsilon}{2} \right) \right)\]By \ref{ineq:1} and \ref{ineq:2} we get
\[\begin{aligned} M_T(i) \left( \epsilon + \epsilon^2 \right) + \log N & \geq M_T(\text{MWA}) \frac{\epsilon}{2} \\ 2 M_T(i) \left( 1 + \epsilon \right) + \frac{2}{\epsilon} \log N & \geq M_T(\text{MWA}) \end{aligned}\]If instead of rounding \(\frac{\sum{w_i^t}{x_i^t}}{\sum{w_i^t}}\), we use a loss function \(l(\widetilde{y}, y)\) that is convex in \(\widetilde{y}\)
Define expert i’s loss at time T to be \(L_T(i) = \sum{l(x_i^t, y^t)}.\), the algorithm’s loss is defined similarly.
Define the Regret of algorithm is \(\text{Regret} = L_T(\text{Alg}) - \min_i{L_T(i)}\), that is, the additional loss of our algorithm compared to the best expert.
\[\text{Algorithm EWA } : \begin{cases} \widetilde{y}^t = \frac{\sum{w_i^t}{x_i^t}}{\sum{w_i^t}} \\ w_i^{t+1} = w_i^t \exp{ - \eta l(x_i^t,y^t)} \\ \end{cases}\]Theorem: The EWA has the following properties:
\[L_T(EWA) \leq \frac{\eta L_T(i) + \log N }{1-e^{-\eta}},\,\,\,\, i = 1,\ldots,N\]corollary By tuning m, we have
\[\text{Regret} = L_T(EWA) - L_T(i) \leq \log N + \sqrt{2 L_T(i) \log (N)}\]Proof of corollary: just use first order tayloy expansion at 0 to \({e^{-\eta}}\)
First we show an addictive alternative setting: Hedge (or actions), it is not used in the proof, but the same proof applies to both settings.
There are N “actions”, For \(t = 1\ldots T\), Algorithm selects some distrubution \(p^t \in \Delta_N\). Nature reveals cost of action \(l_i^t \in [0,1]\) of action \(i\), So the algorithm pays \(\mathbb{E}[l^t] = \sum_i{p_i^t * l_i^t}\), initially \(\textbf{w} = \textbf{1}\).
\[\text{EWA} \begin{cases} \textbf{p}^t = \frac{ \textbf{w}^t }{ \| \textbf{w} \|_1 } \\ w_i^{t+1} = w_i^t \exp(-\eta l_i^t) \end{cases}\] \[\mathbb{E}[L_T(\text{algorithm})] = \sum_{t=1}^{T}{\textbf{p}^t \cdot \textbf{l}^t}\] \[L_T(i) = \sum_{t=1}^{T}{l_i^t} \text{ loss if you always choose action }i\]Two settings are almost equalent. Set \(l_i^t === l(x_i^t, y^t)\)
Let X be any random variable in [0,1], \(s \in \mathbb{R}\) then by \ref{ineq:3}
\[e^{sX} \leq 1 + (e^s-1)X\] \[\mathbb{E}[e^{sX}] \leq 1 + (e^s-1)\mathbb{E}[X]\] \[\log \mathbb{E}[{e^{sX}}] \leq \log \left( { 1 + (e^s-1)\mathbb{E}[X] } \right) \leq (e^s-1)\mathbb{E}[X]\]Proof
Let sum of weight \(W^t\) be \(\sum_{i=1}^N { w_i^t}\), Cookup potential \(\Phi_t = -\log W^t\).
Define random variable \(X_t\)
\[p(X_t = l(x_i^t,y^t)) = \frac{w_i^t}{\sum_j{w_j^t}}\]Notice:
\[\begin{aligned} \Phi_{t+1} - \Phi_{t} &= -log (\frac{\sum_{i=1}^N { w_i^{t+1}}}{\sum_{i=1}^N { w_i^t}} ) \\ &= -log (\frac{\sum_{i=1}^N { w_i^{t} \exp(-\eta l(x_i^t,y^t)) }}{\sum_{i=1}^N { w_i^t}} ) \\ &= -log \mathbb{E}[\exp(-\eta X_t)] \end{aligned}\]Using EWA-Lemma
\[\begin{aligned} \Phi_{t+1} - \Phi_{t} &= -log \mathbb{E}[\exp(-\eta X_t)] \\ &\geq -(e^{-\eta}-1) \mathbb{E}[X_t] \\ &= (1 - e^{-\eta}) \sum_{i=1}^{N}{\frac{w_i^t}{\sum_j{w_j^t}}}l(x_i^t, y^t) \\ &\geq (1 - e^{-\eta}) l(\sum_{i=1}^{N}{\frac{w_i^tx_i^t}{\sum_j{w_j^t}}, y^t}) \text{ by convexity of }l \\ &= (1 - e^{-\eta}) l(\widetilde{y}^t,y^t) \end{aligned}\]Hence.
\[(1 - e^{-\eta}) \sum_{t=1}^{T} l(\widetilde{y}^t,y^t) \leq \sum_{t=1}^{T}{\Phi_{t+1}-\Phi_t} = \Phi_{T+1} + \log N\] \[\Phi_{T+1} \leq -\log {W_i^{T+1}} = \eta \sum { l(x_i^t, y^t) } = \eta L_T(i)\]Hence
\[L_T(EWA(\eta)) \leq \frac{\eta L_T(i) + \log N }{1 - e^{-\eta}}\]Algorithm selects \(\textbf{w}^t \in \mathbb{R}^d\)
Nature selects \(\textbf{x}^t \in \mathbb{R}, \| \textbf{x}_t \|_2 \leq 1\)
Algorithm predicts \(\hat{y}^t = \operatorname{sign}(\textbf{w}^t \cdot \textbf{x}^t) \in \{ -1, 1 \}\)
Nature reveals \(y^t \in \{ -1, 1\}\)
\[M_T(alg) = 𝟙[\hat{y}^t \neq y^t] = \sum \frac{1-\hat{y}^ty^t}{2}\]Assume there exists \(w^* \in \mathbb{R}^d, \|w\| \leq 1\) s.t.
\[(w^* \cdot x^t) y^t > \gamma \forall t\]where γ is a margin parameter. Equivalently,
\[\|w^*\| \leq \frac{1}{\gamma},\,\,\,\, (w^* \cdot x^t) \cdot y^t > 1\]Perceptron Algorithm:
\[\textbf{w}^1 = \textbf{0}\] \[\begin{aligned} \textbf{w}^{t+1} = \textbf{w}^t \text{ if } y^t (\textbf{w}^t \cdot \textbf{x}^t) > 0 \\ \textbf{w}^{t+1} = \textbf{w}^t + \textbf{x}^t y^t \text{ otherwise} \end{aligned}\]Theorem: Perceptron guarantees
\[M_T \leq \frac{1}{\gamma^2} \text{ assuming } w^*\text{ exists }\]Proof:
Let \(\Phi_t = \| \textbf{w}^* - \textbf{w}^{t+1} \|_2^2\), then
\[\begin{aligned} \frac{1}{\gamma^2} &> \| \textbf{w}^*\|^2 > \| \textbf{w}^*-\textbf{0}\|_2^2 - \|\textbf{w}^*-\textbf{w}^{t+1}\|_2^2 \\ &= \Phi_0 - \Phi_T = \sum_{t=1}^{T}{ \left( \Phi_{t-1} - \Phi_{t} \right) } \\ &= \sum_{t=1}^{T}{ \left( \| \textbf{w}^* - \textbf{w}^{t}\|_2^2 - \| \textbf{w}^* - \textbf{w}^{t+1}\|_2^2 \right) } \\ &= \sum_{t:y^t(\textbf{w}^t\cdot\textbf{x}^t)<0}{ \left( \| \textbf{w}^* - \textbf{w}^{t} \|_2^2 - \| \textbf{w}^* - (\textbf{w}^{t}+\textbf{x}^ty^t) \|_2^2 \right) } \\ &= \sum_{t:y^t(\textbf{w}^t\cdot\textbf{x}^t)<0}{ \left( \| \textbf{w}^* - \textbf{w}^{t} \|_2^2 - \| (\textbf{w}^* - \textbf{w}^{t}) -\textbf{x}^ty^t) \|_2^2 \right) } \\ &= \sum_{t:y^t(\textbf{w}^t\cdot\textbf{x}^t)<0}{ \left( 2 (\textbf{w}^* - \textbf{w}^{t}) \cdot \textbf{x}^t y^t - y^ty^t \textbf{x}^t\cdot \textbf{x}^t \right) } \\ \end{aligned}\]By our condition, \(- y^ty^t \textbf{x}^t\cdot \textbf{x}^t \geq -1\), \(- \textbf{w}^t \cdot \textbf{x}^t y^t > 0\) (because we make an error here), and \(\textbf{w*} \cdot \textbf{x}^t y^t > 1\)
\[\frac{1}{\gamma^2} > \sum_{t:y^t(\textbf{w}^t\cdot\textbf{x}^t)<0}{ \left( 2 + 0 + (-1) \right) } = M_T(\text{algorithm})\]A few review of the definition (also the same with the paper):
Let \(G = (V,E)\) be a graph that is undirectional, weighted, no multiple edges, connected.
Let \(V\) be the set of vertices, \(E\), the set of all edges
Let \(\omega\) be the weight function such that \(\omega(e) > 0 \text{ for all } e \in E\).
Let \(d_G(s,t)\) be the minimum distance from \(s\) to \(t\)(\(s,t \in V\)), \(d_G(s,s)=0\)
Let \(P_s(v) = \{ u \in V : \{u,v\} \in E,\, d_G(s,v) = d_G(s,u) + \omega(\{u,v\})\}\)
\(P_s(v)\) can be perceived as: if you go first from v to s, what point is best to choose in the first step.
It is clear that \(P_s(v) = \emptyset \iff v = s\)
Let \(\sigma_{st}\) be the number of shortest paths from \(s\) to \(t\). Define \(\sigma_{ss} = 1\)
Let \(\sigma_{st}(v)\) be the number of shortest paths from \(s\) to \(t\) containging \(v\).
Let \(\sigma_{st}(v,e)\) be the number of shortest paths from \(s\) to \(t\) containging \(v\) and \(e\).
So it is clear that \(\sigma_{st}(s) = \sigma_{st}(t) = \sigma_{st}\) and \(\sigma_{ss}(s) = 1\) and \(\sigma_{ss}(s') = 0 \text{ where } s'\neq s\)
Let \(\delta_{st}(v)\) be \(\frac{\sigma_{st}(v)}{\sigma_{st}}\).
Let \(\delta_{st}(v,e)\) be \(\frac{\sigma_{st}(v,e)}{\sigma_{st}}\).
So it is clear that \(\delta_{st}(s) = \delta_{st}(t) = 1\)
Common definition of Centrality
Define the set of predecessors of a vertex v on shortest paths from s as
\(P_s(v) = \{u ∈ V : \{u, v\} ∈ E, d_G(s, v) = d_G (s, u) + ω(u, v)\}\).
Lemma 1: (Bellman criterion)
\[\sigma_{st}(v) > 0 \iff d_G(s,t) = d_G(s,v) + d_G(v,t)\]Lemma 2: (Combinatorial shortest-path counting)
If \(s,v \in V,\, s \neq v\), then
\[\sigma_{sv} = \sum_{u\in P_s(v)}{\sigma_{su}}\]In Page 6 the author define \(\delta_{s\bullet}(v)\) as:
\[\delta_{s\bullet}(v) = \sum_{t\in V}{ \delta_{st}(v) }\]Lemma 5 If there is exactly one shortest path from \(s ∈ V\) to each \(t ∈ V\), the dependency of \(s\) on any \(v ∈ V\) obeys
\[δs• (v) = \sum{w : v∈P_s(w)}(1 + δs•(w))\]Counter example
Suppose a graph with 2 vertices (\(s\) and \(t\)) and 1 edge connecting them.
It is empty set because there are no such \(w\) that \(t\in P_s(w)\)
In the deduction of page 8, it says:
\[δs• (v) = \sum_{t∈V} δst (v) = \sum_{t∈V}{ \sum_{w : v∈P_s(w)}{ δst (v, {v, w})}} = \sum_{w : v∈P_s(w)} {\sum_{t∈V} {δst (v, \{v, w\})}}\]Counter example:
Suppose a graph with 2 vertices (\(s\) and \(t\)) and 1 edge connecting them.
In page 8, it says:
\[δst (v, \{v, w\}) = \begin{cases} \frac{σ_{sv}}{σ_{sw}} & \text{if } t = w \\ \frac{σ_{sv}}{σ_{sw}} \cdot \frac{σ_{st}(w)}{σ_{st}} & \text {if } t \neq w \end{cases}\]It is redundant because, if \(t = w\), then \(\frac{σ_{st}(w)}{σ_{st}} = 1\), it is just
\[δst (v, \{v, w\}) = \frac{σ_{sv}}{σ_{sw}} \cdot \frac{σ_{st}(w)}{σ_{st}}\]The two parts in red color doesn’t follow the definition.
Definition:
\[\delta_{s\bullet}(v) = \sum_{t\in V, t\neq s \neq v}{ \delta_{st}(v) }\]We have:
\[\begin{aligned} \delta_{s\bullet}(v) &= \sum_{t\in V, t\neq s \neq v}{ \left( \delta_{st}(v) \right) } \\ &= \sum_{t\in V, t\neq s \neq v}{ \left( \frac{\sigma_{st}(v)}{\sigma_{st}} \right) } \\ &= \sum_{t\in V, t\neq s \neq v}{ \left( \frac{\sum_{w:v\in P_s(w)}{\sigma_{st}(v,\{v,w\})}}{\sigma_{st}} \right) } \\ \end{aligned}\]Note in the parenthesis above that we have
\[\sigma_{st}(v, \{v,w\}) = \frac{\sigma_{sv}}{\sigma_{sw}} \cdot \sigma_{st}(w)\]Why?.The definition of \(w\) means that \(v\) is closer to \(s\) than \(w\). and \(\sigma_{st}(v, \{v,w\}) = \sigma_{st}(w, \{v,w\})\). So the equation means that of all the paths from \(s\) to \(t\) via \(w\), which is \(\sigma_{st}(w)\), we take a fraction by only allowing one way (a.k.a \(\{v,w\}\)) of entering \(w\). We had \(\sigma_{sw}\) ways of entering \(w\) but now we have only \(\sigma_{sv}\). Hence the total possible ways is also fractured by \(\frac{\sigma_{sv}}{\sigma_{sw}}\).
Continue the equation:
\[\begin{aligned} \delta_{s\bullet}(v) &= \sum_{t\in V, t\neq s \neq v}{ \left( \delta_{st}(v) \right) } \\ &= \sum_{t\in V, t\neq s \neq v}{ \left( \frac{\sigma_{st}(v)}{\sigma_{st}} \right) } \\ &= \sum_{t\in V, t\neq s \neq v}{ \left( \frac{\sum_{w:v\in P_s(w)}{\sigma_{st}(v,\{v,w\})}}{\sigma_{st}} \right) } \\ &= \sum_{t\in V, t\neq s \neq v}{ \left( \frac{\sum_{w:v\in P_s(w)}{\frac{\sigma_{sv}}{\sigma_{sw}} \cdot \sigma_{st}(w)}}{\sigma_{st}} \right) } \\ &= \sum_{t\in V, t\neq s \neq v}{ \sum_{w:v\in P_s(w)} { \left( \frac{\sigma_{sv}}{\sigma_{sw}} \cdot \delta_{st}(w) \right) }} \\ &= \sum_{w:v\in P_s(w)} { \sum_{t\in V, t\neq s \neq v} { \left( \frac{\sigma_{sv}}{\sigma_{sw}} \cdot \delta_{st}(w) \right) }} \\ &= \sum_{w:v\in P_s(w)} \left( \frac{\sigma_{sv}}{\sigma_{sw}} \cdot { \sum_{t\in V, t\neq s \neq v} { \left( \delta_{st}(w) \right) }} \right) \\ &= \sum_{w:v\in P_s(w)} \left( \frac{\sigma_{sv}}{\sigma_{sw}} \cdot \left( { \left( \sum_{t\in V, t\neq s \neq w} { \delta_{st}(w) } \right) } + \delta_{sw}(w) - \delta_{sv}(w) \right) \right) \\ &= \sum_{w:v\in P_s(w)} \left( \frac{\sigma_{sv}}{\sigma_{sw}} \cdot ( \delta_{s\bullet}(w) + 1 ) \right) \end{aligned}\]]]>1.18 定义加法和乘法
1.19 向量空间 (又叫线性空间)
域\(F\)上的向量空间是一个集合\(V\),\(V\)上有两个运算\(+ : V × V → V\)(加法)和\(· : F × V → V\)(标量乘法),满足以下性质:
公理 | 说明 |
---|---|
向量加法的结合律 | u + (v + w) = (u + v) + w |
向量加法的交换律 | u + v = v + u |
向量加法的单位元 | 存在一个叫做零向量的元素0 ∈ V,使得对任意u ∈ V都满足u + 0 = u |
向量加法的逆元素 | 对任意v ∈ V都存在其逆元素−v ∈ V使得v + (−v) = 0 |
标量乘法与标量的域乘法相容 | a(bv) = (ab)v, a,b ∈ F |
标量乘法的单位元 | 域F存在乘法单位元1满足1v = v |
标量乘法对向量加法的分配律 | a(u + v) = au + av |
标量乘法对域加法的分配律 | (a + b)v = av + bv |
注意这里加法和乘法自带要求运算是封闭的。
1.20 定义 vector, point
Elements of a vector space are called vectors or points.
1.21 Definition real vector space, complex vector space
到此基本上是说向量的加减乘除是intuitive的
若U是V的子集,且U用V的加法和乘法也是一个向量空间,则称U是V的子空间。
U是V的子空间(subspace)当且仅当以下三个条件成立
证:正向trivial,反向:由定义也不难证,略了
1.36 定义 sum of subspaces
Suppose \(\boldsymbol{U_1} \cdots \boldsymbol{U}_m\) are subsets of \(\boldsymbol{V}\).
\[\boldsymbol{U}_1 + \cdots + \boldsymbol{U}_m = \left\{ \boldsymbol{u}_1 + \cdots + \boldsymbol{u}_m : \boldsymbol{u}_1 \in \boldsymbol{U}_1, \cdots, \boldsymbol{u}_m \in \boldsymbol{U}_m \right\}\]1.39 定理 Sum of subspaces 是包含所有subspaces中最小的
证:显然 Sum of subspaces 是 subspace!,又显然 Sum of subspaces 包含了 \(\boldsymbol{U}_1,\, \boldsymbol{U}_2,\, \cdots\),所以 Sum of subspaces足够大
又由space的封闭性,任何包含\(\boldsymbol{U}_1,\, \boldsymbol{U}_2,\, \cdots\)的又都包含 Sum of subspaces,所以 Sum of subspaces足够小
1.40 定义 direct sum
Suppose \(\boldsymbol{U_1} \cdots \boldsymbol{U}_m\) are subsets of \(\boldsymbol{V}\).
有点类似于正交子空间,也是笛卡尔积。
1.45 两个子空间的 direct sum 等价定义为它们的交集只包含一个0元素
2.11 定义 polynomial: A function p : F → F is called a polynomial with coefficients in F if there exist \(a_0, \cdots, a_m \in \mathbb{F}\) such that
\[p(z) = a_0 + a_1z + a_2z^2 + \cdots + a_mz^m\]for all \(z \in \boldsymbol{F}\).
\(\mathcal{P}(\mathbb{F})\) is the set of all polynomials with coefficients in \(\mathbb{F}\).
2.32 每个有限维空间都存在一组基
注意:维基说:在选择公理成立的条件下,每个无限维空间也都存在一组基,但是作者这里避开了无限维的讨论。
2.34 (有限维时)每个子空间都可以通过 direct sum (类似笛卡尔积) 得到原空间
Suppose \(V\) is finite-dimensional and \(U\) is a subspace of V. Then there is a subspace \(W\) of V such that \(V = U \oplus W\).
2.43 Dimension of a sum 若 \(\boldsymbol{U}_1,\, \boldsymbol{U}_2\)是一有限维空间的子空间,则
\[\text{dim}(\boldsymbol{U}_1 + \boldsymbol{U}_2) = \text{dim}\boldsymbol{U}_1 + \text{dim}\boldsymbol{U}_2 - \text{dim} \left( \boldsymbol{U}_1 \cap \boldsymbol{U}_2 \right)\]注意这里在讨论线性变换时又把对维度的限制去掉了,可以是无穷维的。
3.1 Notation
3.2 定义 Linear Map
A Linear Map is a function T : V → W that satisfies
3.3 Notation
所有V到W的线性映射记为 \(\mathcal{L}(V,W)\)
3.5 Linear maps and basis of domain
若 \(L \in \mathcal{L}(V,W)\), \(\boldsymbol{v}_i\) 是 \(V\) 的一组基, \(\boldsymbol{w}_i\) 是 \(W\) 中的一组向量(与 \(\boldsymbol{v}_i\) 个数相同),则
存在唯一的线性映射,将 \(\boldsymbol{v}_i\) 映射成 \(\boldsymbol{w}_i\)
证明见书
3.6 定义 线性映射函数之间的加法和数乘
若 \(S,T \in \mathcal{L}(V,W),\text{ and } \lambda \in \mathbb{F}\) 则
\(S + T := \backslash \boldsymbol{v} → S \boldsymbol{v} + T \boldsymbol{v}\)
\(\lambda S = \backslash \boldsymbol{v} → \lambda S \boldsymbol{v}\)
3.7 \(\mathcal{L}(V,W)\) 是一个向量空间。
由于这种加法和乘法满足向量空间的公理
3.8 Product of Linear Maps
\(T \in \mathcal{L}(U,V)\) and \(S \in \mathcal{L}(V,W)\),则定义
\(ST = \backslash \boldsymbol{u} → S ( T ( u ) )\) for \(\boldsymbol{u} \in U\)
3.9 Algebraic properties of products of linear maps
书上把证明留给读者了,我也略了。
习题 3A.8
Give an example of a function φ: ℝ² → ℝ such that φ(av) = aφ(v) 但 φ 不是线性的
φ 是无穷范数, φ(x) = | max(x₁, x₂) |, φ([0,1]) + φ([1,0]) = 2, φ([1,1]) = 1
习题 3A.9
Give an example of a function φ: ℂ² → ℂ such that φ(w+z)=φ(w)+φ(z) 但 φ 不是线性的
(Here C is thought of as a complex vector space.) [There also exists such function in ℝ. However, showing the existence of such a function involves considerably more advanced tools.]
φ(x) = Re(x), φ(i⋅1) = 0, i⋅φ(1) = i
3.12 定义 Null space, null T
For \(T \in \mathcal{L}(V,W)\), the null space of T, denoted null T, is the subset of V consisting of those vectors that T maps to 0:
\[\text{ null } T = \left\{ v \in V : T(v)=0 \right\}\]3.14 定义 Null space is a subspace
首先,0在null T里,由T的线性性质,可得null T里的元素也对加法和数乘封闭,所以null T满足0元,加法封闭,乘法封闭这三个判定条件。
3.15 定义 injective (one-to-one) 单射
A function T: V -> W is called injective if T(u)=T(v) implies u=v.
3.16 定理 Injectivity is equivalent to null space equals {0}
证明比较Trivial,略
注意:在讨论维度和进行计算时,又把有限维的要求加上了。
3.22 定理 Fundamental Theorem of Linear Algebra 线性代数基本定理 (只是一部分)
若 \(V\) 是有限维向量空间,\(T \in \mathcal{L}(V,W)\),则T的值域是有限维的,且
\[\text{dim }V = \text{dim null }T + \text{dim range }T\]注意:由2.34,V可以直接可以被分成两个子空间(正交子空间),W也是。
3.23 映射到更小维度空间的映射不是单射(injective)
dim null T = dim V - dim range T = dim V - dim W > 0;
3.24 映射到更大维度空间的映射不是满射(surjective)
3.30 定义 matrix, \(A_{j,k}\)
Let m and n denote positive integers. An m-by-n matrix A is a rectangular array of elements of F with m rows and n columns:
\[A = \begin{pmatrix} A_{1,1} & \cdots & A_{1,n} \\ \vdots & \ddots & \vdots \\ A_{m,1} & \cdots & A_{m,n} \end{pmatrix}\]The notation \(A_{j,k}\) denotes the entry in row j , column k of A.
3.32 定义 matrix of a linear map, \(\mathcal{M}(T)\) Suppose \(T \in \mathcal{L}(V,W)\) and \(v_1, \cdots, v_n\) is a basis of \(V\) and \(w_1, \cdots, w_m\) is a basis of \(W\). The matrix of T with respect to these bases is the m-by-n matrix \(\mathcal{M}(T)\) whose entries \(A_{j,k}\) are defined by
\[T v_k = A_{1,k}w_1 + \cdots + A_{m,k}w_m\]If the bases are not clear from the context, then the notation \(\mathcal{M}\left(T, \left(v_1, \cdots, v_n\right), \left(w_1, \cdots, w_m\right)\right)\) is used
注: \(\boldsymbol{w} = (\mathcal{M}(T)) \boldsymbol{v}\)
3.39 Notation \(\mathbb{F}^{m,n}\)
For m and n positive integers, the set of all m-by-n matrices with entries in \(\mathbb{F}\) is denoted by \(\mathbb{F}^{m,n}\)
3.53 定义 invertible, inverse
A linear map \(T \in \mathcal{L}(V,W)\) is called invertible if there exists a linear map \(S \in \mathcal{L}(W,V)\) such that \(ST\) equals the identity map on \(V\) and \(TS\) equals the identity map on \(W\).
Such \(S\) is called an inverse of \(T\).
3.58 定义 Isomorphism, isomorphic 同构
3.59 Dimension shows whether vector spaces are isomorphic
Two finite-dimensional vector spaces over F are isomorphic if and only if they have the same dimension.
3.60 \(\mathcal{L}(V,W)\) 和 \(\mathbb{F}^{m,n}\) 同构,由矩阵 \(\mathcal{M}\) 联系起来。
证明:略
3.62 定义 Matrix of a vector \(\mathcal{M}(V)\)
Suppose \(v \in V\) and \(v_1, \cdots, v_n\) is a basis of \(V\). The matrix of v respect to this basis is the n-by-1 matrix
\[\mathcal{M}(v) = \begin{pmatrix} c_1 \\ \vdots \\ c_n \end{pmatrix}\]where \(v = c_1v_1 + \cdots + c_nv_n\)
3.64 \(\mathcal{M}(T)_{\cdot,k} = \mathcal{M}(Tv_k)\) Note there is an printing error in the book.
意思是一个线性变换对应的矩阵的第k列 是 原空间中第k的基的线性变换后的向量在值空间中的坐标
3.65 Linear maps act like matrix multiplication
Suppose \(T \in \mathcal{L}(V,W),\, v \in V\). \(v_1, \cdots, v_n\) is a basis of \(V\) and \(w_1, \cdots, w_m\) is a basis of \(W\). Then
\[\mathcal{M}(Tv) = \mathcal{M}(T) \mathcal{M}(v)\]之前 3.43 定理 \(\mathcal{M}(S \circ T) = \mathcal{M}(S) \mathcal{M}(T)\) 都讲过了,所以这3.65是啥意思?
3.67 定义 operator, \(\mathcal{L}(V)\)
3.69 定理 有限维空间中单射(injective),双射(bijective),满射(surjective)等价。
3.71 定义 product of vector spaces
Suppose \(V_1, \cdots, V_m\) are vector spaces over \(\mathbb{F}\).
The product \(VS1 \times \cdots \times V_m\) is defined by
\[V_1 \times \cdots \times V_M = \left\{ ( v_1 , \ldots , v_m ) : v_1 \in V_1 , \ldots , v_m \in V_M \right\}.\]Addition on \(V_1 \times \cdots \times V_M\) is defined by
\[(u_1 , \ldots , u_m) + (v_1 , \ldots , v_m) = (u_1 + v_1 , \ldots , u_m + v_m ).\]Scalar multiplication on \(V_1 \times \cdots \times V_M\) is defined by
\[\lambda (v_1 , \ldots , v_m ) = (\lambda v_1 , \ldots , \lambda v_m ).\]3.73 Product of vector spaces is a vector space
Suppose \(V_1, \ldots, V_M\) are vector spaces over F. Then \(V_1 \times \cdots \times V_M\) is a vector space over F.
3.75 Example Find a basis of \(\mathcal{P}_2(\mathbb{R}) \times \mathbb{R}^2\)
Solution
\[\left(1, (0, 0)\right), \left(x, (0, 0)\right), \left(x^2, (0, 0)\right), \left(0, (1, 0)\right), \left(0, (0, 1)\right).\]3.78 A sum is a direct sum if and only if dimensions add up
直和成立等价于维度不减
3.79 定义 v+U
Suppose \(v \in V\) and \(U\) is a subspace of \(V\). Then \(v + U\) is the subset of \(V\) defined by
\[v + U = \left\{v + u : u \in U \right\}.\]3.81 定义 affine subset, parallel
An affine subset of \(V\) is a subset of \(V\) of the form \(v + U\) for some \(v \in V\) and some subspace \(U\) of \(V\).
the affine subset \(v + U\) is said to be parallel to \(U\).
补充一点,这里要开始说商空间了,但是
3.83 定义 quotient space 商空间
Suppose \(U\) is a subspace of \(V\). Then the quotient space \(V / U\) is the set of all affine subsets of \(V\) parallel to \(U\). In other words,
\[V / U = \left\{v + U : v \in V \right\}.\]接下来显示是也要把商空间搞成一个向量空间。
3.85 Two affine subsets parallel to U are equal or disjoint
Suppose \(U\) is a subspace of \(V\) and \(v, w \in V\). Then the following are equivalent:
这几条就说明了两个affine subset不是相等就是相离
3.86 定义 商空间上的加法和乘法
Suppose \(U\) is a subspace of \(V\). Then addition and scalar multiplication are defined on \(V / U\) by
for \(v, w \in V\) and \(\lambda \in \mathbb{F}\).
注意:此时,每个商空间上的向量的加法和乘法都是多重定义的,只有这些定义不冲突时它们才有意义。
3.87 定理 商空间是向量空间
在3.86定义的加法和乘法下商空间是向量空间。
当然是先证明3.86的Definition是proper的
设 \(v + U = v' + U,\, w + U = w' + U\) 只有 \((v+w) + U = (v'+w') + U\) 时才有意义,证明简单,略。 数乘同理。
书上又把证明给省略了,这此按定义逐个验证一下。
以下用 \(\overline{v}\) 代表 \(v + U\)
向量加法的结合律 \(\overline{u} + (\overline{v} + \overline{w}) = (\overline{u} + \overline{v}) + \overline{w}\) 成立,都为 \(\overline{u+v+w}\) |
向量加法的交换律 \(\overline{u} + \overline{v} = \overline{v} + \overline{u}\) 显然成立,都为 \(\overline{u+v}\) |
向量加法的单位元 即为 \(\overline{0}\) |
向量加法的逆元素 即为 \(\overline{-v}\) 对任意v ∈ V都存在其逆元素−v ∈ V使得v + (−v) = 0 |
标量乘法与标量的域乘法相容 \(a(b \overline{v}) = (ab) \overline{v}, a,b ∈ \mathbb{F}\) 成立,都为 \(\overline{abv}\) |
标量乘法的单位元 域F存在乘法单位元1满足1v = v,即为原始域 \(\mathbb{F}\) 的乘法单位元 |
标量乘法对向量加法的分配律 \(a(\overline{u} + \overline{v}) = \overline{au} + \overline{av}\),成立,都为 \(\overline{a(u+v)}\) |
标量乘法对域加法的分配律 \((a + b) \overline{v} = a \overline{v} + b \overline{v}\),成立,都为 \(\overline{(a+b)v}\) |
3.88 定义 quotient map, \(\pi\)
Suppose \(U\) is a subspace of \(V\). The quotient map \(\pi\) is the linear map \(\pi : V → V / U\) defined by
\[\pi(v) = v + U,\, v \in V\]证明 \(\pi\) 是一个 linear map 作者又略了,比较显然,略。
3.89 定理 Dimension of a quotient space
Suppose \(V\) is finite-dimensional and \(U\) is a subspace of \(V\). Then
\[\text{dim }V/U = \text{dim }V - \text{dim }U\]其实这个比较显然,引入3.88的\(\pi\)之后就可以用 Fundamental Theorem of Linear Maps 证了。
3.90 Definition \(\widetilde{T}\)
Suppose \(T \in \mathcal{L}(V,W)\). Define \(\widetilde{T} : V / (\text{null } T) → W\) by
\[\widetilde{T}(v + \text{null } T) = T(v)\]这个主意很好啊!把T在原空间中相差属于null space的向量归成一个等价类,这样 \(\widetilde{T}\) 还是双射。
3.91 Null space and range of \(\widetilde{T}\)
\(\widetilde{T}\) is a linear map. |
\(\widetilde{T}\) is injective. |
\(\text{range } \widetilde{T} = \text{range } T\). |
\(V / (\text{null } T)\) is isomorphic to \(\text{range } T\). |
证明略。
3.92 定义 linear functional
A linear functional on V is a linear map from V to F. In other words, a linear functional is an element of \(\mathcal{L}(V,F)\).
3.94 定义 dual space, \(V'\)
The dual space of \(V\), denoted \(V'\), is the vector space of all linear functionals on \(V\). In other words, \(V' = \mathcal{L}(V,F)\)..
3.95 定义 对偶空间的维度和原空间相等
Suppose \(V\) is finite-dimensional. Then V is also finite-dimensional and \(\text{dim }V' = \text{dim}(V)\).
由3.61直接可得。
3.96 定义 Dual basis
If \(v_1, \cdots, v_n\) is a basis of \(V\), then the dual basis is the list \(\varphi_1, \ldots, \varphi_n\) of elements of \(V'\), where each \(\varphi_j\) is the linear functional on \(V\) such that
\[\varphi_j (v_j) = 1 \\ \varphi_j (v_\_) = 0\]3.98 Dual basis is a basis of the dual space
Suppose V is finite-dimensional. Then the dual basis of a basis of V is a basis of V’.
TODO: 注意这里限定了 \(V\) 是有限维的,还要ponder一下为什么。
证明见书
3.99 定义 Dual map, \(T'\)
If \(T \in \mathcal{L}(V,W),\) then the dual map of \(T\) is the linear map \(T' \in \mathcal{L}(W',V')\) defined by \(T'(\varphi) = \varphi \circ T\) for \(\varphi \in W'\).
这里有点看出端倪了,这是要搞转置矩阵啊,TODO: 但是为什么要在对偶空间上搞,对偶空间的主意是怎么想到的,intuitive在哪?
3.101 对偶映射的代数性质
证:前两个都直接由对偶空间的线性性可证,第三个:
\[(ST)'(\varphi) = \varphi \circ (ST) = (\varphi \circ S) \circ T = T'(S'(\varphi)) = (T'S')(\varphi)\]Our goal in this subsection is to describe null T’ and range T’ in terms of range T and null T. To do this, we will need the following definition.
3.102 定义 annihilator, \(U^0\)
For \(U \subset V\), the annihilator of \(U\), denoted \(U^0\), is defined by
\[U^0 = \left\{ \varphi \in V' : \varphi(u) = 0 \text{ for all } u \in U \right\}\]尽管这个定义很简明,但是此处比较抽象,还要用自己的话重新说一遍
\(U^0\) 是一个线性函数(\(V→\mathbb{F}\))的集合,这些线性函数都会把\(U\)中的元素映射成\(\boldsymbol{0}\).
3.105 annihilator is a subspace
Suppose \(U \subset V\). Then \(U^0\) is a subspace of \(V'\).
证明很短,见书。
3.106 Dimension of the annihilator
Suppose \(V\) is finite-dimensional and \(U\) is a subspace of \(V\). Then
\[\text{ dim } U + \text{ dim } U^0 = \text{ dim } V\]这个定理很直观,但是证明不是很直观。
证:令\(i \in \mathcal{L}(U,V),\, i(\boldsymbol{u}) = (\boldsymbol{u}) \text{ for } \boldsymbol{u} \in U\),则
\[i' = (\backslash f \in V' → i \circ f) \in \mathcal{L}(V',U')\]由线性代数基本定理
\[\text{dim } \text{range}i' + \text{dim }\text{null}i' = \text{dim }V'\]由 \(\text{null }i'\) 的定义和 \(U^0\) 的定义可以看出这两个定义相同,
另一个方向,从U到V再用V→𝔽,不管V的维度比U大还是小,都能等价于从U→𝔽
若 \(\varphi \in U'\),则 \(\varphi\) 可以扩展成 \(\psi \in V'\),所以 \(\varphi = i'(\psi)\),所以 \(U' \subset \text{range }i'\),又 \(\text{range }i' \subset U'\)
3.107 The null space of \(T'\)
Suppose \(V\) and \(W\) are finite-dimensional and \(T \in \mathcal{L}(V,W).\) Then
证:
\(\begin {aligned} \text{null }T' &= \left\{ f \in W' \middle| f \circ T = \boldsymbol{0} \in V' \right\} \\ \left( \text{range }T \right)^0 &= \left\{ f \in W' \middle| f (x) = 0 \text{ for } x \in \text{range }T \right\} \\ \left( \text{range }T \right)^0 &= \left\{ f \in W' \middle| f (T (v)) = 0 \text{ for } v \in V \right\} \\ \left( \text{range }T \right)^0 &= \left\{ f \in W' \middle| f \circ T = \boldsymbol{0} \in V' \right\} \end {aligned}\)
\(\begin{aligned} \text{dim } \text{null } T' &= \text{dim } (\text{range} T)^0 \\ &= \text{dim }W - \text{dim }\text{range} T \\ &= \text{dim }W - ( \text{dim }V - \text{dim } \text{null } T ) \\ &= \text{dim } \text{null } T + \text{dim }W - \text{dim }V \end{aligned}\)
3.108 T surjective 等价于 T’ injective
Suppose \(V\) and \(W\) are finite-dimensional and \(T \in \mathcal{L}(V,W)\). Then \(T\) is surjective if and only if \(T'\) is injective.
Proof The map \(T \in \mathcal{L}(V,W)\) is surjective if and only if \(\text{ range }T=W\), which happens if and only if \((range T)^0 = \{0\}\), which happens if and only if \(\text{null } T' = \{0\}\) [by 3.107(a)], which happens if and only if \(T'\) is injective.
3.109 The range of T’
Suppose \(V\) and \(W\) are finite-dimensional and \(T \in \mathcal{L}(V,W)\). Then
证
\[\begin{aligned} \text{dim }\text{range }T' &= \text{dim }W' - \text{dim }\text{null }T' \\ &= \text{dim }W - \text{dim }(\text{range }T)^0 \\ &= \text{dim }\text{range }T \end{aligned}\]First suppose \(\varphi \in \text{range } T'\). Thus there exists \(\psi \in W'\) such that \(\varphi = T'(\psi)\). If \(v \in \text{null } T\), then
\[\varphi(v) = \left( T'(\psi) \right) v = \left( \psi \circ T \right) (v) = \psi(Tv) = \psi(0) = 0\]又
\begin{aligned} \text{dim }\text{range }T’ &= \text{dim }\text{range }T &= \text{dim }V - \text{dim }\text{null }T &= \text{dim } \left( \text{null }T \right) ^0 \end{aligned}
由3.69有限维时单双满射等价,证毕
TODO: 补个图 (这儿可能很不直观,稍后我补个图)
3.110 T injective is equivalent to T’ surjective
Suppose \(V\) and \(W\) are finite-dimensional and \(T \in \mathcal{L}(V,W)\). Then \(T'\) is injective if and only if \(T\) is surjective.
注:此是对偶对了一半,从T到T’还没有从T’到T,四个子空间说清楚了两个(和相应的对偶空间)
3.111 定义,transpose
对偶了半天终于开始转置了。
3.113 转置积
\[(AC)^T = C^TA^T\]3.114
Suppose \(T \in \mathcal{L}(V,W)\). Then \(\mathcal{M}(T') = \left( \mathcal{M}(T) \right)^T\)
书上给了一种证法,但是我觉得此处记号非常混乱,写我自己的证法了。
注意到对偶空间\(V'\) 是一个 \(V \times \mathbb{F}\),又是线性映射,说白了就是点乘么。所以 \(V'\) 是的向量,设为 \(\boldsymbol{u}\), 是把\(V\)变成\(\mathbb{F}\),就是\(\boldsymbol{u} = \boldsymbol{u} \cdot \boldsymbol{v}\)么(等号前u为V’的元素,等号后u为等号前u在标准基下的一组表示),
所以,设 \(\varphi = \mathcal{M}(T') \psi,\,(\varphi \in V', \psi \in W')\),由 \(T'\) 的定义,
\[\varphi (v) = \psi ( \mathcal{M}(T)v )\]就是说
\[\varphi^T = \psi^T \mathcal{M}(T) \\ \varphi = \left( \mathcal{M}(T) \right)^T \psi\]所以 \(\left( \mathcal{M}(T) \right)^T = \mathcal{M}(T')\)
3.115 定义 行秩 列秩 (m-by-n)
矩阵的行秩是矩阵的行空间的秩 矩阵的列秩是矩阵的列空间的秩
3.118 行秩等于列秩
见 3.109 的图
3.119 定义 秩
既然行秩等于列秩,就定义它们为秩好了。
注:此时第三章abruptly讲完了。。。
实数上的平均值不等式
若 \(a,b > 0,\, a,b \in \mathbb{R}\), 则
\[\sqrt{ab} \leq \frac{a+b}{2} \leq \sqrt{\frac{a^2+b^2}{2}}\]证:
\[\begin{aligned} \sqrt{ab} &= \sqrt{ \left( \frac{a+b}{2} + \frac{a-b}{2} \right) \left( \frac{a+b}{2} - \frac{a-b}{2} \right) } \\ &= \sqrt{ \left( \frac{a+b}{2} \right)^2 - \left( \frac{a-b}{2} \right) ^2 } \\ &\leq \sqrt{ \left( \frac{a+b}{2} \right)^2 } \\ &= \frac{a+b}{2} \\ &\leq \sqrt{ \left( \frac{a+b}{2} \right)^2 + \left( \frac{a-b}{2} \right) ^2 } \\ &= \sqrt { \frac{a^2 + b^2}{2} } \end{aligned}\]注意这里可以有一些弹性的变换,如:
\[ab \leq \frac{1}{2}\epsilon a^2 + \frac{1}{2}\frac{b^2}{\epsilon},\, \forall \epsilon > 0\]AM-GM不等式
虽然此时还没有长度,面积,体积等的定义,但是直觉理解是:\(\frac{a+b}{2}\) 是平均边长,\(\sqrt{ab}\) 是等体积的正方形的平均边长。
所以 \(\sqrt{ab} \leq \frac{a+b}{2}\) 的意思是说,在等面积长方形中,正方形是平均边长最小的。
当然这也可以扩展到多维之中,自然的,我们推测:
\[\sqrt[n]{x_1x_2\cdots x_n} \leq \frac{x_1 + x_2 + \cdots + x_n}{n},\, x_i > 0\]Proof by induction (From Wikipedia):
Suppose it holds for integers ≤ n.
Let \(\alpha = \frac{1}{n} \left( x_1 + \cdots + x_{n+1} \right)\), Suppose \(x_n > \alpha\) and \(x_{n+1} < \alpha\) (by reordering, If such reordering cannot be done, it means that all x_i are equal).
Let \(y\) be \(x_n + x_{n+1} - \alpha\), so that \(\alpha\) is also the mean of \(x_1, \cdots, x_{n-1}, y\). By induction we have.
\[\alpha^{n+1} = \alpha^n \alpha \geq x_1x_2\cdots x_{n-1}y\alpha\]All we need to do is to proof \(y\alpha > x_nx_{n+1}\)
\[\begin{aligned} y\alpha - x_nx_{n+1} &= (x_n+x_{n+1}-\alpha)\alpha - x_nx_{n+1} \\ &= (x_n-\alpha)(\alpha-x_{n+1}) > 0 \end{aligned}\]Hence \(\alpha^{n+1} > x_1x_2\cdots x_{n+1}\) If x_i are not all the same.
Weighted AM–GM inequality
if \(w_i\) are positive integers, we can easily have
\[\sqrt[w]{x_1^{w_1}x_2^{w_2}\cdots x_n^{w_n}} \leq \frac{w_1x_1 + w_2x_2 + \cdots + w_nx_n}{w},\, x_i > 0\]where \(w = \sum_{i}{w_i}\).
if \(w_i\) are positive rationals, by finding the lcm (想办法通分), it also holds.
if \(w_i\) are real numbers, by basic analysis, it still holds.
常用复数上的不等式
Let \(z = x + yi\), by definition
\[\left\vert z \right\vert ^2 = z \cdot \overline{z} = x^2 + y^2 = \operatorname{Re } (z)^2 + \operatorname{Im } (z)^2\]所以
\[\left\vert z \right\vert \geq \left\vert \operatorname{Re } (z) \right\vert ,\,\,\,\, \left\vert z \right\vert \geq \left\vert \operatorname{Im } (z) \right\vert\] \[\begin{aligned} \left\vert \left\vert z+w \right\vert ^2 \right\vert &= (z+w) \left\vert \left( z+w \right) \right\vert \\ &= (z+w) \left( \left\vert z \right\vert + \left\vert w \right\vert \right) \\ &= \left\vert z \right\vert^2 + \left\vert w \right\vert ^2 + (z \overline{w} + \overline{z} w) \\ &= \left\vert z \right\vert^2 + \left\vert w \right\vert ^2 + 2 \operatorname{Re } (z \overline{w}) \\ &\leq \left\vert z \right\vert^2 + \left\vert w \right\vert ^2 + 2 \left\vert z \overline{w} \right\vert \\ &= \left\vert z \right\vert^2 + \left\vert w \right\vert ^2 + 2 \left\vert z \right\vert \left\vert w \right\vert \\ &= \left( \left\vert z \right\vert + \left\vert w \right\vert \right) ^2 \end{aligned}\]所以
\[\left\vert z+w \right\vert \leq \left\vert z \right\vert + \left\vert w \right\vert\]最后,有
\[\left\vert \left\vert w \right\vert - \left\vert z \right\vert \right\vert \leq \left\vert w - z \right\vert\]柯西-施瓦茨不等式(Cauchy Schwarz inequality)
更新:内积是柯西-施瓦茨不等式的抽象,在另一篇笔记的6.15节看到了这个不等式,以下有部分是过时的原始内容:
这篇文章:向量分析-Cauchy-Schwarz不等式之本質與意義-林琦焜 (缓存)写的非常好。
在思考内积时,不能用欧式空间/余弦定理等,而要抽象成不依赖空间的东西,再用内积定义角度,神奇的是,角度定义可以不止一种,甚至可以是复数。
证法一:
如果 \(a_1, a_2, \cdots, a_n, \text{ and } b_1, b_2, \cdots, b_n\) 都是复数,那么
\[\left\vert \sum_{i=1}^{n}{a_i\overline{b}_i} \right\vert^2 \leq \left( \sum_{i=1}^{n}{\lvert a_i \rvert^2} \right) \left( \sum_{i=1}^{n}{\lvert b_i \rvert^2} \right)\]证:设\(a,b\)是复向量, \(a≠0\),\(λ\)是标量,\(c = b - λa\)
\[0 ≤ \lvert c \rvert = c \overline{c} = (b-λa) (\overline{b} - λ \overline{a})\]对\(λ\)应用判别式Δ≤0,得:
\[( b \overline{a} + a \overline{b} )^2 ≤ 4 \lvert a \rvert ^2 \lvert b \rvert ^2 \\ ( b \overline{a} )^2 + (a \overline{b} )^2 ≤ 2 \lvert a \rvert ^2 \lvert b \rvert ^2\]注意到 \(b \overline{a} = \overline{b \overline{a}}\),所以
\[( a \overline{b} )^2 ≤ \lvert a \rvert ^2 \lvert b \rvert ^2\]证法二:
由平均值不等式, \(\sqrt{ab} \leq \sqrt { \frac{1}{2}a^2 + \frac{1}{2}b^2 }\), 令
\[\widetilde{a}_i = \frac{a_i}{\sqrt{\sum_{i=1}^{n}{a_i^2}}},\,\,\,\, \widetilde{b}_i = \frac{b_i}{\sqrt{\sum_{i=1}^{n}{b_i^2}}}\]则
\[\widetilde{a}_i \widetilde{b}_i \leq \frac{1}{2}\widetilde{a}_i^2 + \frac{1}{2}\widetilde{b}_i^2\]两边对\(i\)求和
\[\sum_i { \widetilde{a}_i \widetilde{b}_i } \leq \sum_i { \frac{1}{2}\widetilde{a}_i^2 + \frac{1}{2}\widetilde{b}_i^2 }\]a.k.a.
\[\frac{\sum_{i}^{n}a_ib_i}{\sqrt{\sum_{i=1}^{n}{a_i^2}}\sqrt{\sum_{i=1}^{n}{b_i^2}}} \leq 1\]a.k.a
\[\sum_{i}^{n}a_ib_i \leq \sqrt{\sum_{i=1}^{n}{a_i^2}}\sqrt{\sum_{i=1}^{n}{b_i^2}}\]在实数成立之后,由 \(\left\vert z+w \right\vert \leq \left\vert z \right\vert + \left\vert w \right\vert\) 要叫推出在复数上也成立
\[\left\vert \sum_{i}^{n}a_ib_i \right\vert \leq \sum_{i}^{n}{\left\vert a_ib_i \right\vert} = \sum_{i}^{n}{ \left\vert a_i \right\vert \left\vert b_i \right\vert }\]杨氏不等式(Young’s inequality)
\[ab ≤ \int_0^a{f(x)} + \int_0^b{f^{-1}(y)}\]令 \(f(x) = x^{p-1} \text{ where p > 1 }\) 可得
\[ab ≤ \frac{a^p}{p} + \frac{b^q}{q} \text{ where } a,b>0,\,\frac{1}{p} + \frac{1}{q}=1\]等号成立条件为 \(b = a^{p-1}\) ,亦即 \(a = b^{q-1}\) 亦即 \(a^p = b^q\)
另一种比较代数的证明方式见 https://math.stackexchange.com/a/259837
Hölder不等式(实数版)
给定任意实数\(a_1, \cdots, a_n,\, b_1, \cdots, b_n,\, p>1,\, q>1,\, \frac{1}{p} + \frac{1}{q} = 1\)则:
\[\sum_{i=1}^{n}{a_ib_i} ≤ \left( \sum_{i=1}^{n} \left\vert a_i \right\vert ^p \right) ^ { \frac{1}{p} } \left( \sum_{i=1}^{n} \left\vert b_i \right\vert ^q \right) ^ { \frac{1}{q} }\]证:仿效柯西不等式证法二:令
\[\widetilde{a}_i = \frac{a_i}{\left( \sum_{i=1}^{n} \left\vert a_i \right\vert ^p \right) ^ { \frac{1}{p} }},\, \widetilde{b}_i = \frac{b_i}{\left( \sum_{i=1}^{n} \left\vert b_i \right\vert ^q \right) ^ { \frac{1}{q} }},\]则由杨式不等式,左右累加并化简得 \(\sum{\widetilde{a}_i \widetilde{b}_i} ≤ 1\),又分母大于0,所以分子小于分母,即为所求。