CS 7545 Fall 2018 Notes 2
Game Theory
Def: A two player (finite strategy) game is given by a pair of matrices
\[N \in \mathbb{R}^{n\times m}, M \in \mathbb{R}^{n\times m}\]where
\[M_{i,j} = \text{payoff to player 1 if } p_1 \text{ selects action } i \text{ and } p_2 \text{ selects action } j\]Let’s draw \(M\) here
\[M = \begin{bmatrix} m_{1,1} & \cdots & m_{1,m} \\ m_{2,1} & \cdots & m_{2,m} \\ m_{3,1} & \cdots & m_{3,m} \\ m_{4,1} & \cdots & m_{4,m} \\ m_{5,1} & \cdots & m_{5,m} \\ \vdots & \vdots & \vdots \\ m_{n,1} & \cdots & m_{n,m} \\ \end{bmatrix}\]Note: \(\textbf{p}^T M \textbf{q}\) is the expected gain of player 1 if \(p_i\) is probability of prayer 1 taking action \(i\) and \(q_j\) is the probability of player 2 taking action \(j\)
Def: A game is zero sum if
\[N = -M\]Def: A Nash equilibrium is a pair \(\widetilde{p} \in \Delta_n, \widetilde{q} \in \Delta_m,\) s.t.
\[\forall p \in \Delta_n, \widetilde{p}^T M \widetilde{q} \geq p^T M \widetilde{q}\] \[\forall q \in \Delta_m, \widetilde{p}^TN\widetilde{q} \geq \widetilde{p}^TNq\]Nash’s theorem: There exist a (possibly non-unique) Nash equilibrium for any 2-player game.
Von Neumann’s min-max theorem:
\[∀M \in \mathbb{R}^{n× m}, \min_{p\in \Delta_n} \max_{q\in \Delta_m} p^T M q = \max_{q\in \Delta_m} \min_{p\in \Delta_n} p^T M q\]No regret algorithm
We say that an algorithm \(\mathcal{A}\) is no-regret if \(\forall \ell_1 \ldots \ell_T \ldots \in [0,1]\) with \(\textbf{p}^t \in \Delta_n\) chosen as \(\textbf{p}^t \leftarrow \mathcal{A}(\ell_1,\ldots,\ell_{t-1})\)
\[\frac{1}{T} \left( \sum_{t=1}^T{\textbf{p}^t \cdot \boldsymbol{\ell}^t} - \min_{p\in \Delta_n}{\sum_{t=1}^{T}{\textbf{p} \cdot \boldsymbol{\ell}^t}} \right) = \epsilon_T = O(1)\]-
Observe:
\[\min_{p\in \Delta_n}{\sum_{t=1}^{T}{\textbf{p}^t \cdot \boldsymbol{\ell}_t}} = \min_{i=1\ldots n}{\sum_{t=1}^{T}{\textbf{e}_i \cdot \boldsymbol{\ell}^t}}\] -
Note:
- EWA is as no-regret algorithm with \(\epsilon_T \leq \frac{\log N + \sqrt{2 T \log N}}{T} = \frac{\log N}{T} + \sqrt{\frac{2\log N}{T}}\)
- No regret algorithm performs well even in worst case ( e.g. loss chosen against p )
No-Regret Happy theorem
-
Let \(M\) be \(\mathbb{R}^{n\times m},\,\, \mathcal{A}\) be a no-regret algorithm.
-
For \(t = 1 \ldots T\),
- \(\textbf{p}^t\) is chosen as \(\mathcal{A}(\ell_1,\ldots,\ell_{t-1}), \text{ where } \ell_s = Mq_s (s = 1\ldots t-1)\)
- \(\textbf{q}^t\) is chosen as \(\textbf{q}^t = \operatorname*{argmax}_{\textbf{q}\in \Delta_m}{\textbf{p}^t \cdot \textbf{M} \textbf{q}}\;\;\;\; \text{ a.k.a.most adversary nature}\)
-
Q1: How happy is q
\[\begin{aligned} \frac{1}{T}\sum_{t=1}^{T}{\textbf{p}^t \cdot \textbf{M} \textbf{q}^t } &= \frac{1}{T} \sum_{t=1}^{T}{\max_{\textbf{q}}\textbf{p}^t\cdot \textbf{M} \textbf{q}} \\ &≥ \frac{1}{T}\max_{\textbf{q}}{\sum_{t=1}^{T}{(\textbf{p}^t \cdot \textbf{M} \textbf{q})}} \\ &= \frac{1}{T}\max_{\textbf{q}}{\sum_{t=1}^{T}{(\textbf{p}^t)}} \cdot \textbf{M} \textbf{q} = \max_{\textbf{q}}{ \bar{\textbf{p}} } \cdot \textbf{M} \textbf{q} \\ &≥ \min_{\textbf{p}}\max_{\textbf{q}} \textbf{p}\cdot \textbf{M} \textbf{q} \end{aligned}\] -
Q2: How happy is p
\[\begin{aligned} \frac{1}{T}\sum_{t=1}^{T}{\textbf{p}^t \cdot \textbf{M} \textbf{q}^t} &= \frac{1}{T}\sum_{t=1}^{T}{\textbf{p}^t \cdot \boldsymbol{\ell}^t} & \\ &= \frac{1}{T}\min_{\textbf{p}}{\sum_{t=1}^{T}{\textbf{p}\cdot \boldsymbol{\ell}^t}} + \epsilon_T & \text{ by definition of no regret} \\ &= \min_{\textbf{p}}{\frac{1}{T} \sum_{t=1}^{T}{\textbf{p} \cdot \textbf{M} \textbf{q}^t}} + \epsilon_T & \\ &= \min_{\textbf{p}}{\textbf{p} \cdot \textbf{M} \bar{\textbf{q}}} + \epsilon_T & \\ &≤ \max_{\textbf{q}} \min_{\textbf{p}} \textbf{p} \cdot \textbf{M} \textbf{q} + \epsilon_T \end{aligned}\] -
To summarize:
-
Corollary:
\(\bar{\textbf{p}}\) and \(\overline{\textbf{q}}\) are \(\epsilon_T\)-optimal Nash eq.
Boosting
Given \(\textbf{x}_1,\ldots,\textbf{x}_n \in \mathcal{X}\), \(\textbf{y}_1,\ldots, \textbf{y}_n \in \{-1,1\}\), Hypothesis class \(H = \{ h_1,\ldots,h_m \}\) where \(h : \mathcal{X} \mapsto \{ -1, 1 \}\)
-
Weak Learner Assumption:
\[∀ \textbf{p} \in \Delta_n,\, ∃ h \in H,\,\text{s.t. if } \textbf{x}_i \text{ show up with probability } p_i,\text{ then }\] \[\operatorname{Pr}\{ h(\textbf{x}_i) \neq y_i \} \leq \frac{1}{2} - \frac{\gamma}{2},\;\; \gamma > 0\]Which is: n \(∀ \textbf{p} \in \Delta_n,\, ∃ h \in H,\,\text{s.t. } \sum_{i}{p_i\frac{1 - y_ih(\textbf{x}_i) }{2}} \leq \frac{1}{2} - \frac{\gamma}{2}\)
Alternatively:
\[∀ \textbf{p} \in \Delta_n,\, ∃ h \in H,\,\text{s.t. } \gamma \leq \sum_{i}{p_iy_ih(\textbf{x}_i)}\] -
Proof of \(WLA \implies SLA\)
Define \(\textbf{M} \in \{ -1, 1 \}^{n×m}\), \(\textbf{M}_{i,j} = h_j(\textbf{x}_i)y_i\), then
\[\sum_{i}{p_iy_ih_j(\textbf{x}_i)} = \textbf{p} \cdot \textbf{M} \textbf{e}_j\]WLA says for any \(\textbf{p}\) this is a \(j\), we have
\[\gamma \leq \min_{\textbf{p} \in \Delta_n}{\textbf{p} \cdot \textbf{M} \textbf{e}_j} \leq \min_{\textbf{p} \in \Delta_n}\max_{\textbf{q} \in \Delta_m}{\textbf{p} \cdot \textbf{M} q}\]So
\[0 < \gamma \leq \max_{\textbf{q} \in \Delta_m}\min_{\textbf{p} \in \Delta_n}{\textbf{p} \cdot \textbf{M} q}\]which is strong Learner assumption:
\[\exists q \in \Delta_m \text{ s.t. } 0 < \min_{\textbf{p} \in \Delta_n}{\textbf{p}^T \textbf{M} q}\] -
Strong Learning Assumption: exist \(\textbf{q} \in \Delta_m\), s.t. \(∀ i = 1\ldots n,\\)
\[\sum_{h\in H}{\textbf{q}_h \cdot h(\textbf{x}_i) y_i} > 0\] -
How to find \(\textbf{q}\)
If we use a no-regret algorithm to learn p that maximize error of prediction (a.k.a minimizing \(\textbf{p⋅Mq}\)), and we choose \(\textbf{q}\) according to \(\textbf{p}\) to maximize \(\textbf{p⋅Mq}\), then by no regret happy theorem
\[\gamma - \epsilon_T = \min_{\textbf{p}}\max_{\textbf{q}} \textbf{p}\cdot \textbf{M} \textbf{q} - \epsilon_T \leq \min_{\textbf{p}}{\textbf{p} \cdot \textbf{M} \overline{\textbf{q}}}\]So, whenever \(\epsilon_T < \gamma, \overline{\textbf{q}}\) is what we need.
-
Boosting by Majority Algorithm:
We use EWA as the no-regret algorithm. (Note: EWA requires that \(\textbf{M} \in [0,1]^{n\times m}\) but here \(\textbf{M} \in \{-1,1\}^{n\times m}\). the professor promise it will work somehow. My thought is that let \(\textbf{M}' = \frac{\textbf{M}+\textbf{1}}{2}\), then \(\textbf{p} \cdot \textbf{M}' \textbf{q} = \textbf{p} \cdot \frac{\textbf{M}+\textbf{1}}{2} \textbf{q} = \frac{1}{2} \textbf{p} \cdot \textbf{Mq} + \underbrace{\textbf{p} \cdot \textbf{1q}}_{=1!}\), so optimal \(\textbf{q}\) for \(\textbf{M}'\) is also optimal for \(\textbf{M}\) )
Let \(T > \frac{2\log N}{\gamma^2}\) (which somehow \(\epsilon_T < \gamma\) at this point), \(\textbf{w}^1 = 1\), For \(t = 1 \ldots T\), Let
\[\begin{aligned} \textbf{p}^t &= \frac{\textbf{w}^t}{\| \textbf{w}^t \|_1} & \\ h_t &= \operatorname*{argmax}_{h\in \mathcal{H}}{\sum_{i=1}^{N}{\textbf{p}^t_ih(\textbf{x}_i)y_i}} & \text{ we should choose q to maximize } \textbf{p}\cdot \textbf{Mq} \\ & &\text { but optimal value always happen at corner } \\ & &\text { which is equivalent to choose best } h_t \\ \textbf{w}^{t+1}_i &= \textbf{w}^t_i \exp{ \left( -\eta h_t(\textbf{x}_i)y_i \right) } & \end{aligned}\]Output \(\overline{h_T} = \frac{1}{T}\sum_{t=1}^{T}{h_t}\)
Online Convex Optimization
-
Settings
Let a set \(\mathcal{K} \subset \mathbb{R}^d\) be convex and compact.
-
For \(t = 1\ldots T\),
- Algorithm select \(\textbf{x}_t \in \mathcal{K}\)
- Nature select convex function \(f_t : \mathcal{K} \mapsto \mathbb{R}\)
Let Regret be \(\left(\sum_{1}^{T}{f_t(\textbf{x}_t)} \right) - \min_{\textbf{x}\in \mathcal{K}}{\sum_{t=1}^{T}{f_t(\textbf{x})}}\)
-
Note:
- This is more general than experts setting (hedge setting)
- e.g.: set \(\mathcal{K} = \Delta_n,\, f_t(\textbf{x}) = \ell_t \cdot \textbf{x}\)
-
-
Online Gradient Descent Algorithm (OGD)
-
Define
\[\operatorname{Proj}_{\mathcal{K}}{x} = \operatorname*{argmin}_{y\in \mathcal{K}}{\|y-x\|_2}\]Note: \(\forall \textbf{z} \in \mathcal{K}, \forall \textbf{y}\):
\[\| \operatorname{Proj}(\textbf{y}) - z\|_2 \leq \|y-z\|_2\] -
OGD Algorithm
Let \(\textbf{x}_0\) be arbitrary \(\textbf{x} \in \mathcal{K}\),
\[\textbf{x}_{t+1} = \operatorname{Proj}_{\mathcal{K}}{x_t-\eta \nabla_t \text{ where } \nabla_t = \nabla f_t(\textbf{x}_t)}\] -
Theorem
Assume \(\| \nabla f(\textbf{x}_t) \| \leq G,\, \|\textbf{x}_0 - \textbf{x}^* \| \leq D \,(\forall \textbf{x}^* \in \mathcal{K})\), then
\[\operatorname{Regret}_T(\text{OGD}) \leq GD\sqrt{T}\] -
Proof
Notice that
\[\begin{aligned} \frac{1}{2} \| \textbf{x}_{t+1} - \textbf{x}^* \|^2 &= \frac{1}{2} \| \operatorname{Proj}_{\mathcal{K}}{\textbf{x}_t - \eta \nabla_t} - \textbf{x}^* \|^2 \\ &\leq \frac{1}{2} \| \textbf{x}_t-\eta \nabla_t - \textbf{x}^* \|^2 \\ &= \frac{1}{2} (\textbf{x}_t - \textbf{x}^* - \eta \nabla_t) \cdot (\textbf{x}_t - \textbf{x}^* - \eta \nabla_t) \\ &= \frac{1}{2} \| \textbf{x}_t - \textbf{x}^* \|^2 + \frac{\eta^2}{2}\| \nabla_t\|^2 - \eta \nabla_t \cdot ( \textbf{x}_t - \textbf{x}^* ) \\ & & \\ \eta \nabla_t \cdot ( \textbf{x}_t - \textbf{x}^* ) &\leq \frac{1}{2} \left( \| \textbf{x}_t - \textbf{x}^* \|^2 - \| \textbf{x}_{t+1} - \textbf{x}^* \|^2 \right) + \frac{\eta^2}{2}\| \nabla_t\|^2 \end{aligned}\]Also notice that if \(f\) is convex, then \(f(\textbf{x}^*) - f(\textbf{x}) \geq \nabla f(\textbf{x})(\textbf{x}^* - \textbf{x})\), so
\[\nabla_t \cdot ( \textbf{x}_t - \textbf{x}^* ) \geq f(\textbf{x}_t) - f(\textbf{x}^*)\]So
\[\begin{aligned} \operatorname{Regret}_T(\text{OGD}) &= \sum { f(\textbf{x}_t) - f(\textbf{x}^*) } \\ &\leq \sum_{t=1}^{T} {\nabla_t \cdot ( \textbf{x}_t - \textbf{x}^* ) } \\ &\leq \sum_{t=1}^{T} { \left( \frac{1}{2\eta} \left( \| \textbf{x}_t - \textbf{x}^* \|^2 - \| \textbf{x}_{t+1} - \textbf{x}^* \|^2 \right) + \frac{\eta}{2}\| \nabla_t\|^2 \right) } \\ &\leq \sum_{t=1}^{T} { \frac{1}{2\eta} \left( \| \textbf{x}_t - \textbf{x}^* \|^2 - \| \textbf{x}_{t+1} - \textbf{x}^* \|^2 \right) } + \frac{\eta}{2} TG^2 \\ &\leq \frac{1}{2\eta} \left( (\underbrace{\| \textbf{x}_1 - \textbf{x}^* \|^2}_{\leq D^2} + \underbrace{ - \| \textbf{x}_{T+1} - \textbf{x}^* \|^2 }_{\leq 0} \right) + \frac{\eta}{2} TG^2 \\ &\leq \frac{1}{2\eta} D^2 + \frac{\eta}{2} TG^2 \\ \end{aligned}\]Set \(\eta = \frac{D}{G\sqrt{T}}\), we have
\[\operatorname{Regret}_T(\text{OGD}) \leq DG\sqrt{T}\]
-
More on OCO
-
Convex optimization to OCO
In this setting, we want to minimize a convex loss function $ f $ over a convex compact set \(\mathcal{K}\)
We use OCO.
For \(t = 1, \ldots, T\), an algorithm choose $ x_t $, Nature then show $ f_t = f $ (a.k.a, always show the same loss function)
After \(T\) round, output \(\overline{x_T} = \frac{1}{T}\sum_{t=1}^{T}{x_t}\)-
Claim:
\[f(\overline{x_t}) - \mathop{\operatorname{min}}_{x \in \mathcal{K}}{f(x)} \leq \frac{1}{T} \operatorname{Regret}_T\] -
Proof: (easy)
\[\begin{aligned} f(\overline{x_t}) &\leq \frac{1}{T} \sum_{t=1}^{T}{f(x_t)} & \text{ by convexity} \\ &= \frac{1}{T}\sum_{t=1}^{T}{f_t(x_t)} & \\ &= \frac{1}{T} \left( \sum_{t=1}^{T}{f_t(x^*) } + \operatorname{Regret}_T \right) & \\ &= \frac{1}{T} \left( \sum_{t=1}^{T}{f(x^*) } + \operatorname{Regret}_T \right) & \\ &= f(x^*) + \frac{1}{T} \operatorname{Regret}_T & \end{aligned}\]
-
-
Learning in stocastic setting
Learning in stocastic setting can reduces to OCO
In Stochastic learning setting, we want to find a parameter from a predefined parameter set to minimize the expected loss. (e.g., find a parameter of nerual network parameters to minimize classification errors)
Under conditions of 1. loss function are convex, 2. parameter space is convex; this problem can be reduced to OCO.
-
Settings:
Let \(X,Y\) be domain of data and set of labels.
Let \((X,Y) \sim D\), which means that X,Y are generated i.i.d from distribution D. Let \(h_θ\) be a hypothesis function that maps form \(X\) to \(Y\) parameterized by \(θ\).
Let \(\mathcal{H}\) be the set of all \(h\), a.k.a, \(\mathcal{H} = \{ h_θ \vert θ \in Θ \}\)
Let \(\ell(h_θ, x, y)\) be the loss if we use \(h_θ\) on point \((x,y)\)
Let \(\ell(h_θ, x, y)\) be convex in \(θ\) (In realistic scenarios, this may not always be true)Define Risk of \(θ\):
\[\mathcal{L}(θ) = \mathop{\operatorname{\mathbb{E}}}_{(x,y)\sim D}{\ell(h_θ, x, y)}\]Note: ** \(\mathcal{L}(θ)\) ** is convex!!
We want to find \(\hat{θ}\) from \(T\) data points (i.i.d from some distribution) s.t.
\[\mathcal{L}(\hat{θ}) - \mathcal{L(θ^*)} \leq ε \\ \text{where } θ^* = \mathop{\operatorname{min}}_{θ}{\mathcal{L}(θ)} \leq ε\] -
Algorithm:
For \(t = 1,\ldots,T\),
select \(θ_t\) using OCO,
then observe \((x_t, y_t)\), (note: it is important not to observe \((x_t, y_t)\) in advance.
then set loss function \(f_t(θ_t) = \ell(h_{θ_t}, x_t, y_t)\),
then output \(\hat{θ} = \frac{1}{T}\sum_{t=1}^{T}{θ_t}\)Can we say anything about \(\hat{θ}\)? No, because it is heavily dependent on specific \(x_1,y_1,\ldots,y_1,y_t\)
We Can Only something about the Expectation of \(\hat{θ}\)
We want to proof
\[\mathop{\operatorname{\mathbb{E}}}_{(x_1, y_1), \ldots, (x_t, y_t) \sim D}{[\mathcal{L}(\hat{θ})]} - \mathcal{L}(θ^*) \leq \frac{1}{T} \mathop{\operatorname{\mathbb{E}}}_{(x_1, y_1), \ldots, (x_t, y_t) \sim D}{[\text{Regret}_T]}\]It is too long to write \({\displaystyle \mathop{\operatorname{\mathbb{E}}}_{(x_1, y_1), \ldots, (x_t, y_t) \sim D} }\), Let’s use the notation \({\displaystyle \mathop{\operatorname{\mathbb{E}}}_{all}}\)
-
Proof:
\[\begin{aligned} \mathop{\operatorname{\mathbb{E}}}_{all}{\mathcal{L}(\hat{θ})} &= \mathop{\operatorname{\mathbb{E}}}_{all}{\mathcal{L}\left(\frac{1}{T}\sum_{t=1}^{T}{θ_t}\right)} \\ &\leq \mathop{\operatorname{\mathbb{E}}}_{all}{\frac{1}{T} \sum_{t=1}^{T}{\mathcal{L} \left( θ_t \right)}} & \\ &= \frac{1}{T} \sum_{t=1}^{T} \mathop{\operatorname{\mathbb{E}}}_{all}\mathop{\operatorname{\mathbb{E}}}_{(x,y)\sim D}{\ell(h_{θ_t}, x, y)} & \\ &= \frac{1}{T} \sum_{t=1}^{T} \mathop{\operatorname{\mathbb{E}}}_{all}{\ell(h_{θ_t}, x_t, y_t)} & \text{ tricky part }\\ &= \mathop{\operatorname{\mathbb{E}}}_{all}{\frac{1}{T} \sum_{t=1}^{T} \ell(h_{θ_t}, x_t, y_t)} & \text{} \\ &= \mathop{\operatorname{\mathbb{E}}}_{all}{\frac{1}{T} \sum_{t=1}^{T} \left( \mathop{\operatorname{min}}_{θ}{\ell(h_{θ}, x_t, y_t)} + \mathop{\operatorname{Regret}_T} \right) } & \text{} \\ &\leq \mathop{\operatorname{\mathbb{E}}}_{all}{\frac{1}{T} \sum_{t=1}^{T} \left( \ell(h_{θ^*}, x_t, y_t) + \mathop{\operatorname{Regret}_T} \right) } & \text{ for any }θ^* \\ &= \mathcal{L}(θ^*) + \mathop{\operatorname{\mathbb{E}}}_{all}{\frac{1}{T} \mathop{\operatorname{Regret}_T} } & \text{} \\ \end{aligned}\]
-
Follow the leader (FTL)
-
Algorithm
\[x_t = \operatorname{arg}\min_{x\in K}{\sum_{s=1}^{t-1}{f_s(x)}}\]Claim:
\[\text{Regret} \leq \sum_{t=1}^{T}{f_t(x_t) - f_t(x_{t+1})}\]Proof By induction.
-
T = 1:
\[\text{Regret}_T(\text{FTL}) = f_1(x_1) - f_1(x_2)\] -
T > 1
\[\begin{aligned} \text{Regret}_T(\text{FTL}) &= \sum_{t=1}^{T}{f_t(x_t) - \sum_{t=1}^{T}f_t(x_{T+1})} \\ &= \sum_{t=1}^{T}{\left( f_t(x_t) - f_t(x_{T+1}) \right) } \\ &= \sum_{t=1}^{T-1}{ \left( f_t(x_t) - f_t(x_{T+1}) \right) } + f_T(x_T) - f_T(x_{T+1}) \\ &\leq \sum_{t=1}^{T-1}{ \left( f_t(x_t) - f_t(x_{T}) \right) } + f_T(x_T) - f_T(x_{T+1}) \\ &= \mathop{\operatorname{Regret}_{T-1}} + f_T(x_T) - f_T(x_{T+1}) \\ &\leq \sum_{t=1}^{T-1}{ \left( f_t(x_t) - f_t(x_{t+1}) \right) } + f_T(x_T) - f_T(x_{T+1}) \\ &= \sum_{t=1}^{T}{ \left( f_t(x_t) - f_t(x_{t+1}) \right) } \end{aligned}\]
-
-
FTL example
Data $ z_t $ reveal one by one.
Predict the $ \mu $.See scribed lecture
-
Linear loss is harder:
Let \(\widetilde{f}_t(x) = \nabla_{x_t}f_t(x-x_t) + f_t(x_t)\)
Note that:
- \(\widetilde{f}_t(x) = f_t(x_t)\).
- \[\widetilde{f}_t(u) \leq f_t(u)\]
- \(\sum{ \left( f_t(x_t) - f_t(u) \right) \leq \sum { \left( \widetilde{f}_t(x_t) - \widetilde{f}_t(u) - \right) } }\).
Hence, Linear loss is larger (harder)
-
Bad performance (linear regret) in Linear loss
Example:
\(X \in [-1, 1]\), loss function
\[f_t(X) = \begin{cases} \frac{1}{2}X & \text{ when } t=1 \\ -X & t = 2,4,6,\ldots \\ X & t = 3,5,7,\ldots \end{cases}\]
Follow the Regularized leader (FTRL)
-
Follow the Regularized Leader (FTRL)
\[x_{t+1} = \mathop{\operatorname{argmin}}_{x\in \mathcal{K}}{\sum_{s=1}^{t}{\eta f_s(x) + R(x)}}\]Assume \(f_s\) is linear, which is the hardest case, let \(f_s(x)\) be \((g_s \cdot x)\)
-
Lemma: \eta g_t\cdot(x_t-u) = D_R(u, x_t) - D_R(u, x_{t+1}) + D_R(x_t, x_{t+1})
-
Proof: TODO
-
\(\Other \Lemma \TODO\).
-
Bandit Setting
-
Expert Setting: full info feedback:
We know the loss function. a.k.a, we know what would happen if we chose another \(x_t\)
-
Bandit Setting: feedback limited to chosen action
Protocal:
We have n actions (called arms), for t = 1 … T, algorithm selects \(i_t\); Nature reveal \(\ell_{i_t}^{t}\) from unobserved \(\ell^{t} \in [0,1]^n\)
- subsetting adversarial: \(\ell^t\) chosen arbitrarily, but in advance.
- subsetting stochastic: \(\ell^t_i \sim \mathcal{D}_i (i.i.d)\)
-
EXP3 algorithm
For adversarial settings.
(Note this is different from the original paper)
-
Algorithm
N: actions (also called arms); T: time, \(\ell \in [0,1]\): loss
-
Theorem: \(\mathop{\operatorname{\mathbb{E}}}{[\sum_{t=1}^{T}{\ell_{i_t}^t - \ell_{i^*}^t}]} \leq \frac{\log n}{\eta} + \frac{\eta}{2}Tn\)
Cookup potential:
\[\Phi_t = - \frac{1}{\eta} \log \left( \sum_{i=1}^{N}{w^t_i} \right)\]Note in the following deduction, we reach time t. \(\boldsymbol{w}^t\) is fixed, hence \(\Phi_{t}\) is fixed. \(\boldsymbol{\ell}^t\) is unseen. \(\Phi_{t+1}\) is random variable. \(w_i^{t+1}\) are random variable. \(\ell_i^t\) are random variable.
\[\begin{aligned} \Phi_{t+1} - \Phi_{t} &= - \frac{1}{\eta} \log \left( \frac{\sum_{i=1}^{N}{w_i^{t+1}}}{\sum_{i=1}^{N}{w_i^t}} \right) & \\ &= - \frac{1}{\eta} \log \left( \frac{\sum_{i=1}^{N}{w_i^t \exp(-\eta \hat{\ell}_i^t)}}{\sum_{i=1}^{N}{w_i^t}} \right) & \\ &= - \frac{1}{\eta} \log \left( \sum_{i=1}^{N}{\left(\frac{w_i^t}{\sum_{i=1}^{N}{w_i^t}}\right) \exp(-\eta \hat{\ell}_i^t)} \right) & \\ &= - \frac{1}{\eta} \log \left( \sum_{i=1}^{N}{p^t_i \exp(-\eta \hat{\ell}_i^t)} \right) & \\ &= - \frac{1}{\eta} \log \left( \mathop{\operatorname{\mathbb{E}}}{[\exp(-\eta \hat{\ell}^t)]} \right) & \\ &\geq - \frac{1}{\eta} \log \left( \mathop{\operatorname{\mathbb{E}}}{[1 -\eta \hat{\ell}^t + \frac{1}{2} (\eta \hat{\ell}^t)^2] } \right) & e^{-x} \leq 1 - x + \frac{1}{2}x^2 \text{ when } x\geq 0 \\ &= - \frac{1}{\eta} \log \left( \mathop{1 - \operatorname{\mathbb{E}}}{[ \eta \hat{\ell}^t - \frac{1}{2} (\eta \hat{\ell}^t)^2] } \right) & \\ &\geq \frac{1}{\eta} \mathop{\operatorname{\mathbb{E}}}{[ \eta \hat{\ell}^t - \frac{1}{2} (\eta \hat{\ell}^t)^2] } & \log(1-x)\leq -x \\ &= \mathop{\operatorname{\mathbb{E}}}{[ \hat{\ell}^t - \eta \frac{1}{2} ( \hat{\ell}^t)^2] } & \\ &= \sum_{i=1}^{N}{p_i^t} \hat{\ell}_i^t - \eta \frac{1}{2} \sum_{i=1}^{N}{p_i^t} ( \hat{\ell}_i^t)^2 & \\ \end{aligned}\]All the way we set up \(\hat{\ell}_i^t\) is to make it unbiased, so we can take expectation (on arms we pull) at time t)
\[\begin{aligned} \mathop{\operatorname{\mathbb{E}}}_{I_t \sim \boldsymbol{p}^t}{[\Phi_{t+1} - \Phi_{t} \vert I_{1}, I_{2}, \ldots, I_{t-1}]} &\geq \mathop{\operatorname{\mathbb{E}}} [\sum_{i=1}^{N}{p_i^t} \hat{\ell}_i^t - \eta \frac{1}{2} \sum_{i=1}^{N}{p_i^t} ( \hat{\ell}_i^t)^2] \\ &= \sum_{i=1}^{N}{p_i^t} \mathop{\operatorname{\mathbb{E}}}[\hat{\ell}_i^t] - \eta \frac{1}{2} \sum_{i=1}^{N}{p_i^t} \mathop{\operatorname{\mathbb{E}}}[(\hat{\ell}_i^t)^2] \\ &= \sum_{i=1}^{N}{p_i^t} \ell_i^t - \eta \frac{1}{2} \sum_{i=1}^{N}{p_i^t} \frac{ (\ell_i^t)^2 }{p_i^t} \\ &= \boldsymbol{p}^t \cdot \boldsymbol{\ell}^t - \eta \frac{1}{2} \sum_{i=1}^{N}{ (\ell_i^t)^2 } \\ &\leq \boldsymbol{p}^t \cdot \boldsymbol{\ell}^t - \eta \frac{1}{2} N \\ \end{aligned}\]Now given all real loss up to time T, that is \(\ell^1, \ldots, \ell^T\), the EXP3 algorithm generate a serie of action \(i_1, \ldots, i_T\).
Once real loss is given, each serie is generated by a specific probability. Think it as a tree. So
\[\begin{aligned} & \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{[\Phi_{T+1} - \Phi_1 \vert (i_1,\ldots,i_T)]} \\ &= \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{[\Phi_{T+1} - \Phi_{T} + \Phi_T - \Phi_1 \vert (i_1,\ldots,i_T)]}\\ &= \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{[(\Phi_{T+1} - \Phi_{T} \vert (i_1,\ldots,i_T))]} + \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{[\Phi_T - \Phi_1 \vert (i_1,\ldots,i_T)]}\\ &= \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_{T-1}) \in \{1\ldots N\}^{T-1}}{\left[\mathop{\operatorname{\mathbb{E}}}_{i_T}[(\Phi_{T+1} - \Phi_{T} \vert (i_1,\ldots,i_{T-1}))]\right]} + \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_{T-1}) \in \{1\ldots N\}^{T-1}}{[\Phi_T - \Phi_1 \vert (i_1,\ldots,i_{T-1})]}\\ &= \text{recursive} \\ &\geq \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_{T-1}) \in \{1\ldots N\}^{T-1}}{[(\boldsymbol{p}^T \cdot \boldsymbol{\ell}^T - \eta \frac{1}{2} N)]} + \text{...omitted} \\ &= \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_{T}) \in \{1\ldots N\}^{T}}{\left[\sum_{t=1}^{T}{\ell^t_{i^t}} - \eta \frac{1}{2} N\right]} \\ &= - \eta \frac{NT}{2} + \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_{T}) \in \{1\ldots N\}^{T}}{\left[\sum_{t=1}^{T}{\ell^t_{i^t}}\right]} \\ \end{aligned}\]Moreover, We have
\[\mathop{\operatorname{\mathbb{E}}}{ \left[ \Phi_{T+1} - \Phi_1 \right] } \leq \sum_{t=1}^{T}{\ell_i^t} + \frac{\log N}{\eta} \,\,\,\, \text{ for all } i = 1 \ldots N\]Why: the second term is by definition \(-\frac{1}{\eta} N\). The first term is
\[\begin{aligned} \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{\Phi_{T+1}} &= \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{-\frac{1}{\eta} \log { \left( \sum_{i=1}^{N}{w_i^T} \right) }} & \text{ random var is } w_i^T \\ &\leq -\frac{1}{\eta} \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{ \log { \left( \sum_{i=i^*}^{i^*}{w_i^T} \right) }} & \\ &= -\frac{1}{\eta} \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{ \log { \left( {w_{i^*}^{T-1} \exp (- \eta \hat{\ell}^T_{i^*})} \right) }} & \\ &= -\frac{1}{\eta} \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{ (- \eta \hat{\ell}^T_{i^*}) + \log { \left( {w_{i^*}^{T-1} } \right) }} & \\ &= {\ell}^T_{i^*} - \frac{1}{\eta} \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_T) \in \{1\ldots N\}^T}{ \log { \left( {w_{i^*}^{T-1} } \right) }} & \\ &= \text{ recursive} \\ &= \sum_{t=1}^{T}{ {\ell}^t_{i^*}} \end{aligned}\]To combine
\[\mathop{\operatorname{\mathbb{E}}}{ \left[ \mathop{\operatorname{Regret}_T} \right] } = \mathop{\operatorname{\mathbb{E}}}_{(i_1,\ldots,i_{T}) \in \{1\ldots N\}^{T}}{\left[\sum_{t=1}^{T}{\ell^t_{i^t}}\right]} - \sum_{t=1}^{T}{\ell_i^t} \leq \eta \frac{NT}{2} + \frac{\log n}{\eta}\]
-
-
UCB1 algorithm
For stochastic settings.
-
Settings:
- There is K different distributions governing rewards for K actions (arms).
- Each distribution doesn’t change over time
- At time $t$, algorithm select arm $i_t$, then reward is revealed as $X^{(t)}_i$, which is sampled i.i.d from distribution $\mathcal{D}_i$
The best arm to choose is the $i^*$ that maximize reward $ {\displaystyle u_i = u(i) = \mathop{\operatorname{\mathbb{E}}}_{X \sim \mathcal{D}_i}{[X]} } $, a.k.a, the best arm in expectation.
-
Define regret
\[\mathop{\operatorname{Regret}}_T = Tu(i^*) - \sum_{t=1}^{T}{X^{(t)}_{i_t}}\]which is the reward we lost by not following the best policy.
-
Example: So let’s take the action with the highest average reward directly.
- Assume two actions.
- Action 1 has reward of 1 with probability 0.3 and otherwise has reward 0f 0.
- Action 2 has reward of 1 with probability 0.7 and otherwise has reward of 0.
- Play action 1, get reward of 1.
- Play action 2, get reward of 0.
-
Now average reward of action 1 will never drop to 0, so we’ll never play action 2
- This illustrates a classic problem, which is the defining characteristic of decision making:
- the trade-off between exploring and exploiting.
- Exploring means to try new actions to learn their effects.
- Exploiting means to try what we know has worked in the past.
- The algorithm above does not explore sufficiently.
-
UCB1 intuition
The idea is like this:
Without loss of generosity, suppose that #1 is the best arm.
As we explore, by hoeffing’s inequality, with a very high probability, estimated cost is similar to real expected cost.
Suppose we choose a tuning parameter $ε$, at time $t$, suppose algorithm choose arm $i_t ≠ 1$, then consider the following events:
- $ S^{(t)}_1: \hat{u_1} < u_1 - ε $, which is, the estimated $\hat{u_1}$ is very low
- $ L^{(t)}_ {i_t}: \hat{u_{i_t}} > u_{i_t} + ε $ which is, the estimated $\hat{u_{i_t}}$ is very high
These events are less and less likely to occur as we play. So we want to bond to things:
- #times $S_1$ or $L_{i_t}$ occurs.
- total loss when this events do not occur.
There is one technical issue, which is that, when applying the Heoffding’s inequality, the $N$ is changing (see following)
Suppose at time $t$ algorithm choose $i_t$, and $i_t$ has been choosen $ N^{(t)}_{i_t}$ times, by Hoeffding’s inequality:
\[\mathop{\operatorname{Pr}}(\hat{u_j} < u_j - ε ) < e^{-2N_jε^2} \\ \mathop{\operatorname{Pr}}(\hat{u_j} > u_j + ε ) < e^{-2N_jε^2}\]I’m not thinking thoroughly now what will occur if we choose to optimize $\epsilon$, but they use a variable exchange.
-
Algorithm (somehow different from the original)
let $\delta = e^{-2N_jε^2}$
\[\mathop{\operatorname{Pr}}(\hat{u_j} < u_j - \sqrt{\frac{\log \frac{1}{\delta}}{2N_j}} ) < \delta \\ \mathop{\operatorname{Pr}}(\hat{u_j} > u_j + \sqrt{\frac{\log \frac{1}{\delta}}{2N_j}} ) < \delta\]Now redefine event (smaller, larger)
\[\mathop{\operatorname{Pr}}(S_j: u_j < \hat{u_j} - \sqrt{\frac{\log \frac{1}{\delta}}{2N_j}}) < \delta \\ \mathop{\operatorname{Pr}}(L_j: u_j > \hat{u_j} + \sqrt{\frac{\log \frac{1}{\delta}}{2N_j}}) < \delta\]for time $ t $ from $ 1 $ to $ T $,
-
First consider the event $ S^{(t)}_1 ∨ L^{(t)} _ {i_t}$ occurs.
this means that we highly underestimate the best arm $u_1$ or highly overestimate arm $i_t$, we suffer loss at most 1. (since loss defined to be in $0\leq \text{loss} \leq 1$). So the expected loss, under this case, is $2T\delta$
-
Now consider the step where the event doesn’t occur.
How should we pick.
Suppose our algorithm choose $i_t$ at time t
\[u_1 < \hat{u_1} + \sqrt{\frac{\log \frac{1}{\delta}}{2N_1}} < \hat{u_{i_t}} + \sqrt{\frac{\log \frac{1}{\delta}}{2N_{i_t}}} < u_{i_t} + 2\sqrt{\frac{\log \frac{1}{\delta}}{2N_{i_t}}}\]-
Explain:
- The first inequality means $S^{(t)}_1$ doesn’t occur.
- The second is by our algorithm
- The third inequality means $L^{(t)}_{i_t}$ doesn’t occur.
Now we have
\[Δ_{i_t} = u_1 - u_{i_t} < 2\sqrt{\frac{\log \frac{1}{\delta}}{2N_{i_t}}}\]This is nice, since when we pick, we always pick the one with highest optimistic value, but as $N_{i_t}$ goes up, confidence interval of $i_t$ shrinks. So we will not choose $i_t$ after certain steps.
\[Δ_{i_t} = u_1 - u_{i_t} < 2\sqrt{\frac{\log \frac{1}{\delta}}{2N_{i_t}}} \\ \implies \left( \frac{Δ_{i_t}}{2} \right)^2 < \frac{\log \frac{1}{\delta}}{2N_{i_t}} \\ \implies N_{i_t} < \frac{2\log \frac{1}{\delta}}{Δ_{i_t}^2} \\\]The maximum loss possible of arm k will be:
\[\sum_{i=1}^{\lfloor {2\log \frac{1}{\delta}} / {Δ_k^2} \rfloor}{2\sqrt{\frac{\log \frac{1}{\delta}}{2}} \sqrt{\frac{1}{i}}}\]Using calculus we know that
\[\sum_{i=1}^{n} \sqrt{\frac{1}{i}} \leq 1 + \int_{1}^{n}{\sqrt{\frac{1}{x}} \mathop{\operatorname{dx}}} = 1 + 2\sqrt{n}\]So,
\[\sum_{i=1}^{\lfloor {2\log \frac{1}{\delta}} / {Δ_k^2} \rfloor}{2\sqrt{\frac{\log \frac{1}{\delta}}{2}} \sqrt{\frac{1}{i}}} < 2\sqrt{\frac{\log \frac{1}{\delta}}{2}}{\left(1+2\sqrt{\lfloor {2\log \frac{1}{\delta}} / {Δ_k^2} \rfloor}\right)} \approx 4 \log(1/\delta) / Δ_k\] -
So, the total regret expected is
\[\mathop{\operatorname{\mathbb{E}}}[\mathop{\operatorname{Regret}_T}] \lessapprox K + 2Tδ + \sum_{k=2}^{K}{4\log(1/δ)/Δ_k}\]Setting $δ = 1/T$, we get something like $\sum_{k=2}^{K}{O(\log(T))/Δ_k}$
My teacher said that factor $1/Δ_k$ is inevitable somehow…, by some theory…
-
-