06 November 2019
Prof. Varun Kanade
Notes taken by Miroslav Gasparek
(stochastic bandits) Actions set: $A = \{1, ..., k \}$ In each round t
Suppose we play for T rounds? "Pcik the action with maximum expected value" $\mu^{*}$ is the expected optimal reward. Optimal: gets $T\mu^{*}$.
$Regret(Alg) = T\mu^{*} - \mathbb{E}[Reward(Alg)]$
(pseudo) Explore then Exploit algorithm:
Hoeffding's Inequality:
Let $X_1, ... X_n$ are i.i.d. random variables , with $X_i \in [0, 1], \mathbb{E}[X_i] = \mu$
then $\mathbb{P}(\mid \frac{1}{n}\sum_{i} X_i - \mu \mid > t) \leq 2e^{-2nt^2}$
Question: Are assumptions stronger than in the central limit theorem?
Fix some arm $a$ \begin{equation} \epsilon_{a}: \mathbb{P}\left( \mid \hat{\mu}(a) - \mu(a) \mid \geq \sqrt{\frac{2 log \ T}{N}}) \right) \leq 2e^{-2N \frac{2 log \ T}{N}} = \frac{1}{T^{4}} \end{equation} Then $\mathbb{P}(\epsilon_a) \leq \frac{1}{T^4}$
$\mathbb{P}(\cap_{a} \epsilon^c_{a}) = \mathbb{P}((\cup_a \epsilon_a)^c) = 1 - \mathbb{P}(\cup \epsilon_a) \geq 1 - \sum_a \mathbb{P}(\epsilon_a) \geq 1 - \frac{k}{T^4} \geq 1 - \frac{1}{T^3}$
Let $a^*$ be an arm that has optimal reward. $a$ is the arm picked by the algorithm and "the good event occurs".
then $\mu(a) + \sqrt{\frac{2 log \ T}{N}} \geq \hat{\mu}(a) \geq \hat{\mu}(a^*) \geq \mu^* - \sqrt{\frac{2 log \ T}{N}}$
Then
$Regret = Nk + T \left( \sqrt{\frac{2 log \ T}{N}} \right)$
Where $Nk$ is explore term and the second term is exploit term. How to pick $N$?
$N = \left(\frac{2T \sqrt{2log \ T}}{k} \right)^{2/3}$
$Regret = \mathcal{O}(k^{1/3}T^{2/3} (log \ T)^{2/3})$
The regret grows slower than $T$, which means that we are actually learning - if it grew linearly with $T$, we would actually not learn anything.
"Best arm identification" (for the fixed time)...
$\epsilon-greedy$:
For each round $t = 1, ...,T,$ do:
What were the problems with the algorithm? Perhaps we do not exploit the good arms more often?
Successive Elimination: Initially all arms are active
At each phase:
The bounds come from the Hoeffding's inequality
UCB Algorithm
How much suboptimal to pick $\Delta(a) = \mu^* - \mu(a)$?
$a^*$ is optimal
Suppose arm picked at time $t$, $a_t$, is a suboptimal arm. Then the following must hold. Assume that good event is that reward is always in the confidence interval and assume we get the good event.
Then: \begin{equation} \mu(a_t) + 2r_t(a_t) \geq \hat{\mu}(a_t) + r_t(a_t) \geq \mu(a^*) \end{equation}
The quantity $\hat{\mu}(a_t) + r_t(a_t)$ is $UCB_t(a_t)$
Also, \begin{equation} \Delta(a_t) = \mu(a^*) - \mu(a_t) \leq 2 \sqrt{\frac{2 log \ T}{n_t(a)}} \end{equation}
\begin{align} Regret &= \sum^T_{t=1} \Delta(a_t) \leq 2\sqrt{2log \ T} \sum^K_{a=1} \sum^{n_T(a)}_{s=1}\sqrt{\frac{1}{s}} \\ & \leq 2\sqrt{2 log \ T} \sum_{a=1}^k\sqrt{\frac{n_T(a)}{k}} \\ & \leq 2\sqrt{2 log \ T}k \sqrt{\frac{T}{k}} &= \mathcal{O}(\sqrt{Tk \log \ T}) \end{align}NB: Check the Jensen's inequality for the concave function.
\begin{align} n_T(a) & \leq c \frac{log \ T}{(\Delta(a))^2} \\ Regret &= \sum_{a \ suboptimal}n_T(a) \Delta(a) \leq c\sum_a \Delta(a) \frac{log \ T}{\Delta(a)^2} \\ &= c \ log \ T \sum_{a \ suboptimal} \frac{1}{\Delta(a)} \end{align}Explore then Exploit: $Regret = \mathcal{O}(T^{2/3}(k log \ T)^{1/3})$
$\epsilon$-greedy: $Regret = \mathcal{O}(T^{2/3}(k log \ T)^{1/3})$
Successive elimination: $Regret = \mathcal{O}(\sqrt{Tk \ log \ T}) \land \mathcal{O}(log \ T \sum_a \frac{1}{\Delta(a)})$
UCB1: $Regret = \mathcal{O}(\sqrt{Tk \ log \ T}) \land \mathcal{O}(log \ T \sum_a \frac{1}{\Delta(a)})$
$p, q$ probability distributions over $\Omega$
\begin{align} KL(p \| q) &= \sum_{x \in \Omega} p(x) ln \left( \frac{p(x)}{q(x)} \right) \\ &= \mathbb{E}_{x \sim p} \left[ln \left( \frac{p(x)}{q(x)} \right) \right] \end{align}Properties:
i. $KL(p \| q) \geq 0$ (equality if $p=q$)
ii. Chain rule: $\Omega = \Omega_1 x, ... , x\Omega_n$, when $p > q, p = p_1x...xp_m$
$KL(p \| q) = \sum_i KL(p_i \| q_i)$
iii. $2(p(A) - q(A))^2 \leq KL(p\|q) \ \forall A$
$\forall A \ \mid p(A) - q(A) \mid \leq \sqrt(\frac{1}{2}KL(p\|q))$
iv. $KL(RC_{\epsilon}, RC_0) \leq 2\epsilon^2, 0 < \epsilon < \frac{1}{12}$
$KL(RC_0, RC_\epsilon) \leq \epsilon^2$
$KL(RC_0, RC_\epsilon) \leq \epsilon^2$
Fix $T,k$, any bandit algorithm, then there exists a problem instance $\mathbb{E}[Regret] \geq c\sqrt{kT}$
For arm $a$: define \begin{align} I_a = \begin{cases} \mu_a = \frac{1}{2} + \frac{\epsilon}{2}, \epsilon = \sqrt{\frac{k}{T}} \\ \mu_i = \frac{1}{2}, \quad i\neq j \end{cases} \end{align}
Algorithm predicts after $T$ rounds arm $a$. $\mathbb{P}[prediction \ after \ T \ rounds \ is \ correct \mid I_a] \geq 0.99$
Lemma: For "bandits with prediction", when $T\leq \frac{ck}{\epsilon^2}$ for some constant $c$, any deterministic algortihm has the following property:
$\exists k/3$ arms $a$ such that $\mathbb{P}[prediction \ is \ a \mid I_a] \leq 3/4$
Corollary: If instance picked randomly, then $\mathbb{P}[prediction \ is \ invalid] \geq \frac{1}{12}$
Proof:
Let $\epsilon = \sqrt{\frac{ck}{T}}$
Fix any round $t \leq T$
$\mathbb{P}[alg \ incorrect \ to \ predict \ a \ at \ a \ time \ t] \geq 1/12$
$\Delta(a_t) = \mu^* - \mu(a_t)$
$\mathbb{E}[\Delta(a_t)] \geq \epsilon/24$
$\mathbb{E}[Reject] = \underset{t}{\epsilon} \mathbb{E}[\Delta(a_t)] \geq \frac{T\epsilon}{24} \geq \hat{c}\sqrt{kT}$
$K = 2, I_1, I_2$
$\Omega$ is a $2 \times n$ grid, $\Omega = \{ 0,1 \}^{2T}$
$A:$ "Algorithm outputs arm 1"
$P_1(A) \geq 3/4, P_2(A) \leq 1/4$ Then
$KL(P_1, P_2) = \sum^{2}_{a=1} \sum^{T}_{t=1} KL(P_1^{a,t}, P_2^{a,t}) \leq 4\epsilon^2 T$
but $\sqrt{\frac{1}{2}KL(P_1 \| P_2)} \leq \epsilon \sqrt{2T}$, $\epsilon = \frac{1}{4\sqrt{T}}$
Next...
Algorithm:
$K \subseteq \mathbb{R}^n, x_t \in K$
$Loss(Algorithm) = \sum_t f_t(x_t)$
$Regret = \sum_t f_t (x_t) - \underset{x \in K}{min} \sum_t f_t(x)$
Environment:
$f_t:K \rightarrow [0,1]$
You have $n$ "experts" (i. e. telling you to pick a a stock), $\Delta_n = \{ x \geq 0 \mid \sum_i x_i = 1 \}$
Algorithm: pick $x_t \in \Delta_n$ (probability distribution of $n$ experts)
Each expert has a loss $l_{t,i} \in [0,1]$
Algorithm's loss is then $\sum_i x_{t,i}l_{t,i}$, but in this case the algorithm observes the vector $l_t$
Regret is simply loss of the algorithm $Regret = \sum_t l_t x_t - \underset{i}{min} \sum_{t=1}^{r} k_{t,i}$, how can you minimize this??
One good wrong stratregy, at time t:
$i \in arg \ min \ \sum_{s=1}^{t-1} l_{t,i}$
$x_{t,i} = 1$ and $x_{t,j} = 0$ for $j \neq i$
This is called Follow the leader algorithm, and it does not work.
The correct choice is... at time $t$:
$x_t \in arg \ softmax \ (-\eta \sum_{s=1}^{t-1} l_{t,i})$
$x_{t,i} \propto exp(- \eta \sum_{s=1}^{t-1} l_{t,i})$ called also Follow the regularized leader perturbed, Hedge, Multiplicative WEight Update, Mirror Descent
Proof:
Let weight on the expert at time $=1$ be $w_{1,i} = 1$. Then we update the weight as
$w_{t+1,i} = w_{t,i} exp(-\eta l_{t,i})$
$Z_{t+1} = \sum_{i}w_{t+1,i}$
\begin{align} \frac{Z_{t+1}}{Z_t} &= \sum_i \frac{w_{t,i}}{Z_t} exp(-\eta l_{t,i}) \\ &= \sum_i x_{t,i} (1+(e^{-\eta}-1)l_{t,i}) \\ &= 1 + (e^{-\eta}-1)(\sum_i x_{t,i}l_{t,i}) \\ &\leq exp(x_{t} \cdot l_t(e^{-\eta})-1) \\ \Pi_{t=1} \frac{Z_{t+1}}{Z_{t}} &\leq exp \left( (e^{-\eta}-1) loss(Alg) \right) \end{align}Then ($i^*$ is the best expert):
\begin{align} \frac{Z_{t+1}}{n} \geq \frac{w_{T+1, i^*}}{n} &= \frac{exp(-\eta loss(i^{*}))}{n} \\ - \eta \ loss(i^*) - log \ n &\leq (e^{-\eta}-1) loss(Alg) \\ loss(Alg) - loss(i^*) &\leq \frac{(e^{-\eta}-(1-\eta))}{2}loss(Alg) + \frac{log \ n}{\eta} \\ &\leq 2 \ loss(Alg) + \frac{log \ n}{\eta} \\ &\leq 2\sqrt{T \ log(n)} \end{align}$n$ experts
Oberve $l_t \in [0,1]^n$ and incur loss $l_t x_t$ $Regret = \sum_t l_t x_t - \underset{i}{min} \sum_t l_{t,i}$
MWUA $Regret = \mathcal{O}(log (n))$
Adaboost: (instance of OLE)
Weak learning guarantee:
Let us have a set of data points $x_1, ..., x_n$ and some classifiers $c_1, c_2, ... ,c_T$, where $c_t$ is some classifier and \begin{align} l_{t,i} &= \begin{cases} 1, \quad \text{if $c_t$ is corrrect} \\ 0, \quad \text{otherwise} \end{cases} \end{align}
Choose $p_t$ on $\{ x_1, ..., x_n\}$, and $l_t p_t \geq \frac{1}{2} + \gamma$
$\sum_{t=1}^T l_t p_t \geq \left( \frac{1}{2} + \gamma \right)T$
$\sum_{t=1}^{T} l_t p_t - \sum_{t=1}^{T} l_{t,i} \leq 2\sqrt{T \ log \ n} \forall i$
$\left( \frac{1}{2}+\gamma \right)T - \sum_{t=1}^T l_{t,i} \leq 2\sqrt{T \ log \ n}$
$\sum_{t=1}^T l_{t,i} \geq \frac{T}{2} + \gamma T - 2\sqrt{T \ log \ n} > \frac{T}{2}$
$MAJORITY(c_1, c_2,..., c_T)$ actually classifies all $n$ examples correctly...
Have $k$ features in the data and classify them as 0 and 1 and find out if your thresholding gives you better results than random fit.
Algorithm:
Linear programming problem:
$\exists x \mid Ax \leq b$?
$Ax \leq b + \delta$
Let's go through this algorithm... $x_1$;
For $t = 1,...T$
Then we can expand the quantity $\lVert x_{t+1} - x^* \rVert^2$...
Then $\lVert x_t - x^* \rVert^2 - \lVert x_{t+1} - x^* \rVert^2 = -\eta^2 \lVert \nabla f_t(x_t) \rVert^2 + 2\eta \langle \nabla f_t(x_t), x_t - x^* \rangle$
The 2-norm of loss is bounded by $n$ (makes sense)
Initialize: $w_{1,i} = 1$
for $t=1,...,T$:
$p_{t,i} = (1-\gamma) \frac{w_{t,i}}{Z_t} + \frac{\gamma}{K}$
$a_t \sim p_t$, receive loss $l_{t,a_t}$
For $j = 1, ..., k$:
Make fake loss vector $\hat{l}_{t,j}$, so that \begin{align} \hat{l}_{t,j} &= \begin{cases} l_{t,a_t}, \quad if j=a_t \\ 0, \quad \text{otherwise} \end{cases} \end{align}
$w_{t+1,j} = w_{t,j} exp(-\frac{\gamma}{k} \hat{l}_{t,j})$
(Proof follows)
Problem: We have the graph, where we start in $s$ and end at $t$, each edge has a weight, which changes at each time. How to find the optimal (shortest) path?
(Can use Dijkstra's algorithm)
We know the graph and we only get all the weights of the edges at the end of the trial... How to solve it?
Objective: Minimize the regret:
$Regret = \sum cost_t(s,t) - \sum cost_t(path)$
Environment: $f_t:k \rightarrow \mathbb{R}$
Loss of player at time $t$: $f_t(x_t)$
Algorithms
(Ex. FTL works if $k$ is convex & $f_t$ are strongly conex)
Lemma: Be the ledar has negative regret
But the term $f_0(x^*) - f_0(x_1)$ depends only on regularizer, and $\sum_{t=1}^T (f_t(x_t)-f_t(x_{t+1}))$ depends on stability of the algorithm.
In the FTRL, we only minimize the regularizer at the time 0?
Lemma: $\sum_{t=0}^T f_t(x_{t+1}) \leq \underset{x \in k}{min} \sum_{t=0}^T f_t(x)$, where $f_0 = R$
Remark: Whenever working on simplex, entropy is a good regularizer
Experts Problem: $K = \{ x \in \mathbb{R}^{n}_+ \mid \sum_{i}x_{i} = 1 \}, R(x) = -\frac{1}{\eta}H(x), l \in \mathbb{R}_+^n$
The Shortest Path Problem
Shortest Path Problem:
And this is the end, my friends!