Bayesian human-in-the-loop optimization

# Bayesian<br />human-in-the-loop<br />optimization
## ELEC-E7851<br />Computational User Interface Design
### Tomi Peltola<br /><a href="mailto:tomi.peltola@aalto.fi" class="email">tomi.peltola@aalto.fi</a>, <a href="http://www.tmpl.fi" class="uri">http://www.tmpl.fi</a><br />
<div style="font-size: 15px; margin: 0px;">
(source code for the slides: <a href="http://www.tmpl.fi/bayesian-optimization/slides.Rmd" class="uri">http://www.tmpl.fi/bayesian-optimization/slides.Rmd</a>)
</div>
### 2018/11/16 (updated: 2018-11-16)

---

# Learning Goals

After this lecture and the related exercises, you should

1. understand the basics of Bayesian optimization,
 2. be able to assess when it can be a useful tool and when not,
 3. know how to use the GPyOpt Python package for your own applications.

---

# Outline

1. Introduction
 2. Bernoulli bandit
 3. Bayesian optimization for continuous domains
 4. Surrogate models: Gaussian processes
 5. Sequential experimental design: Acquisition functions
 6. Human-in-the-loop application: animation design
 7. Summary

`$$\newcommand{\E}{\mathbb{E}}$$`
`$$\DeclareMathOperator*{\argmax}{argmax}$$`
`$$\DeclareMathOperator*{\argmin}{argmin}$$`

---

# Motivation - Formulation

Find the minimum of a function `$f(x)$` within some bounded domain `$\mathcal{X} \subset \mathbb{R}^D$`:

`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`

---

# Motivation - Formulation

Find the minimum of a function `$f(x)$` within some bounded domain `$\mathcal{X} \subset \mathbb{R}^D$`:

`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`

**How would you solve this?**

---

# Motivation - Formulation

Find the minimum of a function `$f(x)$` within some bounded domain `$\mathcal{X} \subset \mathbb{R}^D$`:

`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`

**How would you solve this?**

* Hand-tuning / trial and error,
 * random search,
 * grid search,
 * gradient descend,
 * evolutationary algorithms,
 * different types of programming (linear, integer, etc.),
 * ...

Whichever method you learned in your optimization course that is suitable for the application.

But what if...

---

# Motivation - A/B testing

`$f$`

* can only be evaluated implicitly,
 * with a lot of noise.
 
For example, optimize for click-through rate, retention time, or purchases, implicitly measuring user satisfaction, interest, or revenue of different version of a web site.
 
<div class="center">
<img src="figs/aalto.png" style="width: 99%" />
</div>

---

# Motivation - Animation design

`$f$`

* is a black-box (e.g., contains human judgement),
 * with noisy and expensive/slow evaluations,
 * and multi-modal.

For example, optimizing parameters for procedural animation generation.

<div style="font-size: 10px;"><a href="https://dl.acm.org/citation.cfm?id=1921443">[image: Brochu et al., 2010]</a></div>

---

# Motivation - Wearable devices

`$f$`

* is based on an implicit signal from human,
 * with noisy and expensive/slow evaluations.

For example, optimizing step rate to minimize metabolic cost.

<div style="font-size: 10px;"><a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0184054">[image: Kim et al., 2017]</a></div>

---

# Motivation - Simulators and cognitive models

`$f$`

* contains a parametric simulator,
 * with noisy and expensive/slow evaluations.

For example, fitting parameters of cognitive simulators to experimental data.

<div style="font-size: 10px;"><a href="https://dl.acm.org/citation.cfm?doid=3025453.3025576">[image: Kangasrääsiö et al., 2017]</a></div>

---

# Motivation - Other applications

* Recommender systems
 * Sensor networks
 * AutoML: automatic tuning of machine learning models
 * Robotics and reinforcement learning

---
# Motivation

What if

* `$f$` is a black-box that we can only evaluate point-wise, 
 * `$f$` can be multi-modal,
 * `$f$` is slow or expensive to evaluate,
 * evaluations of `$f$` are noisy,
 * `$f$` has no gradients available (can be used if available).

Natural in many HCI applications.

---

# Motivation

What if

In principle, we can always do, for example, grid search: scales `$O(M^D)$` to fill the domain at constant grid of size `$M$` per axis.

---

# Formulation

Find the minimum of a function `$f(x)$` within some bounded domain `$\mathcal{X} \subset \mathbb{R}^D$`:

`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`

Want to find the minimum with small number of evaluations of `$f$`.

**How to cope with these?**

---

# Formulation

Find the minimum of a function `$f(x)$` within some bounded domain `$\mathcal{X} \subset \mathbb{R}^D$`:

`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`

Want to find the minimum with small number of evaluations of `$f$`.

1. Construct a tractable **statistical surrogate model** of `$f$`.
 2. Turn the optimization problem into **a sequence of easier problems**.

---

# Bernoulli bandit

Task: A/B testing to find which of two versions of a web ad is better.

* ''Experiment'': Show one of the two version for a visitor.
 * ''Observation'': Did the visitor click the ad.

Want to find out which ad is better, while gathering as much clicks as possible.
 
Minimize *regret* `$R$` for `$T$` experiments:
`$$R = T \E[y^*] - \sum_{t=1}^T y_t$$`
 * `$\E[y^*]$` is the expected click rate for the better ad,
 * `$y_t \in \{0,1\}$` is whether visitor `$t$`, who was shown version `$x_t \in \{A, B\}$`, clicked the ad.

---

# Bernoulli bandit - surrogate model

* Model click rates of `$A$` and `$B$` independently. Equations for A below.
 
**Bayes theorem**: `$p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta)}{p(\mathcal{D})}  p(\theta)$`

updates **prior** knowledge `$p(\theta)$` with **observations** `$\mathcal{D}$` to **posterior** knowledge `$p(\theta \mid \mathcal{D})$`.

Observation model:

`$$\Pr(y_t \mid x_t = A) = \textrm{Bernoulli}(y_t \mid \rho_A) = \rho_A^{y_t} (1 - \rho_A)^{1-y_t}.$$`

Prior model:

`$$p(\rho_A) = \textrm{Beta}(\rho_A \mid \alpha, \beta) = \frac{1}{B(\alpha, \beta)} \rho_A^{\alpha-1} (1 - \rho_A)^{\beta-1}.$$`

Given a dataset of `$t$` observations `$\mathcal{D}_t = \{(x_1, y_1), \ldots, (x_t, y_t)\}$`, the posterior distribution is
`$$p(\rho_A \mid \mathcal{D}_t) = \frac{p(\rho_A) \prod_{t: x_t = A} \Pr(y_t \mid x_t = A)}{\int_0^1 p(\rho_A) \prod_{t: x_t = A} \Pr(y_t \mid x_t = A) d\rho_A} = \textrm{Beta}(\rho_B \mid \alpha + n^{A}_1, \beta + n^{A}_0),$$`
where `$n^{A}_1$` and `$n^{A}_0$` are the total numbers of `$y_t = 1$` and `$y_t = 0$` for `$x_t = A$`.

---

# Bernoulli bandit - surrogate model
 
<img src="slides_files/figure-html/bayes1-1.png" style="display: block; margin: auto;" />
---

# Bernoulli bandit - surrogate model

<img src="slides_files/figure-html/bayes2-1.png" style="display: block; margin: auto;" />
---

# Bernoulli bandit - surrogate model

---

# Bernoulli bandit - surrogate model

---

# Bernoulli bandit - Thompson sampling

When `$t+1$`th visitor comes, which ad, `$A$` or `$B$`, to serve?

* Want to exploit: gather as much clicks as possible.
 * Need to explore: learn about click rates for `$A$` and `$B$`.
 
Thompson sampling is a simple algorithm navigating this trade-off:

1. Sample a value for `$\hat{\rho}_A$` and for `$\hat{\rho}_B$` from `$p(\rho_A \mid \mathcal{D}_t)$` and `$p(\rho_B \mid \mathcal{D}_t)$`.
 2. Show `$A$` if `$\hat{\rho}_A > \hat{\rho}_B$` and `$B$` otherwise.
 
--

We then observe whether the visitor clicked the ad or not and update our posterior distributions and continue to next iteration.

**How does Thompson sampling navigate exploration-exploitation trade-off?**

---

# Bernoulli bandit - simulation

```python
import numpy as np; np.random.seed(123)

rho_A_true = 0.1; rho_B_true = 0.2 # simulated visitor click rates
alpha = 1; beta = 1 # prior parameters
n_A1 = 0; n_A0 = 0; n_B1 = 0; n_B0 = 0 # numbers of clicks/no-clicks
T = 1000 # number of iterations
clicks = np.zeros(T); A_or_B = np.zeros(T)

for t in range(T):
    # Thompson sampling
    rho_A = np.random.beta(alpha + n_A1, beta + n_A0)
    rho_B = np.random.beta(alpha + n_B1, beta + n_B0)
    if rho_A > rho_B: # which ad to show
        y_t = np.random.binomial(1, rho_A_true) # simul. click
        n_A1 += y_t; n_A0 += 1 - y_t; # update posterior of A
    else:
        y_t = np.random.binomial(1, rho_B_true) # simul. click
        n_B1 += y_t; n_B0 += 1 - y_t; # update posterior of B
    # collect statistics
    clicks[t] = y_t; A_or_B[t] = rho_A > rho_B
```

---

# Bernoulli bandit - simulation

---

# Questions so far?

---

# Formulation

Find the minimum of a function `$f(x)$` within some bounded domain `$\mathcal{X} \subset \mathbb{R}^D$`:

`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`

Want to find the minimum with small number of evaluations of `$f$`.

1. Construct a tractable statistical surrogate model of `$f$`.
 2. Turn the optimization problem into a sequence of easier problems.
 
---

# Formulation

Find the minimum of a function `$f(x)$` within some bounded domain `$\mathcal{X} \subset \mathbb{R}^D$`:

`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`

Want to find the minimum with small number of evaluations of `$f$`.

1. **Construct a tractable statistical surrogate model of `$f$`.**
 2. Turn the optimization problem into a sequence of easier problems.
 
---

# Construct a tractable statistical surrogate model of `$f$`

Let `$g(x)$` be our surrogate model of `$f$`.

* `$g$` should be able to capture important aspects of `$f$` from small number of evaluations.
 * Need to be able to update `$g$` when we acquire new evaluations of `$f$`: `$g$` should get better and better as a model of `$f$`.
 * `$g$` should be fast to evaluate and to update.
 * `$g$` needs to cope with noise.
 * Need to be able to quantify uncertainty in `$g$` (navigating exploration-exploitation tradeoff).

---

# Gaussian processes

* Gaussian processes provide a probability distribution over functions.
 * Extends (and uses properties of) the multivariate normal distribution `$\Rightarrow$` computationally tractable.
 * Prior information about the type or behaviour of the modelled function can be included in the *covariance function* and its parameters.

* Alternatives: random forests, Bayesian neural networks.

---

# Gaussian processes - 1D example

![:gpappstart]

---

# Formulation

Find the minimum of a function `$f(x)$` within some bounded domain `$\mathcal{X} \subset \mathbb{R}^D$`:

`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`

Want to find the minimum with small number of evaluations of `$f$`.

1. Construct a tractable statistical surrogate model of `$f$`.
 2. **Turn the optimization problem into a sequence of easier problems.**
 
---

# Turn the optimization problem into a sequence of easier problems

Consider having evaluated `$f$` at points `$x_1, \ldots, x_{t-1}$` and having constructed `$p(g \mid \mathcal{D}_{t-1})$`:

How to choose the next `$x$` to evaluate `$f$` at?

**What would you do?**

---

# Turn the optimization problem into a sequence of easier problems

Consider having evaluated `$f$` at points `$x_1, \ldots, x_{t-1}$` and having constructed `$p(g \mid \mathcal{D}_{t-1})$`:

**Guided exploration** using `$p(g \mid \mathcal{D}_t)$`:

* Trade off exploration (reducing uncertainty) and exploitation (sampling near likely places of optima).
 * Formulated by maximizing an acquisition function `$\alpha(x; \mathcal{D}_{t-1})$`.

---

# Turn the optimization problem into a sequence of easier problems

How to choose the next `$x$` to evaluate `$f$` at?

**Guided exploration** using `$p(g \mid \mathcal{D}_{t-1})$`:

* Trade off exploration (reducing uncertainty) and exploitation (sampling near likely places of optima).
 * Formulated by maximizing an acquisition function `$\alpha(x; \mathcal{D}_{t-1})$`.
 
`$$x_{t} = \argmax_x \alpha(x; \mathcal{D}_{t-1})$$`

**Thompson sampling** acquisition function:

* `$\alpha(x; \mathcal{D}_{t-1}) = \hat{g}(x)$`, where `$\hat{g}(x)$` is a sample from `$p(g \mid \mathcal{D}_{t-1})$`.

**Expected improvement** acquisition function:

* Currently best value `$y^* = \max_{s \in \{1,\ldots,t-1\}} y_s$`.
 * Improvement function provides utility of `$x$` given `$g$`: `$I(x, g) = (g(x) - y^*) I(g(x) > y^*)$`.
 * Expected improvement: `$\alpha_{EI}(x; \mathcal{D}_t) = \E_g[I(x, g)]$`.

Many others also exists.

---

# Bayesian Optimization Recipe

**Goal**: Find the minimum of a function `$f(x)$` within some bounded domain `$\mathcal{X} \subset \mathbb{R}^D$`:

`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`

**Key ideas**

1. Construct a tractable statistical surrogate model of `$f$`.
 2. Turn the optimization problem into a sequence of easier problems.

**Algorithm**

1. Initialize dataset `$\mathcal{D}_0$`, surrogate model `$p(g \mid \mathcal{D}_0)$`; choose acquisition function `$\alpha(\cdot)$`.
 2. Loop for `$t = 1,2,\ldots,T$`:
      1. Select next evaluation point: `$x_{t} = \argmax \alpha(x; \mathcal{D}_{t-1})$`.
      2. Evaluate `$f(x_{t})$` to obtain `$y_{t}$`.
      3. Update dataset `$\mathcal{D}_{t} = \{\mathcal{D}_{t-1},(x_t, y_t)\}$`.
      4. Update surrogate model `$p(g \mid \mathcal{D}_t)$`.
 3. Report the found optimum.
 
---

# Questions so far?

---

# GPyOpt - Python package

```python
import GPyOpt
import numpy as np; np.random.seed(12345)
from matplotlib import pyplot as plt

def f_u(x):
    return 0.2 * (x - 0.3)**2 - 0.4 * np.sin(15.0 * x)

plt.figure(); xx = np.linspace(0, 1, 101)
```

```python
plt.plot(xx, f_u(xx), 'k-'); plt.show()
```

---

# GPyOpt

```python
bounds = [{'name': 'var_1', 'type': 'continuous', 'domain': (0,1)}]
myBopt = GPyOpt.methods.BayesianOptimization(
    f=f_u, domain=bounds,        # Function and domain                 
    acquisition_type='EI',       # Expected improvement
    exact_feval=True,            # Noiseless function evaluations
    eps=1e-6,
    normalize_Y=False,           # (for clearer visualization)
    initial_design_numdata=2)    # (for clearer visualization)
```

---

# GPyOpt