### 2018/11/16 (updated: 2018-11-16)
---
# Learning Goals
After this lecture and the related exercises, you should
1. understand the basics of Bayesian optimization,
2. be able to assess when it can be a useful tool and when not,
3. know how to use the GPyOpt Python package for your own applications.
---
# Outline
1. Introduction
2. Bernoulli bandit
3. Bayesian optimization for continuous domains
4. Surrogate models: Gaussian processes
5. Sequential experimental design: Acquisition functions
6. Human-in-the-loop application: animation design
7. Summary
`$$\newcommand{\E}{\mathbb{E}}$$`
`$$\DeclareMathOperator*{\argmax}{argmax}$$`
`$$\DeclareMathOperator*{\argmin}{argmin}$$`
---
# Motivation - Formulation
Find the minimum of a function `\(f(x)\)` within some bounded domain `\(\mathcal{X} \subset \mathbb{R}^D\)`:
`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`
---
# Motivation - Formulation
Find the minimum of a function `\(f(x)\)` within some bounded domain `\(\mathcal{X} \subset \mathbb{R}^D\)`:
`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`
**How would you solve this?**
---
# Motivation - Formulation
Find the minimum of a function `\(f(x)\)` within some bounded domain `\(\mathcal{X} \subset \mathbb{R}^D\)`:
`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`
**How would you solve this?**
* Hand-tuning / trial and error,
* random search,
* grid search,
* gradient descend,
* evolutationary algorithms,
* different types of programming (linear, integer, etc.),
* ...
Whichever method you learned in your optimization course that is suitable for the application.
--
But what if...
---
# Motivation - A/B testing
`\(f\)`
* can only be evaluated implicitly,
* with a lot of noise.
For example, optimize for click-through rate, retention time, or purchases, implicitly measuring user satisfaction, interest, or revenue of different version of a web site.
<div class="center">
<img src="figs/aalto.png" style="width: 99%" />
</div>
---
# Motivation - Animation design
`\(f\)`
* is a black-box (e.g., contains human judgement),
* with noisy and expensive/slow evaluations,
* and multi-modal.
For example, optimizing parameters for procedural animation generation.
<div class="center">
<img src="figs/brochu/anim.png" style="width: 50%" />
</div>
<div style="font-size: 10px;"><a href="https://dl.acm.org/citation.cfm?id=1921443">[image: Brochu et al., 2010]</a></div>
---
# Motivation - Wearable devices
`\(f\)`
* is based on an implicit signal from human,
* with noisy and expensive/slow evaluations.
For example, optimizing step rate to minimize metabolic cost.
<div class="center">
<img src="figs/wearable_devices.png" style="width: 70%" />
</div>
<div style="font-size: 10px;"><a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0184054">[image: Kim et al., 2017]</a></div>
---
# Motivation - Simulators and cognitive models
`\(f\)`
* contains a parametric simulator,
* with noisy and expensive/slow evaluations.
For example, fitting parameters of cognitive simulators to experimental data.
<div class="center">
<img src="figs/abc.png" style="width: 50%" />
</div>
<div style="font-size: 10px;"><a href="https://dl.acm.org/citation.cfm?doid=3025453.3025576">[image: Kangasrääsiö et al., 2017]</a></div>
---
# Motivation - Other applications
* Recommender systems
* Sensor networks
* AutoML: automatic tuning of machine learning models
* Robotics and reinforcement learning
---
# Motivation
What if
* `\(f\)` is a black-box that we can only evaluate point-wise,
* `\(f\)` can be multi-modal,
* `\(f\)` is slow or expensive to evaluate,
* evaluations of `\(f\)` are noisy,
* `\(f\)` has no gradients available (can be used if available).
Natural in many HCI applications.
---
# Motivation
What if
* `\(f\)` is a black-box that we can only evaluate point-wise,
* `\(f\)` can be multi-modal,
* `\(f\)` is slow or expensive to evaluate,
* evaluations of `\(f\)` are noisy,
* `\(f\)` has no gradients available (can be used if available).
In principle, we can always do, for example, grid search: scales `\(O(M^D)\)` to fill the domain at constant grid of size `\(M\)` per axis.
---
# Formulation
Find the minimum of a function `\(f(x)\)` within some bounded domain `\(\mathcal{X} \subset \mathbb{R}^D\)`:
`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`
* `\(f\)` is a black-box that we can only evaluate point-wise,
* `\(f\)` can be multi-modal,
* `\(f\)` is slow or expensive to evaluate,
* evaluations of `\(f\)` are noisy,
* `\(f\)` has no gradients available (can be used if available).
Want to find the minimum with small number of evaluations of `\(f\)`.
**How to cope with these?**
---
# Formulation
Find the minimum of a function `\(f(x)\)` within some bounded domain `\(\mathcal{X} \subset \mathbb{R}^D\)`:
`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`
Want to find the minimum with small number of evaluations of `\(f\)`.
1. Construct a tractable **statistical surrogate model** of `\(f\)`.
2. Turn the optimization problem into **a sequence of easier problems**.
---
# Bernoulli bandit
Task: A/B testing to find which of two versions of a web ad is better.
* ''Experiment'': Show one of the two version for a visitor.
* ''Observation'': Did the visitor click the ad.
--
Want to find out which ad is better, while gathering as much clicks as possible.
Minimize *regret* `\(R\)` for `\(T\)` experiments:
`$$R = T \E[y^*] - \sum_{t=1}^T y_t$$`
* `\(\E[y^*]\)` is the expected click rate for the better ad,
* `\(y_t \in \{0,1\}\)` is whether visitor `\(t\)`, who was shown version `\(x_t \in \{A, B\}\)`, clicked the ad.
---
# Bernoulli bandit - surrogate model
* Model click rates of `\(A\)` and `\(B\)` independently. Equations for A below.
**Bayes theorem**: `\(p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta)}{p(\mathcal{D})} p(\theta)\)`
updates **prior** knowledge `\(p(\theta)\)` with **observations** `\(\mathcal{D}\)` to **posterior** knowledge `\(p(\theta \mid \mathcal{D})\)`.
--
Observation model:
`$$\Pr(y_t \mid x_t = A) = \textrm{Bernoulli}(y_t \mid \rho_A) = \rho_A^{y_t} (1 - \rho_A)^{1-y_t}.$$`
--
Prior model:
`$$p(\rho_A) = \textrm{Beta}(\rho_A \mid \alpha, \beta) = \frac{1}{B(\alpha, \beta)} \rho_A^{\alpha-1} (1 - \rho_A)^{\beta-1}.$$`
--
Given a dataset of `\(t\)` observations `\(\mathcal{D}_t = \{(x_1, y_1), \ldots, (x_t, y_t)\}\)`, the posterior distribution is
`$$p(\rho_A \mid \mathcal{D}_t) = \frac{p(\rho_A) \prod_{t: x_t = A} \Pr(y_t \mid x_t = A)}{\int_0^1 p(\rho_A) \prod_{t: x_t = A} \Pr(y_t \mid x_t = A) d\rho_A} = \textrm{Beta}(\rho_B \mid \alpha + n^{A}_1, \beta + n^{A}_0),$$`
where `\(n^{A}_1\)` and `\(n^{A}_0\)` are the total numbers of `\(y_t = 1\)` and `\(y_t = 0\)` for `\(x_t = A\)`.
---
# Bernoulli bandit - surrogate model
<img src="slides_files/figure-html/bayes1-1.png" style="display: block; margin: auto;" />
---
# Bernoulli bandit - surrogate model
<img src="slides_files/figure-html/bayes2-1.png" style="display: block; margin: auto;" />
---
# Bernoulli bandit - surrogate model
<img src="slides_files/figure-html/bayes3-1.png" style="display: block; margin: auto;" />
---
# Bernoulli bandit - surrogate model
<img src="slides_files/figure-html/bayes4-1.png" style="display: block; margin: auto;" />
---
# Bernoulli bandit - Thompson sampling
When `\(t+1\)`th visitor comes, which ad, `\(A\)` or `\(B\)`, to serve?
* Want to exploit: gather as much clicks as possible.
* Need to explore: learn about click rates for `\(A\)` and `\(B\)`.
Thompson sampling is a simple algorithm navigating this trade-off:
1. Sample a value for `\(\hat{\rho}_A\)` and for `\(\hat{\rho}_B\)` from `\(p(\rho_A \mid \mathcal{D}_t)\)` and `\(p(\rho_B \mid \mathcal{D}_t)\)`.
2. Show `\(A\)` if `\(\hat{\rho}_A > \hat{\rho}_B\)` and `\(B\)` otherwise.
--
We then observe whether the visitor clicked the ad or not and update our posterior distributions and continue to next iteration.
--
**How does Thompson sampling navigate exploration-exploitation trade-off?**
---
# Bernoulli bandit - simulation
```python
import numpy as np; np.random.seed(123)
rho_A_true = 0.1; rho_B_true = 0.2 # simulated visitor click rates
alpha = 1; beta = 1 # prior parameters
n_A1 = 0; n_A0 = 0; n_B1 = 0; n_B0 = 0 # numbers of clicks/no-clicks
T = 1000 # number of iterations
clicks = np.zeros(T); A_or_B = np.zeros(T)
for t in range(T):
# Thompson sampling
rho_A = np.random.beta(alpha + n_A1, beta + n_A0)
rho_B = np.random.beta(alpha + n_B1, beta + n_B0)
if rho_A > rho_B: # which ad to show
y_t = np.random.binomial(1, rho_A_true) # simul. click
n_A1 += y_t; n_A0 += 1 - y_t; # update posterior of A
else:
y_t = np.random.binomial(1, rho_B_true) # simul. click
n_B1 += y_t; n_B0 += 1 - y_t; # update posterior of B
# collect statistics
clicks[t] = y_t; A_or_B[t] = rho_A > rho_B
```
---
# Bernoulli bandit - simulation
<img src="slides_files/figure-html/bandit_comparison_plot-1.png" style="display: block; margin: auto;" />
---
class: inverse, center, middle
# Questions so far?
---
# Formulation
Find the minimum of a function `\(f(x)\)` within some bounded domain `\(\mathcal{X} \subset \mathbb{R}^D\)`:
`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`
Want to find the minimum with small number of evaluations of `\(f\)`.
1. Construct a tractable statistical surrogate model of `\(f\)`.
2. Turn the optimization problem into a sequence of easier problems.
---
# Formulation
Find the minimum of a function `\(f(x)\)` within some bounded domain `\(\mathcal{X} \subset \mathbb{R}^D\)`:
`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`
Want to find the minimum with small number of evaluations of `\(f\)`.
1. **Construct a tractable statistical surrogate model of `\(f\)`.**
2. Turn the optimization problem into a sequence of easier problems.
---
# Construct a tractable statistical surrogate model of `\(f\)`
Let `\(g(x)\)` be our surrogate model of `\(f\)`.
* `\(g\)` should be able to capture important aspects of `\(f\)` from small number of evaluations.
* Need to be able to update `\(g\)` when we acquire new evaluations of `\(f\)`: `\(g\)` should get better and better as a model of `\(f\)`.
* `\(g\)` should be fast to evaluate and to update.
* `\(g\)` needs to cope with noise.
* Need to be able to quantify uncertainty in `\(g\)` (navigating exploration-exploitation tradeoff).
---
# Gaussian processes
* Gaussian processes provide a probability distribution over functions.
* Extends (and uses properties of) the multivariate normal distribution `\(\Rightarrow\)` computationally tractable.
* Prior information about the type or behaviour of the modelled function can be included in the *covariance function* and its parameters.
* Alternatives: random forests, Bayesian neural networks.
---
# Gaussian processes - 1D example
<div id="gp-outer"></div>
![:gpappstart]
---
# Formulation
Find the minimum of a function `\(f(x)\)` within some bounded domain `\(\mathcal{X} \subset \mathbb{R}^D\)`:
`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`
Want to find the minimum with small number of evaluations of `\(f\)`.
1. Construct a tractable statistical surrogate model of `\(f\)`.
2. **Turn the optimization problem into a sequence of easier problems.**
---
# Turn the optimization problem into a sequence of easier problems
Consider having evaluated `\(f\)` at points `\(x_1, \ldots, x_{t-1}\)` and having constructed `\(p(g \mid \mathcal{D}_{t-1})\)`:
<img src="slides_files/figure-html/gp_unc-1.png" style="display: block; margin: auto;" />
How to choose the next `\(x\)` to evaluate `\(f\)` at?
**What would you do?**
---
# Turn the optimization problem into a sequence of easier problems
Consider having evaluated `\(f\)` at points `\(x_1, \ldots, x_{t-1}\)` and having constructed `\(p(g \mid \mathcal{D}_{t-1})\)`:
<img src="slides_files/figure-html/gp_unc2-1.png" style="display: block; margin: auto;" />
**Guided exploration** using `\(p(g \mid \mathcal{D}_t)\)`:
* Trade off exploration (reducing uncertainty) and exploitation (sampling near likely places of optima).
* Formulated by maximizing an acquisition function `\(\alpha(x; \mathcal{D}_{t-1})\)`.
---
# Turn the optimization problem into a sequence of easier problems
How to choose the next `\(x\)` to evaluate `\(f\)` at?
**Guided exploration** using `\(p(g \mid \mathcal{D}_{t-1})\)`:
* Trade off exploration (reducing uncertainty) and exploitation (sampling near likely places of optima).
* Formulated by maximizing an acquisition function `\(\alpha(x; \mathcal{D}_{t-1})\)`.
`$$x_{t} = \argmax_x \alpha(x; \mathcal{D}_{t-1})$$`
--
**Thompson sampling** acquisition function:
* `\(\alpha(x; \mathcal{D}_{t-1}) = \hat{g}(x)\)`, where `\(\hat{g}(x)\)` is a sample from `\(p(g \mid \mathcal{D}_{t-1})\)`.
--
**Expected improvement** acquisition function:
* Currently best value `\(y^* = \max_{s \in \{1,\ldots,t-1\}} y_s\)`.
* Improvement function provides utility of `\(x\)` given `\(g\)`: `\(I(x, g) = (g(x) - y^*) I(g(x) > y^*)\)`.
* Expected improvement: `\(\alpha_{EI}(x; \mathcal{D}_t) = \E_g[I(x, g)]\)`.
--
Many others also exists.
---
# Bayesian Optimization Recipe
**Goal**: Find the minimum of a function `\(f(x)\)` within some bounded domain `\(\mathcal{X} \subset \mathbb{R}^D\)`:
`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`
--
**Key ideas**
1. Construct a tractable statistical surrogate model of `\(f\)`.
2. Turn the optimization problem into a sequence of easier problems.
--
**Algorithm**
1. Initialize dataset `\(\mathcal{D}_0\)`, surrogate model `\(p(g \mid \mathcal{D}_0)\)`; choose acquisition function `\(\alpha(\cdot)\)`.
2. Loop for `\(t = 1,2,\ldots,T\)`:
1. Select next evaluation point: `\(x_{t} = \argmax \alpha(x; \mathcal{D}_{t-1})\)`.
2. Evaluate `\(f(x_{t})\)` to obtain `\(y_{t}\)`.
3. Update dataset `\(\mathcal{D}_{t} = \{\mathcal{D}_{t-1},(x_t, y_t)\}\)`.
4. Update surrogate model `\(p(g \mid \mathcal{D}_t)\)`.
3. Report the found optimum.
---
class: inverse, center, middle
# Questions so far?
---
# GPyOpt - Python package
```python
import GPyOpt
import numpy as np; np.random.seed(12345)
from matplotlib import pyplot as plt
def f_u(x):
return 0.2 * (x - 0.3)**2 - 0.4 * np.sin(15.0 * x)
plt.figure(); xx = np.linspace(0, 1, 101)
```
```python
plt.plot(xx, f_u(xx), 'k-'); plt.show()
```
<img src="slides_files/figure-html/gpyopt_example-1.png" style="display: block; margin: auto;" />
---
# GPyOpt
```python
bounds = [{'name': 'var_1', 'type': 'continuous', 'domain': (0,1)}]
myBopt = GPyOpt.methods.BayesianOptimization(
f=f_u, domain=bounds, # Function and domain
acquisition_type='EI', # Expected improvement
exact_feval=True, # Noiseless function evaluations
eps=1e-6,
normalize_Y=False, # (for clearer visualization)
initial_design_numdata=2) # (for clearer visualization)
```
---
# GPyOpt
```python
myBopt.run_optimization(max_iter=1)
myBopt.plot_acquisition()
```
<img src="slides_files/figure-html/gpyopt_example3-1.png" style="display: block; margin: auto;" />
---
# GPyOpt
```python
myBopt.run_optimization(max_iter=1)
myBopt.plot_acquisition()
```
<img src="slides_files/figure-html/gpyopt_example4-1.png" style="display: block; margin: auto;" />
---
# GPyOpt
```python
myBopt.run_optimization(max_iter=1)
myBopt.plot_acquisition()
```
<img src="slides_files/figure-html/gpyopt_example5-1.png" style="display: block; margin: auto;" />
---
# GPyOpt
```python
myBopt.run_optimization(max_iter=1)
myBopt.plot_acquisition()
```
<img src="slides_files/figure-html/gpyopt_example6-1.png" style="display: block; margin: auto;" />
---
# GPyOpt
```python
myBopt.run_optimization(max_iter=1)
myBopt.plot_acquisition()
```
<img src="slides_files/figure-html/gpyopt_example7-1.png" style="display: block; margin: auto;" />
---
# GPyOpt
```python
myBopt.run_optimization(max_iter=1)
myBopt.plot_acquisition()
```
<img src="slides_files/figure-html/gpyopt_example8-1.png" style="display: block; margin: auto;" />
---
# GPyOpt
```python
myBopt.run_optimization(max_iter=1)
myBopt.plot_acquisition()
```
<img src="slides_files/figure-html/gpyopt_example9-1.png" style="display: block; margin: auto;" />
---
# GPyOpt
```python
myBopt.run_optimization(max_iter=1)
myBopt.plot_acquisition()
```
<img src="slides_files/figure-html/gpyopt_example10-1.png" style="display: block; margin: auto;" />
---
# GPyOpt
```python
myBopt.run_optimization(max_iter=1)
myBopt.plot_acquisition()
```
<img src="slides_files/figure-html/gpyopt_example11-1.png" style="display: block; margin: auto;" />
---
# GPyOpt
```python
myBopt.run_optimization(max_iter=1)
myBopt.plot_acquisition()
```
<img src="slides_files/figure-html/gpyopt_example12-1.png" style="display: block; margin: auto;" />
---
# GPyOpt
```python
myBopt.run_optimization(max_iter=1)
myBopt.plot_acquisition()
```
<img src="slides_files/figure-html/gpyopt_example13-1.png" style="display: block; margin: auto;" />
---
# GPyOpt
```python
myBopt.run_optimization(max_iter=1)
myBopt.plot_acquisition()
```
<img src="slides_files/figure-html/gpyopt_example14-1.png" style="display: block; margin: auto;" />
---
# GPyOpt
```python
myBopt.run_optimization(max_iter=1)
myBopt.plot_acquisition()
```
<img src="slides_files/figure-html/gpyopt_example15-1.png" style="display: block; margin: auto;" />
---
# Is it guaranteed to work?
* Theoretical guarantees (regret bounds) exists under some conditions.
* ''There is still a wide gap between theory and practice.'' - Shahriari et al.
''[...]the careful choice of statistical model is often far more important than the choice of acquisition function heuristic.'' - Shahriari et al.
--
**Some limitations**
* Difficult for high-dimensional spaces.
* Can spend a lot of time on the edges of the space (Siivola et al., MLSP 2018).
* Computation complexity of inference in Gaussian processes scales as `\(O(N^3)\)`, where `\(N\)` is the number of observations (sparse GPs/inducing point approximations can be used; or other types of models).
* Optimizing hyperparameters (controlling the behaviour of the surrogate) can be challenging.
---
# Human-in-the-loop applications
Two types of rather direct human-in-the-loop applications:
* Human provides explicit feedback at `\(x\)`, the value `\(f(x)\)`.
* Human provides implicit feedback at `\(x\)`, for example, `\(f(x)\)` is a completion time of a task with parameters `\(x\)`.
<div class="center">
<img src="figs/brochu/anim.png" style="width: 35%" />
<img src="figs/wearable_devices.png" style="width: 55%" />
</div>
---
# Application: Animation design
<div class="center">
<img src="figs/brochu/anim.png" style="width: 50%" />
</div>
Eric Brochu, Tyson Brochu, Nando de Freitas: **A Bayesian Interactive Optimization Approach to Procedural Animation Design**, Eurographics/ACM SIGGRAPH Symposium on Computer Animation (2010).
---
# Application: Animation design
<div class="center">
<img src="figs/brochu/anim.png" style="width: 50%" />
</div>
Find parameters for generating a procedural fluid animation:
* velocity, radius and magnitude of the (possibly multiple) vortex rings,
* length scale and magnitude of the curl noise,
* relative strengths of vortex rings and curl noise.
--
User can easily tell which animations look good: ''the psychoperceptual process underlying judgment - how well a realization fits what the user has in mind''.
---
# Application: Animation design
<div class="center">
<img src="figs/brochu/ui.png" style="width: 99%" />
</div>
---
# Application: Animation design
<div class="center">
<img src="figs/brochu/bo_alg.png" style="width: 99%" />
</div>
---
# Application: Animation design
* Obtained improved results compared to novice and expert users setting parameters manually.
* Tailored the Bayesian optimization approach to make it work:
1. preferential feedbacks,
2. transfer information over multiple sessions,
3. combined manual parameter tuning and Bayesian optimization.
---
# Summary
Find the minimum of a function `\(f(x)\)` within some bounded domain `\(\mathcal{X} \subset \mathbb{R}^D\)`:
`$$x^* = \argmin_{x \in \mathcal{X}} f(x)$$`
* `\(f\)` is a black-box that we can only evaluate point-wise,
* `\(f\)` can be multi-modal,
* `\(f\)` is slow or expensive to evaluate,
* evaluations of `\(f\)` are noisy,
* `\(f\)` has no gradients available (can be used if available).
--
**Wide range of applications**. Relevant for HCI and human-in-the-loop modelling.
--
**Key ideas**
1. Construct a tractable statistical surrogate model of `\(f\)`, with proper uncertainty quantification.
2. Turn the optimization problem into a sequence of easier problems, navigating the exploration-exploitation tradeoff.
--
Many **software implementations** (e.g., GPyOpt) exists. Relatively easy to start using.
---
# Further resources
* <a href="https://ieeexplore.ieee.org/document/7352306">Shahriari et al., **Taking the Human Out of the Loop: A Review of Bayesian Optimization**, Proceedings of the IEEE, 2016.</a>
* <a href="http://nbviewer.jupyter.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Chapter1_Introduction/Ch1_Introduction_PyMC3.ipynb">Cam Davidson-Pilon, **Chapter 1: Introduction to Bayesian Methods from Bayesian Methods for Hackers**.<a/>
* <a href="http://www.gaussianprocess.org/gpml/">Rasmussen, Williams, **Gaussian Processes for Machine Learning**, MIT Press, 2016.</a>
* <a href="http://sheffieldml.github.io/GPyOpt/">GPyOpt, Python package for Bayesian optimization.</a>
* <a href="http://www.tmpl.fi/gp/">Interactive Gaussian process regression demo.</a>
* <a href="https://dl.acm.org/citation.cfm?id=1921443">Brochu et al., **A Bayesian Interactive Optimization Approach to Procedural Animation Design**, Eurographics/ACM SIGGRAPH Symposium on Computer Animation, 2010.</a>.
* <a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0184054">Kim et al., **Human-in-the-loop Bayesian optimization of wearable device parameters**, PLOS ONE, 2017.</a>
* <a href="https://dl.acm.org/citation.cfm?doid=3025453.3025576">Kangasrääsiö et al., **Inferring Cognitive Models from Data using Approximate Bayesian Computation**, CHI 2017.</a>
* <a href="https://arxiv.org/abs/1704.00963">Siivola et al., **Correcting boundary over-exploration deficiencies in Bayesian optimization with virtual derivative sign observations**, MLSP 2018.</a>