Hyperband: Bandit Based Configuration Evaluation For Hyperparameter Optimization

Overview

Instead of using Bayesian optimization this paper aims at speeding up random search. Especially, the paper claims that Hyperband is simple, flexible and theoretically sound. Specifically, it should be seen as an allocation method for randomly sampled configurations. Provides over an order of magnitude speed-ups as compared to Bayesian optimization.

Paper shows:

  • Introduction
  • Related work
  • The algorithm (theoretical results)
  • Experiments
  • Discussion and future work

For me the most interesting part is: the algorithm and maybe theoretical results. That is because I have already heard about it’s success so I believe the results (see plot from paper). Maybe the comparison to Bayesian optimization.

alt text

The gist is that the authors focus on configuration evaluation and keep the configuration proposal simple (and theoretically sound) while making minimal assumptions. They do however assume that evaluation time is negligible. You could imagine that as the number of configurations goes to infinity while the iteration size goes to zero we would have an optimally consistent algorithm. This may not totally work with the types of models that this method is aiming for which can sometimes take minutes to build and compile.

The Hyperband algorithm

The algorithm is building on the SuccessiveHalving algorithm by Jamieson & Talwalkar (2015). SuccessiveHalving means: start up 100 models, train them for a pre-defined time, throw the 50 worst ones out, train the remaining 50 for exponentially more time, then 25 … until you are left with one. In order to know how much time to allocate to each, you decide on a finite time budget B and a number of configurations n. The problem is, how do you choose n? Large or small? And the resulting time for each model therefore large or small?

This further boils down to the concept that we don’t know how training develops and whether different configurations will be distinguishable soon. Specifically, if it takes a long time to differentiate (slow convergence or configs perform similarly) then we should choose n small. Or in the opposite case - fast convergence or many clearly bad configurations - choose n large.

The Hyperband solves this “n vs. B/n” problem by considering several n for a fixed B. Each value of n gets a minimum resource r which gets smaller the larger n is. So we are doing grid search over n. Specifically,

    for each value of **n**
        Run SuccessiveHalving with that value of **n** and the corresponding **r** using **B** time

Now Hyperband needs R the maximum resource to be allocated to one configuration and eta which controls the proportion of configurations discarded in each bracket i.e. round of SuccessiveHalving (above we had 0.5). Specifically, s_max + 1 values for n are being considered where s_max = floor[log_eta (R)]. alt text