We provide three different categories of examples:
- Examples for black-box hyperparameter tuning using the 'hypersweeper' package.
- Running a schedule based on a reward heuristic
- Running a reactive schedule based on the gradient history
Additionally, we provide instructions on how to evaluate an incumbent on ARLBench:
- Evaluating a HPO result
We use 'hydra' as a command line interface for these experiments, you'll find the corresponding configurations (including some variations on the algorithms and environments) in the 'configs' directory. The "hypersweeper_tuning" and "schedules" notebooks can help you run these examples and inspect their results in an interactive way.
We use the Hypersweeper package to demonstrate how ARLBench can be used for black-box HPO. Since it's hydra-based, we simply set up a script which takes a configuration, runs it and returns the evaluation reward at the end. First, use pip to install the hypersweeper:
pip install hypersweeperYou can try a single run of arlbench first using DQN on CartPole:
python run_arlbench.pyTo use random search instead, you should install hypersweeper, choose the "random_search" config and use the "--multirun" flag which is a signal to hydra to engage the sweeper:
pip install hypersweeper
python run_arlbench.py --config-name=random_search --multirunFinally, we can also use the state-of-the-art SMAC optimizer by changing to the "smac" config:
python run_arlbench.py --config-name=smac -mYou can switch between the environments and algorithms in ARLBench by specifying it in the command line like this:
python run_arlbench.py --config-name=smac -m environment=cc_cartpole algorithm=ppo search_space=ppo_ccYou can see what exactly this command changes by looking into the example configs. In 'configs/algorithm/ppo.yaml', for example, we see the following:
# @package _global_
algorithm: ppo
hp_config:
clip_eps: 0.2
ent_coef: 0.0
gae_lambda: 0.95
gamma: 0.99
learning_rate: 0.0003
max_grad_norm: 0.5
minibatch_size: 64
n_steps: 128
normalize_advantage: true
normalize_observations: false
update_epochs: 10
vf_clip_eps: 0.2
vf_coef: 0.5
nas_config:
activation: tanh
hidden_size: 64These are the default arguments for the PPO algorithm in ARLBench. You can also override each of these individually, if you for example want to try a different value for gamma:
python run_arlbench.py --config-name=smac -m environment=cc_cartpole algorithm=ppo search_space=ppo_cc hp_config.gamma=0.8The search space specification works very similarly via a yaml file, 'configs/search_space/cc_ppo.yaml' contains:
seed: 0
hyperparameters:
hp_config.learning_rate:
type: uniform_float
upper: 0.1
lower: 1.0e-06
log: true
hp_config.ent_coef:
type: uniform_float
upper: 0.5
lower: 0.0
log: false
hp_config.minibatch_size:
type: categorical
choices: [128, 256, 512]
hp_config.gae_lambda:
type: uniform_float
upper: 0.9999
lower: 0.8
log: false
hp_config.clip_eps:
type: uniform_float
upper: 0.5
lower: 0.0
log: false
hp_config.vf_clip_eps:
type: uniform_float
upper: 0.5
lower: 0.0
log: false
hp_config.normalize_advantage:
type: categorical
choices: [True, False]
hp_config.vf_coef:
type: uniform_float
upper: 1.0
lower: 0.0
default: 0.5
log: false
hp_config.max_grad_norm:
type: uniform_float
upper: 1.0
lower: 0.0
log: falseThis config sets a seed for the search space as well as lists the hyperparameters to configure with the values they can take. This way we can configure the full HPO setting with only yaml files to make the process easy to follow and simple to document for others.
We can also use ARLBench to dynamically change the hyperparameter config. We provide a simple example for this in 'run_heuristic_schedule.py': as soon as the agent improves over a certain reward threshold, we decrease the exploration epsilon in DQN a bit. This is likely not the best approach in practice, so feel free to play around with this idea! To see the result, run:
python run_heuristic_schedule.pySince we now run ARLBench dynamically, we have to think of another configuration option: the settings for dynamic execution. This is configured using the 'autorl' set of keys, our general settings for ARLBench. The default version looks like this:
autorl:
seed: 42
env_framework: ${environment.framework}
env_name: ${environment.name}
env_kwargs: ${environment.kwargs}
eval_env_kwargs: ${environment.eval_kwargs}
n_envs: ${environment.n_envs}
algorithm: ${algorithm}
cnn_policy: ${environment.cnn_policy}
nas_config: ${nas_config}
n_total_timesteps: ${environment.n_total_timesteps}
checkpoint: []
checkpoint_name: "default_checkpoint"
checkpoint_dir: "/tmp"
state_features: []
objectives: ["reward_mean"]
optimize_objectives: "upper"
n_steps: 1
n_eval_steps: 100
n_eval_episodes: 10As you can see, most of the defaults are decided by the environment and algorithm we choose. For dynamic execution, we are interested in the 'n_steps' and 'n_total_timesteps' keys. 'n_steps' decides how many steps should be taken in the AutoRL Environment - in other words, how many schedule intervals we'd like to have. The 'n_total_timesteps' key then decides the length of each interval. In the current config, we do a single training interval consisting of the total number of environment steps suggested for our target domain. If we want to instead run a schedule of length 10 with each schedule segment taking 10e4 steps, we can change the configuration like this:
python run_heuristic_schedule.py autorl.n_steps=10 autorl.n_total_timesteps=10000Lastly, we can also adjust the hyperparameters based on algorithm statistics. In 'run_reactive_schedule.py' we spike the learning rate if we see the gradient norm stagnating. See how it works by running:
python run_reactive_schedule.pyTo actually configure to what information ARLBench returns about the RL algorithm's internal state, we can use the 'state' features key - in this case, we want to add the gradient norm and variance like this:
python run_reactive_schedule.py "autorl.state_features=['grad_info']"Now we can build a schedule that takes the gradient information into account.
You can use ARLBench to evaluate your benchmark method. We recommend running your method on the proposed subset of environments for each algorithm. After that, you need to store the final hyperparameter configurations for the environments and algorithms. This is how the configuration for DQN on Acrobot-v1 might look like:
# @package _global_
defaults:
- override /environment: cc_acrobot
- override /algorithm: dqn
hpo_method: my_optimizer
hp_config:
buffer_batch_size: 64
buffer_size: 100000
buffer_prio_sampling: false
initial_epsilon: 0.64
target_epsilon: 0.112
gamma: 0.99
gradient_steps: 1
learning_rate: 0.0023
learning_starts: 1032
use_target_network: true
target_update_interval: 10You can then set your incumbent configuration for the algorithm/environment accordingly.
As soon as you have stored all your incumbents (in this example in the incumbent directory in configs), you can run the evaluation script:
python run_arlbench.py --config-name=evaluate -m "hpo_method=<my_optimizer>" "autorl.seed=100-110" "incumbent=glob(*)" The command will evaluate all configurations on the test seeds 100,101,102,.... Make sure not to use these during the design or tuning of your methods as this will invalidate the evaluation results.
We recommend test on at least 10 seeds.
The final evaluation results are stored in the evaluation directory for each algorithm and environment.
To run the evaluation only for a single algorithm, e.g. PPO, you can adapt the incumbent argument:
python run_arlbench.py --config-name=evaluate -m "autorl.seed=100-110" "incumbent=glob(ppo*)"The same can be done for single combinations of environments and algorithms.
When it comes to dynamic HPO methods, you cannot simply return the incumbent for evaluation since wou'll have a schedule with variable length and configuration intervals.
For this case, we recommend to use your dynamic tuning setup, but make sure to set the seed of the AutoRL Environment accordingly to a set of test seeds (100, 101, 102, ...).