Skip to content

[Tune][Air] MLFlow Callback is incompatible with PB2 #27783

@olipinski

Description

@olipinski

What happened + What you expected to happen

Reporting MLFlow metrics is incompatible with Population-Based Training. As the PB2 changes parameters during runtime, when Tune tries to report that to MLFlow, it throws an error, as MLFlow does not allow the parameters to be changed.

Traceback (most recent call last):
  File "C:\Users\user\Anaconda3\envs\repro\lib\site-packages\ray\tune\execution\trial_runner.py", line 819, in _wait_and_handle_event
    self._on_pg_ready(next_trial)
  File "C:\Users\user\Anaconda3\envs\repro\lib\site-packages\ray\tune\execution\trial_runner.py", line 909, in _on_pg_ready
    if not _start_trial(next_trial) and next_trial.status != Trial.ERROR:
  File "C:\Users\user\Anaconda3\envs\repro\lib\site-packages\ray\tune\execution\trial_runner.py", line 901, in _start_trial
    self._callbacks.on_trial_start(
  File "C:\Users\user\Anaconda3\envs\repro\lib\site-packages\ray\tune\callback.py", line 317, in on_trial_start
    callback.on_trial_start(**info)
  File "C:\Users\user\Anaconda3\envs\repro\lib\site-packages\ray\tune\logger\logger.py", line 135, in on_trial_start
    self.log_trial_start(trial)
  File "C:\Users\user\Anaconda3\envs\repro\lib\site-packages\ray\air\callbacks\mlflow.py", line 118, in log_trial_start
    self.mlflow_util.log_params(run_id=run_id, params_to_log=config)
  File "C:\Users\user\Anaconda3\envs\repro\lib\site-packages\ray\air\_internal\mlflow.py", line 280, in log_params
    client.log_param(run_id=run_id, key=key, value=value)
  File "C:\Users\user\Anaconda3\envs\repro\lib\site-packages\mlflow\tracking\client.py", line 743, in log_param
    self._tracking_client.log_param(run_id, key, value)
  File "C:\Users\user\Anaconda3\envs\repro\lib\site-packages\mlflow\tracking\_tracking_service\client.py", line 248, in log_param
    raise MlflowException(msg, INVALID_PARAMETER_VALUE)
mlflow.exceptions.MlflowException: Changing param values is not allowed. Param with key='rollout_fragment_length' was already logged with value='590' for run ID='3d0a25a70dcc4a6b9c96374e908b0ad8'. Attempted logging new value '4590'.

Versions / Dependencies

Ray 3.0.0
Python 3.9
Windows 10 Enterprise 20H2

Reproduction script

import os
import random
import tempfile
import uuid

from ray.air.callbacks.mlflow import MLflowLoggerCallback
from ray.tune import run, sample_from
from ray.tune.schedulers.pb2 import PB2

if __name__ == "__main__":

    pb2 = PB2(
        time_attr="timesteps_total",
        metric="episode_reward_mean",
        mode="max",
        perturbation_interval=50000,
        hyperparam_bounds={
            "lambda": [0.9, 1.0],
            "clip_param": [0.1, 0.5],
            "lr": [1e-3, 1e-5],
            "train_batch_size": [1000, 60000],
        },
    )

    analysis = run(
        "PPO",
        scheduler=pb2,
        verbose=1,
        num_samples=4,
        stop={"timesteps_total": 1000000},
        config={
            "framework": "torch",
            "env": "CartPole-v0",
            "log_level": "INFO",
            "seed": 0,
            "kl_coeff": 1.0,
            "num_gpus": 0,
            "horizon": 1600,
            "observation_filter": "MeanStdFilter",
            "model": {
                "free_log_std": True,
            },
            "num_sgd_iter": 10,
            "sgd_minibatch_size": 128,
            "lambda": sample_from(lambda spec: random.uniform(0.9, 1.0)),
            "clip_param": sample_from(lambda spec: random.uniform(0.1, 0.5)),
            "lr": sample_from(lambda spec: random.uniform(1e-3, 1e-5)),
            "train_batch_size": sample_from(lambda spec: random.randint(1000, 60000)),
        },
        callbacks=[
            MLflowLoggerCallback(
                experiment_name=str(uuid.uuid4()),
                tracking_uri=f'file:{os.path.join(tempfile.gettempdir(), "mlruns")}',
            )
        ],
    )

Issue Severity

Low: It annoys or frustrates me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'tdocsAn issue or change related to documentationtuneTune-related issueswindows

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions