[RLlib] New ConnectorV3 API #05: PPO runs in single-agent mode in this API stack#42272
Conversation
…runner_support_connectors_04_learner_api_changes
…runner_support_connectors_04_learner_api_changes
…runner_support_connectors_04_learner_api_changes
…runner_support_connectors_04_learner_api_changes
| ) | ||
|
|
||
| @override(Learner) | ||
| def _preprocess_train_data( |
There was a problem hiding this comment.
Note: Only called on the new API stack + EnvRunners.
| if not episodes: | ||
| return batch, episodes | ||
|
|
||
| # Make all episodes one ts longer in order to just have a single batch |
There was a problem hiding this comment.
New way to do GAE:
- elongate all episodes by one artificial ts.
- perform vf-predictions AND bootstrap value predictions in one single batch (b/c we have the extra timestep!)
-
- use the learner connector to make sure this forward pass is done using the correct (custom?) batch format.
- remove extra timesteps from episodes (and computed advantages)
| SampleBatch.VF_PREDS, | ||
| SampleBatch.ACTION_DIST_INPUTS, | ||
| ] | ||
| return self.output_specs_inference() |
| """ | ||
| # TODO (sven): Make this the only bahevior once PPO has been migrated | ||
| # to new API stack (including EnvRunners!). | ||
| if self.config.model_config_dict.get("uses_new_env_runners"): |
There was a problem hiding this comment.
temporary hack to make sure RLModule knows, when it still has to compute vf-preds via forward_exploration (old and hybrid API stacks).
| # the final results dict in the `self.compile_update_results()` method. | ||
| self._metrics = defaultdict(dict) | ||
|
|
||
| @OverrideToImplementCustomLogic_CallToSuperRecommended |
There was a problem hiding this comment.
Moved here for better ordering of methods (used to be all the way at the bottom of class).
|
|
||
| # Build learner connector pipeline used on this Learner worker. | ||
| # TODO (sven): Support multi-agent cases. | ||
| if self.config.uses_new_env_runners and not self.config.is_multi_agent(): |
There was a problem hiding this comment.
For now: Only on new API stack + EnvRunner + single-agent: use Learner connector (w/o this PPO on new stack would not learn).
rllib/utils/minibatch_utils.py
Outdated
| def get_len(b): | ||
| return len(b[SampleBatch.SEQ_LENS]) | ||
|
|
||
| n_steps = int( |
There was a problem hiding this comment.
Bug fix. When slicing on a BxT batch, we should slice properly along B-axis (with the correct slice size!).
| return value | ||
|
|
||
| data = tree.map_structure(map_, self) | ||
| infos = self.pop(SampleBatch.INFOS, None) |
| # we return the values here and slice them separately | ||
| # TODO(Artur): Clean this hack up. | ||
| return value | ||
| return value[start_padded:stop_padded] |
…runner_support_connectors_05_ppo_w_connectorv2s
…runner_support_connectors_05_ppo_w_connectorv2s
|
@sven1977 Could you speak more to why GAE support was dropped for APPO in this release? |
EnvRunners support new ConnectorV3 API; PPO runs in single-agent mode in this API stack
This PR:
train_batch_size_per_learnerto better distinguish between total effective batch size and batch size per (GPU) learner worker.forward_explorationto perform a value-function pass. This is an essential improvement in code quality as we now have full separation between the sampling- and the learning worlds. The EnvRunner (sampling world) is no longer concerned with having to think about what the PPOLearner (learning world) might need and only needs to compute actions for the next env step.Benchmark results:
Learns Pong in ~5min via examples/connectors/connector_v2_frame_stacking.py example script:
Args:
--num-gpus=8 --num-env-runners=95 --framework=torchon commit: 790a537
Why are these changes needed?
Related issue number
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.