[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learning via league-based self-play.#21356
Conversation
…disable_distr_exec_api_initial_pr
…disable_distr_exec_api_initial_pr
…disable_distr_exec_api_initial_pr
…disable_distr_exec_api_initial_pr # Conflicts: # rllib/agents/trainer.py
…disable_distr_exec_api_initial_pr
…ed_multi_agent_learning
…ntralized_multi_agent_learning
…ntralized_multi_agent_learning_02 # Conflicts: # rllib/execution/rollout_ops.py
…ed_multi_agent_learning # Conflicts: # rllib/execution/rollout_ops.py
…ntralized_multi_agent_learning
gjoliver
left a comment
There was a problem hiding this comment.
I didn't look at the league building logics too closely yet.
and I have a bunch of higher level comments.
thanks Sven. the most complicated agent yet :)
| - p1 | ||
| - p2 | ||
| - p3 | ||
| policy_mapping_fn: ray.rllib. |
There was a problem hiding this comment.
nope, let me fix ...
| # 4 GPUs + max. 10 policies to train -> 4 shards (0.4 GPU/pol). | ||
| # 8 GPUs + max. 3 policies to train -> 3 shards (2.667 GPUs/pol). | ||
| # 8 GPUs + max. 2 policies to train -> 2 shards (4 GPUs/pol). | ||
| # 2 GPUs + max. 5 policies to train -> 2 shards (0.4 GPUs/pol). |
There was a problem hiding this comment.
actually I have a usability question, why would people ever want to specify a different num_shards here?
why not do the math ourselves, and skip the Unioin[int, str] num_shards parameter al-togather?
There was a problem hiding this comment.
A user might not want to use all the gpus on a single machine, ex. or all the cpus
There was a problem hiding this comment.
Good point. This is fixed now and by default, we determine these values ourselves.
There was a problem hiding this comment.
Also, if GPUs are fractional and > 1.0, we floor the value (e.g. 1.333 -> 1.0). Otherwise, it wouldn't make sense.
There was a problem hiding this comment.
The user will always want to use all of the configured "config.num_gpus", though, which is what we are looking at here.
| max_num_policies: | ||
| replay_actor_class: | ||
| replay_actor_args: | ||
| num_learner_shards: |
There was a problem hiding this comment.
it's probably worth explaining this concept of shards a bit.
IIUC, a shard is a logical group of policies and gpus, right? normally, a shard corresponds to a machine instance?
There was a problem hiding this comment.
I need clarification on this as well +1
There was a problem hiding this comment.
Good point, will fix the docs and clarify.
| self.num_gpus_per_shard = self.num_gpus / self.num_learner_shards | ||
| else: | ||
| self.num_learner_shards = num_learner_shards | ||
| self.num_gpus_per_shard = 0 |
There was a problem hiding this comment.
if user specify num_shards, num_gpus_per_shard is 0, then num_gpus_per_policy will be 0 too. then we will not use gpu for their training?
There was a problem hiding this comment.
This has been fixed.
| (self.replay_actor_class, self.replay_actor_args, {}, 1), | ||
| ] + [( | ||
| ray.remote( | ||
| num_cpus=1, |
There was a problem hiding this comment.
hmm, don't we need to update default_resource_req() for these CPUs as well ... ?
There was a problem hiding this comment.
Great catch. Fixed.
| sample_results = asynchronous_parallel_requests( | ||
| remote_requests_in_flight=self.remote_requests_in_flight, | ||
| actors=self.workers.remote_workers() | ||
| or [self.workers.local_worker()], |
There was a problem hiding this comment.
can we create a local variable for the this argument, so things read a little better? thanks.
|
|
||
| @override(Trainer) | ||
| def training_iteration(self) -> ResultDict: | ||
| # Trigger asynchronous rollouts on all RolloutWorkers. |
There was a problem hiding this comment.
If I read correct, training step here is not really asynchronous (every learner runs their own training loop by themselves).
rather, this is a completely parallelized synchronous training. everyone samples some data, wait for everyone to finish, then everyone learns a step, wait for everyone to finish.
is this right?
can you add some high level comments here explaining how things work?
thanks.
There was a problem hiding this comment.
It is fully asynchronous (at least I hope so :) ):
- Each time the training iteration function is run, we make sure that each worker has at least n
samplerequests "in flight", so that each worker is basically always sampling in the background (async). We never wait for any worker to complete, but only ever collect what's already done anyways. - Only those requests that are done are returned here (with an optional timeout) and the driver script can immediately processed the next item (which is requesting policy updates, also asynchronous). So the very first time, you run this, there will be nothing returned (after the timeout) as you just kicked off the background sampling.
There was a problem hiding this comment.
I'll update the comment(s).
| operations. | ||
| """ | ||
|
|
||
| # If no evaluation results -> Use hist data gathered for training. |
There was a problem hiding this comment.
I really think we need to define some kind of API for this league building thing.
it could be as simple as
class League:
def build(agent: Trainer, result; ResultDict) -> PolicyMapping:
...
we then separate all these logic in a different file, and plug in this League building object when we construct AlphaStar agent.
this would allow people to easily override the league building strategy without touch this agent.
There was a problem hiding this comment.
Yeah, I like this idea, too. Will separate.
There was a problem hiding this comment.
Done, created a LeagueBuilder base class, from which a AlphaStarLeagueBuilder sub-classes. Class and c'tor kwarg values can be specified in the config. Also works in yaml files as done in the 4-agent CartPole example.
|
|
||
| # If win rate is good enough -> Snapshot current policy and decide, | ||
| # whether to freeze the new snapshot or not. | ||
| if win_rate >= self.config["win_rate_threshold_for_new_snapshot"]: |
There was a problem hiding this comment.
I have a feeling these parameters should be per-agent, and not global.
although I am completely ok with starting with this for now and see.
There was a problem hiding this comment.
Probably. Then again, tons of things should probably be changed by the user depending on their league-building needs.
| policy_id) | ||
| self.league_exploiters += 1 | ||
| # New main-exploiter policy. | ||
| elif policy_id.startswith("main_ex"): |
There was a problem hiding this comment.
I am kind of not a fan of allowing people to configure things, but then telling them you can only name your things this way.
if all of these can be capsuled inside a League building instance though, that would be a lot better (basically you hardcode things because you are using a specific league building strategy).
There was a problem hiding this comment.
These naming things have been completely moved into the new LeagueBuilder API, so they are entirely in the user's control. The algo itself is policyID agnostic. LeagueBuilder is responsible for a) building the multi-agent dict, according to its c'tor settings, and b) handling the thus-built "config.multiagent.policies" dict.
There was a problem hiding this comment.
yeah, appreciate. this feels much beter.
|
…ntralized_multi_agent_learning # Conflicts: # rllib/utils/numpy.py
…ntralized_multi_agent_learning
avnishn
left a comment
There was a problem hiding this comment.
I refrained from adding comments about the policy builder, as I don't understand the abstraction.
Had some questions and comments for clarification.
| max_num_policies: | ||
| replay_actor_class: | ||
| replay_actor_args: | ||
| num_learner_shards: |
There was a problem hiding this comment.
I need clarification on this as well +1
| # 4 GPUs + max. 10 policies to train -> 4 shards (0.4 GPU/pol). | ||
| # 8 GPUs + max. 3 policies to train -> 3 shards (2.667 GPUs/pol). | ||
| # 8 GPUs + max. 2 policies to train -> 2 shards (4 GPUs/pol). | ||
| # 2 GPUs + max. 5 policies to train -> 2 shards (0.4 GPUs/pol). |
There was a problem hiding this comment.
A user might not want to use all the gpus on a single machine, ex. or all the cpus
| # Find first empty slot. | ||
| for shard in self.shards: | ||
| if shard.max_num_policies > len(shard.policy_actors): | ||
| if shard.has_replay_buffer is False: |
| sample_results = asynchronous_parallel_requests( | ||
| remote_requests_in_flight=self.remote_requests_in_flight, | ||
| actors=self.workers.remote_workers() or [self.workers.local_worker()], | ||
| ray_wait_timeout_s=0.01, |
There was a problem hiding this comment.
There feels like there is some downside to manually setting the timeouts for sampling and training -- isn't this going to be dependent on rollout length and training time?
There was a problem hiding this comment.
I had a gotcha moment about this too.
I think this timeout is basically saying: give me whatever is ready right now, so I can continue processing them. leave the un-finished remote calls for the next round of iteration.
There was a problem hiding this comment.
"is basically saying: give me whatever is ready right now":
Yes, correct.
I think we should try different things here, but probably timeout=0.0 is the best value here.
gjoliver
left a comment
There was a problem hiding this comment.
a bunch of comments / suggestions. nothing major. I paid more attention to league building module this time.
please take a look at the comments and feel free to merge after you address them.
| policy_id) | ||
| self.league_exploiters += 1 | ||
| # New main-exploiter policy. | ||
| elif policy_id.startswith("main_ex"): |
There was a problem hiding this comment.
yeah, appreciate. this feels much beter.
| num_learner_shards = min(cf["num_gpus"], num_policies) | ||
| num_gpus_per_shard = cf["num_gpus"] / num_learner_shards | ||
| else: | ||
| num_learner_shards = cf.get("num_replay_buffer_shards", 1) |
There was a problem hiding this comment.
ah ok, maybe we just need to explain the concept of shard really clearly somewhere for now.
also num_replay_buffer_shards is not part of the default config.
| { | ||
| # Policy learners (and Replay buffer shards). | ||
| "CPU": 1, | ||
| "GPU": num_gpus_per_shard, |
There was a problem hiding this comment.
any chance you can use super.default_resource_request(config), then add the additional resources?
that way, you can the updates automatically if we modify the default resource req somehow.
| num_cpus=1, | ||
| num_gpus=0.01 | ||
| if (self.config["num_gpus"] and not self.config["_fake_gpus"]) else | ||
| 0)(MixInMultiAgentReplayBuffer) |
There was a problem hiding this comment.
oh, it's actually a long statement ...
can I suggest we move this out of func call, and create a local variable, and add your reply here as a comment? so it's clear we have either 0 or 0.001 gpu here.
| with self._timers[LEARN_ON_BATCH_TIMER]: | ||
| pol_actors = [] | ||
| args = [] | ||
| for i, (pid, pol_actor, repl_actor) in enumerate(self.distributed_learners): |
There was a problem hiding this comment.
why bother with enumerate here? doesn't seem like i is used for anything?
| if self.num_gpus_per_shard == 0: | ||
| self.num_gpus_per_shard = self.num_gpus / self.num_learner_shards | ||
|
|
||
| num_policies_per_shard = ( |
There was a problem hiding this comment.
do we actually need to round this value a bit instead.
not quite sure what fractional policy really means ...
There was a problem hiding this comment.
We do that here afterward:
self.num_policies_per_shard = math.ceil(num_policies_per_shard)
| # If win rate is good enough -> Snapshot current policy and decide, | ||
| # whether to freeze the new snapshot or not. | ||
| if win_rate >= self.win_rate_threshold_for_new_snapshot: | ||
| is_main = re.match("^main(_\\d+)?$", policy_id) |
There was a problem hiding this comment.
do we really need re, or we can simply policy_id.starts_with(...)?
There was a problem hiding this comment.
If we use "startswith", then it could be a main_exploiter_\\d as well.
I like re (and Perl!) very much :)
| ) | ||
|
|
||
| # Update our mapping function accordingly. | ||
| def policy_mapping_fn(agent_id, episode, worker, **kwargs): |
There was a problem hiding this comment.
possible to define this mapping_fn outside of build_league() now? so things are not so nestd and it's clear what input goes into making this decision.
There was a problem hiding this comment.
I think this won't work. The here constructed function needs access to some of the closures around it, like probs_match_types and many others. Moving these into the function would require ray object store transports of large objects. Trying to avoid this here.
| logger.debug( | ||
| f"Episode {episode.episode_id}: AgentID " | ||
| f"{agent_id} played by {main} ({training})" | ||
| ) |
There was a problem hiding this comment.
in addition to this msg, I imagine what will be pretty useful for debugging is to track how many matches are played between different type of agents, and just show it.
will help us get a sense of whether league is in the right shape, etc.
…ntralized_multi_agent_learning
…ntralized_multi_agent_learning
…ng via league-based self-play. (ray-project#21356)" This reverts commit 3f03ef8.
AlphaStar: Parallelized, multi-agent/multi-GPU learning via league-based self-play.
agents/alpha_star/AlphaStarTrainerperforming parallelized multi-agent/multi-GPU training on an arbitrary number of policies.build_leaguemethod can be overridden to implement custom league-building logic.TODO (follow up PR):
StochasticSampling.Why are these changes needed?
Related issue number
Checks
scripts/format.shto lint the changes in this PR.