Skip to content

[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learning via league-based self-play.#21356

Merged
sven1977 merged 121 commits intoray-project:masterfrom
sven1977:decentralized_multi_agent_learning
Feb 3, 2022
Merged

[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learning via league-based self-play.#21356
sven1977 merged 121 commits intoray-project:masterfrom
sven1977:decentralized_multi_agent_learning

Conversation

@sven1977
Copy link
Copy Markdown
Contributor

@sven1977 sven1977 commented Jan 3, 2022

AlphaStar: Parallelized, multi-agent/multi-GPU learning via league-based self-play.

  • New algorithm agents/alpha_star / AlphaStarTrainer performing parallelized multi-agent/multi-GPU training on an arbitrary number of policies.
  • Suitable for two-player (self-play) zero-sum games.
  • A build_league method can be overridden to implement custom league-building logic.
  • Currently, the new AlphaStarTrainer is based off APPOPolicy, but this might be relaxed in the future.
  • Adds a simple compilation test case.
  • Adds a small learning test case for 4-agent CartPole for the CI.

TODO (follow up PR):

  • Benchmark on hard task using multi-GPU and add to weekly learning regression tests.
  • Add AlphaStar specific exploration enhancement to this implementation. Currently, it's using APPO's StochasticSampling.

Why are these changes needed?

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

…ntralized_multi_agent_learning_02

# Conflicts:
#	rllib/execution/rollout_ops.py
…ed_multi_agent_learning

# Conflicts:
#	rllib/execution/rollout_ops.py
@sven1977 sven1977 changed the title [WIP RLlib] Decentralized multi-agent + multi-GPU learning. [RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learning via league-based self-play. Jan 27, 2022
Copy link
Copy Markdown
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look at the league building logics too closely yet.
and I have a bunch of higher level comments.
thanks Sven. the most complicated agent yet :)

- p1
- p2
- p3
policy_mapping_fn: ray.rllib.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, let me fix ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

# 4 GPUs + max. 10 policies to train -> 4 shards (0.4 GPU/pol).
# 8 GPUs + max. 3 policies to train -> 3 shards (2.667 GPUs/pol).
# 8 GPUs + max. 2 policies to train -> 2 shards (4 GPUs/pol).
# 2 GPUs + max. 5 policies to train -> 2 shards (0.4 GPUs/pol).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I have a usability question, why would people ever want to specify a different num_shards here?
why not do the math ourselves, and skip the Unioin[int, str] num_shards parameter al-togather?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user might not want to use all the gpus on a single machine, ex. or all the cpus

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This is fixed now and by default, we determine these values ourselves.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if GPUs are fractional and > 1.0, we floor the value (e.g. 1.333 -> 1.0). Otherwise, it wouldn't make sense.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user will always want to use all of the configured "config.num_gpus", though, which is what we are looking at here.

max_num_policies:
replay_actor_class:
replay_actor_args:
num_learner_shards:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably worth explaining this concept of shards a bit.
IIUC, a shard is a logical group of policies and gpus, right? normally, a shard corresponds to a machine instance?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need clarification on this as well +1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will fix the docs and clarify.

self.num_gpus_per_shard = self.num_gpus / self.num_learner_shards
else:
self.num_learner_shards = num_learner_shards
self.num_gpus_per_shard = 0
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if user specify num_shards, num_gpus_per_shard is 0, then num_gpus_per_policy will be 0 too. then we will not use gpu for their training?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been fixed.

(self.replay_actor_class, self.replay_actor_args, {}, 1),
] + [(
ray.remote(
num_cpus=1,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, don't we need to update default_resource_req() for these CPUs as well ... ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. Fixed.

sample_results = asynchronous_parallel_requests(
remote_requests_in_flight=self.remote_requests_in_flight,
actors=self.workers.remote_workers()
or [self.workers.local_worker()],
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we create a local variable for the this argument, so things read a little better? thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


@override(Trainer)
def training_iteration(self) -> ResultDict:
# Trigger asynchronous rollouts on all RolloutWorkers.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read correct, training step here is not really asynchronous (every learner runs their own training loop by themselves).
rather, this is a completely parallelized synchronous training. everyone samples some data, wait for everyone to finish, then everyone learns a step, wait for everyone to finish.
is this right?
can you add some high level comments here explaining how things work?
thanks.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fully asynchronous (at least I hope so :) ):

  • Each time the training iteration function is run, we make sure that each worker has at least n sample requests "in flight", so that each worker is basically always sampling in the background (async). We never wait for any worker to complete, but only ever collect what's already done anyways.
  • Only those requests that are done are returned here (with an optional timeout) and the driver script can immediately processed the next item (which is requesting policy updates, also asynchronous). So the very first time, you run this, there will be nothing returned (after the timeout) as you just kicked off the background sampling.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the comment(s).

operations.
"""

# If no evaluation results -> Use hist data gathered for training.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we need to define some kind of API for this league building thing.
it could be as simple as

class League:
    def build(agent: Trainer, result; ResultDict) -> PolicyMapping:
        ...

we then separate all these logic in a different file, and plug in this League building object when we construct AlphaStar agent.
this would allow people to easily override the league building strategy without touch this agent.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like this idea, too. Will separate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, created a LeagueBuilder base class, from which a AlphaStarLeagueBuilder sub-classes. Class and c'tor kwarg values can be specified in the config. Also works in yaml files as done in the 4-agent CartPole example.


# If win rate is good enough -> Snapshot current policy and decide,
# whether to freeze the new snapshot or not.
if win_rate >= self.config["win_rate_threshold_for_new_snapshot"]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling these parameters should be per-agent, and not global.
although I am completely ok with starting with this for now and see.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. Then again, tons of things should probably be changed by the user depending on their league-building needs.

policy_id)
self.league_exploiters += 1
# New main-exploiter policy.
elif policy_id.startswith("main_ex"):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am kind of not a fan of allowing people to configure things, but then telling them you can only name your things this way.
if all of these can be capsuled inside a League building instance though, that would be a lot better (basically you hardcode things because you are using a specific league building strategy).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These naming things have been completely moved into the new LeagueBuilder API, so they are entirely in the user's control. The algo itself is policyID agnostic. LeagueBuilder is responsible for a) building the multi-agent dict, according to its c'tor settings, and b) handling the thus-built "config.multiagent.policies" dict.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, appreciate. this feels much beter.

@bveeramani
Copy link
Copy Markdown
Member

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

  1. Install Black
pip install -I black==21.12b0
  1. Format changed files with Black
curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh
  1. Commit your changes.
git add --all
git commit -m "Format Python code with Black"
  1. Merge master into your branch.
git pull upstream master
  1. Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

Copy link
Copy Markdown
Member

@avnishn avnishn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refrained from adding comments about the policy builder, as I don't understand the abstraction.

Had some questions and comments for clarification.

max_num_policies:
replay_actor_class:
replay_actor_args:
num_learner_shards:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need clarification on this as well +1

# 4 GPUs + max. 10 policies to train -> 4 shards (0.4 GPU/pol).
# 8 GPUs + max. 3 policies to train -> 3 shards (2.667 GPUs/pol).
# 8 GPUs + max. 2 policies to train -> 2 shards (4 GPUs/pol).
# 2 GPUs + max. 5 policies to train -> 2 shards (0.4 GPUs/pol).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A user might not want to use all the gpus on a single machine, ex. or all the cpus

# Find first empty slot.
for shard in self.shards:
if shard.max_num_policies > len(shard.policy_actors):
if shard.has_replay_buffer is False:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

sample_results = asynchronous_parallel_requests(
remote_requests_in_flight=self.remote_requests_in_flight,
actors=self.workers.remote_workers() or [self.workers.local_worker()],
ray_wait_timeout_s=0.01,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There feels like there is some downside to manually setting the timeouts for sampling and training -- isn't this going to be dependent on rollout length and training time?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a gotcha moment about this too.
I think this timeout is basically saying: give me whatever is ready right now, so I can continue processing them. leave the un-finished remote calls for the next round of iteration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is basically saying: give me whatever is ready right now":
Yes, correct.

I think we should try different things here, but probably timeout=0.0 is the best value here.

Copy link
Copy Markdown
Member

@gjoliver gjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a bunch of comments / suggestions. nothing major. I paid more attention to league building module this time.
please take a look at the comments and feel free to merge after you address them.

policy_id)
self.league_exploiters += 1
# New main-exploiter policy.
elif policy_id.startswith("main_ex"):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, appreciate. this feels much beter.

num_learner_shards = min(cf["num_gpus"], num_policies)
num_gpus_per_shard = cf["num_gpus"] / num_learner_shards
else:
num_learner_shards = cf.get("num_replay_buffer_shards", 1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah ok, maybe we just need to explain the concept of shard really clearly somewhere for now.
also num_replay_buffer_shards is not part of the default config.

{
# Policy learners (and Replay buffer shards).
"CPU": 1,
"GPU": num_gpus_per_shard,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any chance you can use super.default_resource_request(config), then add the additional resources?
that way, you can the updates automatically if we modify the default resource req somehow.

num_cpus=1,
num_gpus=0.01
if (self.config["num_gpus"] and not self.config["_fake_gpus"]) else
0)(MixInMultiAgentReplayBuffer)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it's actually a long statement ...
can I suggest we move this out of func call, and create a local variable, and add your reply here as a comment? so it's clear we have either 0 or 0.001 gpu here.

with self._timers[LEARN_ON_BATCH_TIMER]:
pol_actors = []
args = []
for i, (pid, pol_actor, repl_actor) in enumerate(self.distributed_learners):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why bother with enumerate here? doesn't seem like i is used for anything?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

if self.num_gpus_per_shard == 0:
self.num_gpus_per_shard = self.num_gpus / self.num_learner_shards

num_policies_per_shard = (
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually need to round this value a bit instead.
not quite sure what fractional policy really means ...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do that here afterward:

self.num_policies_per_shard = math.ceil(num_policies_per_shard)

# If win rate is good enough -> Snapshot current policy and decide,
# whether to freeze the new snapshot or not.
if win_rate >= self.win_rate_threshold_for_new_snapshot:
is_main = re.match("^main(_\\d+)?$", policy_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need re, or we can simply policy_id.starts_with(...)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use "startswith", then it could be a main_exploiter_\\d as well.
I like re (and Perl!) very much :)

)

# Update our mapping function accordingly.
def policy_mapping_fn(agent_id, episode, worker, **kwargs):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible to define this mapping_fn outside of build_league() now? so things are not so nestd and it's clear what input goes into making this decision.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this won't work. The here constructed function needs access to some of the closures around it, like probs_match_types and many others. Moving these into the function would require ray object store transports of large objects. Trying to avoid this here.

logger.debug(
f"Episode {episode.episode_id}: AgentID "
f"{agent_id} played by {main} ({training})"
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in addition to this msg, I imagine what will be pretty useful for debugging is to track how many matches are played between different type of agents, and just show it.
will help us get a sense of whether league is in the right shape, etc.

@sven1977 sven1977 merged commit 3f03ef8 into ray-project:master Feb 3, 2022
rkooo567 added a commit to rkooo567/ray that referenced this pull request Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants