[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learning via league-based self-play. by sven1977 · Pull Request #21356 · ray-project/ray

sven1977 · 2022-01-03T20:38:11Z

AlphaStar: Parallelized, multi-agent/multi-GPU learning via league-based self-play.

New algorithm agents/alpha_star / AlphaStarTrainer performing parallelized multi-agent/multi-GPU training on an arbitrary number of policies.
Suitable for two-player (self-play) zero-sum games.
A build_league method can be overridden to implement custom league-building logic.
Currently, the new AlphaStarTrainer is based off APPOPolicy, but this might be relaxed in the future.
Adds a simple compilation test case.
Adds a small learning test case for 4-agent CartPole for the CI.

TODO (follow up PR):

Benchmark on hard task using multi-GPU and add to weekly learning regression tests.
Add AlphaStar specific exploration enhancement to this implementation. Currently, it's using APPO's StochasticSampling.

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…disable_distr_exec_api_initial_pr

…disable_distr_exec_api_initial_pr # Conflicts: # rllib/agents/trainer.py

…disable_distr_exec_api_initial_pr

…ed_multi_agent_learning

…ntralized_multi_agent_learning

…ntralized_multi_agent_learning_02 # Conflicts: # rllib/execution/rollout_ops.py

…ed_multi_agent_learning # Conflicts: # rllib/execution/rollout_ops.py

…ntralized_multi_agent_learning

gjoliver

I didn't look at the league building logics too closely yet.
and I have a bunch of higher level comments.
thanks Sven. the most complicated agent yet :)

gjoliver · 2022-01-28T09:20:58Z

rllib/tuned_examples/alpha_star/multi-agent-cartpole-alpha-star.yaml

+              - p1
+              - p2
+              - p3
+            policy_mapping_fn: ray.rllib.


intentional?

nope, let me fix ...

gjoliver · 2022-01-28T09:38:39Z

rllib/agents/alpha_star/distributed_learners.py

+        # 4 GPUs + max. 10 policies to train -> 4 shards (0.4 GPU/pol).
+        # 8 GPUs + max. 3 policies to train -> 3 shards (2.667 GPUs/pol).
+        # 8 GPUs + max. 2 policies to train -> 2 shards (4 GPUs/pol).
+        # 2 GPUs + max. 5 policies to train -> 2 shards (0.4 GPUs/pol).


actually I have a usability question, why would people ever want to specify a different num_shards here?
why not do the math ourselves, and skip the Unioin[int, str] num_shards parameter al-togather?

A user might not want to use all the gpus on a single machine, ex. or all the cpus

Good point. This is fixed now and by default, we determine these values ourselves.

Also, if GPUs are fractional and > 1.0, we floor the value (e.g. 1.333 -> 1.0). Otherwise, it wouldn't make sense.

The user will always want to use all of the configured "config.num_gpus", though, which is what we are looking at here.

gjoliver · 2022-01-28T16:11:07Z

rllib/agents/alpha_star/distributed_learners.py

+            max_num_policies:
+            replay_actor_class:
+            replay_actor_args:
+            num_learner_shards:


it's probably worth explaining this concept of shards a bit.
IIUC, a shard is a logical group of policies and gpus, right? normally, a shard corresponds to a machine instance?

I need clarification on this as well +1

Good point, will fix the docs and clarify.

gjoliver · 2022-01-28T16:41:01Z

rllib/agents/alpha_star/distributed_learners.py

+            self.num_gpus_per_shard = self.num_gpus / self.num_learner_shards
+        else:
+            self.num_learner_shards = num_learner_shards
+            self.num_gpus_per_shard = 0


if user specify num_shards, num_gpus_per_shard is 0, then num_gpus_per_policy will be 0 too. then we will not use gpu for their training?

This has been fixed.

gjoliver · 2022-01-28T16:51:15Z

rllib/agents/alpha_star/distributed_learners.py

+                (self.replay_actor_class, self.replay_actor_args, {}, 1),
+            ] + [(
+                ray.remote(
+                    num_cpus=1,


hmm, don't we need to update default_resource_req() for these CPUs as well ... ?

Great catch. Fixed.

gjoliver · 2022-01-28T17:08:31Z

rllib/agents/alpha_star/alpha_star.py

+            sample_results = asynchronous_parallel_requests(
+                remote_requests_in_flight=self.remote_requests_in_flight,
+                actors=self.workers.remote_workers()
+                or [self.workers.local_worker()],


can we create a local variable for the this argument, so things read a little better? thanks.

gjoliver · 2022-01-28T17:12:30Z

rllib/agents/alpha_star/alpha_star.py

+
+    @override(Trainer)
+    def training_iteration(self) -> ResultDict:
+        # Trigger asynchronous rollouts on all RolloutWorkers.


If I read correct, training step here is not really asynchronous (every learner runs their own training loop by themselves).
rather, this is a completely parallelized synchronous training. everyone samples some data, wait for everyone to finish, then everyone learns a step, wait for everyone to finish.
is this right?
can you add some high level comments here explaining how things work?
thanks.

It is fully asynchronous (at least I hope so :) ):

Each time the training iteration function is run, we make sure that each worker has at least n sample requests "in flight", so that each worker is basically always sampling in the background (async). We never wait for any worker to complete, but only ever collect what's already done anyways.

Only those requests that are done are returned here (with an optional timeout) and the driver script can immediately processed the next item (which is requesting policy updates, also asynchronous). So the very first time, you run this, there will be nothing returned (after the timeout) as you just kicked off the background sampling.

I'll update the comment(s).

gjoliver · 2022-01-28T17:16:11Z

rllib/agents/alpha_star/alpha_star.py

+                operations.
+        """
+
+        # If no evaluation results -> Use hist data gathered for training.


I really think we need to define some kind of API for this league building thing.
it could be as simple as

class League: def build(agent: Trainer, result; ResultDict) -> PolicyMapping: ...

we then separate all these logic in a different file, and plug in this League building object when we construct AlphaStar agent.
this would allow people to easily override the league building strategy without touch this agent.

Yeah, I like this idea, too. Will separate.

Done, created a LeagueBuilder base class, from which a AlphaStarLeagueBuilder sub-classes. Class and c'tor kwarg values can be specified in the config. Also works in yaml files as done in the 4-agent CartPole example.

gjoliver · 2022-01-28T17:20:27Z

rllib/agents/alpha_star/alpha_star.py

+
+            # If win rate is good enough -> Snapshot current policy and decide,
+            # whether to freeze the new snapshot or not.
+            if win_rate >= self.config["win_rate_threshold_for_new_snapshot"]:


I have a feeling these parameters should be per-agent, and not global.
although I am completely ok with starting with this for now and see.

Probably. Then again, tons of things should probably be changed by the user depending on their league-building needs.

gjoliver · 2022-01-28T17:22:36Z

rllib/agents/alpha_star/alpha_star.py

+                                        policy_id)
+                    self.league_exploiters += 1
+                # New main-exploiter policy.
+                elif policy_id.startswith("main_ex"):


I am kind of not a fan of allowing people to configure things, but then telling them you can only name your things this way.
if all of these can be capsuled inside a League building instance though, that would be a lot better (basically you hardcode things because you are using a specific league building strategy).

These naming things have been completely moved into the new LeagueBuilder API, so they are entirely in the user's control. The algo itself is policyID agnostic. LeagueBuilder is responsible for a) building the multi-agent dict, according to its c'tor settings, and b) handling the thus-built "config.multiagent.policies" dict.

yeah, appreciate. this feels much beter.

bveeramani · 2022-01-30T05:15:36Z

‼️ ACTION REQUIRED ‼️

We've switched our code formatter from YAPF to Black (see #21311).

To prevent issues with merging your code, here's what you'll need to do:

Install Black

pip install -I black==21.12b0

Format changed files with Black

curl -o format-changed.sh https://gist.githubusercontent.com/bveeramani/42ef0e9e387b755a8a735b084af976f2/raw/7631276790765d555c423b8db2b679fd957b984a/format-changed.sh
chmod +x ./format-changed.sh
./format-changed.sh
rm format-changed.sh

Commit your changes.

git add --all
git commit -m "Format Python code with Black"

Merge master into your branch.

git pull upstream master

Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated format.sh.

…ntralized_multi_agent_learning # Conflicts: # rllib/utils/numpy.py

…ntralized_multi_agent_learning

avnishn

I refrained from adding comments about the policy builder, as I don't understand the abstraction.

Had some questions and comments for clarification.

avnishn · 2022-01-31T19:35:51Z

rllib/agents/alpha_star/distributed_learners.py

+            max_num_policies:
+            replay_actor_class:
+            replay_actor_args:
+            num_learner_shards:


I need clarification on this as well +1

avnishn · 2022-01-31T19:36:25Z

rllib/agents/alpha_star/distributed_learners.py

+        # 4 GPUs + max. 10 policies to train -> 4 shards (0.4 GPU/pol).
+        # 8 GPUs + max. 3 policies to train -> 3 shards (2.667 GPUs/pol).
+        # 8 GPUs + max. 2 policies to train -> 2 shards (4 GPUs/pol).
+        # 2 GPUs + max. 5 policies to train -> 2 shards (0.4 GPUs/pol).


A user might not want to use all the gpus on a single machine, ex. or all the cpus

avnishn · 2022-01-31T19:37:06Z

rllib/agents/alpha_star/distributed_learners.py

+        # Find first empty slot.
+        for shard in self.shards:
+            if shard.max_num_policies > len(shard.policy_actors):
+                if shard.has_replay_buffer is False:


avnishn · 2022-02-01T23:29:53Z

rllib/agents/alpha_star/alpha_star.py

+            sample_results = asynchronous_parallel_requests(
+                remote_requests_in_flight=self.remote_requests_in_flight,
+                actors=self.workers.remote_workers() or [self.workers.local_worker()],
+                ray_wait_timeout_s=0.01,


There feels like there is some downside to manually setting the timeouts for sampling and training -- isn't this going to be dependent on rollout length and training time?

I had a gotcha moment about this too.
I think this timeout is basically saying: give me whatever is ready right now, so I can continue processing them. leave the un-finished remote calls for the next round of iteration.

"is basically saying: give me whatever is ready right now":
Yes, correct.

I think we should try different things here, but probably timeout=0.0 is the best value here.

rllib/agents/alpha_star/alpha_star.py

rllib/agents/alpha_star/league_builder.py

gjoliver

a bunch of comments / suggestions. nothing major. I paid more attention to league building module this time.
please take a look at the comments and feel free to merge after you address them.

gjoliver · 2022-02-02T07:13:07Z

rllib/agents/alpha_star/alpha_star.py

+                                        policy_id)
+                    self.league_exploiters += 1
+                # New main-exploiter policy.
+                elif policy_id.startswith("main_ex"):


yeah, appreciate. this feels much beter.

gjoliver · 2022-02-02T14:49:28Z

rllib/agents/alpha_star/alpha_star.py

+            num_learner_shards = min(cf["num_gpus"], num_policies)
+            num_gpus_per_shard = cf["num_gpus"] / num_learner_shards
+        else:
+            num_learner_shards = cf.get("num_replay_buffer_shards", 1)


ah ok, maybe we just need to explain the concept of shard really clearly somewhere for now.
also num_replay_buffer_shards is not part of the default config.

gjoliver · 2022-02-02T14:51:46Z

rllib/agents/alpha_star/alpha_star.py

+                {
+                    # Policy learners (and Replay buffer shards).
+                    "CPU": 1,
+                    "GPU": num_gpus_per_shard,


any chance you can use super.default_resource_request(config), then add the additional resources?
that way, you can the updates automatically if we modify the default resource req somehow.

gjoliver · 2022-02-02T14:54:25Z

rllib/agents/alpha_star/alpha_star.py

+            num_cpus=1,
+            num_gpus=0.01
+            if (self.config["num_gpus"] and not self.config["_fake_gpus"]) else
+            0)(MixInMultiAgentReplayBuffer)


oh, it's actually a long statement ...
can I suggest we move this out of func call, and create a local variable, and add your reply here as a comment? so it's clear we have either 0 or 0.001 gpu here.

gjoliver · 2022-02-02T15:03:55Z

rllib/agents/alpha_star/alpha_star.py

+        with self._timers[LEARN_ON_BATCH_TIMER]:
+            pol_actors = []
+            args = []
+            for i, (pid, pol_actor, repl_actor) in enumerate(self.distributed_learners):


why bother with enumerate here? doesn't seem like i is used for anything?

gjoliver · 2022-02-02T15:11:22Z

rllib/agents/alpha_star/distributed_learners.py

+        if self.num_gpus_per_shard == 0:
+            self.num_gpus_per_shard = self.num_gpus / self.num_learner_shards
+
+        num_policies_per_shard = (


do we actually need to round this value a bit instead.
not quite sure what fractional policy really means ...

We do that here afterward:

self.num_policies_per_shard = math.ceil(num_policies_per_shard)

rllib/agents/alpha_star/league_builder.py

gjoliver · 2022-02-02T15:22:15Z

rllib/agents/alpha_star/league_builder.py

+            # If win rate is good enough -> Snapshot current policy and decide,
+            # whether to freeze the new snapshot or not.
+            if win_rate >= self.win_rate_threshold_for_new_snapshot:
+                is_main = re.match("^main(_\\d+)?$", policy_id)


do we really need re, or we can simply policy_id.starts_with(...)?

If we use "startswith", then it could be a main_exploiter_\\d as well.
I like re (and Perl!) very much :)

gjoliver · 2022-02-02T15:25:03Z

rllib/agents/alpha_star/league_builder.py

+                )
+
+                # Update our mapping function accordingly.
+                def policy_mapping_fn(agent_id, episode, worker, **kwargs):


possible to define this mapping_fn outside of build_league() now? so things are not so nestd and it's clear what input goes into making this decision.

I think this won't work. The here constructed function needs access to some of the closures around it, like probs_match_types and many others. Moving these into the function would require ray object store transports of large objects. Trying to avoid this here.

gjoliver · 2022-02-02T15:31:18Z

rllib/agents/alpha_star/league_builder.py

+                            logger.debug(
+                                f"Episode {episode.episode_id}: AgentID "
+                                f"{agent_id} played by {main} ({training})"
+                            )


in addition to this msg, I imagine what will be pretty useful for debugging is to track how many matches are played between different type of agents, and just show it.
will help us get a sense of whether league is in the right shape, etc.

…ntralized_multi_agent_learning

…ng via league-based self-play. (ray-project#21356)" This reverts commit 3f03ef8.

sven1977 added 30 commits December 9, 2021 12:16

wip

e56dc4a

wip

e34a50e

Merge branch 'master' of https://github.com/ray-project/ray into poc_…

c7aeedd

…disable_distr_exec_api_initial_pr

fix.

176e2d5

wip

8e52187

merge

d0efad5

fixes.

2183d01

fix

dfc069d

Merge branch 'master' of https://github.com/ray-project/ray into poc_…

f2e36be

…disable_distr_exec_api_initial_pr

fix

f55f336

fix

b8a8e5d

fix

01b0d20

Merge branch 'master' of https://github.com/ray-project/ray into poc_…

45533b1

…disable_distr_exec_api_initial_pr

fixes

04ef67d

wip

48a36fd

Merge branch 'master' of https://github.com/ray-project/ray into poc_…

208b8cb

…disable_distr_exec_api_initial_pr # Conflicts: # rllib/agents/trainer.py

wip.

9f24902

fix.

063a2d1

fix.

e5fef78

Merge branch 'master' of https://github.com/ray-project/ray into poc_…

0d451a7

…disable_distr_exec_api_initial_pr

wip.

db70554

wip.

813bae9

wip.

ddde69d

Merge branch 'poc_disable_distr_exec_api_initial_pr' into decentraliz…

de76d97

…ed_multi_agent_learning

wip.

a347f10

merge

1435931

wip

ffdfd05

wip

602ffb5

Merge branch 'master' of https://github.com/ray-project/ray into dece…

ba91330

…ntralized_multi_agent_learning

wip.

202e014

sven1977 added 4 commits January 27, 2022 17:02

wip

303fff6

Merge branch 'master' of https://github.com/ray-project/ray into dece…

ad18fcf

…ntralized_multi_agent_learning_02 # Conflicts: # rllib/execution/rollout_ops.py

Merge branch 'decentralized_multi_agent_learning_02' into decentraliz…

c6d7535

…ed_multi_agent_learning # Conflicts: # rllib/execution/rollout_ops.py

wip.

83c4da2

sven1977 assigned gjoliver Jan 27, 2022

sven1977 changed the title ~~[WIP RLlib] Decentralized multi-agent + multi-GPU learning.~~ [RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learning via league-based self-play. Jan 27, 2022

Merge branch 'master' of https://github.com/ray-project/ray into dece…

60ce3ce

…ntralized_multi_agent_learning

gjoliver reviewed Jan 28, 2022

View reviewed changes

sven1977 added 12 commits January 31, 2022 10:30

wip

a8659f2

wip

1b89da0

black formatting

0297c05

Merge branch 'master' of https://github.com/ray-project/ray into dece…

8e8dba4

…ntralized_multi_agent_learning # Conflicts: # rllib/utils/numpy.py

wip.

67aa20e

wip.

8a610fc

wip.

1153a36

wip.

8474e63

wip.

2daa318

wip.

571b1ab

wip.

f8366e6

Merge branch 'master' of https://github.com/ray-project/ray into dece…

47dfcb8

…ntralized_multi_agent_learning

avnishn reviewed Feb 1, 2022

View reviewed changes

wip.

927a999

gjoliver approved these changes Feb 2, 2022

View reviewed changes

sven1977 added 3 commits February 2, 2022 17:25

last commit :)

e067197

Merge branch 'master' of https://github.com/ray-project/ray into dece…

42264fd

…ntralized_multi_agent_learning

Merge branch 'master' of https://github.com/ray-project/ray into dece…

4379b3e

…ntralized_multi_agent_learning

sven1977 merged commit 3f03ef8 into ray-project:master Feb 3, 2022

rkooo567 added a commit to rkooo567/ray that referenced this pull request Feb 4, 2022

Revert "[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learni…

40e4f87

…ng via league-based self-play. (ray-project#21356)" This reverts commit 3f03ef8.

rkooo567 mentioned this pull request Feb 4, 2022

Revert "[RLlib] AlphaStar: Parallelized, multi-agent/multi-GPU learni… #22105

Merged

6 tasks

Conversation

sven1977 commented Jan 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gjoliver left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bveeramani commented Jan 30, 2022

‼️ ACTION REQUIRED ‼️

Uh oh!

avnishn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sven1977 commented Jan 3, 2022 •

edited

Loading