[RLlib] Preparatory PR for multi-agent, multi-GPU learning agent (alpha-star style) #02.#21649
Conversation
gjoliver
left a comment
There was a problem hiding this comment.
overall looks great. a few comments, thanks.
| buffer = self.replay_buffers[policy_id] | ||
| # If buffer empty or no new samples to mix with replayed ones, | ||
| # return None. | ||
| if len(buffer) == 0 or len(buffer.last_added_batches) == 0: |
There was a problem hiding this comment.
may be safer to check:
if len(buffer) == 0 and self.replay_ratio > 0.0:
return None
if len(buffer.last_added_batches) == 0 and self.replay_ratio < 1.0:
return None
In other words, if we need to have replay samples and the buffer is empty, or we actually need to have newly added samples, but there isn't any, return None.
There was a problem hiding this comment.
Not sure, actually.
If len(buffer) == 0, then self.last_added_batches should also be empty.
Everything that's in self.last_added_batches is guaranteed to be part of the buffer already.
What we should check, though is, whether replay_ratio == 1.0 (in this case, we do NOT care about buffer.last_added_batches because 100% is replayed (older samples)).
I'll fix this.
There was a problem hiding this comment.
yeah, that's pretty much what I meant. thanks.
| # replay ratio = old / [old + new] | ||
| num_new = len(output_batches) | ||
| num_old = 0 | ||
| while random.random() > num_old / (num_old + num_new): |
There was a problem hiding this comment.
we are not looking at self.replay_ratio here?
can you explain a bit how this while loop is statistically connected to self.replay_ratio?
There was a problem hiding this comment.
sorry I am still confused. old / [old + new] is current replay_ratio right?
the logic is we stop adding replayed samples when random.random() is less than current replay_ratio. which by gut feelings should result in an expected ratio of 0.5?
I feel like at the very least, self.replay_ratio should get involved in the math here, something like:
expected_replay_batches = self.replay_ratio * num_new
while random.random() < expected_replay_batches:
output_batches.append(buffer.replay())
expected_replay_batches -= 1.0
There was a problem hiding this comment.
This has been fixed. There is also a test case now that verifies, different given ratios are being respected.
The randomness is necessary here, to allow for "odd" ratios. E.g. if we assume that there is always just one new batch and the replay ratio is say 0.33, then only every 2nd returned batch should have 1 additional old batch in it.
rllib/execution/parallel_requests.py
Outdated
| `remote_fn()`, which will be applied to the actor(s) instead. | ||
|
|
||
| Args: | ||
| trainer: The Trainer object that we run the sampling for. |
There was a problem hiding this comment.
pass in remote_requests_in_flight, instead of the entire trainer?
There was a problem hiding this comment.
sorry, just double checking, it seems like trainer is still the parameter?
There was a problem hiding this comment.
Nope, this had been fixed. Could you take another look?
|
|
||
|
|
||
| @ExperimentalAPI | ||
| def asynchronous_parallel_requests( |
There was a problem hiding this comment.
this function looks really familiar. are we not replacing some existing logics with this util func call somewhere?
There was a problem hiding this comment.
Yeah, sorry, moved it into a new module for better clarity: This function may not only be used to collect SampleBatches from a RolloutWorker, but works generically on any set (and types!) of ray remote actors.
There was a problem hiding this comment.
I removed the old code (not used yet anywhere anyways) in replay_ops.py.
rllib/policy/policy.py
Outdated
| """ | ||
| # Sample a batch from the given replay actor. | ||
| # For better performance, make sure the replay actor is co-located | ||
| # with this policy (on the same node). |
There was a problem hiding this comment.
maybe this comment needs to be updated?
where are we making sure the replay_actor is co-located with the policy?
There was a problem hiding this comment.
The driver (execution plan/Trainer.setup) needs to make sure of that. But it's not hard-required, just better for performance as we don't have to send the batch across the wire, then.
There was a problem hiding this comment.
I see. can. you update the comment a little bit and say:
For better performance, trainer will try to schedule replay actors co-located with this policy
something like that? thanks.
There was a problem hiding this comment.
I'm not sure this comment should be here (we don't know for sure what the Trainer will do). The policy has no influence on its Trainer. We should simply clarify that it would be better, if they WERE on the same node, but it's not a hard requirement. And that the Trainer (that creates buffer and policy) needs to take care of this, not the policy itself.
I'll fix.
…ntralized_multi_agent_learning_03
…ntralized_multi_agent_learning_03
…ntralized_multi_agent_learning_03
…ntralized_multi_agent_learning_03
…ed_multi_agent_learning_02 # Conflicts: # rllib/utils/typing.py
…ed_multi_agent_learning_02
…ntralized_multi_agent_learning_02
gjoliver
left a comment
There was a problem hiding this comment.
couple of minor comments, one math question. thanks.
| # replay ratio = old / [old + new] | ||
| num_new = len(output_batches) | ||
| num_old = 0 | ||
| while random.random() > num_old / (num_old + num_new): |
There was a problem hiding this comment.
sorry I am still confused. old / [old + new] is current replay_ratio right?
the logic is we stop adding replayed samples when random.random() is less than current replay_ratio. which by gut feelings should result in an expected ratio of 0.5?
I feel like at the very least, self.replay_ratio should get involved in the math here, something like:
expected_replay_batches = self.replay_ratio * num_new
while random.random() < expected_replay_batches:
output_batches.append(buffer.replay())
expected_replay_batches -= 1.0
rllib/execution/parallel_requests.py
Outdated
| `remote_fn()`, which will be applied to the actor(s) instead. | ||
|
|
||
| Args: | ||
| trainer: The Trainer object that we run the sampling for. |
There was a problem hiding this comment.
sorry, just double checking, it seems like trainer is still the parameter?
rllib/policy/policy.py
Outdated
| """ | ||
| # Sample a batch from the given replay actor. | ||
| # For better performance, make sure the replay actor is co-located | ||
| # with this policy (on the same node). |
There was a problem hiding this comment.
I see. can. you update the comment a little bit and say:
For better performance, trainer will try to schedule replay actors co-located with this policy
something like that? thanks.
…ntralized_multi_agent_learning_02
…ntralized_multi_agent_learning_02
gjoliver
left a comment
There was a problem hiding this comment.
sorry, still have 2 comments.
| # Mix buffer's last added batches with older replayed batches. | ||
| with self.replay_timer: | ||
| output_batches = self.last_added_batches[policy_id].copy() | ||
| self.last_added_batches[policy_id].clear() |
There was a problem hiding this comment.
since you clear right after copy, why not:
output_batches = self.last_added_batches[policy_id]
self.last_added_batches[policy_id] = []
?
There was a problem hiding this comment.
do you think this is a good idea? this is the only comment I have left.
There was a problem hiding this comment.
Great catch! Indeed, it saves the copy. Done. :)
|
On Thu, Jan 27, 2022 at 7:52 AM Sven Mika ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In rllib/execution/buffers/mixin_replay_buffer.py
<#21649 (comment)>:
> + with self.replay_timer:
+ output_batches = self.last_added_batches[policy_id].copy()
+ self.last_added_batches[policy_id].clear()
+
+ # No replay desired -> Return here.
+ if self.replay_ratio == 0.0:
+ return SampleBatch.concat_samples(output_batches)
+ # Only replay desired -> Return a (replayed) sample from the
+ # buffer.
+ elif self.replay_ratio == 1.0:
+ return buffer.replay()
+
+ # Replay ratio = old / [old + new]
+ # Replay proportion: old / new
+ num_new = len(output_batches)
+ replay_proportion = self.replay_proportion
I think it's correct. That's how we also did it in the existing mixin
implementation (in execution/replay_ops.py).
Also the new test cases show that this logic is ok. We test for different
replay ratios and measure the average batch composition (old vs new
samples).
oh, I get it, yeah, we are doing the same replay_ratio to replay_proportion
computation above now.
ok, never mind, I missed that part, my bad.
… —
Reply to this email directly, view it on GitHub
<#21649 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQNWQLAL6262OBCDZK3XHTUYFS3DANCNFSM5ME7QPIA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
…ntralized_multi_agent_learning_02 # Conflicts: # rllib/execution/rollout_ops.py
| # By default, use Glorot unform initializer. | ||
| if initializer is None: | ||
| initializer = flax.nn.initializers.xavier_uniform() | ||
| initializer = nn.initializers.xavier_uniform() |
There was a problem hiding this comment.
@sven1977 I'm seeing AttributeError: module 'flax' has no attribute 'nn' on 1.11.0 release branch. Is this line cherry-pickable, or does it need to be cherry picked into 1.11.0?
There was a problem hiding this comment.
Yes, this line is cherry-pickable. Let's also fix the comment:
Fixed code:
# By default, use Glorot uniform initializer.
if initializer is None:
initializer = nn.initializers.xavier_uniform()
Preparatory PR for multi-agent, multi-GPU learning agent (alpha-star style) #2.
asynchronous_parallel_requestsutility to rllib/execution/parallel_requests.py. Allows sending (up to n in-flight) parallel remote requests to a set of actors and collecting and returning those results that are available (given some timeout).Policyclass to be usable as remote ray actor, adding e.g.get_host()and also a (preliminary)learn_on_batch_from_replay_buffermethod.Why are these changes needed?
Related issue number
Checks
scripts/format.shto lint the changes in this PR.