ir: drain and close deployment notification channels, fix #3237#3489
ir: drain and close deployment notification channels, fix #3237#3489cthulhu-rider merged 1 commit intomasterfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3489 +/- ##
==========================================
- Coverage 23.43% 23.41% -0.02%
==========================================
Files 669 669
Lines 50312 50334 +22
==========================================
- Hits 11790 11788 -2
- Misses 37601 37625 +24
Partials 921 921 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The problem is that deploy code subscribes and never unsubscribes while it spawns some readers that are bound to deploy context. So once Deploy() is done they can (and should) exit, but we still haven't performed unsubscription, so there is a race between events and unsubscription that's especially annoying with small block times. Ideally it's deploy code that should be doing this (it spawns receivers), but draining here concurrently to deploy threads should be sufficient too. Signed-off-by: Roman Khimov <roman@nspcc.ru>
4063e33 to
5f38926
Compare
|
Report with 50ms blocks: https://rest.fs.neo.org/HXSaMJXk2g8C14ht8HSi7BBaiYZ1HeWh2xnWPGQCg4H6/3673-1753819315/index.html |
| for i := range x.subs { | ||
| _ = x.wsClient.Unsubscribe(x.subs[i]) | ||
| } | ||
| // Unsubscription is done, it's safe to close(). |
There was a problem hiding this comment.
is it? If Unsubscribe prevents further writes to the channel, than there should be no need to drain. But since we are doing it, writes are possible. Then it is unsafe to close
There was a problem hiding this comment.
The problem happens when:
- deploy stops and cancels its context
- channel reader routines (notary/block) exit
- we get a new block/notary and WSClient tries to push it into channel, but there is no one to read from it
- we're trying to unsubscribe, but we can't because WSClient is stuck trying to deliver an event
That's exactly why readers are spawned into a separate goroutine above. They can run concurrently to proper deploy readers since those are not guaranteed to exit by the time we get here, but it's not a problem, no one cares about these events really.
Now that we are 100% sure we have some readers we unsubscribe and we will not get any events after that. All events come (and get pushed into channel by WSClient) before we get an unsubscription reply. So it's safe to close to release reader routine.
But I'd also ask @AnnaShaleva here.
There was a problem hiding this comment.
think i got it. This is just one way to exit the routine above, we don't have to drain all sub channels actually (they are free after unsub). This explains the legitimacy of the exit upon closing any of the channels
There was a problem hiding this comment.
is it?
It is, Roman is right. By the end of Unsubscribe execution it is guaranteed that no events will be sent to this channel anymore.
Another possible caveat I see here is: if the same channel is reused to receive headers/requests from another subscription (subscription with another ID), then we can't close it because it's possible to receive some event from this subscription. But from what I see, your use-case doesn't imply multiple subscriptions sharing the same channel, every subscription creates and maintains its own channel. So it's not a problem for this code.
There was a problem hiding this comment.
Another caveat that I see: sometimes WSClient may close receiver channels by himself. It happens on missed event handling or on WS reader loop exit (in the end of WSClient functioning, in fact). So it's important to ensure that these channels are not closed yet by the moment of cancelSubs call.
There was a problem hiding this comment.
Missed event is the only possibility (client can be closed after unsubscription), but this will lead to deployment failure and there won't be unsubscription at all. At least it's OK for now, I think.
neofs-node/pkg/innerring/innerring.go
Lines 554 to 559 in 0e1b4ce
The problem is that deploy code subscribes and never unsubscribes while it spawns some readers that are bound to deploy context. So once Deploy() is done they can (and should) exit, but we still haven't performed unsubscription, so there is a race between events and unsubscription that's especially annoying with small block times.
Ideally it's deploy code that should be doing this (it spawns receivers), but draining here concurrently to deploy threads should be sufficient too.