Skip to content

ir: drain and close deployment notification channels, fix #3237#3489

Merged
cthulhu-rider merged 1 commit intomasterfrom
improve-deploy-sync
Jul 30, 2025
Merged

ir: drain and close deployment notification channels, fix #3237#3489
cthulhu-rider merged 1 commit intomasterfrom
improve-deploy-sync

Conversation

@roman-khimov
Copy link
Member

The problem is that deploy code subscribes and never unsubscribes while it spawns some readers that are bound to deploy context. So once Deploy() is done they can (and should) exit, but we still haven't performed unsubscription, so there is a race between events and unsubscription that's especially annoying with small block times.

Ideally it's deploy code that should be doing this (it spawns receivers), but draining here concurrently to deploy threads should be sufficient too.

@codecov
Copy link

codecov bot commented Jul 29, 2025

Codecov Report

❌ Patch coverage is 0% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 23.41%. Comparing base (6c4da00) to head (5f38926).
⚠️ Report is 10 commits behind head on master.

Files with missing lines Patch % Lines
pkg/innerring/deploy.go 0.00% 25 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3489      +/-   ##
==========================================
- Coverage   23.43%   23.41%   -0.02%     
==========================================
  Files         669      669              
  Lines       50312    50334      +22     
==========================================
- Hits        11790    11788       -2     
- Misses      37601    37625      +24     
  Partials      921      921              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The problem is that deploy code subscribes and never unsubscribes while it
spawns some readers that are bound to deploy context. So once Deploy() is done
they can (and should) exit, but we still haven't performed unsubscription, so
there is a race between events and unsubscription that's especially annoying
with small block times.

Ideally it's deploy code that should be doing this (it spawns receivers), but
draining here concurrently to deploy threads should be sufficient too.

Signed-off-by: Roman Khimov <roman@nspcc.ru>
@roman-khimov roman-khimov force-pushed the improve-deploy-sync branch from 4063e33 to 5f38926 Compare July 29, 2025 18:26
@roman-khimov
Copy link
Member Author

for i := range x.subs {
_ = x.wsClient.Unsubscribe(x.subs[i])
}
// Unsubscription is done, it's safe to close().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it? If Unsubscribe prevents further writes to the channel, than there should be no need to drain. But since we are doing it, writes are possible. Then it is unsafe to close

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem happens when:

  • deploy stops and cancels its context
  • channel reader routines (notary/block) exit
  • we get a new block/notary and WSClient tries to push it into channel, but there is no one to read from it
  • we're trying to unsubscribe, but we can't because WSClient is stuck trying to deliver an event

That's exactly why readers are spawned into a separate goroutine above. They can run concurrently to proper deploy readers since those are not guaranteed to exit by the time we get here, but it's not a problem, no one cares about these events really.

Now that we are 100% sure we have some readers we unsubscribe and we will not get any events after that. All events come (and get pushed into channel by WSClient) before we get an unsubscription reply. So it's safe to close to release reader routine.

But I'd also ask @AnnaShaleva here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think i got it. This is just one way to exit the routine above, we don't have to drain all sub channels actually (they are free after unsub). This explains the legitimacy of the exit upon closing any of the channels

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it?

It is, Roman is right. By the end of Unsubscribe execution it is guaranteed that no events will be sent to this channel anymore.

Another possible caveat I see here is: if the same channel is reused to receive headers/requests from another subscription (subscription with another ID), then we can't close it because it's possible to receive some event from this subscription. But from what I see, your use-case doesn't imply multiple subscriptions sharing the same channel, every subscription creates and maintains its own channel. So it's not a problem for this code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another caveat that I see: sometimes WSClient may close receiver channels by himself. It happens on missed event handling or on WS reader loop exit (in the end of WSClient functioning, in fact). So it's important to ensure that these channels are not closed yet by the moment of cancelSubs call.

Copy link
Member Author

@roman-khimov roman-khimov Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed event is the only possibility (client can be closed after unsubscription), but this will lead to deployment failure and there won't be unsubscription at all. At least it's OK for now, I think.

err = deploy.Deploy(ctx, deployPrm)
if err != nil {
return nil, fmt.Errorf("deploy FS chain: %w", err)
}
fschain.cancelSubs()

@cthulhu-rider cthulhu-rider merged commit e444485 into master Jul 30, 2025
21 of 24 checks passed
@cthulhu-rider cthulhu-rider deleted the improve-deploy-sync branch July 30, 2025 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants