ir: drain and close deployment notification channels, fix #3237 by roman-khimov · Pull Request #3489 · nspcc-dev/neofs-node

roman-khimov · 2025-07-29T16:50:09Z

The problem is that deploy code subscribes and never unsubscribes while it spawns some readers that are bound to deploy context. So once Deploy() is done they can (and should) exit, but we still haven't performed unsubscription, so there is a race between events and unsubscription that's especially annoying with small block times.

Ideally it's deploy code that should be doing this (it spawns receivers), but draining here concurrently to deploy threads should be sufficient too.

codecov · 2025-07-29T16:52:34Z

Codecov Report

❌ Patch coverage is 0% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 23.41%. Comparing base (6c4da00) to head (5f38926).
⚠️ Report is 10 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/innerring/deploy.go	0.00%	25 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3489      +/-   ##
==========================================
- Coverage   23.43%   23.41%   -0.02%     
==========================================
  Files         669      669              
  Lines       50312    50334      +22     
==========================================
- Hits        11790    11788       -2     
- Misses      37601    37625      +24     
  Partials      921      921

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The problem is that deploy code subscribes and never unsubscribes while it spawns some readers that are bound to deploy context. So once Deploy() is done they can (and should) exit, but we still haven't performed unsubscription, so there is a race between events and unsubscription that's especially annoying with small block times. Ideally it's deploy code that should be doing this (it spawns receivers), but draining here concurrently to deploy threads should be sufficient too. Signed-off-by: Roman Khimov <roman@nspcc.ru>

roman-khimov · 2025-07-29T20:04:51Z

Report with 50ms blocks: https://rest.fs.neo.org/HXSaMJXk2g8C14ht8HSi7BBaiYZ1HeWh2xnWPGQCg4H6/3673-1753819315/index.html

cthulhu-rider · 2025-07-30T07:52:29Z

pkg/innerring/deploy.go

+	for i := range x.subs {
+		_ = x.wsClient.Unsubscribe(x.subs[i])
+	}
+	// Unsubscription is done, it's safe to close().


is it? If Unsubscribe prevents further writes to the channel, than there should be no need to drain. But since we are doing it, writes are possible. Then it is unsafe to close

The problem happens when:

deploy stops and cancels its context

channel reader routines (notary/block) exit

we get a new block/notary and WSClient tries to push it into channel, but there is no one to read from it

we're trying to unsubscribe, but we can't because WSClient is stuck trying to deliver an event

That's exactly why readers are spawned into a separate goroutine above. They can run concurrently to proper deploy readers since those are not guaranteed to exit by the time we get here, but it's not a problem, no one cares about these events really.

Now that we are 100% sure we have some readers we unsubscribe and we will not get any events after that. All events come (and get pushed into channel by WSClient) before we get an unsubscription reply. So it's safe to close to release reader routine.

But I'd also ask @AnnaShaleva here.

think i got it. This is just one way to exit the routine above, we don't have to drain all sub channels actually (they are free after unsub). This explains the legitimacy of the exit upon closing any of the channels

is it?

It is, Roman is right. By the end of Unsubscribe execution it is guaranteed that no events will be sent to this channel anymore.

Another possible caveat I see here is: if the same channel is reused to receive headers/requests from another subscription (subscription with another ID), then we can't close it because it's possible to receive some event from this subscription. But from what I see, your use-case doesn't imply multiple subscriptions sharing the same channel, every subscription creates and maintains its own channel. So it's not a problem for this code.

Another caveat that I see: sometimes WSClient may close receiver channels by himself. It happens on missed event handling or on WS reader loop exit (in the end of WSClient functioning, in fact). So it's important to ensure that these channels are not closed yet by the moment of cancelSubs call.

Missed event is the only possibility (client can be closed after unsubscription), but this will lead to deployment failure and there won't be unsubscription at all. At least it's OK for now, I think.

neofs-node/pkg/innerring/innerring.go

Lines 554 to 559 in 0e1b4ce

err = deploy.Deploy(ctx, deployPrm)

if err != nil {

return nil, fmt.Errorf("deploy FS chain: %w", err)

}

fschain.cancelSubs()

roman-khimov requested review from End-rey, carpawell and cthulhu-rider as code owners July 29, 2025 16:50

roman-khimov force-pushed the improve-deploy-sync branch from 4063e33 to 5f38926 Compare July 29, 2025 18:26

cthulhu-rider reviewed Jul 30, 2025

View reviewed changes

cthulhu-rider approved these changes Jul 30, 2025

View reviewed changes

AnnaShaleva approved these changes Jul 30, 2025

View reviewed changes

cthulhu-rider merged commit e444485 into master Jul 30, 2025
21 of 24 checks passed

cthulhu-rider deleted the improve-deploy-sync branch July 30, 2025 09:30

roman-khimov mentioned this pull request Jul 31, 2025

Use dynamic block time nspcc-dev/neofs-testcases#1116

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ir: drain and close deployment notification channels, fix #3237#3489

ir: drain and close deployment notification channels, fix #3237#3489
cthulhu-rider merged 1 commit intomasterfrom
improve-deploy-sync

roman-khimov commented Jul 29, 2025

Uh oh!

codecov bot commented Jul 29, 2025 •

edited

Loading

Uh oh!

roman-khimov commented Jul 29, 2025

Uh oh!

cthulhu-rider Jul 30, 2025

Uh oh!

roman-khimov Jul 30, 2025

Uh oh!

cthulhu-rider Jul 30, 2025

Uh oh!

AnnaShaleva Jul 30, 2025

Uh oh!

AnnaShaleva Jul 30, 2025

Uh oh!

roman-khimov Jul 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	err = deploy.Deploy(ctx, deployPrm)
	if err != nil {
	return nil, fmt.Errorf("deploy FS chain: %w", err)
	}

	fschain.cancelSubs()

Conversation

roman-khimov commented Jul 29, 2025

Uh oh!

codecov bot commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

roman-khimov commented Jul 29, 2025

Uh oh!

cthulhu-rider Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

roman-khimov Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

cthulhu-rider Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

AnnaShaleva Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

AnnaShaleva Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

roman-khimov Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Jul 29, 2025 •

edited

Loading

roman-khimov Jul 30, 2025 •

edited

Loading