Fix race condition when reading from discovery by fabian-heib · Pull Request #113 · dell/gobrick

fabian-heib · 2025-08-05T11:56:36Z

Description

Fix race condition when reading from discovery. Will also be sending this to the Dell service request we have.

The race condition locks the thread in the discovery mode, meaning it will be there forever until the pod/binary that runs this is restarted. This in CSI case will also block further request since there is a lock on the request itself to not run concurrently.

In our case it resulted in that some volumes would never mount. This was a example on how to solve it, I might not have enough experience in this codebase to know if this is a acceptable solution.

Checklist:

I have performed a self-review of my own code to ensure there are no formatting, vetting, linting, or security issues
I have verified that new and existing unit tests pass locally with my changes
I have not allowed coverage numbers to degenerate
I have maintained at least 90% code coverage
I have commented on my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
[x} Backward compatibility is not broken

How Has This Been Tested?

Running it locally with
go test ./... -timeout 10m -race -run ^TestNVME_Connector_ConnectVolume$ -count=100

Without the test i have added i was unable to prove that it ever happens.

AkshaySainiDell · 2025-08-06T12:36:09Z

@fabian-heib Thanks for raising this issue and proposing a fix — the timeout approach seems like a reasonable safeguard against indefinite blocking.

Could you share a bit more context around the problem you encountered?

Was the channel consistently not returning a value, or was it an intermittent issue?
Did it occur under specific conditions (e.g., certain workloads, cluster states, or timing)?
How did you determine that the discovery thread was stuck and that it was causing volume mounts to fail?

Understanding the exact scenario will help us assess whether the timeout is the most appropriate long-term solution, or if a more context-aware mechanism might be better.

For reference, the discoverDevice method is responsible for writing to the channel. Even if no matching device is found, it still sends an empty result after two attempts — so the channel should always be written to unless something unexpected happens.

Appreciate your work on this!

fabian-heib · 2025-08-06T13:44:54Z

Was the channel consistently not returning a value, or was it an intermittent issue?

This was a intermittent issue, this can occur when the scheduled cpu does a lot of context switches and/or other work that could park the current goroutine.

Did it occur under specific conditions (e.g., certain workloads, cluster states, or timing)?

Timing, MCP (MachineConfigPool) trigger a drain and reboot, the moment the node went up and was ready to take workload again it got assigned volumes but the nvme wasn't attached yet.

How did you determine that the discovery thread was stuck and that it was causing volume mounts to fail?

Found in the logs that the same log happend when it stopped and it was that discovery goroutines are still running, tried to stress test the host and continue troubleshooting a bit until we notice a restart of the driver sovled it. Found source code and started digging.

For reference, the discoverDevice method is responsible for writing to the channel. Even if no matching device is found, it still sends an empty result after two attempts — so the channel should always be written to unless something unexpected happens.

This is true but it does it only once and not twice like the code suggest. If this log ever happens and the disk is not attached it will go back to the begging of the loop waiting forever for the channel to return data, that will never happen since it already return it once. I would suggest rewriting this part since it's confusing of having a async discovery only run once but the listing side handles more than one item in the channel, also the listing side is waiting on the async call? I don't see the reason for that, it just add complexity to something that could be simpler.

The are alot of fixes for this kind of concurrency issue.
Instead of defering the waitgroup it could have been placed just above the result being sent into the channel

Currently don't have access to the support ticket number but there is more info in there and logs.

AkshaySainiDell · 2025-08-07T09:21:09Z

Thanks for the detailed response — that really helps connect the dots.

I was thinking we might be able to align the behavior more closely with the context timeout already used in connectMultipathDevice. Instead of a fixed timeout like 10 seconds. That way, we still avoid hanging, but we also keep everything tied to the parent context’s lifecycle.

Also, just curious — were you able to test this change under the same conditions where the issue occurred ? Would be helpful to know if it consistently prevents the hang and lets volume mounts continue smoothly.

fabian-heib · 2025-08-07T10:48:28Z

Thanks for the detailed response — that really helps connect the dots.

I was thinking we might be able to align the behavior more closely with the context timeout already used in connectMultipathDevice. Instead of a fixed timeout like 10 seconds. That way, we still avoid hanging, but we also keep everything tied to the parent context’s lifecycle.

I can create a new branch on how I would have done this to give an example

Also, just curious — were you able to test this change under the same conditions where the issue occurred ? Would be helpful to know if it consistently prevents the hang and lets volume mounts continue smoothly.

We haven't build the csi driver based on this patch to test it out. We could but that would be some work to create that image with this patch. I have only tested with the unit tests nothing more as of now.

fabian-heib · 2025-08-07T13:06:27Z

Here is a small rewrite to skip the goroutines for the discovery since it wasn't needed anyway.

Want to use that instead of checking the context? If thats the case I will update this PR

AkshaySainiDell · 2025-08-08T06:28:29Z

Thanks for the quick follow-up and for offering to rewrite the logic — really appreciate that!

I think aligning the behavior with the discoveryCtx timeout used in connectMultipathDevice would be a great improvement. It keeps things consistent and avoids hardcoded timeouts, while still respecting the parent context’s lifecycle — which feels cleaner and more predictable. Could you please update the current changes to be context-aware?

fabian-heib · 2025-08-11T07:37:54Z

@AkshaySainiDell Hello

The linting error doesn't say whats wrong with it, tried to run it locally but I different result, I didn't get that with make check also the Coverage seems to be broken?

AkshaySainiDell · 2025-08-11T12:49:30Z

Hi @fabian-heib, there's a formatting issue in nvme_test.go. The logs show:

  Error: nvme_test.go:303:1: File is not properly formatted (gofumpt)

We're currently looking into the coverage failure and will share an update once we have more information.

AkshaySainiDell · 2025-08-13T12:33:05Z

@fabian-heib It looks like the failure is due to the absence of our organization-specific Threshold and Exclusion settings in your forked repository.

To help resolve this, here are a couple of suggestions:

Set the following variables in your forked repo:
- CODE_COVERAGE_TARGET=85
- EXCLUDE_DIRECTORIES=./internal|./inttests
Alternatively, if possible, you could consider opening the PR directly in the GoBrick repository instead of from your fork. This would allow the CI to automatically pick up the correct coverage settings and exclusions.

fabian-heib · 2025-08-13T12:59:50Z

@AkshaySainiDell Done

That is a really bad practice and I could change the values to some meaningless values for my repo and affect your pipeline with it, as example lower the coverage to 50%, or even worse inject some code/execution to tamper with other things.
That's not how github works. I can't create a branch in this repo. If I want to contribute I have to create a fork and then do a pull request from that.
I tested also, I as a user don't have permission to randomly create branches in this repo as it should.

rajkumar-palani · 2025-08-13T15:09:10Z

Thanks @AkshaySainiDell, for the comments.
Thanks @fabian-heib, we tried options but we can't change the variables through forked repo. We are checking this with our team, will revert back on this.
@shaynafinocchiaro - "Organization-specific Threshold and Exclusion" settings are not taken into an account for the external PRs. Please check and do the needful.

shaynafinocchiaro · 2025-08-13T16:09:53Z

Hello @fabian-heib, apologies for the trouble. I have made the necessary changes on our end, but sometimes GitHub caches a workflow version. Please push a new commit (can be trivial like adding a new line) or close and re-open your PR to re-fetch the latest workflow. Thank you.

fabian-heib · 2025-08-15T09:08:53Z

I fixed that my commit has a signature and that is verified, sorry for the commit mess I made there for a small while.

rajkumar-palani

Hi @fabian-heib - We are good with the code logic, but we have an internal process to validate the changes across drivers before merging it with main and to be made available in a release candidate. It may take some time but sure it will be done and we will keep you posted on the progress. if possible, can you please attach the logs to ensure the change is working in your environment as expected. Thank You.

fabian-heib · 2025-08-18T14:24:06Z

I don't have access to the logs right now, different laptop. The logs should be available on the dell support case 213377876

rajkumar-palani · 2025-08-19T06:36:25Z

I don't have access to the logs right now, different laptop. The logs should be available on the dell support case 213377876

@fabian-heib - We are not having access to see the support case. please share the logs here, if possible. Thank You.

fabian-heib · 2025-08-19T08:09:23Z

@rajkumar-palani We don't want to share those logs public on github. Please get access to the case for the logs if they are required

harishp8889

lgtm

fabian-heib requested review from alankar-verma, alikdell, atye, bpjain2004, chaganti-rajitha, coulof, francis-nijay, harishp8889, hoppea2, khareRajshree, kumarp20, niranjan-n1, prablr79, samihan-dell, satyakonduri, shaynafinocchiaro, tdawe and xuluna as code owners August 5, 2025 11:56

fabian-heib force-pushed the main branch from 422612b to 932bf5c Compare August 8, 2025 07:24

fabian-heib force-pushed the main branch from 932bf5c to 728b778 Compare August 11, 2025 14:32

rajkumar-palani mentioned this pull request Aug 12, 2025

Updated common actions with default values dell/common-github-actions#178

Merged

5 tasks

fabian-heib force-pushed the main branch from 728b778 to 4a22c35 Compare August 14, 2025 07:30

AkshaySainiDell previously approved these changes Aug 14, 2025

View reviewed changes

fabian-heib force-pushed the main branch from 764480c to ed29f63 Compare August 15, 2025 09:03

fabian-heib dismissed AkshaySainiDell’s stale review via 8963d15 August 15, 2025 09:05

fabian-heib force-pushed the main branch 2 times, most recently from 8963d15 to b776301 Compare August 15, 2025 09:05

Fix race condition when reading from discovery

0e4cb16

fabian-heib force-pushed the main branch from b776301 to 0e4cb16 Compare August 15, 2025 09:07

AkshaySainiDell approved these changes Aug 18, 2025

View reviewed changes

rajkumar-palani approved these changes Aug 18, 2025

View reviewed changes

rajkumar-palani reviewed Aug 18, 2025

View reviewed changes

Merge branch 'main' into main

6a4d946

harishp8889 approved these changes Sep 4, 2025

View reviewed changes

AkshaySainiDell merged commit cfeca78 into dell:main Sep 4, 2025
6 checks passed

Conversation

fabian-heib commented Aug 5, 2025

Description

Checklist:

How Has This Been Tested?

Uh oh!

AkshaySainiDell commented Aug 6, 2025

Uh oh!

fabian-heib commented Aug 6, 2025

Uh oh!

AkshaySainiDell commented Aug 7, 2025

Uh oh!

fabian-heib commented Aug 7, 2025

Uh oh!

fabian-heib commented Aug 7, 2025

Uh oh!

AkshaySainiDell commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fabian-heib commented Aug 11, 2025

Uh oh!

AkshaySainiDell commented Aug 11, 2025

Uh oh!

AkshaySainiDell commented Aug 13, 2025

Uh oh!

fabian-heib commented Aug 13, 2025

Uh oh!

rajkumar-palani commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shaynafinocchiaro commented Aug 13, 2025

Uh oh!

fabian-heib commented Aug 15, 2025

Uh oh!

rajkumar-palani left a comment

Choose a reason for hiding this comment

Uh oh!

fabian-heib commented Aug 18, 2025

Uh oh!

rajkumar-palani commented Aug 19, 2025

Uh oh!

fabian-heib commented Aug 19, 2025

Uh oh!

harishp8889 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

AkshaySainiDell commented Aug 8, 2025 •

edited

Loading

rajkumar-palani commented Aug 13, 2025 •

edited

Loading