Skip to content

Fix race condition when reading from discovery#113

Merged
AkshaySainiDell merged 2 commits into
dell:mainfrom
fabian-heib:main
Sep 4, 2025
Merged

Fix race condition when reading from discovery#113
AkshaySainiDell merged 2 commits into
dell:mainfrom
fabian-heib:main

Conversation

@fabian-heib

Copy link
Copy Markdown
Contributor

Description

Fix race condition when reading from discovery. Will also be sending this to the Dell service request we have.

The race condition locks the thread in the discovery mode, meaning it will be there forever until the pod/binary that runs this is restarted. This in CSI case will also block further request since there is a lock on the request itself to not run concurrently.

In our case it resulted in that some volumes would never mount. This was a example on how to solve it, I might not have enough experience in this codebase to know if this is a acceptable solution.

Checklist:

  • I have performed a self-review of my own code to ensure there are no formatting, vetting, linting, or security issues
  • I have verified that new and existing unit tests pass locally with my changes
  • I have not allowed coverage numbers to degenerate
  • I have maintained at least 90% code coverage
  • I have commented on my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • [x} Backward compatibility is not broken

How Has This Been Tested?

Running it locally with
go test ./... -timeout 10m -race -run ^TestNVME_Connector_ConnectVolume$ -count=100

Without the test i have added i was unable to prove that it ever happens.

@AkshaySainiDell

Copy link
Copy Markdown
Contributor

@fabian-heib Thanks for raising this issue and proposing a fix — the timeout approach seems like a reasonable safeguard against indefinite blocking.

Could you share a bit more context around the problem you encountered?

  • Was the channel consistently not returning a value, or was it an intermittent issue?
  • Did it occur under specific conditions (e.g., certain workloads, cluster states, or timing)?
  • How did you determine that the discovery thread was stuck and that it was causing volume mounts to fail?

Understanding the exact scenario will help us assess whether the timeout is the most appropriate long-term solution, or if a more context-aware mechanism might be better.

For reference, the discoverDevice method is responsible for writing to the channel. Even if no matching device is found, it still sends an empty result after two attempts — so the channel should always be written to unless something unexpected happens.

Appreciate your work on this!

@fabian-heib

Copy link
Copy Markdown
Contributor Author
  • Was the channel consistently not returning a value, or was it an intermittent issue?

This was a intermittent issue, this can occur when the scheduled cpu does a lot of context switches and/or other work that could park the current goroutine.

  • Did it occur under specific conditions (e.g., certain workloads, cluster states, or timing)?

Timing, MCP (MachineConfigPool) trigger a drain and reboot, the moment the node went up and was ready to take workload again it got assigned volumes but the nvme wasn't attached yet.

  • How did you determine that the discovery thread was stuck and that it was causing volume mounts to fail?

Found in the logs that the same log happend when it stopped and it was that discovery goroutines are still running, tried to stress test the host and continue troubleshooting a bit until we notice a restart of the driver sovled it. Found source code and started digging.

For reference, the discoverDevice method is responsible for writing to the channel. Even if no matching device is found, it still sends an empty result after two attempts — so the channel should always be written to unless something unexpected happens.

This is true but it does it only once and not twice like the code suggest. If this log ever happens and the disk is not attached it will go back to the begging of the loop waiting forever for the channel to return data, that will never happen since it already return it once. I would suggest rewriting this part since it's confusing of having a async discovery only run once but the listing side handles more than one item in the channel, also the listing side is waiting on the async call? I don't see the reason for that, it just add complexity to something that could be simpler.

The are alot of fixes for this kind of concurrency issue.
Instead of defering the waitgroup it could have been placed just above the result being sent into the channel

Currently don't have access to the support ticket number but there is more info in there and logs.

@AkshaySainiDell

Copy link
Copy Markdown
Contributor

Thanks for the detailed response — that really helps connect the dots.

I was thinking we might be able to align the behavior more closely with the context timeout already used in connectMultipathDevice. Instead of a fixed timeout like 10 seconds. That way, we still avoid hanging, but we also keep everything tied to the parent context’s lifecycle.

Also, just curious — were you able to test this change under the same conditions where the issue occurred ? Would be helpful to know if it consistently prevents the hang and lets volume mounts continue smoothly.

@fabian-heib

Copy link
Copy Markdown
Contributor Author

Thanks for the detailed response — that really helps connect the dots.

I was thinking we might be able to align the behavior more closely with the context timeout already used in connectMultipathDevice. Instead of a fixed timeout like 10 seconds. That way, we still avoid hanging, but we also keep everything tied to the parent context’s lifecycle.

I can create a new branch on how I would have done this to give an example

Also, just curious — were you able to test this change under the same conditions where the issue occurred ? Would be helpful to know if it consistently prevents the hang and lets volume mounts continue smoothly.

We haven't build the csi driver based on this patch to test it out. We could but that would be some work to create that image with this patch. I have only tested with the unit tests nothing more as of now.

@fabian-heib

Copy link
Copy Markdown
Contributor Author

Here is a small rewrite to skip the goroutines for the discovery since it wasn't needed anyway.

Want to use that instead of checking the context? If thats the case I will update this PR

@AkshaySainiDell

AkshaySainiDell commented Aug 8, 2025

Copy link
Copy Markdown
Contributor

Thanks for the quick follow-up and for offering to rewrite the logic — really appreciate that!

I think aligning the behavior with the discoveryCtx timeout used in connectMultipathDevice would be a great improvement. It keeps things consistent and avoids hardcoded timeouts, while still respecting the parent context’s lifecycle — which feels cleaner and more predictable. Could you please update the current changes to be context-aware?

@fabian-heib

Copy link
Copy Markdown
Contributor Author

@AkshaySainiDell Hello

The linting error doesn't say whats wrong with it, tried to run it locally but I different result, I didn't get that with make check also the Coverage seems to be broken?

@AkshaySainiDell

Copy link
Copy Markdown
Contributor

Hi @fabian-heib, there's a formatting issue in nvme_test.go. The logs show:

  Error: nvme_test.go:303:1: File is not properly formatted (gofumpt)

We're currently looking into the coverage failure and will share an update once we have more information.

@AkshaySainiDell

Copy link
Copy Markdown
Contributor

@fabian-heib It looks like the failure is due to the absence of our organization-specific Threshold and Exclusion settings in your forked repository.

To help resolve this, here are a couple of suggestions:

  1. Set the following variables in your forked repo:
    • CODE_COVERAGE_TARGET=85
    • EXCLUDE_DIRECTORIES=./internal|./inttests
  2. Alternatively, if possible, you could consider opening the PR directly in the GoBrick repository instead of from your fork. This would allow the CI to automatically pick up the correct coverage settings and exclusions.

@fabian-heib

Copy link
Copy Markdown
Contributor Author

@AkshaySainiDell Done

  1. That is a really bad practice and I could change the values to some meaningless values for my repo and affect your pipeline with it, as example lower the coverage to 50%, or even worse inject some code/execution to tamper with other things.
  2. That's not how github works. I can't create a branch in this repo. If I want to contribute I have to create a fork and then do a pull request from that.
    I tested also, I as a user don't have permission to randomly create branches in this repo as it should.

@rajkumar-palani

rajkumar-palani commented Aug 13, 2025

Copy link
Copy Markdown
Contributor

Thanks @AkshaySainiDell, for the comments.
Thanks @fabian-heib, we tried options but we can't change the variables through forked repo. We are checking this with our team, will revert back on this.
@shaynafinocchiaro - "Organization-specific Threshold and Exclusion" settings are not taken into an account for the external PRs. Please check and do the needful.

@shaynafinocchiaro

Copy link
Copy Markdown
Collaborator

Hello @fabian-heib, apologies for the trouble. I have made the necessary changes on our end, but sometimes GitHub caches a workflow version. Please push a new commit (can be trivial like adding a new line) or close and re-open your PR to re-fetch the latest workflow. Thank you.

@fabian-heib

Copy link
Copy Markdown
Contributor Author

I fixed that my commit has a signature and that is verified, sorry for the commit mess I made there for a small while.

@rajkumar-palani rajkumar-palani left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fabian-heib - We are good with the code logic, but we have an internal process to validate the changes across drivers before merging it with main and to be made available in a release candidate. It may take some time but sure it will be done and we will keep you posted on the progress. if possible, can you please attach the logs to ensure the change is working in your environment as expected. Thank You.

@fabian-heib

Copy link
Copy Markdown
Contributor Author

I don't have access to the logs right now, different laptop. The logs should be available on the dell support case 213377876

@rajkumar-palani

Copy link
Copy Markdown
Contributor

I don't have access to the logs right now, different laptop. The logs should be available on the dell support case 213377876

@fabian-heib - We are not having access to see the support case. please share the logs here, if possible. Thank You.

@fabian-heib

Copy link
Copy Markdown
Contributor Author

@rajkumar-palani We don't want to share those logs public on github. Please get access to the case for the logs if they are required

@harishp8889 harishp8889 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@AkshaySainiDell AkshaySainiDell merged commit cfeca78 into dell:main Sep 4, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants