Fix race condition when reading from discovery#113
Conversation
|
@fabian-heib Thanks for raising this issue and proposing a fix — the timeout approach seems like a reasonable safeguard against indefinite blocking. Could you share a bit more context around the problem you encountered?
Understanding the exact scenario will help us assess whether the timeout is the most appropriate long-term solution, or if a more context-aware mechanism might be better. For reference, the discoverDevice method is responsible for writing to the channel. Even if no matching device is found, it still sends an empty result after two attempts — so the channel should always be written to unless something unexpected happens. Appreciate your work on this! |
This was a intermittent issue, this can occur when the scheduled cpu does a lot of context switches and/or other work that could park the current goroutine.
Timing, MCP (MachineConfigPool) trigger a drain and reboot, the moment the node went up and was ready to take workload again it got assigned volumes but the nvme wasn't attached yet.
Found in the logs that the same log happend when it stopped and it was that
This is true but it does it only once and not twice like the code suggest. If this log ever happens and the disk is not attached it will go back to the begging of the loop waiting forever for the channel to return data, that will never happen since it already return it once. I would suggest rewriting this part since it's confusing of having a async discovery only run once but the listing side handles more than one item in the channel, also the listing side is waiting on the async call? I don't see the reason for that, it just add complexity to something that could be simpler. The are alot of fixes for this kind of concurrency issue. Currently don't have access to the support ticket number but there is more info in there and logs. |
|
Thanks for the detailed response — that really helps connect the dots. I was thinking we might be able to align the behavior more closely with the context timeout already used in connectMultipathDevice. Instead of a fixed timeout like 10 seconds. That way, we still avoid hanging, but we also keep everything tied to the parent context’s lifecycle. Also, just curious — were you able to test this change under the same conditions where the issue occurred ? Would be helpful to know if it consistently prevents the hang and lets volume mounts continue smoothly. |
I can create a new branch on how I would have done this to give an example
We haven't build the csi driver based on this patch to test it out. We could but that would be some work to create that image with this patch. I have only tested with the unit tests nothing more as of now. |
|
Here is a small rewrite to skip the goroutines for the discovery since it wasn't needed anyway. Want to use that instead of checking the context? If thats the case I will update this PR |
|
Thanks for the quick follow-up and for offering to rewrite the logic — really appreciate that! I think aligning the behavior with the discoveryCtx timeout used in connectMultipathDevice would be a great improvement. It keeps things consistent and avoids hardcoded timeouts, while still respecting the parent context’s lifecycle — which feels cleaner and more predictable. Could you please update the current changes to be context-aware? |
|
@AkshaySainiDell Hello The linting error doesn't say whats wrong with it, tried to run it locally but I different result, I didn't get that with |
|
Hi @fabian-heib, there's a formatting issue in nvme_test.go. The logs show: We're currently looking into the coverage failure and will share an update once we have more information. |
|
@fabian-heib It looks like the failure is due to the absence of our organization-specific Threshold and Exclusion settings in your forked repository. To help resolve this, here are a couple of suggestions:
|
|
@AkshaySainiDell Done
|
|
Thanks @AkshaySainiDell, for the comments. |
|
Hello @fabian-heib, apologies for the trouble. I have made the necessary changes on our end, but sometimes GitHub caches a workflow version. Please push a new commit (can be trivial like adding a new line) or close and re-open your PR to re-fetch the latest workflow. Thank you. |
8963d15 to
b776301
Compare
|
I fixed that my commit has a signature and that is verified, sorry for the commit mess I made there for a small while. |
rajkumar-palani
left a comment
There was a problem hiding this comment.
Hi @fabian-heib - We are good with the code logic, but we have an internal process to validate the changes across drivers before merging it with main and to be made available in a release candidate. It may take some time but sure it will be done and we will keep you posted on the progress. if possible, can you please attach the logs to ensure the change is working in your environment as expected. Thank You.
|
I don't have access to the logs right now, different laptop. The logs should be available on the dell support case 213377876 |
@fabian-heib - We are not having access to see the support case. please share the logs here, if possible. Thank You. |
|
@rajkumar-palani We don't want to share those logs public on github. Please get access to the case for the logs if they are required |
Description
Fix race condition when reading from discovery. Will also be sending this to the Dell service request we have.
The race condition locks the thread in the discovery mode, meaning it will be there forever until the pod/binary that runs this is restarted. This in CSI case will also block further request since there is a lock on the request itself to not run concurrently.
In our case it resulted in that some volumes would never mount. This was a example on how to solve it, I might not have enough experience in this codebase to know if this is a acceptable solution.
Checklist:
How Has This Been Tested?
Running it locally with
go test ./... -timeout 10m -race -run ^TestNVME_Connector_ConnectVolume$ -count=100Without the test i have added i was unable to prove that it ever happens.