-
Notifications
You must be signed in to change notification settings - Fork 136
[linstor] fix: prevent DRBD device race condition in updateDiscGran #1829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Add physical device path existence check to prevent race condition where lsblk is called before kernel creates DRBD device node after drbdadm adjust. This fixes issue where resources ended up in Unknown state after satellite restart because: - drbdadm adjust completes successfully (brings devices up) - updateDiscGran() immediately tries lsblk - /dev/drbd* doesn't exist yet (kernel hasn't created node) - lsblk fails with "not a block device" exit code 32 - StorageException interrupts DeviceManager cycle - Device remains in incomplete state The patch now checks Files.exists(devicePath) before lsblk, allowing the check to be retried in next cycle when device node is available. Upstream: LINBIT/linstor-server#471 Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
📝 WalkthroughWalkthroughA patch file for LINSTOR Piraeus Server that adds defensive checks before DRBD adjustment operations. It validates device path existence and child VLM device validity before regenerating resources or adjusting device configurations, skipping operations when dependencies are unavailable rather than raising exceptions. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @kvaps, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical race condition within LINSTOR's device management, specifically affecting DRBD resources during adjustment operations. By introducing a preliminary check for the physical existence of device nodes, it prevents Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a race condition where a DRBD device node might not exist immediately after an adjust operation, causing lsblk to fail. The fix is to add a check in updateDiscGran to verify the device file's existence before calling lsblk. This prevents a StorageException and allows the system to retry in the next cycle.
The change is implemented by adding a new patch for DeviceHandlerImpl.java to the skip-adjust-when-device-inaccessible.diff file. The solution is well-targeted and correctly uses Files.exists() to solve the race condition. The code is clean and the accompanying comment clearly explains the reasoning. The fix appears solid and I have no further recommendations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
packages/system/linstor/images/piraeus-server/patches/skip-adjust-when-device-inaccessible.diff (1)
38-53: Consider extracting duplicated device accessibility check.This logic is duplicated verbatim in the adjust guard block (lines 70-90). Consider extracting to a helper method for maintainability.
♻️ Suggested helper method
private boolean areChildDevicesAccessible( DrbdRscData<Resource> drbdRscData, AccessContext workerCtx, boolean skipDisk ) throws AccessDeniedException { if (skipDisk || drbdRscData.getAbsResource().isDrbdDiskless(workerCtx)) { return true; } AbsRscLayerObject<Resource> dataChild = drbdRscData.getChildBySuffix(RscLayerSuffixes.SUFFIX_DATA); if (dataChild != null) { for (DrbdVlmData<Resource> drbdVlmData : drbdRscData.getVlmLayerObjects().values()) { VlmProviderObject<Resource> childVlm = dataChild.getVlmProviderObject(drbdVlmData.getVlmNr()); if (childVlm == null || !childVlm.exists() || childVlm.getDevicePath() == null) { return false; } } } return true; }Then use:
boolean canRegenerateResFile = areChildDevicesAccessible(drbdRscData, workerCtx, skipDisk); // ... boolean canAdjust = areChildDevicesAccessible(drbdRscData, workerCtx, skipDisk);
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
packages/system/linstor/images/piraeus-server/patches/skip-adjust-when-device-inaccessible.diff
🧰 Additional context used
📓 Path-based instructions (1)
packages/system/**
📄 CodeRabbit inference engine (AGENTS.md)
Use Helm Chart umbrella pattern with vendored upstream charts in
charts/directory
Files:
packages/system/linstor/images/piraeus-server/patches/skip-adjust-when-device-inaccessible.diff
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build
🔇 Additional comments (3)
packages/system/linstor/images/piraeus-server/patches/skip-adjust-when-device-inaccessible.diff (3)
14-25: LGTM - Defensive check prevents lsblk on non-existent device nodes.The added
Files.exists(Paths.get(devicePath))check correctly guards against the race condition where the kernel hasn't yet created the/dev/drbd*device node afterdrbdadm adjust. The condition ordering is correct (null check first).Note: A small TOCTOU window remains between the existence check and the actual lsblk call, but this is acceptable defensive programming that significantly reduces the race window.
94-109: LGTM - Exception handling correctly scoped.The try-catch and
restoreBackupResFile()are properly scoped to only execute when adjustment is actually attempted. This prevents incorrect backup restoration when adjustment is intentionally skipped.
111-116: Clarify adjustRequired retry semantics in non-deletion scenarios.When
canAdjustis false, the code setssetAdjustRequired(false), which prevents the adjust operation from being attempted in subsequent DeviceManager cycles. This is intentional for the deletion scenario (where LUKS devices are intentionally closed) since the resource won't exist after cleanup.However, for scenarios where devices become temporarily inaccessible during non-deletion operations, there is no shown mechanism to re-enable
adjustRequiredonce it's disabled. In contrast, the sibling code path fordiscGranusesUNINITIALIZED_SIZEto signal a retry is needed in the next cycle. Verify whether there is an upstream mechanism that re-enablesadjustRequiredwhen underlying devices become accessible again, particularly for toggle-disk operations where temporary device unavailability might occur outside of resource deletion.
|
/retest |
3 similar comments
|
/retest |
|
/retest |
|
/retest |
|
Successfully created backport PR for |
…1829) ## Problem After satellite restart with patched linstor-server, some DRBD resources ended up in **Unknown** state. Investigation revealed a race condition between `drbdadm adjust` completion and `updateDiscGran()` lsblk check: 1. `drbdadm adjust` completes successfully (21:53:49.791) - brings devices up 2. `updateDiscGran()` immediately tries to check discard granularity (21:53:49.799) 3. `/dev/drbd*` device node doesn't exist yet - kernel hasn't created it 4. `lsblk` fails with exit code 32: "not a block device" 5. `StorageException` interrupts DeviceManager cycle (21:53:50.811) 6. DRBD device remains in incomplete state → **Unknown** ### Timeline from logs ``` 21:53:49.791 - [DeviceManager] Resource 'pvc-aafbd92a' [DRBD] adjusted ✓ 21:53:49.799 - [lsblk_parser] ERROR: /dev/drbd1169 not a block device 21:53:49.804-50.807 - [lsblk_parser] 10+ retry errors (every ~100ms) 21:53:50.811 - [DeviceManager] ERROR: Error executing lsblk 21:53:50.878 - [DeviceManager] End cycle 9 (WITH ERROR!) ``` ## Solution Update `skip-adjust-when-device-inaccessible.diff` patch to add physical device path existence check before calling `lsblk`: - Check `Files.exists(devicePath)` in `updateDiscGran()` before lsblk - If device doesn't exist yet → skip check, set `discGran = UNINITIALIZED_SIZE` - Next DeviceManager cycle (few seconds later) → device node available → lsblk succeeds This complements the existing patch which checks **child layer** devices (LUKS/Storage) for deletion scenarios, while this fix addresses the **DRBD device itself** during adjust operations. ## Testing Manual `drbdadm up` on affected devices confirmed they were in down state and brought them back to UpToDate, proving the issue was incomplete device initialization. ## Related - Upstream linstor-server PR: LINBIT/linstor-server#471 (comment) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Enhanced robustness for storage operations by adding device accessibility validation. Operations now gracefully skip when device paths are unavailable or invalid, preventing unnecessary failures and improving system resilience during device access issues. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…1829) ## Problem After satellite restart with patched linstor-server, some DRBD resources ended up in **Unknown** state. Investigation revealed a race condition between `drbdadm adjust` completion and `updateDiscGran()` lsblk check: 1. `drbdadm adjust` completes successfully (21:53:49.791) - brings devices up 2. `updateDiscGran()` immediately tries to check discard granularity (21:53:49.799) 3. `/dev/drbd*` device node doesn't exist yet - kernel hasn't created it 4. `lsblk` fails with exit code 32: "not a block device" 5. `StorageException` interrupts DeviceManager cycle (21:53:50.811) 6. DRBD device remains in incomplete state → **Unknown** ### Timeline from logs ``` 21:53:49.791 - [DeviceManager] Resource 'pvc-aafbd92a' [DRBD] adjusted ✓ 21:53:49.799 - [lsblk_parser] ERROR: /dev/drbd1169 not a block device 21:53:49.804-50.807 - [lsblk_parser] 10+ retry errors (every ~100ms) 21:53:50.811 - [DeviceManager] ERROR: Error executing lsblk 21:53:50.878 - [DeviceManager] End cycle 9 (WITH ERROR!) ``` ## Solution Update `skip-adjust-when-device-inaccessible.diff` patch to add physical device path existence check before calling `lsblk`: - Check `Files.exists(devicePath)` in `updateDiscGran()` before lsblk - If device doesn't exist yet → skip check, set `discGran = UNINITIALIZED_SIZE` - Next DeviceManager cycle (few seconds later) → device node available → lsblk succeeds This complements the existing patch which checks **child layer** devices (LUKS/Storage) for deletion scenarios, while this fix addresses the **DRBD device itself** during adjust operations. ## Testing Manual `drbdadm up` on affected devices confirmed they were in down state and brought them back to UpToDate, proving the issue was incomplete device initialization. ## Related - Upstream linstor-server PR: LINBIT/linstor-server#471 (comment) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Enhanced robustness for storage operations by adding device accessibility validation. Operations now gracefully skip when device paths are unavailable or invalid, preventing unnecessary failures and improving system resilience during device access issues. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…1829) ## Problem After satellite restart with patched linstor-server, some DRBD resources ended up in **Unknown** state. Investigation revealed a race condition between `drbdadm adjust` completion and `updateDiscGran()` lsblk check: 1. `drbdadm adjust` completes successfully (21:53:49.791) - brings devices up 2. `updateDiscGran()` immediately tries to check discard granularity (21:53:49.799) 3. `/dev/drbd*` device node doesn't exist yet - kernel hasn't created it 4. `lsblk` fails with exit code 32: "not a block device" 5. `StorageException` interrupts DeviceManager cycle (21:53:50.811) 6. DRBD device remains in incomplete state → **Unknown** ### Timeline from logs ``` 21:53:49.791 - [DeviceManager] Resource 'pvc-aafbd92a' [DRBD] adjusted ✓ 21:53:49.799 - [lsblk_parser] ERROR: /dev/drbd1169 not a block device 21:53:49.804-50.807 - [lsblk_parser] 10+ retry errors (every ~100ms) 21:53:50.811 - [DeviceManager] ERROR: Error executing lsblk 21:53:50.878 - [DeviceManager] End cycle 9 (WITH ERROR!) ``` ## Solution Update `skip-adjust-when-device-inaccessible.diff` patch to add physical device path existence check before calling `lsblk`: - Check `Files.exists(devicePath)` in `updateDiscGran()` before lsblk - If device doesn't exist yet → skip check, set `discGran = UNINITIALIZED_SIZE` - Next DeviceManager cycle (few seconds later) → device node available → lsblk succeeds This complements the existing patch which checks **child layer** devices (LUKS/Storage) for deletion scenarios, while this fix addresses the **DRBD device itself** during adjust operations. ## Testing Manual `drbdadm up` on affected devices confirmed they were in down state and brought them back to UpToDate, proving the issue was incomplete device initialization. ## Related - Upstream linstor-server PR: LINBIT/linstor-server#471 (comment) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Enhanced robustness for storage operations by adding device accessibility validation. Operations now gracefully skip when device paths are unavailable or invalid, preventing unnecessary failures and improving system resilience during device access issues. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Problem
After satellite restart with patched linstor-server, some DRBD resources ended up in Unknown state. Investigation revealed a race condition between
drbdadm adjustcompletion andupdateDiscGran()lsblk check:drbdadm adjustcompletes successfully (21:53:49.791) - brings devices upupdateDiscGran()immediately tries to check discard granularity (21:53:49.799)/dev/drbd*device node doesn't exist yet - kernel hasn't created itlsblkfails with exit code 32: "not a block device"StorageExceptioninterrupts DeviceManager cycle (21:53:50.811)Timeline from logs
Solution
Update
skip-adjust-when-device-inaccessible.diffpatch to add physical device path existence check before callinglsblk:Files.exists(devicePath)inupdateDiscGran()before lsblkdiscGran = UNINITIALIZED_SIZEThis complements the existing patch which checks child layer devices (LUKS/Storage) for deletion scenarios, while this fix addresses the DRBD device itself during adjust operations.
Testing
Manual
drbdadm upon affected devices confirmed they were in down state and brought them back to UpToDate, proving the issue was incomplete device initialization.Related
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.