-
Notifications
You must be signed in to change notification settings - Fork 136
fix(linstor): prevent orphaned DRBD devices during toggle-disk retry #1823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The previous retry logic in toggle-disk removed layer data from controller DB and recreated it. However, removeLayerData() only deletes from the database without calling drbdadm down on the satellite, leaving orphaned DRBD devices in the kernel that occupy ports and block new operations. This fix changes retry to simply repeat the operation with existing layer data, allowing the satellite to handle it idempotently. Upstream: LINBIT/linstor-server#475 Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
📝 WalkthroughWalkthroughA patch adds retry and cancel flow support to disk toggle operations in CtrlRscToggleDiskApiCallHandler. Previously, DISK_ADD_REQUESTED and DISK_REMOVE_REQUESTED states threw errors; now they support cancel, retry, and cleanup flows with new helper methods for state management and storage layer inspection. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. 📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used📓 Path-based instructions (1)packages/system/**📄 CodeRabbit inference engine (AGENTS.md)
Files:
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (6)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @kvaps, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical bug in LINSTOR's Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a critical bug that led to orphaned DRBD devices during a toggle-disk retry. The change in logic to rely on the satellite's idempotency instead of attempting a partial cleanup on the controller is a solid fix. Additionally, the new functionality to cancel pending operations and to clean up orphaned layers on diskless resources significantly improves the robustness of the toggle-disk operation. The code is well-commented and the changes are logical. I have a couple of suggestions for refactoring that could enhance code maintainability and readability.
|
Successfully created backport PR for |
…1823) ## Summary Fix bug in toggle-disk retry logic that left orphaned DRBD devices in kernel. ## Problem When toggle-disk retry was triggered (e.g., user retries after a failed operation), the code called `removeLayerData()` to clean up and recreate the layer stack. However, `removeLayerData()` only removes data from the controller's database — it does NOT call `drbdadm down` on the satellite. This caused DRBD devices to remain in the kernel (visible in `drbdsetup` but not managed by LINSTOR), occupying ports and blocking subsequent operations. ## Solution Changed retry logic to simply repeat the operation with existing layer data intact. The satellite handles this idempotently without creating orphaned resources. ## Upstream - LINBIT/linstor-server#475 (updated) ```release-note [linstor] Fix orphaned DRBD devices during toggle-disk retry ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added cancel and retry capabilities for disk addition operations * Added cancel and retry capabilities for disk removal operations * Improved cleanup handling for diskless resources with orphaned storage layers <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…1823) ## Summary Fix bug in toggle-disk retry logic that left orphaned DRBD devices in kernel. ## Problem When toggle-disk retry was triggered (e.g., user retries after a failed operation), the code called `removeLayerData()` to clean up and recreate the layer stack. However, `removeLayerData()` only removes data from the controller's database — it does NOT call `drbdadm down` on the satellite. This caused DRBD devices to remain in the kernel (visible in `drbdsetup` but not managed by LINSTOR), occupying ports and blocking subsequent operations. ## Solution Changed retry logic to simply repeat the operation with existing layer data intact. The satellite handles this idempotently without creating orphaned resources. ## Upstream - LINBIT/linstor-server#475 (updated) ```release-note [linstor] Fix orphaned DRBD devices during toggle-disk retry ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added cancel and retry capabilities for disk addition operations * Added cancel and retry capabilities for disk removal operations * Improved cleanup handling for diskless resources with orphaned storage layers <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…1823) ## Summary Fix bug in toggle-disk retry logic that left orphaned DRBD devices in kernel. ## Problem When toggle-disk retry was triggered (e.g., user retries after a failed operation), the code called `removeLayerData()` to clean up and recreate the layer stack. However, `removeLayerData()` only removes data from the controller's database — it does NOT call `drbdadm down` on the satellite. This caused DRBD devices to remain in the kernel (visible in `drbdsetup` but not managed by LINSTOR), occupying ports and blocking subsequent operations. ## Solution Changed retry logic to simply repeat the operation with existing layer data intact. The satellite handles this idempotently without creating orphaned resources. ## Upstream - LINBIT/linstor-server#475 (updated) ```release-note [linstor] Fix orphaned DRBD devices during toggle-disk retry ``` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added cancel and retry capabilities for disk addition operations * Added cancel and retry capabilities for disk removal operations * Improved cleanup handling for diskless resources with orphaned storage layers <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Summary
Fix bug in toggle-disk retry logic that left orphaned DRBD devices in kernel.
Problem
When toggle-disk retry was triggered (e.g., user retries after a failed operation), the code called
removeLayerData()to clean up and recreate the layer stack. However,removeLayerData()only removes data from the controller's database — it does NOT calldrbdadm downon the satellite.This caused DRBD devices to remain in the kernel (visible in
drbdsetupbut not managed by LINSTOR), occupying ports and blocking subsequent operations.Solution
Changed retry logic to simply repeat the operation with existing layer data intact. The satellite handles this idempotently without creating orphaned resources.
Upstream
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.