Skip to content

Conversation

@kvaps
Copy link
Member

@kvaps kvaps commented Jan 7, 2026

Summary

Fix bug in toggle-disk retry logic that left orphaned DRBD devices in kernel.

Problem

When toggle-disk retry was triggered (e.g., user retries after a failed operation), the code called removeLayerData() to clean up and recreate the layer stack. However, removeLayerData() only removes data from the controller's database — it does NOT call drbdadm down on the satellite.

This caused DRBD devices to remain in the kernel (visible in drbdsetup but not managed by LINSTOR), occupying ports and blocking subsequent operations.

Solution

Changed retry logic to simply repeat the operation with existing layer data intact. The satellite handles this idempotently without creating orphaned resources.

Upstream

[linstor] Fix orphaned DRBD devices during toggle-disk retry

Summary by CodeRabbit

  • New Features
    • Added cancel and retry capabilities for disk addition operations
    • Added cancel and retry capabilities for disk removal operations
    • Improved cleanup handling for diskless resources with orphaned storage layers

✏️ Tip: You can customize this high-level summary in your review settings.

The previous retry logic in toggle-disk removed layer data from controller DB
and recreated it. However, removeLayerData() only deletes from the database
without calling drbdadm down on the satellite, leaving orphaned DRBD devices
in the kernel that occupy ports and block new operations.

This fix changes retry to simply repeat the operation with existing layer data,
allowing the satellite to handle it idempotently.

Upstream: LINBIT/linstor-server#475

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

A patch adds retry and cancel flow support to disk toggle operations in CtrlRscToggleDiskApiCallHandler. Previously, DISK_ADD_REQUESTED and DISK_REMOVE_REQUESTED states threw errors; now they support cancel, retry, and cleanup flows with new helper methods for state management and storage layer inspection.

Changes

Cohort / File(s) Summary
Disk Toggle Handler Enhancement
packages/system/linstor/images/piraeus-server/patches/allow-toggle-disk-retry.diff
Added retry and cancel flow support for disk toggle operations. Introduced three private helper methods: unmarkDiskAddRequested(), unmarkDiskRemoveRequested(), and hasNonDisklessStorageLayer() for state management and storage layer inspection. New imports for VlmProviderObject and DeviceProviderKind to enable storage layer validation. Enhanced error handling paths to support cancellation and retry instead of immediate failure.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 With disk states tangled in a knot,
Our patch brings retry to the lot,
Cancel flows smooth, and cleanup so neat,
Storage layers dance—oh what a treat!
A toggle that's wise, no longer we fret,
The finest disk flow you've ever met!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main bugfix: preventing orphaned DRBD devices during toggle-disk retry operations in LINSTOR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0fb02e6 and 8151e1e.

📒 Files selected for processing (1)
  • packages/system/linstor/images/piraeus-server/patches/allow-toggle-disk-retry.diff
🧰 Additional context used
📓 Path-based instructions (1)
packages/system/**

📄 CodeRabbit inference engine (AGENTS.md)

Use Helm Chart umbrella pattern with vendored upstream charts in charts/ directory

Files:

  • packages/system/linstor/images/piraeus-server/patches/allow-toggle-disk-retry.diff
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (6)
packages/system/linstor/images/piraeus-server/patches/allow-toggle-disk-retry.diff (6)

5-14: LGTM!

The new imports for VlmProviderObject and DeviceProviderKind are correctly added to support the new storage layer inspection logic.


22-75: Core fix looks correct and well-documented.

The retry flow correctly reuses existing layer data instead of calling removeLayerData(), which addresses the root cause of orphaned DRBD devices. The cancel flow properly leverages the existing disk removal mechanism for cleanup.

The inline comments at lines 64-66 clearly document why the previous approach was problematic, which is valuable for future maintainers.


76-107: LGTM!

The asymmetric handling between cancel flows is correct: cancelling a disk-add requires cleanup of potentially provisioned storage, while cancelling a disk-remove simply clears the flag since storage remains intact.


117-150: Good defensive addition for orphaned storage layer cleanup.

This handles edge cases where a resource is marked diskless but retains orphaned storage layers (e.g., from a previous failed operation). Reusing the existing disk removal flow for cleanup ensures consistent satellite-side behavior.


161-184: LGTM!

The new helper methods follow the established pattern of existing mark* methods in this class, with consistent error handling.


192-225: LGTM!

The recursive layer tree traversal is correct. The method properly checks the current layer for non-diskless storage volumes before recursing into children, with early-exit optimization.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kvaps, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in LINSTOR's toggle-disk retry mechanism that could lead to orphaned DRBD devices remaining active in the kernel. The fix streamlines the retry process by eliminating an erroneous cleanup step that inadvertently caused these unmanaged devices, ensuring that toggle-disk operations are robust and do not leave system-level residue, thereby improving resource management and preventing operational blocks.

Highlights

  • DRBD Device Management: Modified the toggle-disk retry logic to prevent the creation of orphaned DRBD devices in the kernel, which previously occurred when removeLayerData() was called without properly deactivating the DRBD device on the satellite.
  • Retry Mechanism Improvement: The problematic removeLayerData() and subsequent ensureStackDataExists() calls have been removed from the toggle-disk retry path, as they only cleaned the controller's database and not the actual DRBD devices on the satellite.
  • Idempotent Operations: The retry mechanism now simply re-attempts the toggle-disk operation with existing layer data, relying on the satellite's inherent idempotent handling to ensure a clean and correct state without leaving behind orphaned resources.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kvaps kvaps added the backport Should change be backported on previus release label Jan 7, 2026
@kvaps kvaps marked this pull request as ready for review January 7, 2026 12:30
@kvaps kvaps requested review from lllamnyp and nbykov0 as code owners January 7, 2026 12:30
@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Jan 7, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively resolves a critical bug that led to orphaned DRBD devices during a toggle-disk retry. The change in logic to rely on the satellite's idempotency instead of attempting a partial cleanup on the controller is a solid fix. Additionally, the new functionality to cancel pending operations and to clean up orphaned layers on diskless resources significantly improves the robustness of the toggle-disk operation. The code is well-commented and the changes are logical. I have a couple of suggestions for refactoring that could enhance code maintainability and readability.

@dosubot dosubot bot added the bug Something isn't working label Jan 7, 2026
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jan 7, 2026
@kvaps kvaps merged commit b8b330e into main Jan 7, 2026
30 checks passed
@kvaps kvaps deleted the fix/linstor-toggle-disk-retry branch January 7, 2026 14:00
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

Successfully created backport PR for release-0.39:

kvaps added a commit that referenced this pull request Jan 7, 2026
…uring toggle-disk retry (#1825)

# Description
Backport of #1823 to `release-0.39`.
kvaps added a commit that referenced this pull request Jan 8, 2026
…1823)

## Summary

Fix bug in toggle-disk retry logic that left orphaned DRBD devices in
kernel.

## Problem

When toggle-disk retry was triggered (e.g., user retries after a failed
operation), the code called `removeLayerData()` to clean up and recreate
the layer stack. However, `removeLayerData()` only removes data from the
controller's database — it does NOT call `drbdadm down` on the
satellite.

This caused DRBD devices to remain in the kernel (visible in `drbdsetup`
but not managed by LINSTOR), occupying ports and blocking subsequent
operations.

## Solution

Changed retry logic to simply repeat the operation with existing layer
data intact. The satellite handles this idempotently without creating
orphaned resources.

## Upstream

- LINBIT/linstor-server#475 (updated)

```release-note
[linstor] Fix orphaned DRBD devices during toggle-disk retry
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
  * Added cancel and retry capabilities for disk addition operations
  * Added cancel and retry capabilities for disk removal operations
* Improved cleanup handling for diskless resources with orphaned storage
layers

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit that referenced this pull request Jan 8, 2026
…1823)

## Summary

Fix bug in toggle-disk retry logic that left orphaned DRBD devices in
kernel.

## Problem

When toggle-disk retry was triggered (e.g., user retries after a failed
operation), the code called `removeLayerData()` to clean up and recreate
the layer stack. However, `removeLayerData()` only removes data from the
controller's database — it does NOT call `drbdadm down` on the
satellite.

This caused DRBD devices to remain in the kernel (visible in `drbdsetup`
but not managed by LINSTOR), occupying ports and blocking subsequent
operations.

## Solution

Changed retry logic to simply repeat the operation with existing layer
data intact. The satellite handles this idempotently without creating
orphaned resources.

## Upstream

- LINBIT/linstor-server#475 (updated)

```release-note
[linstor] Fix orphaned DRBD devices during toggle-disk retry
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
  * Added cancel and retry capabilities for disk addition operations
  * Added cancel and retry capabilities for disk removal operations
* Improved cleanup handling for diskless resources with orphaned storage
layers

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit that referenced this pull request Jan 9, 2026
…1823)

## Summary

Fix bug in toggle-disk retry logic that left orphaned DRBD devices in
kernel.

## Problem

When toggle-disk retry was triggered (e.g., user retries after a failed
operation), the code called `removeLayerData()` to clean up and recreate
the layer stack. However, `removeLayerData()` only removes data from the
controller's database — it does NOT call `drbdadm down` on the
satellite.

This caused DRBD devices to remain in the kernel (visible in `drbdsetup`
but not managed by LINSTOR), occupying ports and blocking subsequent
operations.

## Solution

Changed retry logic to simply repeat the operation with existing layer
data intact. The satellite handles this idempotently without creating
orphaned resources.

## Upstream

- LINBIT/linstor-server#475 (updated)

```release-note
[linstor] Fix orphaned DRBD devices during toggle-disk retry
```

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
  * Added cancel and retry capabilities for disk addition operations
  * Added cancel and retry capabilities for disk removal operations
* Improved cleanup handling for diskless resources with orphaned storage
layers

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Should change be backported on previus release bug Something isn't working lgtm This PR has been approved by a maintainer size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants