Skip to content

Conversation

@kvaps
Copy link
Member

@kvaps kvaps commented Jan 12, 2026

What this PR does

This PR updates piraeus-server patches to address several critical production issues with DRBD resources and LUKS encryption:

  1. Add fix-duplicate-tcp-ports.diff - Prevents duplicate TCP ports after toggle-disk operations (upstream PR revert-precommit #476)

  2. Update skip-adjust-when-device-inaccessible.diff - Comprehensive fix for multiple issues:

    • Resources stuck in StandAlone state after node reboot
    • Unknown state race condition during satellite restart
    • Encrypted LUKS resource deletion failures
    • Network reconnect blocked by unavailable child device checks

These patches resolve scenarios where DRBD resources fail to automatically reconnect after node reboots and improve LUKS resource lifecycle management.

Upstream PRs:

Release note

[linstor] Fix DRBD resources stuck in StandAlone state after reboot and encrypted resource deletion issues

Summary by CodeRabbit

  • Bug Fixes
    • Prevents duplicate TCP port conflicts after disk toggle operations
    • Fixes resources stuck in StandAlone or Unknown state after reboot
    • Resolves issues with encrypted resource deletion
    • Improves handling of temporarily inaccessible storage devices

✏️ Tip: You can customize this high-level summary in your review settings.

Update piraeus-server patches to address critical production issues:

- Add fix-duplicate-tcp-ports.diff to prevent duplicate TCP ports
  after toggle-disk operations (upstream PR #476)

- Update skip-adjust-when-device-inaccessible.diff with comprehensive
  fixes for resources stuck in StandAlone after reboot, Unknown state
  race condition, and encrypted LUKS resource deletion (upstream PR #477)

```release-note
[linstor] Fix DRBD resources stuck in StandAlone state after reboot and encrypted resource deletion issues
```

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 12, 2026

📝 Walkthrough

Walkthrough

This PR updates Linstor's piraeus-server patch documentation and adds two targeted fixes: one addressing duplicate TCP port assignments after toggle-disk operations, and another handling inaccessible devices during resource operations by skipping unnecessary regeneration and adjustment steps.

Changes

Cohort / File(s) Summary
Patch Documentation
packages/system/linstor/images/piraeus-server/patches/README.md
Updated patch registry with new entry fix-duplicate-tcp-ports.diff and revised description for skip-adjust-when-device-inaccessible.diff to reflect expanded functionality (Upstream #476, #477)
TCP Port Conflict Fix
packages/system/linstor/images/piraeus-server/patches/fix-duplicate-tcp-ports.diff
Removes redundant ensureStackDataExists() call from resetStoragePools() in CtrlRscLayerDataFactory to prevent creation of DRBD resource data with empty TCP port sets, eliminating port-conflict behavior during toggle-disk, snapshot, and backup/restore operations
Device Accessibility Guards
packages/system/linstor/images/piraeus-server/patches/skip-adjust-when-device-inaccessible.diff
Adds pre-checks for device existence in DeviceHandlerImpl and introduces conditional guards (canRegenerateResFile, canAdjust) in DrbdLayer to skip res-file regeneration and DRBD adjustment when child devices are inaccessible, allowing operations to continue during resource deletion and network reconnect scenarios

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 A patch hops in, fixing ports that collide,
And guards check for devices far and wide,
No duplicate traffic shall roam these lands,
Linstor's resources now firm in our hands! 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: updating piraeus-server patches with critical fixes addressing production issues.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kvaps, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the stability and reliability of LINSTOR's DRBD resource management by updating existing patches and introducing a new one. It addresses critical production issues related to toggle-disk operations, node reboots, and LUKS encrypted volumes. The changes prevent TCP port conflicts, ensure automatic DRBD reconnection, and facilitate proper deletion of encrypted resources, thereby improving overall system resilience and operational consistency.

Highlights

  • Prevent Duplicate TCP Ports: This PR introduces a patch (fix-duplicate-tcp-ports.diff) that resolves an issue where toggle-disk operations, including snapshot creation/restore, could lead to duplicate TCP port assignments for DRBD resources. The root cause was a redundant ensureStackDataExists() call with an empty payload, which, after a TCP port migration, resulted in DrbdRscData being created without proper port assignments. The fix removes this redundant call, ensuring correct port allocation and preventing resources from getting stuck in StandAlone or Connecting states due to "port is also used" errors.
  • Comprehensive Fixes for DRBD Resource and LUKS Issues: The existing skip-adjust-when-device-inaccessible.diff patch has been updated to address several critical production issues:
  1. Resources stuck in StandAlone after node reboot: Previously, DRBD resources could remain in StandAlone state after a node reboot because drbdadm adjust was blocked by checks for inaccessible child devices, even when only network reconnection was needed. The updated fix now only checks child device accessibility when disk access is truly required (e.g., for volume creation, resize, metadata operations), allowing network reconnects to proceed.
  2. Unknown state race condition during satellite restart: A race condition could occur during satellite restart where updateDiscGran() would attempt lsblk on a DRBD device before its device node was fully created, leading to StorageException and resources getting stuck in Unknown state. The fix adds a check for physical device path existence before calling lsblk.
  3. Encrypted LUKS resource deletion failures: Deleting encrypted LUKS resources could fail if the LUKS layer closed its device before DRBD attempted to adjust or regenerate its resource file, causing "Failed to adjust DRBD resource" errors. The fix adds checks to skip regenerateResFile() and drbdUtils.adjust() if child devices are inaccessible during deletion, allowing the process to complete.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates piraeus-server patches to address several critical production issues. The changes include adding a new patch to fix duplicate TCP ports and overhauling another patch to resolve issues with DRBD resources being stuck, race conditions, and encrypted resource deletion failures. The commit messages within the patch files are detailed and clearly explain the rationale behind the changes. The logic appears sound and effectively addresses the described problems. I've included a couple of suggestions to enhance code readability and maintainability by using Java Streams for checking conditions.

@kvaps kvaps added backport Should change be backported on previus release backport-previous labels Jan 12, 2026
@kvaps kvaps marked this pull request as ready for review January 12, 2026 21:29
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Jan 12, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
packages/system/linstor/images/piraeus-server/patches/skip-adjust-when-device-inaccessible.diff (1)

34-54: Code duplication in child device accessibility checks.

The child device accessibility check logic is duplicated between the canRegenerateResFile block (lines 34-54) and the canAdjust block (lines 66-88). Both iterate over drbdRscData.getVlmLayerObjects().values() and check childVlm == null || !childVlm.exists() || childVlm.getDevicePath() == null.

Consider extracting this into a helper method like areChildDevicesAccessible(DrbdRscData<Resource> drbdRscData) to reduce duplication and improve maintainability.

♻️ Suggested helper method extraction
private boolean areChildDevicesAccessible(DrbdRscData<Resource> drbdRscData) throws AccessDeniedException
{
    AbsRscLayerObject<Resource> dataChild = drbdRscData.getChildBySuffix(RscLayerSuffixes.SUFFIX_DATA);
    if (dataChild != null)
    {
        for (DrbdVlmData<Resource> drbdVlmData : drbdRscData.getVlmLayerObjects().values())
        {
            VlmProviderObject<Resource> childVlm = dataChild.getVlmProviderObject(drbdVlmData.getVlmNr());
            if (childVlm == null || !childVlm.exists() || childVlm.getDevicePath() == null)
            {
                return false;
            }
        }
    }
    return true;
}

Also applies to: 66-113

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8025657 and dc2773b.

📒 Files selected for processing (3)
  • packages/system/linstor/images/piraeus-server/patches/README.md
  • packages/system/linstor/images/piraeus-server/patches/fix-duplicate-tcp-ports.diff
  • packages/system/linstor/images/piraeus-server/patches/skip-adjust-when-device-inaccessible.diff
🧰 Additional context used
📓 Path-based instructions (2)
packages/system/**

📄 CodeRabbit inference engine (AGENTS.md)

Use Helm Chart umbrella pattern with vendored upstream charts in charts/ directory

Files:

  • packages/system/linstor/images/piraeus-server/patches/skip-adjust-when-device-inaccessible.diff
  • packages/system/linstor/images/piraeus-server/patches/fix-duplicate-tcp-ports.diff
  • packages/system/linstor/images/piraeus-server/patches/README.md
**/*.md

📄 CodeRabbit inference engine (AGENTS.md)

Git commits should follow [component] Description format and include --signoff flag

Files:

  • packages/system/linstor/images/piraeus-server/patches/README.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (4)
packages/system/linstor/images/piraeus-server/patches/fix-duplicate-tcp-ports.diff (1)

1-87: Well-documented patch with comprehensive root cause analysis.

The patch is well-structured with:

  • Clear problem description and root cause analysis
  • Detailed impact assessment
  • Minimal, targeted fix (removing redundant ensureStackDataExists() call)
  • Proper testing verification
  • Correct sign-off and attribution

The fix correctly addresses the TCP port duplication issue by ensuring resetStoragePools() only handles storage pool assignments while relying on callers to properly initialize layer data with correct TCP ports.

packages/system/linstor/images/piraeus-server/patches/skip-adjust-when-device-inaccessible.diff (2)

163-172: Sound fix for race condition during device adjustment.

The added Files.exists(Paths.get(devicePath)) check correctly handles the race condition where kernel device nodes may not exist immediately after drbdadm adjust. Deferring the check to the next DeviceManager cycle is a reasonable approach.


230-254: Critical fix for StandAlone resources after reboot.

The needsDiskAccess determination logic correctly identifies when disk access is truly required:

  • New volume creation (!drbdVlmData.exists())
  • Metadata operations (checkMetaData() – a state flag indicating metadata needs checking/creation)
  • Resize operations (Volume.Flags.RESIZE, Volume.Flags.DRBD_RESIZE)

This allows network reconnect operations (StandAlone → Connected) to proceed without blocking on child device availability, which is the root cause of resources staying stuck in StandAlone after node reboot.

packages/system/linstor/images/piraeus-server/patches/README.md (1)

11-14: Documentation properly updated.

The new patch entries follow the established format with:

  • Bold patch filename
  • Em-dash separator
  • Concise description
  • Upstream PR reference links

The descriptions accurately reflect the patch purposes documented in the diff files.

@kvaps
Copy link
Member Author

kvaps commented Jan 12, 2026

/retest

@kvaps kvaps merged commit ea6ec3e into main Jan 12, 2026
41 of 46 checks passed
@kvaps kvaps deleted the fix/linstor-piraeus-patches branch January 12, 2026 22:23
@github-actions
Copy link

Successfully created backport PR for release-0.40:

@github-actions
Copy link

Successfully created backport PR for release-0.39:

kvaps added a commit that referenced this pull request Jan 12, 2026
…critical fixes (#1852)

# Description
Backport of #1850 to `release-0.40`.
kvaps added a commit that referenced this pull request Jan 12, 2026
…critical fixes (#1853)

# Description
Backport of #1850 to `release-0.39`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Should change be backported on previus release backport-previous bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants