Skip to content

Conversation

@kvaps
Copy link
Member

@kvaps kvaps commented Jan 8, 2026

Problem

After satellite restart with patched linstor-server, some DRBD resources ended up in Unknown state. Investigation revealed a race condition between drbdadm adjust completion and updateDiscGran() lsblk check:

  1. drbdadm adjust completes successfully (21:53:49.791) - brings devices up
  2. updateDiscGran() immediately tries to check discard granularity (21:53:49.799)
  3. /dev/drbd* device node doesn't exist yet - kernel hasn't created it
  4. lsblk fails with exit code 32: "not a block device"
  5. StorageException interrupts DeviceManager cycle (21:53:50.811)
  6. DRBD device remains in incomplete state → Unknown

Timeline from logs

21:53:49.791 - [DeviceManager] Resource 'pvc-aafbd92a' [DRBD] adjusted ✓
21:53:49.799 - [lsblk_parser] ERROR: /dev/drbd1169 not a block device
21:53:49.804-50.807 - [lsblk_parser] 10+ retry errors (every ~100ms)
21:53:50.811 - [DeviceManager] ERROR: Error executing lsblk
21:53:50.878 - [DeviceManager] End cycle 9 (WITH ERROR!)

Solution

Update skip-adjust-when-device-inaccessible.diff patch to add physical device path existence check before calling lsblk:

  • Check Files.exists(devicePath) in updateDiscGran() before lsblk
  • If device doesn't exist yet → skip check, set discGran = UNINITIALIZED_SIZE
  • Next DeviceManager cycle (few seconds later) → device node available → lsblk succeeds

This complements the existing patch which checks child layer devices (LUKS/Storage) for deletion scenarios, while this fix addresses the DRBD device itself during adjust operations.

Testing

Manual drbdadm up on affected devices confirmed they were in down state and brought them back to UpToDate, proving the issue was incomplete device initialization.

Related

Summary by CodeRabbit

  • Bug Fixes
    • Enhanced robustness for storage operations by adding device accessibility validation. Operations now gracefully skip when device paths are unavailable or invalid, preventing unnecessary failures and improving system resilience during device access issues.

✏️ Tip: You can customize this high-level summary in your review settings.

Add physical device path existence check to prevent race condition where
lsblk is called before kernel creates DRBD device node after drbdadm adjust.

This fixes issue where resources ended up in Unknown state after satellite
restart because:
- drbdadm adjust completes successfully (brings devices up)
- updateDiscGran() immediately tries lsblk
- /dev/drbd* doesn't exist yet (kernel hasn't created node)
- lsblk fails with "not a block device" exit code 32
- StorageException interrupts DeviceManager cycle
- Device remains in incomplete state

The patch now checks Files.exists(devicePath) before lsblk, allowing
the check to be retried in next cycle when device node is available.

Upstream: LINBIT/linstor-server#471

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 8, 2026

📝 Walkthrough

Walkthrough

A patch file for LINSTOR Piraeus Server that adds defensive checks before DRBD adjustment operations. It validates device path existence and child VLM device validity before regenerating resources or adjusting device configurations, skipping operations when dependencies are unavailable rather than raising exceptions.

Changes

Cohort / File(s) Summary
Device Path Validation
packages/system/linstor/images/piraeus-server/patches/skip-adjust-when-device-inaccessible.diff
Added null-safe existence checks for device paths using Files.exists() in updateDiscGran. Introduced pre-condition validation in DrbdLayer to verify child VLM devices and device paths before resource regeneration and adjustment. Guards adjustment operations with canAdjust flag; skips operations gracefully instead of throwing exceptions. Includes exception handling to restore backup resources only when adjustments proceed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

bug

Suggested reviewers

  • lllamnyp
  • nbykov0

Poem

🐰 A patch hops in to check the ground,
Before adjusting DRBD 'round,
"Does the device exist?" it softly asks,
If not, skip the tricky tasks,
No exceptions thrown, just graceful grace!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The title accurately describes the main change: adding a null-safe check for device paths in updateDiscGran to prevent DRBD race conditions during device initialization.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kvaps, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical race condition within LINSTOR's device management, specifically affecting DRBD resources during adjustment operations. By introducing a preliminary check for the physical existence of device nodes, it prevents lsblk from failing prematurely when DRBD devices are temporarily unavailable, thereby ensuring proper initialization and preventing resources from getting stuck in an "Unknown" state.

Highlights

  • Addresses DRBD Race Condition: Fixes an issue where DRBD resources could enter an "Unknown" state due to a race between drbdadm adjust completion and updateDiscGran()'s lsblk check, which occurred because the kernel had not yet created the device node.
  • Introduces Device Path Existence Check: Adds a check using Files.exists(devicePath) in the updateDiscGran() method within DeviceHandlerImpl.java to ensure the physical device node exists before attempting to query it with lsblk.
  • Graceful Handling of Temporary Unavailability: If the device path is not yet present, the lsblk check is skipped, and discGran is set to UNINITIALIZED_SIZE. This allows subsequent DeviceManager cycles to successfully process the device once its node becomes available, preventing errors and incomplete device initialization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a race condition where a DRBD device node might not exist immediately after an adjust operation, causing lsblk to fail. The fix is to add a check in updateDiscGran to verify the device file's existence before calling lsblk. This prevents a StorageException and allows the system to retry in the next cycle.

The change is implemented by adding a new patch for DeviceHandlerImpl.java to the skip-adjust-when-device-inaccessible.diff file. The solution is well-targeted and correctly uses Files.exists() to solve the race condition. The code is clean and the accompanying comment clearly explains the reasoning. The fix appears solid and I have no further recommendations.

@kvaps kvaps added the backport Should change be backported on previus release label Jan 8, 2026
@kvaps kvaps marked this pull request as ready for review January 8, 2026 12:30
@kvaps kvaps requested review from lllamnyp and nbykov0 as code owners January 8, 2026 12:30
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. bug Something isn't working labels Jan 8, 2026
@kvaps kvaps changed the title fix(linstor): prevent DRBD device race condition in updateDiscGran [linstor] fix: prevent DRBD device race condition in updateDiscGran Jan 8, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
packages/system/linstor/images/piraeus-server/patches/skip-adjust-when-device-inaccessible.diff (1)

38-53: Consider extracting duplicated device accessibility check.

This logic is duplicated verbatim in the adjust guard block (lines 70-90). Consider extracting to a helper method for maintainability.

♻️ Suggested helper method
private boolean areChildDevicesAccessible(
    DrbdRscData<Resource> drbdRscData,
    AccessContext workerCtx,
    boolean skipDisk
) throws AccessDeniedException
{
    if (skipDisk || drbdRscData.getAbsResource().isDrbdDiskless(workerCtx))
    {
        return true;
    }
    AbsRscLayerObject<Resource> dataChild = drbdRscData.getChildBySuffix(RscLayerSuffixes.SUFFIX_DATA);
    if (dataChild != null)
    {
        for (DrbdVlmData<Resource> drbdVlmData : drbdRscData.getVlmLayerObjects().values())
        {
            VlmProviderObject<Resource> childVlm = dataChild.getVlmProviderObject(drbdVlmData.getVlmNr());
            if (childVlm == null || !childVlm.exists() || childVlm.getDevicePath() == null)
            {
                return false;
            }
        }
    }
    return true;
}

Then use:

boolean canRegenerateResFile = areChildDevicesAccessible(drbdRscData, workerCtx, skipDisk);
// ...
boolean canAdjust = areChildDevicesAccessible(drbdRscData, workerCtx, skipDisk);
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 67c5265 and a2a0747.

📒 Files selected for processing (1)
  • packages/system/linstor/images/piraeus-server/patches/skip-adjust-when-device-inaccessible.diff
🧰 Additional context used
📓 Path-based instructions (1)
packages/system/**

📄 CodeRabbit inference engine (AGENTS.md)

Use Helm Chart umbrella pattern with vendored upstream charts in charts/ directory

Files:

  • packages/system/linstor/images/piraeus-server/patches/skip-adjust-when-device-inaccessible.diff
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build
🔇 Additional comments (3)
packages/system/linstor/images/piraeus-server/patches/skip-adjust-when-device-inaccessible.diff (3)

14-25: LGTM - Defensive check prevents lsblk on non-existent device nodes.

The added Files.exists(Paths.get(devicePath)) check correctly guards against the race condition where the kernel hasn't yet created the /dev/drbd* device node after drbdadm adjust. The condition ordering is correct (null check first).

Note: A small TOCTOU window remains between the existence check and the actual lsblk call, but this is acceptable defensive programming that significantly reduces the race window.


94-109: LGTM - Exception handling correctly scoped.

The try-catch and restoreBackupResFile() are properly scoped to only execute when adjustment is actually attempted. This prevents incorrect backup restoration when adjustment is intentionally skipped.


111-116: Clarify adjustRequired retry semantics in non-deletion scenarios.

When canAdjust is false, the code sets setAdjustRequired(false), which prevents the adjust operation from being attempted in subsequent DeviceManager cycles. This is intentional for the deletion scenario (where LUKS devices are intentionally closed) since the resource won't exist after cleanup.

However, for scenarios where devices become temporarily inaccessible during non-deletion operations, there is no shown mechanism to re-enable adjustRequired once it's disabled. In contrast, the sibling code path for discGran uses UNINITIALIZED_SIZE to signal a retry is needed in the next cycle. Verify whether there is an upstream mechanism that re-enables adjustRequired when underlying devices become accessible again, particularly for toggle-disk operations where temporary device unavailability might occur outside of resource deletion.

@kvaps
Copy link
Member Author

kvaps commented Jan 8, 2026

/retest

3 similar comments
@kvaps
Copy link
Member Author

kvaps commented Jan 8, 2026

/retest

@kvaps
Copy link
Member Author

kvaps commented Jan 8, 2026

/retest

@kvaps
Copy link
Member Author

kvaps commented Jan 8, 2026

/retest

@kvaps kvaps merged commit 9cab76b into main Jan 8, 2026
82 of 85 checks passed
@kvaps kvaps deleted the fix/linstor-race-condition branch January 8, 2026 20:04
@github-actions
Copy link

github-actions bot commented Jan 8, 2026

Successfully created backport PR for release-0.39:

kvaps added a commit that referenced this pull request Jan 8, 2026
…tion in updateDiscGran (#1836)

# Description
Backport of #1829 to `release-0.39`.
kvaps added a commit that referenced this pull request Jan 8, 2026
…1829)

## Problem

After satellite restart with patched linstor-server, some DRBD resources
ended up in **Unknown** state. Investigation revealed a race condition
between `drbdadm adjust` completion and `updateDiscGran()` lsblk check:

1. `drbdadm adjust` completes successfully (21:53:49.791) - brings
devices up
2. `updateDiscGran()` immediately tries to check discard granularity
(21:53:49.799)
3. `/dev/drbd*` device node doesn't exist yet - kernel hasn't created it
4. `lsblk` fails with exit code 32: "not a block device"  
5. `StorageException` interrupts DeviceManager cycle (21:53:50.811)
6. DRBD device remains in incomplete state → **Unknown**

### Timeline from logs

```
21:53:49.791 - [DeviceManager] Resource 'pvc-aafbd92a' [DRBD] adjusted ✓
21:53:49.799 - [lsblk_parser] ERROR: /dev/drbd1169 not a block device
21:53:49.804-50.807 - [lsblk_parser] 10+ retry errors (every ~100ms)
21:53:50.811 - [DeviceManager] ERROR: Error executing lsblk
21:53:50.878 - [DeviceManager] End cycle 9 (WITH ERROR!)
```

## Solution

Update `skip-adjust-when-device-inaccessible.diff` patch to add physical
device path existence check before calling `lsblk`:

- Check `Files.exists(devicePath)` in `updateDiscGran()` before lsblk
- If device doesn't exist yet → skip check, set `discGran =
UNINITIALIZED_SIZE`
- Next DeviceManager cycle (few seconds later) → device node available →
lsblk succeeds

This complements the existing patch which checks **child layer** devices
(LUKS/Storage) for deletion scenarios, while this fix addresses the
**DRBD device itself** during adjust operations.

## Testing

Manual `drbdadm up` on affected devices confirmed they were in down
state and brought them back to UpToDate, proving the issue was
incomplete device initialization.

## Related

- Upstream linstor-server PR:
LINBIT/linstor-server#471 (comment)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced robustness for storage operations by adding device
accessibility validation. Operations now gracefully skip when device
paths are unavailable or invalid, preventing unnecessary failures and
improving system resilience during device access issues.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit that referenced this pull request Jan 8, 2026
…1829)

## Problem

After satellite restart with patched linstor-server, some DRBD resources
ended up in **Unknown** state. Investigation revealed a race condition
between `drbdadm adjust` completion and `updateDiscGran()` lsblk check:

1. `drbdadm adjust` completes successfully (21:53:49.791) - brings
devices up
2. `updateDiscGran()` immediately tries to check discard granularity
(21:53:49.799)
3. `/dev/drbd*` device node doesn't exist yet - kernel hasn't created it
4. `lsblk` fails with exit code 32: "not a block device"  
5. `StorageException` interrupts DeviceManager cycle (21:53:50.811)
6. DRBD device remains in incomplete state → **Unknown**

### Timeline from logs

```
21:53:49.791 - [DeviceManager] Resource 'pvc-aafbd92a' [DRBD] adjusted ✓
21:53:49.799 - [lsblk_parser] ERROR: /dev/drbd1169 not a block device
21:53:49.804-50.807 - [lsblk_parser] 10+ retry errors (every ~100ms)
21:53:50.811 - [DeviceManager] ERROR: Error executing lsblk
21:53:50.878 - [DeviceManager] End cycle 9 (WITH ERROR!)
```

## Solution

Update `skip-adjust-when-device-inaccessible.diff` patch to add physical
device path existence check before calling `lsblk`:

- Check `Files.exists(devicePath)` in `updateDiscGran()` before lsblk
- If device doesn't exist yet → skip check, set `discGran =
UNINITIALIZED_SIZE`
- Next DeviceManager cycle (few seconds later) → device node available →
lsblk succeeds

This complements the existing patch which checks **child layer** devices
(LUKS/Storage) for deletion scenarios, while this fix addresses the
**DRBD device itself** during adjust operations.

## Testing

Manual `drbdadm up` on affected devices confirmed they were in down
state and brought them back to UpToDate, proving the issue was
incomplete device initialization.

## Related

- Upstream linstor-server PR:
LINBIT/linstor-server#471 (comment)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced robustness for storage operations by adding device
accessibility validation. Operations now gracefully skip when device
paths are unavailable or invalid, preventing unnecessary failures and
improving system resilience during device access issues.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
kvaps added a commit that referenced this pull request Jan 9, 2026
…1829)

## Problem

After satellite restart with patched linstor-server, some DRBD resources
ended up in **Unknown** state. Investigation revealed a race condition
between `drbdadm adjust` completion and `updateDiscGran()` lsblk check:

1. `drbdadm adjust` completes successfully (21:53:49.791) - brings
devices up
2. `updateDiscGran()` immediately tries to check discard granularity
(21:53:49.799)
3. `/dev/drbd*` device node doesn't exist yet - kernel hasn't created it
4. `lsblk` fails with exit code 32: "not a block device"  
5. `StorageException` interrupts DeviceManager cycle (21:53:50.811)
6. DRBD device remains in incomplete state → **Unknown**

### Timeline from logs

```
21:53:49.791 - [DeviceManager] Resource 'pvc-aafbd92a' [DRBD] adjusted ✓
21:53:49.799 - [lsblk_parser] ERROR: /dev/drbd1169 not a block device
21:53:49.804-50.807 - [lsblk_parser] 10+ retry errors (every ~100ms)
21:53:50.811 - [DeviceManager] ERROR: Error executing lsblk
21:53:50.878 - [DeviceManager] End cycle 9 (WITH ERROR!)
```

## Solution

Update `skip-adjust-when-device-inaccessible.diff` patch to add physical
device path existence check before calling `lsblk`:

- Check `Files.exists(devicePath)` in `updateDiscGran()` before lsblk
- If device doesn't exist yet → skip check, set `discGran =
UNINITIALIZED_SIZE`
- Next DeviceManager cycle (few seconds later) → device node available →
lsblk succeeds

This complements the existing patch which checks **child layer** devices
(LUKS/Storage) for deletion scenarios, while this fix addresses the
**DRBD device itself** during adjust operations.

## Testing

Manual `drbdadm up` on affected devices confirmed they were in down
state and brought them back to UpToDate, proving the issue was
incomplete device initialization.

## Related

- Upstream linstor-server PR:
LINBIT/linstor-server#471 (comment)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Bug Fixes**
* Enhanced robustness for storage operations by adding device
accessibility validation. Operations now gracefully skip when device
paths are unavailable or invalid, preventing unnecessary failures and
improving system resilience during device access issues.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Should change be backported on previus release bug Something isn't working size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants