Support metro volumes in csm-resiliency by lukeatdell · Pull Request #533 · dell/csi-powerstore

lukeatdell · 2025-08-08T17:41:18Z

Description

Adds metro volumes as a supported volume type for csm-resiliency.

Previously, ValidateVolumeHostConnectivity() was only checking the local volume for IOs in-progress.
Now, the function will check to see if there is a remote volume in the volumeIDs provided in the request. If one is found, two async requests are issued (one to local volume, one to the remote) to check if IO is in-progress. If either request reports IO is in-progress, any pending requests are cancelled, the response is updated to indicate the status, and it is returned.

GitHub Issues

GitHub Issue #
https://github.com/dell/csm/issues/1961

Checklist:

I have performed a self-review of my own code to ensure there are no formatting, vetting, linting, or security issues
I have verified that new and existing unit tests pass locally with my changes
I have not allowed coverage numbers to degenerate
I have maintained at least 90% code coverage
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
Backward compatibility is not broken

How Has This Been Tested?

Running continuous metro endpoint failure, forcing resiliency to continuously migrate workloads. Monitoring memory usage for high consumption.
E2E tests introduced as part of PRs:
- Add Preferred-site node failure scenario for metro resiliency karavi-resiliency#320
- Add PowerStore Metro, Multiple Preferred Nodes E2E Test karavi-resiliency#321

320

    Given a kubernetes <kubeConfig>                                                                                                                        # <autogenerated>:1 -> *integration
    And cluster is clean of test pods                                                                                                                      # <autogenerated>:1 -> *integration
    And wait <nodeCleanSecs> to see there are no taints                                                                                                    # <autogenerated>:1 -> *integration
    And label <workers> node as <preferred> site                                                                                                           # <autogenerated>:1 -> *integration
    And <podsPerNode> pods per node with <nVol> volumes and <nDev> devices using <driverType> and <storageClass> in <deploySecs> with <preferred> affinity # <autogenerated>:1 -> *integration
    Then validate that all pods are running within <deploySecs> seconds                                                                                    # <autogenerated>:1 -> *integration
    And all pods are running on <preferred> node                                                                                                           # <autogenerated>:1 -> *integration
    When I fail labeled <preferred> nodes with <failure> failure for <failSecs> seconds                                                                    # <autogenerated>:1 -> *integration
    Then validate that all pods are running within <runSecs> seconds                                                                                       # <autogenerated>:1 -> *integration
    And labeled pods are on a different node                                                                                                               # <autogenerated>:1 -> *integration
    And the taints for the failed nodes are removed within <nodeCleanSecs> seconds                                                                         # <autogenerated>:1 -> *integration
    Then finally cleanup everything                                                                                                                        # <autogenerated>:1 -> *integration

    Examples:
      | kubeConfig | podsPerNode | nVol  | nDev  | driverType   | storageClass       | workers     | primary | failure         | failSecs | deploySecs | runSecs | nodeCleanSecs | preferred |
      | ""         | "1-1"       | "1-1" | "0-0" | "powerstore" | "powerstore-metro" | "one-third" | "zero"  | "interfacedown" | 240      | 600        | 600     | 600           | "site"    |

1 scenarios (1 passed)
12 steps (12 passed)
7m36.804438307s
INFO[0469] Integration test finished                    
--- PASS: TestPowerStoreShortIntegration (456.85s)
PASS
status 0
ok      podmon/internal/monitor 469.086s

321

    Given a kubernetes <kubeConfig>                                                                                                                        # <autogenerated>:1 -> *integration
    And there are at least <nNodes> worker nodes which are ready                                                                                           # <autogenerated>:1 -> *integration
    And cluster is clean of test pods                                                                                                                      # <autogenerated>:1 -> *integration
    And wait <nodeCleanSecs> to see there are no taints                                                                                                    # <autogenerated>:1 -> *integration
    And label <workers> node as <preferred> site                                                                                                           # <autogenerated>:1 -> *integration
    And <podsPerNode> pods per node with <nVol> volumes and <nDev> devices using <driverType> and <storageClass> in <deploySecs> with <preferred> affinity # <autogenerated>:1 -> *integration
    Then validate that all pods are running within <deploySecs> seconds                                                                                    # <autogenerated>:1 -> *integration
    And all pods are running on <preferred> node                                                                                                           # <autogenerated>:1 -> *integration
    When I fail <workers> nodes with label <preferred> with <failure> failure for <failSecs> seconds                                                       # <autogenerated>:1 -> *integration
    Then validate that all pods are running within <runSecs> seconds                                                                                       # <autogenerated>:1 -> *integration
    And labeled pods are on a different node                                                                                                               # <autogenerated>:1 -> *integration
    And the taints for the failed nodes are removed within <nodeCleanSecs> seconds                                                                         # <autogenerated>:1 -> *integration
    Then finally cleanup everything                                                                                                                        # <autogenerated>:1 -> *integration

    Examples:
      | kubeConfig | nNodes | podsPerNode | nVol  | nDev  | driverType   | storageClass       | workers    | failure         | failSecs | deploySecs | runSecs | nodeCleanSecs | preferred |
      | ""         | 4      | "1-1"       | "1-1" | "0-0" | "powerstore" | "powerstore-metro" | "one-half" | "interfacedown" | 240      | 600        | 600     | 600           | "site"    |

2 scenarios (2 passed)
25 steps (25 passed)
15m23.708868918s
INFO[0938] Integration test finished                    
--- PASS: TestPowerStoreMetroIntegration (923.75s)
PASS
status 0
ok      podmon/internal/monitor 938.601s

- This works: concurrently gets IOs in-progress for both arrays. - However, when reading results, it reads them sequentially from the channels, so if the first channel blocks, we must wait.

- fixes issue from last commit. - Still hitting an issue with the new test where context doesn't appear to time out.

- isIOInProgress would return true as soon as a non-nil error was received, leaving any other goroutine unread and blocking.

- removing redundant and bad check for matching arrayIDs - array connectivity check and volume IO check are now isolated checks. - adding more unit tests

github-actions · 2025-08-13T20:41:23Z

Merging this branch will not change overall coverage

Impacted Packages	Coverage Δ	🤖
github.com/dell/csi-powerstore/pkg/controller	0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/dell/csi-powerstore/pkg/controller/csi_extension_server.go	0.00% (ø)	0	0	0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/dell/csi-powerstore/pkg/controller/controller_test.go
github.com/dell/csi-powerstore/pkg/controller/csi_extension_server_test.go

lukeatdell added 14 commits August 1, 2025 14:12

refactoring tests

a28bbc8

add unit tests for resiliency and metro

9cc1dd2

check both array for metro vol. see comments.

1e32991

- This works: concurrently gets IOs in-progress for both arrays. - However, when reading results, it reads them sequentially from the channels, so if the first channel blocks, we must wait.

read requests as they're received.

2ccfdb3

- fixes issue from last commit. - Still hitting an issue with the new test where context doesn't appear to time out.

add unit test for request with many volumes

56972d2

optimizing and debugging concurrent IO request

01dae77

Merge branch 'main' into usr/lukeatdell/metro-resiliency

c77f9e3

refactoring and renaming

ab7add5

adding missed file to previous commit

0cbd054

adding comments to test vars for intellisense

fac6329

refactor after noticing a potential bug

2218c85

- isIOInProgress would return true as soon as a non-nil error was received, leaving any other goroutine unread and blocking.

adding unit tests for new functions

560d840

Merge branch 'main' into usr/lukeatdell/metro-resiliency

18e3d41

adding more comments

2175b3b

lukeatdell requested review from AkshaySainiDell, abhi16394, adarsh-dell, donatwork and santhoshatdell as code owners August 8, 2025 17:41

linting

24b7cd2

lukeatdell marked this pull request as draft August 8, 2025 18:25

lukeatdell and others added 2 commits August 11, 2025 14:34

Merge branch 'main' into usr/lukeatdell/metro-resiliency

ba05dba

renaming some vars and adding some more comments

360ec7d

falfaroc previously approved these changes Aug 12, 2025

View reviewed changes

Comment thread pkg/controller/controller_test.go

Comment thread pkg/controller/csi_extension_server.go

PR comments: falfaroc

54ef271

lukeatdell dismissed falfaroc’s stale review via 54ef271 August 12, 2025 14:14

lukeatdell requested a review from falfaroc August 12, 2025 15:17

Merge branch 'main' into usr/lukeatdell/metro-resiliency

d5f1222

kumarp20 reviewed Aug 13, 2025

View reviewed changes

Comment thread pkg/controller/csi_extension_server.go Outdated

PR comments: kumarp20

b64778b

- removing redundant and bad check for matching arrayIDs - array connectivity check and volume IO check are now isolated checks. - adding more unit tests

lukeatdell requested a review from kumarp20 August 13, 2025 20:39

lukeatdell marked this pull request as ready for review August 14, 2025 14:28

falfaroc approved these changes Aug 14, 2025

View reviewed changes

alikdell approved these changes Aug 14, 2025

View reviewed changes

santhoshatdell approved these changes Aug 14, 2025

View reviewed changes

lukeatdell merged commit 15e6b8e into main Aug 14, 2025
6 checks passed

lukeatdell deleted the usr/lukeatdell/metro-resiliency branch August 14, 2025 20:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support metro volumes in csm-resiliency#533

Support metro volumes in csm-resiliency#533
lukeatdell merged 20 commits into
mainfrom
usr/lukeatdell/metro-resiliency

lukeatdell commented Aug 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Aug 13, 2025

Changed files (no unit tests)

Changed unit test files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

lukeatdell commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

GitHub Issues

Checklist:

How Has This Been Tested?

320

321

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Aug 13, 2025

Merging this branch will not change overall coverage

Changed files (no unit tests)

Changed unit test files

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lukeatdell commented Aug 8, 2025 •

edited

Loading