Skip to content

Support metro volumes in csm-resiliency#533

Merged
lukeatdell merged 20 commits into
mainfrom
usr/lukeatdell/metro-resiliency
Aug 14, 2025
Merged

Support metro volumes in csm-resiliency#533
lukeatdell merged 20 commits into
mainfrom
usr/lukeatdell/metro-resiliency

Conversation

@lukeatdell

@lukeatdell lukeatdell commented Aug 8, 2025

Copy link
Copy Markdown
Contributor

Description

Adds metro volumes as a supported volume type for csm-resiliency.

Previously, ValidateVolumeHostConnectivity() was only checking the local volume for IOs in-progress.
Now, the function will check to see if there is a remote volume in the volumeIDs provided in the request. If one is found, two async requests are issued (one to local volume, one to the remote) to check if IO is in-progress. If either request reports IO is in-progress, any pending requests are cancelled, the response is updated to indicate the status, and it is returned.

GitHub Issues

GitHub Issue #
https://github.com/dell/csm/issues/1961

Checklist:

  • I have performed a self-review of my own code to ensure there are no formatting, vetting, linting, or security issues
  • I have verified that new and existing unit tests pass locally with my changes
  • I have not allowed coverage numbers to degenerate
  • I have maintained at least 90% code coverage
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • Backward compatibility is not broken

How Has This Been Tested?

320

    Given a kubernetes <kubeConfig>                                                                                                                        # <autogenerated>:1 -> *integration
    And cluster is clean of test pods                                                                                                                      # <autogenerated>:1 -> *integration
    And wait <nodeCleanSecs> to see there are no taints                                                                                                    # <autogenerated>:1 -> *integration
    And label <workers> node as <preferred> site                                                                                                           # <autogenerated>:1 -> *integration
    And <podsPerNode> pods per node with <nVol> volumes and <nDev> devices using <driverType> and <storageClass> in <deploySecs> with <preferred> affinity # <autogenerated>:1 -> *integration
    Then validate that all pods are running within <deploySecs> seconds                                                                                    # <autogenerated>:1 -> *integration
    And all pods are running on <preferred> node                                                                                                           # <autogenerated>:1 -> *integration
    When I fail labeled <preferred> nodes with <failure> failure for <failSecs> seconds                                                                    # <autogenerated>:1 -> *integration
    Then validate that all pods are running within <runSecs> seconds                                                                                       # <autogenerated>:1 -> *integration
    And labeled pods are on a different node                                                                                                               # <autogenerated>:1 -> *integration
    And the taints for the failed nodes are removed within <nodeCleanSecs> seconds                                                                         # <autogenerated>:1 -> *integration
    Then finally cleanup everything                                                                                                                        # <autogenerated>:1 -> *integration

    Examples:
      | kubeConfig | podsPerNode | nVol  | nDev  | driverType   | storageClass       | workers     | primary | failure         | failSecs | deploySecs | runSecs | nodeCleanSecs | preferred |
      | ""         | "1-1"       | "1-1" | "0-0" | "powerstore" | "powerstore-metro" | "one-third" | "zero"  | "interfacedown" | 240      | 600        | 600     | 600           | "site"    |

1 scenarios (1 passed)
12 steps (12 passed)
7m36.804438307s
INFO[0469] Integration test finished                    
--- PASS: TestPowerStoreShortIntegration (456.85s)
PASS
status 0
ok      podmon/internal/monitor 469.086s

321

    Given a kubernetes <kubeConfig>                                                                                                                        # <autogenerated>:1 -> *integration
    And there are at least <nNodes> worker nodes which are ready                                                                                           # <autogenerated>:1 -> *integration
    And cluster is clean of test pods                                                                                                                      # <autogenerated>:1 -> *integration
    And wait <nodeCleanSecs> to see there are no taints                                                                                                    # <autogenerated>:1 -> *integration
    And label <workers> node as <preferred> site                                                                                                           # <autogenerated>:1 -> *integration
    And <podsPerNode> pods per node with <nVol> volumes and <nDev> devices using <driverType> and <storageClass> in <deploySecs> with <preferred> affinity # <autogenerated>:1 -> *integration
    Then validate that all pods are running within <deploySecs> seconds                                                                                    # <autogenerated>:1 -> *integration
    And all pods are running on <preferred> node                                                                                                           # <autogenerated>:1 -> *integration
    When I fail <workers> nodes with label <preferred> with <failure> failure for <failSecs> seconds                                                       # <autogenerated>:1 -> *integration
    Then validate that all pods are running within <runSecs> seconds                                                                                       # <autogenerated>:1 -> *integration
    And labeled pods are on a different node                                                                                                               # <autogenerated>:1 -> *integration
    And the taints for the failed nodes are removed within <nodeCleanSecs> seconds                                                                         # <autogenerated>:1 -> *integration
    Then finally cleanup everything                                                                                                                        # <autogenerated>:1 -> *integration

    Examples:
      | kubeConfig | nNodes | podsPerNode | nVol  | nDev  | driverType   | storageClass       | workers    | failure         | failSecs | deploySecs | runSecs | nodeCleanSecs | preferred |
      | ""         | 4      | "1-1"       | "1-1" | "0-0" | "powerstore" | "powerstore-metro" | "one-half" | "interfacedown" | 240      | 600        | 600     | 600           | "site"    |

2 scenarios (2 passed)
25 steps (25 passed)
15m23.708868918s
INFO[0938] Integration test finished                    
--- PASS: TestPowerStoreMetroIntegration (923.75s)
PASS
status 0
ok      podmon/internal/monitor 938.601s

- This works: concurrently gets IOs in-progress for both arrays.
- However, when reading results, it reads them sequentially from the channels, so if the first channel blocks, we must wait.
- fixes issue from last commit.
- Still hitting an issue with the new test where context doesn't appear to time out.
- isIOInProgress would return true as soon as a non-nil error was received, leaving any other goroutine unread and blocking.
@lukeatdell lukeatdell marked this pull request as draft August 8, 2025 18:25
falfaroc
falfaroc previously approved these changes Aug 12, 2025
Comment thread pkg/controller/controller_test.go
Comment thread pkg/controller/csi_extension_server.go
Comment thread pkg/controller/csi_extension_server.go Outdated
- removing redundant and bad check for matching arrayIDs
- array connectivity check and volume IO check are now isolated checks.
- adding more unit tests
@lukeatdell lukeatdell requested a review from kumarp20 August 13, 2025 20:39
@github-actions

Copy link
Copy Markdown
Contributor

Merging this branch will not change overall coverage

Impacted Packages Coverage Δ 🤖
github.com/dell/csi-powerstore/pkg/controller 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/dell/csi-powerstore/pkg/controller/csi_extension_server.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/dell/csi-powerstore/pkg/controller/controller_test.go
  • github.com/dell/csi-powerstore/pkg/controller/csi_extension_server_test.go

@lukeatdell lukeatdell marked this pull request as ready for review August 14, 2025 14:28
@lukeatdell lukeatdell merged commit 15e6b8e into main Aug 14, 2025
6 checks passed
@lukeatdell lukeatdell deleted the usr/lukeatdell/metro-resiliency branch August 14, 2025 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants