Skip to content

Add PowerStore Metro, Multiple Preferred Nodes E2E Test#321

Merged
falfaroc merged 4 commits into
mainfrom
usr/falfaroc/add-metro-node-failure-test
Aug 13, 2025
Merged

Add PowerStore Metro, Multiple Preferred Nodes E2E Test#321
falfaroc merged 4 commits into
mainfrom
usr/falfaroc/add-metro-node-failure-test

Conversation

@falfaroc

@falfaroc falfaroc commented Aug 8, 2025

Copy link
Copy Markdown
Contributor

Description

Add new scenario that tests and verifies PowerStore Metro + Resiliency when there are multiple preferred nodes and the preferred node that has the application pod goes down. This ensures that the pod migrates to another preferred node as expected.

GitHub Issues

List the GitHub issues impacted by this PR:

GitHub Issue #
https://github.com/dell/csm/issues/1961

Checklist:

  • I have performed a self-review of my own code to ensure there are no formatting, vetting, linting, or security issues
  • I have verified that new and existing unit tests pass locally with my changes
  • I have not allowed coverage numbers to degenerate
  • I have maintained at least 90% code coverage
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • Backward compatibility is not broken

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Please also list any relevant details for your test configuration

  • Add and run scenario of PowerStore Metro with multiple preferred nodes.

Clean Run:

$ make powerstore-metro-integration-test
RESILIENCY_INT_TEST="true" \
RESILIENCY_TEST_CLEANUP="true" \
POLL_K8S="true" \
SCRIPTS_DIR="../../test/sh" \
POWERSTORE_METRO="true" \
go test -timeout 6h -test.v -test.run "^\QTestPowerStoreFirstCheck\E|\QTestPowerStoreMetroIntegration\E"
=== RUN   TestPowerStoreFirstCheck
INFO[0000] RESILIENCY_INT_TEST_STOP_ON_FAILURE = true
Feature: Integration Test
  As a CSM for Resiliency developer
  I want to test CSM for Resiliency in a kubernetes environment
  So that it is known to work on various pod clean up cases and give consistent results
INFO[0000] attempting k8sapi connection
INFO[0000] Using kubeconfig /home/falfaroc/.kube/config
INFO[0000] connected to k8sapi
...

  Scenario Outline: Validate that we have a valid k8s configuration for the PowerStore metro integration tests # features/integration.feature:59
INFO[0000] Driver csi-powerstore.dellemc.com exists on the cluster
...
    Given a kubernetes <kubeConfig>                                                                            # <autogenerated>:1 -> *integration
    And test environmental variables are set                                                                   # <autogenerated>:1 -> *integration
    And these CSI driver <driverNames> are configured on the system                                            # <autogenerated>:1 -> *integration
    And these storageClasses <storageClasses> exist in the cluster                                             # <autogenerated>:1 -> *integration
    And there is a <namespace> in the cluster                                                                  # <autogenerated>:1 -> *integration
    And there are driver pods in <namespace> with this <name> prefix                                           # <autogenerated>:1 -> *integration
    And can logon to nodes and drop test scripts                                                               # <autogenerated>:1 -> *integration

    Examples:
      | kubeConfig | driverNames                  | namespace    | name         | storageClasses     |
      | ""         | "csi-powerstore.dellemc.com" | "powerstore" | "powerstore" | "powerstore-metro" |

1 scenarios (1 passed)
7 steps (7 passed)
1m32.717367134s
INFO[0092] Integration setup check finished
--- PASS: TestPowerStoreFirstCheck (92.73s)
=== RUN   TestPowerStoreMetroIntegration
INFO[0092] RESILIENCY_INT_TEST_STOP_ON_FAILURE = true
INFO[0092] Starting PowerStore Metro integration test
Feature: Integration Test
  As a CSM for Resiliency developer
  I want to test CSM for Resiliency in a kubernetes environment
  So that it is known to work on various pod clean up cases and give consistent results
INFO[0092] attempting k8sapi connection
...
  Scenario Outline: Preferred site node failover testing using test StatefulSet pods (node interface down)                                                 # features/integration.feature:215
INFO[0093] Removing preferred labels from nodes
INFO[0093] Checking if all the nodes are in 'Ready' state
INFO[0093] Checking if nodes have taints
INFO[0093] Taints were not found on the nodes.
...
NAME: pmtps1
LAST DEPLOYED: Fri Aug  8 14:45:01 2025
NAMESPACE: pmtps1
STATUS: deployed
REVISION: 1
TEST SUITE: None
INFO[0155] Waiting up to 600 seconds for pods to deploy
...
    Given a kubernetes <kubeConfig>                                                                                                                        # <autogenerated>:1 -> *integration
    And cluster is clean of test pods                                                                                                                      # <autogenerated>:1 -> *integration
    And wait <nodeCleanSecs> to see there are no taints                                                                                                    # <autogenerated>:1 -> *integration
    And label <workers> node as <preferred> site                                                                                                           # <autogenerated>:1 -> *integration
    And <podsPerNode> pods per node with <nVol> volumes and <nDev> devices using <driverType> and <storageClass> in <deploySecs> with <preferred> affinity # <autogenerated>:1 -> *integration
    Then validate that all pods are running within <deploySecs> seconds                                                                                    # <autogenerated>:1 -> *integration
    And all pods are running on <preferred> node                                                                                                           # <autogenerated>:1 -> *integration
    When I fail labeled <preferred> nodes with <failure> failure for <failSecs> seconds                                                                    # <autogenerated>:1 -> *integration
    Then validate that all pods are running within <runSecs> seconds                                                                                       # <autogenerated>:1 -> *integration
    And labeled pods are on a different node                                                                                                               # <autogenerated>:1 -> *integration
    And the taints for the failed nodes are removed within <nodeCleanSecs> seconds                                                                         # <autogenerated>:1 -> *integration
    Then finally cleanup everything                                                                                                                        # <autogenerated>:1 -> *integration

    Examples:
      | kubeConfig | podsPerNode | nVol  | nDev  | driverType   | storageClass       | workers     | primary | failure         | failSecs | deploySecs | runSecs | nodeCleanSecs | preferred |
      | ""         | "1-1"       | "1-1" | "0-0" | "powerstore" | "powerstore-metro" | "one-third" | "zero"  | "interfacedown" | 240      | 600        | 600     | 600           | "site"    |
INFO[0585] attempting k8sapi connection
INFO[0585] Using kubeconfig /home/falfaroc/.kube/config
INFO[0585] connected to k8sapi
...

  Scenario Outline: Preferred site node failover to preferred node (w/ metro, multiple preferred nodes)                                                    # features/integration.feature:234
INFO[0585] Node master-1-up9snjq0d5fyy.domain is a control plane node
INFO[0585] Attempting to clean up everything for driverType 'powerstore'
INFO[0585] Removing preferred labels from nodes
INFO[0585] Checking if all the nodes are in 'Ready' state
INFO[0585] Checking if nodes have taints
INFO[0585] Taints were not found on the nodes.
...
INFO[1076] Removing preferred labels from nodes
    Given a kubernetes <kubeConfig>                                                                                                                        # <autogenerated>:1 -> *integration
    And there are at least <nNodes> worker nodes which are ready                                                                                           # <autogenerated>:1 -> *integration
    And cluster is clean of test pods                                                                                                                      # <autogenerated>:1 -> *integration
    And wait <nodeCleanSecs> to see there are no taints                                                                                                    # <autogenerated>:1 -> *integration
    And label <workers> node as <preferred> site                                                                                                           # <autogenerated>:1 -> *integration
    And <podsPerNode> pods per node with <nVol> volumes and <nDev> devices using <driverType> and <storageClass> in <deploySecs> with <preferred> affinity # <autogenerated>:1 -> *integration
    Then validate that all pods are running within <deploySecs> seconds                                                                                    # <autogenerated>:1 -> *integration
    And all pods are running on <preferred> node                                                                                                           # <autogenerated>:1 -> *integration
    When I fail <workers> nodes with label <preferred> with <failure> failure for <failSecs> seconds                                                       # <autogenerated>:1 -> *integration
    Then validate that all pods are running within <runSecs> seconds                                                                                       # <autogenerated>:1 -> *integration
    And labeled pods are on a different node                                                                                                               # <autogenerated>:1 -> *integration
    And the taints for the failed nodes are removed within <nodeCleanSecs> seconds                                                                         # <autogenerated>:1 -> *integration
    Then finally cleanup everything                                                                                                                        # <autogenerated>:1 -> *integration

    Examples:
      | kubeConfig | nNodes | podsPerNode | nVol  | nDev  | driverType   | storageClass       | workers    | failure         | failSecs | deploySecs | runSecs | nodeCleanSecs | preferred |
      | ""         | 4      | "1-1"       | "1-1" | "0-0" | "powerstore" | "powerstore-metro" | "one-half" | "interfacedown" | 240      | 600        | 600     | 600           | "site"    |

2 scenarios (2 passed)
25 steps (25 passed)
16m23.829281099s
INFO[1076] Integration test finished
--- PASS: TestPowerStoreMetroIntegration (983.84s)
PASS
status 0
ok      podmon/internal/monitor 1076.600s
  • Ensure that multiple preferred nodes are needed for this test and this scenario is skipped if that isn't satisfied.
  Scenario Outline: Preferred site node failover to preferred node (w/ metro, multiple preferred nodes)                                                    # features/integration.feature:234
INFO[0104] Node master-1-njo3bz3jjysjb is a control plane node
WARN[0104] Skipping this scenario. Expected at least 4 but found 2
    Given a kubernetes <kubeConfig>                                                                                                                        # <autogenerated>:1 -> *integration
    And there are at least <nNodes> worker nodes which are ready                                                                                           # <autogenerated>:1 -> *integration
    And cluster is clean of test pods                                                                                                                      # <autogenerated>:1 -> *integration
    And wait <nodeCleanSecs> to see there are no taints                                                                                                    # <autogenerated>:1 -> *integration
    And label <workers> node as <preferred> site                                                                                                           # <autogenerated>:1 -> *integration
    And <podsPerNode> pods per node with <nVol> volumes and <nDev> devices using <driverType> and <storageClass> in <deploySecs> with <preferred> affinity # <autogenerated>:1 -> *integration
    Then validate that all pods are running within <deploySecs> seconds                                                                                    # <autogenerated>:1 -> *integration
    And all pods are running on <preferred> node                                                                                                           # <autogenerated>:1 -> *integration
    When I fail <workers> nodes with label <preferred> with <failure> failure for <failSecs> seconds                                                       # <autogenerated>:1 -> *integration
    Then validate that all pods are running within <runSecs> seconds                                                                                       # <autogenerated>:1 -> *integration
    And labeled pods are on a different node                                                                                                               # <autogenerated>:1 -> *integration
    And the taints for the failed nodes are removed within <nodeCleanSecs> seconds                                                                         # <autogenerated>:1 -> *integration
    Then finally cleanup everything                                                                                                                        # <autogenerated>:1 -> *integration

    Examples:
      | kubeConfig | nNodes | podsPerNode | nVol  | nDev  | driverType   | storageClass       | workers    | failure         | failSecs | deploySecs | runSecs | nodeCleanSecs | preferred |
      | ""         | 4      | "1-1"       | "1-1" | "0-0" | "powerstore" | "powerstore-metro" | "one-half" | "interfacedown" | 240      | 600        | 600     | 600           | "site"    |

1 scenarios (1 passed)
13 steps (1 passed, 12 skipped)
322.868553ms
INFO[0104] Integration test finished
--- PASS: TestPowerStoreMetroIntegration (0.33s)
PASS
status 0
ok      podmon/internal/monitor 104.399s

@falfaroc falfaroc force-pushed the usr/falfaroc/add-metro-node-failure-test branch from aad054c to b279a18 Compare August 8, 2025 19:08
@falfaroc falfaroc marked this pull request as ready for review August 8, 2025 19:12
lukeatdell
lukeatdell previously approved these changes Aug 11, 2025
Comment thread internal/monitor/features/integration.feature Outdated
@xuluna xuluna force-pushed the usr/luna/preferred-site branch from 4b9bf2f to 75571ab Compare August 12, 2025 15:06
@falfaroc

Copy link
Copy Markdown
Contributor Author

Setting to draft until dependent PR is merged.

@falfaroc falfaroc marked this pull request as draft August 12, 2025 15:42
Base automatically changed from usr/luna/preferred-site to main August 12, 2025 18:13
@xuluna xuluna dismissed lukeatdell’s stale review August 12, 2025 18:13

The base branch was changed.

@falfaroc falfaroc force-pushed the usr/falfaroc/add-metro-node-failure-test branch from 898929b to 7efef4b Compare August 12, 2025 18:26
@falfaroc falfaroc marked this pull request as ready for review August 12, 2025 18:26
@falfaroc falfaroc requested a review from lukeatdell August 12, 2025 18:28
@falfaroc falfaroc force-pushed the usr/falfaroc/add-metro-node-failure-test branch 3 times, most recently from 49bc15e to a607de4 Compare August 12, 2025 19:19
lukeatdell
lukeatdell previously approved these changes Aug 12, 2025
Comment thread internal/monitor/integration_steps_test.go
lukeatdell
lukeatdell previously approved these changes Aug 12, 2025
Comment thread test/podmontest/Makefile Outdated
anathoodell
anathoodell previously approved these changes Aug 13, 2025
@github-actions

Copy link
Copy Markdown

Merging this branch will not change overall coverage

Impacted Packages Coverage Δ 🤖
github.com/dell/karavi-resiliency/internal/monitor 0.00% (ø)
github.com/dell/karavi-resiliency/test/ssh 0.00% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/dell/karavi-resiliency/test/ssh/client.go 0.00% (ø) 0 0 0

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/dell/karavi-resiliency/internal/monitor/integration_steps_test.go

@falfaroc falfaroc merged commit 6c1fe06 into main Aug 13, 2025
6 checks passed
@falfaroc falfaroc deleted the usr/falfaroc/add-metro-node-failure-test branch August 13, 2025 14:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants