Skip to content

Fix: Volumes on lost local storage cannot be removed#7594

Merged
DaanHoogland merged 5 commits intoapache:4.18from
shapeblue:fix-volumes-lost-local-storage-issue
Jun 23, 2023
Merged

Fix: Volumes on lost local storage cannot be removed#7594
DaanHoogland merged 5 commits intoapache:4.18from
shapeblue:fix-volumes-lost-local-storage-issue

Conversation

@nvazquez
Copy link
Copy Markdown
Contributor

@nvazquez nvazquez commented Jun 6, 2023

Description

This PR fixes the following:

  • When destroying a local storage pool during host force deletion: then also destroy the volumes on the pool from DB
  • When expunging a destroyed volume: check if the volume's storage pool exists (not removed). In case its removed, skip the volume removal from storage

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

  • Create a local storage KVM zone
  • Deploy a VM to local storage
  • Force remove the host
  • Delete the VM -> This succeeds but its volumes remain forever in "Destroy" (before the fix) -> After the fix: volumes are destroyed

@nvazquez
Copy link
Copy Markdown
Contributor Author

nvazquez commented Jun 6, 2023

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@nvazquez a [LL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [LL]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6086

@nvazquez
Copy link
Copy Markdown
Contributor Author

nvazquez commented Jun 6, 2023

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@nvazquez a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@codecov
Copy link
Copy Markdown

codecov bot commented Jun 6, 2023

Codecov Report

Merging #7594 (f086bec) into 4.18 (0941d01) will increase coverage by 0.01%.
The diff coverage is 38.88%.

@@             Coverage Diff              @@
##               4.18    #7594      +/-   ##
============================================
+ Coverage     12.98%   12.99%   +0.01%     
- Complexity     8981     9001      +20     
============================================
  Files          2716     2716              
  Lines        256307   256391      +84     
  Branches      39967    39988      +21     
============================================
+ Hits          33280    33317      +37     
- Misses       218864   218904      +40     
- Partials       4163     4170       +7     
Impacted Files Coverage Δ
...main/java/com/cloud/storage/dao/VolumeDaoImpl.java 23.23% <0.00%> (-0.39%) ⬇️
...cloud/storage/resource/VmwareStorageProcessor.java 0.32% <0.00%> (-0.01%) ⬇️
...n/java/com/cloud/resource/ResourceManagerImpl.java 0.00% <0.00%> (ø)
...ain/java/com/cloud/storage/StorageManagerImpl.java 1.59% <65.00%> (+0.70%) ⬆️
...n/java/com/cloud/storage/VolumeApiServiceImpl.java 15.30% <80.00%> (+0.27%) ⬆️

... and 6 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly sensible but I found one strangity

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-6683)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 46978 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7594-t6683-kvm-centos7.zip
Smoke tests completed. 108 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

@nvazquez
Copy link
Copy Markdown
Contributor Author

nvazquez commented Jun 7, 2023

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@nvazquez a [LL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [LL]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6105

@nvazquez
Copy link
Copy Markdown
Contributor Author

nvazquez commented Jun 7, 2023

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@nvazquez a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-6703)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 46712 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7594-t6703-kvm-centos7.zip
Smoke tests completed. 107 look OK, 1 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_list_vms_metrics_admin Error 3608.74 test_metrics_api.py
test_list_vms_metrics_history Error 7.42 test_metrics_api.py
test_list_volumes_metrics_history Error 7.38 test_metrics_api.py

final Long poolId = pool.getPoolId();
final StoragePoolVO storagePool = _storagePoolDao.findById(poolId);
if (storagePool.isLocal() && isForceDeleteStorage) {
destroyLocalStoragePoolVolumes(poolId);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be good 👍

Comment on lines +1653 to +1655
InternalIdentity pool = role == DataStoreRole.Primary ?
_storagePoolDao.findById(volume.getPoolId()) :
imageStoreDao.findById(volume.getPoolId());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use switch/case ?

Copy link
Copy Markdown
Contributor

@shwstppr shwstppr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM

Copy link
Copy Markdown
Member

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nvazquez

Observing the following issue

  1. Create a disk offering with storage type as local
  2. Create a compute offering with storage type as local
  3. Launch a vm with data disk , wait till its in running state
  4. Disable the host
  5. Force delete the host
  6. Found the vm gets into destroyed state and is not expunged
  7. The data disk is still in ready state

Ideally the vm and data disk on the local storage should also get remvoed automatically if the local storage is removed

Currently the end user has to manually expunge the vm and the destory data disk

@nvazquez
Copy link
Copy Markdown
Contributor Author

Thanks @kiranchavala, have pushed a fix for that scenario

@nvazquez
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@nvazquez a [SF] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 6276

Copy link
Copy Markdown
Member

@kiranchavala kiranchavala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Tested the data disk fix and it's working fine

@kiranchavala
Copy link
Copy Markdown
Member

@blueorangutan test

@blueorangutan
Copy link
Copy Markdown

@kiranchavala a [SF] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@yadvr yadvr marked this pull request as ready for review June 22, 2023 09:18
Copy link
Copy Markdown
Member

@yadvr yadvr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM didn't test it

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-6811)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 50678 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr7594-t6811-kvm-centos7.zip
Smoke tests completed. 104 look OK, 4 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File
test_01_migrate_VM_and_root_volume Error 78.71 test_vm_life_cycle.py
test_02_migrate_VM_with_two_data_disks Error 52.38 test_vm_life_cycle.py
test_02_unsecure_vm_migration Error 225.86 test_vm_life_cycle.py
test_03_secured_to_nonsecured_vm_migration Error 140.91 test_vm_life_cycle.py
test_04_nonsecured_to_secured_vm_migration Error 150.16 test_vm_life_cycle.py
test_01_verify_ipv6_vpc Error 601.41 test_vpc_ipv6.py
test_02_redundant_VPC_default_routes Failure 632.33 test_vpc_redundant.py
test_05_rvpc_multi_tiers Failure 497.28 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 497.30 test_vpc_redundant.py
test_02_VPC_default_routes Failure 172.52 test_vpc_router_nics.py

@DaanHoogland DaanHoogland merged commit c809201 into apache:4.18 Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants