Skip to content

Add a shutdownhook to remove jobs owned by the process#8896

Merged
DaanHoogland merged 2 commits intoapache:4.18from
shapeblue:ghi8590-usageJobsOnRestart
Apr 19, 2024
Merged

Add a shutdownhook to remove jobs owned by the process#8896
DaanHoogland merged 2 commits intoapache:4.18from
shapeblue:ghi8590-usageJobsOnRestart

Conversation

@DaanHoogland
Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland commented Apr 10, 2024

Description

This PR...

Fixes: #8590
by adding a shutdown hook that removes the last planned usage job on exit, if owned by the current process.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Feature/Enhancement Scale or Bug Severity

Feature/Enhancement Scale

  • Major
  • Minor

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

Screenshots (if appropriate):

How Has This Been Tested?

How did you try to break this feature and the system with this change?

  • start the usage server
  • see the job record being created for the running process
  • stop the server
  • check that the record is no longer in the DB

@DaanHoogland
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@andrijapanicsb
Copy link
Copy Markdown
Contributor

If you remove the last record (which, in my head is some kind of protection mechanism, but I might be wrong) - don't we run into risk if the following happens:

  • you start 2 usage servers at the same/similar time (systemctl start)
  • both usage servers try to run the job (while only one should be the "owner of the job" and the other one will notice this and, currently, NOT execute the duplicate job - it will never execute the duplicate job as long as there is that last record which indicates that another/first usage server is the owner of the job)

?

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9212

@DaanHoogland
Copy link
Copy Markdown
Contributor Author

If you remove the last record (which, in my head is some kind of protection mechanism, but I might be wrong) - don't we run into risk if the following happens:

* you start 2 usage servers at the same/similar time (systemctl start)

* both usage servers try to run the job (while only one should be the "owner of the job" and the other one will notice this and, currently, NOT execute the duplicate job - it will never execute the duplicate job as long as there is that last record which indicates that another/first usage server is the owner of the job)

?

Yes, this is correct. The issue is when you stop the Usage server owning the job, it will no longer have an owner and no server will ever execute a job anymore. Therefore the owning process should remove their job on exit.

In the current situation, you have to delete the record manually and the risk is the same. There is just a manual action needed that you need to be aware off.

@DaanHoogland
Copy link
Copy Markdown
Contributor Author

@blueorangutan test alma9 kvm-alma9

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a [SL] Trillian-Jenkins test job (alma9 mgmt + kvm-alma9) has been kicked to run smoke tests

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian Build Failed (tid-9781)

Co-authored-by: Suresh Kumar Anaparti <sureshkumar.anaparti@gmail.com>
@DaanHoogland
Copy link
Copy Markdown
Contributor Author

@blueorangutan package

@blueorangutan
Copy link
Copy Markdown

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 9215

@blueorangutan
Copy link
Copy Markdown

[SF] Trillian test result (tid-9787)
Environment: kvm-alma8 (x2), Advanced Networking with Mgmt server a8
Total time taken: 42344 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr8896-t9787-kvm-alma8.zip
Smoke tests completed. 110 look OK, 0 have errors, 0 did not run
Only failed and skipped tests results shown below:

Test Result Time (s) Test File

Copy link
Copy Markdown
Contributor

@sureshanaparti sureshanaparti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm

Copy link
Copy Markdown
Member

@vishesh92 vishesh92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm
didn't test

@vladimirpetrov
Copy link
Copy Markdown
Contributor

Not sure if this can be avoided but when the usage server is killed (with -9, simulating crash), this machanism no longer works - the old entry stays in the db and a new one (with the new pid) is not created when the server is started again.

@DaanHoogland
Copy link
Copy Markdown
Contributor Author

@vladimirpetrov I don't think it can but I'll have a google. does it lgty otherwise? We can create an improvement later.

Copy link
Copy Markdown
Contributor

@vladimirpetrov vladimirpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM based on manual testing

@DaanHoogland DaanHoogland merged commit 5f8450f into apache:4.18 Apr 19, 2024
@DaanHoogland DaanHoogland deleted the ghi8590-usageJobsOnRestart branch April 19, 2024 07:18
DaanHoogland added a commit that referenced this pull request Apr 19, 2024
* 4.18:
  protect against null-path (#8915)
  UI: Fix missing locale strings for Status widget (#8792)
  Add a shutdownhook to remove jobs owned by the process (#8896)
DaanHoogland added a commit that referenced this pull request Apr 19, 2024
* 4.19:
  protect against null-path (#8915)
  UI: Fix missing locale strings for Status widget (#8792)
  Add a shutdownhook to remove jobs owned by the process (#8896)
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request May 3, 2024
Co-authored-by: Suresh Kumar Anaparti <sureshkumar.anaparti@gmail.com>
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request May 3, 2024
* 4.18:
  protect against null-path (apache#8915)
  UI: Fix missing locale strings for Status widget (apache#8792)
  Add a shutdownhook to remove jobs owned by the process (apache#8896)
dhslove pushed a commit to ablecloud-team/ablestack-cloud that referenced this pull request May 3, 2024
* 4.19:
  protect against null-path (apache#8915)
  UI: Fix missing locale strings for Status widget (apache#8792)
  Add a shutdownhook to remove jobs owned by the process (apache#8896)
@yadvr yadvr modified the milestones: 4.18.3, 4.19.1.0 Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: No status

Development

Successfully merging this pull request may close these issues.

7 participants