CheckOnHostCommand: add missing timeout setting#9677
CheckOnHostCommand: add missing timeout setting#9677DaanHoogland merged 1 commit intoapache:4.19from
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.19 #9677 +/- ##
============================================
+ Coverage 15.08% 15.11% +0.02%
+ Complexity 11192 11190 -2
============================================
Files 5406 5406
Lines 473215 473214 -1
Branches 61680 58585 -3095
============================================
+ Hits 71386 71521 +135
- Misses 393880 393883 +3
+ Partials 7949 7810 -139
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
The new CheckOnHostCommand constructor was missing a reasonable timeout value, which meant it would fallback to the wait (1800s) timeout. On a Linstor cluster this resulted in over 15 minutes wait time until a host was recognized as down. With timeout of 20s (as the other constructor) it takes 4-5 mins for a host to become recognized as down.
5ce9077 to
eca66f8
Compare
|
@blueorangutan package |
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11163 |
slavkap
left a comment
There was a problem hiding this comment.
code LGTM but I haven't tested it
|
@blueorangutan package |
|
@rohityadavcloud a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11374 |
|
@blueorangutan test |
|
@DaanHoogland a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-11709)
|
|
@DaanHoogland how to continue with this? |
:D I think we can merge. Unless we need more testing for this online change. Personally I think smoke tests must have hit this change multiple times, ... |
|
For me this is a regression fix. See also this discussion here: #10097 It can't be on purpose for CloudStack to take 15+ mins to detect a down host? |
Description
The new CheckOnHostCommand constructor was missing a reasonable timeout value, which meant it would fallback to the wait (1800s) timeout. On a Linstor cluster this resulted in over 15 minutes wait time until a host was recognized as down.
With timeout of 20s (as the other constructor) it takes 4-5 mins for a host to become recognized as down.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Failover tests (force shutdown of a host) in a Linstor cluster.
How did you try to break this feature and the system with this change?