KVM: return null state instead of Disconnected when investigate a host without NFS#10515
Conversation
|
@blueorangutan package |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.19 #10515 +/- ##
============================================
- Coverage 15.17% 15.16% -0.01%
+ Complexity 11332 11328 -4
============================================
Files 5414 5414
Lines 474802 474802
Branches 57909 57909
============================================
- Hits 72028 72008 -20
- Misses 394718 394742 +24
+ Partials 8056 8052 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@blueorangutan package |
|
@weizhouapache a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12686 |
|
@blueorangutan test |
|
@rohityadavcloud a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-12603)
|
kiranchavala
left a comment
There was a problem hiding this comment.
LGTM, Verified the issue manually by executing the following steps
- Create a cloudstack env with 2 hosts and no nfs primary storages.
- On one of the kvm host configure ha and enable HA.
- Add a firewall rule which drops the packets on port 8250
iptables -I OUTPUT -p tcp -m tcp --dport 8250 -j DROP
- Check the management server logs
Before fix,
Cloudstack doesn't pick up the HypervInvestigator VMwareInvestigator, ping investigator.
2025-03-06 13:36:30,022 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Investigating why host Host {"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"} has disconnected with event PingTimeout
2025-03-06 13:36:30,023 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) checking if agent (Host {"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"}) is alive
2025-03-06 13:36:30,025 DEBUG [c.c.a.t.Request] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: Sending { Cmd , MgmtId: 32986892337576, via: 1(ref-trl-8094-k-mol8-kiran-chavala-kvm1), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}] }
2025-03-06 13:37:10,041 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: Waiting some more time because this is the current command
2025-03-06 13:37:10,041 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: Waiting some more time because this is the current command
2025-03-06 13:37:10,042 WARN [c.c.a.m.AgentAttache] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: Timed out on Seq 1-8864491441548689460: { Cmd , MgmtId: 32986892337576, via: 1(ref-trl-8094-k-mol8-kiran-chavala-kvm1), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}] }
2025-03-06 13:37:10,047 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Seq 1-8864491441548689460: Cancelling.
2025-03-06 13:37:10,047 WARN [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Operation timed out: Commands 8864491441548689460 to Host 1 timed out after 100
2025-03-06 13:37:10,067 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) SimpleInvestigator unable to determine the state of the host. Moving on.
2025-03-06 13:37:10,067 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) XenServerInvestigator unable to determine the state of the host. Moving on.
2025-03-06 13:37:10,083 WARN [c.c.h.KVMInvestigator] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Agent investigation was requested on host Host {"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"}, but host does not support investigation because it has no NFS storage. Skipping investigation.
2025-03-06 13:37:10,083 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) KVMInvestigator was able to determine host Host {"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"} is in Disconnected
2025-03-06 13:37:10,083 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) The agent from host Host {"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"} state determined is Disconnected
2025-03-06 13:37:10,083 WARN [c.c.a.m.AgentManagerImpl] (AgentTaskPool-10:ctx-c383007c) (logid:363305d1) Agent is disconnected but the host is still up: Host {"id":1,"name":"ref-trl-8094-k-mol8-kiran-chavala-kvm1","type":"Routing","uuid":"40f96f30-2b3d-47bd-86ab-cea4c4a5dd4f"} state: Enabled
After fix
Cloudstack picks up the HypervInvestigator VMwareInvestigator, ping investigator.
[root@ol8 ~]# cat /var/log/cloudstack/management/management-server.log |grep -i "logid:b39c7f05"
2025-03-06 13:08:59,485 INFO [c.c.a.m.AgentManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Investigating why host Host {"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"} has disconnected with event PingTimeout
2025-03-06 13:08:59,485 DEBUG [c.c.a.m.AgentManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) checking if agent (Host {"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"}) is alive
2025-03-06 13:08:59,487 DEBUG [c.c.a.t.Request] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: Sending { Cmd , MgmtId: 32987949302884, via: 2(ref-trl-8087-k-mol8-kiran-chavala-kvm2), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}] }
2025-03-06 13:09:49,487 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: Waiting some more time because this is the current command
2025-03-06 13:10:39,487 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: Waiting some more time because this is the current command
2025-03-06 13:10:39,488 WARN [c.c.a.m.AgentAttache] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: Timed out on Seq 2-5748563449361727501: { Cmd , MgmtId: 32987949302884, via: 2(ref-trl-8087-k-mol8-kiran-chavala-kvm2), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":"50","bypassHostMaintenance":"false"}}] }
2025-03-06 13:10:39,488 DEBUG [c.c.a.m.AgentAttache] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 2-5748563449361727501: Cancelling.
2025-03-06 13:10:39,489 WARN [c.c.a.m.AgentManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Operation timed out: Commands 5748563449361727501 to Host 2 timed out after 100
2025-03-06 13:10:39,491 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) SimpleInvestigator unable to determine the state of the host. Moving on.
2025-03-06 13:10:39,491 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) XenServerInvestigator unable to determine the state of the host. Moving on.
2025-03-06 13:10:39,494 WARN [c.c.h.KVMInvestigator] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Agent investigation was requested on host Host {"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"}, but host does not support investigation because it has no NFS storage. Skipping investigation.
2025-03-06 13:10:39,494 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) KVMInvestigator unable to determine the state of the host. Moving on.
2025-03-06 13:10:39,494 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) HypervInvestigator unable to determine the state of the host. Moving on.
2025-03-06 13:10:39,494 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) VMwareInvestigator unable to determine the state of the host. Moving on.
2025-03-06 13:10:39,495 DEBUG [c.c.h.UserVmDomRInvestigator] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) checking if agent (Host {"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"}) is alive
2025-03-06 13:10:39,496 DEBUG [c.c.h.UserVmDomRInvestigator] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) sending ping from (Host {"id":1,"name":"ol8.localdomain","type":"Routing","uuid":"c0fd498b-e0ff-433c-a68d-698a982a5f6f"}) to agent's host ip address (10.0.35.136)
2025-03-06 13:10:39,497 DEBUG [c.c.a.t.Request] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 1-728457239727181052: Sending { Cmd , MgmtId: 32987949302884, via: 1(ol8.localdomain), Ver: v1, Flags: 100011, [{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"10.0.35.136","wait":"20","bypassHostMaintenance":"false"}}] }
2025-03-06 13:10:39,511 DEBUG [c.c.a.t.Request] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) Seq 1-728457239727181052: Received: { Ans: , MgmtId: 32987949302884, via: 1(ol8.localdomain), Ver: v1, Flags: 10, { Answer } }
2025-03-06 13:10:39,512 DEBUG [c.c.h.AbstractInvestigatorImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) host (10.0.35.136) has been successfully pinged, returning that host is up
2025-03-06 13:10:39,512 DEBUG [c.c.h.UserVmDomRInvestigator] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) ping from (Host {"id":1,"name":"ol8.localdomain","type":"Routing","uuid":"c0fd498b-e0ff-433c-a68d-698a982a5f6f"}) to agent's host ip address (10.0.35.136) successful, returning that agent is disconnected
2025-03-06 13:10:39,512 DEBUG [c.c.h.HighAvailabilityManagerImpl] (AgentTaskPool-8:ctx-9220e781) (logid:b39c7f05) PingInvestigator was able to determine host Host {"id":2,"name":"ref-trl-8087-k-mol8-kiran-chavala-kvm2","type":"Routing","uuid":"ec2fdf6c-809d-42b9-96e0-1ff6abde5f89"} is in Disconnected
great, thanks @kiranchavala for testing ! |
Description
Currently when kvm host does not have NFS, it is determined as Disconnected during agent/vm investigation.
The other investigators are not performed.
This PR fixes the issue so that the other investigators will be performed.
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Below is an example of the investigation process with this PR
(on the kvm host, I added a firewall rule to drop the packets to port 8250 of management server)

How did you try to break this feature and the system with this change?