Fleet server version: 7.15.0
Host OS: All
Preconditions:
- Executing in 7.15 cloud.
Steps to reproduce:
- Create default 7.15 Elastic deployment in cloud.
- Modify the configuration to limit the number of simultaneous check-in to something reasonable (there's a separate issue where the limits that existed for 7.14 were dropped in 7.15).
- ie. set server.limits.checkin_limit.max: 100
- Use horde to put load on the Fleet Server, triggering the configurable circuit breakers. A horde of 1000 should do it.
- Notice that after 10m the Fleet Server is marked "unenrolled", and the the system no longer seems to have an active Fleet Server.
Expected behavior:
The Fleet Server should not be automatically unenrolled as a side effect of heavy load.
Analysis:
Code was added to the Fleet Server to automatically unenroll any ephemeral agent (cloud based) that did not check in within a time limit. For the cloud, that time limit is set to 10m. We are hitting that case under load.
One fix is to move the communication channel between the agent and the fleet server to a different port to avoid resource contention with Agents checking in. This would prevent agents from starving comms to the hosted Fleet Server.
At 9m in:

At 10m in:

Fleet server version: 7.15.0
Host OS: All
Preconditions:
Steps to reproduce:
Expected behavior:
The Fleet Server should not be automatically unenrolled as a side effect of heavy load.
Analysis:
Code was added to the Fleet Server to automatically unenroll any ephemeral agent (cloud based) that did not check in within a time limit. For the cloud, that time limit is set to 10m. We are hitting that case under load.
One fix is to move the communication channel between the agent and the fleet server to a different port to avoid resource contention with Agents checking in. This would prevent agents from starving comms to the hosted Fleet Server.
At 9m in:
At 10m in: