Skip to content

Fleet server is unexpectedly unenrolled under load #741

@scunningham

Description

@scunningham

Fleet server version: 7.15.0
Host OS: All

Preconditions:

  1. Executing in 7.15 cloud.

Steps to reproduce:

  1. Create default 7.15 Elastic deployment in cloud.
  2. Modify the configuration to limit the number of simultaneous check-in to something reasonable (there's a separate issue where the limits that existed for 7.14 were dropped in 7.15).
  • ie. set server.limits.checkin_limit.max: 100
  1. Use horde to put load on the Fleet Server, triggering the configurable circuit breakers. A horde of 1000 should do it.
  2. Notice that after 10m the Fleet Server is marked "unenrolled", and the the system no longer seems to have an active Fleet Server.

Expected behavior:

The Fleet Server should not be automatically unenrolled as a side effect of heavy load.

Analysis:

Code was added to the Fleet Server to automatically unenroll any ephemeral agent (cloud based) that did not check in within a time limit. For the cloud, that time limit is set to 10m. We are hitting that case under load.

One fix is to move the communication channel between the agent and the fleet server to a different port to avoid resource contention with Agents checking in. This would prevent agents from starving comms to the hosted Fleet Server.

At 9m in:

Screen Shot 2021-09-27 at 10 20 19 AM

At 10m in:

Screen Shot 2021-09-27 at 10 21 25 AM

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions