[Elastic Agent] The system process metricsets should not report as degraded when metrics are partially collected

In https://github.com/elastic/beats/pull/40400 we added the ability for metricsets to report their status using the Elastic Agent control protocol to make errors more visible. One consequence of this has been surfacing previously hidden problems collecting information about PIDs. 

It appears that by default we fail to get complete metrics for some PIDs on both Windows and Linux, resulting in the Elastic Agent being permanently unhealthy in the Fleet UI where no quick fix:
- [x] https://github.com/elastic/elastic-agent/issues/5300
- [x] https://github.com/elastic/beats/issues/40537

While this functionality is useful and has helped us find bugs, we do not want users to immediately experience unhealthy agents they can't obviously fix. As a stop gap while we fix the underlying problems, let's keep reporting the error message but have the system process metricsets report as healthy. This will put the errors messages in the UI without making the agent unhealthy, which most users treat as something to quickly fix.

For example we see something like the following in our leak detection tests regularly:

```yaml
    - id: system/metrics-default
      state:
        message: 'Healthy: communicating with pid ''1556'''
        pid: 0
        state: 2
        units:
            input-system/metrics-default-system/metrics-system-5f5e65eb-2fd6-41e1-8c29-f24d57e66509:
                state: DEGRADED
                message: |-
                    Error fetching data for metricset system.process_summary: Not enough privileges to fetch information: Not enough privileges to fetch information: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
                    error fetching status: OpenProcess failed for pid=0: The parameter is incorrect.
                    GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument
                payload:
                    streams:
                        system/metrics-system.process.summary-5f5e65eb-2fd6-41e1-8c29-f24d57e66509:
                            error: |-
                                Error fetching data for metricset system.process_summary: Not enough privileges to fetch information: Not enough privileges to fetch information: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
                                error fetching status: OpenProcess failed for pid=0: The parameter is incorrect.
                                GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument
                            status: DEGRADED
```

The proposal is that we keep the error messages, but report the state as healthy.

```yaml
    - id: system/metrics-default
      state:
        message: 'Healthy: communicating with pid ''1556'''
        pid: 0
        state: 2
        units:
            input-system/metrics-default-system/metrics-system-5f5e65eb-2fd6-41e1-8c29-f24d57e66509:
                state: HEALTHY
                message: |-
                    Error fetching data for metricset system.process_summary: Not enough privileges to fetch information: Not enough privileges to fetch information: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
                    error fetching status: OpenProcess failed for pid=0: The parameter is incorrect.
                    GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument
                payload:
                    streams:
                        system/metrics-system.process.summary-5f5e65eb-2fd6-41e1-8c29-f24d57e66509:
                            error: |-
                                Error fetching data for metricset system.process_summary: Not enough privileges to fetch information: Not enough privileges to fetch information: GetInfoForPid: could not get all information for PID 0: error fetching name: OpenProcess failed for pid=0: The parameter is incorrect.
                                error fetching status: OpenProcess failed for pid=0: The parameter is incorrect.
                                GetInfoForPid: could not get all information for PID 4: error fetching name: GetProcessImageFileName failed for pid=4: GetProcessImageFileName failed: invalid argument
                            status: HEALTHY
```

This will keep the information accessible for bug reports but should hopefully reduce the perceived urgency of the problem and the level of support cases. The control protocol always allows both a state and a message regardless of what the state is, see [here](https://github.com/elastic/elastic-agent-client/blob/f66604cbb9146380e8194fda745c4e80440acc85/elastic-agent-client.proto#L326-L330).

There will be a follow up issue to make switching between these two.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Elastic Agent] The system process metricsets should not report as degraded when metrics are partially collected #40542

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Elastic Agent] The system process metricsets should not report as degraded when metrics are partially collected #40542

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions