[metricbeat] - Allow metricsets to report their status via v2 protocol#40025
Conversation
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
6c10a9f to
e37ef86
Compare
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
faec
left a comment
There was a problem hiding this comment.
This looks good, but I don't see any actual invocations of SetStatusReporter, which seems to be required for the other changes to have an effect. Am I missing something, or is that coming in a later PR?
@Fae, This part was introduced by #39209 and it's done at Lines 157 to 161 in 1a6318d |
faec
left a comment
There was a problem hiding this comment.
Thanks for clarifying, looks good to me!
@cmacknz Let me know how it looks to you now and if you have any more comments regarding this. @faec Thanks for reviewing! |
|
The changes themselves LGTM, but I am not getting the result I expected when I test this manually. What I did:
~$ cat script.sh
while true;
do
echo "Hello!";
sleep 1;
done
~$ sudo ./script.sh
{"log.level":"debug","@timestamp":"2024-07-16T21:04:30.148Z","message":"Non-fatal error fetching PID metrics for 30358, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30358/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:04:40.143Z","message":"Non-fatal error fetching PID metrics for 30357, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30357/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:05:50.143Z","message":"Non-fatal error fetching PID metrics for 30358, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30358/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:05:50.143Z","message":"Error fetching PID info for 30642, skipping: GetInfoForPid: no such process","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":198,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidIter"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.055Z","message":"Non-fatal error fetching PID metrics for 30356, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30356/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30357, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30357/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30358, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30358/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30649, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30649/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30650, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30650/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30651, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30651/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30652, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30652/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.logger":"processes","log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"debug","@timestamp":"2024-07-16T21:06:10.056Z","message":"Non-fatal error fetching PID metrics for 30670, metrics are valid, but partial: Not enough privileges to fetch information: /io unavailable; if running inside a container, use SYS_PTRACE: error fetching IO metrics: open /proc/30670/io: permission denied","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"system/metrics-default","type":"system/metrics"},"log":{"source":"system/metrics-default"},"log.origin":{"file.line":268,"file.name":"process/process.go","function":"github.com/elastic/elastic-agent-system-metrics/metric/system/process.(*Stats).pidFill"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"processes","ecs.version":"1.6.0"}I've attached diagnostics from the VM. The logs indicating the error is non-fatal might be what is catching us here, I think this has changed since the issue to implement this was created. Perhaps you were initially right hooking into the error in the event, which might catch these since they are non-fatal, but I still think looking at events makes it harder to see when we will go back to healthy. Regardless, this isn't actually catching the degraded state we want. unpriv-metricbeat-diagnostics.zip I also wonder if we could get a test to catch this, assuming I verified this correctly. |
|
@cmacknz I see. I think this is due to the fact that We can make use of |
|
This pull request is now in conflicts. Could you fix it? 🙏 |
|
This pull request is now in conflicts. Could you fix it? 🙏 |
210c929 to
abcb26a
Compare
| err := fetcher.Fetch(reporter.V2()) | ||
| if err != nil { | ||
| reporter.V2().Error(err) | ||
| msw.module.UpdateStatus(status.Degraded, fmt.Sprintf("Error fetching data for metricset %s.%s: %s", msw.module.Name(), msw.MetricSet.Name(), err)) |
There was a problem hiding this comment.
nit: use %v for the err
| err := fetcher.Fetch(ctx, reporter.V2()) | ||
| if err != nil { | ||
| reporter.V2().Error(err) | ||
| msw.module.UpdateStatus(status.Degraded, fmt.Sprintf("Error fetching data for metricset %s.%s: %s", msw.module.Name(), msw.MetricSet.Name(), err)) |
There was a problem hiding this comment.
nit: use %v for the err
There was a problem hiding this comment.
@pkoutsovasilis I didn't see this.
I'll raise another PR for this.
Sorry from my end!
pkoutsovasilis
left a comment
There was a problem hiding this comment.
Left some nits, but looks good to me
|
@cmacknz Screenshot of error event on Kibana:
|
|
Merging this as of now. |
#40025) * fix: initial commit * tests: add integration test cases * fix: expand testing scenarios * fix: add comments * fix: move integration tests to system/process * cleanup * fix: ci * fix: ci and typos * chore: update changelog * fix: add helper * fix: remove extra space * fix: ci * fix: move integration tests to x-pack * fix: add null check * fix: ci * fix: remove unused code * fix: lint * fix: lint and imports * fix: ci windows * inting for windows * fix lint linux * fix: go imports * fix: switch to the generic way * chore: make error descriptive * fix: move status report after fetch * fix: typo * fix: remove nolint --------- Co-authored-by: Pierre HILBERT <pierre.hilbert@elastic.co> (cherry picked from commit c86330a)
|
|

Proposed commit message
Add
StatusReporterfor letting individual metricsets to report their statuses.Checklist
CHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.Related issues
Use cases
Screenshots
Logs