[Rate limit processor] Add counter metric for dropped events#23330
[Rate limit processor] Add counter metric for dropped events#23330ycombinator merged 9 commits intoelastic:masterfrom ycombinator:ratelimit-throttled-field
Conversation
|
Pinging @elastic/integrations (Team:Platforms) |
There was a problem hiding this comment.
This change is unrelated to this PR but since it's minor I thought I would include it in here. Let me know if you'd prefer to have it in its own PR.
There was a problem hiding this comment.
I am ok with adding it here, but maybe only the part of In the current implementation, rate-limited events are dropped., I don't think it is needed to document possible future implementations 🙂
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
There was a problem hiding this comment.
I am on the fence about whether to reset this counter each time or let it grow so it's a true counter. I decided to not go with the latter since it would be more susceptible to wraparound issues. But happy to discuss and change!
There was a problem hiding this comment.
If we include it in the event, I agree with resetting it after it is reported.
There was a problem hiding this comment.
I decided to keep track of the number of events that have been rate-limited (since the last time an event was allowed through; more on that in https://github.com/elastic/beats/pull/23330/files#r550323522). Alternatively we could simply have a boolean to indicate that rate limiting has happened. But I felt like a number might be more informative. Happy to discuss and change, though!
There was a problem hiding this comment.
On a related note, it would be nice if we could include this metric as part of the libbeat monitoring infrastructure. For example, the dns processor is doing something like this. If you agree, I can implement this as part of this PR or a separate PR.
There was a problem hiding this comment.
Agree with current approach.
As an alternative I was thinking on setting only a tag in the event, and have a metric in the monitoring infra to report the number of metrics, but this can be problematic if multiple rate_limit processors are used.
jsoriano
left a comment
There was a problem hiding this comment.
I have added some thoughts, let me know what you think. Thanks for adding this!
There was a problem hiding this comment.
I am ok with adding it here, but maybe only the part of In the current implementation, rate-limited events are dropped., I don't think it is needed to document possible future implementations 🙂
There was a problem hiding this comment.
Agree with current approach.
As an alternative I was thinking on setting only a tag in the event, and have a metric in the monitoring infra to report the number of metrics, but this can be problematic if multiple rate_limit processors are used.
There was a problem hiding this comment.
If we include it in the event, I agree with resetting it after it is reported.
There was a problem hiding this comment.
Nit. Don't keep track of this if p.config.MetricField == "", the atomic operation could add some contention.
There was a problem hiding this comment.
I wonder if we should decide what field to use for this. This way users wouldn't need to think what field to use, with the risk of choosing a field that is used by something else. We could also provide a mapping for the field. And we could also keep this always enabled.
(Naming is complicated, I don't have a suggestion for this field 😬 )
As a example of something somehow similar, the multiline feature of filebeat adds a multiline flag in a known field (log.flags) when multiple lines are merged in a single event, and this cannot be disabled.
There was a problem hiding this comment.
I recall having a conversation previously where we thought it might be a good idea to give the user control of the field: #21020 (comment). But your points here about the advantages of having a fixed field are also valid.
Maybe we should take a step back and ask what is the use case for including a field in the event that describes a past state, especially considering that events may be batched and retried by some outputs. I think the idea is to know that rate limiting is happening or not and, if it is, to what extent. So maybe a monitoring counter is the more reliable way of expressing this?
There was a problem hiding this comment.
I think the idea is to know that rate limiting is happening or not and, if it is, to what extent.
Yes, I think the same.
So maybe a monitoring counter is the more reliable way of expressing this?
A monitoring counter sounds good, but I wonder if this is enough when multiple rate_limit processors are configured in the same beat, it can be difficult to see which one is being triggered. Though maybe this is not such a problem in real deployments.
Having a counter for each rate_limit could be an alternative, but it can complicate things, and I am not sure how each counter could be correlated to each config.
Having the info in the event is not so nice or reliable, but I guess it is easier to identify what kind of events are being throttled.
Maybe by now we can go on with a global monitoring counter (one for all rate_limit processors), and wait for feedback to see if this is enough. Having debug logging about the throttled events could also help with tricky cases and complicated configs.
There was a problem hiding this comment.
A monitoring counter sounds good, but I wonder if this is enough when multiple rate_limit processors are configured in the same beat, it can be difficult to see which one is being triggered. Though maybe this is not such a problem in real deployments.
I looked at how the dns processor solves this issue. It does so by giving each instance of the processor it's own ID and then using it for logging and metrics monitoring purposes:
beats/libbeat/processors/dns/dns.go
Lines 60 to 65 in 11c5367
So maybe we could play the same game here?
Alternatively we could create a new processors.MonitoredProcessor struct that implements processors.Processor but also implements a MonitoringRegistry() method that provides a monitoring registry instance for any processor that wishes to be monitored. This way we will have a consistent namespace for each processor instance within the libbeat monitoring registry for per-processor-instance metrics.
WDYT?
Having debug logging about the throttled events could also help with tricky cases and complicated configs.
Agreed. We already have this today:
There was a problem hiding this comment.
Ok, if we have already other processors doing something like this I am ok with following the same approach 👍
There was a problem hiding this comment.
Cool, I'm going to change this PR to remove the metric field on the event and instead setup a monitoring counter for the cumulative total number of events rate limited.
There was a problem hiding this comment.
If multiple goroutines are sending events at the same time, all of them could see that p.numRateLimited.Load() > 0, and they would create multiple events with the same count before it is reset to 0. We could use the atomic swap to ensure that the counter is only set in one event.
| if p.config.MetricField != "" && p.numRateLimited.Load() > 0 { | |
| event.PutValue(p.config.MetricField, p.numRateLimited.Load()) | |
| p.numRateLimited.Store(0) | |
| if p.config.MetricField != "" { | |
| if count := p.numRateLimited.Swap(0); count > 0 { | |
| event.PutValue(p.config.MetricField, count) |
|
@jsoriano I've updated this PR significantly per #23330 (comment). Please re-review when you get a chance. Thanks! |
jsoriano
left a comment
There was a problem hiding this comment.
Looks good, added some minor comments.
Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
…23330) (#23493) * [Rate limit processor] Add counter metric for dropped events (#23330) * Adding throtted_field * Documenting the field * Adding note on dropping of events * Renaming metric field * Adding CHANGELOG entry * Converting to monitoring counter metric * Removing metric_field * Fixing wrapping * Removing old entry from CHANGELOG Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co> Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co> * Removing extra empty lines from CHANGELOG Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>
What does this PR do?
This PR adds a monitoring counter metric,
processors.rate_limit.n.droppedwherenis the thenth instance (1-based) of arate_limitprocessor used in a Beat's configuration. This counter is incremented each time the processor drops an event due to rate limiting.Why is it important?
This allows users of the processor to understand whether their events are being rate limited and by how much.
Checklist
I have made corresponding change to the default configuration filesI have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.How to test this PR locally
Create a minimal Filebeat configuration with this processor in it.
Run Filebeat with the above configuration.
Send events to Filebeat via STDIN at a rate faster than one event per minute (the rate limit).
In another window check that the Filebeat Stats API has the monitoring counter implemented by this PR and that it is incrementing as expected.
Related issues