[Rate limit processor] Add counter metric for dropped events by ycombinator · Pull Request #23330 · elastic/beats

ycombinator · 2020-12-30T20:44:47Z

What does this PR do?

This PR adds a monitoring counter metric, processors.rate_limit.n.dropped where n is the the nth instance (1-based) of a rate_limit processor used in a Beat's configuration. This counter is incremented each time the processor drops an event due to rate limiting.

Why is it important?

This allows users of the processor to understand whether their events are being rate limited and by how much.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
~~I have made corresponding change to the default configuration files~~
~~I have added tests that prove my fix is effective or that my feature works~~
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Create a minimal Filebeat configuration with this processor in it.

# filebeat.test.yml

filebeat.inputs:
- type: stdin

processors:
  - rate_limit:
      limit: "1/m"

output.console:
  enabled: true
  pretty: true

http.enabled: true

Run Filebeat with the above configuration.
```
filebeat -c filebeat.test.yml
```
Send events to Filebeat via STDIN at a rate faster than one event per minute (the rate limit).
In another window check that the Filebeat Stats API has the monitoring counter implemented by this PR and that it is incrementing as expected.
```
curl -s 'http://localhost:5066/stats' | jq -c '.processor.rate_limit."1".dropped' 
```

Related issues

Closes Add event rate quota per Cloud Foundry organization #21020

elasticmachine · 2020-12-30T20:47:24Z

Pinging @elastic/integrations (Team:Platforms)

ycombinator · 2020-12-30T20:48:44Z

libbeat/processors/ratelimit/docs/rate_limit.asciidoc

This change is unrelated to this PR but since it's minor I thought I would include it in here. Let me know if you'd prefer to have it in its own PR.

I am ok with adding it here, but maybe only the part of In the current implementation, rate-limited events are dropped., I don't think it is needed to document possible future implementations 🙂

elasticmachine · 2020-12-30T20:49:34Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: Pull request #23330 updated
- Start Time: 2021-01-13T14:45:56.337+0000
Duration: 45 min 34 sec
Commit: c370761

Test stats 🧪

Test	Results
Failed	0
Passed	17370
Skipped	1345
Total	18715

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	17370
Skipped	1345
Total	18715

ycombinator · 2020-12-30T20:50:47Z

libbeat/processors/ratelimit/rate_limit.go

I am on the fence about whether to reset this counter each time or let it grow so it's a true counter. I decided to not go with the latter since it would be more susceptible to wraparound issues. But happy to discuss and change!

If we include it in the event, I agree with resetting it after it is reported.

ycombinator · 2020-12-30T20:52:16Z

libbeat/processors/ratelimit/rate_limit.go

I decided to keep track of the number of events that have been rate-limited (since the last time an event was allowed through; more on that in https://github.com/elastic/beats/pull/23330/files#r550323522). Alternatively we could simply have a boolean to indicate that rate limiting has happened. But I felt like a number might be more informative. Happy to discuss and change, though!

On a related note, it would be nice if we could include this metric as part of the libbeat monitoring infrastructure. For example, the dns processor is doing something like this. If you agree, I can implement this as part of this PR or a separate PR.

Agree with current approach.

As an alternative I was thinking on setting only a tag in the event, and have a metric in the monitoring infra to report the number of metrics, but this can be problematic if multiple rate_limit processors are used.

jsoriano

I have added some thoughts, let me know what you think. Thanks for adding this!

jsoriano · 2021-01-04T16:31:03Z

libbeat/processors/ratelimit/docs/rate_limit.asciidoc

I am ok with adding it here, but maybe only the part of In the current implementation, rate-limited events are dropped., I don't think it is needed to document possible future implementations 🙂

jsoriano · 2021-01-04T16:47:37Z

libbeat/processors/ratelimit/rate_limit.go

Agree with current approach.

As an alternative I was thinking on setting only a tag in the event, and have a metric in the monitoring infra to report the number of metrics, but this can be problematic if multiple rate_limit processors are used.

jsoriano · 2021-01-04T16:48:41Z

libbeat/processors/ratelimit/rate_limit.go

If we include it in the event, I agree with resetting it after it is reported.

jsoriano · 2021-01-04T16:52:03Z

libbeat/processors/ratelimit/rate_limit.go

Nit. Don't keep track of this if p.config.MetricField == "", the atomic operation could add some contention.

jsoriano · 2021-01-04T17:04:49Z

libbeat/processors/ratelimit/docs/rate_limit.asciidoc

I wonder if we should decide what field to use for this. This way users wouldn't need to think what field to use, with the risk of choosing a field that is used by something else. We could also provide a mapping for the field. And we could also keep this always enabled.

(Naming is complicated, I don't have a suggestion for this field 😬 )

As a example of something somehow similar, the multiline feature of filebeat adds a multiline flag in a known field (log.flags) when multiple lines are merged in a single event, and this cannot be disabled.

I recall having a conversation previously where we thought it might be a good idea to give the user control of the field: #21020 (comment). But your points here about the advantages of having a fixed field are also valid.

Maybe we should take a step back and ask what is the use case for including a field in the event that describes a past state, especially considering that events may be batched and retried by some outputs. I think the idea is to know that rate limiting is happening or not and, if it is, to what extent. So maybe a monitoring counter is the more reliable way of expressing this?

I think the idea is to know that rate limiting is happening or not and, if it is, to what extent.

Yes, I think the same.

So maybe a monitoring counter is the more reliable way of expressing this?

A monitoring counter sounds good, but I wonder if this is enough when multiple rate_limit processors are configured in the same beat, it can be difficult to see which one is being triggered. Though maybe this is not such a problem in real deployments.
Having a counter for each rate_limit could be an alternative, but it can complicate things, and I am not sure how each counter could be correlated to each config.
Having the info in the event is not so nice or reliable, but I guess it is easier to identify what kind of events are being throttled.

Maybe by now we can go on with a global monitoring counter (one for all rate_limit processors), and wait for feedback to see if this is enough. Having debug logging about the throttled events could also help with tricky cases and complicated configs.

A monitoring counter sounds good, but I wonder if this is enough when multiple rate_limit processors are configured in the same beat, it can be difficult to see which one is being triggered. Though maybe this is not such a problem in real deployments.

I looked at how the dns processor solves this issue. It does so by giving each instance of the processor it's own ID and then using it for logging and metrics monitoring purposes:

beats/libbeat/processors/dns/dns.go

Lines 60 to 65 in 11c5367

// Logging and metrics (each processor instance has a unique ID).

var (

id = int(instanceID.Inc())

log = logp.NewLogger(logName).With("instance_id", id)

metrics = monitoring.Default.NewRegistry(logName+"."+strconv.Itoa(id), monitoring.DoNotReport)

)

So maybe we could play the same game here?

Alternatively we could create a new processors.MonitoredProcessor struct that implements processors.Processor but also implements a MonitoringRegistry() method that provides a monitoring registry instance for any processor that wishes to be monitored. This way we will have a consistent namespace for each processor instance within the libbeat monitoring registry for per-processor-instance metrics.

WDYT?

Having debug logging about the throttled events could also help with tricky cases and complicated configs.
Agreed. We already have this today:

beats/libbeat/processors/ratelimit/rate_limit.go

Line 89 in 11c5367

p.logger.Debugf("event [%v] dropped by rate_limit processor", event)

Ok, if we have already other processors doing something like this I am ok with following the same approach 👍

Cool, I'm going to change this PR to remove the metric field on the event and instead setup a monitoring counter for the cumulative total number of events rate limited.

jsoriano · 2021-01-04T17:21:00Z

libbeat/processors/ratelimit/rate_limit.go

If multiple goroutines are sending events at the same time, all of them could see that p.numRateLimited.Load() > 0, and they would create multiple events with the same count before it is reset to 0. We could use the atomic swap to ensure that the counter is only set in one event.

Suggested change

if p.config.MetricField != "" && p.numRateLimited.Load() > 0 {

event.PutValue(p.config.MetricField, p.numRateLimited.Load())

p.numRateLimited.Store(0)

if p.config.MetricField != "" {

if count := p.numRateLimited.Swap(0); count > 0 {

event.PutValue(p.config.MetricField, count)

ycombinator · 2021-01-12T21:58:50Z

@jsoriano I've updated this PR significantly per #23330 (comment). Please re-review when you get a chance. Thanks!

jsoriano

Looks good, added some minor comments.

libbeat/processors/ratelimit/rate_limit.go

CHANGELOG.next.asciidoc

Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

…23330) (#23493) * [Rate limit processor] Add counter metric for dropped events (#23330) * Adding throtted_field * Documenting the field * Adding note on dropping of events * Renaming metric field * Adding CHANGELOG entry * Converting to monitoring counter metric * Removing metric_field * Fixing wrapping * Removing old entry from CHANGELOG Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co> Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co> * Removing extra empty lines from CHANGELOG Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Dec 30, 2020

ycombinator added needs_backport PR is waiting to be backported to other branches. Team:Platforms Label for the Integrations - Platforms team v7.12.0 v8.0.0 labels Dec 30, 2020

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Dec 30, 2020

ycombinator requested a review from jsoriano December 30, 2020 20:47

ycombinator commented Dec 30, 2020

View reviewed changes

jsoriano reviewed Jan 4, 2021

View reviewed changes

ycombinator mentioned this pull request Jan 12, 2021

Possible Enhancements for rate_limit processor #23276

Closed

4 tasks

ycombinator changed the title ~~[Rate limit processor] Add metric_field~~ [Rate limit processor] Add counter metric for dropped events Jan 12, 2021

ycombinator requested a review from jsoriano January 12, 2021 21:57

ycombinator added 8 commits January 12, 2021 13:57

Adding throtted_field

9495945

Documenting the field

74b0211

Adding note on dropping of events

b48f95a

Renaming metric field

9530c1f

Adding CHANGELOG entry

bc335a1

Converting to monitoring counter metric

9d5516b

Removing metric_field

4576746

Fixing wrapping

7d2d900

ycombinator added test-plan Add this PR to be manual test plan review labels Jan 12, 2021

jsoriano reviewed Jan 13, 2021

View reviewed changes

libbeat/processors/ratelimit/rate_limit.go Show resolved Hide resolved

CHANGELOG.next.asciidoc Outdated Show resolved Hide resolved

Removing old entry from CHANGELOG

c370761

Co-authored-by: Jaime Soriano Pastor <jaime.soriano@elastic.co>

ycombinator requested a review from jsoriano January 13, 2021 19:24

jsoriano approved these changes Jan 13, 2021

View reviewed changes

ycombinator merged commit 6d6da71 into elastic:master Jan 13, 2021

ycombinator mentioned this pull request Jan 13, 2021

[7.x] [Rate limit processor] Add counter metric for dropped events (#23330) #23493

Merged

ycombinator removed the needs_backport PR is waiting to be backported to other branches. label Jan 13, 2021

andresrc added the test-plan-added This PR has been added to the test plan label Feb 15, 2021

	// Logging and metrics (each processor instance has a unique ID).
	var (
	id = int(instanceID.Inc())
	log = logp.NewLogger(logName).With("instance_id", id)
	metrics = monitoring.Default.NewRegistry(logName+"."+strconv.Itoa(id), monitoring.DoNotReport)
	)

-	if p.config.MetricField != "" && p.numRateLimited.Load() > 0 {
-		event.PutValue(p.config.MetricField, p.numRateLimited.Load())
-		p.numRateLimited.Store(0)
+	if p.config.MetricField != "" {
+		if count := p.numRateLimited.Swap(0); count > 0 {
+			event.PutValue(p.config.MetricField, count)

Conversation

ycombinator commented Dec 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Why is it important?

Checklist

How to test this PR locally

Related issues

Uh oh!

elasticmachine commented Dec 30, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Dec 30, 2020 • edited by jenkins-beats-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💚 Build Succeeded

Build stats

Test stats 🧪

💚 Flaky test report

Test stats 🧪

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsoriano left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsoriano Jan 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ycombinator commented Jan 12, 2021

Uh oh!

jsoriano left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ycombinator commented Dec 30, 2020 •

edited

Loading

elasticmachine commented Dec 30, 2020 •

edited by jenkins-beats-ci bot

Loading

jsoriano Jan 4, 2021 •

edited

Loading