Adding a threaded parsing option for statsD input plugin#6922
Adding a threaded parsing option for statsD input plugin#6922danielnelson merged 5 commits intoinfluxdata:masterfrom
Conversation
plugins/inputs/statsd/statsd.go
Outdated
|
|
||
| MaxParserThreads int `toml:"max_parser_threads"` |
There was a problem hiding this comment.
Let's not expose this option, since it is so tied to an implementation detail. Let's just pick a good default number and always start this many. How about 3?
There was a problem hiding this comment.
I'm not sure if I agree with that. It's definitely tied to an implementation detail, but I think there is value in allowing the level of concurrency to be configurable. For instance, if I choose to run telegraf on the same machine as my app, I may want to limit it's footprint. But if I set up a dedicated instance to act as a metrics aggregator, being able to spend more CPU to increase throughput (and reduce the number of servers I need to use) could be very useful.
Looking at the existing telegraf.conf, it seems as if there are a few places where this is configurable (but maybe we should rename it?)
https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf#L6124
https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf#L373
https://github.com/influxdata/telegraf/blob/master/etc/telegraf.conf#L4961
Maybe there's some way we could expose a level of concurrency as an option without tying it directly to the number of goroutines?
There was a problem hiding this comment.
I'm not confident enough yet that this is the right solution, at the very least we would need to do better performance testing before I'd be willing to add it. I'd also like to discuss this with @reimda before we commit to any new options.
| time.Sleep(time.Millisecond) | ||
| for len(listener.in) > 0 { | ||
| fmt.Printf("Left in buffer: %v \n", len(listener.in)) | ||
| time.Sleep(time.Millisecond) |
There was a problem hiding this comment.
Sorry about the current state of the tests for this plugin which encourages this, but let's not add any new tests/benchmarks that require sleep.
|
Here is a bit of testing I did. Until timestamp 07:10 it is running with this branch and 3 goroutines, after this timestamp it is running with the current code (master with #6921). The extra goroutines are reducing the dropped packets, but for some reason there is a slight increase in udp errors at the system level and a moderate decrease in the total number of packets captured. The number of processed packets is about the same. Could we be slowing the read goroutine by having more readers of the channel? If so, why is the system udp_inerrors only slightly decreased? |
|
I'm not sure if I'm seeing the decrease in packets captured? From your image, it looks like the green line is the total number of packets received, which remains relatively consistent across both sides of the 7:10 timestamp. It's just the rate that drops, but that makes sense, as the calculation for I agree that the udp_inerrors seem to be coming in at a higher rate with the goroutine changes based on your graph. But I'm unsure why. I'll do some looking around as well, but do you have any insight into what types of things cause these errors? (It looks like they come in regularly anyways. Is the error profile the same or different with the different uses?) I'll also try to get something spun up to check this out myself. |
|
@danielnelson I'd love to revisit this and see if we can get a resolution. (Sorry for the long delay, I've been on and off vacation for the last month or so). Did you get a chance to discuss with @reimda? We are running a forked version with these changes in a pre-production environment and having good results. (Up to ~120k-130k TPM for a single telegraf instance). |
|
I don't want to add any new options to the plugin interface but if you will update with a fixed number of reader goroutines, probably 3 or so, then we can merge. |
|
Fair enough. I updated the PR with 5 (I tend to think the slightly higher number gains us a bit more benefit. We tested up to 1000 without hitting CPU issues), but I can bump it down to 3 if you feel strongly. |

Required for all PRs: