beater: Set semaphore, modelindexer chan from mem#9358
beater: Set semaphore, modelindexer chan from mem#9358marclop merged 18 commits intoelastic:mainfrom
Conversation
internal/beater/beater.go
Outdated
|
|
||
| func eventBufferSize(memLimit float64) int { | ||
| if memLimit > 1 { | ||
| return int(256 * memLimit) |
There was a problem hiding this comment.
Memory usage is till pretty low, we could likely increase from 256 to 512 or even further.
Initial benchmarksI ran a benchmark matrix with different APM Server instances (1, 2, 8, 15 and 30 gigabytes of RAM) with a constant Elasticsearch deployment of 15 nodes (3 dedicated masters, 12 hot nodes) and 12 shards for all the APM data streams. Peak CPUMemoryNetwork
|
7d33bb9 to
09c843f
Compare
📚 Go benchmark reportDiff with the report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat |
Update the MaxConcurrentDecoders and the modelindexer internal channel based on the total amount of memory that the APM server can access. It looks at cgroups if available, otherwise, falls back to the total system memory. Also updates the default semaphore size to `48`, which will be used when APM server is 1GB or smaller and the default modelindexer internal queue size to `64` from `100`. This fixes a couple of issues; 1GB instances could still OOM if enough concurrency was used (> 80 concurrent agents sending data as fast as possible), and another issue where APM Server doesn't use the total available memory to its advantage. Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
09c843f to
40ef0de
Compare
This comment was marked as resolved.
This comment was marked as resolved.
…-mi-buffer-based-on-memory
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
This comment was marked as outdated.
This comment was marked as outdated.
…-mi-buffer-based-on-memory
…-mi-buffer-based-on-memory
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
|
After merging #9393, all instances had 50 available indexers. Because of how we cycle through them, 1GB and 2GB APM instances were running out of memory. I've modified the PR to also size the number of available indexers based on RAM, so that we can avoid unpleasant OOMs in smaller instances. Taking profiles gave me the clue that I needed, after sizing the 1GB instance to 10 available indexers It turns out that reading the Elasticsearch bulk responses can take quite a bit of memory, which makes sense since we're sending pretty big bulk requests (10K+ events). apm-server/internal/model/modelindexer/bulk_indexer.go Lines 178 to 180 in 8e27af7 Now a 1GB instance memory consumption is much better: |
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
|
This pull request is now in conflicts. Could you fix it @marclop? 🙏 |
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
…-mi-buffer-based-on-memory
internal/beater/beater.go
Outdated
| if limit, err := cgroupMemoryLimit(cgroupReader); err != nil { | ||
| s.logger.Warn(err) | ||
| } else { | ||
| memLimit = float64(limit) / 1024 / 1024 / 1024 |
There was a problem hiding this comment.
Did you say you were going to limit the use of cgroup memory, to allow for other processes in the container?
There was a problem hiding this comment.
I capped it to 80% of the cgroup memory limit
internal/beater/beater.go
Outdated
| opts.EventBufferSize = int(2048 * memLimit) | ||
| if opts.EventBufferSize >= 61440 { | ||
| opts.EventBufferSize = 61440 | ||
| } | ||
| logger.Infof(logMessage, | ||
| "modelindexer.EventBufferSize", opts.EventBufferSize, memLimit, | ||
| ) | ||
| if opts.MaxRequests > 0 { | ||
| return opts | ||
| } | ||
| // This formula yields the following max requests for APM Server sized: | ||
| // 1 2 4 8 15 30 | ||
| // 10 13 16 22 32 55 | ||
| maxRequests := int(float64(10) + memLimit*1.5) | ||
| if maxRequests > 60 { | ||
| maxRequests = 60 | ||
| } |
There was a problem hiding this comment.
How did you arrive at these formulae?
If we assumed the max event size (in bytes on the wire) could fit exactly into as many bytes in memory, then isn't a channel size of 2048 for 1GB memory is too much? The default max event size is 300KB. 2048 * 300 * 1024 == 629145600, which is (substantially) greater than 1024 * 1024 * 1024 == 1073741824.
What do you think about defining the defaults in terms of:
- some percentage of available memory
- maximum event size * (in-memory model objects overhead factor)
?
There was a problem hiding this comment.
I think that is a big higher than I intended the 1GB instance to hold, however, it would still be within the limit I think?
1,073,741,824 limit in bytes
0,629,145,600 current buffer max usage in bytes (based on 300kb event size). (600MB)
I think it would be safer to to reduce the base queue by 1024 so we end up with up to 300MB of cached events in the queue (And allows some overhead for the in-memory representation).
How did you arrive at these formulae?
I did some calculations on paper and then adjusted them observing the behavior of benchmarks. The queue didn't have a big impact in throughput to be honest, but It should allow a bit of data to be cached (in practice, it would be less than 1s at maximal throughput, but that's not what the queue is there for).
Modelindexer bulk indexer size = ~10MB (based on some profiles taken that indicated that with 50 available bulk requests, the buffers grew to ~300MB / 50 = ~6MB. per available indexer).
So for a 1GB APM Instance
- Max requests of 128 * 10 * 300kb = 375MB
- Available bulk indexers 11 * 6 = 66MB
- Modelindexer internal queue = 1024 * 300kb = 300MB
- +25% error for other aspects of the server and not consuming all the cgroup memory = 926MB
There was a problem hiding this comment.
I think that is a big higher than I intended the 1GB instance to hold, however, it would still be within the limit I think?
1,073,741,824 limit in bytes
0,629,145,600 current buffer max usage in bytes (based on 300kb event size). (600MB)
🤦♂️ indeed, I was just off by an order of magnitude.
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
BenchstatOverall there's a clear improvement in throughput with this PR, I believe as a product of the increased semaphore + autoscaling + tuned buffers. All instances showed improvement, and I believe that 30g shows less of an improvement due to the Elasticsearch deployment size? (although the deployment is 580g x 3z). 1gMain ran out of memory, while PR diff didn't. 2g$ benchstat -alpha 1 main/2g-30s-30n.txt pr/2g-30s-30n.txt
name old time/op new time/op delta
AgentAll-256 1.91s ±24% 1.61s ±27% -15.64% (p=0.400 n=3+3)
name old events/sec new events/sec delta
AgentAll-256 9.77k ±28% 11.64k ±25% +19.20% (p=0.400 n=3+3)
name old gc_cycles new gc_cycles delta
AgentAll-256 224 ±118% 438 ±55% +95.68% (p=0.200 n=3+3)
name old max_goroutines new max_goroutines delta
AgentAll-256 278 ±81% 349 ±54% ~ (p=1.000 n=3+3)
name old max_heap_alloc new max_heap_alloc delta
AgentAll-256 1.08G ±14% 0.77G ±27% -28.09% (p=0.100 n=3+3)
name old max_heap_objects new max_heap_objects delta
AgentAll-256 9.86M ±14% 5.54M ±25% -43.85% (p=0.100 n=3+3)
name old max_rss new max_rss delta
AgentAll-256 1.18G ±13% 0.89G ±24% -24.53% (p=0.100 n=3+3)
name old mean_available_indexers new mean_available_indexers delta
AgentAll-256 41.4 ±26% 6.2 ±59% -84.93% (p=0.100 n=3+3)
name old alloc/op new alloc/op delta
AgentAll-256 559MB ± 1% 570MB ± 1% +1.98% (p=0.100 n=3+3)
name old allocs/op new allocs/op delta
AgentAll-256 7.90M ± 0% 7.97M ± 1% +0.92% (p=0.200 n=3+3)4g$ benchstat -alpha 1 main/4g-30s-30n.txt pr/4g-30s-30n.txt
name old time/op new time/op delta
AgentAll-512 1.25s ± 0% 1.07s ± 1% -14.12% (p=0.100 n=3+3)
name old events/sec new events/sec delta
AgentAll-512 14.4k ± 0% 16.8k ± 1% +16.44% (p=0.100 n=3+3)
name old gc_cycles new gc_cycles delta
AgentAll-512 159 ± 1% 199 ± 1% +25.10% (p=0.100 n=3+3)
name old max_goroutines new max_goroutines delta
AgentAll-512 211 ±10% 223 ± 7% +6.01% (p=0.400 n=3+3)
name old max_heap_alloc new max_heap_alloc delta
AgentAll-512 1.05G ± 1% 0.98G ± 0% -6.33% (p=0.100 n=3+3)
name old max_heap_objects new max_heap_objects delta
AgentAll-512 9.41M ± 1% 7.01M ± 1% -25.50% (p=0.100 n=3+3)
name old max_rss new max_rss delta
AgentAll-512 1.16G ± 2% 1.14G ± 0% -1.76% (p=0.400 n=3+3)
name old mean_available_indexers new mean_available_indexers delta
AgentAll-512 45.4 ± 0% 10.8 ± 1% -76.32% (p=0.100 n=3+3)
name old metrics/sec new metrics/sec delta
AgentAll-512 2.21k ± 0% 2.57k ± 1% +16.40% (p=0.100 n=3+3)
name old alloc/op new alloc/op delta
AgentAll-512 579MB ± 0% 566MB ± 0% -2.28% (p=0.100 n=3+3)
name old allocs/op new allocs/op delta
AgentAll-512 8.07M ± 0% 7.95M ± 0% -1.50% (p=0.100 n=3+3)8g$ benchstat -alpha 1 main/8g-30s-30n.txt pr/8g-30s-30n.txt
name old time/op new time/op delta
AgentAll-1024 610ms ±16% 350ms ± 1% -42.67% (p=0.100 n=3+3)
name old errors/sec new errors/sec delta
AgentAll-1024 599 ±18% 1029 ± 1% +71.83% (p=0.100 n=3+3)
name old events/sec new events/sec delta
AgentAll-1024 30.0k ±18% 51.5k ± 1% +71.83% (p=0.100 n=3+3)
name old gc_cycles new gc_cycles delta
AgentAll-1024 286 ±14% 278 ±11% -2.68% (p=0.700 n=3+3)
name old max_goroutines new max_goroutines delta
AgentAll-1024 410 ±28% 561 ± 4% +36.86% (p=0.100 n=3+3)
name old max_heap_alloc new max_heap_alloc delta
AgentAll-1024 1.19G ± 5% 2.07G ± 5% +73.42% (p=0.100 n=3+3)
name old max_heap_objects new max_heap_objects delta
AgentAll-1024 10.5M ± 8% 14.0M ± 5% +33.12% (p=0.100 n=3+3)
name old max_rss new max_rss delta
AgentAll-1024 1.32G ± 6% 2.31G ± 4% +74.57% (p=0.100 n=3+3)
name old mean_available_indexers new mean_available_indexers delta
AgentAll-1024 39.2 ±11% 1.4 ±10% -96.43% (p=0.100 n=3+3)
name old alloc/op new alloc/op delta
AgentAll-1024 580MB ± 1% 573MB ± 0% -1.20% (p=0.100 n=3+3)
name old allocs/op new allocs/op delta
AgentAll-1024 8.09M ± 0% 8.01M ± 0% -0.90% (p=0.100 n=3+3)15g$ benchstat -alpha 1 main/15g-30s-30n.txt pr/15g-30s-30n.txt
name old time/op new time/op delta
AgentAll-1920 371ms ± 0% 323ms ± 0% -12.82% (p=0.100 n=3+3)
name old events/sec new events/sec delta
AgentAll-1920 48.6k ± 0% 55.7k ± 0% +14.59% (p=0.100 n=3+3)
name old gc_cycles new gc_cycles delta
AgentAll-1920 405 ± 1% 148 ± 9% -63.34% (p=0.100 n=3+3)
name old max_goroutines new max_goroutines delta
AgentAll-1920 548 ± 3% 607 ± 5% +10.83% (p=0.100 n=3+3)
name old max_heap_alloc new max_heap_alloc delta
AgentAll-1920 1.21G ± 1% 3.84G ± 4% +216.61% (p=0.100 n=3+3)
name old max_heap_objects new max_heap_objects delta
AgentAll-1920 10.8M ± 2% 26.9M ± 3% +149.36% (p=0.100 n=3+3)
name old max_rss new max_rss delta
AgentAll-1920 1.36G ± 0% 4.29G ± 1% +214.90% (p=0.100 n=3+3)
name old mean_available_indexers new mean_available_indexers delta
AgentAll-1920 31.2 ± 2% 7.6 ± 4% -75.80% (p=0.100 n=3+3)
name old alloc/op new alloc/op delta
AgentAll-1920 567MB ± 0% 568MB ± 0% +0.23% (p=0.100 n=3+3)
name old allocs/op new allocs/op delta
AgentAll-1920 7.98M ± 0% 7.99M ± 0% +0.15% (p=0.700 n=3+3)30g$ benchstat -alpha 1 main/30g-30s-30n.txt pr/30g-30s-30n-2048d.txt
name old time/op new time/op delta
AgentAll-3840 181ms ± 1% 173ms ± 3% -4.27% (p=0.008 n=5+5)
name old events/sec new events/sec delta
AgentAll-3840 99.3k ± 0% 103.6k ± 3% +4.34% (p=0.008 n=5+5)
name old gc_cycles new gc_cycles delta
AgentAll-3840 725 ± 0% 313 ±28% -56.76% (p=0.004 n=5+6)
name old max_goroutines new max_goroutines delta
AgentAll-3840 1.02k ± 1% 1.05k ±10% +2.75% (p=0.667 n=5+5)
name old max_heap_alloc new max_heap_alloc delta
AgentAll-3840 1.50G ± 2% 4.09G ±25% +171.76% (p=0.004 n=5+6)
name old max_heap_objects new max_heap_objects delta
AgentAll-3840 13.6M ± 2% 30.6M ±27% +124.10% (p=0.004 n=5+6)
name old max_rss new max_rss delta
AgentAll-3840 1.73G ± 1% 4.64G ±24% +168.33% (p=0.004 n=5+6)
name old mean_available_indexers new mean_available_indexers delta
AgentAll-3840 15.9 ± 1% 15.3 ±38% -3.65% (p=0.758 n=5+6)
name old alloc/op new alloc/op delta
AgentAll-3840 558MB ± 0% 561MB ± 1% +0.46% (p=0.121 n=5+6)
name old allocs/op new allocs/op delta
AgentAll-3840 7.89M ± 0% 7.88M ± 1% -0.14% (p=0.909 n=5+6) |
axw
left a comment
There was a problem hiding this comment.
The magic numbers (in modelIndexerConfig in particular) make me a bit uncomfortable, but this is a clear improvement so I'm good with merging it and iterating as needed.
Just a few more minor things, otherwise looks good.
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
axw
left a comment
There was a problem hiding this comment.
LGTM, thanks! Could you please open an issue to follow up on modelIndexerConfig and maxConcurrentDecoders, to take the max event size and flush size into account?
|
|
||
| func newTracerServer(listener net.Listener, logger *logp.Logger, batchProcessor model.BatchProcessor) (*http.Server, error) { | ||
| cfg := config.DefaultConfig() | ||
| func newTracerServer(cfg *config.Config, listener net.Listener, logger *logp.Logger, batchProcessor model.BatchProcessor) (*http.Server, error) { |
There was a problem hiding this comment.
This was a bit surprising to me, since the modelindexer is passed in. I suppose this is needed for the MaxConcurrentDecoders though, is that right?
This seems like a bit of a problem: it means when self-instrumentation is enabled, there could be 2x the concurrent decoders (1x from external clients, 1x from the server itself). I think this is another reason why the modelindexer should control the rate of input.
(No action required at the moment.)
There was a problem hiding this comment.
Absolutely right. I thought about using a smaller arbitrary value instead, but like you say I think we can address the problem by encapsulating the semaphore in a different manner and have the model indexer act as the controller.



Motivation/summary
Update the MaxConcurrentDecoders, the modelindexer internal channel and the MaxRequests based on the total amount of memory that the APM server can access. It looks at cgroups if available, otherwise, falls back to the total system memory.
Updates the default semaphore size to
128from200, the default modelindexer internal queue size to1024from100and the modelindexer.MaxRequests to10down from50.This fixes a couple of issues; 1GB instances could still OOM if enough concurrency was used (> 80 concurrent agents sending data as fast as possible), and another issue where APM Server doesn't use the total available memory to its advantage.
Checklist
- [ ] Update package changelog.yml (only if changes toapmpackagehave been made)- [ ] Documentation has been updatedHow to test these changes
Benchmark with different sizes and observe more resource utilization when available.
Related issues
Closes #9182
Closes #9341