Skip to content

beater: Set semaphore, modelindexer chan from mem#9358

Merged
marclop merged 18 commits intoelastic:mainfrom
marclop:f/dynamic-semaphore-mi-buffer-based-on-memory
Nov 4, 2022
Merged

beater: Set semaphore, modelindexer chan from mem#9358
marclop merged 18 commits intoelastic:mainfrom
marclop:f/dynamic-semaphore-mi-buffer-based-on-memory

Conversation

@marclop
Copy link
Copy Markdown
Contributor

@marclop marclop commented Oct 13, 2022

Motivation/summary

Update the MaxConcurrentDecoders, the modelindexer internal channel and the MaxRequests based on the total amount of memory that the APM server can access. It looks at cgroups if available, otherwise, falls back to the total system memory.

Updates the default semaphore size to 128 from 200, the default modelindexer internal queue size to 1024 from 100 and the modelindexer.MaxRequests to 10 down from 50.

This fixes a couple of issues; 1GB instances could still OOM if enough concurrency was used (> 80 concurrent agents sending data as fast as possible), and another issue where APM Server doesn't use the total available memory to its advantage.

Checklist

How to test these changes

Benchmark with different sizes and observe more resource utilization when available.

Related issues

Closes #9182
Closes #9341

@marclop marclop added enhancement backport-skip Skip notification from the automated backport with mergify v8.6.0 labels Oct 13, 2022

func eventBufferSize(memLimit float64) int {
if memLimit > 1 {
return int(256 * memLimit)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory usage is till pretty low, we could likely increase from 256 to 512 or even further.

@ghost
Copy link
Copy Markdown

ghost commented Oct 13, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-11-04T08:18:43.903+0000

  • Duration: 27 min 21 sec

Test stats 🧪

Test Results
Failed 0
Passed 153
Skipped 0
Total 153

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate and publish the docker images.

  • /test windows : Build & tests on Windows.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@marclop
Copy link
Copy Markdown
Contributor Author

marclop commented Oct 13, 2022

Initial benchmarks

I ran a benchmark matrix with different APM Server instances (1, 2, 8, 15 and 30 gigabytes of RAM) with a constant Elasticsearch deployment of 15 nodes (3 dedicated masters, 12 hot nodes) and 12 shards for all the APM data streams. Peak events/sec was reached at and above 8g and bigger instances didn't seem to improve the performance much.

CPU

image

Memory

image

Network

image

benchstat comparison

WIP.

@marclop marclop force-pushed the f/dynamic-semaphore-mi-buffer-based-on-memory branch from 7d33bb9 to 09c843f Compare October 13, 2022 05:58
@ghost
Copy link
Copy Markdown

ghost commented Oct 13, 2022

📚 Go benchmark report

Diff with the main branch

name                                                                                              old time/op    new time/op     delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
ContextReset/X-Real-IP_ipv4-12                                                                       691ns ±54%      945ns ±11%  +36.80%  (p=0.032 n=5+5)
ContextReset/X-Real-IP_ipv6-12                                                                       974ns ±21%      697ns ±33%  -28.40%  (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
ModelIndexer/NoCompression-12                                                                       1.27µs ± 8%     0.72µs ± 2%  -43.10%  (p=0.008 n=5+5)
ModelIndexer/NoCompressionScaling-12                                                                 785ns ± 3%      694ns ± 4%  -11.59%  (p=0.008 n=5+5)
ModelIndexer/BestSpeedScaling-12                                                                    4.63µs ±13%     3.31µs ±33%  -28.44%  (p=0.016 n=5+5)
ModelIndexer/DefaultCompression-12                                                                  3.05µs ± 4%     3.27µs ± 7%   +7.10%  (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/errors_rum.ndjson-12                                                               25.0µs ±20%     31.8µs ±18%  +27.19%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/invalid-json-event.ndjson-12            3.74µs ± 3%     4.13µs ±19%  +10.32%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/span-links.ndjson-12                    2.70µs ± 7%     2.98µs ± 9%  +10.22%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/transactions-huge_traces.ndjson-12      10.5µs ± 3%     11.7µs ±12%  +11.37%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/optional-timestamps.ndjson-12           1.83µs ± 1%     1.94µs ± 3%   +6.04%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/otel-bridge.ndjson-12                   3.80µs ± 1%     3.97µs ± 1%   +4.54%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/ratelimit.ndjson-12                     6.28µs ± 2%     6.74µs ± 2%   +7.43%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/span-links.ndjson-12                    1.55µs ± 1%     1.64µs ± 2%   +5.35%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/spans.ndjson-12                         10.9µs ± 1%     11.5µs ± 1%   +5.94%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions-huge_traces.ndjson-12      5.50µs ± 1%     5.76µs ± 1%   +4.77%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions.ndjson-12                  10.9µs ± 1%     11.5µs ± 0%   +5.50%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans.ndjson-12            10.7µs ± 1%     11.4µs ± 1%   +6.34%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum.ndjson-12        2.00µs ± 2%     2.10µs ± 0%   +5.03%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum_2.ndjson-12      1.93µs ± 1%     2.02µs ± 1%   +4.44%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/unknown-span-type.ndjson-12             7.14µs ± 1%     7.52µs ± 2%   +5.39%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors.ndjson-12                      6.89µs ± 4%     7.67µs ± 2%  +11.36%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_2.ndjson-12                    6.47µs ± 2%     7.13µs ± 1%  +10.22%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_rum.ndjson-12                  1.91µs ± 5%     2.04µs ± 1%   +6.95%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_transaction_id.ndjson-12       5.36µs ± 2%     5.62µs ± 0%   +4.76%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/events.ndjson-12                      13.0µs ± 1%     13.7µs ± 1%   +5.39%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-event-type.ndjson-12           803ns ± 2%      833ns ± 1%   +3.70%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-event.ndjson-12               3.30µs ± 1%     3.47µs ± 1%   +5.24%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-event.ndjson-12          1.09µs ± 2%     1.14µs ± 1%   +4.18%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-metadata.ndjson-12       1.88µs ± 1%     1.95µs ± 1%   +4.01%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/logs.ndjson-12                        5.48µs ± 4%     5.88µs ± 1%   +7.25%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metadata.ndjson-12                    1.26µs ± 1%     1.31µs ± 1%   +4.18%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metricsets.ndjson-12                  4.38µs ± 1%     4.56µs ± 1%   +4.20%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/minimal-service.ndjson-12              978ns ± 2%     1018ns ± 1%   +4.12%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/minimal.ndjson-12                     1.96µs ± 2%     2.06µs ± 3%   +4.83%  (p=0.016 n=4+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/optional-timestamps.ndjson-12         1.46µs ± 2%     1.53µs ± 2%   +4.49%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/otel-bridge.ndjson-12                 3.16µs ± 1%     3.34µs ± 2%   +5.82%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/ratelimit.ndjson-12                   5.52µs ± 1%     5.83µs ± 3%   +5.60%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/span-links.ndjson-12                  1.19µs ± 1%     1.25µs ± 4%   +5.00%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/spans.ndjson-12                       9.48µs ± 1%     9.74µs ± 1%   +2.83%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions-huge_traces.ndjson-12    4.60µs ± 2%     4.79µs ± 2%   +4.26%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions.ndjson-12                9.26µs ± 2%     9.72µs ± 1%   +4.91%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans.ndjson-12          9.24µs ± 2%     9.60µs ± 2%   +3.88%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum.ndjson-12      1.60µs ± 1%     1.68µs ± 1%   +4.79%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum_2.ndjson-12    1.52µs ± 3%     1.59µs ± 1%   +4.55%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/unknown-span-type.ndjson-12           6.09µs ± 1%     6.41µs ± 1%   +5.21%  (p=0.008 n=5+5)
ReadBatch/transactions_spans_rum_2.ndjson-12                                                        20.9µs ±18%     16.0µs ±27%  -23.47%  (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
TraceGroups-12                                                                                       150ns ±10%      129ns ±20%  -14.06%  (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64

name                                                                                              old alloc/op   new alloc/op    delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
ModelIndexer/NoCompression-12                                                                       2.95kB ± 0%     2.20kB ± 2%  -25.57%  (p=0.008 n=5+5)
ModelIndexer/NoCompressionScaling-12                                                                2.95kB ± 0%     2.18kB ± 0%  -26.07%  (p=0.008 n=5+5)
ModelIndexer/DefaultCompression-12                                                                  2.55kB ± 1%     2.58kB ± 1%   +1.26%  (p=0.016 n=5+5)
ModelIndexer/BestCompressionScaling-12                                                              2.59kB ± 1%     2.61kB ± 1%   +0.92%  (p=0.048 n=5+5)
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/invalid-json-event.ndjson-12                                                       4.30kB ± 2%     4.43kB ± 3%   +3.12%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/optional-timestamps.ndjson-12           5.21kB ± 1%     5.15kB ± 1%   -1.20%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/transactions-huge_traces.ndjson-12      16.0kB ± 1%     15.6kB ± 2%   -2.29%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/transactions_spans.ndjson-12            25.1kB ± 1%     25.5kB ± 1%   +1.31%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/optional-timestamps.ndjson-12           5.22kB ± 1%     5.28kB ± 1%   +1.31%  (p=0.032 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/metricsets.ndjson-12                    14.5kB ± 1%     14.4kB ± 0%   -1.13%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_rum.ndjson-12                  8.36kB ± 1%     8.44kB ± 1%   +0.92%  (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64
ShardedWriteTransactionContended-12                                                                 8.33kB ±28%     6.86kB ± 3%  -17.62%  (p=0.016 n=5+4)

name                                                                                              old allocs/op  new allocs/op   delta
pkg:github.com/elastic/apm-server/internal/agentcfg goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/beater/request goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
ModelIndexer/NoCompression-12                                                                         24.0 ± 0%       21.0 ± 0%  -12.50%  (p=0.008 n=5+5)
ModelIndexer/NoCompressionScaling-12                                                                  24.0 ± 0%       21.0 ± 0%  -12.50%  (p=0.008 n=5+5)
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/internal/publish goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/spanmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/aggregation/txmetrics goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling goos:linux goarch:amd64
pkg:github.com/elastic/apm-server/x-pack/apm-server/sampling/eventstorage goos:linux goarch:amd64

name                                                                                              old speed      new speed       delta
pkg:github.com/elastic/apm-server/internal/model/modelindexer goos:linux goarch:amd64
ModelIndexer/NoCompression-12                                                                     1.96GB/s ± 8%   3.44GB/s ± 2%  +75.45%  (p=0.008 n=5+5)
ModelIndexer/NoCompressionScaling-12                                                              3.16GB/s ± 3%   3.58GB/s ± 4%  +13.12%  (p=0.008 n=5+5)
ModelIndexer/BestSpeedScaling-12                                                                   539MB/s ±15%    778MB/s ±28%  +44.35%  (p=0.016 n=5+5)
ModelIndexer/DefaultCompression-12                                                                 815MB/s ± 4%    761MB/s ± 6%   -6.59%  (p=0.032 n=5+5)
pkg:github.com/elastic/apm-server/internal/processor/stream goos:linux goarch:amd64
BackendProcessor/errors_rum.ndjson-12                                                             76.7MB/s ±17%   60.4MB/s ±19%  -21.35%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel2/invalid-json-event.ndjson-12           157MB/s ± 3%    143MB/s ±17%   -8.59%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/span-links.ndjson-12                   253MB/s ± 7%    230MB/s ±10%   -9.21%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel4/transactions-huge_traces.ndjson-12     301MB/s ± 3%    271MB/s ±11%   -9.83%  (p=0.016 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/optional-timestamps.ndjson-12          561MB/s ± 1%    529MB/s ± 4%   -5.67%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/otel-bridge.ndjson-12                  495MB/s ± 1%    473MB/s ± 1%   -4.35%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/ratelimit.ndjson-12                    671MB/s ± 2%    625MB/s ± 2%   -6.92%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/span-links.ndjson-12                   439MB/s ± 1%    416MB/s ± 2%   -5.09%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/spans.ndjson-12                        739MB/s ± 1%    698MB/s ± 1%   -5.61%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions-huge_traces.ndjson-12     576MB/s ± 1%    550MB/s ± 1%   -4.56%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions.ndjson-12                 520MB/s ± 1%    492MB/s ± 0%   -5.22%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans.ndjson-12           544MB/s ± 1%    511MB/s ± 1%   -5.96%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum.ndjson-12       577MB/s ± 2%    549MB/s ± 0%   -4.80%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/transactions_spans_rum_2.ndjson-12     578MB/s ± 1%    553MB/s ± 1%   -4.25%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel8/unknown-span-type.ndjson-12            463MB/s ± 1%    439MB/s ± 2%   -5.10%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors.ndjson-12                     921MB/s ± 4%    827MB/s ± 2%  -10.22%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_2.ndjson-12                   729MB/s ± 2%    661MB/s ± 1%   -9.27%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_rum.ndjson-12                 994MB/s ± 5%    929MB/s ± 1%   -6.57%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/errors_transaction_id.ndjson-12      713MB/s ± 2%    681MB/s ± 0%   -4.55%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/events.ndjson-12                     572MB/s ± 1%    543MB/s ± 1%   -5.12%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-event-type.ndjson-12         487MB/s ± 2%    469MB/s ± 1%   -3.57%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-event.ndjson-12              232MB/s ± 1%    221MB/s ± 1%   -4.97%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-event.ndjson-12         538MB/s ± 2%    516MB/s ± 1%   -4.00%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/invalid-json-metadata.ndjson-12      238MB/s ± 1%    229MB/s ± 1%   -3.84%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/logs.ndjson-12                       637MB/s ± 5%    594MB/s ± 1%   -6.81%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metadata.ndjson-12                   988MB/s ± 1%    948MB/s ± 1%   -4.01%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/metricsets.ndjson-12                 581MB/s ± 1%    558MB/s ± 1%   -4.03%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/minimal-service.ndjson-12            435MB/s ± 2%    418MB/s ± 1%   -3.95%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/optional-timestamps.ndjson-12        703MB/s ± 2%    673MB/s ± 1%   -4.30%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/otel-bridge.ndjson-12                595MB/s ± 1%    563MB/s ± 2%   -5.50%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/ratelimit.ndjson-12                  763MB/s ± 1%    723MB/s ± 3%   -5.28%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/span-links.ndjson-12                 572MB/s ± 1%    545MB/s ± 4%   -4.74%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/spans.ndjson-12                      847MB/s ± 1%    824MB/s ± 1%   -2.75%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions-huge_traces.ndjson-12   689MB/s ± 2%    661MB/s ± 2%   -4.08%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions.ndjson-12               609MB/s ± 2%    581MB/s ± 1%   -4.69%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans.ndjson-12         630MB/s ± 2%    606MB/s ± 2%   -3.73%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum.ndjson-12     723MB/s ± 1%    690MB/s ± 1%   -4.57%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/transactions_spans_rum_2.ndjson-12   734MB/s ± 3%    702MB/s ± 1%   -4.35%  (p=0.008 n=5+5)
BackendProcessorParallel/BenchmarkBackendProcessorParallel200/unknown-span-type.ndjson-12          543MB/s ± 1%    516MB/s ± 1%   -4.95%  (p=0.008 n=5+5)
ReadBatch/transactions_spans_rum_2.ndjson-12                                                      54.2MB/s ±20%   72.7MB/s ±27%  +34.19%  (p=0.032 n=5+5)

report generated with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat

Update the MaxConcurrentDecoders and the modelindexer internal channel
based on the total amount of memory that the APM server can access. It
looks at cgroups if available, otherwise, falls back to the total system
memory.

Also updates the default semaphore size to `48`, which will be used when
APM server is 1GB or smaller and the default modelindexer internal queue
size to `64` from `100`.

This fixes a couple of issues; 1GB instances could still OOM if enough
concurrency was used (> 80 concurrent agents sending data as fast as
possible), and another issue where APM Server doesn't use the total
available memory to its advantage.

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
@marclop marclop force-pushed the f/dynamic-semaphore-mi-buffer-based-on-memory branch from 09c843f to 40ef0de Compare October 13, 2022 06:46
@mergify

This comment was marked as resolved.

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
@mergify

This comment was marked as outdated.

@marclop
Copy link
Copy Markdown
Contributor Author

marclop commented Nov 1, 2022

After merging #9393, all instances had 50 available indexers. Because of how we cycle through them, 1GB and 2GB APM instances were running out of memory. I've modified the PR to also size the number of available indexers based on RAM, so that we can avoid unpleasant OOMs in smaller instances.

Taking profiles gave me the clue that I needed, after sizing the 1GB instance to 10 available indexers

heap profile: 377: 74804032 [681456: 883525128] @ heap/1048576
10: 41943040 [10: 41943040] @ 0xaaaab7611238 0xaaaab7610b8c 0xaaaab7611018 0xaaaab83fa898 0xaaaab83fcb04 0xaaaab83fdb68 0xaaaab7956c00 0xaaaab7578d64
#	0xaaaab7611237	bytes.makeSlice+0x57											bytes/buffer.go:229
#	0xaaaab7610b8b	bytes.(*Buffer).grow+0x10b										bytes/buffer.go:142
#	0xaaaab7611017	bytes.(*Buffer).ReadFrom+0x47										bytes/buffer.go:202
#	0xaaaab83fa897	github.com/elastic/apm-server/internal/model/modelindexer.(*bulkIndexer).Flush+0x327			github.com/elastic/apm-server/internal/model/modelindexer/bulk_indexer.go:178
#	0xaaaab83fcb03	github.com/elastic/apm-server/internal/model/modelindexer.(*Indexer).flush+0x253			github.com/elastic/apm-server/internal/model/modelindexer/indexer.go:408
#	0xaaaab83fdb67	github.com/elastic/apm-server/internal/model/modelindexer.(*Indexer).runActiveIndexer.func2+0x37	github.com/elastic/apm-server/internal/model/modelindexer/indexer.go:535
#	0xaaaab7956bff	golang.org/x/sync/errgroup.(*Group).Go.func1+0x5f							golang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75

It turns out that reading the Elasticsearch bulk responses can take quite a bit of memory, which makes sense since we're sending pretty big bulk requests (10K+ events).

if _, err := b.respBuf.ReadFrom(res.Body); err != nil {
return elasticsearch.BulkIndexerResponse{}, err
}
.

Now a 1GB instance memory consumption is much better:

BenchmarkAgentAll-192	      20	6652711227 ns/op	         0 error_responses/sec	        54.11 errors/sec	      2715 events/sec	        50.00 gc_cycles	        86.00 max_goroutines	 509110984 max_heap_alloc	   5071053 max_heap_objects	 608043008 max_rss	         7.048 mean_available_indexers	       420.4 metrics/sec	      1624 spans/sec	       617.3 txs/sec	577225184 B/op	 8139136 allocs/op
BenchmarkAgentAll-192	      24	6072171566 ns/op	         0 error_responses/sec	        59.29 errors/sec	      2976 events/sec	        59.00 gc_cycles	        90.00 max_goroutines	 536567928 max_heap_alloc	   5108225 max_heap_objects	 623251456 max_rss	         6.812 mean_available_indexers	       461.3 metrics/sec	      1779 spans/sec	       676.4 txs/sec	576634447 B/op	 8126064 allocs/op
BenchmarkAgentAll-192	      37	3758763803 ns/op	         0 error_responses/sec	        95.78 errors/sec	      4801 events/sec	        88.00 gc_cycles	       119.0 max_goroutines	 512009848 max_heap_alloc	   4998253 max_heap_objects	 617234432 max_rss	         6.821 mean_available_indexers	       738.8 metrics/sec	      2874 spans/sec	      1093 txs/sec	574661768 B/op	 8139747 allocs/op

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
@marclop marclop marked this pull request as ready for review November 2, 2022 07:46
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
@marclop marclop requested a review from a team November 2, 2022 07:54
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Nov 2, 2022

This pull request is now in conflicts. Could you fix it @marclop? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b f/dynamic-semaphore-mi-buffer-based-on-memory upstream/f/dynamic-semaphore-mi-buffer-based-on-memory
git merge upstream/main
git push upstream f/dynamic-semaphore-mi-buffer-based-on-memory

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
if limit, err := cgroupMemoryLimit(cgroupReader); err != nil {
s.logger.Warn(err)
} else {
memLimit = float64(limit) / 1024 / 1024 / 1024
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you say you were going to limit the use of cgroup memory, to allow for other processes in the container?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I capped it to 80% of the cgroup memory limit

Comment on lines +695 to +711
opts.EventBufferSize = int(2048 * memLimit)
if opts.EventBufferSize >= 61440 {
opts.EventBufferSize = 61440
}
logger.Infof(logMessage,
"modelindexer.EventBufferSize", opts.EventBufferSize, memLimit,
)
if opts.MaxRequests > 0 {
return opts
}
// This formula yields the following max requests for APM Server sized:
// 1 2 4 8 15 30
// 10 13 16 22 32 55
maxRequests := int(float64(10) + memLimit*1.5)
if maxRequests > 60 {
maxRequests = 60
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you arrive at these formulae?

If we assumed the max event size (in bytes on the wire) could fit exactly into as many bytes in memory, then isn't a channel size of 2048 for 1GB memory is too much? The default max event size is 300KB. 2048 * 300 * 1024 == 629145600, which is (substantially) greater than 1024 * 1024 * 1024 == 1073741824.

What do you think about defining the defaults in terms of:

  • some percentage of available memory
  • maximum event size * (in-memory model objects overhead factor)

?

Copy link
Copy Markdown
Contributor Author

@marclop marclop Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a big higher than I intended the 1GB instance to hold, however, it would still be within the limit I think?

1,073,741,824 limit in bytes
0,629,145,600 current buffer max usage in bytes (based on 300kb event size). (600MB)

I think it would be safer to to reduce the base queue by 1024 so we end up with up to 300MB of cached events in the queue (And allows some overhead for the in-memory representation).

How did you arrive at these formulae?

I did some calculations on paper and then adjusted them observing the behavior of benchmarks. The queue didn't have a big impact in throughput to be honest, but It should allow a bit of data to be cached (in practice, it would be less than 1s at maximal throughput, but that's not what the queue is there for).

Modelindexer bulk indexer size = ~10MB (based on some profiles taken that indicated that with 50 available bulk requests, the buffers grew to ~300MB / 50 = ~6MB. per available indexer).

So for a 1GB APM Instance

  • Max requests of 128 * 10 * 300kb = 375MB
  • Available bulk indexers 11 * 6 = 66MB
  • Modelindexer internal queue = 1024 * 300kb = 300MB
  • +25% error for other aspects of the server and not consuming all the cgroup memory = 926MB

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is a big higher than I intended the 1GB instance to hold, however, it would still be within the limit I think?

1,073,741,824 limit in bytes
0,629,145,600 current buffer max usage in bytes (based on 300kb event size). (600MB)

🤦‍♂️ indeed, I was just off by an order of magnitude.

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
@marclop
Copy link
Copy Markdown
Contributor Author

marclop commented Nov 3, 2022

Benchstat

Overall there's a clear improvement in throughput with this PR, I believe as a product of the increased semaphore + autoscaling + tuned buffers. All instances showed improvement, and I believe that 30g shows less of an improvement due to the Elasticsearch deployment size? (although the deployment is 580g x 3z).

1g

Main ran out of memory, while PR diff didn't.

2g

$ benchstat -alpha 1 main/2g-30s-30n.txt pr/2g-30s-30n.txt
name          old time/op                  new time/op                  delta
AgentAll-256                   1.91s ±24%                   1.61s ±27%  -15.64%  (p=0.400 n=3+3)
name          old events/sec               new events/sec               delta
AgentAll-256                   9.77k ±28%                  11.64k ±25%  +19.20%  (p=0.400 n=3+3)
name          old gc_cycles                new gc_cycles                delta
AgentAll-256                    224 ±118%                     438 ±55%  +95.68%  (p=0.200 n=3+3)
name          old max_goroutines           new max_goroutines           delta
AgentAll-256                     278 ±81%                     349 ±54%     ~     (p=1.000 n=3+3)
name          old max_heap_alloc           new max_heap_alloc           delta
AgentAll-256                   1.08G ±14%                   0.77G ±27%  -28.09%  (p=0.100 n=3+3)
name          old max_heap_objects         new max_heap_objects         delta
AgentAll-256                   9.86M ±14%                   5.54M ±25%  -43.85%  (p=0.100 n=3+3)
name          old max_rss                  new max_rss                  delta
AgentAll-256                   1.18G ±13%                   0.89G ±24%  -24.53%  (p=0.100 n=3+3)
name          old mean_available_indexers  new mean_available_indexers  delta
AgentAll-256                    41.4 ±26%                     6.2 ±59%  -84.93%  (p=0.100 n=3+3)
name          old alloc/op                 new alloc/op                 delta
AgentAll-256                   559MB ± 1%                   570MB ± 1%   +1.98%  (p=0.100 n=3+3)
name          old allocs/op                new allocs/op                delta
AgentAll-256                   7.90M ± 0%                   7.97M ± 1%   +0.92%  (p=0.200 n=3+3)

4g

$ benchstat -alpha 1 main/4g-30s-30n.txt pr/4g-30s-30n.txt
name          old time/op                  new time/op                  delta
AgentAll-512                   1.25s ± 0%                   1.07s ± 1%  -14.12%  (p=0.100 n=3+3)
name          old events/sec               new events/sec               delta
AgentAll-512                   14.4k ± 0%                   16.8k ± 1%  +16.44%  (p=0.100 n=3+3)
name          old gc_cycles                new gc_cycles                delta
AgentAll-512                     159 ± 1%                     199 ± 1%  +25.10%  (p=0.100 n=3+3)
name          old max_goroutines           new max_goroutines           delta
AgentAll-512                     211 ±10%                     223 ± 7%   +6.01%  (p=0.400 n=3+3)
name          old max_heap_alloc           new max_heap_alloc           delta
AgentAll-512                   1.05G ± 1%                   0.98G ± 0%   -6.33%  (p=0.100 n=3+3)
name          old max_heap_objects         new max_heap_objects         delta
AgentAll-512                   9.41M ± 1%                   7.01M ± 1%  -25.50%  (p=0.100 n=3+3)
name          old max_rss                  new max_rss                  delta
AgentAll-512                   1.16G ± 2%                   1.14G ± 0%   -1.76%  (p=0.400 n=3+3)
name          old mean_available_indexers  new mean_available_indexers  delta
AgentAll-512                    45.4 ± 0%                    10.8 ± 1%  -76.32%  (p=0.100 n=3+3)
name          old metrics/sec              new metrics/sec              delta
AgentAll-512                   2.21k ± 0%                   2.57k ± 1%  +16.40%  (p=0.100 n=3+3)
name          old alloc/op                 new alloc/op                 delta
AgentAll-512                   579MB ± 0%                   566MB ± 0%   -2.28%  (p=0.100 n=3+3)
name          old allocs/op                new allocs/op                delta
AgentAll-512                   8.07M ± 0%                   7.95M ± 0%   -1.50%  (p=0.100 n=3+3)

8g

$ benchstat -alpha 1 main/8g-30s-30n.txt pr/8g-30s-30n.txt
name           old time/op                  new time/op                  delta
AgentAll-1024                   610ms ±16%                   350ms ± 1%  -42.67%  (p=0.100 n=3+3)
name           old errors/sec               new errors/sec               delta
AgentAll-1024                     599 ±18%                    1029 ± 1%  +71.83%  (p=0.100 n=3+3)
name           old events/sec               new events/sec               delta
AgentAll-1024                   30.0k ±18%                   51.5k ± 1%  +71.83%  (p=0.100 n=3+3)
name           old gc_cycles                new gc_cycles                delta
AgentAll-1024                     286 ±14%                     278 ±11%   -2.68%  (p=0.700 n=3+3)
name           old max_goroutines           new max_goroutines           delta
AgentAll-1024                     410 ±28%                     561 ± 4%  +36.86%  (p=0.100 n=3+3)
name           old max_heap_alloc           new max_heap_alloc           delta
AgentAll-1024                   1.19G ± 5%                   2.07G ± 5%  +73.42%  (p=0.100 n=3+3)
name           old max_heap_objects         new max_heap_objects         delta
AgentAll-1024                   10.5M ± 8%                   14.0M ± 5%  +33.12%  (p=0.100 n=3+3)
name           old max_rss                  new max_rss                  delta
AgentAll-1024                   1.32G ± 6%                   2.31G ± 4%  +74.57%  (p=0.100 n=3+3)
name           old mean_available_indexers  new mean_available_indexers  delta
AgentAll-1024                    39.2 ±11%                     1.4 ±10%  -96.43%  (p=0.100 n=3+3)
name           old alloc/op                 new alloc/op                 delta
AgentAll-1024                   580MB ± 1%                   573MB ± 0%   -1.20%  (p=0.100 n=3+3)
name           old allocs/op                new allocs/op                delta
AgentAll-1024                   8.09M ± 0%                   8.01M ± 0%   -0.90%  (p=0.100 n=3+3)

15g

$ benchstat -alpha 1 main/15g-30s-30n.txt pr/15g-30s-30n.txt
name           old time/op                  new time/op                  delta
AgentAll-1920                   371ms ± 0%                   323ms ± 0%   -12.82%  (p=0.100 n=3+3)
name           old events/sec               new events/sec               delta
AgentAll-1920                   48.6k ± 0%                   55.7k ± 0%   +14.59%  (p=0.100 n=3+3)
name           old gc_cycles                new gc_cycles                delta
AgentAll-1920                     405 ± 1%                     148 ± 9%   -63.34%  (p=0.100 n=3+3)
name           old max_goroutines           new max_goroutines           delta
AgentAll-1920                     548 ± 3%                     607 ± 5%   +10.83%  (p=0.100 n=3+3)
name           old max_heap_alloc           new max_heap_alloc           delta
AgentAll-1920                   1.21G ± 1%                   3.84G ± 4%  +216.61%  (p=0.100 n=3+3)
name           old max_heap_objects         new max_heap_objects         delta
AgentAll-1920                   10.8M ± 2%                   26.9M ± 3%  +149.36%  (p=0.100 n=3+3)
name           old max_rss                  new max_rss                  delta
AgentAll-1920                   1.36G ± 0%                   4.29G ± 1%  +214.90%  (p=0.100 n=3+3)
name           old mean_available_indexers  new mean_available_indexers  delta
AgentAll-1920                    31.2 ± 2%                     7.6 ± 4%   -75.80%  (p=0.100 n=3+3)
name           old alloc/op                 new alloc/op                 delta
AgentAll-1920                   567MB ± 0%                   568MB ± 0%    +0.23%  (p=0.100 n=3+3)
name           old allocs/op                new allocs/op                delta
AgentAll-1920                   7.98M ± 0%                   7.99M ± 0%    +0.15%  (p=0.700 n=3+3)

30g

$ benchstat -alpha 1 main/30g-30s-30n.txt pr/30g-30s-30n-2048d.txt
name           old time/op                  new time/op                  delta
AgentAll-3840                   181ms ± 1%                   173ms ± 3%    -4.27%  (p=0.008 n=5+5)
name           old events/sec               new events/sec               delta
AgentAll-3840                   99.3k ± 0%                  103.6k ± 3%    +4.34%  (p=0.008 n=5+5)
name           old gc_cycles                new gc_cycles                delta
AgentAll-3840                     725 ± 0%                     313 ±28%   -56.76%  (p=0.004 n=5+6)
name           old max_goroutines           new max_goroutines           delta
AgentAll-3840                   1.02k ± 1%                   1.05k ±10%    +2.75%  (p=0.667 n=5+5)
name           old max_heap_alloc           new max_heap_alloc           delta
AgentAll-3840                   1.50G ± 2%                   4.09G ±25%  +171.76%  (p=0.004 n=5+6)
name           old max_heap_objects         new max_heap_objects         delta
AgentAll-3840                   13.6M ± 2%                   30.6M ±27%  +124.10%  (p=0.004 n=5+6)
name           old max_rss                  new max_rss                  delta
AgentAll-3840                   1.73G ± 1%                   4.64G ±24%  +168.33%  (p=0.004 n=5+6)
name           old mean_available_indexers  new mean_available_indexers  delta
AgentAll-3840                    15.9 ± 1%                    15.3 ±38%    -3.65%  (p=0.758 n=5+6)
name           old alloc/op                 new alloc/op                 delta
AgentAll-3840                   558MB ± 0%                   561MB ± 1%    +0.46%  (p=0.121 n=5+6)
name           old allocs/op                new allocs/op                delta
AgentAll-3840                   7.89M ± 0%                   7.88M ± 1%    -0.14%  (p=0.909 n=5+6)

@marclop marclop requested a review from axw November 3, 2022 15:52
Copy link
Copy Markdown
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic numbers (in modelIndexerConfig in particular) make me a bit uncomfortable, but this is a clear improvement so I'm good with merging it and iterating as needed.

Just a few more minor things, otherwise looks good.

Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
Signed-off-by: Marc Lopez Rubio <marc5.12@outlook.com>
@marclop marclop requested a review from axw November 4, 2022 08:19
Copy link
Copy Markdown
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! Could you please open an issue to follow up on modelIndexerConfig and maxConcurrentDecoders, to take the max event size and flush size into account?


func newTracerServer(listener net.Listener, logger *logp.Logger, batchProcessor model.BatchProcessor) (*http.Server, error) {
cfg := config.DefaultConfig()
func newTracerServer(cfg *config.Config, listener net.Listener, logger *logp.Logger, batchProcessor model.BatchProcessor) (*http.Server, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bit surprising to me, since the modelindexer is passed in. I suppose this is needed for the MaxConcurrentDecoders though, is that right?

This seems like a bit of a problem: it means when self-instrumentation is enabled, there could be 2x the concurrent decoders (1x from external clients, 1x from the server itself). I think this is another reason why the modelindexer should control the rate of input.

(No action required at the moment.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely right. I thought about using a smaller arbitrary value instead, but like you say I think we can address the problem by encapsulating the semaphore in a different manner and have the model indexer act as the controller.

@marclop marclop merged commit 02740f7 into elastic:main Nov 4, 2022
@marclop marclop deleted the f/dynamic-semaphore-mi-buffer-based-on-memory branch November 4, 2022 10:19
@axw
Copy link
Copy Markdown
Member

axw commented Nov 24, 2022

I think we can test this as part of #9181 and #9341

@axw axw removed the test-plan label Nov 24, 2022
@kruskall kruskall assigned kruskall and unassigned kruskall Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-skip Skip notification from the automated backport with mergify enhancement v8.6.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1GB APM Server may still run out of memory Ensure APM Server makes full usage of available CPU resources

3 participants