You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The test results here will be the subject of an upcoming AWS blog post. However, I need a place to publish all of the raw data, which is quite long. A GitHub issue on moby seems like one obvious choice.
Users can use this data to decide if they want to switch to non-blocking mode and what max-buffer-size do they want to configure.
Background
blocking (default): When logs can not be immediately sent to Amazon CloudWatch, calls from container code to write to stdout/stderr will block/freeze. The logging thread in the application will block/halt, which may prevent the application from functioning and lead to health check failures and task termination. Container startup will fail if the required log group or log stream can not be created.
non-blocking: When logs can not be sent to Amazon CloudWatch, they are stored in an in-memory buffer configured with the max-buffer-size option. When the buffer fills up, logs are lost. Calls to write to stdout/stderr from container code will not block and will return immediately. With Amazon ECS on Amazon EC2, container startup will not fail if the required log group or log stream can not be created. With Amazon ECS on AWS Fargate, container startup always fails if the log group or log stream can not be created regardless of the mode configured.
The default blocking mode can compromise application availability. When logs can not be uploaded to CloudWatch, back-pressure in the log driver can freeze the application. Please read the AWS container blog Choosing container logging options to avoid backpressure for more information and a guide for testing your application under a logging back-pressure scenario.
The concern with non-blocking mode is the risk of log loss if/when the buffer fills up. If the CloudWatch API is unavailable, log loss is expected; however, in the happy case where CloudWatch PutLogEvents calls can be made, is non-blocking mode safe for applications that log at a high rate? Is the default 1MB max-buffer-size sufficient to prevent log loss or would most users need to increase it? We performed tests to quantify this risk.
What did the tests find?
Please be aware that the results discussed in this issue do not represent a guarantee of performance. We are simply sharing the results of tests that we ran.
Here are the key findings:
max-buffer-size of 4+ MB does not show any log loss for <= 2MB/s log output rate from the container
max-buffer-size of 25+ MB does not show any log loss for <= 5MB/s log output rate from the container
Above 6 MB/s, the performance of the AWSLogs driver is less predictable/consistent. For example, there was an outlier test failure with a 100 MB buffer and 7 MB/s.
The max-buffer-size setting controls the byte size of messages in a go slice. It does not directly constrain the memory usage, because Go is a garbage collected language. One test suite noted the real size of the queue on average is often fairly small, generally less than 500KB. Memory usage will climb to the limit occasionally during periods of latency or increased log throughput.
AWSLogs driver can upload consistently at a much higher rate when sending logs to the CloudWatch API in the same region as the test task, due to lower latency connection to CloudWatch. Cross-region log upload is less reliable; moreover, it violates the best practice of region isolation. Cross-region log push also incurs higher network
For cross-region log push, a max-buffer-size of 50 MB is recommended to ensure there is low risk of log loss.
The results are similar for Amazon ECS on EC2 launch type compared with the Fargate launch type.
Use the IncomingBytes metric in CloudWatch Metrics to track the ingestion rate to your log group(s). Assuming that all containers send at roughly the same rate, you can then divide the log group ingestion rate by the number of containers. Then you have the rate for each individual container. It is recommended to over-estimate the log throughput from each container; log output may spike occasionally, especially during incidents. If possible, calculate your throughput during a load test or recent incident. It is recommended to use the peak log output rate over a time interval of a minute or less, to account for bursts in throughput.
How were tests run?
The code used for benchmarking can be found on GitHub. EC2 tests were performed on Docker version v20.10.25. Fargate tests were performed on platform version 1.4.
Each log loss test run was an Amazon ECS Task that sends 1 GB of log data to CloudWatch Logs with the AWSLogs Driver. It then queries CloudWatch Logs to get back all log events and checks how many were received. Each log message has a unique ID that is a predictable sequence number.
Several thousand test runs were executed, to get statistically meaningful data on log loss.
Full Test Results
Understanding the Data Tables
For each set of dimensions (the test run data aggregated across some set of values, like log size), there are 3 data tables.
For all tables, the horizontal axis is increasing values of max-buffer-size, the vertical axis is increasing values for log output rate from the container.
Summary Data Tables
This table type shows how many test runs were successful out of the total set of test runs for the given throughput and max-buffer-size. Successful means that no logs were lost in the entire test run. This is the strictest way of viewing the data, and also the most relevant- customers expect zero log loss.
The table uses emoji symbols to make it easier to view the data at a glance. They mean:
✅ => no logs were lost in any of the test runs. All were successful.
❕ => grey exclamation mark means only a single test run showed log loss. This case was almost completely successful. (And at least 90% of the test runs were successful. There are cases where only a few test task were run for a single combination of throughput and buffer size. The 90% test guards against a case where for example 2/3 runs were successful, which is actually 33% failure and thus should not be given a “green” positive emoji).
❗️ => red exclamation mark means that 95+% of runs were successful.
❌ => red “X” mark means that 90+% of runs were successful.
🚨 => indicates very non-trivial loss and that less than 90% of runs were successful (more than 10% of runs had log loss).
The number before the emoji is {successful runs}/{total runs}. If there was at least one failed test run, number after the emoji is the percent of logs lost in the worst failed test run.
Average Log Loss Data Tables
This table type shows the average % of logs lost across all test runs for the same throughput and max-buffer-size. This is also the same as the % of logs lost if all test runs were a single longer running test.
The table uses emoji symbols to make it easier to view the data at a glance. They mean:
✅ => no logs were lost in any of the test runs. All were successful.
❕ => grey exclamation mark means average log loss was 0.5% or less. 99.5% or more of logs were sent successfully.
❗️ => red exclamation mark means average log loss was 1% or less. 99% or more of logs were sent successfully.
❌ => red" "X" mark means average log loss was 5% or less. 95% or more of logs were sent successfully.
🚨 => indicates very non-trivial loss and log loss was greater than 5%.
Standard Deviation Log Loss Data Tables
This table shows the standard deviation in % log loss across test runs. This table shows how variable/random/unpredictable the AWSLogs driver is under different throughputs and buffer sizes in non-blocking mode.
The table uses emoji symbols to make it easier to view the data at a glance. They mean:
✅ => Standard deviation in log loss was less than 1%.
❕ => grey exclamation mark means standard deviation in log loss was less than 2%.
❗️ => red exclamation mark means standard deviation in log loss was less than 5%.
❌ => red “X” mark means standard deviation in log loss was less than 10%.
🚨 => indicates very non-trivial variation between test runs of 10% or more.
Log loss test results: in-region
These are the test results for tasks sending to CloudWatch in the same region as the tasks.
Summary of All Test Runs
These are the results only aggregated by max-buffer-size and log throughput. The results are not broken down by any additional dimensions like log message size or launch type.
Summary Data Table
Buffer Size =>
1 MB
2 MB
4 MB
6 MB
8 MB
10 MB
12 MB
15 MB
20 MB
25 MB
30 MB
40 MB
50 MB
60 MB
75 MB
100 MB
150 MB
200 KB/s
20/20 ✅
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
250 KB/s
20/20 ✅
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
300
20/20 ✅
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
400
20/20 ✅
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
500 KB/s
68/68 ✅
63/63 ✅
40/40 ✅
25/25 ✅
27/27 ✅
--
22/22 ✅
15/15 ✅
38/38 ✅
7/7 ✅
45/45 ✅
39/39 ✅
39/39 ✅
--
--
50/50 ✅
--
600
37/37 ✅
56/56 ✅
39/39 ✅
28/28 ✅
13/13 ✅
--
16/16 ✅
5/5 ✅
13/13 ✅
--
--
--
--
--
--
--
--
700
39/39 ✅
50/50 ✅
39/39 ✅
28/28 ✅
5/5 ✅
--
14/14 ✅
10/10 ✅
7/7 ✅
--
--
--
--
--
--
--
--
750
71/71 ✅
39/39 ✅
40/40 ✅
18/18 ✅
5/5 ✅
--
34/34 ✅
11/11 ✅
35/35 ✅
8/8 ✅
51/51 ✅
37/37 ✅
42/42 ✅
--
--
50/50 ✅
--
800
78/78 ✅
40/40 ✅
39/39 ✅
33/33 ✅
7/7 ✅
--
20/20 ✅
8/8 ✅
7/7 ✅
--
--
--
--
--
--
--
--
900
74/74 ✅
56/56 ✅
39/39 ✅
28/28 ✅
14/14 ✅
--
19/19 ✅
11/11 ✅
--
--
42/42 ✅
41/41 ✅
41/41 ✅
--
--
47/47 ✅
--
1000 KB/s
143/143 ✅
120/120 ✅
84/84 ✅
92/92 ✅
63/63 ✅
40/40 ✅
124/124 ✅
35/35 ✅
24/24 ✅
18/18 ✅
66/66 ✅
41/41 ✅
44/44 ✅
--
--
43/43 ✅
--
1100
19/19 ✅
50/50 ✅
22/22 ✅
31/31 ✅
6/6 ✅
--
18/18 ✅
11/11 ✅
--
7/7 ✅
50/50 ✅
42/42 ✅
48/48 ✅
--
--
34/34 ✅
--
1200
20/20 ✅
28/28 ✅
21/21 ✅
29/29 ✅
9/9 ✅
--
28/28 ✅
26/26 ✅
20/20 ✅
8/8 ✅
49/49 ✅
37/37 ✅
45/45 ✅
--
--
34/34 ✅
--
1250
--
24/24 ✅
23/23 ✅
30/30 ✅
7/7 ✅
--
7/7 ✅
14/14 ✅
--
8/8 ✅
10/10 ✅
--
--
--
--
--
--
1500
--
38/38 ✅
31/31 ✅
23/23 ✅
12/12 ✅
--
22/22 ✅
30/30 ✅
20/20 ✅
2/2 ✅
51/51 ✅
30/30 ✅
46/46 ✅
--
--
31/31 ✅
--
1750
--
36/36 ✅
28/28 ✅
21/21 ✅
14/14 ✅
--
9/9 ✅
11/11 ✅
--
4/4 ✅
46/46 ✅
30/30 ✅
41/41 ✅
--
--
42/42 ✅
--
2000 KB/s
34/109 🚨 27.34%
108/108 ✅
92/92 ✅
94/94 ✅
72/72 ✅
80/80 ✅
110/110 ✅
50/50 ✅
41/41 ✅
34/34 ✅
82/82 ✅
55/55 ✅
58/58 ✅
--
--
42/42 ✅
--
3000
20/71 🚨 79.20%
28/69 🚨 87.50%
57/59 ❗️ 54.10%
72/72 ✅
49/49 ✅
80/80 ✅
102/102 ✅
25/25 ✅
30/30 ✅
32/32 ✅
38/38 ✅
40/40 ✅
40/40 ✅
--
--
--
--
4000
11/54 🚨 89.01%
35/75 🚨 81.62%
50/50 ✅
65/65 ✅
53/53 ✅
80/80 ✅
91/91 ✅
28/28 ✅
27/27 ✅
33/33 ✅
39/39 ✅
39/39 ✅
20/20 ✅
--
--
--
--
4500
3/51 🚨 56.90%
19/55 🚨 76.82%
23/63 🚨 97.90%
73/73 ✅
49/49 ✅
40/40 ✅
44/44 ✅
--
19/20 ❕ 29.38%
20/20 ✅
20/20 ✅
20/20 ✅
20/20 ✅
--
--
--
--
5000 KB/s
8/51 🚨 95.30%
26/59 🚨 66.80%
21/62 🚨 82.90%
66/66 ✅
49/49 ✅
75/75 ✅
85/85 ✅
27/27 ✅
33/33 ✅
32/32 ✅
60/60 ✅
60/60 ✅
40/40 ✅
--
20/20 ✅
20/20 ✅
20/20 ✅
6000
14/52 🚨 75.90%
17/57 🚨 74.20%
24/65 🚨 94.71%
59/60 ❕ 9.80%
43/43 ✅
60/60 ✅
73/73 ✅
13/13 ✅
25/25 ✅
29/29 ✅
59/60 ❕ 0.30%
60/60 ✅
39/39 ✅
--
20/20 ✅
20/20 ✅
--
7000
0/61 🚨 80.60%
27/70 🚨 55.30%
18/64 🚨 94.60%
31/49 🚨 0.45%
35/36 ❕ 0.47%
59/59 ✅
77/77 ✅
5/5 ✅
29/29 ✅
29/29 ✅
58/58 ✅
80/80 ✅
49/49 ✅
--
32/32 ✅
39/40 ❕ 0.20%
--
8000
2/57 🚨 76.50%
16/77 🚨 99.80%
23/63 🚨 42.90%
29/53 🚨 40.30%
50/50 ✅
20/20 ✅
52/53 ❕ 1.76%
--
20/20 ✅
20/20 ✅
40/40 ✅
79/79 ✅
47/48 ❕ 0.20%
--
37/37 ✅
39/40 ❕ 3.25%
--
9000
0/48 🚨 90.60%
17/67 🚨 83.40%
21/62 🚨 97.70%
19/50 🚨 9.80%
41/45 ❌ 0.07%
20/20 ✅
50/50 ✅
--
20/20 ✅
20/20 ✅
40/40 ✅
66/66 ✅
25/25 ✅
--
38/38 ✅
40/40 ✅
--
10000 KB/s
0/53 🚨 90.40%
20/60 🚨 99.60%
18/71 🚨 77.60%
19/51 🚨 92.11%
43/47 ❌ 0.15%
97/100 ❗️ 88.10%
151/154 ❗️ 85.90%
81/81 ✅
83/83 ✅
82/82 ✅
101/101 ✅
122/122 ✅
51/52 ❕ 0.03%
--
38/39 ❕ 0.10%
38/38 ✅
--
12000
1/56 🚨 54.25%
20/61 🚨 41.98%
20/63 🚨 25.02%
6/48 🚨 8.15%
50/56 🚨 0.98%
80/80 ✅
149/149 ✅
83/83 ✅
82/82 ✅
83/83 ✅
101/101 ✅
121/121 ✅
48/48 ✅
--
32/32 ✅
33/33 ✅
--
14000
--
--
--
--
--
79/80 ❕ 0.03%
100/100 ✅
82/83 ❕ 0.14%
66/66 ✅
66/66 ✅
88/88 ✅
90/90 ✅
46/46 ✅
--
20/20 ✅
20/20 ✅
--
15000
--
--
--
--
--
70/81 🚨 0.27%
97/100 ❗️ 0.87%
118/123 ❗️ 66.39%
103/103 ✅
102/102 ✅
126/127 ❕ 0.22%
136/136 ✅
93/93 ✅
40/40 ✅
28/28 ✅
39/39 ✅
--
16000
--
--
--
--
--
40/80 🚨 5.77%
89/92 ❗️ 1.33%
103/103 ✅
100/101 ❕ 4.35%
102/103 ❕ 4.23%
130/131 ❕ 2.57%
138/138 ✅
97/97 ✅
39/39 ✅
30/30 ✅
39/40 ❕ 0.40%
--
18000
--
--
--
--
--
38/80 🚨 15.73%
23/83 🚨 4.70%
102/102 ✅
101/101 ✅
104/104 ✅
124/124 ✅
136/136 ✅
89/89 ✅
40/40 ✅
27/27 ✅
40/40 ✅
--
20000 KB/s
--
--
--
--
--
40/80 🚨 23.06%
23/83 🚨 13.52%
52/103 🚨 1.04%
100/102 ❗️ 1.22%
103/103 ✅
131/131 ✅
132/132 ✅
92/92 ✅
39/39 ✅
31/31 ✅
40/40 ✅
--
25000 KB/s
--
--
--
--
--
--
--
20/40 🚨 18.93%
20/40 🚨 13.20%
19/40 🚨 33.62%
20/40 🚨 10.18%
20/40 🚨 11.78%
20/40 🚨 9.95%
20/40 🚨 5.53%
--
--
--
Average Log Loss Data Table
Buffer Size =>
1 MB
2 MB
4 MB
6 MB
8 MB
10 MB
12 MB
15 MB
20 MB
25 MB
30 MB
40 MB
50 MB
60 MB
75 MB
100 MB
150 MB
200 KB/s
✅
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
250 KB/s
✅
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
300
✅
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
400
✅
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
500 KB/s
✅
✅
✅
✅
✅
--
✅
✅
✅
✅
✅
✅
✅
--
--
✅
--
600
✅
✅
✅
✅
✅
--
✅
✅
✅
--
--
--
--
--
--
--
--
700
✅
✅
✅
✅
✅
--
✅
✅
✅
--
--
--
--
--
--
--
--
750
✅
✅
✅
✅
✅
--
✅
✅
✅
✅
✅
✅
✅
--
--
✅
--
800
✅
✅
✅
✅
✅
--
✅
✅
✅
--
--
--
--
--
--
--
--
900
✅
✅
✅
✅
✅
--
✅
✅
--
--
✅
✅
✅
--
--
✅
--
1000 KB/s
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
--
--
✅
--
1100
✅
✅
✅
✅
✅
--
✅
✅
--
✅
✅
✅
✅
--
--
✅
--
1200
✅
✅
✅
✅
✅
--
✅
✅
✅
✅
✅
✅
✅
--
--
✅
--
1250
--
✅
✅
✅
✅
--
✅
✅
--
✅
✅
--
--
--
--
--
--
1500
--
✅
✅
✅
✅
--
✅
✅
✅
✅
✅
✅
✅
--
--
✅
--
1750
--
✅
✅
✅
✅
--
✅
✅
--
✅
✅
✅
✅
--
--
✅
--
2000 KB/s
🚨 12.20%
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
--
--
✅
--
3000
🚨 16.12%
🚨 6.99%
❗️ 0.94%
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
--
--
--
--
4000
🚨 19.71%
🚨 10.51%
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
--
--
--
--
4500
🚨 24.53%
🚨 11.90%
🚨 7.37%
✅
✅
✅
✅
--
❌ 1.47%
✅
✅
✅
✅
--
--
--
--
5000 KB/s
🚨 30.18%
🚨 11.11%
❌ 4.24%
✅
✅
✅
✅
✅
✅
✅
✅
✅
✅
--
✅
✅
✅
6000
🚨 25.45%
🚨 13.79%
🚨 6.01%
❕ 0.16%
✅
✅
✅
✅
✅
✅
❕ 0.01%
✅
✅
--
✅
✅
--
7000
🚨 30.47%
🚨 11.67%
🚨 9.91%
❕ 0.05%
❕ 0.01%
✅
✅
✅
✅
✅
✅
✅
✅
--
✅
❕ 0.01%
--
8000
🚨 33.72%
🚨 14.91%
❌ 1.66%
❌ 1.15%
✅
✅
❕ 0.03%
--
✅
✅
✅
✅
❕ 0.00%
--
✅
❕ 0.08%
--
9000
🚨 38.75%
🚨 18.64%
❌ 4.72%
❕ 0.46%
❕ 0.01%
✅
✅
--
✅
✅
✅
✅
✅
--
✅
✅
--
10000 KB/s
🚨 37.92%
🚨 19.23%
🚨 7.35%
❌ 3.89%
❕ 0.01%
❌ 1.10%
❗️ 0.93%
✅
✅
✅
✅
✅
❕ 0.00%
--
❕ 0.00%
✅
--
12000
🚨 41.10%
🚨 20.05%
🚨 9.91%
❌ 3.89%
❕ 0.03%
✅
✅
✅
✅
✅
✅
✅
✅
--
✅
✅
--
14000
--
--
--
--
--
❕ 0.00%
✅
❕ 0.00%
✅
✅
✅
✅
✅
--
✅
✅
--
15000
--
--
--
--
--
❕ 0.01%
❕ 0.02%
❌ 1.62%
✅
✅
❕ 0.00%
✅
✅
✅
✅
✅
--
16000
--
--
--
--
--
❌ 2.01%
❕ 0.02%
✅
❕ 0.04%
❕ 0.04%
❕ 0.02%
✅
✅
✅
✅
❕ 0.01%
--
18000
--
--
--
--
--
🚨 6.49%
❌ 1.43%
✅
✅
✅
✅
✅
✅
✅
✅
✅
--
20000 KB/s
--
--
--
--
--
🚨 10.57%
🚨 7.44%
❕ 0.10%
❕ 0.02%
✅
✅
✅
✅
✅
✅
✅
--
25000 KB/s
--
--
--
--
--
--
--
🚨 8.85%
❌ 4.76%
🚨 5.02%
❌ 3.81%
❌ 3.24%
❌ 2.86%
❌ 1.70%
--
--
--
Standard Deviation Log Loss Data Table
Buffer Size =>
1 MB
2 MB
4 MB
6 MB
8 MB
10 MB
12 MB
15 MB
20 MB
25 MB
30 MB
40 MB
50 MB
60 MB
75 MB
100 MB
150 MB
200 KB/s
✅ 0.00
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
250 KB/s
✅ 0.00
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
300
✅ 0.00
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
400
✅ 0.00
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
--
500 KB/s
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
--
✅ 0.00
--
600
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
✅ 0.00
--
--
--
--
--
--
--
--
700
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
✅ 0.00
--
--
--
--
--
--
--
--
750
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
--
✅ 0.00
--
800
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
✅ 0.00
--
--
--
--
--
--
--
--
900
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
--
--
✅ 0.00
✅ 0.00
✅ 0.00
--
--
✅ 0.00
--
1000 KB/s
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
--
✅ 0.00
--
1100
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
--
✅ 0.00
--
1200
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
--
✅ 0.00
--
1250
--
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
--
--
--
--
--
--
1500
--
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
--
✅ 0.00
--
1750
--
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
--
✅ 0.00
--
2000 KB/s
❌ 9.15
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
--
✅ 0.00
--
3000
🚨 18.51
🚨 14.17
❌ 6.98
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
--
--
--
4000
🚨 21.04
🚨 16.02
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
--
--
--
4500
🚨 17.61
🚨 16.10
🚨 22.29
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
❌ 6.40
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
--
--
--
5000 KB/s
🚨 22.79
🚨 14.40
🚨 15.81
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
✅ 0.00
6000
🚨 20.68
🚨 14.28
🚨 19.43
❕ 1.25
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.04
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
--
7000
🚨 17.19
🚨 11.42
🚨 23.11
✅ 0.09
✅ 0.08
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.03
--
8000
🚨 17.66
🚨 12.94
❌ 5.31
❌ 5.63
✅ 0.00
✅ 0.00
✅ 0.24
--
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.03
--
✅ 0.00
✅ 0.51
--
9000
🚨 18.00
🚨 14.68
🚨 14.59
❕ 1.86
✅ 0.02
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
--
10000 KB/s
🚨 18.55
🚨 17.95
🚨 11.88
🚨 16.17
✅ 0.03
❌ 9.02
❌ 7.74
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.02
✅ 0.00
--
12000
🚨 15.44
🚨 17.20
🚨 10.99
❗️ 3.73
✅ 0.13
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
--
14000
--
--
--
--
--
✅ 0.00
✅ 0.00
✅ 0.02
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
✅ 0.00
✅ 0.00
--
15000
--
--
--
--
--
✅ 0.04
✅ 0.11
🚨 10.24
✅ 0.00
✅ 0.00
✅ 0.02
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
16000
--
--
--
--
--
❗️ 2.51
✅ 0.14
✅ 0.00
✅ 0.43
✅ 0.41
✅ 0.22
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.06
--
18000
--
--
--
--
--
❌ 6.71
❕ 1.82
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
20000 KB/s
--
--
--
--
--
🚨 10.65
❗️ 4.96
✅ 0.17
✅ 0.13
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
✅ 0.00
--
25000 KB/s
--
--
--
--
--
--
--
❌ 8.86
❗️ 4.91
❌ 6.34
❗️ 3.95
❗️ 3.46
❗️ 3.13
❗️ 2.02
--
--
--
Much more data is in the comments, I ran into single message length limit on GitHub
Goal
The test results here will be the subject of an upcoming AWS blog post. However, I need a place to publish all of the raw data, which is quite long. A GitHub issue on moby seems like one obvious choice.
Post: https://aws.amazon.com/blogs/containers/preventing-log-loss-with-non-blocking-mode-in-the-awslogs-container-log-driver/
Users can use this data to decide if they want to switch to
non-blockingmode and whatmax-buffer-sizedo they want to configure.Background
blocking(default): When logs can not be immediately sent to Amazon CloudWatch, calls from container code to write to stdout/stderr will block/freeze. The logging thread in the application will block/halt, which may prevent the application from functioning and lead to health check failures and task termination. Container startup will fail if the required log group or log stream can not be created.non-blocking: When logs can not be sent to Amazon CloudWatch, they are stored in an in-memory buffer configured with themax-buffer-sizeoption. When the buffer fills up, logs are lost. Calls to write to stdout/stderr from container code will not block and will return immediately. With Amazon ECS on Amazon EC2, container startup will not fail if the required log group or log stream can not be created. With Amazon ECS on AWS Fargate, container startup always fails if the log group or log stream can not be created regardless of the mode configured.The default
blockingmode can compromise application availability. When logs can not be uploaded to CloudWatch, back-pressure in the log driver can freeze the application. Please read the AWS container blog Choosing container logging options to avoid backpressure for more information and a guide for testing your application under a logging back-pressure scenario.The concern with
non-blockingmode is the risk of log loss if/when the buffer fills up. If the CloudWatch API is unavailable, log loss is expected; however, in the happy case where CloudWatch PutLogEvents calls can be made, isnon-blockingmode safe for applications that log at a high rate? Is the default 1MBmax-buffer-sizesufficient to prevent log loss or would most users need to increase it? We performed tests to quantify this risk.What did the tests find?
Please be aware that the results discussed in this issue do not represent a guarantee of performance. We are simply sharing the results of tests that we ran.
Here are the key findings:
max-buffer-sizeof 4+ MB does not show any log loss for <= 2MB/s log output rate from the containermax-buffer-sizeof 25+ MB does not show any log loss for <= 5MB/s log output rate from the containermax-buffer-sizesetting controls the byte size of messages in a go slice. It does not directly constrain the memory usage, because Go is a garbage collected language. One test suite noted the real size of the queue on average is often fairly small, generally less than 500KB. Memory usage will climb to the limit occasionally during periods of latency or increased log throughput.max-buffer-sizeof 50 MB is recommended to ensure there is low risk of log loss.non-blockingmode buffer. There is no log statement or metric emitted by the Docker Daemon when loss occurs. Please see the proposal on Github: Proposal: Metrics for log driver log loss, logs sent, throughput, in Docker Stats or prometheus metrics #45953How can I find my log throughput?
Use the
IncomingBytesmetric in CloudWatch Metrics to track the ingestion rate to your log group(s). Assuming that all containers send at roughly the same rate, you can then divide the log group ingestion rate by the number of containers. Then you have the rate for each individual container. It is recommended to over-estimate the log throughput from each container; log output may spike occasionally, especially during incidents. If possible, calculate your throughput during a load test or recent incident. It is recommended to use the peak log output rate over a time interval of a minute or less, to account for bursts in throughput.How were tests run?
The code used for benchmarking can be found on GitHub. EC2 tests were performed on Docker version v20.10.25. Fargate tests were performed on platform version 1.4.
Each log loss test run was an Amazon ECS Task that sends 1 GB of log data to CloudWatch Logs with the AWSLogs Driver. It then queries CloudWatch Logs to get back all log events and checks how many were received. Each log message has a unique ID that is a predictable sequence number.
Several thousand test runs were executed, to get statistically meaningful data on log loss.
Full Test Results
Understanding the Data Tables
For each set of dimensions (the test run data aggregated across some set of values, like log size), there are 3 data tables.
For all tables, the horizontal axis is increasing values of
max-buffer-size, the vertical axis is increasing values for log output rate from the container.Summary Data Tables
This table type shows how many test runs were successful out of the total set of test runs for the given throughput and
max-buffer-size. Successful means that no logs were lost in the entire test run. This is the strictest way of viewing the data, and also the most relevant- customers expect zero log loss.The table uses emoji symbols to make it easier to view the data at a glance. They mean:
The number before the emoji is
{successful runs}/{total runs}. If there was at least one failed test run, number after the emoji is the percent of logs lost in the worst failed test run.Average Log Loss Data Tables
This table type shows the average % of logs lost across all test runs for the same throughput and
max-buffer-size. This is also the same as the % of logs lost if all test runs were a single longer running test.The table uses emoji symbols to make it easier to view the data at a glance. They mean:
Standard Deviation Log Loss Data Tables
This table shows the standard deviation in % log loss across test runs. This table shows how variable/random/unpredictable the AWSLogs driver is under different throughputs and buffer sizes in
non-blockingmode.The table uses emoji symbols to make it easier to view the data at a glance. They mean:
Log loss test results: in-region
These are the test results for tasks sending to CloudWatch in the same region as the tasks.
Summary of All Test Runs
These are the results only aggregated by
max-buffer-sizeand log throughput. The results are not broken down by any additional dimensions like log message size or launch type.Summary Data Table
Average Log Loss Data Table
Standard Deviation Log Loss Data Table
Much more data is in the comments, I ran into single message length limit on GitHub