-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Snapshots to S3 Repository in 5.x cause a much higher object churn/GC than previous versions #26969
Description
Elasticsearch version: Ok on 2.3.3, Not Ok on 5.3.3
Plugins: S3, EC2
JVM: 1.8.0_102
OS Version: Linux ip-10-176-42-66 4.9.32-15.41.amzn1.x86_64 #1 SMP Thu Jun 22 06:20:54 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
heap size [4.6gb], compressed ordinary object pointers [true]
We have a number of Elasticsearch nodes in multiple clusters on EC2 instances. Previously we were running version 2.3.3, but lately upgraded to 5.3.3.
We run Snapshots every hour to back up data regularly to S3.
Since upgrading from 2.3.3 to 5.3.3 we see a regular pattern of high GC, both in our monitoring (Dynatrace) and the newly introduced gc-logs from Elasticsearch itself.
It seems the Snapshot operation changed in 5.x and now causes quite a large spike in GC activity on the Elasticsearch server.
We analyzed this with Java Flight Recorder and it shows that almost all of the memory churn is caused by the constructor S3OutputStream(), which seems to allocate a huge byte[]. 957 allocations of byte[] cause 94GB of memory allocation, this would mean 100MB per invocation of S3ObjectStream(). This matches with the debug-output from S3Repository:
[2017-10-11T13:46:28,431][DEBUG][o.e.r.s.S3Repository ] [i-052912327e74e66e0] using bucket [ruxit-elasticsearch-testbackups], chunk_size [1gb], server_side_encryption [false], buffer_size [100mb], cannedACL [], storageClass []
So it seems the buffer is re-allocated again and again, which causes quite high object churn/GC spikes. I am not sure if it is on-purpose to re-create this buffer continuously or not, so either this re-allocation is wrong or the memory size calculation (see below) does not fit well for 4GB heap instances. (we also see it for 8GB instances, BTW)
This is what JFR reports:
Steps to reproduce:
- Use a normal EC2 instance, e.g. m4.large
- Set up Elasticsearch 5.3.3 using the default setup and add S3 plugins
- Add some index-data, e.g. in our case 8GB in 12 indices with 3 shards, 0 replicas each
- Set up an S3 snapshot repository
- Trigger a snapshot to S3
=> We see GC-logs stating that more than 500ms were spent for GC in 1s:
Provide logs (if relevant):
[2017-10-10T11:10:26,254][INFO ][o.e.m.j.JvmGcMonitorService] [i-052912327e74e66e0] [gc][82970] overhead, spent [258ms] collecting in the last [1s]
[2017-10-10T11:11:02,393][INFO ][o.e.m.j.JvmGcMonitorService] [i-052912327e74e66e0] [gc][83006] overhead, spent [281ms] collecting in the last [1s]
[2017-10-10T11:12:08,671][INFO ][o.e.m.j.JvmGcMonitorService] [i-052912327e74e66e0] [gc][83072] overhead, spent [255ms] collecting in the last [1s]
[2017-10-10T11:13:15,227][INFO ][o.e.s.SnapshotShardsService] [i-052912327e74e66e0] snapshot [s3_repository:snapshot_dost-test-apm-108329_1507633631823/mWeSRcliTSKvN8NYDzApGA] is done
Additional info:
The buffer size is computed as follows, for 4.6GB, it leads to 100MB
Math.max(
ByteSizeUnit.MB.toBytes(5), // minimum value
Math.min(
ByteSizeUnit.MB.toBytes(100),
JvmInfo.jvmInfo().getMem().getHeapMax().getBytes() / 20)),

