-
Notifications
You must be signed in to change notification settings - Fork 50
Description
I see some unexpected memory consumption during replication process. This is not reproduced on dev-env but can be seen on other environments: hardware or virtual machines.
If node considers the object has not enough replicas, it reads it into memory to put it into others container nodes.
I reduced replicator.pool_size down to 1 and set more aggressive GC setting (GOGC=20). However after some time I see that some object are keep stored in the memory for longer than I expect.
heap profile: 63: 803864784 [36195: 73107425384] @ heap/1048576
6: 402702336 [122: 8188280832] @ 0x4d8e0b 0xc4ca2e 0xc520b5 0xc899b8 0xc89e9c 0xc897ae 0xc9a574 0xca48a8 0xc99e65 0xc99ad6 0xc939ef 0xc99a39 0xc9aa3f 0xd0d4c6 0xe12576 0xe112fa 0xe138bb 0xbcc0f7 0x46b921
# 0x4d8e0a os.ReadFile+0xea os/file.go:693
# 0xc4ca2d github.com/nspcc-dev/neofs-node/pkg/local_object_storage/blobstor/fstree.(*FSTree).Get+0xad github.com/nspcc-dev/neofs-node/pkg/local_object_storage/blobstor/fstree/fstree.go:304
# 0xc520b4 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/blobstor.(*BlobStor).Get+0x2f4 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/blobstor/get.go:20
# 0xc899b7 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*Shard).Get.func1+0xb7 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard/get.go:73
# 0xc89e9b github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*Shard).fetchObjectData+0x41b github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard/get.go:127
# 0xc897ad github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard.(*Shard).Get+0x22d github.com/nspcc-dev/neofs-node/pkg/local_object_storage/shard/get.go:86
# 0xc9a573 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).get.func1+0x113 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:84
# 0xca48a7 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).iterateOverSortedShards+0xc7 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/shards.go:225
# 0xc99e64 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).get+0x324 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:78
# 0xc99ad5 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).Get.func1+0x55 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:48
# 0xc939ee github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).execIfNotBlocked+0xce github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/control.go:147
# 0xc99a38 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.(*StorageEngine).Get+0xb8 github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:47
# 0xc9aa3e github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine.Get+0x9e github.com/nspcc-dev/neofs-node/pkg/local_object_storage/engine/get.go:172
# 0xd0d4c5 github.com/nspcc-dev/neofs-node/pkg/services/replicator.(*Replicator).HandleTask+0x105 github.com/nspcc-dev/neofs-node/pkg/services/replicator/process.go:30
# 0xe12575 github.com/nspcc-dev/neofs-node/pkg/services/policer.(*Policer).processNodes+0xfd5 github.com/nspcc-dev/neofs-node/pkg/services/policer/check.go:241
# 0xe112f9 github.com/nspcc-dev/neofs-node/pkg/services/policer.(*Policer).processObject+0xc19 github.com/nspcc-dev/neofs-node/pkg/services/policer/check.go:127
# 0xe138ba github.com/nspcc-dev/neofs-node/pkg/services/policer.(*Policer).shardPolicyWorker.func1+0x17a github.com/nspcc-dev/neofs-node/pkg/services/policer/process.go:65
# 0xbcc0f6 github.com/panjf2000/ants/v2.(*goWorker).run.func1+0x96 github.com/panjf2000/ants/v2@v2.4.0/worker.go:68
At this memory profile I see 6 objects of 64 MiB (max object size in the network) in memory. This number is changing but tends to grow up over time (later I saw 10 objects in this run). Maybe we can do something about that.
Maybe it is just GC thing (need to try GOMEMLIMIT with go1.19 build), maybe something else.
To reduce number of object reads, node might want to skip replication in case of having 2048 access denied errors (related to #1709). This can happen when eACL restricts system operations.
Any other ideas are appreciated.
/cc @fyrchik