Remove Specific Atomic Write Method from BlobContainer#43329
Remove Specific Atomic Write Method from BlobContainer#43329original-brownbear wants to merge 2 commits intoelastic:masterfrom original-brownbear:remove-non-atomic-write
Conversation
original-brownbear
commented
Jun 18, 2019
- Extracted from Remove BlobContainer APIs that Are not Universally Applicable #42886
- Having both write and writeAtomic makes no sense. On Cloud providers all writes are atomic and on FS and HDFS the added cost of a move and directory fsync operation is hardly relevant.
- On the other hand, allowing for partial writes (though extremely unlikely in practice) simply has the potential of creating more garbage blobs that aren't even deserializable. This change neatly ensures that every blob by a non-temp name has been fully written and fsynced (in case of NFS + HDFS).
* Extracted from #42886 * Having both write and writeAtomic makes no sense. On Cloud providers all writes are atomic and on FS and HDFS the added cost of a move and directory fsync operation is hardly relevant. * On the other hand, allowing for partial writes (though extremely unlikely in practice) simply has the potential of creating more garbage blobs that aren't even deserializable. This change neatly ensures that every blob by a non-temp name has been fully written and fsynced (in case of NFS + HDFS).
|
Pinging @elastic/es-distributed |
|
I think this brings the risk now of creating more garbage files at the top-level, which are not cleaned up. |
I don't think that's true. We are currently not cleaning up top level files anyway, on broken snapshots or deletes we still leak top level snap-{}.dat etc. All this changes is that we'd be leaking temp- blobs instead of non-temp named blobs. If anything I'd say that makes it easier to run cleanup :) the number of blobs def. does not go up imo. PS: I will open a PR to clean up the top level situation very shortly anyway :) |
You're right.
On HDFS the cost will be higher. Each metadata request can take a few hundred milliseconds to be processed by a NameNode, so this might make snapshotting on HDFS slower. Another problem is that the write_once semantics of blobs will be harder to guarantee. The move operation on FS as well as HDFS does not guarantee the same semantics as CREATE_NEW. The Javadocs for FIles.move say that This is also why we're doing a best-effort check in I'm a bit torn on this. On one hand I think removing the extra method is good. On the other hand, it gives us worse guarantees and slower backups on HDFS. |
Yea that's true, but in the end how relevant is this for overall performance in relative terms? In the end we're doubling the number (not even because we already use atomic write in a bunch of spots) of metadata requests and keep everything else constant. Especially if we also get #42791 in (parallelize snapshots on the file not the shard level) I think the effect will not be that relevant from the user perspective. But even without I'd be surprised if metadata requests were the bulk of the snapshot cost at the moment and if they aren't then the effect isn't that big.
See my other PR on this topic: #42886 I don't think we should work with write_once any more due to the difficulties it introduces when it comes to S3 (and some NFS). |
|
Closing this as it will become irrelevant as a result of #46250 and follow-ups (since we'll use the CS for consistency of the repository contents) and we can simply remove the atomic-write methods once they become unused. |