Remove random writes from SST file ingestion#4172
Remove random writes from SST file ingestion#4172riversand963 wants to merge 16 commits intofacebook:masterfrom
Conversation
RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Test plan: ``` $make clean && make -j16 all check $./external_sst_file_test $./external_sst_file_basic_test ```
facebook-github-bot
left a comment
There was a problem hiding this comment.
@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@siying Could you PTAL? Thanks |
siying
left a comment
There was a problem hiding this comment.
Is there a way we write a special flag in ExternalSstFilePropertyNames::kGlobalSeqno to indicate that this is not a valid seqno? Or maybe bump up "rocksdb.external_sst_file.version" and remove "rocksdb.external_sst_file.global_seqno" as a whole.
db/version_edit.h
Outdated
| FileDescriptor(uint64_t number, uint32_t path_id, uint64_t _file_size) | ||
| FileDescriptor(uint64_t number, uint32_t path_id, uint64_t _file_size, | ||
| SequenceNumber _smallest_seqno = kMaxSequenceNumber, | ||
| SequenceNumber _largest_seqno = 0) |
There was a problem hiding this comment.
How hard it is to avoid defaults?
table/block_based_table_reader.cc
Outdated
| } | ||
|
|
||
| SequenceNumber global_seqno = DecodeFixed64(seqno_pos->second.c_str()); | ||
| SequenceNumber global_seqno = largest_seqno; |
There was a problem hiding this comment.
Can we still read from DecodeFixed64(seqno_pos->second.c_str()) and make sure it is either the same as largest_seqno or it is not modified?
There was a problem hiding this comment.
by 'not modified', do you mean verify the checksum?
|
@riversand963 has updated the pull request. Re-import the pull request |
|
Closing this PR, moving to #4188 to avoid unrelated file changes. |
|
@riversand963 has updated the pull request. Re-import the pull request |
|
@siying can you PTAL? |
|
@riversand963 has updated the pull request. Re-import the pull request |
include/rocksdb/options.h
Outdated
| // All files will be ingested at the bottommost level with seqno=0. | ||
| bool ingest_behind = false; | ||
| // Set to true if you would like to write global_seqno to a given offset in | ||
| // the external SST file. |
There was a problem hiding this comment.
Mention the backward compatibility scenario that this option is tries to address and recommend to set it false to new services or the compatibility is not needed.
table/block_based_table_reader.cc
Outdated
| } | ||
|
|
||
| SequenceNumber global_seqno = DecodeFixed64(seqno_pos->second.c_str()); | ||
| assert(global_seqno == 0 || global_seqno == largest_seqno); |
There was a problem hiding this comment.
Rather than assert, I suggest explicitly fail the open and return corrupted
table/table_builder.h
Outdated
| const InternalKeyComparator& _internal_comparator, | ||
| bool _skip_filters = false, bool _immortal = false, | ||
| int _level = -1) | ||
| int _level = -1, SequenceNumber _largest_seqno = 0) |
There was a problem hiding this comment.
Can we remove the defaults? I don't see a scenario we should do it.
There was a problem hiding this comment.
Do you mean defaults of all the parameters?
There was a problem hiding this comment.
I don't care level, but I don't think it makes sense to have a default for largest_seqno. I don't care whether you reorder to two so you can keep the level default, or remove defaults for both of level and largest_seqno.
There was a problem hiding this comment.
Done. Adding an overloaded constructor. Another option is to provide a setter (not implemented).
table/block_based_table_reader.cc
Outdated
| @@ -697,6 +697,8 @@ SequenceNumber GetGlobalSequenceNumber(const TableProperties& table_properties, | |||
| } | |||
|
|
|||
| SequenceNumber global_seqno = DecodeFixed64(seqno_pos->second.c_str()); | |||
There was a problem hiding this comment.
I'm surprised that the existing code has no verification that seqno_pos is valid.
|
@riversand963 has updated the pull request. Re-import the pull request |
facebook-github-bot
left a comment
There was a problem hiding this comment.
@riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@riversand963 has updated the pull request. Re-import the pull request |
|
@riversand963 has updated the pull request. Re-import the pull request |
facebook-github-bot
left a comment
There was a problem hiding this comment.
riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@riversand963 has updated the pull request. Re-import the pull request |
facebook-github-bot
left a comment
There was a problem hiding this comment.
riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@riversand963 has updated the pull request. Re-import the pull request |
facebook-github-bot
left a comment
There was a problem hiding this comment.
riversand963 has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Pull Request resolved: facebook#4172 Differential Revision: D8961465 Pulled By: riversand963 fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071
This uses he new flag added in facebook/rocksdb#4172 to avoid the global_seq_no edits and the SST copying they forced us to do to preserve the raft log. Passing this flag has to be gated on a cluster version that is defined well after the upgrade to rocks 5.17 -- using the flag requires that older versions of RocksDB (< 5.16) will never be used to read, as they will still be looking for and using the seq_no in the file. Release note (performance improvement): reduce write-amplification during bulk-loading (IMPORT and RESTORE)
This uses he new flag added in facebook/rocksdb#4172 to avoid the global_seq_no edits and the SST copying they forced us to do to preserve the raft log. Passing this flag has to be gated on a cluster version that is defined well after the upgrade to rocks 5.17 -- using the flag requires that older versions of RocksDB (< 5.16) will never be used to read, as they will still be looking for and using the seq_no in the file. Release note (performance improvement): reduce write-amplification during bulk-loading (IMPORT and RESTORE)
This uses he new flag added in facebook/rocksdb#4172 to avoid the global_seq_no edits and the SST copying they forced us to do to preserve the raft log. Passing this flag has to be gated on a cluster version that is defined well after the upgrade to rocks 5.17 -- using the flag requires that older versions of RocksDB (< 5.16) will never be used to read, as they will still be looking for and using the seq_no in the file. Release note (performance improvement): reduce write-amplification during bulk-loading (IMPORT and RESTORE)
34806: util/mon: Downgrade "bytes usage increases" log message r=knz a=bdarnell This message can be logged frequently in certain workloads, and its appearance in the info log is not actionable since there is no way to tie it back to the query that caused it. I see no reason to have these messages enabled by default. Release note: None 34886: storage/engine,libroach: 1x write amplification on ingest sst r=dt a=dt This uses the new flag added in facebook/rocksdb#4172 to avoid needing to make global_seq_no-related edits to SSTs, and thus avoid the SST copying those edits forced us to do to preserve the raft log immutability when hard-linking directly to side load SSTs. This is gated on a cluster version that is defined well after the upgrade to rocks 5.17, since using the flag requires that no older versions of RocksDB (<5.16), will ever be used to read these SSTs, as they will lack the seq_no that the older versions need for correctness. Release note (performance improvement): reduce write-amplification during bulk-loading (IMPORT and RESTORE) 34902: sql: ensure spans get anonymized in plan collection r=knz a=knz Fixes #34889. Release note (bug fix): The logical plans collected for display in the web UI now also hide the details of which key ranges are scanned in table lookups. Co-authored-by: Ben Darnell <ben@bendarnell.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com> Co-authored-by: Raphael 'kena' Poss <knz@cockroachlabs.com>
Summary: It it not safe to assume application had sync the SST file before ingest it into DB. Also the directory to put the ingested file needs to be fsync, otherwise the file can be lost. For integrity of RocksDB we need to sync the ingested file and directory before apply the change to manifest. Also syncing after writing global sequence when write_global_seqno=true was removed in #4172. Adding it back. Fixes #5287. Pull Request resolved: #5435 Test Plan: Test ingest file with ldb command and observe fsync/fdatasync in strace output. Tried both move_files=true and move_files=false. https://gist.github.com/yiwu-arbug/650a4023f57979056d83485fa863bef9 More test suggestions are welcome. Differential Revision: D15941675 Pulled By: riversand963 fbshipit-source-id: 389533f3923065a96df2cdde23ff4724a1810d78
Summary: It it not safe to assume application had sync the SST file before ingest it into DB. Also the directory to put the ingested file needs to be fsync, otherwise the file can be lost. For integrity of RocksDB we need to sync the ingested file and directory before apply the change to manifest. Also syncing after writing global sequence when write_global_seqno=true was removed in facebook#4172. Adding it back. Fixes facebook#5287. Pull Request resolved: facebook#5435 Test Plan: Test ingest file with ldb command and observe fsync/fdatasync in strace output. Tried both move_files=true and move_files=false. https://gist.github.com/yiwu-arbug/650a4023f57979056d83485fa863bef9 More test suggestions are welcome. Differential Revision: D15941675 Pulled By: riversand963 fbshipit-source-id: 389533f3923065a96df2cdde23ff4724a1810d78
RocksDB used to store global_seqno in external SST files written by
SstFileWriter. During file ingestion, RocksDB uses
pwriteto update theglobal_seqno. Since random write is not supported in some non-POSIX compliantfile systems, external SST file ingestion is not supported on these file
systems. To address this limitation, we no longer update
global_seqnoduringfile ingestion. Later RocksDB uses the MANIFEST and other information in table
properties to deduce global seqno for externally-ingested SST files.
Test plan: