persist: columnar representation for unsealed batches#9165
persist: columnar representation for unsealed batches#9165ruchirK merged 3 commits intoMaterializeInc:mainfrom
Conversation
ebd4a47 to
e7ec078
Compare
|
still TODO for me: summarize the impact on the end-to-end test with 1 million 110 byte rows. |
|
it looks like the encoded batch sizes went up. I can see how we'd have more overhead for very small batches with this (e.g. 4 vecs vs 2 per record). let's look into whether there's an inflection point where a sufficiently large batch is smaller in the columnar representation. if not, we should at least understand why before we move forward with using this in the storage format |
|
also, I've sufficiently forgotten the context of my WIP that you shouldn't feel the need to structure commits as things on top of it. once we're confidant in the perf worthiness of this and you're ready to get it reviewed for merge, let's structure the PR as "introduce a columnar batch thingy" in one commit and "use it for unsealed" in another |
yup -- that was how I was hoping to merge it. Kept it this way in case it was easier for review, but am happy to switch to this structure |
0693aeb to
bd140fa
Compare
|
This change is RFAL! I broke it up into logical commits as requested, and I have the following answers for the questions around encoding size, and impact on end-to-end tests, which I also included in the second commit messages (under the benchmark results) It seems very likely that the end-to-end tests will show improvement when we move updating listeners to a background task. |
|
Hmm, I'd like to do some more experiments then before we merge this. From a results perspective, I'm highly confidant that this is the right direction, but from a process perspective, I want us to do our due diligence before we commit to such a big change in the name of perf. In particular, actually seeing the needle move on the end-to-end test.
Huh. The columnar representation was the same size? That's surprising, I'd expect at some point for them to be smaller, but maybe my mental model here is wrong. I have zero idea how bincode encodes things, but let's walk through a generalized example of what might be happening. For the following: The row encoding might be The col encoding might be Yeah, I can see how bincode might behind this. The usual places for columnar to pull ahead of row encoding is if the format is self-describing (bincode isn't AFAIK) or if the encoding doesn't store things like [u8] inline, but instead has a pointer out to the data (e.g. capnp). And it's losing a bit of efficiency on the columnar side by having to encode the length of the 4 vecs separately, even though they're always the same in practice. I'm sufficiently convinced that this behavior is expected.
Let's explore some prototypes of how we can avoid this. On one hand, we could move sending to the listener out of the hot path, which is something we'll likely want to do at some point, but in a way it's just shuffling the problem to some other thread, which will still become a throughput bottleneck for us at some point. What about this: instead of sending a |
|
To be clear, my last suggestion isn't that we should expand the scope of this PR to include that change, just to build a quick prototype of it on top of this PR and see if the needle moves on end-to-end in a way that we expect. |
|
Thanks for the review Dan! I agree that being more rigorous with the performance analysis would be better. Re: the size on disk - I actually think you've raised a really interesting question and its actually non-obvious why the data in storage are the same between the row-oriented, and the columnar representations, and thinking about this has helped me understand what we're actually writing down better. Concretely, we have a 100 byte string + 2 4 byte ints in each row inserted with Thats BUT, it feels like, given your reasonable description of how Perhaps one takeaway here is that the length prefix we add in |
|
Re: prototype -- I'll do that right now. I also was thinking about breaking the last commit that introduces the columnar representation into two commits, one that introduces the idea of the My main motivation here is to avoid having to rebase the write path code multiple times between now and when we can do a meta version bump, but I don't have super strong feelings about this. wdyt? |
Yeah that sgtm, exactly the sort of thing I want to start practicing in advance of when we do backward compatibility for real. I was thinking yesterday that I could do something similar with that WIP inc meta branch I linked you the other day |
|
Alright! Took a while to de-noise the end-to-end benchmark -- for future ref turns off the system table test and correctly sets the step interval to 100ms. Going from there, a nice repeatable testing protocol is to wipe mzdata, and then do 5 - 10 inserts of 1 million batches, without With all of that -- the best case latency before the columnar change is 2.2 seconds, and the best case latency after the columnar change (changing the implementation to send The 100 ms improvement is roughly what we had expected based on the microbenchmarks / more detailed info from For me this is pretty conclusive proof that this change does indeed improve performance end-to-end. I'll go ahead and split up the commits so that we can merge everything that threads |
bd140fa to
45ee22b
Compare
|
RFAL! This pr now threads |
| /// instances of the same ((Key, Val), Time), and some Diffs might be zero, or | ||
| /// add up to zero). | ||
| #[derive(Clone, Debug, Default, Eq, PartialEq, Serialize, Deserialize)] | ||
| pub struct ColumnarRecords { |
There was a problem hiding this comment.
thoughts on manually implementing Debug so it's actually useful? i lean toward yes. something like making this one in terms of Ref and implementing Ref with fmt.debug_list. we can leave Iter derived since I think iterators don't usually print all their contents as part of their debug impls
There was a problem hiding this comment.
I implemented them both in terms of debug_list because I couldn't find a way to tell the formatter "print an instance of this type which also implements debug", other than the struct builder thingy, which didn't seem right. the complexity of doing that is a tiny bit higher than reusing the columnarrecordref's debug impl, but both debug impls seem simple enough. at any rate, this seems like something we can touch up in post
There was a problem hiding this comment.
something like fmt::Debug::fmt(self.borrow(), f) should work as the ColumnarRecords impl
There was a problem hiding this comment.
🤦 ah yeah ofc. ty fixed!
…n state Some changes to the encoding state render the previously stored golden state entirely un-decodable. In those cases, the golden test currently crashes with an assertion failure (it returns an error instead of the expected ok). This commit makes the behavior in that case a little bit more user friendly by printing out the current state and the requests that lead to it, as most of the time the in this case we just want to update golden_test.json.
63766c8 to
a3408bc
Compare
|
tftr! merging on green! |
…olumnar format. At a very high level, our data is composed of lots of small vectors of bytes for keys and values. This representation turns the storage around 90 degrees to be backed by one large vector of bytes for all the keys + another large vector of bytes for all the values, and separate arrays of offsets into both. The pros of this representation are that its faster to encode, and more performant to scan through (fewer cache misses). The cons are that its more complicated to append data, and its downright difficult to reorder or shuffle data in this representation. Actually using this representation is left to future commits.
This commit teaches persist to accept writes as a Vec<ColumnarRecords>, and teaches
writers to use the WriteReqBuilder. However, we keep the current row oriented
encoding format and will change the encoding in a later commit.
Unfortunately, this means that this commit will be a slight performance regression,
as we do the work to generate inputs in the columnar format, and also have to convert
back to the row oriented format when writing out to storage.
indexed_write_drain/mem_sorted/23000000
time: [257.46 ms 260.45 ms 263.71 ms]
thrpt: [83.175 MiB/s 84.218 MiB/s 85.195 MiB/s]
change:
time: [+24.933% +28.471% +31.811%] (p = 0.00 < 0.05)
thrpt: [-24.134% -22.162% -19.957%]
Performance has regressed.
indexed_write_drain/mem_unsorted/23000000
time: [260.59 ms 264.15 ms 268.14 ms]
thrpt: [81.803 MiB/s 83.037 MiB/s 84.174 MiB/s]
change:
time: [+14.418% +17.150% +19.689%] (p = 0.00 < 0.05)
thrpt: [-16.450% -14.639% -12.601%]
Performance has regressed.
indexed_write_drain/file_sorted/23000000
time: [334.63 ms 342.51 ms 351.35 ms]
thrpt: [62.429 MiB/s 64.040 MiB/s 65.549 MiB/s]
change:
time: [+12.900% +16.180% +19.538%] (p = 0.00 < 0.05)
thrpt: [-16.345% -13.927% -11.426%]
Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
1 (10.00%) high mild
indexed_write_drain/file_unsorted/23000000
time: [345.38 ms 366.15 ms 388.11 ms]
thrpt: [56.516 MiB/s 59.906 MiB/s 63.508 MiB/s]
change:
time: [+11.792% +18.735% +26.303%] (p = 0.00 < 0.05)
thrpt: [-20.825% -15.779% -10.549%]
Performance has regressed.
a3408bc to
6d28e42
Compare
Motivation
Details in the commit messages.
Tips for reviewer
This grabs the commit from #8551 and then builds on top of that in a separate commit, for ease of review.
Checklist
user-facing behavior changes.