persist: columnar representation for unsealed batches by ruchirK · Pull Request #9165 · MaterializeInc/materialize

ruchirK · 2021-11-17T19:26:20Z

Motivation

Details in the commit messages.

Tips for reviewer

This grabs the commit from #8551 and then builds on top of that in a separate commit, for ease of review.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR adds a release note for any
user-facing behavior changes.

ruchirK · 2021-11-17T19:30:55Z

still TODO for me: summarize the impact on the end-to-end test with 1 million 110 byte rows.

danhhz · 2021-11-17T19:46:28Z

it looks like the encoded batch sizes went up. I can see how we'd have more overhead for very small batches with this (e.g. 4 vecs vs 2 per record). let's look into whether there's an inflection point where a sufficiently large batch is smaller in the columnar representation. if not, we should at least understand why before we move forward with using this in the storage format

danhhz · 2021-11-17T19:50:24Z

also, I've sufficiently forgotten the context of my WIP that you shouldn't feel the need to structure commits as things on top of it. once we're confidant in the perf worthiness of this and you're ready to get it reviewed for merge, let's structure the PR as "introduce a columnar batch thingy" in one commit and "use it for unsealed" in another

ruchirK · 2021-11-17T19:54:42Z

once we're confidant in the perf worthiness of this and you're ready to get it reviewed for merge, let's structure the PR as "introduce a columnar batch thingy" in one commit and "use it for unsealed" in another

yup -- that was how I was hoping to merge it. Kept it this way in case it was easier for review, but am happy to switch to this structure

ruchirK · 2021-11-17T22:12:21Z

This change is RFAL! I broke it up into logical commits as requested, and I have the following answers for the questions around encoding size, and impact on end-to-end tests, which I also included in the second commit messages (under the benchmark results)

For very small batches this change results in a noticeable increase in the encoded
size but the effect is small, and entirely unnoticeable once data volumes exceed
1,000 bytes (measured with inserts of 1.1KB, 110 KB, and 110 MB, and the sizes of
unsealed batches produced before and after this change were identical).

Unfortunately, this change doesn't register any improvement in the end-to-end test
sending 1 million 110 byte rows via a COPY INTO, because we save on allocations
during `write_atomic`, but end up having to allocate when sending rows to the listener.
However, there's clear positives from this change. Serialization times decrease
from 230 ms to 170 ms (out of a total of ~360 - 370 ms CPU time spent in drain_pending).
This represents a ~20% improvement that is roughly commeasurate with what we observe
in the benchmarks. On the coordinator side, we go from spending ~175 ms in
`write_atomic` to spending about 140 ms in `write_atomic`, which is a smaller than
expected change. Curiously, the trace seems to be dominated by `free_tiny` which
leads us to suspect that perhaps the cost we're now bottlenecked on is the `drop`
on the owned vec that gets passed to `write_atomic`.

It seems very likely that the end-to-end tests will show improvement when we move updating listeners to a background task.

danhhz · 2021-11-18T18:28:40Z

Hmm, I'd like to do some more experiments then before we merge this. From a results perspective, I'm highly confidant that this is the right direction, but from a process perspective, I want us to do our due diligence before we commit to such a big change in the name of perf. In particular, actually seeing the needle move on the end-to-end test.

For very small batches this change results in a noticeable increase in the encoded size but the effect is small, and entirely unnoticeable once data volumes exceed 1,000 bytes (measured with inserts of 1.1KB, 110 KB, and 110 MB, and the sizes of unsealed batches produced before and after this change were identical).

Huh. The columnar representation was the same size? That's surprising, I'd expect at some point for them to be smaller, but maybe my mental model here is wrong. I have zero idea how bincode encodes things, but let's walk through a generalized example of what might be happening. For the following:

<key1><val1><ts1><diff1>
...
<keyN><valN><tsN><diffN>

The row encoding might be <N_rows: u64> followed by N repetitions of <key_len1: u64><key_bytes1: [u8: key_len1]><val_len1: u64><val_bytes1: [u8: key_len1]><ts1: u64><diff1: i64> for a total of 8 + 8*N + all_key_bytes 8*N + all_val_bytes + 8*N + 8*N = 8 + 32N + all_key_val_bytes.

The col encoding might be <N_keys: u64> followed by N repetitions of <key_len1: u64><key_bytes1: [u8: key_len1]> followed by the same for vals, then <N_timestamps: u64><timestamps: [u64; N_timestamps]><N_diffs: u64><diffs: [u64; N_diffs]> for a total of 8 + 8 * N + all_key_bytes + 8 + 8 * N + all_val_bytes + 8 + 8 * N + 8 + 8 * N = 32 + 32 * N + all_key_val_bytes.

Yeah, I can see how bincode might behind this. The usual places for columnar to pull ahead of row encoding is if the format is self-describing (bincode isn't AFAIK) or if the encoding doesn't store things like [u8] inline, but instead has a pointer out to the data (e.g. capnp). And it's losing a bit of efficiency on the columnar side by having to encode the length of the 4 vecs separately, even though they're always the same in practice.

I'm sufficiently convinced that this behavior is expected.

we save on allocations during write_atomic, but end up having to allocate when sending rows to the listener.

Let's explore some prototypes of how we can avoid this. On one hand, we could move sending to the listener out of the hot path, which is something we'll likely want to do at some point, but in a way it's just shuffling the problem to some other thread, which will still become a throughput bottleneck for us at some point. What about this: instead of sending a ((Vec<u8>, Vec<u8>), usize, i64) out to the listener, we could send a (idx: usize, batch: Arc<ColumnarBatch>) and the listener is responsible for indexing the indicated row out. That would avoid the col->row transformation entirely.

danhhz · 2021-11-18T18:29:37Z

To be clear, my last suggestion isn't that we should expand the scope of this PR to include that change, just to build a quick prototype of it on top of this PR and see if the needle moves on end-to-end in a way that we expect.

ruchirK · 2021-11-18T20:09:33Z

Thanks for the review Dan! I agree that being more rigorous with the performance analysis would be better.

Re: the size on disk - I actually think you've raised a really interesting question and its actually non-obvious why the data in storage are the same between the row-oriented, and the columnar representations, and thinking about this has helped me understand what we're actually writing down better.

Concretely, we have a 100 byte string + 2 4 byte ints in each row inserted with COPY FROM. In the Row encoding that turns into 100 bytes for the string + 1 byte for the length of the string (because its small), + 3 tag bytes (one per column) for a total of 112 bytes. When we call Codec::encode() on this Row we add 8 bytes for the row length + 1 byte of magic, so 121 for the encoded Row.

Thats key_bytes. val_bytes is just empty because its (). That + 2 x 8 bytes (for time and diff) comes out to 137 bytes per ((key, val), time, diff), which seems like it has to be constant between the previous representation, and the columnar one, because we're not really doing any compression on any of these bytes, we're just re-arranging how they are laid out.

BUT, it feels like, given your reasonable description of how bincode might be doing things -- bincode doesn't know a priori that we're encoding the length in the first 9 bytes of each key, and would have to encode the length of the Vec<u8> provided per key and value in a way that the columnar representation does not at first glance. Specifically, I don't think "The col encoding might be <N_keys: u64> followed by N repetitions of <key_len1: u64><key_bytes1: [u8: key_len1]> followed by the same for vals," is correct because it doesn't make sense to me that the key_lengths would get interleaved with each key's bytes. However, the columnar representation pays the same additional cost in the offset arrays (which is something I didn't realize until thinking through this!). So, thats an additional 16 bytes, stored either inline with the row-oriented representation, or in separate arrays in columnar, for a total of 153 bytes per row inserted, which is almost exactly whats observed (148 MB == 148 * 2^20 bytes ~= 155 million bytes)

Perhaps one takeaway here is that the length prefix we add in Codec::encode for Row could/should be removed?

ruchirK · 2021-11-18T20:40:46Z

Re: prototype -- I'll do that right now.

I also was thinking about breaking the last commit that introduces the columnar representation into two commits, one that introduces the idea of the WriteReqBuilder and changes writes to happen in terms of ColumnarRecords, but doesn't change the on-disk encoding (this commit would be a small perf regression - because we'd be allocating a large ColumnarRecords object, and then breaking it up to construct a BlobUnsealedBatch), and then another commit that handles actually changing the encoding.

My main motivation here is to avoid having to rebase the write path code multiple times between now and when we can do a meta version bump, but I don't have super strong feelings about this. wdyt?

danhhz · 2021-11-18T21:02:40Z

I also was thinking about breaking the last commit that introduces the columnar representation into two commits, one that introduces the idea of the WriteReqBuilder and changes writes to happen in terms of ColumnarRecords, but doesn't change the on-disk encoding (this commit would be a small perf regression - because we'd be allocating a large ColumnarRecords object, and then breaking it up to construct a BlobUnsealedBatch), and then another commit that handles actually changing the encoding.

My main motivation here is to avoid having to rebase the write path code multiple times between now and when we can do a meta version bump, but I don't have super strong feelings about this. wdyt?

Yeah that sgtm, exactly the sort of thing I want to start practicing in advance of when we do backward compatibility for real. I was thinking yesterday that I could do something similar with that WIP inc meta branch I linked you the other day

ruchirK · 2021-11-22T20:52:20Z

Alright! Took a while to de-noise the end-to-end benchmark -- for future ref

cargo run --release -- -w 1 --log-file=stderr --experimental --persistent-user-tables --disable-persistent-system-tables-test --timestamp-frequency 100ms

turns off the system table test and correctly sets the step interval to 100ms. Going from there, a nice repeatable testing protocol is to wipe mzdata, and then do 5 - 10 inserts of 1 million batches, without Instruments attached. Instruments adds another ~15% of overhead. Also, I commented out drain_unsealed and compact to focus on the write latency.

With all of that -- the best case latency before the columnar change is 2.2 seconds, and the best case latency after the columnar change (changing the implementation to send ColumnarRecords directly to listeners as discussed) is 2.1 seconds. Of that time, ~1.5 seconds is spent decoding the copy text format.

The 100 ms improvement is roughly what we had expected based on the microbenchmarks / more detailed info from Instruments (we go from spending approximately 700 ms in coord + persist to 600 ms in coord + persist, driven by speedups in serialization and small speedups in write_atomic)

For me this is pretty conclusive proof that this change does indeed improve performance end-to-end. I'll go ahead and split up the commits so that we can merge everything that threads ColumnarRecords through, without actually changing the encoding, and then leave the encoding change for our breaking change window.

ruchirK · 2021-11-23T01:00:20Z

RFAL! This pr now threads ColumnarRecords through as much of persist as possible without changing the encoding, and #9239 handles actually changing the encoding

danhhz

LGTM

danhhz · 2021-11-23T15:16:24Z

+/// instances of the same ((Key, Val), Time), and some Diffs might be zero, or
+/// add up to zero).
+#[derive(Clone, Debug, Default, Eq, PartialEq, Serialize, Deserialize)]
+pub struct ColumnarRecords {


thoughts on manually implementing Debug so it's actually useful? i lean toward yes. something like making this one in terms of Ref and implementing Ref with fmt.debug_list. we can leave Iter derived since I think iterators don't usually print all their contents as part of their debug impls

I implemented them both in terms of debug_list because I couldn't find a way to tell the formatter "print an instance of this type which also implements debug", other than the struct builder thingy, which didn't seem right. the complexity of doing that is a tiny bit higher than reusing the columnarrecordref's debug impl, but both debug impls seem simple enough. at any rate, this seems like something we can touch up in post

something like fmt::Debug::fmt(self.borrow(), f) should work as the ColumnarRecords impl

🤦 ah yeah ofc. ty fixed!

…n state Some changes to the encoding state render the previously stored golden state entirely un-decodable. In those cases, the golden test currently crashes with an assertion failure (it returns an error instead of the expected ok). This commit makes the behavior in that case a little bit more user friendly by printing out the current state and the requests that lead to it, as most of the time the in this case we just want to update golden_test.json.

ruchirK · 2021-11-23T21:01:20Z

tftr! merging on green!

…olumnar format. At a very high level, our data is composed of lots of small vectors of bytes for keys and values. This representation turns the storage around 90 degrees to be backed by one large vector of bytes for all the keys + another large vector of bytes for all the values, and separate arrays of offsets into both. The pros of this representation are that its faster to encode, and more performant to scan through (fewer cache misses). The cons are that its more complicated to append data, and its downright difficult to reorder or shuffle data in this representation. Actually using this representation is left to future commits.

This commit teaches persist to accept writes as a Vec<ColumnarRecords>, and teaches writers to use the WriteReqBuilder. However, we keep the current row oriented encoding format and will change the encoding in a later commit. Unfortunately, this means that this commit will be a slight performance regression, as we do the work to generate inputs in the columnar format, and also have to convert back to the row oriented format when writing out to storage. indexed_write_drain/mem_sorted/23000000 time: [257.46 ms 260.45 ms 263.71 ms] thrpt: [83.175 MiB/s 84.218 MiB/s 85.195 MiB/s] change: time: [+24.933% +28.471% +31.811%] (p = 0.00 < 0.05) thrpt: [-24.134% -22.162% -19.957%] Performance has regressed. indexed_write_drain/mem_unsorted/23000000 time: [260.59 ms 264.15 ms 268.14 ms] thrpt: [81.803 MiB/s 83.037 MiB/s 84.174 MiB/s] change: time: [+14.418% +17.150% +19.689%] (p = 0.00 < 0.05) thrpt: [-16.450% -14.639% -12.601%] Performance has regressed. indexed_write_drain/file_sorted/23000000 time: [334.63 ms 342.51 ms 351.35 ms] thrpt: [62.429 MiB/s 64.040 MiB/s 65.549 MiB/s] change: time: [+12.900% +16.180% +19.538%] (p = 0.00 < 0.05) thrpt: [-16.345% -13.927% -11.426%] Performance has regressed. Found 1 outliers among 10 measurements (10.00%) 1 (10.00%) high mild indexed_write_drain/file_unsorted/23000000 time: [345.38 ms 366.15 ms 388.11 ms] thrpt: [56.516 MiB/s 59.906 MiB/s 63.508 MiB/s] change: time: [+11.792% +18.735% +26.303%] (p = 0.00 < 0.05) thrpt: [-20.825% -15.779% -10.549%] Performance has regressed.

ruchirK requested a review from danhhz November 17, 2021 19:26

ruchirK force-pushed the persist-columnar branch from ebd4a47 to e7ec078 Compare November 17, 2021 19:29

danhhz mentioned this pull request Nov 17, 2021

persist: pass batches of records around in a columnar allocation #8551

Closed

ruchirK commented Nov 17, 2021

View reviewed changes

Comment thread src/persist/src/indexed/runtime.rs Outdated

ruchirK force-pushed the persist-columnar branch 2 times, most recently from 0693aeb to bd140fa Compare November 17, 2021 22:10

ruchirK force-pushed the persist-columnar branch from bd140fa to 45ee22b Compare November 23, 2021 00:58

ruchirK mentioned this pull request Nov 23, 2021

persist: actually use columnar encoding for unsealed batches. #9239

Merged

2 tasks

danhhz approved these changes Nov 23, 2021

View reviewed changes

ruchirK force-pushed the persist-columnar branch 2 times, most recently from 63766c8 to a3408bc Compare November 23, 2021 20:55

ruchirK enabled auto-merge November 23, 2021 21:01

ruchirK disabled auto-merge November 23, 2021 21:26

ruchirK added 2 commits November 23, 2021 16:30

ruchirK force-pushed the persist-columnar branch from a3408bc to 6d28e42 Compare November 23, 2021 21:30

ruchirK enabled auto-merge November 23, 2021 21:31

ruchirK merged commit b73f65e into MaterializeInc:main Nov 23, 2021

Conversation

ruchirK commented Nov 17, 2021

Motivation

Tips for reviewer

Checklist

Uh oh!

ruchirK commented Nov 17, 2021

Uh oh!

danhhz commented Nov 17, 2021

Uh oh!

danhhz commented Nov 17, 2021

Uh oh!

ruchirK commented Nov 17, 2021

Uh oh!

Uh oh!

ruchirK commented Nov 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danhhz commented Nov 18, 2021

Uh oh!

danhhz commented Nov 18, 2021

Uh oh!

ruchirK commented Nov 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruchirK commented Nov 18, 2021

Uh oh!

danhhz commented Nov 18, 2021

Uh oh!

ruchirK commented Nov 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ruchirK commented Nov 23, 2021

Uh oh!

danhhz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danhhz Nov 23, 2021

Choose a reason for hiding this comment

Uh oh!

ruchirK Nov 23, 2021

Choose a reason for hiding this comment

Uh oh!

danhhz Nov 23, 2021

Choose a reason for hiding this comment

Uh oh!

ruchirK Nov 23, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ruchirK commented Nov 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ruchirK commented Nov 17, 2021 •

edited

Loading

ruchirK commented Nov 18, 2021 •

edited

Loading

ruchirK commented Nov 22, 2021 •

edited

Loading