Skip to content

persist: columnar representation for unsealed batches#9165

Merged
ruchirK merged 3 commits intoMaterializeInc:mainfrom
ruchirK:persist-columnar
Nov 23, 2021
Merged

persist: columnar representation for unsealed batches#9165
ruchirK merged 3 commits intoMaterializeInc:mainfrom
ruchirK:persist-columnar

Conversation

@ruchirK
Copy link
Copy Markdown
Contributor

@ruchirK ruchirK commented Nov 17, 2021

Motivation

Details in the commit messages.

Tips for reviewer

This grabs the commit from #8551 and then builds on top of that in a separate commit, for ease of review.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR adds a release note for any
    user-facing behavior changes.

@ruchirK ruchirK requested a review from danhhz November 17, 2021 19:26
@ruchirK
Copy link
Copy Markdown
Contributor Author

ruchirK commented Nov 17, 2021

still TODO for me: summarize the impact on the end-to-end test with 1 million 110 byte rows.

@danhhz
Copy link
Copy Markdown
Contributor

danhhz commented Nov 17, 2021

it looks like the encoded batch sizes went up. I can see how we'd have more overhead for very small batches with this (e.g. 4 vecs vs 2 per record). let's look into whether there's an inflection point where a sufficiently large batch is smaller in the columnar representation. if not, we should at least understand why before we move forward with using this in the storage format

@danhhz
Copy link
Copy Markdown
Contributor

danhhz commented Nov 17, 2021

also, I've sufficiently forgotten the context of my WIP that you shouldn't feel the need to structure commits as things on top of it. once we're confidant in the perf worthiness of this and you're ready to get it reviewed for merge, let's structure the PR as "introduce a columnar batch thingy" in one commit and "use it for unsealed" in another

@ruchirK
Copy link
Copy Markdown
Contributor Author

ruchirK commented Nov 17, 2021

once we're confidant in the perf worthiness of this and you're ready to get it reviewed for merge, let's structure the PR as "introduce a columnar batch thingy" in one commit and "use it for unsealed" in another

yup -- that was how I was hoping to merge it. Kept it this way in case it was easier for review, but am happy to switch to this structure

Comment thread src/persist/src/indexed/runtime.rs Outdated
@ruchirK ruchirK force-pushed the persist-columnar branch 2 times, most recently from 0693aeb to bd140fa Compare November 17, 2021 22:10
@ruchirK
Copy link
Copy Markdown
Contributor Author

ruchirK commented Nov 17, 2021

This change is RFAL! I broke it up into logical commits as requested, and I have the following answers for the questions around encoding size, and impact on end-to-end tests, which I also included in the second commit messages (under the benchmark results)

For very small batches this change results in a noticeable increase in the encoded
size but the effect is small, and entirely unnoticeable once data volumes exceed
1,000 bytes (measured with inserts of 1.1KB, 110 KB, and 110 MB, and the sizes of
unsealed batches produced before and after this change were identical).

Unfortunately, this change doesn't register any improvement in the end-to-end test
sending 1 million 110 byte rows via a COPY INTO, because we save on allocations
during `write_atomic`, but end up having to allocate when sending rows to the listener.
However, there's clear positives from this change. Serialization times decrease
from 230 ms to 170 ms (out of a total of ~360 - 370 ms CPU time spent in drain_pending).
This represents a ~20% improvement that is roughly commeasurate with what we observe
in the benchmarks. On the coordinator side, we go from spending ~175 ms in
`write_atomic` to spending about 140 ms in `write_atomic`, which is a smaller than
expected change. Curiously, the trace seems to be dominated by `free_tiny` which
leads us to suspect that perhaps the cost we're now bottlenecked on is the `drop`
on the owned vec that gets passed to `write_atomic`.

It seems very likely that the end-to-end tests will show improvement when we move updating listeners to a background task.

@danhhz
Copy link
Copy Markdown
Contributor

danhhz commented Nov 18, 2021

Hmm, I'd like to do some more experiments then before we merge this. From a results perspective, I'm highly confidant that this is the right direction, but from a process perspective, I want us to do our due diligence before we commit to such a big change in the name of perf. In particular, actually seeing the needle move on the end-to-end test.

For very small batches this change results in a noticeable increase in the encoded size but the effect is small, and entirely unnoticeable once data volumes exceed 1,000 bytes (measured with inserts of 1.1KB, 110 KB, and 110 MB, and the sizes of unsealed batches produced before and after this change were identical).

Huh. The columnar representation was the same size? That's surprising, I'd expect at some point for them to be smaller, but maybe my mental model here is wrong. I have zero idea how bincode encodes things, but let's walk through a generalized example of what might be happening. For the following:

<key1><val1><ts1><diff1>
...
<keyN><valN><tsN><diffN>

The row encoding might be <N_rows: u64> followed by N repetitions of <key_len1: u64><key_bytes1: [u8: key_len1]><val_len1: u64><val_bytes1: [u8: key_len1]><ts1: u64><diff1: i64> for a total of 8 + 8*N + all_key_bytes 8*N + all_val_bytes + 8*N + 8*N = 8 + 32N + all_key_val_bytes.

The col encoding might be <N_keys: u64> followed by N repetitions of <key_len1: u64><key_bytes1: [u8: key_len1]> followed by the same for vals, then <N_timestamps: u64><timestamps: [u64; N_timestamps]><N_diffs: u64><diffs: [u64; N_diffs]> for a total of 8 + 8 * N + all_key_bytes + 8 + 8 * N + all_val_bytes + 8 + 8 * N + 8 + 8 * N = 32 + 32 * N + all_key_val_bytes.

Yeah, I can see how bincode might behind this. The usual places for columnar to pull ahead of row encoding is if the format is self-describing (bincode isn't AFAIK) or if the encoding doesn't store things like [u8] inline, but instead has a pointer out to the data (e.g. capnp). And it's losing a bit of efficiency on the columnar side by having to encode the length of the 4 vecs separately, even though they're always the same in practice.

I'm sufficiently convinced that this behavior is expected.

we save on allocations during write_atomic, but end up having to allocate when sending rows to the listener.

Let's explore some prototypes of how we can avoid this. On one hand, we could move sending to the listener out of the hot path, which is something we'll likely want to do at some point, but in a way it's just shuffling the problem to some other thread, which will still become a throughput bottleneck for us at some point. What about this: instead of sending a ((Vec<u8>, Vec<u8>), usize, i64) out to the listener, we could send a (idx: usize, batch: Arc<ColumnarBatch>) and the listener is responsible for indexing the indicated row out. That would avoid the col->row transformation entirely.

@danhhz
Copy link
Copy Markdown
Contributor

danhhz commented Nov 18, 2021

To be clear, my last suggestion isn't that we should expand the scope of this PR to include that change, just to build a quick prototype of it on top of this PR and see if the needle moves on end-to-end in a way that we expect.

@ruchirK
Copy link
Copy Markdown
Contributor Author

ruchirK commented Nov 18, 2021

Thanks for the review Dan! I agree that being more rigorous with the performance analysis would be better.

Re: the size on disk - I actually think you've raised a really interesting question and its actually non-obvious why the data in storage are the same between the row-oriented, and the columnar representations, and thinking about this has helped me understand what we're actually writing down better.

Concretely, we have a 100 byte string + 2 4 byte ints in each row inserted with COPY FROM. In the Row encoding that turns into 100 bytes for the string + 1 byte for the length of the string (because its small), + 3 tag bytes (one per column) for a total of 112 bytes. When we call Codec::encode() on this Row we add 8 bytes for the row length + 1 byte of magic, so 121 for the encoded Row.

Thats key_bytes. val_bytes is just empty because its (). That + 2 x 8 bytes (for time and diff) comes out to 137 bytes per ((key, val), time, diff), which seems like it has to be constant between the previous representation, and the columnar one, because we're not really doing any compression on any of these bytes, we're just re-arranging how they are laid out.

BUT, it feels like, given your reasonable description of how bincode might be doing things -- bincode doesn't know a priori that we're encoding the length in the first 9 bytes of each key, and would have to encode the length of the Vec<u8> provided per key and value in a way that the columnar representation does not at first glance. Specifically, I don't think "The col encoding might be <N_keys: u64> followed by N repetitions of <key_len1: u64><key_bytes1: [u8: key_len1]> followed by the same for vals," is correct because it doesn't make sense to me that the key_lengths would get interleaved with each key's bytes. However, the columnar representation pays the same additional cost in the offset arrays (which is something I didn't realize until thinking through this!). So, thats an additional 16 bytes, stored either inline with the row-oriented representation, or in separate arrays in columnar, for a total of 153 bytes per row inserted, which is almost exactly whats observed (148 MB == 148 * 2^20 bytes ~= 155 million bytes)

Perhaps one takeaway here is that the length prefix we add in Codec::encode for Row could/should be removed?

@ruchirK
Copy link
Copy Markdown
Contributor Author

ruchirK commented Nov 18, 2021

Re: prototype -- I'll do that right now.

I also was thinking about breaking the last commit that introduces the columnar representation into two commits, one that introduces the idea of the WriteReqBuilder and changes writes to happen in terms of ColumnarRecords, but doesn't change the on-disk encoding (this commit would be a small perf regression - because we'd be allocating a large ColumnarRecords object, and then breaking it up to construct a BlobUnsealedBatch), and then another commit that handles actually changing the encoding.

My main motivation here is to avoid having to rebase the write path code multiple times between now and when we can do a meta version bump, but I don't have super strong feelings about this. wdyt?

@danhhz
Copy link
Copy Markdown
Contributor

danhhz commented Nov 18, 2021

I also was thinking about breaking the last commit that introduces the columnar representation into two commits, one that introduces the idea of the WriteReqBuilder and changes writes to happen in terms of ColumnarRecords, but doesn't change the on-disk encoding (this commit would be a small perf regression - because we'd be allocating a large ColumnarRecords object, and then breaking it up to construct a BlobUnsealedBatch), and then another commit that handles actually changing the encoding.

My main motivation here is to avoid having to rebase the write path code multiple times between now and when we can do a meta version bump, but I don't have super strong feelings about this. wdyt?

Yeah that sgtm, exactly the sort of thing I want to start practicing in advance of when we do backward compatibility for real. I was thinking yesterday that I could do something similar with that WIP inc meta branch I linked you the other day

@ruchirK
Copy link
Copy Markdown
Contributor Author

ruchirK commented Nov 22, 2021

Alright! Took a while to de-noise the end-to-end benchmark -- for future ref

cargo run --release -- -w 1 --log-file=stderr --experimental --persistent-user-tables --disable-persistent-system-tables-test --timestamp-frequency 100ms

turns off the system table test and correctly sets the step interval to 100ms. Going from there, a nice repeatable testing protocol is to wipe mzdata, and then do 5 - 10 inserts of 1 million batches, without Instruments attached. Instruments adds another ~15% of overhead. Also, I commented out drain_unsealed and compact to focus on the write latency.

With all of that -- the best case latency before the columnar change is 2.2 seconds, and the best case latency after the columnar change (changing the implementation to send ColumnarRecords directly to listeners as discussed) is 2.1 seconds. Of that time, ~1.5 seconds is spent decoding the copy text format.

The 100 ms improvement is roughly what we had expected based on the microbenchmarks / more detailed info from Instruments (we go from spending approximately 700 ms in coord + persist to 600 ms in coord + persist, driven by speedups in serialization and small speedups in write_atomic)

For me this is pretty conclusive proof that this change does indeed improve performance end-to-end. I'll go ahead and split up the commits so that we can merge everything that threads ColumnarRecords through, without actually changing the encoding, and then leave the encoding change for our breaking change window.

@ruchirK
Copy link
Copy Markdown
Contributor Author

ruchirK commented Nov 23, 2021

RFAL! This pr now threads ColumnarRecords through as much of persist as possible without changing the encoding, and #9239 handles actually changing the encoding

Copy link
Copy Markdown
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread src/persist/src/indexed/columnar.rs Outdated
Comment thread src/persist/src/indexed/columnar.rs Outdated
Comment thread src/persist/src/indexed/columnar.rs Outdated
Comment thread src/persist/src/indexed/columnar.rs
/// instances of the same ((Key, Val), Time), and some Diffs might be zero, or
/// add up to zero).
#[derive(Clone, Debug, Default, Eq, PartialEq, Serialize, Deserialize)]
pub struct ColumnarRecords {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on manually implementing Debug so it's actually useful? i lean toward yes. something like making this one in terms of Ref and implementing Ref with fmt.debug_list. we can leave Iter derived since I think iterators don't usually print all their contents as part of their debug impls

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented them both in terms of debug_list because I couldn't find a way to tell the formatter "print an instance of this type which also implements debug", other than the struct builder thingy, which didn't seem right. the complexity of doing that is a tiny bit higher than reusing the columnarrecordref's debug impl, but both debug impls seem simple enough. at any rate, this seems like something we can touch up in post

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like fmt::Debug::fmt(self.borrow(), f) should work as the ColumnarRecords impl

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 ah yeah ofc. ty fixed!

Comment thread src/persist/src/indexed/mod.rs Outdated
…n state

Some changes to the encoding state render the previously stored golden state
entirely un-decodable. In those cases, the golden test currently crashes with
an assertion failure (it returns an error instead of the expected ok). This commit
makes the behavior in that case a little bit more user friendly by printing out
the current state and the requests that lead to it, as most of the time the in
this case we just want to update golden_test.json.
@ruchirK ruchirK force-pushed the persist-columnar branch 2 times, most recently from 63766c8 to a3408bc Compare November 23, 2021 20:55
@ruchirK
Copy link
Copy Markdown
Contributor Author

ruchirK commented Nov 23, 2021

tftr! merging on green!

@ruchirK ruchirK enabled auto-merge November 23, 2021 21:01
@ruchirK ruchirK disabled auto-merge November 23, 2021 21:26
…olumnar format.

At a very high level, our data is composed of lots of small vectors of bytes for
keys and values. This representation turns the storage around 90 degrees to be
backed by one large vector of bytes for all the keys + another large vector of
bytes for all the values, and separate arrays of offsets into both.

The pros of this representation are that its faster to encode, and more performant
to scan through (fewer cache misses). The cons are that its more complicated to
append data, and its downright difficult to reorder or shuffle data in this
representation.

Actually using this representation is left to future commits.
This commit teaches persist to accept writes as a Vec<ColumnarRecords>, and teaches
writers to use the WriteReqBuilder. However, we keep the current row oriented
encoding format and will change the encoding in a later commit.

Unfortunately, this means that this commit will be a slight performance regression,
as we do the work to generate inputs in the columnar format, and also have to convert
back to the row oriented format when writing out to storage.

indexed_write_drain/mem_sorted/23000000
                        time:   [257.46 ms 260.45 ms 263.71 ms]
                        thrpt:  [83.175 MiB/s 84.218 MiB/s 85.195 MiB/s]
                 change:
                        time:   [+24.933% +28.471% +31.811%] (p = 0.00 < 0.05)
                        thrpt:  [-24.134% -22.162% -19.957%]
                        Performance has regressed.
indexed_write_drain/mem_unsorted/23000000
                        time:   [260.59 ms 264.15 ms 268.14 ms]
                        thrpt:  [81.803 MiB/s 83.037 MiB/s 84.174 MiB/s]
                 change:
                        time:   [+14.418% +17.150% +19.689%] (p = 0.00 < 0.05)
                        thrpt:  [-16.450% -14.639% -12.601%]
                        Performance has regressed.
indexed_write_drain/file_sorted/23000000
                        time:   [334.63 ms 342.51 ms 351.35 ms]
                        thrpt:  [62.429 MiB/s 64.040 MiB/s 65.549 MiB/s]
                 change:
                        time:   [+12.900% +16.180% +19.538%] (p = 0.00 < 0.05)
                        thrpt:  [-16.345% -13.927% -11.426%]
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
indexed_write_drain/file_unsorted/23000000
                        time:   [345.38 ms 366.15 ms 388.11 ms]
                        thrpt:  [56.516 MiB/s 59.906 MiB/s 63.508 MiB/s]
                 change:
                        time:   [+11.792% +18.735% +26.303%] (p = 0.00 < 0.05)
                        thrpt:  [-20.825% -15.779% -10.549%]
                        Performance has regressed.
@ruchirK ruchirK enabled auto-merge November 23, 2021 21:31
@ruchirK ruchirK merged commit b73f65e into MaterializeInc:main Nov 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants