[wip] storage,sql: push cfetcher into pebble's mvcc scanner by jordanlewis · Pull Request #52863 · cockroachdb/cockroach

jordanlewis · 2020-08-15T19:41:21Z

This commit prototypes injecting the cfetcher state machine into the
mvcc scanner. It adds a new scan format that requests bytes to be
returned in the arrow columnar batch format. Under the hood, the
MVCCScan implementation gets a handle onto a cfetcher, and injects the
pebble iterator into the cfetcher's "Next KV" interface, so that the
cfetcher, when invoked, directly copies keys and values out of the
iterator. Then, the TableReaderSpec at the sql level is implemented by
deserializing the arrow batch directly into a coldata.batch, without
having to do any extra processing.

petermattis

Neat to see that this is possible, though I'm not terribly fond of having KV call back into the SQL code in this manner. Is there any hope of disentangling ColFetcher from SQL to a greater degree?

DistSender splits apart ScanRequests on range boundaries and then joins them back together. See ScanResponse.combine. Hope that combining responses isn't too difficult with the COL_BATCH format.

petermattis · 2020-08-16T18:38:46Z

pkg/roachpb/api.proto

+  // This "any type" field is a serialized TableReaderSpec that can be used
+  // to initialize a columnar scan for MVCC. We use this generic type to resolve
+  // some painful circular dependency issues.
+  google.protobuf.Any scan_spec = 6;


I suspect we'd want to do something more principled here if this is productionized.

pkg/roachpb/api.proto

jordanlewis · 2020-09-12T14:50:16Z

For anyone looking at this later, note that this PR still doesn't reap the performance gains we were expecting. The reason is that, even though storage is writing directly to coldata.Batch and avoiding the extra copy from keys to coldata.Batch later in the pipe, we're forced to convert the batch into arrow format and then into bytes, and then from bytes back to arrow back to coldata.Batch. This is expensive.

There’s 2 directions I want to try to going to fix this. One is making the conversion back and forth from coldata.Batch to arrow batch to bytes more efficient. The other is making a fast path that gets enabled when there is no network hop. The fast path would skip serialization and return an opaque pointer to the coldata.Batch.

cockroach-teamcity · 2021-05-06T09:10:34Z

This change is

jordanlewis · 2021-06-11T18:32:11Z

@sumeerbhola since this came up earlier, I'm very curious to know what you think of it. It's obviously badly out of date. But I think it's an interesting direction to think about, and we'd have to solve problems of this nature if we ever wanted to make a more sql-aware semicolumnar format in Pebble one day.

sumeerbhola · 2021-06-14T18:25:54Z

But I think it's an interesting direction to think about, and we'd have to solve problems of this nature if we ever wanted to make a more sql-aware semicolumnar format in Pebble one day.

Glad you reminded me -- looking at this PR has been on my TODO list for a few months now.
This is cool stuff.

I can understand why this takes the approach of mostly hiding what is happening from the storage layer internals, since the only thing changing is the output format, and injecting a different implementation suffices for that. Alternatively, we could place the coldata.Batch writing code in a place where storage can depend on it directly.

The reason I am interested in a more transparent approach is that when we have a block-columnar format in Pebble we will need to know about some aspects of the schema in order to lay things out when writing, and use it at read time to decide what parts of a file to read (due to projections) and what filters to apply (including the case where there are column dictionaries at the file level). I think we need to first do some bottom up work in file format design and prototyping to prove out the benefits (write throughput can suffer if we are not very careful, and making sure compactions are very efficient is important). I don't think we are going to start on it very soon.

So if you can show significant performance improvements here by optimizing for the local path, say using the bit in the AdmissionHeader, it would be worthwhile doing something like this as a first step (though probably with making the storage package fully aware of what it is doing).

jordanlewis · 2021-06-14T19:11:10Z

Alternatively, we could place the coldata.Batch writing code in a place where storage can depend on it directly.

I think this would be a better approach architecturally, but I believe it will be very far from straightforward. The encode/decode path from KV to SQL (both tree.Datum or coldata.Batch) is extraordinarily complex. Either we would have to import all of TableDescriptor and its associated stuff into storage (this seems like it would make a mess out of our abstraction layers), or we would have to come up with another encoding method that wasn't as SQL specific that SQL could use as a target instead of the current system (I'm not sure what this would look like, and not sure if it would materially save us any complexity).

Here is a relatively complete spec of the SQL encoding: https://github.com/cockroachdb/cockroach/blob/master/docs/tech-notes/encoding.md. It is missing the per-datatype encoding methods, which are to my knowledge defined only in the code here: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/rowenc/column_type_encoding.go (note that there are 3 separate encodings per type: key ascending, key descending, and value.).

jordanlewis · 2021-10-27T04:32:05Z

cc @RaduBerinde - I rebased this back into reality since we've been discussing it in context of #71887. I added a session setting, plan_direct_scan, which gates the feature, though it probably needs a better name.

There is still significant work remaining to get this PR into shape, assuming it's ever something that the storage team would be okay with merging, which is still unclear based on the conversation above:

It needs to be integrated with the new KV memory limiting and monitoring
It needs to be made more efficient - the roundtripping through arrow is too expensive and performance (for non-projections) is slightly worse as I indicated above. This is sad because it obscures the true potential of this PR, performance-wise I believe that with an improved serialization we will see major performance improvements.
It could be made hugely more efficient (for non-serverless) if we took advantage of the fact that colocated scans don't have to be serialized at all - we could simply pass Batch pointers. This would be much faster than today's scans.
More testing. Probably we should create an alternate logic test mode, or find some other way of ensuring that all possible scans work in both modes.
There's a new complication with types that require hydration. Types that need hydration need access to descriptor caches and so on, which won't be easily possible below SQL. I don't know what the solution is here - perhaps for now we just decide to never plan direct scans if there are types that need hydration in the table.

jordanlewis · 2021-11-12T23:41:57Z

Latest changes:

I implemented a fast-path that skips serializing the scan's ProcessorSpec, by passing an opaque interface{} in the protobuf that has no serialization methods. We could theoretically only use this path for local requests, and serialize the ProcessorSpec normally for remote requests (this part is not yet implemented).

The results are potentially promising - finally, the plan_direct_scan option causes local scans over TPCH data to perform as good if not slightly (~5%) better than without the option. It makes sense - we've managed to remove one copy from the system.

The benchmarking methodology I used to determine that number was far from good - just running on my local machine - I would like to do some more tests for this soon.

Even if we don't end up merging this, I find it encouraging that we can potentially squeeze out some performance wins by eliminating one of the extra copies (instead of writing to KV batches then writing to col batches, just write directly to col batches).

This commit prototypes injecting the cfetcher state machine into the mvcc scanner. It adds a new scan format that requests bytes to be returned in the arrow columnar batch format. Under the hood, the MVCCScan implementation gets a handle onto a cfetcher, and injects the pebble iterator into the cfetcher's "Next KV" interface, so that the cfetcher, when invoked, directly copies keys and values out of the iterator. Then, the TableReaderSpec at the sql level is implemented by deserializing the arrow batch directly into a coldata.batch, without having to do any extra processing.

This patch removes the roundtrip serialization to arrow if the KV scan is occurring in a local request context.

multi-tenant by default makes the direct bit sort of pointless. I accidentally did a few changes at once in the direct commit, so this commit reverts it enough to have things work again. This would all need to be cleaned up to merge, obviously.

This commit is a hacky experiment that plays around with the idea of a fixed-length integer encoding for values, to permit decoding to avoid having to read columns that are not requested by the SQL query. When the session variable hack_new_encoding is set to true, the database encodes integer values in the value component with the following (inefficient) scheme: 32 bits of column id | 32 bits of column type | 64 bits of integer Then, when the database reads values, it directly indexes into the value slices based on the pre-existing knowledge that each column occupies 128 bits in the encoded value. The hypothesis was that this would be more efficient for reading a single column at the end of a wide table, because ordinarily, the database has to read every column in a value to get to the one that it is looking for. However, in an interactive session, this mode seems to slow things down by about 30%. The test setup is included in the benchmark. I think the reason for the slowdown is that the value explodes in size in this scheme: for 10 small integers, instead of just ~20 bytes (due to varint encoding), we now have 160 bytes. This is more expensive for Pebble to decode, not to mention read from disk, but perhaps the block compression helps enough with the "read from disk" part to make it just about Pebble decoding / memcopying. I'm interested to see what would happen if this optimization was applied to cockroachdb#52863, since it would remove the necessity for Pebble to do any memcopying of the unnecessary data whatsoever.

jordanlewis · 2023-01-03T14:55:24Z

This is superceded by @yuzefovich's ongoing work. Woohoo!

jordanlewis force-pushed the mvcc-col branch from e12111b to 335ba9c Compare August 15, 2020 19:43

petermattis reviewed Aug 16, 2020

View reviewed changes

tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021

jordanlewis mentioned this pull request Oct 25, 2021

sql: wide rows handled very inefficiently in multi-tenant configuration #71887

Closed

jordanlewis force-pushed the mvcc-col branch from 335ba9c to dd0b631 Compare October 27, 2021 04:19

jordanlewis requested review from a team as code owners October 27, 2021 04:19

jordanlewis force-pushed the mvcc-col branch from dd0b631 to a734ac9 Compare October 27, 2021 04:24

jordanlewis requested a review from RaduBerinde October 27, 2021 04:25

jordanlewis marked this pull request as draft October 27, 2021 04:25

jordanlewis force-pushed the mvcc-col branch from a734ac9 to 95d0fc4 Compare October 27, 2021 04:36

jordanlewis force-pushed the mvcc-col branch from 95d0fc4 to 9eedc57 Compare November 4, 2021 22:31

jordanlewis force-pushed the mvcc-col branch from 9eedc57 to 37cd3f9 Compare November 12, 2021 23:38

jordanlewis added 4 commits April 25, 2022 22:59

implement local fast path for direct scan

b47e61b

This patch removes the roundtrip serialization to arrow if the KV scan is occurring in a local request context.

rebase / solve conflicts

c15217c

partially revert the direct bit

644db4c

multi-tenant by default makes the direct bit sort of pointless. I accidentally did a few changes at once in the direct commit, so this commit reverts it enough to have things work again. This would all need to be cleaned up to merge, obviously.

jordanlewis force-pushed the mvcc-col branch from 37cd3f9 to 644db4c Compare April 26, 2022 03:00

ajwerner mentioned this pull request Jun 1, 2022

sql,kv,storage: push column batch generation into kvserver #82323

Open

10 tasks

jordanlewis mentioned this pull request Aug 6, 2022

[wip,dnm] experiment with fixed-len value encoding #85694

Draft

jordanlewis closed this Jan 3, 2023

jordanlewis deleted the mvcc-col branch January 5, 2023 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] storage,sql: push cfetcher into pebble's mvcc scanner#52863

[wip] storage,sql: push cfetcher into pebble's mvcc scanner#52863
jordanlewis wants to merge 4 commits intocockroachdb:masterfrom
jordanlewis:mvcc-col

jordanlewis commented Aug 15, 2020

Uh oh!

petermattis left a comment

Uh oh!

petermattis Aug 16, 2020

Uh oh!

Uh oh!

jordanlewis commented Sep 12, 2020

Uh oh!

cockroach-teamcity commented May 6, 2021

Uh oh!

jordanlewis commented Jun 11, 2021

Uh oh!

sumeerbhola commented Jun 14, 2021

Uh oh!

jordanlewis commented Jun 14, 2021

Uh oh!

jordanlewis commented Oct 27, 2021 •

edited

Loading

Uh oh!

jordanlewis commented Nov 12, 2021

Uh oh!

jordanlewis commented Jan 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jordanlewis commented Aug 15, 2020

Uh oh!

petermattis left a comment

Choose a reason for hiding this comment

Uh oh!

petermattis Aug 16, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jordanlewis commented Sep 12, 2020

Uh oh!

cockroach-teamcity commented May 6, 2021

Uh oh!

jordanlewis commented Jun 11, 2021

Uh oh!

sumeerbhola commented Jun 14, 2021

Uh oh!

jordanlewis commented Jun 14, 2021

Uh oh!

jordanlewis commented Oct 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jordanlewis commented Nov 12, 2021

Uh oh!

jordanlewis commented Jan 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jordanlewis commented Oct 27, 2021 •

edited

Loading