[wip] storage,sql: push cfetcher into pebble's mvcc scanner#52863
[wip] storage,sql: push cfetcher into pebble's mvcc scanner#52863jordanlewis wants to merge 4 commits intocockroachdb:masterfrom
Conversation
petermattis
left a comment
There was a problem hiding this comment.
Neat to see that this is possible, though I'm not terribly fond of having KV call back into the SQL code in this manner. Is there any hope of disentangling ColFetcher from SQL to a greater degree?
DistSender splits apart ScanRequests on range boundaries and then joins them back together. See ScanResponse.combine. Hope that combining responses isn't too difficult with the COL_BATCH format.
pkg/roachpb/api.proto
Outdated
| // This "any type" field is a serialized TableReaderSpec that can be used | ||
| // to initialize a columnar scan for MVCC. We use this generic type to resolve | ||
| // some painful circular dependency issues. | ||
| google.protobuf.Any scan_spec = 6; |
There was a problem hiding this comment.
I suspect we'd want to do something more principled here if this is productionized.
|
For anyone looking at this later, note that this PR still doesn't reap the performance gains we were expecting. The reason is that, even though storage is writing directly to There’s 2 directions I want to try to going to fix this. One is making the conversion back and forth from coldata.Batch to arrow batch to bytes more efficient. The other is making a fast path that gets enabled when there is no network hop. The fast path would skip serialization and return an opaque pointer to the coldata.Batch. |
|
@sumeerbhola since this came up earlier, I'm very curious to know what you think of it. It's obviously badly out of date. But I think it's an interesting direction to think about, and we'd have to solve problems of this nature if we ever wanted to make a more sql-aware semicolumnar format in Pebble one day. |
Glad you reminded me -- looking at this PR has been on my TODO list for a few months now. I can understand why this takes the approach of mostly hiding what is happening from the storage layer internals, since the only thing changing is the output format, and injecting a different implementation suffices for that. Alternatively, we could place the The reason I am interested in a more transparent approach is that when we have a block-columnar format in Pebble we will need to know about some aspects of the schema in order to lay things out when writing, and use it at read time to decide what parts of a file to read (due to projections) and what filters to apply (including the case where there are column dictionaries at the file level). I think we need to first do some bottom up work in file format design and prototyping to prove out the benefits (write throughput can suffer if we are not very careful, and making sure compactions are very efficient is important). I don't think we are going to start on it very soon. So if you can show significant performance improvements here by optimizing for the local path, say using the bit in the |
I think this would be a better approach architecturally, but I believe it will be very far from straightforward. The encode/decode path from KV to SQL (both Here is a relatively complete spec of the SQL encoding: https://github.com/cockroachdb/cockroach/blob/master/docs/tech-notes/encoding.md. It is missing the per-datatype encoding methods, which are to my knowledge defined only in the code here: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/rowenc/column_type_encoding.go (note that there are 3 separate encodings per type: key ascending, key descending, and value.). |
|
cc @RaduBerinde - I rebased this back into reality since we've been discussing it in context of #71887. I added a session setting, There is still significant work remaining to get this PR into shape, assuming it's ever something that the storage team would be okay with merging, which is still unclear based on the conversation above:
|
9eedc57 to
37cd3f9
Compare
|
Latest changes: I implemented a fast-path that skips serializing the scan's ProcessorSpec, by passing an opaque The results are potentially promising - finally, the The benchmarking methodology I used to determine that number was far from good - just running on my local machine - I would like to do some more tests for this soon. Even if we don't end up merging this, I find it encouraging that we can potentially squeeze out some performance wins by eliminating one of the extra copies (instead of writing to KV batches then writing to col batches, just write directly to col batches). |
This commit prototypes injecting the cfetcher state machine into the mvcc scanner. It adds a new scan format that requests bytes to be returned in the arrow columnar batch format. Under the hood, the MVCCScan implementation gets a handle onto a cfetcher, and injects the pebble iterator into the cfetcher's "Next KV" interface, so that the cfetcher, when invoked, directly copies keys and values out of the iterator. Then, the TableReaderSpec at the sql level is implemented by deserializing the arrow batch directly into a coldata.batch, without having to do any extra processing.
This patch removes the roundtrip serialization to arrow if the KV scan is occurring in a local request context.
multi-tenant by default makes the direct bit sort of pointless. I accidentally did a few changes at once in the direct commit, so this commit reverts it enough to have things work again. This would all need to be cleaned up to merge, obviously.
This commit is a hacky experiment that plays around with the idea of a fixed-length integer encoding for values, to permit decoding to avoid having to read columns that are not requested by the SQL query. When the session variable hack_new_encoding is set to true, the database encodes integer values in the value component with the following (inefficient) scheme: 32 bits of column id | 32 bits of column type | 64 bits of integer Then, when the database reads values, it directly indexes into the value slices based on the pre-existing knowledge that each column occupies 128 bits in the encoded value. The hypothesis was that this would be more efficient for reading a single column at the end of a wide table, because ordinarily, the database has to read every column in a value to get to the one that it is looking for. However, in an interactive session, this mode seems to slow things down by about 30%. The test setup is included in the benchmark. I think the reason for the slowdown is that the value explodes in size in this scheme: for 10 small integers, instead of just ~20 bytes (due to varint encoding), we now have 160 bytes. This is more expensive for Pebble to decode, not to mention read from disk, but perhaps the block compression helps enough with the "read from disk" part to make it just about Pebble decoding / memcopying. I'm interested to see what would happen if this optimization was applied to cockroachdb#52863, since it would remove the necessity for Pebble to do any memcopying of the unnecessary data whatsoever.
|
This is superceded by @yuzefovich's ongoing work. Woohoo! |
This commit prototypes injecting the cfetcher state machine into the
mvcc scanner. It adds a new scan format that requests bytes to be
returned in the arrow columnar batch format. Under the hood, the
MVCCScan implementation gets a handle onto a cfetcher, and injects the
pebble iterator into the cfetcher's "Next KV" interface, so that the
cfetcher, when invoked, directly copies keys and values out of the
iterator. Then, the TableReaderSpec at the sql level is implemented by
deserializing the arrow batch directly into a coldata.batch, without
having to do any extra processing.