Skip to content

[wip] storage,sql: push cfetcher into pebble's mvcc scanner#52863

Closed
jordanlewis wants to merge 4 commits intocockroachdb:masterfrom
jordanlewis:mvcc-col
Closed

[wip] storage,sql: push cfetcher into pebble's mvcc scanner#52863
jordanlewis wants to merge 4 commits intocockroachdb:masterfrom
jordanlewis:mvcc-col

Conversation

@jordanlewis
Copy link
Copy Markdown
Member

This commit prototypes injecting the cfetcher state machine into the
mvcc scanner. It adds a new scan format that requests bytes to be
returned in the arrow columnar batch format. Under the hood, the
MVCCScan implementation gets a handle onto a cfetcher, and injects the
pebble iterator into the cfetcher's "Next KV" interface, so that the
cfetcher, when invoked, directly copies keys and values out of the
iterator. Then, the TableReaderSpec at the sql level is implemented by
deserializing the arrow batch directly into a coldata.batch, without
having to do any extra processing.

Copy link
Copy Markdown
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat to see that this is possible, though I'm not terribly fond of having KV call back into the SQL code in this manner. Is there any hope of disentangling ColFetcher from SQL to a greater degree?

DistSender splits apart ScanRequests on range boundaries and then joins them back together. See ScanResponse.combine. Hope that combining responses isn't too difficult with the COL_BATCH format.

// This "any type" field is a serialized TableReaderSpec that can be used
// to initialize a columnar scan for MVCC. We use this generic type to resolve
// some painful circular dependency issues.
google.protobuf.Any scan_spec = 6;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect we'd want to do something more principled here if this is productionized.

@jordanlewis
Copy link
Copy Markdown
Member Author

For anyone looking at this later, note that this PR still doesn't reap the performance gains we were expecting. The reason is that, even though storage is writing directly to coldata.Batch and avoiding the extra copy from keys to coldata.Batch later in the pipe, we're forced to convert the batch into arrow format and then into bytes, and then from bytes back to arrow back to coldata.Batch. This is expensive.

There’s 2 directions I want to try to going to fix this. One is making the conversion back and forth from coldata.Batch to arrow batch to bytes more efficient. The other is making a fast path that gets enabled when there is no network hop. The fast path would skip serialization and return an opaque pointer to the coldata.Batch.

@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@jordanlewis
Copy link
Copy Markdown
Member Author

@sumeerbhola since this came up earlier, I'm very curious to know what you think of it. It's obviously badly out of date. But I think it's an interesting direction to think about, and we'd have to solve problems of this nature if we ever wanted to make a more sql-aware semicolumnar format in Pebble one day.

@sumeerbhola
Copy link
Copy Markdown
Collaborator

But I think it's an interesting direction to think about, and we'd have to solve problems of this nature if we ever wanted to make a more sql-aware semicolumnar format in Pebble one day.

Glad you reminded me -- looking at this PR has been on my TODO list for a few months now.
This is cool stuff.

I can understand why this takes the approach of mostly hiding what is happening from the storage layer internals, since the only thing changing is the output format, and injecting a different implementation suffices for that. Alternatively, we could place the coldata.Batch writing code in a place where storage can depend on it directly.

The reason I am interested in a more transparent approach is that when we have a block-columnar format in Pebble we will need to know about some aspects of the schema in order to lay things out when writing, and use it at read time to decide what parts of a file to read (due to projections) and what filters to apply (including the case where there are column dictionaries at the file level). I think we need to first do some bottom up work in file format design and prototyping to prove out the benefits (write throughput can suffer if we are not very careful, and making sure compactions are very efficient is important). I don't think we are going to start on it very soon.

So if you can show significant performance improvements here by optimizing for the local path, say using the bit in the AdmissionHeader, it would be worthwhile doing something like this as a first step (though probably with making the storage package fully aware of what it is doing).

@jordanlewis
Copy link
Copy Markdown
Member Author

Alternatively, we could place the coldata.Batch writing code in a place where storage can depend on it directly.

I think this would be a better approach architecturally, but I believe it will be very far from straightforward. The encode/decode path from KV to SQL (both tree.Datum or coldata.Batch) is extraordinarily complex. Either we would have to import all of TableDescriptor and its associated stuff into storage (this seems like it would make a mess out of our abstraction layers), or we would have to come up with another encoding method that wasn't as SQL specific that SQL could use as a target instead of the current system (I'm not sure what this would look like, and not sure if it would materially save us any complexity).

Here is a relatively complete spec of the SQL encoding: https://github.com/cockroachdb/cockroach/blob/master/docs/tech-notes/encoding.md. It is missing the per-datatype encoding methods, which are to my knowledge defined only in the code here: https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/rowenc/column_type_encoding.go (note that there are 3 separate encodings per type: key ascending, key descending, and value.).

@jordanlewis
Copy link
Copy Markdown
Member Author

jordanlewis commented Oct 27, 2021

cc @RaduBerinde - I rebased this back into reality since we've been discussing it in context of #71887. I added a session setting, plan_direct_scan, which gates the feature, though it probably needs a better name.

There is still significant work remaining to get this PR into shape, assuming it's ever something that the storage team would be okay with merging, which is still unclear based on the conversation above:

  • It needs to be integrated with the new KV memory limiting and monitoring
  • It needs to be made more efficient - the roundtripping through arrow is too expensive and performance (for non-projections) is slightly worse as I indicated above. This is sad because it obscures the true potential of this PR, performance-wise I believe that with an improved serialization we will see major performance improvements.
  • It could be made hugely more efficient (for non-serverless) if we took advantage of the fact that colocated scans don't have to be serialized at all - we could simply pass Batch pointers. This would be much faster than today's scans.
  • More testing. Probably we should create an alternate logic test mode, or find some other way of ensuring that all possible scans work in both modes.
  • There's a new complication with types that require hydration. Types that need hydration need access to descriptor caches and so on, which won't be easily possible below SQL. I don't know what the solution is here - perhaps for now we just decide to never plan direct scans if there are types that need hydration in the table.

@jordanlewis
Copy link
Copy Markdown
Member Author

Latest changes:

I implemented a fast-path that skips serializing the scan's ProcessorSpec, by passing an opaque interface{} in the protobuf that has no serialization methods. We could theoretically only use this path for local requests, and serialize the ProcessorSpec normally for remote requests (this part is not yet implemented).

The results are potentially promising - finally, the plan_direct_scan option causes local scans over TPCH data to perform as good if not slightly (~5%) better than without the option. It makes sense - we've managed to remove one copy from the system.

The benchmarking methodology I used to determine that number was far from good - just running on my local machine - I would like to do some more tests for this soon.

Even if we don't end up merging this, I find it encouraging that we can potentially squeeze out some performance wins by eliminating one of the extra copies (instead of writing to KV batches then writing to col batches, just write directly to col batches).

This commit prototypes injecting the cfetcher state machine into the
mvcc scanner. It adds a new scan format that requests bytes to be
returned in the arrow columnar batch format. Under the hood, the
MVCCScan implementation gets a handle onto a cfetcher, and injects the
pebble iterator into the cfetcher's "Next KV" interface, so that the
cfetcher, when invoked, directly copies keys and values out of the
iterator. Then, the TableReaderSpec at the sql level is implemented by
deserializing the arrow batch directly into a coldata.batch, without
having to do any extra processing.
This patch removes the roundtrip serialization to arrow if the KV scan
is occurring in a local request context.
multi-tenant by default makes the direct bit sort of pointless. I
accidentally did a few changes at once in the direct commit, so this
commit reverts it enough to have things work again.

This would all need to be cleaned up to merge, obviously.
jordanlewis added a commit to jordanlewis/cockroach that referenced this pull request Aug 6, 2022
This commit is a hacky experiment that plays around with the idea of a
fixed-length integer encoding for values, to permit decoding to avoid
having to read columns that are not requested by the SQL query.

When the session variable hack_new_encoding is set to true, the database
encodes integer values in the value component with the following
(inefficient) scheme:

32 bits of column id | 32 bits of column type | 64 bits of integer

Then, when the database reads values, it directly indexes into the value
slices based on the pre-existing knowledge that each column occupies 128
bits in the encoded value.

The hypothesis was that this would be more efficient for reading a
single column at the end of a wide table, because ordinarily, the
database has to read every column in a value to get to the one that it
is looking for.

However, in an interactive session, this mode seems to slow things down
by about 30%. The test setup is included in the benchmark.

I think the reason for the slowdown is that the value explodes in size
in this scheme: for 10 small integers, instead of just ~20 bytes (due to
varint encoding), we now have 160 bytes. This is more expensive for
Pebble to decode, not to mention read from disk, but perhaps the block
compression helps enough with the "read from disk" part to make it just
about Pebble decoding / memcopying.

I'm interested to see what would happen if this optimization was applied
to cockroachdb#52863, since it would
remove the necessity for Pebble to do any memcopying of the unnecessary
data whatsoever.
@jordanlewis
Copy link
Copy Markdown
Member Author

This is superceded by @yuzefovich's ongoing work. Woohoo!

@jordanlewis jordanlewis closed this Jan 3, 2023
@jordanlewis jordanlewis deleted the mvcc-col branch January 5, 2023 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

X-noremind Bots won't notify about PRs with X-noremind

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants