Skip to content

workload/tpcc: improve indexing to permit better partitioning#36854

Merged
craig[bot] merged 6 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/geoTpcc
May 20, 2019
Merged

workload/tpcc: improve indexing to permit better partitioning#36854
craig[bot] merged 6 commits intocockroachdb:masterfrom
nvb:nvanbenschoten/geoTpcc

Conversation

@nvb
Copy link
Copy Markdown
Contributor

@nvb nvb commented Apr 15, 2019

This PR contains a number of incremental improvements to TPC-C that allow it to be more effectively partitioned. The changes focus on migrating indexes to all be partitionable by warehouse and removing indexes that are only needed for foreign keys when fks are not enabled. This is all a precursor to a follow-up PR: #36855.

I'm assigning @jordanlewis and @danhhz as reviewers because between the two of you most of the changes here have already been agreed upon.

This PR will require me to regenerate all TPC-C fixtures. I intend to do so before it is merged.

@nvb nvb requested review from danhhz and jordanlewis April 15, 2019 23:07
@nvb nvb requested a review from a team as a code owner April 15, 2019 23:07
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @danhhz, @jordanlewis, and @nvanbenschoten)


pkg/workload/tpcc/ddls.go, line 88 at r6 (raw file):

	// HISTORY table.
	tpccHistorySchemaBase = `(
		rowid    uuid    not null default gen_random_uuid(),

While you're going to generate fixtures, I'd love to reconfirm that we think this field should be a UUID. I dug up the PR (#23827) that went from implicit PK to UUID and the reasoning for UUID over unique_rowid was that 1) unique_rowid apparently generated a duplicate (!) and 2) our production recommendation is UUID. I think we could fix (1), (2) is more compelling to me.

I ask because UUIDs make this table the only one in our tpcc Generator that doesn't generate rows in PKs order, which makes our lives harder for the upcoming direct-ingest IMPORT. It's something we'll have to solve for secondary indexes, but "solve" here means something like backpressure and trying to be careful about compactions and it's definitely better if the generated data doesn't overlap in the first place. It may also be possible to switch the UUIDs in the initial data to be deterministically generated from batchIdx and the offset within the batch, but I haven't looked into this yet.

Does anyone have strong opinions here?


pkg/workload/tpcc/tpcc.go, line 390 at r6 (raw file):

		Splits: splits(workload.BatchedTuples{
			NumBatches: numBatches(w.warehouses, numWarehousesPerRange),
			NumTotal:   w.warehouses,

all of these NumTotals on the splits look wrong to me. They're supposed to represent how many total rows are produced by all the batches, but each batch seems to be of size 1. I'm actually planning on getting rid of them entirely, they haven't been useful for anything, so I'd omit them here

nvb added 3 commits May 20, 2019 16:04
This commit improves the primary key of the history table, allowing it
to be partitioned based on warehouse ID.

Release note: None
It's unclear why this index was ever added. It wasn't being used for
anything.

Release note: None
There was no reason to do this and it made --wait=false incompatible
with partitioning.

Release note: None
@nvb nvb force-pushed the nvanbenschoten/geoTpcc branch from 21e1cdd to df6a341 Compare May 20, 2019 20:08
nvb added 3 commits May 20, 2019 16:38
This change migrates workload/tpcc from adding foreign key relations
in a PostLoad step to adding them in the schema itself. While doing so,
it also ensures that indexes that are only used for foreign keys are
made clear and are not created when fks are not in use.

This will require fixture regeneration.

Release note: None
@nvb nvb force-pushed the nvanbenschoten/geoTpcc branch from df6a341 to 92c8992 Compare May 20, 2019 20:38
Copy link
Copy Markdown
Contributor Author

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easiest way to address Dan's comment: wait until he does himself in other PRs 😃 TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @danhhz and @jordanlewis)


pkg/workload/tpcc/ddls.go, line 88 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

While you're going to generate fixtures, I'd love to reconfirm that we think this field should be a UUID. I dug up the PR (#23827) that went from implicit PK to UUID and the reasoning for UUID over unique_rowid was that 1) unique_rowid apparently generated a duplicate (!) and 2) our production recommendation is UUID. I think we could fix (1), (2) is more compelling to me.

I ask because UUIDs make this table the only one in our tpcc Generator that doesn't generate rows in PKs order, which makes our lives harder for the upcoming direct-ingest IMPORT. It's something we'll have to solve for secondary indexes, but "solve" here means something like backpressure and trying to be careful about compactions and it's definitely better if the generated data doesn't overlap in the first place. It may also be possible to switch the UUIDs in the initial data to be deterministically generated from batchIdx and the offset within the batch, but I haven't looked into this yet.

Does anyone have strong opinions here?

You're way ahead of me: #37515.


pkg/workload/tpcc/tpcc.go, line 390 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

all of these NumTotals on the splits look wrong to me. They're supposed to represent how many total rows are produced by all the batches, but each batch seems to be of size 1. I'm actually planning on getting rid of them entirely, they haven't been useful for anything, so I'd omit them here

Looks like you got rid of them all.

craig bot pushed a commit that referenced this pull request May 20, 2019
36854: workload/tpcc: improve indexing to permit better partitioning r=nvanbenschoten a=nvanbenschoten

This PR contains a number of incremental improvements to TPC-C that allow it to be more effectively partitioned. The changes focus on migrating indexes to all be partitionable by warehouse and removing indexes that are only needed for foreign keys when fks are not enabled. This is all a precursor to a follow-up PR: #36855.

I'm assigning @jordanlewis and @danhhz as reviewers because between the two of you most of the changes here have already been agreed upon.

This PR will require me to regenerate all TPC-C fixtures. I intend to do so before it is merged.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 20, 2019

Build succeeded

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants