perf: command commit latency is highly correlated with range count

On a cluster running TPC-C for a few days, I've noticed that the p99 command commit latency and the p99 log commit latency are both slowly growing. This growth seems to be highly correlated with the range count in the cluster.

![screenshot_2018-09-13 custom chart debug cockroach console](https://user-images.githubusercontent.com/5438456/45511267-8aecef00-b76a-11e8-9b39-136f35615ba0.png)

![screenshot_2018-09-13 custom chart debug cockroach console 1](https://user-images.githubusercontent.com/5438456/45511285-90e2d000-b76a-11e8-8405-5f5c4ee98788.png)

Interestingly, TPC-C has a fixed amount of load, so it would appear that the range count itself is the only moving variable here. More ranges but a fixed amount of load would result in less batching of RocksDB writes because fewer writes would take place in the same Raft groups. However, our RocksDB commit pipeline attempts to transparently batch independent writes together, so this should help avoid this kind of issue:

https://github.com/cockroachdb/cockroach/blob/33c7d27d8216b543ac77f0fe39d440ebebfa9e70/pkg/storage/engine/rocksdb.go#L1752-L1753

I'd like to instrument this pipeline and see if there are any inefficiencies in it. Specifically, I'd like to check whether the pipeline remains full as the number of batches that it attempts to batch together grows. For instance, it may be that case that the [write batch merging](https://github.com/cockroachdb/cockroach/blob/33c7d27d8216b543ac77f0fe39d440ebebfa9e70/pkg/storage/engine/rocksdb.go#L1778) begins to take longer than the RocksDB writes themselves. This would allow for gaps in the pipeline where the RocksDB `syncLoop` remains idle. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: command commit latency is highly correlated with range count #30213

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	var leader bool
	c.pending, c.groupSize, leader = makeBatchGroup(c.pending, r, c.groupSize, maxBatchGroupSize)

perf: command commit latency is highly correlated with range count #30213

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions