Skip to content

server: use rocksdb::GetApproximateSizes for table stats#20627

Merged
tbg merged 2 commits intocockroachdb:masterfrom
tbg:approximate-sizes
Dec 13, 2017
Merged

server: use rocksdb::GetApproximateSizes for table stats#20627
tbg merged 2 commits intocockroachdb:masterfrom
tbg:approximate-sizes

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Dec 11, 2017

Needs a test. I think I could set up a server with on-disk storage and write to
it until I see a nonzero number from the API endpoint (to check that all the
plumbing was done correctly), but I'll hold off to see if there's something
better that should be done.

Touches #18667.

Release note: None

@tbg tbg requested a review from a team December 11, 2017 20:34
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg tbg force-pushed the approximate-sizes branch from 8b18423 to 664a3a3 Compare December 11, 2017 20:42
@petermattis
Copy link
Copy Markdown
Collaborator

This approach looks fine to me. We'll need to do some verification as to the accuracy and performance of GetApproximateSize.


Review status: 0 of 12 files reviewed at latest revision, 1 unresolved discussion, some commit checks pending.


c-deps/libroach/db.cc, line 1746 at r1 (raw file):

DBStatus DBApproximateSize(DBEngine* db, DBKey start, DBKey end, uint64_t *size) {
  std::string start_key(EncodeKey(start));

In C++, it is common to const-all-the-variables. s/std::string/const std::string/g. s/rocksdb::Range/const rocksdb::Range/g. s/uint8_t flags/const uint8_t flags/g.


Comments from Reviewable

@tbg tbg force-pushed the approximate-sizes branch from 664a3a3 to df6de8b Compare December 11, 2017 20:50
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Dec 11, 2017

First trivial test isn't too bad. Ran tpcc -load and approximate size comes back as 102MB,

✂-1

$ du -sch cockroach-data
148M	cockroach-data

while

          this.size = stats.stats.key_bytes // note that key_bytes is omitted in real code right now
                        .add(stats.stats.val_bytes)
                        .add(stats.stats.sys_bytes)
                        .toNumber();

results in (unfortunately more accurate) 127.3MB.

We definitely need to understand the characteristics of the approximation better, in particular with the data sets we've seen the logical bytes diverge from.


Review status: 0 of 12 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


c-deps/libroach/db.cc, line 1746 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

In C++, it is common to const-all-the-variables. s/std::string/const std::string/g. s/rocksdb::Range/const rocksdb::Range/g. s/uint8_t flags/const uint8_t flags/g.

Thanks, done. I const believe you spelled all of them out.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Dec 11, 2017

Ran a compaction, now it looks more accurate:

$ du -sch cockroach-data
 89M	cockroach-data

approximation is now 85MB (so very close) and logical (obviously) remains at 127.


Review status: 0 of 12 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Nice! Can we verify that GetApproximateSize doesn't perform an disk IO?


Review status: 0 of 12 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

I suppose another bit to verify is if you have two or more tables. You can use kv --table=kv1 ..., kv --table=kv2 and then write different numbers of entries into the tables (using --max-ops).


Review status: 0 of 12 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Dec 11, 2017

kv --table kvsmall --max-ops 100000 --min-block-bytes=32 --max-block-bytes=256
kv --table kvlarge --max-ops 1000000 --min-block-bytes=32 --max-block-bytes=256

163.7 MiB for kvlarge, 17.3 MiB for kvsmall. This is encouraging since one is 10x the other.

317M	cockroach-data

85MB (tpcc) plus 164mb + 18mb = 267mb, so we're undercounting somewhat. After a compaction, the approximate stats remained the comparable, but the directory shrank to 279MB so it's now really close.

It'll be useful to test against heavily compactable datasets. As a sanity check, I dropped the tpcc and kvtest databases, set a 30s TTL via

./cockroach zone set .default --insecure -f -<<EOF
gc:
  ttlseconds: 30
EOF

and will report back when the data has been deleted (at which point I expect the storage dir to remain large). I suspect that the approximate stats in that case will be off as well which is unfortunate, but if we want a cheap check then I'm not sure how RocksDB could do better.


Review status: 0 of 12 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Dec 11, 2017

Had my brain switched off for that last part. It's not unfortunate that the approximate size is (expected to be) similar to the disk size. That was what we're after.


Review status: 0 of 12 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

These numbers are looking fairly good.


Review status: 0 of 12 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Dec 11, 2017

Hrm, second part of the experiment didn't go well (mainly because I screwed it up). There's no straightforward way to query the data from the UI after the table is deleted, and by the time I had added some output to debug compact, I got ~3MB before and after the compaction. It's also possible that the approximation returned 3MB even though on-disk was ~217 at the time. I didn't check, unfortunately.

@petermattis re: disk i/o, wanna peruse the code below that does the heavy lifting? I think this will do disk i/o in the common case. Which may be OK, but something to watch out for. I also think that eventually we'll need some caching in the UI; it's far from snappy even on my idle single-node cluster.

https://github.com/facebook/rocksdb/blob/master/db/version_set.cc#L3537

late addendum: repeated the experiment (dropped a tpcc table). When I switched back to the terminal, the data directory had shrunk to ~55MB from originally ~120MB. I then saw this:

approximate database size before compaction: 30 MiB
approximate database size after compaction): 3.1 MiB

Seems reasonable enough.

Take all these experiments with a grain of salt, though. We really need to test this (and automate the tests) on larger datasets.


Review status: 0 of 12 files reviewed at latest revision, all discussions resolved, all commit checks successful.


Comments from Reviewable

@tbg tbg force-pushed the approximate-sizes branch from df6de8b to e1847c1 Compare December 11, 2017 23:12
@tbg tbg requested a review from a team December 11, 2017 23:12
@tbg tbg requested a review from a team as a code owner December 11, 2017 23:12
@tbg tbg requested review from petermattis and vilterp December 11, 2017 23:12
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Dec 11, 2017

@vilterp could you look at the second commit?

@tbg tbg force-pushed the approximate-sizes branch from e1847c1 to e4580fe Compare December 11, 2017 23:45
@vilterp
Copy link
Copy Markdown
Contributor

vilterp commented Dec 12, 2017

Reviewed 11 of 12 files at r1, 2 of 2 files at r2, 2 of 4 files at r3.
Review status: 13 of 17 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/ui/src/views/databases/containers/tableDetails/index.tsx, line 121 at r3 (raw file):

                <SummaryHeadlineStat
                  title="Size"
                  /* TODO (tschottdorf): where is this tooltip shown? Can't find it. */

Good catch… Sadly, it appears that the tooltip argument is thrown away by the SummaryHeadlineStat component:

export class SummaryHeadlineStat extends React.Component<SummaryHeadlineStatProps, {}> {
render() {
return <div className="summary-headline">
<div className="summary-headline__value">{computeValue(this.props.value, this.props.format)}</div>
<div className="summary-headline__title">{this.props.title}</div>
</div>;
}
}

Filed #20638


pkg/ui/src/views/databases/containers/tableDetails/index.tsx, line 128 at r3 (raw file):

                  title="Ranges"
                  /* TODO (tschottdorf): where is this tooltip shown? Can't find it. */
                  tooltip="The total count of replicas in this database across all replicas."

Is this really the count of replicas, not ranges? Is the "ranges" column in the table also replicas? If so, either change it to be a count of ranges or also change the title here and the label of the column.

Also, I think you meant "The total count of replicas in this database" without the "across all replicas".


pkg/ui/src/views/databases/data/tableInfo.tsx, line 32 at r3 (raw file):

          this.mvcc_stats = stats.stats;
          this.physical_size = FixLong(stats.approximate_disk_bytes).toNumber();
          if (!this.physical_size) {

the only falsey value here should be zero (since toNumber should always return a number, not null or undefined), so let's say if (this.physical_size === 0) here to be more explicit.

Also why say 1kb and not just 0? If RocksDB always has a certain overhead, why isn't that returned by its stats function or if not, always added here to physical_size?


pkg/ui/src/views/databases/data/tableInfo.tsx, line 33 at r3 (raw file):

          this.physical_size = FixLong(stats.approximate_disk_bytes).toNumber();
          if (!this.physical_size) {
            this.physical_size = Number(1024); // 1kb

Number constructor unnecessary; just 1024 is fine


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

I think this will do disk i/o in the common case. Which may be OK, but something to watch out for. I also think that eventually we'll need some caching in the UI; it's far from snappy even on my idle single-node cluster.

I perused the RocksDB code and I'm not seeing I/O in the common case. The one place I see I/O possibly occurring is in retrieving the sstable handle from the cache, but I expect that will almost always be a cache hit. Is there a particular bit of code you expect to be causing I/O there? It looks like most of VersionSet::ApproximateSize' is operating on in memory structures. TableReader::ApproximateOffsetOf` will use the cached index data. Very possible I'm missing something.


Review status: 13 of 17 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Dec 12, 2017

Quite likely you're right. I mostly saw that it was using the table readers and wasn't sure whether that would usually be a cache hit. Do you think I should do some more due diligence here, and if so what?


Review status: 13 of 17 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Do you think I should do some more due diligence here, and if so what?

I'm comfortable with the diligence done so far. Doing something additional would be fairly onerous (i.e. adding an Env wrapper to see if any file I/O is being performed). Or perhaps using strace/truss.

@bdarnell
Copy link
Copy Markdown
Contributor

If we set the rocksdb option pin_l0_filter_and_index_blocks_in_cache, we should be able to guarantee that it won't go to disk. Without that, we may see cache misses (in my experience the analogous bigtable option was usually a good idea, even though it was expensive in terms of ram)

@petermattis
Copy link
Copy Markdown
Collaborator

@bdarnell I believe we specify cache_index_and_filter_blocks = false which has the effect of pinning all filter and index blocks in memory:

// Indicating if we'd put index/filter blocks to the block cache.
// If not specified, each "table reader" object will pre-load index/filter
// block during table initialization.
bool cache_index_and_filter_blocks = false;

@couchand
Copy link
Copy Markdown
Contributor

Thanks @tschottdorf, this does seem like an improvement. I'm not sure about just replacing the value, though. Per the discussion on #20638, maybe we want to surface both?


Review status: 13 of 17 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


pkg/ui/src/views/databases/data/tableInfo.tsx, line 32 at r3 (raw file):

Previously, vilterp (Pete Vilter) wrote…

the only falsey value here should be zero (since toNumber should always return a number, not null or undefined), so let's say if (this.physical_size === 0) here to be more explicit.

Also why say 1kb and not just 0? If RocksDB always has a certain overhead, why isn't that returned by its stats function or if not, always added here to physical_size?

Yes, I'm not sure I agree with the comment.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Dec 12, 2017

@couchand noted, and by no means do I want this PR to set anything in stone (note that the MVCC stats are still available in their entirety where they were previously used). I still think it's better to switch from (meaningless+undocumented+unintuitive) to (meaningful+undocumented+intuitive) in an ad-hoc way now and to discuss the rest in #20638.

For example, it's not clear to me how to expose both while keeping things simple (I'm not much of a UI expert), and also I know for a fact that there are approximately three people in the company that without reading up can give you an almost correct description about what's in which MVCCStats field. If you want actually correct, the number is surely zero. Thus, to correctly describe the semantics of an MVCCStats-derived quantity, I think we'll need a very large tooltip. Both of these are hopefully out of scope here. 😃

@petermattis SGTM. I'll need a review/LGTM of the non-ui code from either you or @bdarnell.


Review status: 13 of 17 files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


Comments from Reviewable

@couchand
Copy link
Copy Markdown
Contributor

Roger, LGTM. Please do comment over there with your thoughts!


Reviewed 4 of 4 files at r4.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks failed.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

:lgtm:


Review status: all files reviewed at latest revision, 6 unresolved discussions, some commit checks failed.


c-deps/libroach/db.cc, line 1748 at r4 (raw file):

  const std::string start_key(EncodeKey(start));
  const std::string end_key(EncodeKey(end));
  const rocksdb::Range r((rocksdb::Slice(start_key)), (rocksdb::Slice(end_key)));

You've got some excessive parentheses here. And I don't think you need to explicitly construct the rocksdb::Slices. This should work:

const rocksdb::Range r(start_key, end_key);

c-deps/libroach/include/libroach.h, line 116 at r4 (raw file):

// Store the approximate on-disk size of the given key range into the
// supplied uint64. Does not take into account the memtables.

Perhaps this function should be named DBApproximateDiskSize and then you could get rid of the mention of memtables from this comment.


Comments from Reviewable

@tbg tbg force-pushed the approximate-sizes branch from e4580fe to dc54687 Compare December 13, 2017 19:44
tbg added 2 commits December 13, 2017 14:57
Release note (admin ui change): the Databases page now reports table sizes that
are better approximations to actual disk space usage.
@tbg tbg force-pushed the approximate-sizes branch from dc54687 to 3ebca00 Compare December 13, 2017 19:57
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Dec 13, 2017

Added a test. From what I can tell, the stats in this simple case (writing ~1MB each time and flushing) are spot on.


Review status: 4 of 18 files reviewed at latest revision, 5 unresolved discussions.


c-deps/libroach/db.cc, line 1748 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

You've got some excessive parentheses here. And I don't think you need to explicitly construct the rocksdb::Slices. This should work:

const rocksdb::Range r(start_key, end_key);

Done.


c-deps/libroach/include/libroach.h, line 116 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps this function should be named DBApproximateDiskSize and then you could get rid of the mention of memtables from this comment.

Done.


pkg/ui/src/views/databases/containers/tableDetails/index.tsx, line 128 at r3 (raw file):

Previously, vilterp (Pete Vilter) wrote…

Is this really the count of replicas, not ranges? Is the "ranges" column in the table also replicas? If so, either change it to be a count of ranges or also change the title here and the label of the column.

Also, I think you meant "The total count of replicas in this database" without the "across all replicas".

Good point. I got confused: this number is actually Ranges. Made the change.


pkg/ui/src/views/databases/data/tableInfo.tsx, line 32 at r3 (raw file):

Previously, couchand (Andrew Couch) wrote…

Yes, I'm not sure I agree with the comment.

I don't trust the zeroes I see in the UI (I always assume it's some exception or missing data) and this would've allowed me to distinguish that, but I think two UI team members weigh more than my intuition here.


pkg/ui/src/views/databases/data/tableInfo.tsx, line 33 at r3 (raw file):

Previously, vilterp (Pete Vilter) wrote…

Number constructor unnecessary; just 1024 is fine

Obsolete.


Comments from Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 4 of 18 files reviewed at latest revision, 4 unresolved discussions.


c-deps/libroach/db.cc, line 1748 at r4 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Done.

👍 We'll make a C++ programmer out of you.


Comments from Reviewable

@couchand
Copy link
Copy Markdown
Contributor

Review status: 4 of 18 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/ui/src/views/databases/data/tableInfo.tsx, line 32 at r3 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I don't trust the zeroes I see in the UI (I always assume it's some exception or missing data) and this would've allowed me to distinguish that, but I think two UI team members weigh more than my intuition here.

I agree that you shouldn't trust zero (particularly on the size of a database!). But making it something else just hides the issue, right?

In general, we should definitely do a better job of making it clear when there's missing data vs. zero (the time series charts are particularly guilty on this point).


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Dec 13, 2017

Review status: 4 of 18 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/ui/src/views/databases/data/tableInfo.tsx, line 32 at r3 (raw file):

Previously, couchand (Andrew Couch) wrote…

I agree that you shouldn't trust zero (particularly on the size of a database!). But making it something else just hides the issue, right?

In general, we should definitely do a better job of making it clear when there's missing data vs. zero (the time series charts are particularly guilty on this point).

Yep, I had filed #20691 earlier. Assigned you, but that's just for a lack of better assignee. Feel free to reassign.


Comments from Reviewable

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Dec 13, 2017

TFTRs!

@tbg tbg merged commit 9547ef0 into cockroachdb:master Dec 13, 2017
@tbg tbg deleted the approximate-sizes branch December 13, 2017 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants