Skip to content

builtins: add pg_total_relation_size builtin#59604

Open
jordanlewis wants to merge 2 commits intocockroachdb:masterfrom
jordanlewis:pg_table_size
Open

builtins: add pg_total_relation_size builtin#59604
jordanlewis wants to merge 2 commits intocockroachdb:masterfrom
jordanlewis:pg_table_size

Conversation

@jordanlewis
Copy link
Copy Markdown
Member

@jordanlewis jordanlewis commented Jan 29, 2021

Closes #20712.

Depends on #59865.

This builtin function returns the number of bytes in a table. For
CockroachDB, this means summing up the number of bytes in the ranges of
a table.

Notably, we do not multiply the number of bytes in the table's ranges by
the number of replicas - this is left up to the user to do if desired.

Release note (sql change): add the pg_total_relation_size builtin for estimating
the amount of data in a table.

@jordanlewis jordanlewis requested a review from a team January 29, 2021 23:35
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

# crdb_internal.ranges in the logic test context.
# Make sure also that a non-existent table id returns null.
query II
SELECT pg_table_size('size_tester'::regclass), pg_table_size(387438)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there no way to test this works outside of logic tests?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test, thanks for the push!

@jordanlewis
Copy link
Copy Markdown
Member Author

I rewrote the function to avoid depending on the virtual table, which has to fetch all range and table and node descriptors, and to avoid getting the range stats for every range in a separate RPC.

Now, it fetches just the range descriptors it needs, and does one big RPC to get all of their stats at once.

Also I addeda test.

@jordanlewis
Copy link
Copy Markdown
Member Author

Actually, this might be subtly wrong or confusing.

pg_indexes_size ( regclass ) → bigintComputes the total disk space used by indexes attached to the specified table.
pg_table_size ( regclass ) → bigintComputes the disk space used by the specified table, excluding indexes (but including its TOAST table if any, free space map, and visibility map).
pg_total_relation_size ( regclass ) → bigintComputes the total disk space used by the specified table, including all indexes and TOAST data. The result is equivalent to pg_table_size + pg_indexes_size.

So I guess this function as written would be closer to pg_indexes_size. What would CockroachDB's pg_table_size be, since CockroachDB has no concept of heap? Maybe we could just do the table's primary index?

@otan
Copy link
Copy Markdown
Contributor

otan commented Feb 1, 2021

So I guess this function as written would be closer to pg_indexes_size.

Am I missig something -- why is it not pg_total_relation_size?

What would CockroachDB's pg_table_size be, since CockroachDB has no concept of heap? Maybe we could just do the table's primary index?

Hmm, yeah sounds about right

@jordanlewis jordanlewis force-pushed the pg_table_size branch 5 times, most recently from c3ebc79 to 91a87c9 Compare February 2, 2021 04:22
@jordanlewis
Copy link
Copy Markdown
Member Author

PTAL

@tbg i know you weighed in on #20712, with a comment

The API call we have right now gives the approximate total replicated size of a table and it needs to fan out across the cluster and also scans from the meta ranges. Hooking this up to pg_table_size is possibly something we could regret.

but given we haven't had something here in the last 3 years and no plans to improve it, any objections remaining? I now see that we have yet another way of calculating this via ComputeStatsForKeySpan, should we use that instead of this mvcc stats method?

Copy link
Copy Markdown
Contributor

@otan otan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r1, 1 of 3 files at r3, 3 of 3 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis and @otan)


pkg/sql/crdb_internal.go, line 2444 at r4 (raw file):

	end_key,
	end_pretty,
  table_id,

nit: indentation misaligned

@tbg
Copy link
Copy Markdown
Member

tbg commented Feb 3, 2021

@jordanlewis I can only warn, not make the call. Anything that fans out to the whole cluster is dangerous when called frequently. Is this going to be called frequently? I don't know. Probably not, but not my call to make. An individual call shouldn't be too bad, since this is only reading from the leaseholders.

Orthogonally, is this the right implementation? It looks like people want this to reflect the disk usage, not the number of logical bytes in the cluster, which is what you're returning now. By implementing it as such we're just going to switch users from asking for this to be implemented to asking why it returns seemingly bogus numbers, especially seeing how the UI perhaps does not return the same thing (I haven't checked what "table size" means there). If there has been a discussion and it was resolved that it should do what this PR proposes, I haven't seen it neither here nor on the linked issues.

@rafiss
Copy link
Copy Markdown
Collaborator

rafiss commented Feb 5, 2021

I don't have much context on this PR, but just inquiring: if the pg_table_size stuff is still under debate, should we cherry-pick off the first commit so #59601 is fixed?

@jordanlewis
Copy link
Copy Markdown
Member Author

Yes, I think so. I'll take care of that today.

@jordanlewis jordanlewis changed the title builtins: add pg_table_size builtin builtins: add pg_total_relation_size builtin Feb 5, 2021
@jordanlewis
Copy link
Copy Markdown
Member Author

@jordanlewis I can only warn, not make the call. Anything that fans out to the whole cluster is dangerous when called frequently. Is this going to be called frequently? I don't know. Probably not, but not my call to make. An individual call shouldn't be too bad, since this is only reading from the leaseholders.

This function tends to get called by metadata scraping admin consoles and such - prometheus exporters - and other things that presumably will call this function once per table in the whole database. So maybe this is not really the best idea to implement it this way.

Orthogonally, is this the right implementation?

I think, based on the text you wrote below this question, the answer is no. I didn't realize that the MVCC stats didn't reflect the physical disk usage of the range. I'll figure out another way to do this, after I merge the first commit to fix #59601.

@jordanlewis jordanlewis requested a review from a team as a code owner February 5, 2021 22:21
@jordanlewis jordanlewis force-pushed the pg_table_size branch 6 times, most recently from 6e27449 to 602905e Compare February 5, 2021 23:02
@jordanlewis jordanlewis requested review from a team and dt and removed request for a team February 5, 2021 23:02
@jordanlewis jordanlewis force-pushed the pg_table_size branch 3 times, most recently from 1708a26 to e7a7ac6 Compare February 5, 2021 23:17
@jordanlewis
Copy link
Copy Markdown
Member Author

I took a look at redoing this today. I ended up having the builtin function call into the admin server's TableStatsRequest API, the same as what the DB Console does. This will prevent any confusion where the numbers don't align - something that I agree would be deeply confusing for everyone.

For now, though, this builtin will still fan out. It's not ideal. But, I think that ultimately it's going to be better to provide information in both SQL and DB console, even if inefficient. I think it would good to collect these statistics during, say, table statistics collection, and store them in a new table or a new way in the existing table statistics system table, but that's going to be a pretty significant chunk of work.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jordan, I'm sorry but this is not going to fly. Our multi-tenant architecture is going to be the way moving forward, even for dedicated / on-prem users, and the SQL pods don't serve the admin UI / status server.

What you really need here is to access the (s *adminServer) statsForSpan() method from SQL. But this method has no reason to be on adminServer. It doesn't even use adminServer specifics. It could as well be implemented on (*Server) or even (*sql.Server). If you were to move it to the latter, you could access it safely from a SQL builtin, and the changes to server/admin.go to get there would be minimal: from adminServer to sql.Server is just s.server.sqlServer.

Reviewed 2 of 11 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @jordanlewis, and @otan)

@knz
Copy link
Copy Markdown
Contributor

knz commented Feb 8, 2021

(Alternatively, you can move it to a method of kv/client which makes it even easier.)

@knz
Copy link
Copy Markdown
Contributor

knz commented Feb 8, 2021

@tbg can you check whether my proposal to integrate statsForSpan in the KV interface above is sound?

@jordanlewis jordanlewis force-pushed the pg_table_size branch 3 times, most recently from 29eed06 to ff432d3 Compare February 12, 2021 22:51
@jordanlewis
Copy link
Copy Markdown
Member Author

I embarked on an exciting plumbing adventure and moved the stats for span implementation into the sql server. However, we still need to call into the status server, once per KV node that has a relevant leaseholder, to fetch the span stats underneath... is that acceptable? In the end, I think we actually could have called into a remote adminServer, right? Don't KV nodes in multitenant world still have an admin and status server all happily ready to server?

... and crdb_internal.ranges_no_leases

This commit adds schema_name to crdb_internal.ranges and
crdb_internal.ranges_no_leases to ensure that it's possible to
disambiguate between ranges that are contained by a table with the same
name in two different user-defined schemas.

In addition, it also adds the table_id column which allows unambiguous
lookups of ranges for a given table id. This will also enable making a
virtual index on the table_id column later, which should be a nice win
for some introspection commands.

Release note (sql change): add the schema_name and table_id columns to
the crdb_internal.ranges and crdb_internal.ranges_no_leases virtual
tables.
@knz
Copy link
Copy Markdown
Contributor

knz commented Feb 15, 2021

Don't KV nodes in multitenant world still have an admin and status server all happily ready to server?

That is not the plan, no.

statusServer / adminServer are components that are dedicated to serving the admin UI / DB console exclusively. They are not meant to be available on SQL pods in multi-tenant, and they are certainly not meant to be used as dependency from SQL.

(Which also means that our current implementation of SQL session handling, which is currently routed through these things, is incorrect. See related issue #60584.)

However, we still need to call into the status server, once per KV node that has a relevant leaseholder, to fetch the span stats underneath...

  1. I don't think that's true. The entity that does this fetch should be the distsender, via the client.Txn interface. I don't see why it needs to hang off statusServer. It's not specific to the HTTP APIs / UI console.

  2. in multi-tenant, you'll need to talk to multiple KV nodes (once per node holding relevant leases). I'm not sure our multi-tenant architecture is able to do fan-outs to the entire KV cluster from a single SQL pod.

IT seems to me that we need an API like this:

  • SQL code talks to internal KV client (via kv/client),
  • SQL client talks through RPC to just 1 KV node on the KV side,
  • that KV node then performs fan-out RPCs to other KV nodes as needed.

I think we actually could have called into a remote adminServer, right?

Again, adminSErver is a DB-console-only concept. We need something else.

@knz
Copy link
Copy Markdown
Contributor

knz commented Feb 16, 2021

This topic was discussed in this week's KV team meeting.

  • @tbg: this doesn’t fit well into the kv/client package, which is focused around executing requests on ranges. The span stats code is intentionally operating on a raw key range as it basically just passes through to the storage layer (note the “approx disk bytes” here). This is fundamentally a per-Store API but “broadcasting” is our only addressing scheme for that.
  • @tbg: note that I’m assuming here that we want to expose the physical disk usage. The logical usage would fit very well into the KV API, you can simply model it as a ranged read request. But I would want to double-check with the storage team whether calling engine.ApproximateDiskBytes(rangeStartKey, rangeEndKey) and summing up over the ranges like that is a) likely to give reasonable results (I think not) or b) expensive (might be).
  • @tbg from a POV of our multi-tenancy usage, I don’t see why tenants should be able to read off their approx physical disk usage. Their billing will likely not depend on it (imo it’s better to base that on logical bytes).
  • @knz the counterpoint to this is that we do not want to build features that provide a different SQL experience on CC dedicated and free/shared-tier clusters. Also eventually we also want to move CC dedicated clusters towards the multi-tenant architecture, so we need a way to build the SQL service in a way that works with multi-tenant.

So in summary, we have a strong uncertainty / lack of leadership/vision about how to expose KV-level (or storage-level) information through SQL in the multi-tenant deployment.

I am currently working with PM to figure out who owns the architectural vision here. Will report when we know more.

@jordanlewis
Copy link
Copy Markdown
Member Author

Just a note of intent - I want to come back to this one now that we've gotten further in our MT architecture.

@brandonrjacobs
Copy link
Copy Markdown

brandonrjacobs commented Apr 3, 2024

Any chance this can be revisited @jordanlewis ? We're finding a need for this and wanted to see if there are any updates to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: expose table on-disk size

7 participants