Skip to content

Add DB_OBJECT_APPROXIMATE_SIZE and DB_OBJECT_APPROXIMATE_TOTAL_SIZE#4070

Merged
katzyn merged 2 commits intoh2database:masterfrom
katzyn:disk_space
Jul 12, 2024
Merged

Add DB_OBJECT_APPROXIMATE_SIZE and DB_OBJECT_APPROXIMATE_TOTAL_SIZE#4070
katzyn merged 2 commits intoh2database:masterfrom
katzyn:disk_space

Conversation

@katzyn
Copy link
Contributor

@katzyn katzyn commented Jul 6, 2024

Page.getDiskSpaceUsed() and related methods in tables and indexes now have an approximate parameter. With approximate = true this method doesn't read every page any more and therefore works much faster with large tables. The initial implementation for this case isn't perfect and can be improved in the future. At least it is faster, it is stable, and it is accurate enough.

New DB_OBJECT_APPROXIMATE_SIZE and DB_OBJECT_APPROXIMATE_TOTAL_SIZE functions are introduced. They are similar to DB_OBJECT_SIZE and DB_OBJECT_TOTAL_SIZE, but they use this parameter for a faster evaluation.

DatabaseMetaData.getIndexInfo() with approximate = true now also uses a fast inaccurate version to resolve a problem with H2 Console: #4058 (comment)

@andreitokar
Copy link
Contributor

I just wonder what is the use of this function at all? If it tries determine ("precisely" or approximate) disk space used by DB object, shouldn't all pages related to it be included, not just current version of the tree?
I doubt, that it can be used for any sort of capacity planning. Also, people would start asking questions like "why my DB file size is so and so, whereas total object's size is 1/x of it?".

On the other hand, implementation already became complex enough, so is it really worth it? Maybe we can drop it altogether instead. Looks to me like a perfect solution for a non-existent problem.

@katzyn
Copy link
Contributor Author

katzyn commented Jul 6, 2024

Such functions are used for data storage optimizations. Simple queries can be written to get data size estimation for all tables and all indexes to determine good candidates for further investigation. Sometimes disk space is mostly used by very unexpected tables or indexes. Disk space used by different versions isn't really important here. Current implementation returns very useful results, but it is horribly slow, so it will be nice to have an inaccurate but fast alternative version.

Maybe separate semi-accurate and fast versions aren't needed, but I'm not sure that this fast version doesn't have any flaws, everything depends on data distribution.

@katzyn
Copy link
Contributor Author

katzyn commented Jul 6, 2024

I don't like my implementation of getDiskSpaceUsedApproximation(), but my other ideas gave too large deviations in results.

@andreitokar
Copy link
Contributor

Sometimes disk space is mostly used by very unexpected tables or indexes
Not sure what you mean and why user would care. Disk space used depends on an update rate as much (if not more) as data itself. Usually number-o- rows * avg-row-size or number-o- rows * avg-index-key-size is good enough approximation. Some metrics about how balanced B-tree is would be much more useful.
Situation is very similar to Java heap space - JVM gives you no way to determine object size.

@manticore-projects
Copy link
Contributor

manticore-projects commented Jul 7, 2024

@andreitokar: We at manticore have a very real use case and need for this, because at banks and insurances we run very large history tables without a clear retention policy -- because naturally you want to keep as much business data as possible until space constraints touch you with the ground of reality.

So we regularly check the database for top 10 largest objects (tables + indices) and then start to prune.
We welcome this change.

@katzyn: at least for us large deviations do not matter, we only need an indicator where to start pruning. A typical 20% : 80% approach (find the 20% of the tables consuming 80% of the space) is more than sufficient.

@katzyn
Copy link
Contributor Author

katzyn commented Jul 7, 2024

Disk space used depends on an update rate as much (if not more) as data itself

Massive updates have an own overhead, but this is a different problem and it isn't critical in many cases when H2 is used as a production database. For example, size of some production database usually floats between 40 GiB after full defragmentation and 75 GiB during normal operation, this behavior is known and acceptable.

Situation 1. Sizes of database, its backup copies and especially of SQL dumps were increased significantly, we know that it isn't a garbage collection problem. There are no tables with too many rows comparing with our expectations, so what happened? Oh, table X has 30 GiB of data in it, but it has only 20,000 rows, what's it? Totally unexpected large values were put into VARBINARY column due to wrong configuration and it wasn't noticed by anyone.

Situation 2. We need to decrease overall size of poorly designed database in an old large project and currently there is no time to change everything, so we need to figure out most important parts first. After that we can check what can be done with them.

These problems aren't database-specific, I used similar functions in PostgreSQL too.

@manticore-projects
Copy link
Contributor

Disk space used depends on an update rate as much (if not more) as data itself

Massive updates have an own overhead, but this is a different problem and it isn't critical in many cases when H2 is used as a production database. For example, size of some production database usually floats between 40 GiB after full defragmentation and 75 GiB during normal operation, this behavior is known and acceptable.

Situation 1. Sizes of database, its backup copies and especially of SQL dumps were increased significantly, we know that it isn't a garbage collection problem. There are no tables with too many rows comparing with our expectations, so what happened? Oh, table X has 30 GiB of data in it, but it has only 20,000 rows, what's it? Totally unexpected large values were put into VARBINARY column due to wrong configuration and it wasn't noticed by anyone.

Situation 2. We need to decrease overall size of poorly designed database in an old large project and currently there is no time to change everything, so we need to figure out most important parts first. After that we can check what can be done with them.

These problems aren't database-specific, I used similar functions in PostgreSQL too.

Situation 3: History tables (adding cards onto the stack periodically) without a clear retention policy (which is mostly a political problem, not a technical problem -- the DBA is still stuck with it).

It really does not depend on the RDBMS, but on the business logic and operations only.
"Showing top x largest objects" is a very welcome tool.

@wburzyns
Copy link

wburzyns commented Jul 7, 2024

Personally, I'm perfectly fine with the current performance, even if it's slow. It's not the information one need every time a DB is accessed.

My question is: why does the web console need to check table/index size while opening a DB? This information is not displayed upfront. When I open a DB via JdbcDataSource.getConnection it gets opened instantly and the web console should behave in the same way.

@katzyn
Copy link
Contributor Author

katzyn commented Jul 7, 2024

My question is: why does the web console need to check table/index size while opening a DB?

Because JDBC metadata was designed many years ago in such bad way. Some its parts look too MySQL-specific, but other database systems are very different from MySQL. H2 Console reads information about indexes from DatabaseMetaData.getIndexInfo() and this method reads index size together with other information about them.

When I open a DB via JdbcDataSource.getConnection it gets opened instantly and the web console should behave in the same way.

Unfortunately, it can't, because it needs to read various information about this database. But we can speed up problematic getIndexInfo(), there is no need to return accurate index size when only approximate index size and number of rows were requested.

@katzyn katzyn merged commit b728c42 into h2database:master Jul 12, 2024
@katzyn katzyn deleted the disk_space branch July 12, 2024 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants