Add support for chunk column statistics tracking#7020
Add support for chunk column statistics tracking#7020nikkhils merged 1 commit intotimescale:mainfrom
Conversation
|
Hmmm i'm not a fan ot the dimension wording, did we have a discussion yet how we want to name this thing? |
286b7e6 to
fc92364
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7020 +/- ##
==========================================
+ Coverage 80.06% 81.82% +1.75%
==========================================
Files 190 203 +13
Lines 37181 38026 +845
Branches 9450 9877 +427
==========================================
+ Hits 29770 31113 +1343
+ Misses 2997 2948 -49
+ Partials 4414 3965 -449 ☔ View full report in Codecov by Sentry. |
@svenklemm no further discussion. I am trying to name it for what it does, track a column's min/max range. The word |
In my imagination, the Correct Technical Term for this is "chunk-level column metadata", complementary to the "batch-level column metadata" in native compression and the "Parquet row group metadata" in CNS. Not sure it's a good user-facing term though. The word "range" itself is probably not good, because it's very likely that we're going to add the bloom filters. |
Arguably, "metadata" is too generic and can mean anything at all. Maybe "chunk-level column statistics", but this clashes with the postgres statistics used for planning. For batch-level, we can call it a "sparse index" which conveys the function of excluding the compressed batches very well, but this name doesn't really translate to chunks... |
@svenklemm @akuzm How about
Or maybe, To keep it simple, we could even call the catalog |
There was a problem hiding this comment.
I did a first pass, can do more later.
I think overall it looks good. I have some nits and suggestions on the table schema. I think it can be simplified (see inline comments).
I also think the code lacks DDL handling. What happens, for example, if one changes the column name, type, or drops the column that has a dimension range? I think drop is handled, but not sure about the others.
I don't see any code handling that or tests that show it works.
src/dimension_range.c
Outdated
| */ | ||
| switch (open->upper_strategy) | ||
| { | ||
| case BTLessEqualStrategyNumber: |
There was a problem hiding this comment.
Looks like there could be better test coverage here.
| Index Cond: (sensor_id > 9) | ||
| (5 rows) | ||
|
|
||
| -- For the purposes of dimension ranges, a compressed chunk is considered as a |
There was a problem hiding this comment.
Why don't we create a utility function that one can use to manually update the ranges? It would be useful for testing, but also for users that want to calculate ranges on any chunk, even if not compressed.
Refresh a particular column's range:
SELECT refresh_dimension_range(:chunk, 'sensor_id') from show_chunks('mytable') chunk;'
Refresh all columns on chunk:
SELECT refresh_dimension_range(:chunk) from show_chunks('mytable') chunk;'
There was a problem hiding this comment.
We could provide such a function. But am not sure about the utility of such a function on regular chunks. Regular chunks have DML in them and till we find a way to update these ranges efficiently the value add for them is pretty less.
FINAL decided names: catalog: chunk_column_stats |
88eafbe to
fe2c43e
Compare
erimatnor
left a comment
There was a problem hiding this comment.
Ok, did another pass.
Overall, the code looks good from the technical standpoint. Just a few nits and naming issues.
I still have some concerns about the schema, especially the semantics around "valid" and storing ranges and "settings" in the same table.
But the biggest issue right now is lack of some tests. See inline comments for some things I am missing. I suspect that adding those tests (and maybe more) will reveal a number of issues.
sql/ddl_api.sql
Outdated
| hypertable REGCLASS, | ||
| column_name NAME, | ||
| if_not_exists BOOLEAN = FALSE | ||
| ) RETURNS TABLE(hypertable_id INT, column_name NAME, dropped BOOL) |
There was a problem hiding this comment.
In-consistent naming: "disabled" vs "dropped". I think the semantics are also different. Disable means just making "inert" while dropping means "remove". Which one is it?
If dropped=true, does it mean:
- All column stats are removed
- Only the setting is removed, so no new stats are added, but existing stats are still there and "enabled" (used for chunk exclusion)
- Or, the settings is removed, and stats remain but are no longer used?
- The setting remains but is "disabled" (e.g., all entries marked "invalid")? In that case, what is it that is dropped?
From the description of the function, I think maybe (2) is what happens. But it is not clear. I guess chunk exclusion on existing stats still happen?
| ) RETURNS TABLE(hypertable_id INT, column_name NAME, dropped BOOL) | |
| ) RETURNS TABLE(hypertable_id INT, column_name NAME, disabled BOOL) |
There was a problem hiding this comment.
@erimatnor from the above list,
1. All column stats are removed is what happens.
Again this goes back to naming discussion we had. I had suggested add_column_stats/remove_column_stats but we felt enable/disable is a better naming convention.
| SELECT * from _timescaledb_catalog.chunk_column_stats; | ||
| id | hypertable_id | chunk_id | column_name | column_type | range_start | range_end | valid | ||
| ----+---------------+----------+-------------+-------------+----------------------+---------------------+------- | ||
| 1 | 1 | 0 | sensor_id | integer | -9223372036854775808 | 9223372036854775807 | t |
There was a problem hiding this comment.
Not sure what the difference is, if any, of having an infinite range vs. being invalid.
To me it seems like an entry is invalid (effectively) if it has an infinite range. So, why do we need the valid flag?
Or, we could just set the range to NULL, NULL for any invalid entry.
Now there seems to be multiple "compatible" states:
- non-infinite range and invalid
- infinite range and valid
then also
- infinite range and valid (what does this mean?)
- non-infinite range and invalid
I am not sure I understand the meaning of all these states.
There was a problem hiding this comment.
If DML happens into a
compressed chunk then as part of marking it as partial, we also mark
the corresponding catalog entries as "invalid". So partial chunks do
not get excluded further. When recompression happens we get the new
min/max ranges from the uncompressed portion and try to reconcile the
ranges in the catalog based on these new values. This is safe to do in
case of INSERTs and UPDATEs. In case of DELETEs, since we are deleting
rows, it's possible that the min/max ranges change, but as of now we
err on the side of caution and retain the earlier values which can be
larger than the actual range.
| SELECT * from _timescaledb_catalog.chunk_column_stats WHERE chunk_id = :'CHUNK_ID'; | ||
| id | hypertable_id | chunk_id | column_name | column_type | range_start | range_end | valid | ||
| ----+---------------+----------+-------------+-------------+-------------+-----------+------- | ||
| 2 | 1 | 1 | sensor_id | integer | 1 | 9 | f |
There was a problem hiding this comment.
Instead of setting valid=false, why not just set an infinite range or NULL, NULL and get rid of the valid flag?
There was a problem hiding this comment.
as explained above. We will update this entry later during recompression. If we do not retain these ranges then we have no way of getting the range back post recompression.
| column_type REGTYPE NOT NULL, | ||
| range_start bigint NOT NULL, | ||
| range_end bigint NOT NULL, | ||
| valid boolean NOT NULL, |
There was a problem hiding this comment.
What is the valid flag for if we can just set the range to -infinite,infinite or NULL, NULL? This seems like another unnecessary addition to the schema that creates many combinations of semi-ambiguous states.
What is the different between these states, for example:
- infinite range, valid
- infinite range, invalid
- non-infinite range, valid
- non-infinite range, invalid
See more comments on this in the test file.
There was a problem hiding this comment.
Sorry, I might not have been clear. It seems to me that an infinite range has the same effect as "invalid", so why have both?
In other words, an infinite range basically renders the stats useless as it always selects the chunk, which is the same as invalid. Or am I missing something?
There was a problem hiding this comment.
@erimatnor I explained this to another response. Adding it here again:
If DML happens into a
compressed chunk then as part of marking it as partial, we also mark
the corresponding catalog entries as "valid = false". So partial chunks do
not get excluded further. When recompression happens we get the new
min/max ranges from the uncompressed portion and try to reconcile the
ranges in the catalog based on these new values. This is safe to do in
case of INSERTs and UPDATEs. In case of DELETEs, since we are deleting
rows, it's possible that the min/max ranges change, but as of now we
err on the side of caution and retain the earlier values which can be
larger than the actual range. So, invalid entries retain the earlier calculated range
and we get to re-use a portion of that range when recompression happens of a partial
chunk.
8522e40 to
c8bdcbd
Compare
|
@erimatnor @akuzm @svenklemm @mkindahl Added a bunch of additional tests in the latest force push. |
d359512 to
f7bfc0a
Compare
mkindahl
left a comment
There was a problem hiding this comment.
Generally, it looks good. I think there are a few checks missing in critical places (e.g., a missing check if you actually managed to get a hypertable for a relid), but otherwise there is nothing that I think should block a merge.
I have a comments for your consideration regarding the user interface, but these should not block a merge.
| -- >, <, = comparators) by storing their textual representation. | ||
| -- | ||
| CREATE TABLE _timescaledb_catalog.chunk_column_stats ( | ||
| id serial NOT NULL, |
There was a problem hiding this comment.
Nit: I can't see a reference to this attribute. Is it needed? In the C-code, it mostly seem to be passed around, but not really used for anything.
There was a problem hiding this comment.
There's a function ts_chunk_column_stats_update_by_id which uses this id field to update an existing entry. Otherwise every lookup will have to do a ht_id, chunk_id, col_name search.
There was a problem hiding this comment.
Is that function commonly used? If not, is seems to make more sense to use the composite key as the primary key. It makes sense to have an id column if you have a foreign key to the table or if the id is needed to make otherwise similar rows unique (say, being able to store multiple rows with (hypertable, chunk, colname)) but that does not seem to be the case here.
| CONSTRAINT chunk_column_stats_pkey PRIMARY KEY (id), | ||
| CONSTRAINT chunk_column_stats_ht_id_chunk_id_colname_key UNIQUE (hypertable_id, chunk_id, column_name), |
There was a problem hiding this comment.
Nit: Shouldn't the second constraint be the primary key here?
There was a problem hiding this comment.
UNIQUE helps us get the same semantics, no? All these columns are NOT NULL as well. The id column is needed for update lookups as mentioned above.
There was a problem hiding this comment.
It is essentially the same, but primary key provide a visual cue to what is the "main" key.
| chunk_column_stats_scan_internal(scankey, | ||
| 2, | ||
| chunk_column_stats_tuple_delete, | ||
| &count, | ||
| 0, | ||
| CHUNK_COLUMN_STATS_HT_ID_CHUNK_ID_COLUMN_NAME_IDX, | ||
| RowExclusiveLock, | ||
| CurrentMemoryContext); |
There was a problem hiding this comment.
With a RowExclusiveLock this could potentially fail if concurrent updates are happening. Since this is not intended to be transactional, it might make sense to require a stronger lock (which would make the call wait until it can get the lock and then do the change atomically). It would be simpler to use since you would not have to check that the function succeeds. However, that could also lead to deadlocks if the function is used in a transaction in that case.
There was a problem hiding this comment.
DML typically takes RowExclusiveLock and that's what we are sticking to here. We don't expect a lot of concurrent updates to this catalog. That's one of the reason why we made a separate catalog here instead of re-using the existing dimension_slice catalog which is already prone to a lot of contention.
7d28a88 to
1e2b9e9
Compare
| { | ||
| DIMENSION_TYPE_OPEN, | ||
| DIMENSION_TYPE_CLOSED, | ||
| DIMENSION_TYPE_STATS, |
There was a problem hiding this comment.
Why are we treating this as dimension type. I think the consensus was that we dont want to mix up this features with dimensions.
There was a problem hiding this comment.
@svenklemm This allows us to use the existing dimension restrict info stuff to remove/use these restrictions if they are not needed for the query.
There was a problem hiding this comment.
FWIW, I don't think there was ever consensus on not mixing this with dimensions. IMO, there's a clear overlap in functionality, as @nikkhils points out.
erimatnor
left a comment
There was a problem hiding this comment.
Approving.
Still have some reservations, however.
- The semantics around invalidation of stats is unclear. I don't understand what the purpose of an invalid entry is and the difference is between
-inf/+infvsinvalid=truevs deleting the entry. Are entries marked invalid=true or -inf/+inf actually used for anything? It seems to me that we don't need the invalidation stuff at all. Why not just have stats when the chunk is compressed and then delete the stats when decompressed? The semantics would be much simpler. If we want stats for non-compressed data in the future we could add an invalidation framework if needed. Still, I think we either have stats and it is valid or we don't have stats (and it doesn't exist). - Still lack isolation/concurrency tests.
- Test coverage seems to be lacking in some places.
- Still think the table schemas can be simplified, but not a blocker.
| column_type REGTYPE NOT NULL, | ||
| range_start bigint NOT NULL, | ||
| range_end bigint NOT NULL, | ||
| valid boolean NOT NULL, |
There was a problem hiding this comment.
Sorry, I might not have been clear. It seems to me that an infinite range has the same effect as "invalid", so why have both?
In other words, an infinite range basically renders the stats useless as it always selects the chunk, which is the same as invalid. Or am I missing something?
| case DIMENSION_TYPE_CLOSED: | ||
| p->coordinates[p->num_coords++] = (int64) DatumGetInt32(datum); | ||
| break; | ||
| case DIMENSION_TYPE_STATS: |
There was a problem hiding this comment.
would be nice to have test coverage here
There was a problem hiding this comment.
Adding a bunch of tests for additional test coverage
| { | ||
| DIMENSION_TYPE_OPEN, | ||
| DIMENSION_TYPE_CLOSED, | ||
| DIMENSION_TYPE_STATS, |
There was a problem hiding this comment.
FWIW, I don't think there was ever consensus on not mixing this with dimensions. IMO, there's a clear overlap in functionality, as @nikkhils points out.
| &restype); | ||
| int64 value = ts_time_value_to_internal_or_infinite(datum, restype); | ||
|
|
||
| switch (strategy) |
There was a problem hiding this comment.
Seems to be a lack of test coverage here.
There was a problem hiding this comment.
added a bunch of tests
src/ts_catalog/chunk_column_stats.c
Outdated
| */ | ||
| switch (open->upper_strategy) | ||
| { | ||
| case BTLessEqualStrategyNumber: |
There was a problem hiding this comment.
Another switch that seems to lack test coverage. It would be nice to improve this.
| case BTGreaterStrategyNumber: /* e.g: id >= 9 */ | ||
| { | ||
| /* range_end is exclusive, so check accordingly */ | ||
| if ((fd->range_end - 1) >= open->upper_bound) |
There was a problem hiding this comment.
Looks like this is the only case being hit in tests.
| _timescaledb_internal._hyper_1_1_chunk | ||
| (1 row) | ||
|
|
||
| -- Entry should be removed for this chunk now |
There was a problem hiding this comment.
My point was that the entry doesn't have to be set it to -inf/+inf because the constraint is still valid after decompression. So, the idea I alluded to is to change it so that it gets invalidated when the actual data changes.
I am not sure we don't have the infra for it (we do DML decompression on the same kind of insert/updates/deletes, for example). But it is OK to not have this for the first version if you decide it is not worth the effort.
Just want to point out, again ,though, the confusion around "invalidation" vs setting infinite range vs deleting. In the test, the range seems to be set to -inf/+inf, but the entry is still marked as "valid". You also mention yourself, in the comment, that the entry is invalidated, but it is still has valid=true. IMO, this is confusing.
I am actually not sure what the point is in keeping the entry vs just deleting it. Are invalidated or infinite range entries actually used for anything? What is the advantage of have the invalid entry in the table vs just adding a new one once the chunk is compressed again?
614eab2 to
c4fde33
Compare
Allow users to specify that ranges (min/max values) be tracked for a specific column using the enable_column_stats() API. We will store such min/max ranges in a new timescaledb catalog table _timescaledb_catalog.chunk_column_stats. As of now we support tracking min/max ranges for smallint, int, bigint, serial, bigserial, date, timestamp, timestamptz data types. Support for other stats for bloom filters etc. will be added in the future. We add an entry of the form (ht_id, invalid_chunk_id, col, -INF, +INF) into this catalog to indicate that min/max values need to be calculated for this column in a given hypertable for chunks. We also iterate through existing chunks and add -INF, +INF entries for them in the catalog. This allows for selection of these chunks by default since no min/max values have been calculated for them. This actual min-max start/end range is calculated later. One of the entry points is during compression for now. The range is stored in start (inclusive) and end (exclusive) form. If DML happens into a compressed chunk then as part of marking it as partial, we also mark the corresponding catalog entries as "invalid". So partial chunks do not get excluded further. When recompression happens we get the new min/max ranges from the uncompressed portion and try to reconcile the ranges in the catalog based on these new values. This is safe to do in case of INSERTs and UPDATEs. In case of DELETEs, since we are deleting rows, it's possible that the min/max ranges change, but as of now we err on the side of caution and retain the earlier values which can be larger than the actual range. We can thus store the min/max values for such columns in this catalog table at the per-chunk level. Note that these min/max range values do not participate in partitioning of the data. Such data ranges will be used for chunk pruning if the WHERE clause of an SQL query specifies ranges on such a column. Note that Executor startup time chunk exclusion logic is also able to use this metadata effectively. A "DROP COLUMN" on a column with a statistics tracking enabled on it ends up removing all relevant entries from the catalog tables. A "decompress_chunk" on a compressed chunk removes its entries from the "chunk_column_stats" catalog table since now it's available for DML. Also a new "disable_column_stats" API has been introduced to allow removal of min/max entries from the catalog for a specific column.
This release contains performance improvements and bug fixes since the 2.15.3 release. We recommend that you upgrade at the next available opportunity. **Features** * timescale#6880: Add support for the array operators used for compressed DML batch filtering. * timescale#6895: Improve the compressed DML expression pushdown. * timescale#6897: Add support for replica identity on compressed hypertables. * timescale#6918: Remove support for PG13. * timescale#6920: Rework compression activity wal markers. * timescale#6989: Add support for foreign keys when converting plain tables to hypertables. * timescale#7020: Add support for the chunk column statistics tracking. * timescale#7048: Add an index scan for INSERT DML decompression. * timescale#7075: Reduce decompression on the compressed INSERT. * timescale#7101: Reduce decompressions for the compressed UPDATE/DELETE. * timescale#7108 Reduce decompressions for INSERTs with UNIQUE constraints **Bugfixes** * timescale#7018: Fix `search_path` quoting in the compression defaults function. * timescale#7046: Prevent locking for compressed tuples. * timescale#7055: Fix the `scankey` for `segment by` columns, where the type `constant` is different to `variable`. * timescale#7064: Fix the bug in the default `order by` calculation in compression. * timescale#7069: Fix the index column name usage. * timescale#7074: Fix the bug in the default `segment by` calculation in compression. **Thanks**
This release contains performance improvements and bug fixes since the 2.15.3 release. We recommend that you upgrade at the next available opportunity. **Features** * timescale#6880: Add support for the array operators used for compressed DML batch filtering. * timescale#6895: Improve the compressed DML expression pushdown. * timescale#6897: Add support for replica identity on compressed hypertables. * timescale#6918: Remove support for PG13. * timescale#6920: Rework compression activity wal markers. * timescale#6989: Add support for foreign keys when converting plain tables to hypertables. * timescale#7020: Add support for the chunk column statistics tracking. * timescale#7048: Add an index scan for INSERT DML decompression. * timescale#7075: Reduce decompression on the compressed INSERT. * timescale#7101: Reduce decompressions for the compressed UPDATE/DELETE. * timescale#7108 Reduce decompressions for INSERTs with UNIQUE constraints * timescale#7116 Use DELETE instead of TRUNCATE after compression * timescale#7134 Refactor foreign key handling for compressed hypertables **Bugfixes** * timescale#7018: Fix `search_path` quoting in the compression defaults function. * timescale#7046: Prevent locking for compressed tuples. * timescale#7055: Fix the `scankey` for `segment by` columns, where the type `constant` is different to `variable`. * timescale#7064: Fix the bug in the default `order by` calculation in compression. * timescale#7069: Fix the index column name usage. * timescale#7074: Fix the bug in the default `segment by` calculation in compression. **Thanks**
This release contains performance improvements and bug fixes since the 2.15.3 release. We recommend that you upgrade at the next available opportunity. **Features** * timescale#6880: Add support for the array operators used for compressed DML batch filtering. * timescale#6895: Improve the compressed DML expression pushdown. * timescale#6897: Add support for replica identity on compressed hypertables. * timescale#6918: Remove support for PG13. * timescale#6920: Rework compression activity wal markers. * timescale#6989: Add support for foreign keys when converting plain tables to hypertables. * timescale#7020: Add support for the chunk column statistics tracking. * timescale#7048: Add an index scan for INSERT DML decompression. * timescale#7075: Reduce decompression on the compressed INSERT. * timescale#7101: Reduce decompressions for the compressed UPDATE/DELETE. * timescale#7108 Reduce decompressions for INSERTs with UNIQUE constraints * timescale#7116 Use DELETE instead of TRUNCATE after compression * timescale#7134 Refactor foreign key handling for compressed hypertables * timescale#7161 Fix `mergejoin input data is out of order` **Bugfixes** * timescale#6987 Fix REASSIGN OWNED BY for background jobs * timescale#7018: Fix `search_path` quoting in the compression defaults function. * timescale#7046: Prevent locking for compressed tuples. * timescale#7055: Fix the `scankey` for `segment by` columns, where the type `constant` is different to `variable`. * timescale#7064: Fix the bug in the default `order by` calculation in compression. * timescale#7069: Fix the index column name usage. * timescale#7074: Fix the bug in the default `segment by` calculation in compression. **Thanks** * @jledentu For reporting a problem with mergejoin input order
This release contains significant performance improvements when working with compressed data, extended join support in continuous aggregates, and the ability to define foreign keys from regular tables towards hypertables. We recommend that you upgrade at the next available opportunity. In TimescaleDB v2.16.0 we: * Introduce multiple performance focused optimizations for data manipulation operations (DML) over compressed chunks. Improved upsert performance by more than 100x in some cases and more than 1000x in some update/delete scenarios. * Add the ability to define chunk skipping indexes on non-partitioning columns of compressed hypertables TimescaleDB v2.16.0 extends chunk exclusion to use those skipping (sparse) indexes when queries filter on the relevant columns, and prune chunks that do not include any relevant data for calculating the query response. * Offer new options for use cases that require foreign keys defined. You can now add foreign keys from regular tables towards hypertables. We have also removed some really annoying locks in the reverse direction that blocked access to referenced tables while compression was running. * Extend Continuous Aggregates to support more types of analytical queries. More types of joins are supported, additional equality operators on join clauses, and support for joins between multiple regular tables. **Highlighted features in this release** * Improved query performance through chunk exclusion on compressed hypertables. You can now define chunk skipping indexes on compressed chunks for any column with one of the following integer data types: `smallint`, `int`, `bigint`, `serial`, `bigserial`, `date`, `timestamp`, `timestamptz`. After you call `enable_chunk_skipping` on a column, TimescaleDB tracks the min and max values for that column. TimescaleDB uses that information to exclude chunks for queries that filter on that column, and would not find any data in those chunks. * Improved upsert performance on compressed hypertables. By using index scans to verify constraints during inserts on compressed chunks, TimescaleDB speeds up some ON CONFLICT clauses by more than 100x. * Improved performance of updates, deletes, and inserts on compressed hypertables. By filtering data while accessing the compressed data and before decompressing, TimescaleDB has improved performance for updates and deletes on all types of compressed chunks, as well as inserts into compressed chunks with unique constraints. By signaling constraint violations without decompressing, or decompressing only when matching records are found in the case of updates, deletes and upserts, TimescaleDB v2.16.0 speeds up those operations more than 1000x in some update/delete scenarios, and 10x for upserts. * You can add foreign keys from regular tables to hypertables, with support for all types of cascading options. This is useful for hypertables that partition using sequential IDs, and need to reference those IDs from other tables. * Lower locking requirements during compression for hypertables with foreign keys Advanced foreign key handling removes the need for locking referenced tables when new chunks are compressed. DML is no longer blocked on referenced tables while compression runs on a hypertable. * Improved support for queries on Continuous Aggregates `INNER/LEFT` and `LATERAL` joins are now supported. Plus, you can now join with multiple regular tables, and you can have more than one equality operator on join clauses. **PostgreSQL 13 support removal announcement** Following the deprecation announcement for PostgreSQL 13 in TimescaleDB v2.13, PostgreSQL 13 is no longer supported in TimescaleDB v2.16. The Currently supported PostgreSQL major versions are 14, 15 and 16. **Features** * #6880: Add support for the array operators used for compressed DML batch filtering. * #6895: Improve the compressed DML expression pushdown. * #6897: Add support for replica identity on compressed hypertables. * #6918: Remove support for PG13. * #6920: Rework compression activity wal markers. * #6989: Add support for foreign keys when converting plain tables to hypertables. * #7020: Add support for the chunk column statistics tracking. * #7048: Add an index scan for INSERT DML decompression. * #7075: Reduce decompression on the compressed INSERT. * #7101: Reduce decompressions for the compressed UPDATE/DELETE. * #7108 Reduce decompressions for INSERTs with UNIQUE constraints * #7116 Use DELETE instead of TRUNCATE after compression * #7134 Refactor foreign key handling for compressed hypertables * #7161 Fix `mergejoin input data is out of order` **Bugfixes** * #6987 Fix REASSIGN OWNED BY for background jobs * #7018: Fix `search_path` quoting in the compression defaults function. * #7046: Prevent locking for compressed tuples. * #7055: Fix the `scankey` for `segment by` columns, where the type `constant` is different to `variable`. * #7064: Fix the bug in the default `order by` calculation in compression. * #7069: Fix the index column name usage. * #7074: Fix the bug in the default `segment by` calculation in compression. **Thanks** * @jledentu For reporting a problem with mergejoin input order
Allow users to specify that ranges (min/max values) be tracked
for a specific column using the enable_column_stats() API. We
will store such min/max ranges in a new timescaledb catalog table
_timescaledb_catalog.chunk_column_stats. As of now we support tracking
min/max ranges for smallint, int, bigint, serial, bigserial, date,
timestamp, timestamptz data types. Support for other stats for bloom
filters etc. will be added in the future.
We add an entry of the form (ht_id, invalid_chunk_id, col, -INF, +INF)
into this catalog to indicate that min/max values need to be calculated
for this column in a given hypertable for chunks. We also iterate
through existing chunks and add -INF, +INF entries for them in the
catalog. This allows for selection of these chunks by default since no
min/max values have been calculated for them.
This actual min-max start/end range is calculated later. One of the
entry points is during compression for now. The range is stored in
start (inclusive) and end (exclusive) form. If DML happens into a
compressed chunk then as part of marking it as partial, we also mark
the corresponding catalog entries as "invalid". So partial chunks do
not get excluded further. When recompression happens we get the new
min/max ranges from the uncompressed portion and try to reconcile the
ranges in the catalog based on these new values. This is safe to do in
case of INSERTs and UPDATEs. In case of DELETEs, since we are deleting
rows, it's possible that the min/max ranges change, but as of now we
err on the side of caution and retain the earlier values which can be
larger than the actual range.
We can thus store the min/max values for such columns in this catalog
table at the per-chunk level. Note that these min/max range values do
not participate in partitioning of the data. Such data ranges will be
used for chunk pruning if the WHERE clause of an SQL query specifies
ranges on such a column.
A "DROP COLUMN" on a column with a statistics tracking enabled on it
ends up removing all relevant entries from the catalog tables.
A "decompress_chunk" on a compressed chunk removes its entries from the
"chunk_column_stats" catalog table since now it's available for DML.
Also a new "disable_column_stats" API has been introduced to allow
removal of min/max entries from the catalog for a specific column.