Add support for chunk column statistics tracking by nikkhils · Pull Request #7020 · timescale/timescaledb

nikkhils · 2024-06-10T11:20:57Z

Allow users to specify that ranges (min/max values) be tracked
for a specific column using the enable_column_stats() API. We
will store such min/max ranges in a new timescaledb catalog table
_timescaledb_catalog.chunk_column_stats. As of now we support tracking
min/max ranges for smallint, int, bigint, serial, bigserial, date,
timestamp, timestamptz data types. Support for other stats for bloom
filters etc. will be added in the future.

We add an entry of the form (ht_id, invalid_chunk_id, col, -INF, +INF)
into this catalog to indicate that min/max values need to be calculated
for this column in a given hypertable for chunks. We also iterate
through existing chunks and add -INF, +INF entries for them in the
catalog. This allows for selection of these chunks by default since no
min/max values have been calculated for them.

This actual min-max start/end range is calculated later. One of the
entry points is during compression for now. The range is stored in
start (inclusive) and end (exclusive) form. If DML happens into a
compressed chunk then as part of marking it as partial, we also mark
the corresponding catalog entries as "invalid". So partial chunks do
not get excluded further. When recompression happens we get the new
min/max ranges from the uncompressed portion and try to reconcile the
ranges in the catalog based on these new values. This is safe to do in
case of INSERTs and UPDATEs. In case of DELETEs, since we are deleting
rows, it's possible that the min/max ranges change, but as of now we
err on the side of caution and retain the earlier values which can be
larger than the actual range.

We can thus store the min/max values for such columns in this catalog
table at the per-chunk level. Note that these min/max range values do
not participate in partitioning of the data. Such data ranges will be
used for chunk pruning if the WHERE clause of an SQL query specifies
ranges on such a column.

A "DROP COLUMN" on a column with a statistics tracking enabled on it
ends up removing all relevant entries from the catalog tables.

A "decompress_chunk" on a compressed chunk removes its entries from the
"chunk_column_stats" catalog table since now it's available for DML.

Also a new "disable_column_stats" API has been introduced to allow
removal of min/max entries from the catalog for a specific column.

svenklemm · 2024-06-10T14:18:13Z

Hmmm i'm not a fan ot the dimension wording, did we have a discussion yet how we want to name this thing?

codecov · 2024-06-11T10:29:56Z

Codecov Report

Attention: Patch coverage is 83.10502% with 111 lines in your changes missing coverage. Please review.

Project coverage is 81.82%. Comparing base (59f50f2) to head (8a3b39a).
Report is 249 commits behind head on main.

Files	Patch %	Lines
src/ts_catalog/chunk_column_stats.c	81.81%	55 Missing and 41 partials ⚠️
src/dimension.c	0.00%	5 Missing ⚠️
src/hypertable_restrict_info.c	90.19%	2 Missing and 3 partials ⚠️
src/chunk.c	93.54%	0 Missing and 2 partials ⚠️
src/process_utility.c	90.47%	0 Missing and 2 partials ⚠️
src/chunk_adaptive.c	88.88%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7020      +/-   ##
==========================================
+ Coverage   80.06%   81.82%   +1.75%     
==========================================
  Files         190      203      +13     
  Lines       37181    38026     +845     
  Branches     9450     9877     +427     
==========================================
+ Hits        29770    31113    +1343     
+ Misses       2997     2948      -49     
+ Partials     4414     3965     -449

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

nikkhils · 2024-06-11T11:18:10Z

Hmmm i'm not a fan ot the dimension wording, did we have a discussion yet how we want to name this thing?

@svenklemm no further discussion. I am trying to name it for what it does, track a column's min/max range. The word dimension in English means a measurable value.

akuzm · 2024-06-11T11:44:33Z

Hmmm i'm not a fan ot the dimension wording, did we have a discussion yet how we want to name this thing?

In my imagination, the Correct Technical Term for this is "chunk-level column metadata", complementary to the "batch-level column metadata" in native compression and the "Parquet row group metadata" in CNS. Not sure it's a good user-facing term though.

The word "range" itself is probably not good, because it's very likely that we're going to add the bloom filters.

akuzm · 2024-06-11T11:47:31Z

Hmmm i'm not a fan ot the dimension wording, did we have a discussion yet how we want to name this thing?

In my imagination, the Correct Technical Term for this is "chunk-level column metadata", complementary to the "batch-level column metadata" in native compression and the "Parquet row group metadata" in CNS. Not sure it's a good user-facing term though.

The word "range" itself is probably not good, because it's very likely that we're going to add the bloom filters.

Arguably, "metadata" is too generic and can mean anything at all. Maybe "chunk-level column statistics", but this clashes with the postgres statistics used for planning.

For batch-level, we can call it a "sparse index" which conveys the function of excluding the compressed batches very well, but this name doesn't really translate to chunks...

nikkhils · 2024-06-11T13:52:53Z

Arguably, "metadata" is too generic and can mean anything at all. Maybe "chunk-level column statistics", but this clashes with the postgres statistics used for planning.

For batch-level, we can call it a "sparse index" which conveys the function of excluding the compressed batches very well, but this name doesn't really translate to chunks...

@svenklemm @akuzm How about chunk_restrictinfo as the name for the catalog instead of the dimension_range? Essentially, we are trying to add metadata which helps in eventual chunk exclusion. But then the API changes to

add_chunk_restrictinfo or remove_chunk_restrictinfo. Not sure if that's too much to expose to a user.

Or maybe, chunk_excludeinfo

To keep it simple, we could even call the catalog chunk_column_stats. That keeps it very generic and user friendly IMO.

erimatnor

I did a first pass, can do more later.

I think overall it looks good. I have some nits and suggestions on the table schema. I think it can be simplified (see inline comments).

I also think the code lacks DDL handling. What happens, for example, if one changes the column name, type, or drops the column that has a dimension range? I think drop is handled, but not sure about the others.

I don't see any code handling that or tests that show it works.

sql/pre_install/tables.sql

sql/updates/latest-dev.sql

src/dimension_range.c

erimatnor · 2024-06-14T07:36:23Z

src/dimension_range.c

+		 */
+		switch (open->upper_strategy)
+		{
+			case BTLessEqualStrategyNumber:


Looks like there could be better test coverage here.

erimatnor · 2024-06-14T07:41:14Z

tsl/test/expected/dimension_range.out

+         Index Cond: (sensor_id > 9)
+(5 rows)
+
+-- For the purposes of dimension ranges, a compressed chunk is considered as a


Why don't we create a utility function that one can use to manually update the ranges? It would be useful for testing, but also for users that want to calculate ranges on any chunk, even if not compressed.

Refresh a particular column's range:

SELECT refresh_dimension_range(:chunk, 'sensor_id') from show_chunks('mytable') chunk;'

Refresh all columns on chunk:

SELECT refresh_dimension_range(:chunk) from show_chunks('mytable') chunk;'

We could provide such a function. But am not sure about the utility of such a function on regular chunks. Regular chunks have DML in them and till we find a way to update these ranges efficiently the value add for them is pretty less.

nikkhils · 2024-06-18T13:42:37Z

Arguably, "metadata" is too generic and can mean anything at all. Maybe "chunk-level column statistics", but this clashes with the postgres statistics used for planning.
For batch-level, we can call it a "sparse index" which conveys the function of excluding the compressed batches very well, but this name doesn't really translate to chunks...

@svenklemm @akuzm How about chunk_restrictinfo as the name for the catalog instead of the dimension_range? Essentially, we are trying to add metadata which helps in eventual chunk exclusion. But then the API changes to

add_chunk_restrictinfo or remove_chunk_restrictinfo. Not sure if that's too much to expose to a user.

Or maybe, chunk_excludeinfo

To keep it simple, we could even call the catalog chunk_column_stats. That keeps it very generic and user friendly IMO.

FINAL decided names:

catalog: chunk_column_stats
APIs: enable_column_stats/disable_column_stats

erimatnor

Ok, did another pass.

Overall, the code looks good from the technical standpoint. Just a few nits and naming issues.

I still have some concerns about the schema, especially the semantics around "valid" and storing ranges and "settings" in the same table.

But the biggest issue right now is lack of some tests. See inline comments for some things I am missing. I suspect that adding those tests (and maybe more) will reveal a number of issues.

sql/pre_install/tables.sql

erimatnor · 2024-06-26T08:01:48Z

sql/ddl_api.sql

+    hypertable              REGCLASS,
+    column_name             NAME,
+    if_not_exists           BOOLEAN = FALSE
+) RETURNS TABLE(hypertable_id INT, column_name NAME, dropped BOOL)


In-consistent naming: "disabled" vs "dropped". I think the semantics are also different. Disable means just making "inert" while dropping means "remove". Which one is it?

If dropped=true, does it mean:

All column stats are removed

Only the setting is removed, so no new stats are added, but existing stats are still there and "enabled" (used for chunk exclusion)

Or, the settings is removed, and stats remain but are no longer used?

The setting remains but is "disabled" (e.g., all entries marked "invalid")? In that case, what is it that is dropped?

From the description of the function, I think maybe (2) is what happens. But it is not clear. I guess chunk exclusion on existing stats still happen?

Suggested change

) RETURNS TABLE(hypertable_id INT, column_name NAME, dropped BOOL)

) RETURNS TABLE(hypertable_id INT, column_name NAME, disabled BOOL)

@erimatnor from the above list,

1. All column stats are removed is what happens.

Again this goes back to naming discussion we had. I had suggested add_column_stats/remove_column_stats but we felt enable/disable is a better naming convention.

src/chunk_column_stats.c

tsl/test/expected/chunk_column_stats.out

erimatnor · 2024-06-27T10:54:48Z

tsl/test/expected/chunk_column_stats.out

+SELECT * from _timescaledb_catalog.chunk_column_stats;
+ id | hypertable_id | chunk_id | column_name | column_type |     range_start      |      range_end      | valid 
+----+---------------+----------+-------------+-------------+----------------------+---------------------+-------
+  1 |             1 |        0 | sensor_id   | integer     | -9223372036854775808 | 9223372036854775807 | t


Not sure what the difference is, if any, of having an infinite range vs. being invalid.

To me it seems like an entry is invalid (effectively) if it has an infinite range. So, why do we need the valid flag?

Or, we could just set the range to NULL, NULL for any invalid entry.

Now there seems to be multiple "compatible" states:

non-infinite range and invalid

infinite range and valid

then also

infinite range and valid (what does this mean?)

non-infinite range and invalid

I am not sure I understand the meaning of all these states.

@erimatnor

If DML happens into a
compressed chunk then as part of marking it as partial, we also mark
the corresponding catalog entries as "invalid". So partial chunks do
not get excluded further. When recompression happens we get the new
min/max ranges from the uncompressed portion and try to reconcile the
ranges in the catalog based on these new values. This is safe to do in
case of INSERTs and UPDATEs. In case of DELETEs, since we are deleting
rows, it's possible that the min/max ranges change, but as of now we
err on the side of caution and retain the earlier values which can be
larger than the actual range.

erimatnor · 2024-06-27T10:56:00Z

tsl/test/expected/chunk_column_stats.out

+SELECT * from _timescaledb_catalog.chunk_column_stats WHERE chunk_id = :'CHUNK_ID';
+ id | hypertable_id | chunk_id | column_name | column_type | range_start | range_end | valid 
+----+---------------+----------+-------------+-------------+-------------+-----------+-------
+  2 |             1 |        1 | sensor_id   | integer     |           1 |         9 | f


Instead of setting valid=false, why not just set an infinite range or NULL, NULL and get rid of the valid flag?

as explained above. We will update this entry later during recompression. If we do not retain these ranges then we have no way of getting the range back post recompression.

erimatnor · 2024-06-27T10:58:41Z

sql/pre_install/tables.sql

+  column_type REGTYPE NOT NULL,
+  range_start bigint NOT NULL,
+  range_end bigint NOT NULL,
+  valid boolean NOT NULL,


What is the valid flag for if we can just set the range to -infinite,infinite or NULL, NULL? This seems like another unnecessary addition to the schema that creates many combinations of semi-ambiguous states.

What is the different between these states, for example:

infinite range, valid

infinite range, invalid

non-infinite range, valid

non-infinite range, invalid

See more comments on this in the test file.

explained above.

Sorry, I might not have been clear. It seems to me that an infinite range has the same effect as "invalid", so why have both?

In other words, an infinite range basically renders the stats useless as it always selects the chunk, which is the same as invalid. Or am I missing something?

@erimatnor I explained this to another response. Adding it here again:

If DML happens into a
compressed chunk then as part of marking it as partial, we also mark
the corresponding catalog entries as "valid = false". So partial chunks do
not get excluded further. When recompression happens we get the new
min/max ranges from the uncompressed portion and try to reconcile the
ranges in the catalog based on these new values. This is safe to do in
case of INSERTs and UPDATEs. In case of DELETEs, since we are deleting
rows, it's possible that the min/max ranges change, but as of now we
err on the side of caution and retain the earlier values which can be
larger than the actual range. So, invalid entries retain the earlier calculated range
and we get to re-use a portion of that range when recompression happens of a partial
chunk.

nikkhils · 2024-07-03T13:14:09Z

@erimatnor @akuzm @svenklemm @mkindahl Added a bunch of additional tests in the latest force push.

mkindahl

Generally, it looks good. I think there are a few checks missing in critical places (e.g., a missing check if you actually managed to get a hypertable for a relid), but otherwise there is nothing that I think should block a merge.

I have a comments for your consideration regarding the user interface, but these should not block a merge.

sql/ddl_api.sql

mkindahl · 2024-07-05T09:16:18Z

sql/pre_install/tables.sql

+-- >, <, = comparators) by storing their textual representation.
+--
+CREATE TABLE _timescaledb_catalog.chunk_column_stats (
+  id serial NOT NULL,


Nit: I can't see a reference to this attribute. Is it needed? In the C-code, it mostly seem to be passed around, but not really used for anything.

There's a function ts_chunk_column_stats_update_by_id which uses this id field to update an existing entry. Otherwise every lookup will have to do a ht_id, chunk_id, col_name search.

Is that function commonly used? If not, is seems to make more sense to use the composite key as the primary key. It makes sense to have an id column if you have a foreign key to the table or if the id is needed to make otherwise similar rows unique (say, being able to store multiple rows with (hypertable, chunk, colname)) but that does not seem to be the case here.

mkindahl · 2024-07-05T10:06:09Z

sql/pre_install/tables.sql

+  CONSTRAINT chunk_column_stats_pkey PRIMARY KEY (id),
+  CONSTRAINT chunk_column_stats_ht_id_chunk_id_colname_key UNIQUE (hypertable_id, chunk_id, column_name),


Nit: Shouldn't the second constraint be the primary key here?

UNIQUE helps us get the same semantics, no? All these columns are NOT NULL as well. The id column is needed for update lookups as mentioned above.

It is essentially the same, but primary key provide a visual cue to what is the "main" key.

mkindahl · 2024-07-05T11:51:47Z

src/ts_catalog/chunk_column_stats.c

+	chunk_column_stats_scan_internal(scankey,
+									 2,
+									 chunk_column_stats_tuple_delete,
+									 &count,
+									 0,
+									 CHUNK_COLUMN_STATS_HT_ID_CHUNK_ID_COLUMN_NAME_IDX,
+									 RowExclusiveLock,
+									 CurrentMemoryContext);


With a RowExclusiveLock this could potentially fail if concurrent updates are happening. Since this is not intended to be transactional, it might make sense to require a stronger lock (which would make the call wait until it can get the lock and then do the change atomically). It would be simpler to use since you would not have to check that the function succeeds. However, that could also lead to deadlocks if the function is used in a transaction in that case.

DML typically takes RowExclusiveLock and that's what we are sticking to here. We don't expect a lot of concurrent updates to this catalog. That's one of the reason why we made a separate catalog here instead of re-using the existing dimension_slice catalog which is already prone to a lot of contention.

src/hypertable_restrict_info.c

src/process_utility.c

src/ts_catalog/chunk_column_stats.c

svenklemm · 2024-07-08T16:19:25Z

src/dimension.h

 {
 	DIMENSION_TYPE_OPEN,
 	DIMENSION_TYPE_CLOSED,
+	DIMENSION_TYPE_STATS,


Why are we treating this as dimension type. I think the consensus was that we dont want to mix up this features with dimensions.

@svenklemm This allows us to use the existing dimension restrict info stuff to remove/use these restrictions if they are not needed for the query.

FWIW, I don't think there was ever consensus on not mixing this with dimensions. IMO, there's a clear overlap in functionality, as @nikkhils points out.

erimatnor

Approving.

Still have some reservations, however.

The semantics around invalidation of stats is unclear. I don't understand what the purpose of an invalid entry is and the difference is between -inf/+inf vs invalid=true vs deleting the entry. Are entries marked invalid=true or -inf/+inf actually used for anything? It seems to me that we don't need the invalidation stuff at all. Why not just have stats when the chunk is compressed and then delete the stats when decompressed? The semantics would be much simpler. If we want stats for non-compressed data in the future we could add an invalidation framework if needed. Still, I think we either have stats and it is valid or we don't have stats (and it doesn't exist).
Still lack isolation/concurrency tests.
Test coverage seems to be lacking in some places.
Still think the table schemas can be simplified, but not a blocker.

erimatnor · 2024-07-11T08:47:27Z

sql/pre_install/tables.sql

+  column_type REGTYPE NOT NULL,
+  range_start bigint NOT NULL,
+  range_end bigint NOT NULL,
+  valid boolean NOT NULL,


Sorry, I might not have been clear. It seems to me that an infinite range has the same effect as "invalid", so why have both?

In other words, an infinite range basically renders the stats useless as it always selects the chunk, which is the same as invalid. Or am I missing something?

erimatnor · 2024-07-11T08:48:32Z

src/dimension.c

 			case DIMENSION_TYPE_CLOSED:
 				p->coordinates[p->num_coords++] = (int64) DatumGetInt32(datum);
 				break;
+			case DIMENSION_TYPE_STATS:


would be nice to have test coverage here

Adding a bunch of tests for additional test coverage

erimatnor · 2024-07-11T08:51:05Z

src/dimension.h

 {
 	DIMENSION_TYPE_OPEN,
 	DIMENSION_TYPE_CLOSED,
+	DIMENSION_TYPE_STATS,


FWIW, I don't think there was ever consensus on not mixing this with dimensions. IMO, there's a clear overlap in functionality, as @nikkhils points out.

erimatnor · 2024-07-11T08:52:45Z

src/hypertable_restrict_info.c

+												   &restype);
+		int64 value = ts_time_value_to_internal_or_infinite(datum, restype);
+
+		switch (strategy)


Seems to be a lack of test coverage here.

added a bunch of tests

erimatnor · 2024-07-11T08:55:31Z

src/ts_catalog/chunk_column_stats.c

+		 */
+		switch (open->upper_strategy)
+		{
+			case BTLessEqualStrategyNumber:


Another switch that seems to lack test coverage. It would be nice to improve this.

erimatnor · 2024-07-11T08:56:48Z

src/ts_catalog/chunk_column_stats.c

+			case BTGreaterStrategyNumber: /* e.g: id >= 9 */
+			{
+				/* range_end is exclusive, so check accordingly */
+				if ((fd->range_end - 1) >= open->upper_bound)


Looks like this is the only case being hit in tests.

erimatnor · 2024-07-11T09:12:12Z

tsl/test/expected/chunk_column_stats.out

+ _timescaledb_internal._hyper_1_1_chunk
+(1 row)
+
+-- Entry should be removed for this chunk now


My point was that the entry doesn't have to be set it to -inf/+inf because the constraint is still valid after decompression. So, the idea I alluded to is to change it so that it gets invalidated when the actual data changes.

I am not sure we don't have the infra for it (we do DML decompression on the same kind of insert/updates/deletes, for example). But it is OK to not have this for the first version if you decide it is not worth the effort.

Just want to point out, again ,though, the confusion around "invalidation" vs setting infinite range vs deleting. In the test, the range seems to be set to -inf/+inf, but the entry is still marked as "valid". You also mention yourself, in the comment, that the entry is invalidated, but it is still has valid=true. IMO, this is confusing.

I am actually not sure what the point is in keeping the entry vs just deleting it. Are invalidated or infinite range entries actually used for anything? What is the advantage of have the invalid entry in the table vs just adding a new one once the chunk is compressed again?

Allow users to specify that ranges (min/max values) be tracked for a specific column using the enable_column_stats() API. We will store such min/max ranges in a new timescaledb catalog table _timescaledb_catalog.chunk_column_stats. As of now we support tracking min/max ranges for smallint, int, bigint, serial, bigserial, date, timestamp, timestamptz data types. Support for other stats for bloom filters etc. will be added in the future. We add an entry of the form (ht_id, invalid_chunk_id, col, -INF, +INF) into this catalog to indicate that min/max values need to be calculated for this column in a given hypertable for chunks. We also iterate through existing chunks and add -INF, +INF entries for them in the catalog. This allows for selection of these chunks by default since no min/max values have been calculated for them. This actual min-max start/end range is calculated later. One of the entry points is during compression for now. The range is stored in start (inclusive) and end (exclusive) form. If DML happens into a compressed chunk then as part of marking it as partial, we also mark the corresponding catalog entries as "invalid". So partial chunks do not get excluded further. When recompression happens we get the new min/max ranges from the uncompressed portion and try to reconcile the ranges in the catalog based on these new values. This is safe to do in case of INSERTs and UPDATEs. In case of DELETEs, since we are deleting rows, it's possible that the min/max ranges change, but as of now we err on the side of caution and retain the earlier values which can be larger than the actual range. We can thus store the min/max values for such columns in this catalog table at the per-chunk level. Note that these min/max range values do not participate in partitioning of the data. Such data ranges will be used for chunk pruning if the WHERE clause of an SQL query specifies ranges on such a column. Note that Executor startup time chunk exclusion logic is also able to use this metadata effectively. A "DROP COLUMN" on a column with a statistics tracking enabled on it ends up removing all relevant entries from the catalog tables. A "decompress_chunk" on a compressed chunk removes its entries from the "chunk_column_stats" catalog table since now it's available for DML. Also a new "disable_column_stats" API has been introduced to allow removal of min/max entries from the catalog for a specific column.

This release contains performance improvements and bug fixes since the 2.15.3 release. We recommend that you upgrade at the next available opportunity. **Features** * timescale#6880: Add support for the array operators used for compressed DML batch filtering. * timescale#6895: Improve the compressed DML expression pushdown. * timescale#6897: Add support for replica identity on compressed hypertables. * timescale#6918: Remove support for PG13. * timescale#6920: Rework compression activity wal markers. * timescale#6989: Add support for foreign keys when converting plain tables to hypertables. * timescale#7020: Add support for the chunk column statistics tracking. * timescale#7048: Add an index scan for INSERT DML decompression. * timescale#7075: Reduce decompression on the compressed INSERT. * timescale#7101: Reduce decompressions for the compressed UPDATE/DELETE. * timescale#7108 Reduce decompressions for INSERTs with UNIQUE constraints **Bugfixes** * timescale#7018: Fix `search_path` quoting in the compression defaults function. * timescale#7046: Prevent locking for compressed tuples. * timescale#7055: Fix the `scankey` for `segment by` columns, where the type `constant` is different to `variable`. * timescale#7064: Fix the bug in the default `order by` calculation in compression. * timescale#7069: Fix the index column name usage. * timescale#7074: Fix the bug in the default `segment by` calculation in compression. **Thanks**

This release contains performance improvements and bug fixes since the 2.15.3 release. We recommend that you upgrade at the next available opportunity. **Features** * timescale#6880: Add support for the array operators used for compressed DML batch filtering. * timescale#6895: Improve the compressed DML expression pushdown. * timescale#6897: Add support for replica identity on compressed hypertables. * timescale#6918: Remove support for PG13. * timescale#6920: Rework compression activity wal markers. * timescale#6989: Add support for foreign keys when converting plain tables to hypertables. * timescale#7020: Add support for the chunk column statistics tracking. * timescale#7048: Add an index scan for INSERT DML decompression. * timescale#7075: Reduce decompression on the compressed INSERT. * timescale#7101: Reduce decompressions for the compressed UPDATE/DELETE. * timescale#7108 Reduce decompressions for INSERTs with UNIQUE constraints * timescale#7116 Use DELETE instead of TRUNCATE after compression * timescale#7134 Refactor foreign key handling for compressed hypertables **Bugfixes** * timescale#7018: Fix `search_path` quoting in the compression defaults function. * timescale#7046: Prevent locking for compressed tuples. * timescale#7055: Fix the `scankey` for `segment by` columns, where the type `constant` is different to `variable`. * timescale#7064: Fix the bug in the default `order by` calculation in compression. * timescale#7069: Fix the index column name usage. * timescale#7074: Fix the bug in the default `segment by` calculation in compression. **Thanks**

@jledentu

This release contains performance improvements and bug fixes since the 2.15.3 release. We recommend that you upgrade at the next available opportunity. **Features** * timescale#6880: Add support for the array operators used for compressed DML batch filtering. * timescale#6895: Improve the compressed DML expression pushdown. * timescale#6897: Add support for replica identity on compressed hypertables. * timescale#6918: Remove support for PG13. * timescale#6920: Rework compression activity wal markers. * timescale#6989: Add support for foreign keys when converting plain tables to hypertables. * timescale#7020: Add support for the chunk column statistics tracking. * timescale#7048: Add an index scan for INSERT DML decompression. * timescale#7075: Reduce decompression on the compressed INSERT. * timescale#7101: Reduce decompressions for the compressed UPDATE/DELETE. * timescale#7108 Reduce decompressions for INSERTs with UNIQUE constraints * timescale#7116 Use DELETE instead of TRUNCATE after compression * timescale#7134 Refactor foreign key handling for compressed hypertables * timescale#7161 Fix `mergejoin input data is out of order` **Bugfixes** * timescale#6987 Fix REASSIGN OWNED BY for background jobs * timescale#7018: Fix `search_path` quoting in the compression defaults function. * timescale#7046: Prevent locking for compressed tuples. * timescale#7055: Fix the `scankey` for `segment by` columns, where the type `constant` is different to `variable`. * timescale#7064: Fix the bug in the default `order by` calculation in compression. * timescale#7069: Fix the index column name usage. * timescale#7074: Fix the bug in the default `segment by` calculation in compression. **Thanks** * @jledentu For reporting a problem with mergejoin input order

@jledentu

This release contains significant performance improvements when working with compressed data, extended join support in continuous aggregates, and the ability to define foreign keys from regular tables towards hypertables. We recommend that you upgrade at the next available opportunity. In TimescaleDB v2.16.0 we: * Introduce multiple performance focused optimizations for data manipulation operations (DML) over compressed chunks. Improved upsert performance by more than 100x in some cases and more than 1000x in some update/delete scenarios. * Add the ability to define chunk skipping indexes on non-partitioning columns of compressed hypertables TimescaleDB v2.16.0 extends chunk exclusion to use those skipping (sparse) indexes when queries filter on the relevant columns, and prune chunks that do not include any relevant data for calculating the query response. * Offer new options for use cases that require foreign keys defined. You can now add foreign keys from regular tables towards hypertables. We have also removed some really annoying locks in the reverse direction that blocked access to referenced tables while compression was running. * Extend Continuous Aggregates to support more types of analytical queries. More types of joins are supported, additional equality operators on join clauses, and support for joins between multiple regular tables. **Highlighted features in this release** * Improved query performance through chunk exclusion on compressed hypertables. You can now define chunk skipping indexes on compressed chunks for any column with one of the following integer data types: `smallint`, `int`, `bigint`, `serial`, `bigserial`, `date`, `timestamp`, `timestamptz`. After you call `enable_chunk_skipping` on a column, TimescaleDB tracks the min and max values for that column. TimescaleDB uses that information to exclude chunks for queries that filter on that column, and would not find any data in those chunks. * Improved upsert performance on compressed hypertables. By using index scans to verify constraints during inserts on compressed chunks, TimescaleDB speeds up some ON CONFLICT clauses by more than 100x. * Improved performance of updates, deletes, and inserts on compressed hypertables. By filtering data while accessing the compressed data and before decompressing, TimescaleDB has improved performance for updates and deletes on all types of compressed chunks, as well as inserts into compressed chunks with unique constraints. By signaling constraint violations without decompressing, or decompressing only when matching records are found in the case of updates, deletes and upserts, TimescaleDB v2.16.0 speeds up those operations more than 1000x in some update/delete scenarios, and 10x for upserts. * You can add foreign keys from regular tables to hypertables, with support for all types of cascading options. This is useful for hypertables that partition using sequential IDs, and need to reference those IDs from other tables. * Lower locking requirements during compression for hypertables with foreign keys Advanced foreign key handling removes the need for locking referenced tables when new chunks are compressed. DML is no longer blocked on referenced tables while compression runs on a hypertable. * Improved support for queries on Continuous Aggregates `INNER/LEFT` and `LATERAL` joins are now supported. Plus, you can now join with multiple regular tables, and you can have more than one equality operator on join clauses. **PostgreSQL 13 support removal announcement** Following the deprecation announcement for PostgreSQL 13 in TimescaleDB v2.13, PostgreSQL 13 is no longer supported in TimescaleDB v2.16. The Currently supported PostgreSQL major versions are 14, 15 and 16. **Features** * #6880: Add support for the array operators used for compressed DML batch filtering. * #6895: Improve the compressed DML expression pushdown. * #6897: Add support for replica identity on compressed hypertables. * #6918: Remove support for PG13. * #6920: Rework compression activity wal markers. * #6989: Add support for foreign keys when converting plain tables to hypertables. * #7020: Add support for the chunk column statistics tracking. * #7048: Add an index scan for INSERT DML decompression. * #7075: Reduce decompression on the compressed INSERT. * #7101: Reduce decompressions for the compressed UPDATE/DELETE. * #7108 Reduce decompressions for INSERTs with UNIQUE constraints * #7116 Use DELETE instead of TRUNCATE after compression * #7134 Refactor foreign key handling for compressed hypertables * #7161 Fix `mergejoin input data is out of order` **Bugfixes** * #6987 Fix REASSIGN OWNED BY for background jobs * #7018: Fix `search_path` quoting in the compression defaults function. * #7046: Prevent locking for compressed tuples. * #7055: Fix the `scankey` for `segment by` columns, where the type `constant` is different to `variable`. * #7064: Fix the bug in the default `order by` calculation in compression. * #7069: Fix the index column name usage. * #7074: Fix the bug in the default `segment by` calculation in compression. **Thanks** * @jledentu For reporting a problem with mergejoin input order

nikkhils requested review from akuzm, erimatnor and svenklemm June 10, 2024 11:20

nikkhils self-assigned this Jun 10, 2024

nikkhils force-pushed the cat_dim_range branch 4 times, most recently from 286b7e6 to fc92364 Compare June 11, 2024 10:20

nikkhils force-pushed the cat_dim_range branch from fc92364 to 4a6e997 Compare June 11, 2024 13:46

nikkhils force-pushed the cat_dim_range branch from 4a6e997 to b8c71cf Compare June 11, 2024 13:55

erimatnor reviewed Jun 14, 2024

View reviewed changes

nikkhils force-pushed the cat_dim_range branch from b8c71cf to 85b27c7 Compare June 20, 2024 07:33

nikkhils changed the title ~~Add support for column dimension range tracking~~ Add support for chunk column statistics tracking Jun 20, 2024

nikkhils force-pushed the cat_dim_range branch 5 times, most recently from 88eafbe to fe2c43e Compare June 20, 2024 09:27

erimatnor reviewed Jun 27, 2024

View reviewed changes

nikkhils added this to the TimescaleDB 2.16.0 milestone Jun 27, 2024

nikkhils force-pushed the cat_dim_range branch 2 times, most recently from 8522e40 to c8bdcbd Compare July 3, 2024 11:51

nikkhils force-pushed the cat_dim_range branch 4 times, most recently from d359512 to f7bfc0a Compare July 4, 2024 10:06

nikkhils mentioned this pull request Jul 4, 2024

Add support for correlated constraints #6727

Closed

nikkhils requested a review from mkindahl July 4, 2024 14:14

mkindahl reviewed Jul 5, 2024

View reviewed changes

nikkhils force-pushed the cat_dim_range branch 4 times, most recently from 7d28a88 to 1e2b9e9 Compare July 8, 2024 11:27

svenklemm reviewed Jul 8, 2024

View reviewed changes

erimatnor approved these changes Jul 11, 2024

View reviewed changes

svenklemm approved these changes Jul 11, 2024

View reviewed changes

nikkhils force-pushed the cat_dim_range branch 4 times, most recently from 614eab2 to c4fde33 Compare July 12, 2024 08:20

nikkhils force-pushed the cat_dim_range branch from c4fde33 to 8a3b39a Compare July 12, 2024 08:54

nikkhils merged commit 50bca31 into timescale:main Jul 12, 2024

nikkhils deleted the cat_dim_range branch July 12, 2024 09:13

pallavisontakke mentioned this pull request Jul 18, 2024

Release 2.16.0 #7135

Closed

pallavisontakke mentioned this pull request Jul 25, 2024

Release 2.16.0 #7156

Closed

pallavisontakke mentioned this pull request Jul 31, 2024

Release 2.16.0 #7169

Closed

bayandin mentioned this pull request Aug 1, 2024

timescaledb 2.16.0 bayandin/homebrew-tap#173

Closed

	) RETURNS TABLE(hypertable_id INT, column_name NAME, dropped BOOL)
	) RETURNS TABLE(hypertable_id INT, column_name NAME, disabled BOOL)

		CONSTRAINT chunk_column_stats_pkey PRIMARY KEY (id),
		CONSTRAINT chunk_column_stats_ht_id_chunk_id_colname_key UNIQUE (hypertable_id, chunk_id, column_name),

Conversation

nikkhils commented Jun 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

svenklemm commented Jun 10, 2024

Uh oh!

codecov bot commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nikkhils commented Jun 11, 2024

Uh oh!

akuzm commented Jun 11, 2024

Uh oh!

akuzm commented Jun 11, 2024

Uh oh!

nikkhils commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erimatnor left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikkhils commented Jun 18, 2024

Uh oh!

erimatnor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

erimatnor Jun 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erimatnor Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nikkhils commented Jul 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkindahl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nikkhils commented Jun 10, 2024 •

edited

Loading

codecov bot commented Jun 11, 2024 •

edited

Loading

nikkhils commented Jun 11, 2024 •

edited

Loading

erimatnor left a comment •

edited

Loading

erimatnor Jun 26, 2024 •

edited

Loading

erimatnor Jun 27, 2024 •

edited

Loading

nikkhils commented Jul 3, 2024 •

edited

Loading