Add index scan to INSERT DML decompression by antekresic · Pull Request #7048 · timescale/timescaledb

antekresic · 2024-06-19T13:23:44Z

In order to verify constraints, we have to decompress batches that could contain duplicates of the tuples we are inserting. To find such batches, we use heap scans which can be very expensive if the compressed chunk contains a lot of tuples. Doing an index scan makes much more sense in this scenario and will
give great performance benefits.

Additionally, we don't want to create the decompressor until we determine we actually want to decompress a batch so we try to lazily initialize it once a batch is found.

codecov · 2024-06-20T07:52:09Z

Codecov Report

Attention: Patch coverage is 82.89474% with 39 lines in your changes missing coverage. Please review.

Project coverage is 81.86%. Comparing base (59f50f2) to head (0685126).
Report is 222 commits behind head on main.

Files	Patch %	Lines
tsl/src/compression/compression.c	83.03%	16 Missing and 22 partials ⚠️
src/nodes/chunk_dispatch/chunk_insert_state.c	75.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7048      +/-   ##
==========================================
+ Coverage   80.06%   81.86%   +1.79%     
==========================================
  Files         190      200      +10     
  Lines       37181    37297     +116     
  Branches     9450     9724     +274     
==========================================
+ Hits        29770    30533     +763     
+ Misses       2997     2861     -136     
+ Partials     4414     3903     -511

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/nodes/chunk_dispatch/chunk_insert_state.c

tsl/src/compression/compression.c

akuzm · 2024-06-20T13:22:16Z

tsl/src/compression/compression.c

+			if (index_rel->rd_index->indnatts - 1 == i)
+			{
+				if (strcmp(attname, COMPRESSION_COLUMN_METADATA_SEQUENCE_NUM_NAME) == 0)
+					matches = true;
+				break;
+			}


If some starting prefix of index columns are segmentby columns, we can use the index for lookups by these segmentbys, no matter what index columns follow, right? Maybe we can simplify/generalize this condition accordingly.

hmm ideally we want the index with the most columns matching the unique constraints, bonus points for considering selectivity, and we could cache this selection for the chunk, but for the common case we only create 1 index on compressed chunk so maybe this is overkill

I think the general approach is even simpler than what we have now: count the length of prefix which consists of segmentbys, then choose the index that has the most. Or we could count the selectivity indeed in the same way, that would be a nice addition given that we have proper statistics for the segmentby columns.

This will be removed with my upcoming work for removing sequence numbers so I'll leave it for now.

akuzm · 2024-06-20T13:22:55Z

tsl/src/compression/compression.c

+		/* Must have at least two attributes. */
+		if (index_rel->rd_index->indnatts < 2)
+		{
+			index_close(index_rel, AccessShareLock);
+			continue;
+		}


Should we filter this out specifically? There might be a user-created index on a particular segmentby column they need.

Again, leaving for removal sequence number PR.

tsl/src/compression/compression.c

In order to verify constraints, we have to decompress batches that could contain duplicates of the tuples we are inserting. To find such batches, we use heap scans which can be very expensive if the compressed chunk contains a lot of tuples. Doing an index scan makes much more sense in this scenario and will give great performance benefits. Additionally, we don't want to create the decompressor until we determine we actually want to decompress a batch so we try to lazily initialize it once a batch is found.

This release contains performance improvements and bug fixes since the 2.15.3 release. We recommend that you upgrade at the next available opportunity. **Features** * timescale#6880: Add support for the array operators used for compressed DML batch filtering. * timescale#6895: Improve the compressed DML expression pushdown. * timescale#6897: Add support for replica identity on compressed hypertables. * timescale#6918: Remove support for PG13. * timescale#6920: Rework compression activity wal markers. * timescale#6989: Add support for foreign keys when converting plain tables to hypertables. * timescale#7020: Add support for the chunk column statistics tracking. * timescale#7048: Add an index scan for INSERT DML decompression. * timescale#7075: Reduce decompression on the compressed INSERT. * timescale#7101: Reduce decompressions for the compressed UPDATE/DELETE. * timescale#7108 Reduce decompressions for INSERTs with UNIQUE constraints **Bugfixes** * timescale#7018: Fix `search_path` quoting in the compression defaults function. * timescale#7046: Prevent locking for compressed tuples. * timescale#7055: Fix the `scankey` for `segment by` columns, where the type `constant` is different to `variable`. * timescale#7064: Fix the bug in the default `order by` calculation in compression. * timescale#7069: Fix the index column name usage. * timescale#7074: Fix the bug in the default `segment by` calculation in compression. **Thanks**

This release contains performance improvements and bug fixes since the 2.15.3 release. We recommend that you upgrade at the next available opportunity. **Features** * timescale#6880: Add support for the array operators used for compressed DML batch filtering. * timescale#6895: Improve the compressed DML expression pushdown. * timescale#6897: Add support for replica identity on compressed hypertables. * timescale#6918: Remove support for PG13. * timescale#6920: Rework compression activity wal markers. * timescale#6989: Add support for foreign keys when converting plain tables to hypertables. * timescale#7020: Add support for the chunk column statistics tracking. * timescale#7048: Add an index scan for INSERT DML decompression. * timescale#7075: Reduce decompression on the compressed INSERT. * timescale#7101: Reduce decompressions for the compressed UPDATE/DELETE. * timescale#7108 Reduce decompressions for INSERTs with UNIQUE constraints * timescale#7116 Use DELETE instead of TRUNCATE after compression * timescale#7134 Refactor foreign key handling for compressed hypertables **Bugfixes** * timescale#7018: Fix `search_path` quoting in the compression defaults function. * timescale#7046: Prevent locking for compressed tuples. * timescale#7055: Fix the `scankey` for `segment by` columns, where the type `constant` is different to `variable`. * timescale#7064: Fix the bug in the default `order by` calculation in compression. * timescale#7069: Fix the index column name usage. * timescale#7074: Fix the bug in the default `segment by` calculation in compression. **Thanks**

@jledentu

This release contains performance improvements and bug fixes since the 2.15.3 release. We recommend that you upgrade at the next available opportunity. **Features** * timescale#6880: Add support for the array operators used for compressed DML batch filtering. * timescale#6895: Improve the compressed DML expression pushdown. * timescale#6897: Add support for replica identity on compressed hypertables. * timescale#6918: Remove support for PG13. * timescale#6920: Rework compression activity wal markers. * timescale#6989: Add support for foreign keys when converting plain tables to hypertables. * timescale#7020: Add support for the chunk column statistics tracking. * timescale#7048: Add an index scan for INSERT DML decompression. * timescale#7075: Reduce decompression on the compressed INSERT. * timescale#7101: Reduce decompressions for the compressed UPDATE/DELETE. * timescale#7108 Reduce decompressions for INSERTs with UNIQUE constraints * timescale#7116 Use DELETE instead of TRUNCATE after compression * timescale#7134 Refactor foreign key handling for compressed hypertables * timescale#7161 Fix `mergejoin input data is out of order` **Bugfixes** * timescale#6987 Fix REASSIGN OWNED BY for background jobs * timescale#7018: Fix `search_path` quoting in the compression defaults function. * timescale#7046: Prevent locking for compressed tuples. * timescale#7055: Fix the `scankey` for `segment by` columns, where the type `constant` is different to `variable`. * timescale#7064: Fix the bug in the default `order by` calculation in compression. * timescale#7069: Fix the index column name usage. * timescale#7074: Fix the bug in the default `segment by` calculation in compression. **Thanks** * @jledentu For reporting a problem with mergejoin input order

@jledentu

This release contains significant performance improvements when working with compressed data, extended join support in continuous aggregates, and the ability to define foreign keys from regular tables towards hypertables. We recommend that you upgrade at the next available opportunity. In TimescaleDB v2.16.0 we: * Introduce multiple performance focused optimizations for data manipulation operations (DML) over compressed chunks. Improved upsert performance by more than 100x in some cases and more than 1000x in some update/delete scenarios. * Add the ability to define chunk skipping indexes on non-partitioning columns of compressed hypertables TimescaleDB v2.16.0 extends chunk exclusion to use those skipping (sparse) indexes when queries filter on the relevant columns, and prune chunks that do not include any relevant data for calculating the query response. * Offer new options for use cases that require foreign keys defined. You can now add foreign keys from regular tables towards hypertables. We have also removed some really annoying locks in the reverse direction that blocked access to referenced tables while compression was running. * Extend Continuous Aggregates to support more types of analytical queries. More types of joins are supported, additional equality operators on join clauses, and support for joins between multiple regular tables. **Highlighted features in this release** * Improved query performance through chunk exclusion on compressed hypertables. You can now define chunk skipping indexes on compressed chunks for any column with one of the following integer data types: `smallint`, `int`, `bigint`, `serial`, `bigserial`, `date`, `timestamp`, `timestamptz`. After you call `enable_chunk_skipping` on a column, TimescaleDB tracks the min and max values for that column. TimescaleDB uses that information to exclude chunks for queries that filter on that column, and would not find any data in those chunks. * Improved upsert performance on compressed hypertables. By using index scans to verify constraints during inserts on compressed chunks, TimescaleDB speeds up some ON CONFLICT clauses by more than 100x. * Improved performance of updates, deletes, and inserts on compressed hypertables. By filtering data while accessing the compressed data and before decompressing, TimescaleDB has improved performance for updates and deletes on all types of compressed chunks, as well as inserts into compressed chunks with unique constraints. By signaling constraint violations without decompressing, or decompressing only when matching records are found in the case of updates, deletes and upserts, TimescaleDB v2.16.0 speeds up those operations more than 1000x in some update/delete scenarios, and 10x for upserts. * You can add foreign keys from regular tables to hypertables, with support for all types of cascading options. This is useful for hypertables that partition using sequential IDs, and need to reference those IDs from other tables. * Lower locking requirements during compression for hypertables with foreign keys Advanced foreign key handling removes the need for locking referenced tables when new chunks are compressed. DML is no longer blocked on referenced tables while compression runs on a hypertable. * Improved support for queries on Continuous Aggregates `INNER/LEFT` and `LATERAL` joins are now supported. Plus, you can now join with multiple regular tables, and you can have more than one equality operator on join clauses. **PostgreSQL 13 support removal announcement** Following the deprecation announcement for PostgreSQL 13 in TimescaleDB v2.13, PostgreSQL 13 is no longer supported in TimescaleDB v2.16. The Currently supported PostgreSQL major versions are 14, 15 and 16. **Features** * #6880: Add support for the array operators used for compressed DML batch filtering. * #6895: Improve the compressed DML expression pushdown. * #6897: Add support for replica identity on compressed hypertables. * #6918: Remove support for PG13. * #6920: Rework compression activity wal markers. * #6989: Add support for foreign keys when converting plain tables to hypertables. * #7020: Add support for the chunk column statistics tracking. * #7048: Add an index scan for INSERT DML decompression. * #7075: Reduce decompression on the compressed INSERT. * #7101: Reduce decompressions for the compressed UPDATE/DELETE. * #7108 Reduce decompressions for INSERTs with UNIQUE constraints * #7116 Use DELETE instead of TRUNCATE after compression * #7134 Refactor foreign key handling for compressed hypertables * #7161 Fix `mergejoin input data is out of order` **Bugfixes** * #6987 Fix REASSIGN OWNED BY for background jobs * #7018: Fix `search_path` quoting in the compression defaults function. * #7046: Prevent locking for compressed tuples. * #7055: Fix the `scankey` for `segment by` columns, where the type `constant` is different to `variable`. * #7064: Fix the bug in the default `order by` calculation in compression. * #7069: Fix the index column name usage. * #7074: Fix the bug in the default `segment by` calculation in compression. **Thanks** * @jledentu For reporting a problem with mergejoin input order

antekresic self-assigned this Jun 19, 2024

antekresic added the enhancement An enhancement to an existing feature for functionality label Jun 19, 2024

antekresic added this to the TimescaleDB 2.16.0 milestone Jun 19, 2024

antekresic force-pushed the insert-index-scan branch 3 times, most recently from 94407e8 to a42cbc4 Compare June 20, 2024 07:42

antekresic force-pushed the insert-index-scan branch from a42cbc4 to 4242fb0 Compare June 20, 2024 07:59

antekresic marked this pull request as ready for review June 20, 2024 07:59

antekresic requested a review from svenklemm June 20, 2024 07:59