Fix skip indexes on data read with patch parts by alexey-milovidov · Pull Request #99543 · ClickHouse/ClickHouse

alexey-milovidov · 2026-03-16T06:22:02Z

Continuation of #88702 - rebased on current master and fixed tests.

The changes:

Extract canUseIndex/canUseMergedIndex as static methods on MergeTreeDataSelectExecutor so they can be reused in multiple places.
Add canUseIndex checks in MergeTreeSkipIndexReader::read to skip indexes that depend on columns being updated on the fly by lightweight updates.
Add canUseIndex check in collectTextIndexReadInfos to prevent direct reading from text index when the indexed column is updated.
Fix MergeTreeReadersChain to only read and apply patches at the step that actually has columns needing patches, not always at step 0.
Fix addPatchPartsColumns to find the correct first step with patches instead of assuming it is always the first prewhere step.

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed usage of skip indexes on data read (with enabled setting use_skip_indexes_on_data_read) and existing patch parts created by lightweight updates.

Continuation of #88702 - rebased on current master and fixed tests. The changes: - Extract `canUseIndex`/`canUseMergedIndex` as static methods on `MergeTreeDataSelectExecutor` so they can be reused in multiple places. - Add `canUseIndex` checks in `MergeTreeSkipIndexReader::read` to skip indexes that depend on columns being updated on the fly by lightweight updates. - Add `canUseIndex` check in `collectTextIndexReadInfos` to prevent direct reading from text index when the indexed column is updated. - Fix `MergeTreeReadersChain` to only read and apply patches at the step that actually has columns needing patches, not always at step 0. - Fix `addPatchPartsColumns` to find the correct first step with patches instead of assuming it is always the first prewhere step. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-03-16T06:22:44Z

Workflow [PR], commit [2e72237]

Summary: ❌

job_name	test_name	status	info
Integration tests (amd_llvm_coverage, 2/5)		failure
	test_overcommit_tracker/test.py::test_user_overcommit	FAIL	cidb
Performance Comparison (arm_release, master_head, 2/6)		failure
	Check Results	failure

AI Review

Summary

This PR fixes skip-index usage with lightweight updates and patch parts by adding canUseIndex checks in both the main skip-index path and direct text-index optimization, and by adjusting reader-chain behavior around patch steps. I found two significant issues: a compilation-time regression from adding a heavy include in a high-fan-out header, and insufficient regression coverage after deleting an existing non-text skip-index test. Verdict: request changes until these are addressed.

Findings

⚠️ Majors
- [src/Storages/MergeTree/MergeTreeDataSelectExecutor.h:4] Common/LoggingFormatStringHelpers.h was added to a high-fan-out header. This increases transitive template-heavy compile cost across many translation units.
  - Suggested fix: forward-declare PreformattedMessage (or add a lightweight forward header) in the header and include LoggingFormatStringHelpers.h only in MergeTreeDataSelectExecutor.cpp.
- [tests/queries/0_stateless/03100_lwu_48_text_index.sql:27] The PR deletes 03100_lwu_36_json_skip_indexes.* and adds only text-index coverage, but also changes generic skip-index gating in MergeTreeSkipIndexReader::read. This leaves ordinary skip-index paths under patch parts under-tested.
  - Suggested fix: keep/add a non-text skip-index regression test (for example bloom/json-path skip index with lightweight updates + use_skip_indexes_on_data_read) so both generic and text-index paths are covered.

Tests

⚠️ Restore or replace the removed non-text skip-index regression coverage; current tests mostly validate direct text-index read behavior and do not fully cover the generic skip-index path changed in this PR.

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	⚠️	MergeTree read/index paths changed; two review issues found (compile-time impact and test coverage gap).
No test removal	⚠️	Existing non-text regression test was removed without equivalent replacement coverage.
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	➖
PR metadata quality	✅
Safe rollout	⚠️	Missing generic skip-index regression coverage increases rollout risk.
Compilation time	⚠️	Heavy transitive include added in high-fan-out header.

Final Verdict

Status: ⚠️ Request changes
Minimum required actions:
- Remove the heavy transitive include from MergeTreeDataSelectExecutor.h (use forward declaration/lightweight include split).
- Add/restore non-text skip-index regression coverage for lightweight-update patch-part reads.

…tion The previous commit tried to defer `readPatches` to the step that has patched columns. This caused "Cannot clone block with columns because block has N columns, but M columns given" when the patches were at step i > 0: `readPatches` was called with the full `getSampleBlock` header (all accumulated columns) but `read_result.columns` only had columns from earlier steps. Revert `MergeTreeReadersChain`, `MergeTreeBlockReadUtils`, and `MergeTreeReadersChain.h` to their pre-patch state. The skip index filtering logic (canUseIndex, canUseMergedIndex, text index checks) is preserved — those changes are independent and correct. https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=99543&sha=00993298544fc0a4d873f9398b88b15f8a2e293b&name_0=PR&name_1=Fast%20test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The `03100_lwu_36_json_skip_indexes` test cannot reproduce the bug because `supportsSkipIndexesOnDataRead` returns false when `hasPatchParts()` is true, and removing this guard exposes a `_part_offset` column error. Remove this test for now since the `MergeTreeSkipIndexReader::read` path is unreachable with patch parts. The `03100_lwu_48_text_index` test is redesigned to actually trigger the bug in `optimizeDirectReadFromTextIndex`. The previous test only removed matches via UPDATE (making the stale text index conservative but still correct). The new test adds matches via UPDATE so that the stale text index misses rows that should be returned: `hasToken` with `splitByNonAlpha` tokenizer uses `Exact` direct read mode, which replaces the original WHERE with a virtual column. On master without the `canUseIndex` check, the stale index returns 1 instead of 11 for count(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ip-indexes-patches-v2

CLAassistant · 2026-03-24T14:25:04Z

All committers have signed the CLA.

…s-v2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-03-28T23:07:35Z

src/Storages/MergeTree/MergeTreeDataSelectExecutor.h

 #pragma once

+#include <expected>
+#include <Common/LoggingFormatStringHelpers.h>


⚠️ This adds Common/LoggingFormatStringHelpers.h into MergeTreeDataSelectExecutor.h, which is a high-fan-out header used across MergeTree read paths. LoggingFormatStringHelpers.h is template-heavy (fmt, compile-time checks), so pulling it into this header multiplies rebuild cost across many translation units.

Please avoid the transitive include here: forward-declare PreformattedMessage (or add a lightweight fwd header) in the .h, and keep the full LoggingFormatStringHelpers.h include in .cpp where canUseIndex is defined.

clickhouse-gh · 2026-03-28T23:07:36Z

tests/queries/0_stateless/03100_lwu_48_text_index.sql

+
+SELECT count() FROM t_lwu_text_index WHERE hasToken(str, '777');
+
+-- Update str to ADD '777' token to rows in a different text index mark.


⚠️ The PR removes 03100_lwu_36_json_skip_indexes.* and adds only a text-index regression test. But this change also modifies the generic skip-index filtering path in MergeTreeSkipIndexReader::read (canUseIndex gating), not only text-index direct read.

Please keep/add an equivalent non-text skip-index coverage case (e.g. bloom_filter/JSON path with patch parts and use_skip_indexes_on_data_read) so we don't lose regression protection for ordinary skip indexes.

clickhouse-gh · 2026-03-31T19:16:28Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.10%	84.10%	+0.00%
Functions	90.90%	90.90%	+0.00%
Branches	76.70%	76.70%	+0.00%

Changed lines: 94.12% (80/85) · Uncovered code

Full report · Diff report

alexey-milovidov mentioned this pull request Mar 16, 2026

Fix skip indexes on data read with patch parts #88702

Closed

clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 16, 2026

alexey-milovidov and others added 5 commits March 18, 2026 12:07

Remove unused canUseMergedIndex method

c78571c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'master' into skip-indexes-patches-v2

869cc90

Merge remote-tracking branch 'origin/skip-indexes-patches-v2' into sk…

48396fa

…ip-indexes-patches-v2

alexey-milovidov force-pushed the skip-indexes-patches-v2 branch from e652ab4 to 48396fa Compare March 26, 2026 23:47

Merge remote-tracking branch 'origin/master' into skip-indexes-patche…

19888c0

…s-v2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clickhouse-gh bot reviewed Mar 28, 2026

View reviewed changes

Merge branch 'master' into skip-indexes-patches-v2

2e72237

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix skip indexes on data read with patch parts#99543

Fix skip indexes on data read with patch parts#99543
alexey-milovidov wants to merge 8 commits intomasterfrom
skip-indexes-patches-v2

alexey-milovidov commented Mar 16, 2026

Uh oh!

clickhouse-gh bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Mar 24, 2026 •

edited

Loading

Uh oh!

clickhouse-gh bot Mar 28, 2026

Uh oh!

clickhouse-gh bot Mar 28, 2026

Uh oh!

clickhouse-gh bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		SELECT count() FROM t_lwu_text_index WHERE hasToken(str, '777');

		-- Update str to ADD '777' token to rows in a different text index mark.

Conversation

alexey-milovidov commented Mar 16, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

clickhouse-gh bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Findings

Tests

ClickHouse Rules

Final Verdict

Uh oh!

CLAassistant commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-gh bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh bot commented Mar 31, 2026

LLVM Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clickhouse-gh bot commented Mar 16, 2026 •

edited

Loading

CLAassistant commented Mar 24, 2026 •

edited

Loading