Skip to content

Fix skip indexes on data read with patch parts#99543

Open
alexey-milovidov wants to merge 8 commits intomasterfrom
skip-indexes-patches-v2
Open

Fix skip indexes on data read with patch parts#99543
alexey-milovidov wants to merge 8 commits intomasterfrom
skip-indexes-patches-v2

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Continuation of #88702 - rebased on current master and fixed tests.

The changes:

  • Extract canUseIndex/canUseMergedIndex as static methods on MergeTreeDataSelectExecutor so they can be reused in multiple places.
  • Add canUseIndex checks in MergeTreeSkipIndexReader::read to skip indexes that depend on columns being updated on the fly by lightweight updates.
  • Add canUseIndex check in collectTextIndexReadInfos to prevent direct reading from text index when the indexed column is updated.
  • Fix MergeTreeReadersChain to only read and apply patches at the step that actually has columns needing patches, not always at step 0.
  • Fix addPatchPartsColumns to find the correct first step with patches instead of assuming it is always the first prewhere step.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed usage of skip indexes on data read (with enabled setting use_skip_indexes_on_data_read) and existing patch parts created by lightweight updates.

Continuation of #88702 - rebased
on current master and fixed tests.

The changes:
- Extract `canUseIndex`/`canUseMergedIndex` as static methods on
  `MergeTreeDataSelectExecutor` so they can be reused in multiple places.
- Add `canUseIndex` checks in `MergeTreeSkipIndexReader::read` to skip
  indexes that depend on columns being updated on the fly by lightweight
  updates.
- Add `canUseIndex` check in `collectTextIndexReadInfos` to prevent
  direct reading from text index when the indexed column is updated.
- Fix `MergeTreeReadersChain` to only read and apply patches at the step
  that actually has columns needing patches, not always at step 0.
- Fix `addPatchPartsColumns` to find the correct first step with patches
  instead of assuming it is always the first prewhere step.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 16, 2026

Workflow [PR], commit [2e72237]

Summary:

job_name test_name status info comment
Integration tests (amd_llvm_coverage, 2/5) failure
test_overcommit_tracker/test.py::test_user_overcommit FAIL cidb
Performance Comparison (arm_release, master_head, 2/6) failure
Check Results failure

AI Review

Summary

This PR fixes skip-index usage with lightweight updates and patch parts by adding canUseIndex checks in both the main skip-index path and direct text-index optimization, and by adjusting reader-chain behavior around patch steps. I found two significant issues: a compilation-time regression from adding a heavy include in a high-fan-out header, and insufficient regression coverage after deleting an existing non-text skip-index test. Verdict: request changes until these are addressed.

Findings
  • ⚠️ Majors
    • [src/Storages/MergeTree/MergeTreeDataSelectExecutor.h:4] Common/LoggingFormatStringHelpers.h was added to a high-fan-out header. This increases transitive template-heavy compile cost across many translation units.
      • Suggested fix: forward-declare PreformattedMessage (or add a lightweight forward header) in the header and include LoggingFormatStringHelpers.h only in MergeTreeDataSelectExecutor.cpp.
    • [tests/queries/0_stateless/03100_lwu_48_text_index.sql:27] The PR deletes 03100_lwu_36_json_skip_indexes.* and adds only text-index coverage, but also changes generic skip-index gating in MergeTreeSkipIndexReader::read. This leaves ordinary skip-index paths under patch parts under-tested.
      • Suggested fix: keep/add a non-text skip-index regression test (for example bloom/json-path skip index with lightweight updates + use_skip_indexes_on_data_read) so both generic and text-index paths are covered.
Tests
  • ⚠️ Restore or replace the removed non-text skip-index regression coverage; current tests mostly validate direct text-index read behavior and do not fully cover the generic skip-index path changed in this PR.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny ⚠️ MergeTree read/index paths changed; two review issues found (compile-time impact and test coverage gap).
No test removal ⚠️ Existing non-text regression test was removed without equivalent replacement coverage.
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout ⚠️ Missing generic skip-index regression coverage increases rollout risk.
Compilation time ⚠️ Heavy transitive include added in high-fan-out header.
Final Verdict
  • Status: ⚠️ Request changes
  • Minimum required actions:
    • Remove the heavy transitive include from MergeTreeDataSelectExecutor.h (use forward declaration/lightweight include split).
    • Add/restore non-text skip-index regression coverage for lightweight-update patch-part reads.

@clickhouse-gh clickhouse-gh bot added the pr-bugfix Pull request with bugfix, not backported by default label Mar 16, 2026
alexey-milovidov and others added 5 commits March 18, 2026 12:07
…tion

The previous commit tried to defer `readPatches` to the step that has
patched columns. This caused "Cannot clone block with columns because
block has N columns, but M columns given" when the patches were at step
i > 0: `readPatches` was called with the full `getSampleBlock` header
(all accumulated columns) but `read_result.columns` only had columns
from earlier steps.

Revert `MergeTreeReadersChain`, `MergeTreeBlockReadUtils`, and
`MergeTreeReadersChain.h` to their pre-patch state. The skip index
filtering logic (canUseIndex, canUseMergedIndex, text index checks)
is preserved — those changes are independent and correct.

https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=99543&sha=00993298544fc0a4d873f9398b88b15f8a2e293b&name_0=PR&name_1=Fast%20test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The `03100_lwu_36_json_skip_indexes` test cannot reproduce the bug because
`supportsSkipIndexesOnDataRead` returns false when `hasPatchParts()` is true,
and removing this guard exposes a `_part_offset` column error. Remove this
test for now since the `MergeTreeSkipIndexReader::read` path is unreachable
with patch parts.

The `03100_lwu_48_text_index` test is redesigned to actually trigger the bug
in `optimizeDirectReadFromTextIndex`. The previous test only removed matches
via UPDATE (making the stale text index conservative but still correct).
The new test adds matches via UPDATE so that the stale text index misses rows
that should be returned: `hasToken` with `splitByNonAlpha` tokenizer uses
`Exact` direct read mode, which replaces the original WHERE with a virtual
column. On master without the `canUseIndex` check, the stale index returns
1 instead of 11 for count().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 24, 2026

CLA assistant check
All committers have signed the CLA.

@alexey-milovidov alexey-milovidov force-pushed the skip-indexes-patches-v2 branch from e652ab4 to 48396fa Compare March 26, 2026 23:47
…s-v2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
#pragma once

#include <expected>
#include <Common/LoggingFormatStringHelpers.h>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ This adds Common/LoggingFormatStringHelpers.h into MergeTreeDataSelectExecutor.h, which is a high-fan-out header used across MergeTree read paths. LoggingFormatStringHelpers.h is template-heavy (fmt, compile-time checks), so pulling it into this header multiplies rebuild cost across many translation units.

Please avoid the transitive include here: forward-declare PreformattedMessage (or add a lightweight fwd header) in the .h, and keep the full LoggingFormatStringHelpers.h include in .cpp where canUseIndex is defined.


SELECT count() FROM t_lwu_text_index WHERE hasToken(str, '777');

-- Update str to ADD '777' token to rows in a different text index mark.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ The PR removes 03100_lwu_36_json_skip_indexes.* and adds only a text-index regression test. But this change also modifies the generic skip-index filtering path in MergeTreeSkipIndexReader::read (canUseIndex gating), not only text-index direct read.

Please keep/add an equivalent non-text skip-index coverage case (e.g. bloom_filter/JSON path with patch parts and use_skip_indexes_on_data_read) so we don't lose regression protection for ordinary skip indexes.

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 31, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.70% 76.70% +0.00%

Changed lines: 94.12% (80/85) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants