Skip to content

Native Common Virtuals for Hive#101069

Merged
Michicosun merged 4 commits intomasterfrom
native_common_virtuals/Hive
Apr 1, 2026
Merged

Native Common Virtuals for Hive#101069
Michicosun merged 4 commits intomasterfrom
native_common_virtuals/Hive

Conversation

@Michicosun
Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Part of #100766

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 29, 2026

Workflow [PR], commit [74881e4]

Summary:

job_name test_name status info comment
Integration tests (amd_llvm_coverage, 2/5) failure
test_overcommit_tracker/test.py::test_user_overcommit FAIL cidb
Stress test (arm_release) failure
Logical error: Shard number is greater than shard count: shard_num=A shard_count=B cluster=C (STID: 5066-457d) FAIL cidb
Stress test (arm_msan) failure
MemorySanitizer: use-of-uninitialized-value (STID: 2410-488a) FAIL cidb

AI Review

Summary

This PR adds support for native common virtuals in StorageHive by wiring requested virtual headers through the read pipeline and materializing _path, _file, and _table in StorageHiveSource. Code-level changes look safe and coherent for the scope, but PR metadata is inconsistent with the user-visible behavior change.

Findings

💡 Nits

  • [src/Storages/Hive/StorageHive.cpp:437] PR template metadata is likely mismatched: this PR adds a new user-visible virtual column (_table) for Hive reads, so Changelog category: Not for changelog is not appropriate.
    • Suggested fix: choose a changelog category aligned with user-visible SQL behavior and provide a concrete changelog entry, e.g. Hive table engine now supports _table virtual column in SELECT queries.
ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality ⚠️ Changelog category appears mismatched for a user-visible behavior change (_table virtual column in Hive).
Safe rollout
Compilation time
Final Verdict
  • Status: ⚠️ Request changes
  • Minimum required actions:
    • Update PR template metadata: pick an appropriate Changelog category and add a specific user-facing Changelog entry for Hive _table virtual column support.

@clickhouse-gh clickhouse-gh bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Mar 29, 2026
{
auto all_columns = metadata->getColumns().get(options);
const auto & common_virtual_columns = IStorage::getCommonVirtuals();
const auto common_virtual_columns = storage.getCommonVirtuals();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StorageSnapshot is documented as a snapshot that "fixes set columns" for a query, but here common virtuals are fetched from storage.getCommonVirtuals() on every call instead of being captured at snapshot creation.

Since setVirtuals updates both virtuals and common_virtuals, a concurrent metadata/virtuals update can make one StorageSnapshot observe old virtual_columns + new common_virtuals (or vice versa), causing inconsistent column visibility (e.g. transient loss/duplication of _table).

Please store common virtuals inside StorageSnapshot (similar to virtual_columns) and use that stored pointer in all StorageSnapshot methods.

if (std::dynamic_pointer_cast<StorageDistributed>(storage))
return false;

if (storage->getVirtualsPtr()->has("_table"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extractRequiredNonTableColumnsFromStorage uses storage->getVirtualsPtr()->has("_table") (current storage state) instead of storage_snapshot->virtual_columns (the captured snapshot used for planning/reading).

If virtual columns change concurrently, this branch can decide to keep _table in storage_column_names while the StorageSnapshot used for the read still doesn't expose _table, causing inconsistent planning/header behavior.

Please base this check on storage_snapshot->virtual_columns->has("_table") (same snapshot object used for the rest of the function). The same fix should be applied in MutationsInterpreter::extractRequiredNonTableColumnsFromStorage.

if (std::dynamic_pointer_cast<StorageDistributed>(storage))
return false;

if (storage->getVirtualsPtr()->has("_table"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same snapshot-consistency issue as in planner path: this check reads storage->getVirtualsPtr() (live mutable state) while the function otherwise relies on storage_snapshot.

If virtual columns are updated concurrently, _table handling can diverge from what storage_snapshot actually contains, causing inconsistent mutation read plan/header behavior.

Please use storage_snapshot->virtual_columns->has("_table") here as well.

VirtualColumnsDescription virtuals_desc;
virtuals_desc.addEphemeral("_path", std::make_shared<DataTypeLowCardinality>(std::make_shared<DataTypeString>()), "");
virtuals_desc.addEphemeral("_file", std::make_shared<DataTypeLowCardinality>(std::make_shared<DataTypeString>()), "");
virtuals_desc.addEphemeral("_table", std::make_shared<DataTypeLowCardinality>(std::make_shared<DataTypeString>()), "");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces a new user-visible virtual column _table for StorageHive (via setVirtuals and value materialization in getResultChunk). Given that behavior change, the PR template's Changelog category being Not for changelog looks mismatched. Please switch to an appropriate category and add a concrete user-facing Changelog entry describing _table support for Hive reads.

@yariks5s yariks5s self-assigned this Mar 31, 2026
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 31, 2026

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 90.90% 90.90% +0.00%
Branches 76.70% 76.70% +0.00%

Changed lines: 0.00% (0/111) · Uncovered code

Full report · Diff report

@Michicosun Michicosun added this pull request to the merge queue Apr 1, 2026
Merged via the queue into master with commit c05b627 Apr 1, 2026
160 of 164 checks passed
@Michicosun Michicosun deleted the native_common_virtuals/Hive branch April 1, 2026 10:35
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants