Skip to content

Support of dynamic subcolumns (JSON data type)#23932

Merged
alesapin merged 101 commits intoClickHouse:masterfrom
CurtizJ:dynamic-columns
Mar 17, 2022
Merged

Support of dynamic subcolumns (JSON data type)#23932
alesapin merged 101 commits intoClickHouse:masterfrom
CurtizJ:dynamic-columns

Conversation

@CurtizJ
Copy link
Copy Markdown
Member

@CurtizJ CurtizJ commented May 7, 2021

I hereby agree to the terms of the CLA available at: https://yandex.ru/legal/cla/?lang=en

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
New data type Object(<schema_format>), which supports storing of semi-structured data (for now JSON only). Data is written to such types as string. Then all paths are extracted according to format of semi-structured data and written as separate columns in most optimal types, that can store all their values. Those columns can be queried by names that match paths in source data. E.g data.key1.key2 or with cast operator data.key1.key2::Int64.

Detailed description / Documentation draft:
Resolves #23516.

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels May 7, 2021
@CurtizJ CurtizJ changed the title Dynamic columns Support of dynamic subcolumns May 7, 2021
@CurtizJ
Copy link
Copy Markdown
Member Author

CurtizJ commented May 7, 2021

This PR depends on #22535 and contains changes from it. If you want to see changes, that related only to implementation of dynamic subcolumns without ColumnSparse, you can use link CurtizJ/ClickHouse@sparse-serialization...CurtizJ:dynamic-columns.

@CurtizJ CurtizJ changed the title Support of dynamic subcolumns Support of dynamic subcolumns (JSON data type) May 7, 2021
@CurtizJ CurtizJ marked this pull request as ready for review March 1, 2022 17:22
Comment on lines +110 to +111
/// Virtual columns must be appended after ordinary, because user can
/// override them.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean by overriding here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User can define physical columns in table definition with the same names as virtual columns.

I've wanted to give the following example, but it doesn't work actually :)

create table kek (_part UInt32) ENGINE = MergeTree ORDER BY tuple();

insert into kek values (1);

select _part from kek;

Received exception from server (version 22.3.1):
Code: 352. DB::Exception: Received from localhost:9000. DB::Exception: Block structure mismatch in (columns with identical name must have identical structure) stream: different types:
_part UInt32 UInt32(size = 0)
_part String String(size = 0). (AMBIGUOUS_COLUMN_NAME)

"Not enough name parts for path {}. Expected at least {}, got {}",
paths[i].getPath(), pos + 1, num_parts);

size_t array_dimensions = kind == Node::NESTED ? 1 : parts[pos].anonymous_array_level;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this "anonymous_array_level" is also very very confusing :) why "anonymous"? I read the comment for it in PathInData, but still do no clue why anonymous

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"anonymous" means that it doesn't have key related to it. Maybe "unnamed" instead "anonymous" is better. Just "array_level" is not correct, because this field doesn't represent number of dimensions in array in a whole.

@alesapin
Copy link
Copy Markdown
Member

Going to fix tidy in master, timeout -- OK.

@alesapin alesapin merged commit 457fa0d into ClickHouse:master Mar 17, 2022
azat added a commit to azat/ClickHouse that referenced this pull request Jun 8, 2022
Before this patch SELECT queries hold parts even if they were not
required by select (had been eliminated by partition pruning).

This defers removing parts if you have long running queries.

This had been introduced in ClickHouse#23932, with introduction of
StorageSnapshotPtr.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
/// Snapshot of storage that fixes set columns that can be read in query.
/// There are 3 sources of columns: regular columns from metadata,
/// dynamic columns from object Types, virtual columns.
struct StorageSnapshot
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @CurtizJ , could you kindly explain the difference between StorageInMemoryMetadata vs StorageSnapshot? e.g., when to use StorageInMemoryMetadata::getColumns and when to use StorageSnapshot::getColumns? Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support of dynamic subcolumns in tables.

8 participants