Add setting to query Iceberg tables as of a specific timestamp#71072
Add setting to query Iceberg tables as of a specific timestamp#7107220 commits merged intoClickHouse:masterfrom
Conversation
|
This is an automated comment for commit f0e6bdf with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page
|
7b7d1aa to
89d9d50
Compare
911f4a8 to
7ba8b02
Compare
src/Core/Settings.cpp
Outdated
| Enabling this setting can lead to incorrect result as in case of evolved schema all data files will be read using the same schema. | ||
| ::: | ||
| )", 0) \ | ||
| DECLARE(Int64, iceberg_query_at_timestamp_ms, 0, R"( |
There was a problem hiding this comment.
IMO it would be better to create a new setting field type (add class SettingFieldDateTime into SettingField.h ). With the new SettingFieldDateTime you have complete control over parsed of setting value from string, see SettingFieldNumber for some inspiration.
This line would be then changed to:
| DECLARE(Int64, iceberg_query_at_timestamp_ms, 0, R"( | |
| DECLARE(DateTime, iceberg_query_at_timestamp, 0, R"( |
There was a problem hiding this comment.
Makes sense, thanks!
There was a problem hiding this comment.
I checked this out and I'd like to do this but might need some more guidance on navigating the metaprogramming going on here. 🫠
I'll try again next week.
|
Probably worth adding the setting to this documentation page: https://clickhouse.com/docs/en/sql-reference/table-functions/iceberg |
| break; | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Could you, please, add chassert (CH version of assert) that schema_id == current_schema_id = metadata_object->getValue("current-schema-id")), please? Or you can throw a LOGICAL_ERROR in case it is not
There was a problem hiding this comment.
I added this, but it's breaking other tests that explicitly look for other errors:
This is because we already have code (that still works as expected on my branch) ensuring the schema_id is the same and that no schema migrations have happened:
If this makes sense to you, I'll remove my extra chassert? Thanks!
| "SELECT number, toString(number + 1) FROM numbers(200)" | ||
| ) | ||
|
|
||
| assert ( |
There was a problem hiding this comment.
Firstly, please, create a new test (you can do it directly in this file). Secondly, we need better testing here. You can insert several blocks with some lag and verify the table state between these insertions
| { | ||
| const auto * iceberg_metadata = dynamic_cast<const IcebergMetadata *>(&other); | ||
| return iceberg_metadata && getVersion() == iceberg_metadata->getVersion(); | ||
| return iceberg_metadata && getManifestListFile() == iceberg_metadata->getManifestListFile(); |
There was a problem hiding this comment.
I do not understand this change, tbh. I think that we shouldn't say that two metadata objects are equal if they have the same manifest lists but different table schema or partition schema
There was a problem hiding this comment.
I can use a tuple or something and also compare the schema and partition schema.
The problem with the code as it existed is that it only compared the metadata version, but the top-level metadata contains a history of many snapshots, schemas, etc. And so if you try to query an older version (via timestamp) it would do this comparison, decide the older snapshot you're trying to query is the same as the current snapshot, and then noop.
There was a problem hiding this comment.
Using a tuple of metadata_version, schema_id, snapshot_id now, although in some sense I think I could just use snapshot_id since it implies the rest. 🤷
342f403 to
7749a8e
Compare
7749a8e to
d5e1798
Compare
|
Hi @bretthoerner! I noticed some integration tests need attention in your PR. Would you mind taking a look at them? You can find the details by clicking Click here in this box. On the failed Integration Tests page, you'll see all the failing tests. No need to worry about the filesystem cache test - I'll look into that one myself. I noticed there's a sanitizer issue in one of the failed tests. You can run tests with sanitizer enabled using: It would be really helpful if you could run the integration tests locally as well. Here's our guide on how to do that. Please don't hesitate to reach out if you have any questions - I'm happy to help! |
…pshot when comparing IcebergMetadata equality
|
@divanik Thanks, I missed the ASAN, I'll check it out now. Some of failures are known though and I called them out here, I'm curious for your thoughts. (Basically, the new assert is duplicating existing checks we have.) |
d5e1798 to
1e6f59f
Compare
|
The current test failures look like an OOM/CI issue? At least, they don't seem related to Iceberg at all. Is it possible to rerun just those tests somehow? |
|
Hi @bretthoerner, |
|
Dear @divanik, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
…oerner/iceberg-query-timestamp
…oerner/iceberg-query-timestamp
…oerner/iceberg-query-timestamp
|
Hi. I did a separate PR because I read the docs and haven't understood how I can reveal info about schema evolution from snapshot-log so I preferred iteration through metadata (and all the logic is another one because of this fact). But this work was very useful, I reused some functions and integration tests from this PR, so thank you for your contribution. |
a487db3
Open questions:
Having this be a setting is probably not the correct solution, but if it's going to be part of the SQL syntax then I think I'd need some ideas for where it could fit in before I spin my wheels there. And it might affect much more than just the Iceberg backend...
If a setting is OK, I think it should probably change to a string that I parse as a DateTime? If so, I'm trying to figure out where exactly to do that translation, so I can weave the necessary timezone information down (I'm assuming I'd want to use something like
parseDateTime64BestEffort?).Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Add setting to query Iceberg tables as of a specific timestamp
Documentation entry for user-facing changes
CI Settings (Only check the boxes if you know what you are doing):