-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Description
Company or project name
No response
Describe what's wrong
I have an Iceberg configuration where:
- The metadata files are stored in an object store
- The data files are stored in a local filesystem
If we try to read the content of a table using ClickHouse and a DataLakeCatalog, the read will fail.
Does it reproduce on the most recent release?
Yes
How to reproduce
Step 1
Use an Iceberg client (e.g. pyiceberg) to create some tables, where the metadata files are stored on S3, and the datafiles in the local filesystem:
from pyiceberg.catalog import load_catalog
import pyarrow.parquet as pq
from pyiceberg.io.pyarrow import _pyarrow_to_schema_without_ids
catalog = load_catalog(
**{
'type': 'rest',
'uri': '...',
's3.endpoint': '...',
's3.path-style-access': '...',
's3.access-key-id': '...',
's3.secret-access-key': '...',
'client.region': '...',
},
)
catalog.create_namespace('test_ns')
# Add local file to table
file = "/absolute/path/to/data.parquet"
parquet_schema = pq.read_table(file).schema
iceberg_schema = _pyarrow_to_schema_without_ids(parquet_schema)
table = catalog.create_table_if_not_exists('test_ns.table1', iceberg_schema)
table.add_files(["file://" + file])
Step 2
Try to access the data from ClickHouse:
-- Create Iceberg Database
CREATE DATABASE iceberg_catalog ENGINE = DataLakeCatalog('...') SETTINGS catalog_type = '...', warehouse = '...', aws_access_key_id='...', aws_secret_access_key='...', region='...';
-- List tables (this should work fine)
SHOW TABLES FROM iceberg_catalog;
-- Query the Iceberg table from the Catalog (this fails)
SELECT * FROM iceberg_catalog.`test_ns.table1`;
Expected behavior
ClickHouse should be able to read the tables created with PyIceberg
Error message and/or stacktrace
Error 1
Code: 36. DB::Exception: Received from localhost:9000. DB::Exception: Expected to find 'test_ns/table1' in data path: 'file:///absolute/path/to/data.parquet'. (BAD_ARGUMENTS)
This error is unexpected, as the Iceberg specification does not force the data files to be under the directory <namespace>/<table_name>.
Error 2
If the data file is under the directory <namespace>/<table_name>, the above error will not occur, but another one will be observed:
Code: 499. DB::Exception: Received from localhost:9000. DB::Exception: Failed to get object info: No response body.. HTTP response code: 404: while reading test_ns/table1/data.parquet: While executing IcebergS3(iceberg_catalog.`test_ns.table1`)Source. (S3_ERROR)
This error is unexpected, as ClickHouse is trying to read the data file from the ObjectStore, while the file is stored on the local filesystem.
Additional context
The two errors are closely related and could be resolved in one go. They are both related to the getProperFilePathFromMetadataInfo function which has strong assumptions about the location of the data files
| std::string getProperFilePathFromMetadataInfo(std::string_view data_path, std::string_view common_path, std::string_view table_location) |