Skip to content

Reading tables from the Iceberg Catalog does not work when files are split across multiple storages backend #84609

@FredKhayat

Description

@FredKhayat

Company or project name

No response

Describe what's wrong

I have an Iceberg configuration where:

  1. The metadata files are stored in an object store
  2. The data files are stored in a local filesystem

If we try to read the content of a table using ClickHouse and a DataLakeCatalog, the read will fail.

Does it reproduce on the most recent release?

Yes

How to reproduce

Step 1

Use an Iceberg client (e.g. pyiceberg) to create some tables, where the metadata files are stored on S3, and the datafiles in the local filesystem:

from pyiceberg.catalog import load_catalog
import pyarrow.parquet as pq
from pyiceberg.io.pyarrow import _pyarrow_to_schema_without_ids

catalog = load_catalog(
    **{
        'type': 'rest',
        'uri': '...',
        's3.endpoint': '...',
        's3.path-style-access': '...',
        's3.access-key-id': '...',
        's3.secret-access-key': '...',
        'client.region': '...',
    },
)

catalog.create_namespace('test_ns')

# Add local file to table
file = "/absolute/path/to/data.parquet"
parquet_schema = pq.read_table(file).schema
iceberg_schema = _pyarrow_to_schema_without_ids(parquet_schema)

table = catalog.create_table_if_not_exists('test_ns.table1', iceberg_schema)
table.add_files(["file://" + file])

Step 2

Try to access the data from ClickHouse:

-- Create Iceberg Database
CREATE DATABASE iceberg_catalog ENGINE = DataLakeCatalog('...') SETTINGS catalog_type = '...', warehouse = '...', aws_access_key_id='...', aws_secret_access_key='...', region='...';

-- List tables (this should work fine)
SHOW TABLES FROM iceberg_catalog;

-- Query the Iceberg table from the Catalog (this fails)
SELECT * FROM iceberg_catalog.`test_ns.table1`;

Expected behavior

ClickHouse should be able to read the tables created with PyIceberg

Error message and/or stacktrace

Error 1

Code: 36. DB::Exception: Received from localhost:9000. DB::Exception: Expected to find 'test_ns/table1' in data path: 'file:///absolute/path/to/data.parquet'. (BAD_ARGUMENTS)

This error is unexpected, as the Iceberg specification does not force the data files to be under the directory <namespace>/<table_name>.

Error 2

If the data file is under the directory <namespace>/<table_name>, the above error will not occur, but another one will be observed:

Code: 499. DB::Exception: Received from localhost:9000. DB::Exception: Failed to get object info: No response body.. HTTP response code: 404: while reading test_ns/table1/data.parquet: While executing IcebergS3(iceberg_catalog.`test_ns.table1`)Source. (S3_ERROR)

This error is unexpected, as ClickHouse is trying to read the data file from the ObjectStore, while the file is stored on the local filesystem.

Additional context

The two errors are closely related and could be resolved in one go. They are both related to the getProperFilePathFromMetadataInfo function which has strong assumptions about the location of the data files

std::string getProperFilePathFromMetadataInfo(std::string_view data_path, std::string_view common_path, std::string_view table_location)

Metadata

Metadata

Assignees

No one assigned

    Labels

    potential bugTo be reviewed by developers and confirmed/rejected.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions