Skip to content

Timezone info lost when reading DateTime columns from Glue DataLakeCatalog #81654

@alsugiliazova

Description

@alsugiliazova

Company or project name

No response

Describe what's wrong

Timezone info lost when reading DateTime columns from Glue DataLakeCatalog.

Example of reading same data with rest and glue catalog:

Glue:

SELECT * FROM database_59caa3c5_46be_11f0_a37e_159f99dc2bbb.\`namespace_59caa3c4_46be_11f0_a37e_159f99dc2bbb.table_59caa3c3_46be_11f0_a37e_159f99dc2bbb\` ORDER BY tuple(*) FORMAT TabSeparated
2024-01-01 15:00:00.000000	2024-01-01 15:00:00.000000
SHOW CREATE TABLE database_59caa3c5_46be_11f0_a37e_159f99dc2bbb.\`namespace_59caa3c4_46be_11f0_a37e_159f99dc2bbb.table_59caa3c3_46be_11f0_a37e_159f99dc2bbb\`
CREATE TABLE database_59caa3c5_46be_11f0_a37e_159f99dc2bbb.`namespace_59caa3c4_46be_11f0_a37e_159f99dc2bbb.table_59caa3c3_46be_11f0_a37e_159f99dc2bbb`\n(\n    `timestamp` Nullable(DateTime64(6)),\n    `timestamptz` Nullable(DateTime64(6))\n)\nENGINE = Iceberg(\'s3://glue***')

Rest:

SELECT * FROM database_92115b01_46be_11f0_a37e_159f99dc2bbb.\`namespace_92115b00_46be_11f0_a37e_159f99dc2bbb.table_92115aff_46be_11f0_a37e_159f99dc2bbb\` ORDER BY tuple(*) FORMAT TabSeparated
2024-01-01 15:00:00.000000	2024-01-01 12:00:00.000000
SHOW CREATE TABLE database_92115b01_46be_11f0_a37e_159f99dc2bbb.\`namespace_92115b00_46be_11f0_a37e_159f99dc2bbb.table_92115aff_46be_11f0_a37e_159f99dc2bbb\`
CREATE TABLE database_92115b01_46be_11f0_a37e_159f99dc2bbb.`namespace_92115b00_46be_11f0_a37e_159f99dc2bbb.table_92115aff_46be_11f0_a37e_159f99dc2bbb`\n(\n    `timestamp` Nullable(DateTime64(6)),\n    `timestamptz` Nullable(DateTime64(6, \'UTC\'))\n)\nENGINE = Iceberg(\'http://minio:9000/warehouse/data/\', \'admin\', \'[HIDDEN]\')

So I see that timezone is there Nullable(DateTime64(6, 'UTC'))

Does it reproduce on the most recent release?

Yes

How to reproduce

  1. Create iceberg table with TimestamptzType column. Insert data with timezone.

I use PyIceberg:

schema = Schema(
    NestedField(
        field_id=1, name="timestamp", field_type=TimestampType(), required=False
    ),
    NestedField(
        field_id=2,
        name="timestamptz",
        field_type=TimestamptzType(),
        required=False,
    ),
)
data = [
    {
        "timestamp": datetime(2024, 1, 1, hour=12, minute=0, second=0, microsecond=0),
        "timestamptz": datetime(
            2024,
            1,
            1,
            hour=12,
            minute=0,
            second=0,
            microsecond=0,
            tzinfo=pytz.timezone("UTC"),
        ),
    }
]
df = pa.Table.from_pylist(data)
table.append(df)
  1. Create database DataLakeCatalog with type=glue
  2. Select data from table

Expected behavior

No response

Error message and/or stacktrace

No response

Additional context

I use aws glue.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions