-
Notifications
You must be signed in to change notification settings - Fork 173
Description
Bug Report
Summary
When ingesting timezone-aware Arrow timestamps (e.g., timestamp[us, tz=UTC]) via the Snowflake ADBC driver, tables are automatically created with TIMESTAMP_LTZ columns instead of TIMESTAMP_TZ columns. This loses the explicit timezone information present in the Arrow data and causes temporal misalignment issues in multi-source pipelines.
Related Issues
This is related to #2843, but specifically addresses timezone-aware timestamps where Arrow explicitly provides timezone metadata.
What happened?
The Snowflake ADBC driver maps timezone-aware Arrow timestamps to TIMESTAMP_LTZ during table creation:
Arrow: timestamp[us, tz=UTC] → Snowflake: TIMESTAMP_LTZ
This is semantically incorrect because:
TIMESTAMP_LTZstores UTC values but interprets them relative to the session timezoneTIMESTAMP_TZstores UTC values with explicit timezone metadata preserved
When Arrow data explicitly specifies a timezone, that information should be preserved in Snowflake's TIMESTAMP_TZ type.
What did you expect to happen?
Timezone-aware Arrow timestamps should map to TIMESTAMP_TZ:
Arrow: timestamp[us, tz=UTC] → Snowflake: TIMESTAMP_TZ
This preserves the explicit timezone information from the source data.
Minimal Reproducible Example
import polars as pol
from datetime import datetime, timezone
import adbc_driver_snowflake.dbapi
# Create DataFrame with explicit UTC timestamps
df = pol.DataFrame({
"id": [1, 2, 3],
"timestamp": [
datetime(2024, 12, 3, 10, 0, 0, tzinfo=timezone.utc),
datetime(2024, 12, 3, 11, 0, 0, tzinfo=timezone.utc),
datetime(2024, 12, 3, 12, 0, 0, tzinfo=timezone.utc),
]
}).with_columns(
pol.col("timestamp").dt.replace_time_zone("UTC").dt.cast_time_unit("us")
)
# Verify Arrow schema has timezone info
print(df.to_arrow().schema)
# timestamp: timestamp[us, tz=UTC]
# Push to Snowflake using ADBC
conn = adbc_driver_snowflake.dbapi.connect("<snowflake_uri>")
df.write_database(
table_name="test_table",
connection=conn,
engine="adbc",
if_table_exists="replace"
)
conn.close()Actual result in Snowflake:
SHOW COLUMNS IN test_table;
-- timestamp column type: TIMESTAMP_LTZ(6)Expected result:
-- timestamp column type: TIMESTAMP_TZ(6)Real-World Impact
This causes critical issues in multi-source data pipelines:
Scenario: Merging flight trajectory data from two sources (ACARS and SWIM)
- Source A: Data manually created with
TIMESTAMP_TZschema → timestamps always displayed as UTC - Source B: Data ingested via ADBC →
TIMESTAMP_LTZcreated → timestamps displayed in session timezone
Result: When users in non-UTC timezones query and merge these datasets, the timestamps appear misaligned:
# User's Snowflake session timezone: America/Los_Angeles (UTC-8)
# Query Source A (TIMESTAMP_TZ)
SELECT timestamp FROM source_a WHERE id = 1;
-- Returns: 2024-12-03 10:00:00 +0000 (explicit UTC)
# Query Source B (TIMESTAMP_LTZ)
SELECT timestamp FROM source_b WHERE id = 1;
-- Returns: 2024-12-03 02:00:00 -0800 (converted to session TZ)
# Merge the sources
SELECT a.timestamp AS ts_a, b.timestamp AS ts_b
FROM source_a a
JOIN source_b b ON a.id = b.id;
-- ts_a: 2024-12-03 10:00:00 +0000
-- ts_b: 2024-12-03 02:00:00 -0800
-- These appear 8 hours apart even though they represent the same instant!This breaks temporal joins, time-series analysis, and data quality validation.
The Semantic Issue
When Arrow explicitly provides timezone metadata, the driver should preserve that information:
| Arrow Type | Current Mapping | Correct Mapping | Reasoning |
|---|---|---|---|
timestamp[us] (no tz) |
TIMESTAMP_LTZ |
TIMESTAMP_NTZ or TIMESTAMP_LTZ |
Ambiguous, either makes sense |
timestamp[us, tz=UTC] |
TIMESTAMP_LTZ ❌ |
TIMESTAMP_TZ ✅ |
Timezone is explicitly specified |
The key difference:
TIMESTAMP_LTZ: "This is a UTC instant, interpret it in the user's local timezone"TIMESTAMP_TZ: "This is a UTC instant and the timezone (UTC) is part of the data"
When Arrow says tz=UTC, that timezone should be stored in Snowflake.
Current Workaround
Pre-create tables with explicit schema before ingestion:
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE test_table (
id NUMBER(38, 0),
timestamp TIMESTAMP_TZ
)
""")
cursor.close()
# Then append data
df.write_database(
table_name="test_table",
connection=conn,
engine="adbc",
if_table_exists="append"
)This works but requires boilerplate schema management and defeats the purpose of schema inference from Arrow types.
Proposed Solution
Option 1 (Preferred): Change default mapping for timezone-aware timestamps
timestamp[us, tz=<any>] → TIMESTAMP_TZ
timestamp[us] (no tz) → TIMESTAMP_LTZ (current behavior, unchanged)
Option 2: Add a connection/statement option
# Allow users to specify timestamp type preference
conn = adbc_driver_snowflake.dbapi.connect(
uri="<snowflake_uri>",
db_kwargs={"adbc.snowflake.ingest.timestamp_tz_for_aware": "true"}
)Option 3: Follow Arrow metadata conventions
If Arrow timestamp has timezone metadata, use TIMESTAMP_TZ. This aligns with Arrow's philosophy of preserving metadata.
Environment
- ADBC Driver:
adbc-driver-snowflake(latest from PyPI) - Arrow:
pyarrow(via Polars) - Language: Python 3.11+
- Framework: Polars DataFrames
- Snowflake: Standard configuration
Additional Context
The Snowflake ADBC driver documentation states:
timestamp_ltz, timestamp_ntz, timestamp_tz: timestamp[ns]
Local time zone will be used, except for timestamp_ntz which is not an instant.
This describes the read path (Snowflake → Arrow) but doesn't specify the write path (Arrow → Snowflake) behavior for timezone-aware timestamps. The write path mapping should be symmetric and preserve timezone information when it exists in the source.
Acknowledgment
I understand #2843 concluded that explicit schema creation is the recommended approach. However, I believe timezone-aware Arrow data represents a distinct case where the "correct" mapping is unambiguous: explicit timezone in Arrow should map to explicit timezone in Snowflake (TIMESTAMP_TZ), not session-dependent interpretation (TIMESTAMP_LTZ).
Even if the default behavior cannot change for backward compatibility, an opt-in option would greatly improve the developer experience for timezone-aware data pipelines.