Skip to content

Snowflake driver should map timezone-aware Arrow timestamps to TIMESTAMP_TZ, not TIMESTAMP_LTZ #3766

@enrico-stauss

Description

@enrico-stauss

Bug Report

Summary

When ingesting timezone-aware Arrow timestamps (e.g., timestamp[us, tz=UTC]) via the Snowflake ADBC driver, tables are automatically created with TIMESTAMP_LTZ columns instead of TIMESTAMP_TZ columns. This loses the explicit timezone information present in the Arrow data and causes temporal misalignment issues in multi-source pipelines.

Related Issues

This is related to #2843, but specifically addresses timezone-aware timestamps where Arrow explicitly provides timezone metadata.

What happened?

The Snowflake ADBC driver maps timezone-aware Arrow timestamps to TIMESTAMP_LTZ during table creation:

Arrow: timestamp[us, tz=UTC] → Snowflake: TIMESTAMP_LTZ

This is semantically incorrect because:

  • TIMESTAMP_LTZ stores UTC values but interprets them relative to the session timezone
  • TIMESTAMP_TZ stores UTC values with explicit timezone metadata preserved

When Arrow data explicitly specifies a timezone, that information should be preserved in Snowflake's TIMESTAMP_TZ type.

What did you expect to happen?

Timezone-aware Arrow timestamps should map to TIMESTAMP_TZ:

Arrow: timestamp[us, tz=UTC] → Snowflake: TIMESTAMP_TZ

This preserves the explicit timezone information from the source data.

Minimal Reproducible Example

import polars as pol
from datetime import datetime, timezone
import adbc_driver_snowflake.dbapi

# Create DataFrame with explicit UTC timestamps
df = pol.DataFrame({
    "id": [1, 2, 3],
    "timestamp": [
        datetime(2024, 12, 3, 10, 0, 0, tzinfo=timezone.utc),
        datetime(2024, 12, 3, 11, 0, 0, tzinfo=timezone.utc),
        datetime(2024, 12, 3, 12, 0, 0, tzinfo=timezone.utc),
    ]
}).with_columns(
    pol.col("timestamp").dt.replace_time_zone("UTC").dt.cast_time_unit("us")
)

# Verify Arrow schema has timezone info
print(df.to_arrow().schema)
# timestamp: timestamp[us, tz=UTC]

# Push to Snowflake using ADBC
conn = adbc_driver_snowflake.dbapi.connect("<snowflake_uri>")
df.write_database(
    table_name="test_table",
    connection=conn,
    engine="adbc",
    if_table_exists="replace"
)
conn.close()

Actual result in Snowflake:

SHOW COLUMNS IN test_table;
-- timestamp column type: TIMESTAMP_LTZ(6)

Expected result:

-- timestamp column type: TIMESTAMP_TZ(6)

Real-World Impact

This causes critical issues in multi-source data pipelines:

Scenario: Merging flight trajectory data from two sources (ACARS and SWIM)

  • Source A: Data manually created with TIMESTAMP_TZ schema → timestamps always displayed as UTC
  • Source B: Data ingested via ADBC → TIMESTAMP_LTZ created → timestamps displayed in session timezone

Result: When users in non-UTC timezones query and merge these datasets, the timestamps appear misaligned:

# User's Snowflake session timezone: America/Los_Angeles (UTC-8)

# Query Source A (TIMESTAMP_TZ)
SELECT timestamp FROM source_a WHERE id = 1;
-- Returns: 2024-12-03 10:00:00 +0000 (explicit UTC)

# Query Source B (TIMESTAMP_LTZ) 
SELECT timestamp FROM source_b WHERE id = 1;
-- Returns: 2024-12-03 02:00:00 -0800 (converted to session TZ)

# Merge the sources
SELECT a.timestamp AS ts_a, b.timestamp AS ts_b 
FROM source_a a 
JOIN source_b b ON a.id = b.id;
-- ts_a: 2024-12-03 10:00:00 +0000
-- ts_b: 2024-12-03 02:00:00 -0800
-- These appear 8 hours apart even though they represent the same instant!

This breaks temporal joins, time-series analysis, and data quality validation.

The Semantic Issue

When Arrow explicitly provides timezone metadata, the driver should preserve that information:

Arrow Type Current Mapping Correct Mapping Reasoning
timestamp[us] (no tz) TIMESTAMP_LTZ TIMESTAMP_NTZ or TIMESTAMP_LTZ Ambiguous, either makes sense
timestamp[us, tz=UTC] TIMESTAMP_LTZ TIMESTAMP_TZ Timezone is explicitly specified

The key difference:

  • TIMESTAMP_LTZ: "This is a UTC instant, interpret it in the user's local timezone"
  • TIMESTAMP_TZ: "This is a UTC instant and the timezone (UTC) is part of the data"

When Arrow says tz=UTC, that timezone should be stored in Snowflake.

Current Workaround

Pre-create tables with explicit schema before ingestion:

cursor = conn.cursor()
cursor.execute("""
    CREATE TABLE test_table (
        id NUMBER(38, 0),
        timestamp TIMESTAMP_TZ
    )
""")
cursor.close()

# Then append data
df.write_database(
    table_name="test_table",
    connection=conn,
    engine="adbc",
    if_table_exists="append"
)

This works but requires boilerplate schema management and defeats the purpose of schema inference from Arrow types.

Proposed Solution

Option 1 (Preferred): Change default mapping for timezone-aware timestamps

timestamp[us, tz=<any>] → TIMESTAMP_TZ
timestamp[us] (no tz)   → TIMESTAMP_LTZ (current behavior, unchanged)

Option 2: Add a connection/statement option

# Allow users to specify timestamp type preference
conn = adbc_driver_snowflake.dbapi.connect(
    uri="<snowflake_uri>",
    db_kwargs={"adbc.snowflake.ingest.timestamp_tz_for_aware": "true"}
)

Option 3: Follow Arrow metadata conventions
If Arrow timestamp has timezone metadata, use TIMESTAMP_TZ. This aligns with Arrow's philosophy of preserving metadata.

Environment

  • ADBC Driver: adbc-driver-snowflake (latest from PyPI)
  • Arrow: pyarrow (via Polars)
  • Language: Python 3.11+
  • Framework: Polars DataFrames
  • Snowflake: Standard configuration

Additional Context

The Snowflake ADBC driver documentation states:

timestamp_ltz, timestamp_ntz, timestamp_tz: timestamp[ns]
Local time zone will be used, except for timestamp_ntz which is not an instant.

This describes the read path (Snowflake → Arrow) but doesn't specify the write path (Arrow → Snowflake) behavior for timezone-aware timestamps. The write path mapping should be symmetric and preserve timezone information when it exists in the source.

Acknowledgment

I understand #2843 concluded that explicit schema creation is the recommended approach. However, I believe timezone-aware Arrow data represents a distinct case where the "correct" mapping is unambiguous: explicit timezone in Arrow should map to explicit timezone in Snowflake (TIMESTAMP_TZ), not session-dependent interpretation (TIMESTAMP_LTZ).

Even if the default behavior cannot change for backward compatibility, an opt-in option would greatly improve the developer experience for timezone-aware data pipelines.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type: bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions