Skip to content

[Java API] Rough edges when partitioning by time types #11899

@ahmedabu98

Description

@ahmedabu98

Apache Iceberg version

1.7.1 (latest release)

Query engine

Other

Please describe the bug 🐞

We've been developing an Iceberg connector at Apache Beam using the Java API, and I noticed some rough edges around partitioning by time types (i.e. year, month, day or hour).

See the following code:

org.apache.iceberg.Schema schema =
    new org.apache.iceberg.Schema(
        Types.NestedField.required(1, "year", Types.TimestampType.withoutZone()),
        Types.NestedField.required(2, "day", Types.TimestampType.withoutZone()));
PartitionSpec spec = PartitionSpec.builderFor(schema)
        .year("year")
        .day("day").build();
Table table = catalog.createTable(TableIdentifier.parse("db.table"), schema, spec);
PartitionKey pk = new PartitionKey(spec, schema);

LocalDateTime val = LocalDateTime.parse("2024-10-08T13:18:20.053");
Record rec = GenericRecord.create(schema).copy(
        ImmutableMap.of(
                "year", val, 
                "day", val));
pk.partition(rec);

I'm applying a simple partition to my original record and would expect it to work normally, but the last line fails with the following error:

java.lang.IllegalStateException: Not an instance of java.lang.Long: 2024-10-08T13:18:20.053
	at org.apache.iceberg.data.GenericRecord.get(GenericRecord.java:123)
	at org.apache.iceberg.Accessors$PositionAccessor.get(Accessors.java:71)
	at org.apache.iceberg.Accessors$PositionAccessor.get(Accessors.java:58)
	at org.apache.iceberg.StructTransform.wrap(StructTransform.java:78)
	at org.apache.iceberg.PartitionKey.wrap(PartitionKey.java:30)
	at org.apache.iceberg.PartitionKey.partition(PartitionKey.java:64)

We've been able to work around it with this logic, replicated below:

Work-around
private Record getPartitionableRecord(
    Record record, PartitionSpec spec, org.apache.iceberg.Schema schema) {
  if (spec.isUnpartitioned()) {
    return record;
  }
  Record output = GenericRecord.create(schema);
  for (PartitionField partitionField : spec.fields()) {
    Transform<?, ?> transform = partitionField.transform();
    Types.NestedField field = schema.findField(partitionField.sourceId());
    String name = field.name();
    Object value = record.getField(name);
    @Nullable Literal<Object> literal = Literal.of(value.toString()).to(field.type());
    if (literal == null || transform.isVoid() || transform.isIdentity()) {
      output.setField(name, value);
    } else {
      output.setField(name, literal.value());
    }
  }
  return output;
}

So that instead we have this:

Record partitionableRec = getPartitionableRecord(rec, spec, schema);
pk.partition(partitionableRec);

This feels a little hacky and I would expect the Iceberg API to handle this by itself. Let me know if I'm missing something!

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions