-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
When using Delta to write out partitioned table, both the column name and value would be uri encoded. For example:
sdf = spark.createDataFrame([[1, 10], [2, 20]], schema=['x%20x', 'y'])
sdf.write.format("delta").partitionBy(["x%20x"]).save("partition_table")
The above code would generate directory in following structure:
├─x%2520x=1
├─x%2520x=2
└─_delta_log
When we use Pyarrow to read this table, it would generate incorrect column name from x%20x to x%2520%x:
In [3]: ds = dataset(paths, format="parquet", partitioning=partitioning(flavor="hive"))
In [4]: ds.to_table().to_pandas()
Out[4]:
y x%2520x
0 20 2
1 10 1
It seems that we did not decode the column name here:
arrow/cpp/src/arrow/dataset/partition.cc
Lines 593 to 596 in e5f3e04
| case SegmentEncoding::Uri: { | |
| auto raw_value = util::string_view(segment).substr(name_end + 1); | |
| ARROW_ASSIGN_OR_RAISE(value, SafeUriUnescape(raw_value)); | |
| break; |
More context from this delta-rs issue: delta-io/delta-rs#495
Could anyone take a look please? Thanks.
Metadata
Metadata
Assignees
Labels
No labels