[CPP] Arrow does not decode partition column name from directory path

When using Delta to write out partitioned table, both the column name and value would be uri encoded. For example: 

```
sdf = spark.createDataFrame([[1, 10], [2, 20]], schema=['x%20x', 'y'])
sdf.write.format("delta").partitionBy(["x%20x"]).save("partition_table")
```

The above code would generate directory in following structure:

```
├─x%2520x=1
├─x%2520x=2
└─_delta_log
```

When we use Pyarrow to read this table, it would generate incorrect column name from `x%20x` to `x%2520%x`:

```
In [3]: ds = dataset(paths, format="parquet", partitioning=partitioning(flavor="hive"))

In [4]: ds.to_table().to_pandas()
Out[4]:
    y  x%2520x
0  20        2
1  10        1
```

It seems that we did not decode the column name here: https://github.com/apache/arrow/blob/e5f3e04b4b80c9b9c53f1f0f71f39d9f8308dced/cpp/src/arrow/dataset/partition.cc#L593-L596

More context from this delta-rs issue: https://github.com/delta-io/delta-rs/issues/495

Could anyone take a look please? Thanks.

	case SegmentEncoding::Uri: {
	auto raw_value = util::string_view(segment).substr(name_end + 1);
	ARROW_ASSIGN_OR_RAISE(value, SafeUriUnescape(raw_value));
	break;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CPP] Arrow does not decode partition column name from directory path #11718

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[CPP] Arrow does not decode partition column name from directory path #11718

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions