Skip to content

[C++][FS][Azure] Consider how and if to fix disallowed characters in file metadata #40057

@Tom-Newton

Description

@Tom-Newton

Describe the enhancement requested

Child of #18014

Manually modifying #40021 to run the Python tests against a real blob storage account caused a failure when attempting to write the metadata {'Content-Type': 'x-pyarrow/test'}. Looking at the Generic C++ tests it looks like we will run into the same issue there

auto metadata = KeyValueMetadata::Make({"Content-Type", "Content-Language"},
{"x-arrow/filesystem-test", "fr_FR"});
.

E   OSError: CommitBlockList failed for 'https://tomtestflat.blob.core.windows.net/63217166-c9f8-11ee-989a-71cec6336ac8/open-output-stream-metadata'. Committing is required to flush an output/append stream. Azure Error: [InvalidMetadata] 400 The metadata specified is invalid. It has characters that are not permitted.
E   The metadata specified is invalid. It has characters that are not permitted.
E   RequestId:ac61e6da-e01e-0009-2305-5e07de000000
E   Time:2024-02-12T22:45:15.5130135Z
E   Request ID: ac61e6da-e01e-0009-2305-5e07de000000

It turns out real Azure storage doesn't allow -s in the metadata keys. This behaviour is the same on flat and hierarchical namespace storage accounts but azurite accepts it without error.

Apparently the keys (names) of metadata must conform to https://learn.microsoft.com/en-us/rest/api/storageservices/naming-and-referencing-containers--blobs--and-metadata#metadata-names the naming rules for C# identifiers.

We need to decide what we want to do here. I think either we have to accept these limitations or we will need to encode the metadata keys before writing them to azure then decode when reading back. A quick Google search came up with these options for potential encodings https://stackoverflow.com/questions/32037525/encode-to-alphanumeric-in-javascript#:~:text=To%20encode%20to%20an%20alphanumeric,in%20a%20shorter%20encoded%20string.

The downside of encoding the metadata would be that other Azure clients won't know to decode.

Component(s)

C++

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions