-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the enhancement requested
Child of #18014
Manually modifying #40021 to run the Python tests against a real blob storage account caused a failure when attempting to write the metadata {'Content-Type': 'x-pyarrow/test'}. Looking at the Generic C++ tests it looks like we will run into the same issue there
arrow/cpp/src/arrow/filesystem/test_util.cc
Lines 912 to 913 in c23a097
| auto metadata = KeyValueMetadata::Make({"Content-Type", "Content-Language"}, | |
| {"x-arrow/filesystem-test", "fr_FR"}); |
E OSError: CommitBlockList failed for 'https://tomtestflat.blob.core.windows.net/63217166-c9f8-11ee-989a-71cec6336ac8/open-output-stream-metadata'. Committing is required to flush an output/append stream. Azure Error: [InvalidMetadata] 400 The metadata specified is invalid. It has characters that are not permitted.
E The metadata specified is invalid. It has characters that are not permitted.
E RequestId:ac61e6da-e01e-0009-2305-5e07de000000
E Time:2024-02-12T22:45:15.5130135Z
E Request ID: ac61e6da-e01e-0009-2305-5e07de000000
It turns out real Azure storage doesn't allow -s in the metadata keys. This behaviour is the same on flat and hierarchical namespace storage accounts but azurite accepts it without error.
Apparently the keys (names) of metadata must conform to https://learn.microsoft.com/en-us/rest/api/storageservices/naming-and-referencing-containers--blobs--and-metadata#metadata-names the naming rules for C# identifiers.
We need to decide what we want to do here. I think either we have to accept these limitations or we will need to encode the metadata keys before writing them to azure then decode when reading back. A quick Google search came up with these options for potential encodings https://stackoverflow.com/questions/32037525/encode-to-alphanumeric-in-javascript#:~:text=To%20encode%20to%20an%20alphanumeric,in%20a%20shorter%20encoded%20string.
The downside of encoding the metadata would be that other Azure clients won't know to decode.
Component(s)
C++