-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[C++] Don't allow the inclusion of passwords (storage account keys) in Auzre ABFS URLs #43197
Description
Describe the enhancement requested
Outline
The Azure Blob File System (ABFS) support in Apache Arrow, implemented in C++ API by #18014 and integrated into Python API by #39968, currently allows embedding a storage account key as a password in an ABFS URL.
https://github.com/apache/arrow/blob/r-16.1.0/cpp/src/arrow/filesystem/azurefs.h#L138-L144
However, I strongly recommend stopping this practice for two reasons.
Security
An access key of a storage account is practically a "root password," giving full access to the data in the storage account.
Microsoft repeatedly emphasises this point in various places in the documentation and encourages the protection of account keys in a secure place like Azure Key Vault.
Protect your access keys
Storage account access keys provide full access to the storage account data and the ability to generate SAS tokens. Always be careful to protect your access keys. Use Azure Key Vault to manage and rotate your keys securely.
https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string#protect-your-access-keys
Embedding a storage account key in an ABFS URL, which is usually not considered confidential information, may lead to unexpected exposure of the key.
Interoperability with other file system implementations
For historical reasons, the syntax of the Azure Blob File System (ABFS) URL is inconsistent between different file system implementations.
Original implementations by Apache Hadoop's hadoop-azure package link
- adfs[s]://<container>@<account>.dsf.core.windows.net/path/to/file
This syntax is widely used, particularly by Apache Spark.
Python adlfs for fsspec link
- Hadoop-compatible URL syntax , and
- az://<container>/path/to/file
- adfs[s]://<container>/path/to/file
Rust object_store::azure link
- Hadoop-compatible URL syntax
- adlfs-compatible URL syntax , and
- azure://<container>/path/to/file
- https://<account>.blob.core.windows.net/<container>/path/to/file
- https://<account>.dfs.core.windows.net/<container>/path/to/file
DuckDB azure extension link
- adfss://<container>/path/to/file
- adfss://<account>.dsf.core.windows.net/<container>/path/to/file
Apache Arrow link
- Hadoop-compatible URL syntax , and
- abfs[s]://[:<password>@]<account>.blob.core.windows.net/<container>/path/to/file
- abfs[s]://<container>[:<password>]@<account>.dfs.core.windows.net/path/to/file
- abfs[s]://[<account[:<password>]@]<host[.domain]>[<:port>]/<container>/path/to/file
- abfs[s]://[<account[:<password>]@]<container>[/path]
This inconsistency of the syntax already causes problems in applications using different frameworks, including additional overhead to translate ABFS URLs between different syntax. It may also lead to unexpected behaviours due to misinterpretation of the same URL by different file system implementations.
I believe a new file system implementation should respect the existing syntax of a URL scheme and SHOULD NOT invent new ones. As far as I understand, no other ABFS file system implementation allows embedding storage account keys in ABFS URLs.
Component(s)
C++, Python