Skip to content

[C++] Don't allow the inclusion of passwords (storage account keys) in Auzre ABFS URLs #43197

@sugibuchi

Description

@sugibuchi

Describe the enhancement requested

Outline

The Azure Blob File System (ABFS) support in Apache Arrow, implemented in C++ API by #18014 and integrated into Python API by #39968, currently allows embedding a storage account key as a password in an ABFS URL.

https://github.com/apache/arrow/blob/r-16.1.0/cpp/src/arrow/filesystem/azurefs.h#L138-L144

However, I strongly recommend stopping this practice for two reasons.

Security

An access key of a storage account is practically a "root password," giving full access to the data in the storage account.

Microsoft repeatedly emphasises this point in various places in the documentation and encourages the protection of account keys in a secure place like Azure Key Vault.

Protect your access keys

Storage account access keys provide full access to the storage account data and the ability to generate SAS tokens. Always be careful to protect your access keys. Use Azure Key Vault to manage and rotate your keys securely.
https://learn.microsoft.com/en-us/azure/storage/common/storage-configure-connection-string#protect-your-access-keys

Embedding a storage account key in an ABFS URL, which is usually not considered confidential information, may lead to unexpected exposure of the key.

Interoperability with other file system implementations

For historical reasons, the syntax of the Azure Blob File System (ABFS) URL is inconsistent between different file system implementations.

Original implementations by Apache Hadoop's hadoop-azure package link

  • adfs[s]://<container>@<account>.dsf.core.windows.net/path/to/file

This syntax is widely used, particularly by Apache Spark.

Python adlfs for fsspec link

  • Hadoop-compatible URL syntax , and
  • az://<container>/path/to/file
  • adfs[s]://<container>/path/to/file

Rust object_store::azure link

  • Hadoop-compatible URL syntax
  • adlfs-compatible URL syntax , and
  • azure://<container>/path/to/file
  • https://<account>.blob.core.windows.net/<container>/path/to/file
  • https://<account>.dfs.core.windows.net/<container>/path/to/file

DuckDB azure extension link

  • adfss://<container>/path/to/file
  • adfss://<account>.dsf.core.windows.net/<container>/path/to/file

Apache Arrow link

  • Hadoop-compatible URL syntax , and
  • abfs[s]://[:<password>@]<account>.blob.core.windows.net/<container>/path/to/file
  • abfs[s]://<container>[:<password>]@<account>.dfs.core.windows.net/path/to/file
  • abfs[s]://[<account[:<password>]@]<host[.domain]>[<:port>]/<container>/path/to/file
  • abfs[s]://[<account[:<password>]@]<container>[/path]

This inconsistency of the syntax already causes problems in applications using different frameworks, including additional overhead to translate ABFS URLs between different syntax. It may also lead to unexpected behaviours due to misinterpretation of the same URL by different file system implementations.

I believe a new file system implementation should respect the existing syntax of a URL scheme and SHOULD NOT invent new ones. As far as I understand, no other ABFS file system implementation allows embedding storage account keys in ABFS URLs.

Component(s)

C++, Python

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions