Skip to content

[Python/C++] S3FileSystem slow to deserialize due to AWS rule engine JSON parsing #40279

@fjetter

Description

@fjetter

Describe the bug, including details regarding any error messages, version, and platform.

Deserializing a pickled S3FileSystem instance is surprisingly slow

import boto3
from pyarrow.fs import S3FileSystem

# Going via boto is not strictly necessary but setting all the keys and tokens already avoids one HTTP request during init
session = boto3.session.Session()
credentials = session.get_credentials()

fs = S3FileSystem(
    secret_key=credentials.secret_key,
    access_key=credentials.access_key,
    region="us-east-2",
    session_token=credentials.token,
)
# Note: This can also be seen by using just S3FileSystem() but this then posts one HTTP request and I want to emphasize the slow json parser, see below
%timeit pickle.loads(pickle.dumps(fs))

takes 1.01 ms ± 153 µs per loop on my machine

Looking at a py-spy profile shows that most of the time is spent in some internal JSON parsing. Is there a way to avoid this?

image

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions