Skip to content

Add order option to encoders#627

Merged
jcrist merged 1 commit intomainfrom
sort-keys
Jan 16, 2024
Merged

Add order option to encoders#627
jcrist merged 1 commit intomainfrom
sort-keys

Conversation

@jcrist
Copy link
Copy Markdown
Owner

@jcrist jcrist commented Jan 7, 2024

Add order kwarg to encoders

This adds an order kwarg to all encoders for configuring how unordered
collections/objects are encoded. Options are:

  • None: the default. All objects are encoded in the most efficient
    manner corresponding to their in-memory representation.
  • 'deterministic': Unordered collections (sets, dicts) are sorted
    before encoding. This ensures a consistent output between runs, which
    may be useful when comparing/hashing the encoded binary
    representation.
  • 'sorted': same as 'deterministic', but all objet-like objects
    will have their fields encoded in alphabetical order by name. This is
    more expensive than 'deterministic', but may be useful for making
    the output more human readable.

The 'deterministic' output has been heavily optimized - given the work
required to accomplish this feature, I wouldn't expect we can speed up
this operation much more. The 'sorted' option has not been fully
optimized (the assumption being a human-readable output is rarely perf
sensitive). If needed, there are some rather simple optimizations we can
add here to speed this up further.

In general, msgspec.json.encode(obj, order="deterministic") should be
as fast or faster than orjson.dumps(obj, option=orjson.OPT_SORT_KEYS).
For common small object sizes we average a ~20% speedup over orjson
for key sorting.

In [1]: import msgspec, orjson, random

In [2]: enc = msgspec.json.Encoder(order="deterministic")

In [3]: keys = [f'field_{i}' for i in range(6)]

In [4]: random.shuffle(keys)

In [5]: msg = dict(zip(keys, range(len(keys))))

In [6]: %timeit enc.encode(msg)
305 ns ± 2.99 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [7]: %timeit orjson.dumps(msg, option=orjson.OPT_SORT_KEYS)
377 ns ± 2.04 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

Fixes #609.

@jcrist jcrist changed the title [WIP] Add sort_keys option to encoders Add sort_keys option to encoders Jan 15, 2024
@jcrist jcrist changed the title Add sort_keys option to encoders Add order option to encoders Jan 15, 2024
This adds an `order` kwarg to all encoders for configuring how unordered
collections/objects are encoded. Options are:

- `None`: the default. All objects are encoded in the most efficient
  manner corresponding to their in-memory representation.
- `'deterministic'`: Unordered collections (sets, dicts) are sorted
  before encoding. This ensures a consistent output between runs, which
  may be useful when comparing/hashing the encoded binary
  representation.
- `'sorted'`: same as `'deterministic'`, but *all* objet-like objects
  will have their fields encoded in alphabetical order by name. This is
  more expensive than `'deterministic'`, but may be useful for making
  the output more human readable.

The `'deterministic'` output has been heavily optimized - given the work
required to accomplish this feature, I wouldn't expect we can speed up
this operation much more. The `'sorted'` option has not been fully
optimized (the assumption being a human-readable output is rarely perf
sensitive). If needed, there are some rather simple optimizations we can
add here to speed this up further.

In general, `msgspec.json.encode(obj, order="deterministic")` should be
as fast or faster than `orjson.dumps(obj, option=orjson.OPT_SORT_KEYS)`.
For common small object sizes we average a ~25% speedup over `orjson`
for key sorting.
@jcrist jcrist merged commit 38c3330 into main Jan 16, 2024
@jcrist jcrist deleted the sort-keys branch January 16, 2024 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

encode sort_keys argument

1 participant