-
Notifications
You must be signed in to change notification settings - Fork 943
Description
I was trying to optimize the performance for a high-throughput use case (proposed by @janosh) where one might need to load/save large quantities of Structure as JSON, and I'm currently starting with the saving phase of the workflow.
A quick line profiling reveals two significant bottlenecks:
as_dictmethod ofStructure: I did some preliminary profiling, and looks like this is owing to (unnecessarily) repeated calculation ofLatticeproperties likelengths/anglesduring serialization (and I believe proper caching could resolve this with correct cache voiding mechanism upon internal_matrixchange). But this would be a separate PR so won't cover too much here.- Topic of this thread: slow
json.dumpto file
Total time: 65.3412 s
File: show_structure_json_bottleneck.py
Function: save_to_file at line 38
Line # Hits Time Per Hit % Time Line Contents
==============================================================
38 @profile
39 def save_to_file(tmpdir):
40 1001 921.1 0.9 0.0 for i, struct in enumerate(STRUCTURES):
41 1000 32742869.8 32742.9 50.1 d = struct.as_dict()
42 1000 9134.1 9.1 0.0 path = os.path.join(tmpdir, f"struct_{i}.json")
43 2000 77023.5 38.5 0.1 with open(path, "w") as f:
44 1000 32511267.6 32511.3 49.8 json.dump(d, f)
And if we replace builtin json with the faster drop-in replacement orjson (I'm not aware of any caveat so far), we got (tested on WSL2 Ubuntu 22.04):
- almost 10x speed up converting Structure dict to string
- more than 10x speed up for saving Structure dict to file
- a little speed up loading JSON back
== Dump to String ==
json: 0.9821 s
orjson: 0.1173 s
== Save to File ==
json_plain: 3.2918 s, 105549.10 KB
orjson_plain: 0.2284 s, 97192.09 KB
json_gz: 8.0444 s, 32786.10 KB
orjson_gz: 3.3784 s, 32217.06 KB
== Load from File ==
json_plain: 3.1989 s
orjson_plain: 2.6992 s
json_gz: 3.4490 s
orjson_gz: 2.9699 s
Script: compare_structure_json.txt
As the speed up is quite obvious with almost no code change (also considering JSON is heavily used around the code base, so not only Structure would benefit from this), perhaps we could consider adding orjson as the default JSON handler? (I'm afraid adding it as an optional dependency would incur too much code change to detect whether orjson is available)
But also I'm a bit unsure about the migration process as current json.dumps takes kwargs (e.g. #4295) and this might cause breakage:
pymatgen/src/pymatgen/core/structure.py
Line 2963 in 1e36ee5
| json_str = json.dumps(self.as_dict(), **kwargs) |
monty JSON and lack of custom JSON encoder support
orjson is already an optional dependency for monty so maybe it's also good for us to double check if orjson is fully used from monty's side as quite some JSON-related workloads are implicitly handled by monty.
Also as orjson doesn't directly support custom encoder/decoder (e.g. MontyEncoder), some more thoughts might be needed to this end