[JIT] Make new zip serialization for torch save/load significantly (~70%) faster#38379
[JIT] Make new zip serialization for torch save/load significantly (~70%) faster#38379voznesenskym wants to merge 55 commits intomasterfrom
Conversation
|
Not really ready for review |
💊 CI failures summary and remediationsAs of commit 2338a31 (more details on the Dr. CI page):
ci.pytorch.org: 2 failed
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 140 times. |
| #include <c10/util/llvmMathExtras.h> | ||
|
|
||
|
|
||
| namespace detail { |
There was a problem hiding this comment.
I think this PR is moving in the right direction, and thanks for working on it. :-) Not sure if you are aware of the similar optimization done internally for the FB torch::save/load. According to https://fburl.com/46eitcmk, for the internal optimization, we used folly's crc function, which was 40x faster than the vanilla implementation. Could you compare your implementation with it? cc: @dzhulgakov @jjlilley
There was a problem hiding this comment.
Oh this is folly's implementation. I just hacked it around a little to make it work folly-free ;)
There was a problem hiding this comment.
Interesting - good that somebody's looking at this!
Thus far, this seems to be the software version, though? IIRC, the vast majority of potential gains are from the hw impl, vs a better sw impl, e.g. in running folly/hash/test/ChecksumTest.cpp:
crc32_hardware_512KB_block 27.80us
crc32_software_512KB_block 1.48ms
My vague (from 6 months ago) recollection was that the existing mz_crc32() table-based software impl in minizip wasn't that bad, vs folly's. It probably lagged (and your benchmarks seem to bear this out), but kind of faded in the background compared with 1480/27.8 = ~54x perf gap above between hardware and software.
fwiw, I played around (months ago) with hacking in folly's impl in D19635487, though I ended up spending too much time fighting with the OSS CMakeList issues and ran out of time I could spend.
So it might be worth focusing on hw (and btw, crc32c shouldn't be needed for zip, alas Intel decided to make its standalone crc sse function do crc32c instead of crc32), if the target is zip here.
But in any case, completely agree that in non-fb OSS, the CRC performance is quite slow, and there's a lot of benefit in OSS from fixing this.
After this, from the traces, my memory is that some other areas that might be worth considering are:
- for large payloads (mostly those that don't fit in L2), consider that torch::save() requires a couple extra passes over the memory than the wire serializer - one pass to crc everything (churning the cache rather than doing incrementally), and another to copy into a flat buffer, rather than refcnting an IOBuf.
- for small payloads, when running torch::load(), the cost of parsing/lexing the accompanying mandatory mini .py program that torch::save() adds is less than trivial.
There was a problem hiding this comment.
Hey! Yes, a subsequent PR will add the hardware version to this! I should probably actually add it here so the switch from miniz to custom crc is cleaner. Good callout.
My vague (from 6 months ago) recollection was that the existing mz_crc32() table-based software impl in minizip wasn't that bad, vs folly's. It probably lagged (and your benchmarks seem to bear this out), but kind of faded in the background compared with 1480/27.8 = ~54x perf gap above between hardware and software.
Software impl of mz_crc32() vs folly (at least, in this benchmark) is 1.5-5x slower. Not massive, but not shippable either.
So it might be worth focusing on hw (and btw, crc32c shouldn't be needed for zip, alas Intel decided to make its standalone crc sse function do crc32c instead of crc32), if the target is zip here.
Agreed across the board. This is the first step towards moving from JIT "new" serialization disabled by default to enabled by default. Once we have that, we can have full py/C++ interop.
For your two points above about cache churn on large payloads and the parser+lexer overhead, these are extremely interesting. I think a good next step after this initial PR will be to write deeper benchmarks that enable us to find all sorts of issues like this, and consider how we want to do torch::load() and torch::save() when the payloads are at extremes (versus uniformly the same as I believe we do now).
| @@ -0,0 +1,1099 @@ | |||
| // Boost CRC library crc.hpp header file -----------------------------------// | |||
There was a problem hiding this comment.
Maybe this should go in the relevant third_party dir? (i.e. fbcode/caffe2/third_party)
Also, if this is mostly for boosting the sw crc speed, the complexity of this dependency might not be warranted (given that we probably only want to run the sw version on a few of the last/first unaligned bytes on the buffer). I suspect that if adding these is green-lit, we might need to do things like including the LICENSE file (see line 4 below), etc.
There was a problem hiding this comment.
Yeah. will move to third_party!
I think there may be cases when it runs only the SW version? I do not know yet. @suo thoughts?
There was a problem hiding this comment.
Either way, I plan to have both HW and SW support here.
facebook-github-bot
left a comment
There was a problem hiding this comment.
@voznesenskym is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
facebook-github-bot
left a comment
There was a problem hiding this comment.
@voznesenskym is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@voznesenskym merged this pull request in fce01a9. |
…70%) faster (pytorch#38379) Summary: Before: ``` 2020-05-11 18:31:41 INFO Benchmarking 'basic', best of 10 runs (with 1 warmup runs) { "Big Tensors Save": { "mean": 17.8048762, "median": 17.458917 }, "Big Tensors Load": { "mean": 3.2556887, "median": 2.9668495000000004 }, "Small Tensors Save": { "mean": 4.0381357, "median": 3.9440125 }, "Small Tensors Load": { "mean": 5.8792499, "median": 5.603067 }, "benchmark_run_at": "2020-05-12T01:31:41" } ``` After ``` Use zipfile serialization: True 2020-05-12 20:15:32 INFO Benchmarking 'basic', best of 10 runs (with 1 warmup runs) { "Big Tensors Save": { "mean": 4.7534657, "median": 4.646732 }, "Big Tensors Load": { "mean": 3.6001919, "median": 3.493285 }, "Small Tensors Save": { "mean": 4.1066924, "median": 4.1219255 }, "Small Tensors Load": { "mean": 6.3902358, "median": 6.36977 }, "benchmark_run_at": "2020-05-13T03:15:32" } ``` Pull Request resolved: pytorch#38379 Differential Revision: D21779494 Pulled By: voznesenskym fbshipit-source-id: 694d65029a5b817424d454bd331e285df828c67a
Before:
After