[JIT] Make new zip serialization for torch save/load significantly (~70%) faster by voznesenskym · Pull Request #38379 · pytorch/pytorch

voznesenskym · 2020-05-13T03:15:56Z

Before:

2020-05-11 18:31:41 INFO     Benchmarking 'basic', best of 10 runs (with 1 warmup runs)
{
  "Big Tensors Save": {
    "mean": 17.8048762,
    "median": 17.458917
  },
  "Big Tensors Load": {
    "mean": 3.2556887,
    "median": 2.9668495000000004
  },
  "Small Tensors Save": {
    "mean": 4.0381357,
    "median": 3.9440125
  },
  "Small Tensors Load": {
    "mean": 5.8792499,
    "median": 5.603067
  },
  "benchmark_run_at": "2020-05-12T01:31:41"
}

After

Use zipfile serialization: True
2020-05-12 20:15:32 INFO     Benchmarking 'basic', best of 10 runs (with 1 warmup runs)
{
  "Big Tensors Save": {
    "mean": 4.7534657,
    "median": 4.646732
  },
  "Big Tensors Load": {
    "mean": 3.6001919,
    "median": 3.493285
  },
  "Small Tensors Save": {
    "mean": 4.1066924,
    "median": 4.1219255
  },
  "Small Tensors Load": {
    "mean": 6.3902358,
    "median": 6.36977
  },
  "benchmark_run_at": "2020-05-13T03:15:32"
}

voznesenskym · 2020-05-13T03:16:29Z

Not really ready for review

dr-ci · 2020-05-13T03:25:28Z

💊 CI failures summary and remediations

As of commit 2338a31 (more details on the Dr. CI page):

2/2 failures possibly* introduced in this PR
- 2/2 non-CircleCI failure(s)

ci.pytorch.org: 2 failed

Failed: pr/caffe2-py3.6-devtoolset7-rocmrpm-centos7-test
Failed: pr/py3.6-clang7-rocmdeb-ubuntu16.04

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 140 times.

houseroad · 2020-05-13T05:17:35Z

+#include <c10/util/llvmMathExtras.h>
+
+
+namespace detail {


I think this PR is moving in the right direction, and thanks for working on it. :-) Not sure if you are aware of the similar optimization done internally for the FB torch::save/load. According to https://fburl.com/46eitcmk, for the internal optimization, we used folly's crc function, which was 40x faster than the vanilla implementation. Could you compare your implementation with it? cc: @dzhulgakov @jjlilley

Oh this is folly's implementation. I just hacked it around a little to make it work folly-free ;)

Interesting - good that somebody's looking at this!

Thus far, this seems to be the software version, though? IIRC, the vast majority of potential gains are from the hw impl, vs a better sw impl, e.g. in running folly/hash/test/ChecksumTest.cpp:

crc32_hardware_512KB_block 27.80us
crc32_software_512KB_block 1.48ms

My vague (from 6 months ago) recollection was that the existing mz_crc32() table-based software impl in minizip wasn't that bad, vs folly's. It probably lagged (and your benchmarks seem to bear this out), but kind of faded in the background compared with 1480/27.8 = ~54x perf gap above between hardware and software.

fwiw, I played around (months ago) with hacking in folly's impl in D19635487, though I ended up spending too much time fighting with the OSS CMakeList issues and ran out of time I could spend.

So it might be worth focusing on hw (and btw, crc32c shouldn't be needed for zip, alas Intel decided to make its standalone crc sse function do crc32c instead of crc32), if the target is zip here.

But in any case, completely agree that in non-fb OSS, the CRC performance is quite slow, and there's a lot of benefit in OSS from fixing this.

After this, from the traces, my memory is that some other areas that might be worth considering are:

for large payloads (mostly those that don't fit in L2), consider that torch::save() requires a couple extra passes over the memory than the wire serializer - one pass to crc everything (churning the cache rather than doing incrementally), and another to copy into a flat buffer, rather than refcnting an IOBuf.

for small payloads, when running torch::load(), the cost of parsing/lexing the accompanying mandatory mini .py program that torch::save() adds is less than trivial.

Hey! Yes, a subsequent PR will add the hardware version to this! I should probably actually add it here so the switch from miniz to custom crc is cleaner. Good callout.

My vague (from 6 months ago) recollection was that the existing mz_crc32() table-based software impl in minizip wasn't that bad, vs folly's. It probably lagged (and your benchmarks seem to bear this out), but kind of faded in the background compared with 1480/27.8 = ~54x perf gap above between hardware and software.

Software impl of mz_crc32() vs folly (at least, in this benchmark) is 1.5-5x slower. Not massive, but not shippable either.

So it might be worth focusing on hw (and btw, crc32c shouldn't be needed for zip, alas Intel decided to make its standalone crc sse function do crc32c instead of crc32), if the target is zip here.

Agreed across the board. This is the first step towards moving from JIT "new" serialization disabled by default to enabled by default. Once we have that, we can have full py/C++ interop.

For your two points above about cache churn on large payloads and the parser+lexer overhead, these are extremely interesting. I think a good next step after this initial PR will be to write deeper benchmarks that enable us to find all sorts of issues like this, and consider how we want to do torch::load() and torch::save() when the payloads are at extremes (versus uniformly the same as I believe we do now).

…am doing

jjlilley · 2020-05-13T19:14:20Z

@@ -0,0 +1,1099 @@
+//  Boost CRC library crc.hpp header file  -----------------------------------//


Maybe this should go in the relevant third_party dir? (i.e. fbcode/caffe2/third_party)

Also, if this is mostly for boosting the sw crc speed, the complexity of this dependency might not be warranted (given that we probably only want to run the sw version on a few of the last/first unaligned bytes on the buffer). I suspect that if adding these is green-lit, we might need to do things like including the LICENSE file (see line 4 below), etc.

Yeah. will move to third_party!

I think there may be cases when it runs only the SW version? I do not know yet. @suo thoughts?

Either way, I plan to have both HW and SW support here.

…/serial

…senskym/serial

…/serial

…senskym/serial

…/serial

facebook-github-bot

@voznesenskym is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@voznesenskym is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-05-29T10:14:34Z

@voznesenskym merged this pull request in fce01a9.

…70%) faster (pytorch#38379) Summary: Before: ``` 2020-05-11 18:31:41 INFO Benchmarking 'basic', best of 10 runs (with 1 warmup runs) { "Big Tensors Save": { "mean": 17.8048762, "median": 17.458917 }, "Big Tensors Load": { "mean": 3.2556887, "median": 2.9668495000000004 }, "Small Tensors Save": { "mean": 4.0381357, "median": 3.9440125 }, "Small Tensors Load": { "mean": 5.8792499, "median": 5.603067 }, "benchmark_run_at": "2020-05-12T01:31:41" } ``` After ``` Use zipfile serialization: True 2020-05-12 20:15:32 INFO Benchmarking 'basic', best of 10 runs (with 1 warmup runs) { "Big Tensors Save": { "mean": 4.7534657, "median": 4.646732 }, "Big Tensors Load": { "mean": 3.6001919, "median": 3.493285 }, "Small Tensors Save": { "mean": 4.1066924, "median": 4.1219255 }, "Small Tensors Load": { "mean": 6.3902358, "median": 6.36977 }, "benchmark_run_at": "2020-05-13T03:15:32" } ``` Pull Request resolved: pytorch#38379 Differential Revision: D21779494 Pulled By: voznesenskym fbshipit-source-id: 694d65029a5b817424d454bd331e285df828c67a

David Riazati and others added 8 commits April 9, 2020 20:27

Remove useless copy on zip file load

522dc5d

fix commit

20ee2f7

update

c9840e3

Merge branch 'master' of github.com:pytorch/pytorch into driazati/zipz

4d9fdad

Merge branch 'master' of github.com:pytorch/pytorch into driazati/zipz

6e6bd56

Address comments

e5e9c36

wip

0a0d7a6

Merge branch 'master' of github.com:pytorch/pytorch

40c63d8

voznesenskym requested a review from apaszke as a code owner May 13, 2020 03:15

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label May 13, 2020

voznesenskym changed the title ~~Make JIT serialization as fast as non JIT serialization~~ [WIP] [RFC] [JIT] Make JIT serialization as fast as non JIT serialization May 13, 2020

voznesenskym added 4 commits May 12, 2020 20:26

WIP, clean up code

047e413

More cleanup

7dfc657

RM spurious file

a1f3e52

Move script over

4acdc9d

houseroad reviewed May 13, 2020

View reviewed changes

WIP, not ready for review, just looking at CI for this boost thing I …

83ebf0b

…am doing

jjlilley reviewed May 13, 2020

View reviewed changes

voznesenskym added 10 commits May 13, 2020 12:41

Add hw support

970ac16

wip

6447133

Merge branch 'master' of github.com:pytorch/pytorch into voznesenskym…

6925a47

…/serial

Faster crc32

d8b41c0

Potential fix for OSX

b014662

wip

02b040c

WIP, clean up code

5e1ebbe

More cleanup

0df3d28

RM spurious file

818c683

Move script over

6404aa8

voznesenskym changed the title ~~[WIP] [RFC] [JIT] Make JIT serialization as fast as non JIT serialization~~ [WIP] [RFC] [JIT] Make new zip serialization for torch save/load significantly (~70%) faster May 21, 2020

voznesenskym and others added 6 commits May 21, 2020 02:41

Merge branch 'driazati/zipz' of github.com:pytorch/pytorch into vozne…

6f8c456

…senskym/serial

Merge branch 'master' of github.com:pytorch/pytorch into voznesenskym…

bec4bfc

…/serial

Fix size

0212b7f

Merge branch 'master' of github.com:pytorch/pytorch into driazati/zipz

151f676

Merge branch 'driazati/zipz' of github.com:pytorch/pytorch into vozne…

6524532

…senskym/serial

rebase

7d965f2

voznesenskym mentioned this pull request May 22, 2020

Remove useless copy on zip file load #36362

Closed

voznesenskym added 10 commits May 21, 2020 22:26

fix init.cpp

f0b8488

fix init.cpp

132bb37

Merge branch 'master' of github.com:pytorch/pytorch into voznesenskym…

9f297b1

…/serial

Fix init.cpp

6b810bb

Make intrusive

53ab1d0

Empty

d1e4190

undo some shady work I pulled in

a139226

Fix small py part

a521d09

Remove file on end of test

f19bdd6

Merge branch 'master' of github.com:pytorch/pytorch into voznesenskym…

9b5be2c

…/serial

voznesenskym changed the title ~~[WIP] [RFC] [JIT] Make new zip serialization for torch save/load significantly (~70%) faster~~ [JIT] Make new zip serialization for torch save/load significantly (~70%) faster May 28, 2020

facebook-github-bot reviewed May 29, 2020

View reviewed changes

Add a newline

2338a31

facebook-github-bot reviewed May 29, 2020

View reviewed changes

facebook-github-bot closed this in fce01a9 May 29, 2020

facebook-github-bot added the merged label May 29, 2020

facebook-github-bot deleted the voznesenskym/serial branch July 13, 2020 17:58

chauhang mentioned this pull request Jul 31, 2020

Regression suite fails for scripted densenet161 model due to timeout issue on PyTorch 1.6 pytorch/serve#570

Closed

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JIT] Make new zip serialization for torch save/load significantly (~70%) faster#38379

[JIT] Make new zip serialization for torch save/load significantly (~70%) faster#38379
voznesenskym wants to merge 55 commits intomasterfrom
voznesenskym/serial

voznesenskym commented May 13, 2020

Uh oh!

voznesenskym commented May 13, 2020

Uh oh!

dr-ci Bot commented May 13, 2020 •

edited

Loading

Uh oh!

houseroad May 13, 2020 •

edited

Loading

Uh oh!

voznesenskym May 13, 2020

Uh oh!

jjlilley May 13, 2020 •

edited

Loading

Uh oh!

voznesenskym May 13, 2020

Uh oh!

jjlilley May 13, 2020

Uh oh!

voznesenskym May 13, 2020

Uh oh!

voznesenskym May 13, 2020

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented May 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		@@ -0,0 +1,1099 @@
		// Boost CRC library crc.hpp header file -----------------------------------//

		#include <c10/util/llvmMathExtras.h>


		namespace detail {

Conversation

voznesenskym commented May 13, 2020

Uh oh!

voznesenskym commented May 13, 2020

Uh oh!

dr-ci Bot commented May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

ci.pytorch.org: 2 failed

Uh oh!

houseroad May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voznesenskym May 13, 2020

Choose a reason for hiding this comment

Uh oh!

jjlilley May 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voznesenskym May 13, 2020

Choose a reason for hiding this comment

Uh oh!

jjlilley May 13, 2020

Choose a reason for hiding this comment

Uh oh!

voznesenskym May 13, 2020

Choose a reason for hiding this comment

Uh oh!

voznesenskym May 13, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented May 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dr-ci Bot commented May 13, 2020 •

edited

Loading

houseroad May 13, 2020 •

edited

Loading

jjlilley May 13, 2020 •

edited

Loading