Skip to content

[JIT] Make new zip serialization for torch save/load significantly (~70%) faster#38379

Closed
voznesenskym wants to merge 55 commits intomasterfrom
voznesenskym/serial
Closed

[JIT] Make new zip serialization for torch save/load significantly (~70%) faster#38379
voznesenskym wants to merge 55 commits intomasterfrom
voznesenskym/serial

Conversation

@voznesenskym
Copy link
Copy Markdown
Collaborator

Before:

2020-05-11 18:31:41 INFO     Benchmarking 'basic', best of 10 runs (with 1 warmup runs)
{
  "Big Tensors Save": {
    "mean": 17.8048762,
    "median": 17.458917
  },
  "Big Tensors Load": {
    "mean": 3.2556887,
    "median": 2.9668495000000004
  },
  "Small Tensors Save": {
    "mean": 4.0381357,
    "median": 3.9440125
  },
  "Small Tensors Load": {
    "mean": 5.8792499,
    "median": 5.603067
  },
  "benchmark_run_at": "2020-05-12T01:31:41"
}

After

Use zipfile serialization: True
2020-05-12 20:15:32 INFO     Benchmarking 'basic', best of 10 runs (with 1 warmup runs)
{
  "Big Tensors Save": {
    "mean": 4.7534657,
    "median": 4.646732
  },
  "Big Tensors Load": {
    "mean": 3.6001919,
    "median": 3.493285
  },
  "Small Tensors Save": {
    "mean": 4.1066924,
    "median": 4.1219255
  },
  "Small Tensors Load": {
    "mean": 6.3902358,
    "median": 6.36977
  },
  "benchmark_run_at": "2020-05-13T03:15:32"
}

@voznesenskym voznesenskym requested a review from apaszke as a code owner May 13, 2020 03:15
@facebook-github-bot facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label May 13, 2020
@voznesenskym voznesenskym changed the title Make JIT serialization as fast as non JIT serialization [WIP] [RFC] [JIT] Make JIT serialization as fast as non JIT serialization May 13, 2020
@voznesenskym
Copy link
Copy Markdown
Collaborator Author

Not really ready for review

@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented May 13, 2020

💊 CI failures summary and remediations

As of commit 2338a31 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

ci.pytorch.org: 2 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 140 times.

Comment thread caffe2/serialize/crc.cc Outdated
#include <c10/util/llvmMathExtras.h>


namespace detail {
Copy link
Copy Markdown
Member

@houseroad houseroad May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is moving in the right direction, and thanks for working on it. :-) Not sure if you are aware of the similar optimization done internally for the FB torch::save/load. According to https://fburl.com/46eitcmk, for the internal optimization, we used folly's crc function, which was 40x faster than the vanilla implementation. Could you compare your implementation with it? cc: @dzhulgakov @jjlilley

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is folly's implementation. I just hacked it around a little to make it work folly-free ;)

Copy link
Copy Markdown

@jjlilley jjlilley May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - good that somebody's looking at this!

Thus far, this seems to be the software version, though? IIRC, the vast majority of potential gains are from the hw impl, vs a better sw impl, e.g. in running folly/hash/test/ChecksumTest.cpp:

crc32_hardware_512KB_block 27.80us
crc32_software_512KB_block 1.48ms

My vague (from 6 months ago) recollection was that the existing mz_crc32() table-based software impl in minizip wasn't that bad, vs folly's. It probably lagged (and your benchmarks seem to bear this out), but kind of faded in the background compared with 1480/27.8 = ~54x perf gap above between hardware and software.

fwiw, I played around (months ago) with hacking in folly's impl in D19635487, though I ended up spending too much time fighting with the OSS CMakeList issues and ran out of time I could spend.

So it might be worth focusing on hw (and btw, crc32c shouldn't be needed for zip, alas Intel decided to make its standalone crc sse function do crc32c instead of crc32), if the target is zip here.

But in any case, completely agree that in non-fb OSS, the CRC performance is quite slow, and there's a lot of benefit in OSS from fixing this.

After this, from the traces, my memory is that some other areas that might be worth considering are:

  • for large payloads (mostly those that don't fit in L2), consider that torch::save() requires a couple extra passes over the memory than the wire serializer - one pass to crc everything (churning the cache rather than doing incrementally), and another to copy into a flat buffer, rather than refcnting an IOBuf.
  • for small payloads, when running torch::load(), the cost of parsing/lexing the accompanying mandatory mini .py program that torch::save() adds is less than trivial.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Yes, a subsequent PR will add the hardware version to this! I should probably actually add it here so the switch from miniz to custom crc is cleaner. Good callout.

My vague (from 6 months ago) recollection was that the existing mz_crc32() table-based software impl in minizip wasn't that bad, vs folly's. It probably lagged (and your benchmarks seem to bear this out), but kind of faded in the background compared with 1480/27.8 = ~54x perf gap above between hardware and software.

Software impl of mz_crc32() vs folly (at least, in this benchmark) is 1.5-5x slower. Not massive, but not shippable either.

So it might be worth focusing on hw (and btw, crc32c shouldn't be needed for zip, alas Intel decided to make its standalone crc sse function do crc32c instead of crc32), if the target is zip here.

Agreed across the board. This is the first step towards moving from JIT "new" serialization disabled by default to enabled by default. Once we have that, we can have full py/C++ interop.

For your two points above about cache churn on large payloads and the parser+lexer overhead, these are extremely interesting. I think a good next step after this initial PR will be to write deeper benchmarks that enable us to find all sorts of issues like this, and consider how we want to do torch::load() and torch::save() when the payloads are at extremes (versus uniformly the same as I believe we do now).

Comment thread caffe2/serialize/boost_crc.h Outdated
@@ -0,0 +1,1099 @@
// Boost CRC library crc.hpp header file -----------------------------------//
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should go in the relevant third_party dir? (i.e. fbcode/caffe2/third_party)

Also, if this is mostly for boosting the sw crc speed, the complexity of this dependency might not be warranted (given that we probably only want to run the sw version on a few of the last/first unaligned bytes on the buffer). I suspect that if adding these is green-lit, we might need to do things like including the LICENSE file (see line 4 below), etc.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. will move to third_party!

I think there may be cases when it runs only the SW version? I do not know yet. @suo thoughts?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, I plan to have both HW and SW support here.

@voznesenskym voznesenskym changed the title [WIP] [RFC] [JIT] Make JIT serialization as fast as non JIT serialization [WIP] [RFC] [JIT] Make new zip serialization for torch save/load significantly (~70%) faster May 21, 2020
@voznesenskym voznesenskym changed the title [WIP] [RFC] [JIT] Make new zip serialization for torch save/load significantly (~70%) faster [JIT] Make new zip serialization for torch save/load significantly (~70%) faster May 28, 2020
Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@voznesenskym is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@voznesenskym is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@voznesenskym merged this pull request in fce01a9.

@facebook-github-bot facebook-github-bot deleted the voznesenskym/serial branch July 13, 2020 17:58
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…70%) faster (pytorch#38379)

Summary:
Before:
```
2020-05-11 18:31:41 INFO     Benchmarking 'basic', best of 10 runs (with 1 warmup runs)
{
  "Big Tensors Save": {
    "mean": 17.8048762,
    "median": 17.458917
  },
  "Big Tensors Load": {
    "mean": 3.2556887,
    "median": 2.9668495000000004
  },
  "Small Tensors Save": {
    "mean": 4.0381357,
    "median": 3.9440125
  },
  "Small Tensors Load": {
    "mean": 5.8792499,
    "median": 5.603067
  },
  "benchmark_run_at": "2020-05-12T01:31:41"
}
```
After
```
Use zipfile serialization: True
2020-05-12 20:15:32 INFO     Benchmarking 'basic', best of 10 runs (with 1 warmup runs)
{
  "Big Tensors Save": {
    "mean": 4.7534657,
    "median": 4.646732
  },
  "Big Tensors Load": {
    "mean": 3.6001919,
    "median": 3.493285
  },
  "Small Tensors Save": {
    "mean": 4.1066924,
    "median": 4.1219255
  },
  "Small Tensors Load": {
    "mean": 6.3902358,
    "median": 6.36977
  },
  "benchmark_run_at": "2020-05-13T03:15:32"
}
```
Pull Request resolved: pytorch#38379

Differential Revision: D21779494

Pulled By: voznesenskym

fbshipit-source-id: 694d65029a5b817424d454bd331e285df828c67a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged oncall: jit Add this issue/PR to JIT oncall triage queue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants