Support >2G model export | torchlib(feat) by justinchuby · Pull Request #1003 · microsoft/onnxscript

justinchuby · 2023-08-10T19:34:02Z

Support >2G model export by caching the model to disk when necessary.

Tested locally with test_save_initializer_to_files_for_large_model

Fixes #493

justinchuby · 2023-08-10T19:37:40Z

onnxscript/function_libs/torch_lib/graph_building.py

-                onnx_model, check_type=True, strict_mode=False, data_prop=True
-            )
-            onnx.checker.check_model(onnx_model, full_check=True)
+            if not cache_model_to_disk:


In the follow up PR we will remove the checks altogether from a discussion with Aaron: we should not check it here.

justinchuby · 2023-08-10T19:38:06Z

onnxscript/function_libs/torch_lib/graph_building.py

+    Returns:
+        The estimated size of the tensor in bytes.
+    """
+    return tensor.numel() * tensor.element_size()


codecov · 2023-08-10T19:41:06Z

Codecov Report

Merging #1003 (db58031) into main (b7d2939) will decrease coverage by 0.03%.
The diff coverage is 52.17%.

@@            Coverage Diff             @@
##             main    #1003      +/-   ##
==========================================
- Coverage   77.23%   77.20%   -0.03%     
==========================================
  Files         112      112              
  Lines       14009    14022      +13     
  Branches     1447     1450       +3     
==========================================
+ Hits        10820    10826       +6     
- Misses       2828     2833       +5     
- Partials      361      363       +2

Files Changed	Coverage Δ
...ipt/function_libs/torch_lib/graph_building_test.py	`78.99% <33.33%> (+1.12%)`	⬆️
...nxscript/function_libs/torch_lib/graph_building.py	`81.74% <55.00%> (-1.79%)`	⬇️

github-actions · 2023-08-10T19:52:42Z

Test Results

        18 files ±        0       18 suites ±0 1h 7m 49s ⏱️ + 2m 49s
  10 249 tests ±        0   7 476 ✔️ ±      0     2 772 💤 -         1 0 ❌ ±0 1 🔥 +1
153 023 runs +15 236 33 536 ✔️ +3 309 119 486 💤 +11 926 0 ❌ ±0 1 🔥 +1

For more details on these errors, see this check.

Results for commit db58031. ± Comparison against base commit b7d2939.

♻️ This comment has been updated with latest results.

BowenBao · 2023-08-10T20:58:24Z

onnxscript/function_libs/torch_lib/graph_building.py

+                    **export_kwargs
+                )
+                onnx_model = onnx.load_from_string(proto)
+                onnx.load_external_data_for_model(onnx_model, temp_dir)


So the trick is the 2GB limitation only applies when serializing, but not on in memory ModelProto.

Makes me think _export_onnx should return ModelProto instead of the serialized string. But that is not supported by pybind. We can probably create a c++ pybind api that returns ModelProto and initializers in separate serialized strings, then in python deserialize and combine them together. Only python api is exposed and used here which gives us ModelProto directly.

The benefit is we skip checking size, nor write to disk. What do you think?

Do we return a list of initializers?

Taking this further, we don’t even need to pass in initializers to _export_onnx. We can serialize them ourselves outside with onnx. This way we don’t need to change the PyTorch c++ implementation.

Tested with

def _add_initializers(model_proto: onnx.ModelProto, initializers: Mapping[str, torch.Tensor]): tensor_protos = [] for name, tensor in initializers.items(): print(name, "0") tensor_numpy = tensor.detach().numpy() print(name) tensor_proto = onnx.helper.make_tensor( name=name, data_type=onnx.helper.np_dtype_to_tensor_dtype(tensor_numpy.dtype), dims=tensor_numpy.shape, vals=tensor_numpy, ) print(name, "done1") tensor_protos.append(tensor_proto) model_proto.graph.initializer.extend(tensor_protos)

But onnx.helper.make_tensor is very slow.

I think returning a list of TensorProtos can work. But for now it seems to me allowing the compatibility may be nice.

*compatibility with torch2.0

Because we will move away from torchscript eventually

Let's try make_tensor(...torch_tensor.data_ptr().to_bytes(torch_tensor.element_size() * torch_tensor.numel(), byteorder=sys.byteorder), raw=True) and if that doesn't work or is still slow then we go back to initial solution.

Should be

import ctypes torch_tensor = torch.tensor([2, 3]) raw_data = bytes(ctypes.c_ubyte*torch_tensor.element_size()*torch_tensor.numel()).from_address(torch_tensor.data_ptr()) tensor_proto = make_tensor(..., vals=raw_data, raw=True)

I was misled by someone doing data_ptr().to_bytes(), but what that really does was converting the pointer integer itself to bytes... lol

I'm not sure if it is worth it, looks hacky, but this should resemble what _export_onnx is doing on c++ side. If that is still slow then there is nothing more we can do.

wschin · 2023-08-10T21:23:51Z

onnxscript/function_libs/torch_lib/graph_building.py

-            onnx.checker.check_model(onnx_model, full_check=True)
+            if not cache_model_to_disk:
+                # Only check the model if it is in memory.
+                # Otherwise the checker and shape_inference will fail because


For shape inference, can we still load shape and element type from model file (not initializer files) and then run infer_shape?

We could, but we also don’t need to because PyTorch supplies all the shape info.

A drawback due to onnx/onnx#5487, we don't have much inner node shape info left now that modules are functions.

wschin · 2023-08-10T21:26:32Z

onnxscript/function_libs/torch_lib/graph_building.py

+            _estimate_tensor_size(tensor) for tensor in self.initializers.values()
+        )
+
+        # Treat models > 1GB as large models so that we have ample room


Humm, maybe increase to 1.8 GB? I never see a model > 100MB without initializers.

wschin · 2023-08-10T21:30:23Z

onnxscript/function_libs/torch_lib/graph_building.py

-        onnx_model = onnx.load_from_string(proto)
+        cache_model_to_disk = include_initializers and large_model
+
+        if cache_model_to_disk:


Whether or not storing initializers should be controlled by a user flag. Assume that I export a 1GB model on remote machine. I want to visualize it locally. I really don't want to download its initializers with home internet. If this flag can be turned on, I will be able to just download the structure of model and debug faster.

I think once the user get the model proto, they can do whatever they want (aka remove all the data)? A user has full control when they get the dynamo export output as an object.

Further yet include_initializers is already an argument

Agree with @wschin 's goal and @justinchuby 's explanation. A thing to consider for ExportOutput.save or ExportOutputSerializer.

onnxscript/function_libs/torch_lib/graph_building.py

BowenBao · 2023-08-10T23:52:16Z

Would be nice to mention the perf impact / comparison too.

justinchuby · 2023-08-11T04:19:08Z

Would be nice to mention the perf impact / comparison too.

Done

Support >2G model export | torchlib(feat)

acbb78d

justinchuby requested a review from BowenBao August 10, 2023 19:34

justinchuby commented Aug 10, 2023

View reviewed changes

justinchuby added 4 commits August 10, 2023 20:12

Fix

0b4bac8

snapshot

bc7459e

Fix

0e015f0

refine test

ca54944

justinchuby added the module: torchlib Related to the torch/aten function lib in development label Aug 10, 2023

BowenBao reviewed Aug 10, 2023

View reviewed changes

wschin reviewed Aug 10, 2023

View reviewed changes

BowenBao approved these changes Aug 10, 2023

View reviewed changes

onnxscript/function_libs/torch_lib/graph_building.py Outdated Show resolved Hide resolved

Docs

331c236

Comments

db58031

justinchuby merged commit d9b64c5 into main Aug 11, 2023

justinchuby deleted the justinchu/big-models branch August 11, 2023 05:22

justinchuby mentioned this pull request Aug 11, 2023

Support >2G model export - alternative implementation | torchlib(feat) #1004

Closed

1 task

titaiwangms self-requested a review August 21, 2023 16:16

Conversation

justinchuby commented Aug 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Aug 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Aug 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinchuby Aug 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinchuby Aug 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BowenBao commented Aug 10, 2023

Uh oh!

justinchuby commented Aug 11, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

justinchuby commented Aug 10, 2023 •

edited

Loading

codecov bot commented Aug 10, 2023 •

edited

Loading

github-actions bot commented Aug 10, 2023 •

edited

Loading

justinchuby Aug 10, 2023 •

edited

Loading

justinchuby Aug 10, 2023 •

edited

Loading