enable parallel prefill#4068
Conversation
CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4068
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 8110c11 with merge base 38046ba ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
This pull request was exported from Phabricator. Differential Revision: D58874164 |
CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) ghstack-source-id: 231571360 Pull Request resolved: #4068
CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D58874164 |
Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231635343 @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D58874164 |
Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231659551 @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D58874164 |
Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231743004 @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D58874164 |
Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231767668 @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D58874164 |
Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231773574 @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D58874164 |
Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231782888 @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D58874164 |
Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231785246 @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]
|
This pull request was exported from Phabricator. Differential Revision: D58874164 |
Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231807015 //unrelated failures @bypass-github-export-checks @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
|
This pull request has been merged in 59e706a. |
|
Hi @kimishpatel, Thank you for the excellent work. I tried the patch today using the following command: python -m examples.models.llama2.export_llama --checkpoint ./stories110M/stories110M.pt -p ./stories110M/params.json -o ptes -n stories110M_parallel_prefill -kv -XHowever, I encountered an error when attempting to generate the PTE with dynamic shape enabled. I have attached the traceback below. Could you please help me identify what went wrong? Best regards Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspace/executorch/examples/models/llama2/export_llama.py", line 31, in <module>
main() # pragma: no cover
File "/workspace/executorch/examples/models/llama2/export_llama.py", line 27, in main
export_llama(modelname, args)
File "/workspace/executorch/examples/models/llama2/export_llama_lib.py", line 316, in export_llama
builder = _export_llama(modelname, args)
File "/workspace/executorch/examples/models/llama2/export_llama_lib.py", line 480, in _export_llama
backend = builder_exported_to_edge.to_backend(partitioners)
File "/workspace/executorch/examples/models/llama2/builder.py", line 360, in to_backend
self.edge_manager = self.edge_manager.to_backend(partitioner)
File "/workspace/executorch/exir/program/_program.py", line 1166, in to_backend
new_edge_programs[name] = to_backend(program, partitioner)
File "/usr/lib/python3.10/functools.py", line 889, in wrapper
return dispatch(args[0].__class__)(*args, **kw)
File "/workspace/executorch/exir/backend/backend_api.py", line 384, in _
tagged_graph_module = _partition_and_lower(
File "/workspace/executorch/exir/backend/backend_api.py", line 316, in _partition_and_lower
res = ExportPass()(tagged_graph_module)
File "/workspace/executorch/third-party/pytorch/torch/fx/passes/infra/pass_base.py", line 40, in __call__
res = self.call(graph_module)
File "/workspace/executorch/exir/pass_base.py", line 571, in call
result = self.call_submodule(graph_module, tuple(inputs))
File "/workspace/executorch/exir/pass_base.py", line 657, in call_submodule
res = super().call_submodule(graph_module, inputs)
File "/workspace/executorch/exir/pass_base.py", line 534, in call_submodule
interpreter.run(*inputs_data)
File "/workspace/executorch/third-party/pytorch/torch/fx/interpreter.py", line 145, in run
self.env[node] = self.run_node(node)
File "/workspace/executorch/exir/pass_base.py", line 374, in run_node
return super().run_node(n)
File "/workspace/executorch/third-party/pytorch/torch/fx/interpreter.py", line 202, in run_node
return getattr(self, n.op)(n.target, args, kwargs)
File "/workspace/executorch/exir/pass_base.py", line 606, in call_function
return self.callback.call_operator(
File "/workspace/executorch/exir/pass_base.py", line 465, in call_operator
return self._fx("call_function", op, args, kwargs, meta)
File "/workspace/executorch/exir/pass_base.py", line 396, in _fx
res_data = getattr(self.interpreter, kind)(target, args_data, kwargs_data)
File "/workspace/executorch/third-party/pytorch/torch/fx/interpreter.py", line 274, in call_function
return target(*args, **kwargs)
File "/workspace/executorch/exir/dialects/edge/_ops.py", line 333, in __call__
return self._op(*args, **kwargs)
File "/workspace/executorch/third-party/pytorch/torch/_ops.py", line 594, in __call__
return self_._op(*args, **kwargs)
File "/workspace/executorch/third-party/pytorch/torch/utils/_stats.py", line 20, in wrapper
return fn(*args, **kwargs)
File "/workspace/executorch/third-party/pytorch/torch/_subclasses/fake_tensor.py", line 896, in __torch_dispatch__
return self.dispatch(func, types, args, kwargs)
File "/workspace/executorch/third-party/pytorch/torch/_subclasses/fake_tensor.py", line 1241, in dispatch
return self._cached_dispatch_impl(func, types, args, kwargs)
File "/workspace/executorch/third-party/pytorch/torch/_subclasses/fake_tensor.py", line 974, in _cached_dispatch_impl
output = self._dispatch_impl(func, types, args, kwargs)
File "/workspace/executorch/third-party/pytorch/torch/_subclasses/fake_tensor.py", line 1458, in _dispatch_impl
r = func(*args, **kwargs)
File "/workspace/executorch/third-party/pytorch/torch/_ops.py", line 594, in __call__
return self_._op(*args, **kwargs)
File "/workspace/executorch/third-party/pytorch/torch/_decomp/decompositions.py", line 756, in slice_forward
elif end_val > sizes[dim]:
File "/workspace/executorch/third-party/pytorch/torch/__init__.py", line 374, in __bool__
return self.node.bool_()
File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/sym_node.py", line 432, in bool_
return self.guard_bool("", 0)
File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/sym_node.py", line 374, in guard_bool
r = self.shape_env.evaluate_expr(self.expr, self.hint, fx_node=self.fx_node)
File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/recording.py", line 255, in wrapper
return event.run(self)
File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/recording.py", line 156, in run
return self.f(*args, **kwargs)
File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4138, in evaluate_expr
raise self._make_data_dependent_error(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression u375 + 4 > 128 (unhinted: s0 + u375 > 128). (Size-like symbols: none)
Potential framework code culprit (scroll up for full backtrace):
File "/workspace/executorch/third-party/pytorch/torch/_decomp/decompositions.py", line 756, in slice_forward
elif end_val > sizes[dim]:
For more information, run with TORCH_LOGS="dynamic"
For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u375"
If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1
For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing
While executing %aten_slice_copy_tensor : [num_users=12] = call_function[target=executorch.exir.dialects.edge._ops.aten.slice_copy.Tensor](args = (%arg111_1, 0, %_local_scalar_dense, %add_2), kwargs = {})
Original traceback:
File "/workspace/executorch/examples/models/llama2/llama_transformer.py", line 522, in forward
freqs_cos = self.freqs_cos.narrow(0, input_pos_item, seqlen) |
|
I will try to repro this |
Thank you for your response. I updated to the latest commit, and the issue has been resolved. |
Awesome |
…ecuTorch" This PR is a prototype to showcase the minimal changes required to lower Gemma-2b to ExecuTorch w/ static kv cache and run it directly in [llama runner](https://github.com/pytorch/executorch/tree/main/examples/models/llama2) w/o single line of code change in the ExecuTorch runtime. By standardizing on the contract between HuggingFace modeling and ExecuTorch runtime, any LLM in HuggingFace could utilize llama runner as a universal runtime for a given backend. Instructions to run the demo: To run the demo, you need to clone huggingface/transformers and patch [PR#31706](huggingface/transformers#31706) on top, which contains minimal changes required on the modeling side. Patch this PR to your ExecuTorch repo, from there you can: 1. Run the export_hf_model.py to lower gemma-2b to ExecuTorch: ``` python -m examples.models.export_hf_model -hfm "google/gemma-2b" --export # The model is exported statical dims with static KV cache ``` 2. Run the tokenizer.py to generate the binary format for ExecuTorch runtime: ``` python -m examples.models.llama2.tokenizer.tokenizer -t <path_to_downloaded_gemma_checkpoint_dir>/tokenizer.model -o <your_out_dir>/tokenizer.bin ``` 3. Build and run the lowered model wiht llama runner by following this guide [step 4](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#step-4-run-on-your-computer-to-validate) NOTE: This prototype is to demonstrate the feasibility of exporting and running native HF model in ExecuTorch by reusing llama runner. It does NOT come with performance yet. It's an ongoing effort along this path to enable 1) delegations, e.g. xnnpack 2) custom sdpa 3) parallel prefill recently enabled in #4068. [ghstack-poisoned]
This PR is a prototype to showcase the minimal changes required to lower Gemma-2b to ExecuTorch w/ static kv cache and run it directly in [llama runner](https://github.com/pytorch/executorch/tree/main/examples/models/llama2) w/o single line of code change in the ExecuTorch runtime. By standardizing on the contract between HuggingFace modeling and ExecuTorch runtime, any LLM in HuggingFace could utilize llama runner as a universal runtime for a given backend. Instructions to run the demo: To run the demo, you need to clone huggingface/transformers and patch [PR#31706](huggingface/transformers#31706) on top, which contains minimal changes required on the modeling side. Patch this PR to your ExecuTorch repo, from there you can: 1. Run the export_hf_model.py to lower gemma-2b to ExecuTorch: ``` python -m examples.models.export_hf_model -hfm "google/gemma-2b" --export # The model is exported statical dims with static KV cache ``` 2. Run the tokenizer.py to generate the binary format for ExecuTorch runtime: ``` python -m examples.models.llama2.tokenizer.tokenizer -t <path_to_downloaded_gemma_checkpoint_dir>/tokenizer.model -o <your_out_dir>/tokenizer.bin ``` 3. Build and run the lowered model wiht llama runner by following this guide [step 4](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#step-4-run-on-your-computer-to-validate) NOTE: This prototype is to demonstrate the feasibility of exporting and running native HF model in ExecuTorch by reusing llama runner. It does NOT come with performance yet. It's an ongoing effort along this path to enable 1) delegations, e.g. xnnpack 2) custom sdpa 3) parallel prefill recently enabled in #4068. [ghstack-poisoned]
Stack from ghstack (oldest at bottom):
CHanges include
Differential Revision: D58874164