enable parallel prefill by kimishpatel · Pull Request #4068 · pytorch/executorch

kimishpatel · 2024-06-25T22:31:37Z

Stack from ghstack (oldest at bottom):

CHanges include

Model export changes to allow indexing into freqcis
xnnpack delegate change to partition with dynamic shape
runner change to do perfill in single step

Differential Revision: D58874164

CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]

pytorch-bot · 2024-06-25T22:31:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4068

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8110c11 with merge base 38046ba ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2024-06-25T22:32:07Z

This pull request was exported from Phabricator. Differential Revision: D58874164

CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) ghstack-source-id: 231571360 Pull Request resolved: #4068

CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]

facebook-github-bot · 2024-06-25T23:54:35Z

This pull request was exported from Phabricator. Differential Revision: D58874164

Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231635343 @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]

facebook-github-bot · 2024-06-26T04:39:57Z

This pull request was exported from Phabricator. Differential Revision: D58874164

Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231659551 @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]

facebook-github-bot · 2024-06-26T18:57:28Z

This pull request was exported from Phabricator. Differential Revision: D58874164

Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231743004 @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]

facebook-github-bot · 2024-06-26T21:16:00Z

This pull request was exported from Phabricator. Differential Revision: D58874164

Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231767668 @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]

facebook-github-bot · 2024-06-26T21:47:03Z

This pull request was exported from Phabricator. Differential Revision: D58874164

Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231773574 @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]

facebook-github-bot · 2024-06-26T22:36:02Z

This pull request was exported from Phabricator. Differential Revision: D58874164

Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231782888 @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]

facebook-github-bot · 2024-06-26T22:50:33Z

This pull request was exported from Phabricator. Differential Revision: D58874164

Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231785246 @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/) [ghstack-poisoned]

facebook-github-bot · 2024-06-27T01:42:23Z

This pull request was exported from Phabricator. Differential Revision: D58874164

Pull Request resolved: #4068 CHanges include - Model export changes to allow indexing into freqcis - xnnpack delegate change to partition with dynamic shape - runner change to do perfill in single step ghstack-source-id: 231807015 //unrelated failures @bypass-github-export-checks @exported-using-ghexport Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

facebook-github-bot · 2024-06-27T03:46:00Z

This pull request has been merged in 59e706a.

DzAvril · 2024-07-05T08:42:15Z

Hi @kimishpatel,

Thank you for the excellent work. I tried the patch today using the following command:

python -m examples.models.llama2.export_llama --checkpoint ./stories110M/stories110M.pt -p ./stories110M/params.json -o ptes -n stories110M_parallel_prefill -kv -X

However, I encountered an error when attempting to generate the PTE with dynamic shape enabled. I have attached the traceback below. Could you please help me identify what went wrong?

Best regards

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/executorch/examples/models/llama2/export_llama.py", line 31, in <module>
    main()  # pragma: no cover
  File "/workspace/executorch/examples/models/llama2/export_llama.py", line 27, in main
    export_llama(modelname, args)
  File "/workspace/executorch/examples/models/llama2/export_llama_lib.py", line 316, in export_llama
    builder = _export_llama(modelname, args)
  File "/workspace/executorch/examples/models/llama2/export_llama_lib.py", line 480, in _export_llama
    backend = builder_exported_to_edge.to_backend(partitioners)
  File "/workspace/executorch/examples/models/llama2/builder.py", line 360, in to_backend
    self.edge_manager = self.edge_manager.to_backend(partitioner)
  File "/workspace/executorch/exir/program/_program.py", line 1166, in to_backend
    new_edge_programs[name] = to_backend(program, partitioner)
  File "/usr/lib/python3.10/functools.py", line 889, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/workspace/executorch/exir/backend/backend_api.py", line 384, in _
    tagged_graph_module = _partition_and_lower(
  File "/workspace/executorch/exir/backend/backend_api.py", line 316, in _partition_and_lower
    res = ExportPass()(tagged_graph_module)
  File "/workspace/executorch/third-party/pytorch/torch/fx/passes/infra/pass_base.py", line 40, in __call__
    res = self.call(graph_module)
  File "/workspace/executorch/exir/pass_base.py", line 571, in call
    result = self.call_submodule(graph_module, tuple(inputs))
  File "/workspace/executorch/exir/pass_base.py", line 657, in call_submodule
    res = super().call_submodule(graph_module, inputs)
  File "/workspace/executorch/exir/pass_base.py", line 534, in call_submodule
    interpreter.run(*inputs_data)
  File "/workspace/executorch/third-party/pytorch/torch/fx/interpreter.py", line 145, in run
    self.env[node] = self.run_node(node)
  File "/workspace/executorch/exir/pass_base.py", line 374, in run_node
    return super().run_node(n)
  File "/workspace/executorch/third-party/pytorch/torch/fx/interpreter.py", line 202, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
  File "/workspace/executorch/exir/pass_base.py", line 606, in call_function
    return self.callback.call_operator(
  File "/workspace/executorch/exir/pass_base.py", line 465, in call_operator
    return self._fx("call_function", op, args, kwargs, meta)
  File "/workspace/executorch/exir/pass_base.py", line 396, in _fx
    res_data = getattr(self.interpreter, kind)(target, args_data, kwargs_data)
  File "/workspace/executorch/third-party/pytorch/torch/fx/interpreter.py", line 274, in call_function
    return target(*args, **kwargs)
  File "/workspace/executorch/exir/dialects/edge/_ops.py", line 333, in __call__
    return self._op(*args, **kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/_ops.py", line 594, in __call__
    return self_._op(*args, **kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/_subclasses/fake_tensor.py", line 896, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/_subclasses/fake_tensor.py", line 1241, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/_subclasses/fake_tensor.py", line 974, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/_subclasses/fake_tensor.py", line 1458, in _dispatch_impl
    r = func(*args, **kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/_ops.py", line 594, in __call__
    return self_._op(*args, **kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/_decomp/decompositions.py", line 756, in slice_forward
    elif end_val > sizes[dim]:
  File "/workspace/executorch/third-party/pytorch/torch/__init__.py", line 374, in __bool__
    return self.node.bool_()
  File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/sym_node.py", line 432, in bool_
    return self.guard_bool("", 0)
  File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/sym_node.py", line 374, in guard_bool
    r = self.shape_env.evaluate_expr(self.expr, self.hint, fx_node=self.fx_node)
  File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/recording.py", line 255, in wrapper
    return event.run(self)
  File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/recording.py", line 156, in run
    return self.f(*args, **kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4138, in evaluate_expr
    raise self._make_data_dependent_error(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression u375 + 4 > 128 (unhinted: s0 + u375 > 128).  (Size-like symbols: none)

Potential framework code culprit (scroll up for full backtrace):
  File "/workspace/executorch/third-party/pytorch/torch/_decomp/decompositions.py", line 756, in slice_forward
    elif end_val > sizes[dim]:

For more information, run with TORCH_LOGS="dynamic"
For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u375"
If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1
For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing


While executing %aten_slice_copy_tensor : [num_users=12] = call_function[target=executorch.exir.dialects.edge._ops.aten.slice_copy.Tensor](args = (%arg111_1, 0, %_local_scalar_dense, %add_2), kwargs = {})
Original traceback:
  File "/workspace/executorch/examples/models/llama2/llama_transformer.py", line 522, in forward
    freqs_cos = self.freqs_cos.narrow(0, input_pos_item, seqlen)

kimishpatel · 2024-07-08T15:09:16Z

I will try to repro this

DzAvril · 2024-07-08T15:21:21Z

I will try to repro this

Thank you for your response. I updated to the latest commit, and the issue has been resolved.

kimishpatel · 2024-07-08T16:44:54Z

I will try to repro this

Thank you for your response. I updated to the latest commit, and the issue has been resolved.

Awesome

…ecuTorch" This PR is a prototype to showcase the minimal changes required to lower Gemma-2b to ExecuTorch w/ static kv cache and run it directly in [llama runner](https://github.com/pytorch/executorch/tree/main/examples/models/llama2) w/o single line of code change in the ExecuTorch runtime. By standardizing on the contract between HuggingFace modeling and ExecuTorch runtime, any LLM in HuggingFace could utilize llama runner as a universal runtime for a given backend. Instructions to run the demo: To run the demo, you need to clone huggingface/transformers and patch [PR#31706](huggingface/transformers#31706) on top, which contains minimal changes required on the modeling side. Patch this PR to your ExecuTorch repo, from there you can: 1. Run the export_hf_model.py to lower gemma-2b to ExecuTorch: ``` python -m examples.models.export_hf_model -hfm "google/gemma-2b" --export # The model is exported statical dims with static KV cache ``` 2. Run the tokenizer.py to generate the binary format for ExecuTorch runtime: ``` python -m examples.models.llama2.tokenizer.tokenizer -t <path_to_downloaded_gemma_checkpoint_dir>/tokenizer.model -o <your_out_dir>/tokenizer.bin ``` 3. Build and run the lowered model wiht llama runner by following this guide [step 4](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#step-4-run-on-your-computer-to-validate) NOTE: This prototype is to demonstrate the feasibility of exporting and running native HF model in ExecuTorch by reusing llama runner. It does NOT come with performance yet. It's an ongoing effort along this path to enable 1) delegations, e.g. xnnpack 2) custom sdpa 3) parallel prefill recently enabled in #4068. [ghstack-poisoned]

This PR is a prototype to showcase the minimal changes required to lower Gemma-2b to ExecuTorch w/ static kv cache and run it directly in [llama runner](https://github.com/pytorch/executorch/tree/main/examples/models/llama2) w/o single line of code change in the ExecuTorch runtime. By standardizing on the contract between HuggingFace modeling and ExecuTorch runtime, any LLM in HuggingFace could utilize llama runner as a universal runtime for a given backend. Instructions to run the demo: To run the demo, you need to clone huggingface/transformers and patch [PR#31706](huggingface/transformers#31706) on top, which contains minimal changes required on the modeling side. Patch this PR to your ExecuTorch repo, from there you can: 1. Run the export_hf_model.py to lower gemma-2b to ExecuTorch: ``` python -m examples.models.export_hf_model -hfm "google/gemma-2b" --export # The model is exported statical dims with static KV cache ``` 2. Run the tokenizer.py to generate the binary format for ExecuTorch runtime: ``` python -m examples.models.llama2.tokenizer.tokenizer -t <path_to_downloaded_gemma_checkpoint_dir>/tokenizer.model -o <your_out_dir>/tokenizer.bin ``` 3. Build and run the lowered model wiht llama runner by following this guide [step 4](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#step-4-run-on-your-computer-to-validate) NOTE: This prototype is to demonstrate the feasibility of exporting and running native HF model in ExecuTorch by reusing llama runner. It does NOT come with performance yet. It's an ongoing effort along this path to enable 1) delegations, e.g. xnnpack 2) custom sdpa 3) parallel prefill recently enabled in #4068. [ghstack-poisoned]

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 25, 2024

kimishpatel mentioned this pull request Jun 25, 2024

Refactor sdpa_with_kv_cache tests #4064

Closed

kimishpatel mentioned this pull request Jun 25, 2024

More test refactor to enable attnetion mask #4065

Closed

facebook-github-bot added the fb-exported label Jun 25, 2024

This was referenced Jun 25, 2024

Enable is_causal tests for sdpa with kv cache op #4066

Closed

Enable dynamic shape tests for sdpa with kv cache #4067

Closed

larryliu0820 approved these changes Jun 26, 2024

View reviewed changes

facebook-github-bot closed this in 59e706a Jun 27, 2024

facebook-github-bot added the Merged label Jun 27, 2024

guangy10 mentioned this pull request Jun 29, 2024

[Not To Land][HF][Gemma] Lower and run HF Gemma2b in ExecuTorch #4088

Closed

Conversation

kimishpatel commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4068

✅ No Failures

Uh oh!

facebook-github-bot commented Jun 25, 2024

Uh oh!

facebook-github-bot commented Jun 25, 2024

Uh oh!

facebook-github-bot commented Jun 26, 2024

Uh oh!

facebook-github-bot commented Jun 26, 2024

Uh oh!

facebook-github-bot commented Jun 26, 2024

Uh oh!

facebook-github-bot commented Jun 26, 2024

Uh oh!

facebook-github-bot commented Jun 26, 2024

Uh oh!

facebook-github-bot commented Jun 26, 2024

Uh oh!

facebook-github-bot commented Jun 27, 2024

Uh oh!

facebook-github-bot commented Jun 27, 2024

Uh oh!

DzAvril commented Jul 5, 2024

Uh oh!

kimishpatel commented Jul 8, 2024

Uh oh!

DzAvril commented Jul 8, 2024

Uh oh!

kimishpatel commented Jul 8, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kimishpatel commented Jun 25, 2024 •

edited

Loading

pytorch-bot bot commented Jun 25, 2024 •

edited

Loading