Skip to content

enable parallel prefill#4068

Closed
kimishpatel wants to merge 9 commits intogh/kimishpatel/59/basefrom
gh/kimishpatel/59/head
Closed

enable parallel prefill#4068
kimishpatel wants to merge 9 commits intogh/kimishpatel/59/basefrom
gh/kimishpatel/59/head

Conversation

@kimishpatel
Copy link
Contributor

@kimishpatel kimishpatel commented Jun 25, 2024

Stack from ghstack (oldest at bottom):

CHanges include

  • Model export changes to allow indexing into freqcis
  • xnnpack delegate change to partition with dynamic shape
  • runner change to do perfill in single step

Differential Revision: D58874164

CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 25, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/4068

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8110c11 with merge base 38046ba (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 25, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58874164

kimishpatel added a commit that referenced this pull request Jun 25, 2024
CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

ghstack-source-id: 231571360
Pull Request resolved: #4068
CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58874164

kimishpatel added a commit that referenced this pull request Jun 25, 2024
Pull Request resolved: #4068

CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step
ghstack-source-id: 231635343
@exported-using-ghexport

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58874164

kimishpatel added a commit that referenced this pull request Jun 26, 2024
Pull Request resolved: #4068

CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step
ghstack-source-id: 231659551
@exported-using-ghexport

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58874164

kimishpatel added a commit that referenced this pull request Jun 26, 2024
Pull Request resolved: #4068

CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step
ghstack-source-id: 231743004
@exported-using-ghexport

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58874164

kimishpatel added a commit that referenced this pull request Jun 26, 2024
Pull Request resolved: #4068

CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step
ghstack-source-id: 231767668
@exported-using-ghexport

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58874164

kimishpatel added a commit that referenced this pull request Jun 26, 2024
Pull Request resolved: #4068

CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step
ghstack-source-id: 231773574
@exported-using-ghexport

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58874164

kimishpatel added a commit that referenced this pull request Jun 26, 2024
Pull Request resolved: #4068

CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step
ghstack-source-id: 231782888
@exported-using-ghexport

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58874164

kimishpatel added a commit that referenced this pull request Jun 26, 2024
Pull Request resolved: #4068

CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step
ghstack-source-id: 231785246
@exported-using-ghexport

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58874164

kimishpatel added a commit that referenced this pull request Jun 27, 2024
Pull Request resolved: #4068

CHanges include
- Model export changes to allow indexing into freqcis
- xnnpack delegate change to partition with dynamic shape
- runner change to do perfill in single step
ghstack-source-id: 231807015

//unrelated failures
@bypass-github-export-checks

@exported-using-ghexport

Differential Revision: [D58874164](https://our.internmc.facebook.com/intern/diff/D58874164/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 59e706a.

@DzAvril
Copy link

DzAvril commented Jul 5, 2024

Hi @kimishpatel,

Thank you for the excellent work. I tried the patch today using the following command:

python -m examples.models.llama2.export_llama --checkpoint ./stories110M/stories110M.pt -p ./stories110M/params.json -o ptes -n stories110M_parallel_prefill -kv -X

However, I encountered an error when attempting to generate the PTE with dynamic shape enabled. I have attached the traceback below. Could you please help me identify what went wrong?

Best regards

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/executorch/examples/models/llama2/export_llama.py", line 31, in <module>
    main()  # pragma: no cover
  File "/workspace/executorch/examples/models/llama2/export_llama.py", line 27, in main
    export_llama(modelname, args)
  File "/workspace/executorch/examples/models/llama2/export_llama_lib.py", line 316, in export_llama
    builder = _export_llama(modelname, args)
  File "/workspace/executorch/examples/models/llama2/export_llama_lib.py", line 480, in _export_llama
    backend = builder_exported_to_edge.to_backend(partitioners)
  File "/workspace/executorch/examples/models/llama2/builder.py", line 360, in to_backend
    self.edge_manager = self.edge_manager.to_backend(partitioner)
  File "/workspace/executorch/exir/program/_program.py", line 1166, in to_backend
    new_edge_programs[name] = to_backend(program, partitioner)
  File "/usr/lib/python3.10/functools.py", line 889, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/workspace/executorch/exir/backend/backend_api.py", line 384, in _
    tagged_graph_module = _partition_and_lower(
  File "/workspace/executorch/exir/backend/backend_api.py", line 316, in _partition_and_lower
    res = ExportPass()(tagged_graph_module)
  File "/workspace/executorch/third-party/pytorch/torch/fx/passes/infra/pass_base.py", line 40, in __call__
    res = self.call(graph_module)
  File "/workspace/executorch/exir/pass_base.py", line 571, in call
    result = self.call_submodule(graph_module, tuple(inputs))
  File "/workspace/executorch/exir/pass_base.py", line 657, in call_submodule
    res = super().call_submodule(graph_module, inputs)
  File "/workspace/executorch/exir/pass_base.py", line 534, in call_submodule
    interpreter.run(*inputs_data)
  File "/workspace/executorch/third-party/pytorch/torch/fx/interpreter.py", line 145, in run
    self.env[node] = self.run_node(node)
  File "/workspace/executorch/exir/pass_base.py", line 374, in run_node
    return super().run_node(n)
  File "/workspace/executorch/third-party/pytorch/torch/fx/interpreter.py", line 202, in run_node
    return getattr(self, n.op)(n.target, args, kwargs)
  File "/workspace/executorch/exir/pass_base.py", line 606, in call_function
    return self.callback.call_operator(
  File "/workspace/executorch/exir/pass_base.py", line 465, in call_operator
    return self._fx("call_function", op, args, kwargs, meta)
  File "/workspace/executorch/exir/pass_base.py", line 396, in _fx
    res_data = getattr(self.interpreter, kind)(target, args_data, kwargs_data)
  File "/workspace/executorch/third-party/pytorch/torch/fx/interpreter.py", line 274, in call_function
    return target(*args, **kwargs)
  File "/workspace/executorch/exir/dialects/edge/_ops.py", line 333, in __call__
    return self._op(*args, **kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/_ops.py", line 594, in __call__
    return self_._op(*args, **kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/_subclasses/fake_tensor.py", line 896, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/_subclasses/fake_tensor.py", line 1241, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/_subclasses/fake_tensor.py", line 974, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/_subclasses/fake_tensor.py", line 1458, in _dispatch_impl
    r = func(*args, **kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/_ops.py", line 594, in __call__
    return self_._op(*args, **kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/_decomp/decompositions.py", line 756, in slice_forward
    elif end_val > sizes[dim]:
  File "/workspace/executorch/third-party/pytorch/torch/__init__.py", line 374, in __bool__
    return self.node.bool_()
  File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/sym_node.py", line 432, in bool_
    return self.guard_bool("", 0)
  File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/sym_node.py", line 374, in guard_bool
    r = self.shape_env.evaluate_expr(self.expr, self.hint, fx_node=self.fx_node)
  File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/recording.py", line 255, in wrapper
    return event.run(self)
  File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/recording.py", line 156, in run
    return self.f(*args, **kwargs)
  File "/workspace/executorch/third-party/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4138, in evaluate_expr
    raise self._make_data_dependent_error(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression u375 + 4 > 128 (unhinted: s0 + u375 > 128).  (Size-like symbols: none)

Potential framework code culprit (scroll up for full backtrace):
  File "/workspace/executorch/third-party/pytorch/torch/_decomp/decompositions.py", line 756, in slice_forward
    elif end_val > sizes[dim]:

For more information, run with TORCH_LOGS="dynamic"
For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u375"
If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1
For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing


While executing %aten_slice_copy_tensor : [num_users=12] = call_function[target=executorch.exir.dialects.edge._ops.aten.slice_copy.Tensor](args = (%arg111_1, 0, %_local_scalar_dense, %add_2), kwargs = {})
Original traceback:
  File "/workspace/executorch/examples/models/llama2/llama_transformer.py", line 522, in forward
    freqs_cos = self.freqs_cos.narrow(0, input_pos_item, seqlen)

@kimishpatel
Copy link
Contributor Author

I will try to repro this

@DzAvril
Copy link

DzAvril commented Jul 8, 2024

I will try to repro this

Thank you for your response. I updated to the latest commit, and the issue has been resolved.

@kimishpatel
Copy link
Contributor Author

I will try to repro this

Thank you for your response. I updated to the latest commit, and the issue has been resolved.

Awesome

guangy10 added a commit that referenced this pull request Jul 12, 2024
…ecuTorch"

This PR is a prototype to showcase the minimal changes required to lower Gemma-2b to ExecuTorch w/ static kv cache and run it directly in [llama runner](https://github.com/pytorch/executorch/tree/main/examples/models/llama2) w/o single line of code change in the ExecuTorch runtime.

By standardizing on the contract between HuggingFace modeling and ExecuTorch runtime, any LLM in HuggingFace could utilize llama runner as a universal runtime for a given backend.

Instructions to run the demo:

To run the demo, you need to clone huggingface/transformers and patch [PR#31706](huggingface/transformers#31706) on top, which contains minimal changes required on the modeling side. Patch this PR to your ExecuTorch repo, from there you can:

1. Run the export_hf_model.py to lower gemma-2b to ExecuTorch:
```
python -m examples.models.export_hf_model -hfm "google/gemma-2b" --export  # The model is exported statical dims with static KV cache
```
2. Run the tokenizer.py to generate the binary format for ExecuTorch runtime:
```
python -m examples.models.llama2.tokenizer.tokenizer -t <path_to_downloaded_gemma_checkpoint_dir>/tokenizer.model -o <your_out_dir>/tokenizer.bin
```
3. Build and run the lowered model wiht llama runner by following this guide [step 4](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#step-4-run-on-your-computer-to-validate)

NOTE: This prototype is to demonstrate the feasibility of exporting and running native HF model in ExecuTorch by reusing llama runner. It does NOT come with performance yet. It's an ongoing effort along this path to enable 1) delegations, e.g. xnnpack 2) custom sdpa 3) parallel prefill recently enabled in #4068.




[ghstack-poisoned]
guangy10 added a commit that referenced this pull request Jul 12, 2024
This PR is a prototype to showcase the minimal changes required to lower Gemma-2b to ExecuTorch w/ static kv cache and run it directly in [llama runner](https://github.com/pytorch/executorch/tree/main/examples/models/llama2) w/o single line of code change in the ExecuTorch runtime.

By standardizing on the contract between HuggingFace modeling and ExecuTorch runtime, any LLM in HuggingFace could utilize llama runner as a universal runtime for a given backend.

Instructions to run the demo:

To run the demo, you need to clone huggingface/transformers and patch [PR#31706](huggingface/transformers#31706) on top, which contains minimal changes required on the modeling side. Patch this PR to your ExecuTorch repo, from there you can:

1. Run the export_hf_model.py to lower gemma-2b to ExecuTorch:
```
python -m examples.models.export_hf_model -hfm "google/gemma-2b" --export  # The model is exported statical dims with static KV cache
```
2. Run the tokenizer.py to generate the binary format for ExecuTorch runtime:
```
python -m examples.models.llama2.tokenizer.tokenizer -t <path_to_downloaded_gemma_checkpoint_dir>/tokenizer.model -o <your_out_dir>/tokenizer.bin
```
3. Build and run the lowered model wiht llama runner by following this guide [step 4](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#step-4-run-on-your-computer-to-validate)

NOTE: This prototype is to demonstrate the feasibility of exporting and running native HF model in ExecuTorch by reusing llama runner. It does NOT come with performance yet. It's an ongoing effort along this path to enable 1) delegations, e.g. xnnpack 2) custom sdpa 3) parallel prefill recently enabled in #4068.




[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported Merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants