add accept num, emit num metric for ChainSpeculativeSampling by LiuXiaoxuanPKU · Pull Request #450 · flashinfer-ai/flashinfer

LiuXiaoxuanPKU · 2024-08-16T06:31:36Z

No description provided.

yzh119

Thanks for your contribution @LiuXiaoxuanPKU ! I added a few comments and suggestions.

yzh119 · 2024-08-16T06:54:37Z

This file has to be updated:

Lines 70 to 72 in 338b2f5

    
           torch::Tensor chain_speculative_sampling(torch::Tensor draft_probs, torch::Tensor draft_token_ids, 
        
                                                    torch::Tensor uniform_samples, torch::Tensor target_probs, 
        
                                                    bool deterministic);

if you change the function signature.

LiuXiaoxuanPKU · 2024-08-17T05:10:38Z

@yzh119 Thanks for the review. I just fixed the comments, feel free to take another pass. Thanks!

yzh119

LGTM, thank you!

@LiuXiaoxuanPKU

🤖 I have created a release *beep* *boop* --- ## [0.1.6](v0.1.5...v0.1.6) (2024-08-27) ### SM75 Support Starting from [0.1.6](v0.1.5...v0.1.6), our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080). ### API Changes #### `plan`/`run` Since [0.1.6](v0.1.5...v0.1.6) on, `begin_forward`/`forward`/`end_forward` APIs are replaced with the new `plan`/`run` API. - `forward` is renamed to `run`, which is more precise and consistent with the naming convention of cutlass's python API. - `begin_forward` is renamed to `plan`, which is consistent with the naming convention of nvmath API. - `end_forward` is deprecated and has no effect after this PR. There is some slight difference between the old `forward` and the new `run` API: - All extra arguments such as `causal` and `logits_soft_cap` will be provided in `plan` (previously `begin_forward`) API, and cached until next `plan` call, and we only need to provide query and KV-Cache tensors in `run` API. The old `begin_forward`/`forward`/`end_forward` APIs are still functional, but we will gradually deprecate them in future releases. Check [#466](#466) for more details. #### `MultiLevelCascadeAttentionWrapper` Since [0.1.6](v0.1.5...v0.1.6) on, we introduce a new `MultiLevelCascadeAttentionWrapper` API for cascade inference, which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache. See [documentation](https://docs.flashinfer.ai/api/python/cascade.html#flashinfer.cascade.MultiLevelCascadeAttentionWrapper) and [tutorial](https://docs.flashinfer.ai/tutorials/kv_layout.html#multi-level-cascade-inference-data-layout) on API usage and layout explaination. The old `BatchDecodeWithSharedPrefixPagedKVCacheWrapper` and `BatchPrefillWithSharedPrefixPagedKVCacheWrapper` will be deprecated in future releases. ### Features * sm75 support ([#448](#448), [#449](#449)) * add `MultiLevelCascadeAttentionWrapper` API ([#462](#462)) ([1e37989](1e37989)) * add accept num, emit num metric for ChainSpeculativeSampling ([#450](#450)) ([fa38b5e](fa38b5e)) * support bmm fp8 ([#469](#469)) ([f1c0b68](f1c0b68)) ### Refactor * refactor: replace `begin_forward`/`forward`/`end_forward` with `plan`/`run` [#466](#466) ### Misc * misc: improve error handling of sampling kernels ([#456](#456)) ([0dce178](0dce178)) ### Performance Improvements * slight optimization on f16->f8 fragment layout swizzling ([#453](#453)) ([0d61871](0d61871)) * slight optimization on fragment layout swizzle ([#458](#458)) ([7c397cb](7c397cb)) * use persistent kernel for merging attention states ([#459](#459)) ([be6bf5b](be6bf5b)) ### Acknowledgement We thank [@LiuXiaoxuanPKU](https://github.com/LiuXiaoxuanPKU) on enhance of speculative sampling operator, [@merrymercy](https://github.com/merrymercy) on API change suggestion and [@zhyncs](https://github.com/zhyncs) on integrating fp8 BMM cublas implementation. --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Zihao Ye <expye@outlook.com>

add accept num, emit num

299d45b

zhyncs requested review from yzh119 and zhyncs August 16, 2024 06:33

zhyncs self-assigned this Aug 16, 2024

yzh119 reviewed Aug 16, 2024

View reviewed changes

Comment thread include/flashinfer/sampling.cuh Outdated

Comment thread python/csrc/sampling.cu Outdated

Comment thread python/csrc/sampling.cu Outdated

Comment thread include/flashinfer/sampling.cuh Outdated

LiuXiaoxuanPKU added 2 commits August 16, 2024 22:05

fix comments

032d0dc

minor

fbd5e4e

yzh119 approved these changes Aug 17, 2024

View reviewed changes

yzh119 merged commit fa38b5e into flashinfer-ai:main Aug 17, 2024

github-actions Bot mentioned this pull request Aug 17, 2024

chore(main): release 0.1.6 #447

Merged

LiuXiaoxuanPKU mentioned this pull request Aug 18, 2024

[SpecDecode][Kernel] Use Flashinfer for Rejection Sampling in Speculative Decoding vllm-project/vllm#7244

Merged

github-actions Bot mentioned this pull request Dec 25, 2024

chore(main): release 0.3.0 #698

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add accept num, emit num metric for ChainSpeculativeSampling#450

add accept num, emit num metric for ChainSpeculativeSampling#450
yzh119 merged 3 commits intoflashinfer-ai:mainfrom
LiuXiaoxuanPKU:rej-sample-param

LiuXiaoxuanPKU commented Aug 16, 2024

Uh oh!

yzh119 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yzh119 commented Aug 16, 2024

Uh oh!

LiuXiaoxuanPKU commented Aug 17, 2024

Uh oh!

yzh119 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LiuXiaoxuanPKU commented Aug 16, 2024

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yzh119 commented Aug 16, 2024

Uh oh!

LiuXiaoxuanPKU commented Aug 17, 2024

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants