Conversation
0e67173 to
1d40f4c
Compare
d77d9a3 to
f9d875e
Compare
e83fee8 to
ca9f399
Compare
ca9f399 to
d214c64
Compare
Contributor
Author
|
🤖 Created releases:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 I have created a release beep boop
0.1.6 (2024-08-27)
SM75 Support
Starting from 0.1.6, our pre-built wheels include experimental support sm75 (Turing architecture GPUs such as Tesla T4, Quadro RTX 6000 and RTX 2080).
API Changes
plan/runSince 0.1.6 on,
begin_forward/forward/end_forwardAPIs are replaced with the newplan/runAPI.forwardis renamed torun, which is more precise and consistent with the naming convention of cutlass's python API.begin_forwardis renamed toplan, which is consistent with the naming convention of nvmath API.end_forwardis deprecated and has no effect after this PR.There is some slight difference between the old
forwardand the newrunAPI:causalandlogits_soft_capwill be provided inplan(previouslybegin_forward) API, and cached until nextplancall, and we only need to provide query and KV-Cache tensors inrunAPI.The old
begin_forward/forward/end_forwardAPIs are still functional, but we will gradually deprecate them in future releases.Check #466 for more details.
MultiLevelCascadeAttentionWrapperSince 0.1.6 on, we introduce a new
MultiLevelCascadeAttentionWrapperAPI for cascade inference,which supports multi-level cascade inference where all levels' KV-Cache can be managed in a unified Paged KV-Cache.
See documentation and tutorial on API usage and layout explaination.
The old
BatchDecodeWithSharedPrefixPagedKVCacheWrapperandBatchPrefillWithSharedPrefixPagedKVCacheWrapperwill be deprecated in future releases.Features
MultiLevelCascadeAttentionWrapperAPI (#462) (1e37989)Refactor
begin_forward/forward/end_forwardwithplan/run#466Misc
Performance Improvements
Acknowledgement
We thank @LiuXiaoxuanPKU on enhance of speculative sampling operator, @merrymercy on API change suggestion and @zhyncs on integrating fp8 BMM cublas implementation.
This PR was generated with Release Please. See documentation.