Skip to content

[feature request] quantized::linear_dynamic on CUDA/eager, and other quantized and low-level int8 operators (matmul, gemm etc) on CUDA + integrate LLM.int8 + integrate ZeroQuant? #69364

@vadimkantorov

Description

@vadimkantorov

It's useful to be able to run a quantized transformer exported TorchScript model on CUDA even if some quantized operators are performed by dequant/conversion to float32/requant (some sort of temporary "polyfill"). E.g. quantized::linear_dynamic (also requested at https://discuss.pytorch.org/t/does-dynamic-quantization-support-gpu/119231)

For TorchScript context, is it possible to manually patch torch.ops.quantized.linear_dynamic to support CUDA? Maybe via some manual op registration / dispatch? Otherwise, is there a method to convert a TorchScript model to do without quantization? (via some FX transforms?)

Maybe relevant to huggingface as well especially if it's for inferencing a pretrained frozen quantized BERT as part of larger model.

Thanks!

cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @Xia-Weiwen @leslie-fang-intel @ptrblck @ngimel

Metadata

Metadata

Labels

module: cudaRelated to torch.cuda, and CUDA support in generaloncall: quantizationQuantization support in PyTorchtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions