[feature request] `quantized::linear_dynamic` on CUDA/eager, and other quantized and low-level int8 operators (matmul, gemm etc) on CUDA + integrate LLM.int8 + integrate ZeroQuant?

It's useful to be able to run a quantized transformer exported TorchScript model on CUDA even if some quantized operators are performed by dequant/conversion to float32/requant (some sort of temporary "polyfill"). E.g. `quantized::linear_dynamic` (also requested at https://discuss.pytorch.org/t/does-dynamic-quantization-support-gpu/119231)

For TorchScript context, is it possible to manually patch `torch.ops.quantized.linear_dynamic` to support CUDA? Maybe via some manual op registration / dispatch? Otherwise, is there a method to convert a TorchScript model to do without quantization? (via some FX transforms?)

Maybe relevant to huggingface as well especially if it's for inferencing a pretrained frozen quantized BERT as part of larger model.

Thanks!

cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @Xia-Weiwen @leslie-fang-intel @ptrblck @ngimel 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] `quantized::linear_dynamic` on CUDA/eager, and other quantized and low-level int8 operators (matmul, gemm etc) on CUDA + integrate LLM.int8 + integrate ZeroQuant? #69364

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feature request] quantized::linear_dynamic on CUDA/eager, and other quantized and low-level int8 operators (matmul, gemm etc) on CUDA + integrate LLM.int8 + integrate ZeroQuant? #69364

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[feature request] `quantized::linear_dynamic` on CUDA/eager, and other quantized and low-level int8 operators (matmul, gemm etc) on CUDA + integrate LLM.int8 + integrate ZeroQuant? #69364