It's useful to be able to run a quantized transformer exported TorchScript model on CUDA even if some quantized operators are performed by dequant/conversion to float32/requant (some sort of temporary "polyfill"). E.g. quantized::linear_dynamic (also requested at https://discuss.pytorch.org/t/does-dynamic-quantization-support-gpu/119231)
For TorchScript context, is it possible to manually patch torch.ops.quantized.linear_dynamic to support CUDA? Maybe via some manual op registration / dispatch? Otherwise, is there a method to convert a TorchScript model to do without quantization? (via some FX transforms?)
Maybe relevant to huggingface as well especially if it's for inferencing a pretrained frozen quantized BERT as part of larger model.
Thanks!
cc @jerryzh168 @jianyuh @raghuramank100 @jamesr66a @vkuzo @jgong5 @Xia-Weiwen @leslie-fang-intel @ptrblck @ngimel