-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Roberta fp16 got wrong inference results #2466
Description
Description
Roberta(bert4keras) model got wrong inference results with fp16( fp32 is right)
Using tf2onnx,convert tensorflow savedmodel to onnx, and there are two out-of-bounds constants in the onnx model:
- Infinity(flota32), Min op’s input[1] in the layernorm structure
- -999999995904(float32), Mul op's input[0] in the self-attention structure
I found out that layernorm was causing the fp16 got wrong inference results.
Environment
TensorRT Version: 8.4.2.4
NVIDIA GPU: A30
NVIDIA Driver Version: 510.68.02
CUDA Version: 11.7
CUDNN Version:
Operating System: Ubuntu 20.04.2 LTS
Python Version (if applicable): 3.8
Tensorflow Version (if applicable): 1.15
Container version: nvcr.io/nvidia/tensorrt:22.08-py3
Relevant Files
Steps To Reproduce
Method 1: Use Clip op to replace Min and Max
- Modify onnx model, use Clip to replace Min-->Max in all layernorm structures
- When onnx2engine with fp16, set the precision of Clip and Mul
network.get_layer(i).precision = trt.DataType.FLOAT
config.set_flag(trt.BuilderFlag.OBEY_PRECISION_CONSTRAINTS)
- In this case, the fp16 inference turned out to be correct. However, compared with fp32:
-
When there are two encoders in the model, the throughput of the model is improved using fp16.
-
When there are 6 encoders in the model, fp16 quantization results in negative optimization, throughput decrease and latency increase
Method 2: Set all ops‘s precision fp32 in layernorm(The red box in the image above)
- When onnx2engine with fp16, set the precision of fp32 to layernorm(from GlobalAveragePool to Add) and Mul
- In this case, the fp16 inference is still wrong,it is no different from the model using fp16 directly
How to use fp16 to improve inference performance while ensuring model's accuracy?
