Bug Description
Compare the perf of Torch-TRT against ONNX-TRT:
In fp16:
- Skipping constant folding of embedding layers doesn't affect engine size or latency or precision
- Disabling linear decomposition + adding linear converter reduces ~15% latency
- opt_level=3 or 5 get almost same latency
- onnx-trt takes much longer in compile time
- torch-trt is ~9% slower than onnx-trt