Skip to content

DNN ONNX backend - Slower model when weights are close to 0, trained with Adam Optimizer (Pytorch) #20985

@harristmv

Description

@harristmv
System information (version)
  • OpenCV => 4.5.4 (also present in 4.5.3), Python
  • Operating System / Platform => Ubuntu 18.04 64 bit / x86_64 arch
  • Compiler => ❔
Detailed description

Hello, I am having an issue with a trained model in Pytorch, which I export for ONNX and make it compatible with OpenCV DNN. Specifically is an SSD Detector which is exported via the method implemented here

What's exactly the issue: I was training my model with SGD + Momentum Optimizer and exporting successfully to OpenCV. Now, by training with Adam Optimizer, I yielded some improvements in accuracy.

However exporting for ONNX the model becomes 4x slower (rough numbers are 36 FPS on Momentum vs 8 FPS on ADAM) on the OpenCV side.

Some remarks:

  • Both are on tested on CPU without OpenCL acceleration.
  • Both tested with Python & C++ OpenCV
  • Error persists in older OpenCV versions (4.5.3 and I think I had the same issue in the past with 4.4.0)

About the versions & onnx export:

  • Pytorch 1.9.0 version
  • ONNX version 1.10.2
  • Opset Version 12
  • Extra: operator_export_type=torch.onnx.OperatorExportTypes.ONNX_ATEN_FALLBACK # This is necessary for the newer version of Pytorch & onnx that the DetectionOutput Node can be exported in a symbolic way, so then OpenCV DNN module picks it up and creates it's own C++ layer version.

What I have observed so far:

  • Adam trained model is only slow if it detects an object. Feeding it a random blob just for inference benchmarking doesn't yield the same slow down.
  • When exporting, results don't diverge too far from the pytorch version (e.g. I get 73.6 % mAP with OpenCV ONNX vs 73.94 with Pytorch on a custom dataset). So inference is good.
  • The most important thing I observed so far: Adam trained model has weights really close to 0. While datatypes & keys of the Momentum & Adam model are the same for all their weights, their values are completely different (probably because of the optimization process). I have a suspicion that the OpenCV DNN module doesn't handle correct these really small weights.

An example comparison is on the last weight matrix. Note this is incomplete printing of the whole matrix but the comparison persists everywhere

  • Adam version:
         [[-2.3086e-41]],
         [[-2.4849e-41]],
         [[-2.5120e-41]]],
        [[[-2.3274e-41]],
         [[-2.1921e-41]],
         [[ 9.2205e-42]],
         ...,
         [[-2.5421e-41]],
         [[-2.5495e-41]],
         [[-1.3273e-41]]]], device='cuda:0') torch.Size([20, 64, 1, 1])
  • Momentum version:
         [[-0.0039]],
         [[ 0.0484]],
         [[ 0.0145]]],
        [[[ 0.0438]],
         [[ 0.0131]],
         [[-0.0101]],
         ...,
         [[ 0.0012]],
         [[-0.0526]],
         [[-0.0229]]]], device='cuda:0') torch.Size([20, 64, 1, 1])
Issue submission checklist
  • I report the issue, it's not a question
  • I checked the problem with documentation, FAQ, open issues,
    forum.opencv.org, Stack Overflow, etc and have not found solution
  • I updated to latest OpenCV version and the issue is still there
  • There is reproducer code and related data files: videos, images, onnx, etc

If a reproduced model is necessary to be shared, tell me so I can reproduce the issue of the slowdown with dummy models.

Thank you in advance

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions