Skip to content

Incorrect result for converted FP16 model with Conv Op when run on arm64 Linux with onnxruntime >= 1.15.0 #18992

@jasonkit

Description

@jasonkit

Describe the issue

An onnx model which are exported from PyTorch with nn.Conv2 and converted to FP16 are not giving correct result during inference.

This issue is not observed on the original exported FP32 onnx model
This issue also not observed on onnxruntime 1.13 or .1.14. I first observe it on onnxruntime >= 1.15.0
Also this issue is only observed on arm64 linux (actually I observe this issue on docker running on M1 macOS).
It works fine on macOS with M1 CPU, or Linux with intel CPU.

To reproduce

On arm64 Linux (or using python:3.10-bullseye docker image),
run following code with onnxruntime >= 1.15.0

import torch
from torch import nn

import onnx
from onnxconverter_common import float16
import onnxruntime as ort
import numpy as np


class ModelUnderTest(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = nn.Conv2d(1, 1, 1)
        nn.init.constant_(self.model.weight.data, 0.5)
        if self.model.bias is not None:
            # It works fine for this test case if bias is initialised to 0
            nn.init.constant_(self.model.bias.data, 0.5)

    def forward(self, x):
        return self.model(x)


if __name__ == "__main__":
    m = ModelUnderTest()
    x = torch.ones(1, 1, 1)
    torch.onnx.export(m, x, "m1.onnx", export_params=True)

    model = onnx.load("m1.onnx")
    m_16 = float16.convert_float_to_float16(
        model,
        keep_io_types=True,
        # It works fine if we block Conv Op
        # op_block_list=float16.DEFAULT_OP_BLOCK_LIST + ["Conv"],
    )
    onnx.save(m_16, "m1_fp16.onnx")

    # ---

    session_option = ort.SessionOptions()
    session_option.log_severity_level = 3
    session_option.enable_cpu_mem_arena = False
    session_option.enable_mem_pattern = False
    session_option.enable_mem_reuse = False

    x = np.ones((1, 1, 1))
    session_fp32 = ort.InferenceSession("m1.onnx", session_option)
    y1 = session_fp32.run(None, {"input": x.astype(np.float32)})[0]
    print("fp32 output")
    print(y1)
    session_fp16 = ort.InferenceSession("m1_fp16.onnx", session_option)
    y2 = session_fp16.run(None, {"input": x.astype(np.float32)})[0]
    print("fp16 output")
    print(y2)

    y_diff = y1 - y2
    y_diff_2 = y_diff * y_diff
    print("SSD")
    print(np.sum(y_diff_2))

It prints

fp32 output
[[[1.]]]
fp16 output
[[[0.5]]]
SSD
0.25

However, the expected output should be

fp32 output
[[[1.]]]
fp16 output
[[[1.]]]
SSD
0.0

It gives the correct output when downgrade onnxruntime to 1.14.1

Urgency

This seems to be a degrade on onnxruntime as it works before 1.15.0.
I can workaround the issue by adding Conv to op_block_list when converting the model to fp16.

Platform

Linux

OS Version

Debian Bullseye

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

>= 1.15.0

ONNX Runtime API

Python

Architecture

ARM64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Metadata

Metadata

Assignees

Labels

ep:ArmNNissues related to Arm NN execution provider

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions