Skip to content

Conversation

@fdwr
Copy link
Contributor

@fdwr fdwr commented Nov 9, 2019

Description: ORT silently substitutes v10 operators when higher version operators (say v13 or v42..) are declared as opset dependencies by the model. This is problematic because it will appear to the user that the model was successfully executed when in reality the code has no idea if the results are correct because it can't predict the future - the semantics between v10 and v13 could change in the specification. It's better to fail than to return the incorrect results. Applications can always register custom schema and newer opset versions to fill in missing gaps, in cases where ORT is behind.

Note this branch of ORT is an older intermediate snapshot which only supported opset 10. Some 11 ops were implemented (BitShift, CumSum, Reduce*, DynamicQuantizeLinear, Pad, Resize, Slice), but none of the slice operators.

Motivation and Context

  • Avoids silently returning the incorrect data.
  • Note this targeted fix specifically applies only to the Vibranium branch (which only fully supported up to opset 10). Master should probably have a different (more complete fix) after more discussion.

#2371

@fdwr fdwr requested a review from a team as a code owner November 9, 2019 03:02
constexpr const char* kBrainSliceExecutionProvider = "BrainSliceExecutionProvider";
constexpr const char* kTensorrtExecutionProvider = "TensorrtExecutionProvider";

constexpr int kMaximumAiOnnxVersionSupported = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it is 10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is branch rel-1905, which is a much older snapshot of ORT where 11 support was only partial (IIRC only a few ops out of them all), which was before the ONNX 1.6 specification was even published and finalized. So the little bit of 11 support contained in this branch is an uncertain state relative to the spec. We want to say cleanly that we support the published version of ONNX 1.4 (opset 10) in Vibranium.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now you need to find a way to make the tests pass :-(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I found a way to make them pass (at least onnxruntime_test_all.exe runs locally without errors) by applying the same rules to the other GetLatestOpsetVersions overload, which is what the Model class indirectly used to determine its default opset when no specific opset was passed to the constructor (it was using 11 by default). Also updated to testdata .onnx files which were set to opset 11 unnecessarily (moving them back down to 10).

Note that in master, I since saw that Scott McKay added explicit versions in a number of places so that we continue to still get testing on older op versions, and so that may not apply quite the same there.

Copy link
Contributor Author

@fdwr fdwr Nov 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alas, I see that onnxruntime_test_all.exe doesn't run all the test_models, and there are still a number of opset 11 models in the CI tests complaining about failing to an 11 kernel, which is many to update O_o. I don't know if I'll make it in time before the bug bar rises to pacify the tests. I'll at least undo the latest change, but I'm considering abandoning this PR and keeping the logic exclusively in WinML for now.

2019-11-14 14:24:55.967545903 [E:onnxruntime:Default, runner.cc:42 RunTestCase] Test raw failed:Load model from test_models/test_abs_uint32/raw/model.onnx failed:This is an invalid model. Error in Node: : No Op registered for Abs with domain_version of 11

https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=96833

@snnn
Copy link
Contributor

snnn commented Nov 9, 2019

Are you saying:
for example, the latest onnxruntime release supports opset 11, there is a new model with opset 12, because onnxruntime 1.0 doesn't have the opset 12 schema, so it assumes the op didn't get changed betwenn 11 and 12 and onnxruntime will execute the model? If that is true, please add a test case to prove your change actually fixed the error.

@fdwr
Copy link
Contributor Author

fdwr commented Nov 10, 2019

Are you saying:
for example, the latest onnxruntime release supports opset 11, there is a new model with opset 12, because onnxruntime 1.0 doesn't have the opset 12 schema, so it assumes the op didn't get changed betwenn 11 and 12 and onnxruntime will execute the model?

Correct, I loaded and executed an Add-13 model without any warning/error.

ir_version: 3
producer_name: "OnnxConformanceTest"
graph {
  node {
    input: "A"
    input: "B"
    output: "C"
    op_type: "Add"
    domain: ""
  }
  name: "Add"
  input {
    name: "A"
    type {
      tensor_type {
        elem_type: FLOAT
        shape {
          dim {
            dim_value: 5
          }
        }
      }
    }
  }
  input {
    name: "B"
    type {
      tensor_type {
        elem_type: FLOAT
        shape {
          dim {
            dim_value: 5
          }
        }
      }
    }
  }
  output {
    name: "C"
    type {
      tensor_type {
        elem_type: FLOAT
        shape {
          dim {
            dim_value: 5
          }
        }
      }
    }
  }
}
opset_import {
  version: 13
}

@fdwr
Copy link
Contributor Author

fdwr commented Nov 10, 2019

Hmm, CI tests: "No Op registered for Identity with domain_version of 11"
There is no Identity-11, and so the test framework must be generically stamping opset 11 into all the models it creates, but I'm unsure where this is occuring... This branch of ORT doesn't really support opset 11 anyway (just partially).

constexpr const char* kBrainSliceExecutionProvider = "BrainSliceExecutionProvider";
constexpr const char* kTensorrtExecutionProvider = "TensorrtExecutionProvider";

constexpr int kMaximumAiOnnxVersionSupported = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we be using
DomainToVersionRange from ONNX
see: https://github.com/onnx/onnx/blob/master/onnx/defs/schema.h#L687
rather than hardcoding it in ORT?

Copy link
Contributor Author

@fdwr fdwr Nov 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. I like the idea of using it. The issue here is that ONNX and ORT have different versions (note this fix targets the rel-1905 Vibranium branch, and so this might not be applicable to the master branch) as we took an older snapshot of ORT that only supported opset 10, but its ONNX submodule has 11 since ONNX 1.6 had already started (but wasn't finalized and published yet). The ORT opset 11 support in this branch was preliminary and partial (before the ONNX 1.6 spec was published, meaning ops could still change behavior). So we don't really want 11.

A pending question is what this should ultimately look like in master. I opened a GitHub issue: #2371

@pranavsharma
Copy link
Contributor

Changes look ok. Can you add a test case?

@fdwr
Copy link
Contributor Author

fdwr commented Nov 14, 2019

Changes look ok. Can you add a test case?

@pranavsharma : Test case added.

Ack, there are many other model tests that fail. I think I'll have to give up on this change to make the window and revert to just having this check only in WinML, even though it's not as complete there.

registries.push_front(registry);
}

static int GetActualLastSupportedOpSet(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this relate to the "DomainToVersionMap" in graph and model ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I fear my latest iteration last night (in an attempt to pacify the test code which calls this function to indirectly figure out what opset it should use for dynamically generated models by default) will break app registered schemas :/, and we do want to support that.

@fdwr
Copy link
Contributor Author

fdwr commented Nov 15, 2019

We'll make a modified change targeting master (incorporating George Wu's feedback) with the same policy (and keep this specific change local for Vibranium only).

@fdwr fdwr closed this Nov 15, 2019
@fdwr fdwr deleted the user/dwayner/ClampVersion2 branch May 13, 2020 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants