-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Description
Describe the bug
If a model declares a dependency on opset 12, ORT 1.0 loads the model but silently substitutes v10/v11 operators, even though ORT can't be certain that behavior hasn't changed between versions. This is a dangerous bug because the model appears to have executed correctly, but a subtle spec change between op N and op N+1 yields the incorrect results (getting back a 5% of cancer vs 60% cancer is a big deal). It would be better to fail the model if earlier the model claims it needs opset 12 when opset 12 isn't actually supported.
ORT already rejects future models sometimes, such as when it sees new attributes that it's unaware of or when input/output counts are different than expected, but it doesn't handle silent behavior changes in the specification which have the same input/output count and attribute names (only a version increment), which aren't as uncommon as you'd think. e.g.:
onnx/onnx#2291 Updated docs for strides and dilations attributes
onnx/onnx#2281 Supporting negative axes for all existing onnx ops
onnx/onnx#2260 Support for negative indices in 'Gather', 'GatherElements', 'ScatterElements', 'OneHot'
All the preceding cases extended/clarified behavior in a way that, except for the version increment, would be otherwise undetectable. For example, Gather-11 is identical to Gather-1 in every input/output/attribute except that 11 accepts negative indices now. Gather-1 implementations may have clamped rogue index bounds before and now silently return the wrong result, or they may have issued an error on the CPU, but this is problematic for GPU execution providers where the execution is too late and too deeply buried to return a meaningful error from inside the shader.
There have been other near-miss cases too, such as with Resize-11 (onnx/onnx#2057) which would have otherwise been imperceptibly identical to Resize-10 if not for a change in input count (given all attributes are optional anyway), in which the default was changed from "asymmetric" to "half_pixel" (which is a good change mind you, but still a behavior change between versions). That's why operators are versioned so we can detect these breaking changes, but if ORT ignores the declared version dependency, then it will return the wrong results in such cases.
Urgency
Manganese.
System information
- OS Platform and Distribution: Windows 10
- ONNX Runtime installed from (source or binary): source
- ONNX Runtime version: 1.0
- Python version: NA
- Visual Studio version (if applicable): 2017
- GCC/Compiler version (if compiling from source): NA
- CUDA/cuDNN version: NA
- GPU model and memory: NA (but AMD and 32GB)
To Reproduce
Load the model below:
ir_version: 3
producer_name: "OnnxConformanceTest"
graph {
node {
input: "A"
input: "B"
output: "C"
op_type: "Add"
domain: ""
}
name: "Add"
input {
name: "A"
type {
tensor_type {
elem_type: FLOAT
shape {
dim {
dim_value: 5
}
}
}
}
}
input {
name: "B"
type {
tensor_type {
elem_type: FLOAT
shape {
dim {
dim_value: 5
}
}
}
}
}
output {
name: "C"
type {
tensor_type {
elem_type: FLOAT
shape {
dim {
dim_value: 5
}
}
}
}
}
}
opset_import {
version: 13
}
Expected behavior
If a model declares a dependency on a given opset, and that dependency isn't satisfied, ORT should reject the model with an informative error.
Fixing this won't gimp all future compatibility. It is still possible to load newer ONNX models with older versions of ONNX runtime if a future application registers the newer schema (and also supplies kernel implementations for any missing operators).