Fix training artifacts for 2GB+ models and MSELoss#22414
Conversation
The use of a global base model when creating new training `Blocks` and `onnx.save` destroying any external data meant any loss block (e.g. `MSELoss`) that builds more than one sub-`Block` will fail validation due to missing external data. Saving using a deep copy of the global model circumvents this. Fixes microsoft#22411
byt3n33dl3
left a comment
There was a problem hiding this comment.
blocks kinda (@microsoft-github-policy-service agree company="Microsoft")
|
@microsoft-github-policy-service agree company="RWS" |
|
/azp run Big Models, Linux Android Emulator QNN CI Pipeline, Linux CPU CI Pipeline, Linux CPU Minimal Build E2E CI Pipeline, Linux GPU CI Pipeline, Linux GPU TensorRT CI Pipeline |
|
/azp run Linux OpenVINO CI Pipeline, Linux QNN CI Pipeline, MacOS CI Pipeline, ONNX Runtime Web CI Pipeline, Windows ARM64 QNN CI Pipeline, |
|
/azp run Windows CPU CI Pipeline, Windows GPU CUDA CI Pipeline, Windows GPU DML CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows GPU TensorRT CI Pipeline, Windows x64 QNN CI Pipeline, onnxruntime-binary-size-checks-ci-pipeline, orttraining-linux-ci-pipeline, orttraining-linux-gpu-ci-pipeline |
|
Azure Pipelines successfully started running 6 pipeline(s). |
|
Azure Pipelines successfully started running 5 pipeline(s). |
|
Azure Pipelines successfully started running 9 pipeline(s). |
|
+1 |
|
Tks @snnn and @baijumeswani |
|
I think this will be included in the upcoming 1.20 release. |
|
tks @baijumeswani |
Description
generate_artifactsfails when creating training artifacts for a model using external data andMSELoss.The use of a global base model when creating new training
Blocksandonnx.savedestroying any external data means any loss block (e.g.MSELoss) that builds more than one sub-Blockwill fail validation due to missing external data and raise an exception.Fix
Saving using a deep copy of the global model circumvents this at the cost of holding 2x the model size in memory.
Other Implementations
An alternative approach using less memory would load the on-disk external data before it is deleted in
Block::__del__and insert the appropriate fields into the globalModelProto.This seems a bit brittle due to the coupling to the specific way external data is destructively accessed in
onnx.save. If there exists a non-modifying save in theonnxrepo it would be ideal to use that inBlock::__call__instead.Motivation and Context
Fixes
generate_artifactsbug reported in #22411