Skip to content

dnn: fix unaligned memory access crash on armv7#20725

Merged
alalek merged 3 commits intoopencv:3.4from
mologie:fix-dnn-tf-on-arm
Oct 6, 2021
Merged

dnn: fix unaligned memory access crash on armv7#20725
alalek merged 3 commits intoopencv:3.4from
mologie:fix-dnn-tf-on-arm

Conversation

@mologie
Copy link
Copy Markdown
Contributor

@mologie mologie commented Sep 20, 2021

Hello,

The getTensorContent function would return a Mat pointing to some member of a Protobuf-encoded message. Protobuf does not make any alignment guarantees, which results in a crash on armv7 when loading models while bit 2 is set in /proc/cpu/alignment (or the relevant kernel feature for alignment compatibility is disabled/unavailable). Any read attempt from the previously unaligned data member would send SIGBUS.

As workaround, this commit makes an aligned copy via existing clone functionality in getTensorContent. The unsafe copy=false option is removed. Unfortunately, a rather crude hack in PReLUSubgraph in fact writes to the Protobuf message. I only fix the alignment issue in this commit, and add getTensorContentRefUnaligned to cover the write case with a safe memcpy. A FIXME marks the issue.

Affected Environment

This issue is reliably reproducible by loading any floating-point model in an optimized(!) build of OpenCV 4.5.X, including master, compiled by clang for armv7 as shipped with the Android NDK 21. The issue does not happen with all phones -- some have kernel emulation enabled to fix up the unaligned read automatically, others have support for unaligned reads on CPU level. The most reliable environment to find this issue is with old Android 6 phones, e.g. Samsung Galaxy S5 Neo.

Caveats

This patch makes loading models just slightly slower in theory, because an aligned copy is made in getTensorContent. It is possible to optimize this branch for dims!=4 when both the input and output data type is a 32-bit float.

However, I could not find a test case where this made a difference beyond a few microseconds, hence I avoided the complexity of implementing a fast branch for that specific case.

The copy cannot safely be avoided otherwise, because unaligned access is undefined behavior in C even on platforms that support it.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

  • I agree to contribute to the project under Apache 2 License. -- My employer @authenticvision allows open-source contributions under the Apache 2.0 license; this PR upstreams a patch I developed.
  • To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
  • The PR is proposed to proper branch
  • There is reference to original bug report and related work I could not find any previous bug report similar to this issue, only unrelated alignment crash issues. Bug described in PR instead.
  • There is accuracy test, performance test and test data in opencv_extra repository Existing tests cover the issue and find the bug on affected Linux arm platforms with the kernel settings described in the commit message.
  • The feature is well documented and sample code can be built with the project CMake Not a feature.

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution!

}

Mat getTensorContent(const tensorflow::TensorProto &tensor, bool copy)
Mat getTensorContentRefUnaligned(const tensorflow::TensorProto& tensor)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mat's data pointer must be aligned on base type requirements.
Otherwise it is impossible to use it in OpenCV algorithms on some platforms.

Mat getTensorContentRefUnaligned

So this function is wrong by design. Mat should not contain "unaligned" pointer.


I believe, it is better to fix existed implementation.
Check pointer alignment and reallocate Mat for such case.
Parameter copy should be renamed to forceCopy due to behavior changes.

Copy link
Copy Markdown
Contributor Author

@mologie mologie Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code's structure does not allow realigning the offending Mat, because PReLUSubgraph's finalize method depends on having write access to the misaligned data block.

My motivation with the PR was focused on resolving the crash without changing semantics.

I do not sufficiently understand the remainder of the implementation at this point to reengineer its complete structure and keep the functionality of PReLUSubgraph by propagating an aligned copy of the tensor content. Instead, I have taken special care that the previously broken code now behaves correctly and as probably intended by the author by accessing the offending Mat via memcpy only.

To address your last sentence regarding renaming copy => forceCopy: You cannot safely keep the existing behavior, it goes against what you said in the beginning where a Mat must be aligned. No copy equals not aligned, since raw Protobuf data goes into that function.


Please let me know how to proceed here. I could have it return a void* instead, which would technically fulfill your requirement of not having an unaligned Mat as return value?

Is a minimal patch sufficient as long as it is an improvement in general, or is a rewrite/restructure required?

mologie and others added 2 commits September 27, 2021 13:21
The getTensorContent function would return a Mat pointing to some
member of a Protobuf-encoded message. Protobuf does not make any
alignment guarantees, which results in a crash on armv7 when loading
models while bit 2 is set in /proc/cpu/alignment (or the relevant
kernel feature for alignment compatibility is disabled). Any read
attempt from the previously unaligned data member would send SIGBUS.

As workaround, this commit makes an aligned copy via existing clone
functionality in getTensorContent. The unsafe copy=false option is
removed. Unfortunately, a rather crude hack in PReLUSubgraph in fact
writes(!) to the Protobuf message. We limit ourselves to fixing the
alignment issues in this commit, and add getTensorContentRefUnaligned
to cover the write case with a safe memcpy. A FIXME marks the issue.
@alalek
Copy link
Copy Markdown
Member

alalek commented Sep 27, 2021

Rebased to 3.4 branch and reduced amount of .clone() calls. @mologie Please verify on your configuration.

@mologie
Copy link
Copy Markdown
Contributor Author

mologie commented Oct 1, 2021

@alalek missed your email, sorry! Thank you for looking into this, I will attempt to verify on Monday. Our application right now builds against OpenCV 4 but the API should be compatible enough to give this a try.

@mologie
Copy link
Copy Markdown
Contributor Author

mologie commented Oct 5, 2021

Just a quick status update: I am making progress, OpenCV 3 now builds for affected Android platforms, integration into affected component is pending.

(3 hours later:) I can now reproduce the issue with OpenCV 3 🎉 Will proceed with validating the patch.

Mat scales = scalesRef.clone() * -1;
Mat scalesRef = getTensorContentRef_(inputNodes[1]->attr().at("value").tensor());
// FIXME: This breaks the const guarantees of tensor() by writing to scalesRef and is likely to perform an
// unaligned write (hence the separate multiplication above to have this work on ARM).
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment's reference to multiplication (in brackets) is outdated with this commit

@alalek
Copy link
Copy Markdown
Member

alalek commented Oct 5, 2021

BTW, you can merge / cherry-pick these commits on the master(4.x) branch for validation.

@mologie
Copy link
Copy Markdown
Contributor Author

mologie commented Oct 6, 2021

Good news, I can confirm that the patch with your changes resolves the issue! Note that my validation did not cover the PReLUSubgraph bits, but these seem to be covered by regular tests.

BTW, you can merge / cherry-pick these commits on the master(4.x) branch for validation.

This gave me a nice excuse to restore compatibility with both OpenCV versions ;-). I prefer to test the exact state submitted here, even if it takes longer.

Copy link
Copy Markdown
Member

@alalek alalek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 👍

@alalek alalek merged commit a3d7811 into opencv:3.4 Oct 6, 2021
@mologie
Copy link
Copy Markdown
Contributor Author

mologie commented Oct 6, 2021

Thank you too for reviewing and improving performance here 👍

@alalek alalek mentioned this pull request Oct 7, 2021
@alalek alalek mentioned this pull request Oct 15, 2021
@alalek alalek mentioned this pull request Aug 5, 2022
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug category: dnn platform: android platform: arm ARM boards related issues: RPi, NVIDIA TK/TX, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants