`SIGBUS` crash in `pointSetBoundingRect()` on ARMv7-A (armeabi-v7a) due to unaligned NEON load

### System Information

OpenCV version: 3.4+
Platform: Android armeabi-v7a, presumably all 32-bit armv7-a boards with NEON enabled
Compiler: clang ver 14.0.6 in Android NDK 25b, generation of crashing instruction changes randomly with clang version

### Detailed description

We have hit this crash in production here at ZenID.

Happens on 32-bit ARM (armeabi-v7a ABI) phones, including natively 64-bit phones forced to run 32-bit executable.

It may happen when feeding data to `cv::boundingRect()` that isn't aligned to 64 bit, depending on your luck with compiler optimization.

In our case we had

```
struct Foo {
...
std::array<cv::Point, 4> Outline; // alignof() == 4
...
}
...
Foo bar;
cv::boundingRect(bar.Outline);
```

It crashes on https://github.com/opencv/opencv/blob/843c09bc/modules/imgproc/src/shapedescr.cpp#L899

`vx_load_low()` uses `vld1_u64(uint64_t*)` intrinsic under the hood, which may translate into either `vldr {d0}, [r0]` or `vld1.32 {d0}, [r0:64]`, depending on your luck with optimizer.

`vldr` is a generic 64 bit load wich requires alignment to 32 bit, `vld1.32` would normally only require 32 bit alignment too (and supports unaligned load?), but clang adds an alignment hint (reasonably since it's being passed int64_t*) and that results in a crash.

For us it compiled to `vldr` with NDK 21d (clang ver 9.0.8) but then it became `vld1.32` with NDK 25b (clang ver 14.0.6) and started crashing.

Godbolt repro of this behavior (play with the alignment attribute on `ua64` and see what happens):
 - [clang 14.0.0 `vld1.32 {d16}, [r0:64]`](https://godbolt.org/z/YxcGd31G7) - crash
 - [clang 9.0.1 `vldr {d16}, [r0]`](https://godbolt.org/z/hazMx7jYK) - no crash
also see alignment requirements from ARM reference manual at the end

---------------------------------------------

There are three ways I see of fixing it:

1. #### Use `int32_t*` in `pointSetBoundingRect()`.

I don't really see why it needs to operate on `int64_t*` when input depth is asserted to be 32 bits and it casts all the 64 bit vector loads to 32 bit anyway. Maybe there's some x86_64 SSE performance reason to do it?

This fixes the crash:
```
v_int32 ptXY = v_reinterpret_as_s32(v_expand_low(v_reinterpret_as_u32(vx_load_low((int32_t*)(pts + i)))));
```

However, somewhere in the codebase more unaligned NEON loads are sure to be lurking, hidden by `vldr` instructions and it may randomly explode on us in the future when the optimizer changes slightly.

2. #### Make v_load/v_store intrinsics accept unaligned data.

Draft PR: #25267

This results in `vld1.8 {d0}, [r0]` instruction and solves all cases of load/store on unaligned address.
```
#if defined(__clang__) && defined(__arm__)
#define OPENCV_HAL_IMPL_NEON_LOAD_LOW_OP(_Tpvec, _Tp, suffix) \
inline _Tpvec v_load_low(const _Tp* ptr) \
{ \
typedef _Tp CV_DECL_ALIGNED(1) unaligned_ptr; \
unaligned_ptr* uptr = (unaligned_ptr*)ptr; \
return _Tpvec(vcombine_##suffix(vld1_##suffix(uptr), vdup_n_##suffix((_Tp)0))); \
}
// and same for other load/store
```
([in `intrin_neon.hpp`](https://github.com/opencv/opencv/blob/843c09bc/modules/core/include/opencv2/core/hal/intrin_neon.hpp#L1269))

This is the solution we picked since it means we don't have to sift through the entire opencv codebase looking for potential unaligned access.

3. #### Allow alignment to lane type but no higher than 32 bit in load/store

Draft PR: #25266

This is less fool-proof but also less invasive. In effect it ensures that `vx_load_low()` remains as `vldr`, which happens half the time anyway. I expect that opencv has a few more places where 64 bit NEON load aligned to 32 bit can happen, and very few (or none) places where you end up with ≤32 NEON load that is misaligned.

On the assembly level that means:
 - `vld1.32 {d0}, [r0:64] => `vldr {d0}, [r0]` (clang already does this half the time, non-determinisically!)
 - `vld1.64 {d0}, [r0] => `vld1.32 {d0}, [r0]`
 - `vld1.64 {d0, d1}, [r0]` => `vld1.32 {d0, d1}, [r0]`
 - `vst1.64 {d0, d1}, [r0]` => `vst1.32 {d0, d1}, [r0]`

So basically do this, only for 64bit load/store:
```
inline _Tpvec v_load_low(const _Tp* ptr) \
{ \
typedef _Tp CV_DECL_ALIGNED(4) aligned32_ptr; \
aligned32_ptr* uptr = (aligned32_ptr*)ptr; \
return _Tpvec(vcombine_##suffix(vld1_##suffix(uptr), vdup_n_##suffix((_Tp)0))); \
}
```
and do the same for `v_load()`, `v_store()`, etc.

-----------------------------------------

I've looked at prior art: PR #16152 attempted something like my option 3 but making the 64-bit load/store completely unaligned. I think it doesn't make sense to go this far and only for 64-bit. That PR was closed with #16139 + #16463 used instead.

On the face of it, it makes sense to fix unaligned access on the algo side. In practice, that work is driven by bug reports and `vldr` vs `vld1.32` thing is particularly nasty, because *you don't know* if it will compile into the crashing instruction, so it's hard to test for it even with full coverage, and crashes may stay lurking for years before suddenly appearing with a new compiler version.

---------------------------------------

Alignment requirements for the relevant instructions for your reference:
![alignment-armv7-a](https://github.com/opencv/opencv/assets/2241548/e1565d24-5929-4c6e-af1e-d2c8d7e211df)

### Steps to reproduce

Feed 64-bit misaligned (and 32-bit aligned) array of ints or floats to `cv::boundingRect()`, using clang ver 14.0.6 or any other version that happens to produce `vld1.32 {d0}, [r0:64]` instruction.

### Issue submission checklist

- [X] I report the issue, it's not a question
- [X] I checked the problem with documentation, FAQ, open issues, forum.opencv.org, Stack Overflow, etc and have not found any solution
- [X] I updated to the latest OpenCV version and the issue is still there
- [X] There is reproducer code and related data files (videos, images, onnx, etc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`SIGBUS` crash in `pointSetBoundingRect()` on ARMv7-A (armeabi-v7a) due to unaligned NEON load #25265

System Information

Detailed description

Use `int32_t*` in `pointSetBoundingRect()`.

Make v_load/v_store intrinsics accept unaligned data.

Allow alignment to lane type but no higher than 32 bit in load/store

Steps to reproduce

Issue submission checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

SIGBUS crash in pointSetBoundingRect() on ARMv7-A (armeabi-v7a) due to unaligned NEON load #25265

Description

System Information

Detailed description

Use int32_t* in pointSetBoundingRect().

Make v_load/v_store intrinsics accept unaligned data.

Allow alignment to lane type but no higher than 32 bit in load/store

Steps to reproduce

Issue submission checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`SIGBUS` crash in `pointSetBoundingRect()` on ARMv7-A (armeabi-v7a) due to unaligned NEON load #25265

Use `int32_t*` in `pointSetBoundingRect()`.