Conversation
| int *D1 = dst[k+1]; | ||
|
|
||
| for( dx = 0; dx < len0; dx += cn ) | ||
| for( dx = 0; (xofs[dx] + cn) < smax; dx += cn ) |
There was a problem hiding this comment.
- cn
As we reading 4 bytes (v_load_expand_q()), so I believe we should count these + 4 somewhere.
(in meaning I would like to see " + 4" / " - 4" in the code directly)
There was a problem hiding this comment.
This loop is predicated on single units of the input matrix (cn), not vector unit size. Would you prefer a variable similar to int step = (vuint32x4::nlanes/cn)*cn; to replace the direct usage of cn? The 8u3 case can easily port to larger vector sizes (the 75% efficiency may show more).
There was a problem hiding this comment.
This note is not about the "step".
It is about the limit, like xofs[dx] + cn + 4 < some_buffer_limit
8u3 case can easily port to larger vector sizes
Make sense, there are several algorithm implementations which process 4 pixels (4*3 = 12 bytes) per iteration. Image resolutions are usually divided by 4 (640x480, 1280x720, etc), so they don't require heavy tail processing.
There was a problem hiding this comment.
Any unit not a multiple of cn doesn't make sense here. Adding + 4 only guarantees 1 more operation which can safely vectorize does not get vectorized. The only thing the vector loop must satisfy is that it never touches the final cn bytes of src or dst. It strides by 3B.
There was a problem hiding this comment.
v_load_expand_q() reads 4 bytes, so we must check that all these 4 bytes are fit into the buffer range limits.
There was a problem hiding this comment.
We must check that all 4 bytes valid.
Through explicit check (with 4 bytes, not through "obfuscated" conditions). Make it clear for review.
We already have that mistake, so let avoid similar mistakes in the future.
There was a problem hiding this comment.
Checking a non-modulo 3 value against such a loop counter is at best accidental correctness, and at worst subtly wrong. (e.g failing to account for src[] offset as was the original bug, which was a different bug altogether). The loop touches 1 + 1/3 8c3 values per iteration, but increments by 1.
There was a problem hiding this comment.
As far as I understand fourth value will be OOB on writing when dx >= (dwidth - cn) and will be OOB on reading when (xofs[dx] + cn) >= (swidth -cn).
So the condition could be (xofs[dx] + 2*cn) < smax or (xofs[dx] + cn + 4) <= smax
There was a problem hiding this comment.
That isn't wrong, but will overlook safe vector ops. It ignores how smax is selected. smax >= (swidth - cn) can only be met if dx >= dmax. This can never be true since xofs[] is by design an array of increasingly large offsets.
There was a problem hiding this comment.
Sorry for mistake in the previous message. I've been confused by the change of smax meaning.
So due to the way smax is defined we could state that smax < swidth - cn therefore (xofs[dx] + cn) < smax check ensures that xofs[dx] + cn < swidth - cn or that xofs[dx] + cn <= swidth - cn - 1 equal to xofs[dx] + cn <= swidth - 4
Both statements ensure safe reading of 4 values starting from xofs[dx] + cn however the latter clearly states that it safe for reading of strictly 4 values while the former rather looks like safe reading of a pixel triplet.
Direct mention of 4 values(or even better of v_int32x4::nlanes) is important to avoid introduction of OOB mistakes when the code will be migrated to wide universal intrinsics.
Actually, we can do this in constant time. xofs always contains same or increasing offset values. We can instead find the most extreme value used and never attempt to load it. Similarly, we can note for all dx >= 0 and dx < (dwidth - cn) where xofs[dx] + cn < xofs[dwidth-cn] implies dx < (dwidth - cn). Thus, we can use this to control our loop termination optimally. This fixes opencv#16137 with little or no performance impact. I have also added a debug check as a sanity check.
|
@terfendail @alalek Any comments on latest version? |
terfendail
left a comment
There was a problem hiding this comment.
There still is a readability issue that could affect porting to wide universal intrinsics. However this version is better and cleaner than previous so I think it could be merged.
I wasn't impressed with my workaround for the OOB fix. Let's fix it with little or no regressing of performance.
This fixes #16137 with little or no performance impact. I have
also added a debug check as a sanity check.