Skip to content

Improve resize 8u3#16146

Merged
opencv-pushbot merged 1 commit intoopencv:3.4from
pmur:reg_16137x2
Jan 22, 2020
Merged

Improve resize 8u3#16146
opencv-pushbot merged 1 commit intoopencv:3.4from
pmur:reg_16137x2

Conversation

@pmur
Copy link
Copy Markdown
Contributor

@pmur pmur commented Dec 12, 2019

I wasn't impressed with my workaround for the OOB fix. Let's fix it with little or no regressing of performance.

This fixes #16137 with little or no performance impact. I have
also added a debug check as a sanity check.

int *D1 = dst[k+1];

for( dx = 0; dx < len0; dx += cn )
for( dx = 0; (xofs[dx] + cn) < smax; dx += cn )
Copy link
Copy Markdown
Member

@alalek alalek Dec 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • cn

As we reading 4 bytes (v_load_expand_q()), so I believe we should count these + 4 somewhere.

(in meaning I would like to see " + 4" / " - 4" in the code directly)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop is predicated on single units of the input matrix (cn), not vector unit size. Would you prefer a variable similar to int step = (vuint32x4::nlanes/cn)*cn; to replace the direct usage of cn? The 8u3 case can easily port to larger vector sizes (the 75% efficiency may show more).

Copy link
Copy Markdown
Member

@alalek alalek Dec 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This note is not about the "step".
It is about the limit, like xofs[dx] + cn + 4 < some_buffer_limit

8u3 case can easily port to larger vector sizes

Make sense, there are several algorithm implementations which process 4 pixels (4*3 = 12 bytes) per iteration. Image resolutions are usually divided by 4 (640x480, 1280x720, etc), so they don't require heavy tail processing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any unit not a multiple of cn doesn't make sense here. Adding + 4 only guarantees 1 more operation which can safely vectorize does not get vectorized. The only thing the vector loop must satisfy is that it never touches the final cn bytes of src or dst. It strides by 3B.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v_load_expand_q() reads 4 bytes, so we must check that all these 4 bytes are fit into the buffer range limits.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must check that all 4 bytes valid.
Through explicit check (with 4 bytes, not through "obfuscated" conditions). Make it clear for review.
We already have that mistake, so let avoid similar mistakes in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking a non-modulo 3 value against such a loop counter is at best accidental correctness, and at worst subtly wrong. (e.g failing to account for src[] offset as was the original bug, which was a different bug altogether). The loop touches 1 + 1/3 8c3 values per iteration, but increments by 1.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand fourth value will be OOB on writing when dx >= (dwidth - cn) and will be OOB on reading when (xofs[dx] + cn) >= (swidth -cn).
So the condition could be (xofs[dx] + 2*cn) < smax or (xofs[dx] + cn + 4) <= smax

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That isn't wrong, but will overlook safe vector ops. It ignores how smax is selected. smax >= (swidth - cn) can only be met if dx >= dmax. This can never be true since xofs[] is by design an array of increasingly large offsets.

Copy link
Copy Markdown
Contributor

@terfendail terfendail Jan 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for mistake in the previous message. I've been confused by the change of smax meaning.
So due to the way smax is defined we could state that smax < swidth - cn therefore (xofs[dx] + cn) < smax check ensures that xofs[dx] + cn < swidth - cn or that xofs[dx] + cn <= swidth - cn - 1 equal to xofs[dx] + cn <= swidth - 4
Both statements ensure safe reading of 4 values starting from xofs[dx] + cn however the latter clearly states that it safe for reading of strictly 4 values while the former rather looks like safe reading of a pixel triplet.
Direct mention of 4 values(or even better of v_int32x4::nlanes) is important to avoid introduction of OOB mistakes when the code will be migrated to wide universal intrinsics.

Actually, we can do this in constant time. xofs always
contains same or increasing offset values. We can instead
find the most extreme value used and never attempt to load it.

Similarly, we can note for all dx >= 0 and dx < (dwidth - cn)
where xofs[dx] + cn < xofs[dwidth-cn] implies dx < (dwidth - cn).

Thus, we can use this to control our loop termination optimally.

This fixes opencv#16137 with little or no performance impact. I have
also added a debug check as a sanity check.
@asmorkalov
Copy link
Copy Markdown
Contributor

@terfendail @alalek Any comments on latest version?

Copy link
Copy Markdown
Contributor

@terfendail terfendail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There still is a readability issue that could affect porting to wide universal intrinsics. However this version is better and cleaner than previous so I think it could be merged.

opencv-pushbot pushed a commit that referenced this pull request Jan 22, 2020
@opencv-pushbot opencv-pushbot merged commit c1cdb24 into opencv:3.4 Jan 22, 2020
@alalek alalek mentioned this pull request Jan 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants