Skip to content

Resize Bilinear interpolation - Tensor support#32

Merged
r-abishek merged 35 commits intor-abishek:ar/resize_tensorfrom
fiona-gladwin:fg/tensor_resize
Dec 11, 2021
Merged

Resize Bilinear interpolation - Tensor support#32
r-abishek merged 35 commits intor-abishek:ar/resize_tensorfrom
fiona-gladwin:fg/tensor_resize

Conversation

@fiona-gladwin
Copy link
Copy Markdown

Add U8 resize bilinear interpolation support.

@r-abishek r-abishek added the enhancement New feature or request label Dec 1, 2021
@r-abishek r-abishek added this to the sow6ms5 milestone Dec 1, 2021
Copy link
Copy Markdown
Owner

@r-abishek r-abishek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look into these changes


srcDescPtr->w = ((srcDescPtr->w / 8) * 8) + 8;
dstDescPtr->w = ((dstDescPtr->w / 8) * 8) + 8;
// srcDescPtr->w = ((srcDescPtr->w / 8) * 8) + 8;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave it uncommented so as to ensure that it works with a random size too?


char temp[1000];
strcpy(temp, dst);
strcpy(temp, "dst_ten_pkd_");
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

for (; vectorLoopCount < alignedLength; vectorLoopCount+=4)
{
p0 = _mm_setr_ps(vectorLoopCount, vectorLoopCount + 1, vectorLoopCount + 2, vectorLoopCount + 3);
p0 = _mm_mul_ps(p0, pWRatio);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the setr_ps inside the loop, see if you can do a px0 = _mm_setr_epi32(0, 1, 2, 3); and pxFour = _mm_set1_epi32(4); outside this loop. Check if replacing this line to px0 = _mm_add_epi32(px0, pxFour); p0 = _mm_castsi128_ps(px0); works

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try this?

}
}

inline RppStatus rpp_resize_load4_u8pkd3_to_f32pln3(Rpp8u* srcPtrTopRow, Rpp8u* srcPtrBottomRow, Rpp32u* loc, __m128* p)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kind of double row load could be useful to any functionality using bilinear interpolation. Probably call it bilinear load instead of having a tie to resize.

px[1] = _mm_loadu_si128((__m128i *)(srcPtrTopRow + loc[1])); /* Load the 1st RGBR pixel from TopRow*/
px[2] = _mm_loadu_si128((__m128i *)(srcPtrTopRow + loc[2])); /* Load the 1st RGBR pixel from TopRow*/
px[3] = _mm_loadu_si128((__m128i *)(srcPtrTopRow + loc[3])); /* Load the 1st RGBR pixel from TopRow*/
px[0] = _mm_unpacklo_epi8(px[0], px[2]); // R1 R3 G1 G3 B1 B3 R2 R3 G2 G3 B B
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few things on the comments:
Follow the same /* comment style like other simd helper functions.
Please add spacing after /* and before */.
Also add a minimum of 4 spaces after the semicolon and before you start your comment.
Please see if you can use some similar commenting style from the previous simd helpers for uniformity in all our helpers.

p[9] = _mm_cvtepi32_ps(_mm_shuffle_epi8(px[2], maskp4));
p[10] = _mm_cvtepi32_ps(_mm_shuffle_epi8(px[3], maskp1));
p[11] = _mm_cvtepi32_ps(_mm_shuffle_epi8(px[3], maskp2));
return RPP_SUCCESS;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give one line space before all the return RPP_SUCCESS; lines

__m128 pChannel = _mm_set1_ps((float) srcDescPtr->c);
__m128 p0, p2, p4, p5, p6, p7, pColFloor;
__m128i pxColFloor;
__m128 pRow[16], pPixels[4];
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you are using all 16 and 4 allocated?

fiona-gladwin and others added 15 commits November 30, 2021 23:19
* Erode initial commit

* Optimize erode for hip

* Add hip tensor test suite support for erode

* Dilate initial commit

* Optimize dilate for hip

* Add hip tensor test suite support for dilate

* Add hip perf tests for erode and dilate

* Add conditions for borders

* Modify ROI override comment for better clarity

* Modify comment style
Modify unit tests and performance tests
Add ImagePatch struct
Introduce dstImageSizes param for resize
Fix the pkd version wrt changes
Add F32 resize HOST calls in test suite
…Cm#81)

* Tensor color_twist initial commit for U8

* Merge branch 'rr/color_twist_new' of https://github.com/rrawther/rpp into ar/opt_color_twist_host

* Remove two mul instructions

* Add cast to ps

* Add AVX2 version for U8PKD3 color_twist

* Fix all variants under U8

* Common formatting change

* Add support for f32

* Add f16 support

* Add i8 support

* Minor build fix

* Add tensor color_twist unittests

* Add tensor color_twist performancetests

* Fix codacy issue

* Fix codacy issue

* Remove commented code

* Change alignedLength computation in tensor_augmentations

* Modify comment style

* Fix for BatchPD PLN3 color_twist

* Increase static allocation

* Fix for squiggly lines
Resize F32 host tensor
Add I8 and F16 resize calls in test suite
* Remove OCL_COMPILE and reformat

* Minor #ifdef change
Copy link
Copy Markdown
Owner

@r-abishek r-abishek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at restructuring and this second round of changes

// *param[in] srcDesc source tensor descriptor
// *param[out] dstPtr destination tensor memory
// *param[in] dstDesc destination tensor descriptor
// *param[in] roiTensorSrc ROI data for each image in source tensor (2D tensor of size batchSize * 4, in either format - XYWH(xy.x, xy.y, roiWidth, roiHeight) or LTRB(lt.x, lt.y, rb.x, rb.y))
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please describe the dstImgSizes param here with language similar to the alpha for brightness or other parm in this file.

if ((srcDescPtr->dataType == RpptDataType::U8) && (dstDescPtr->dataType == RpptDataType::U8))
{
resize_u8_u8_host_tensor(static_cast<Rpp8u*>(srcPtr) + srcDescPtr->offsetInBytes,
srcDescPtr,
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor formatting change to maintain tabbing such that all params start at the character after "(". Add one space for L869-L876. F16/F32/I8 are fine.

dstPtrChannel = dstPtrImage;
Rpp32f srcLocationRow, srcLocationColumn, pixel;
Rpp32s srcLocationRowFloor, srcLocationColumnFloor;
Rpp32u alignedLength = (dstImgSize[batchCount].width / 4) * 4;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try using Rpp32u alignedLength = dstImgSize[batchCount].width & ~3; here

for (; vectorLoopCount < alignedLength; vectorLoopCount+=4)
{
p0 = _mm_setr_ps(vectorLoopCount, vectorLoopCount + 1, vectorLoopCount + 2, vectorLoopCount + 3);
p0 = _mm_mul_ps(p0, pWRatio);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try this?


for (; vectorLoopCount < alignedLength; vectorLoopCount+=4)
{
p0 = _mm_setr_ps(vectorLoopCount, vectorLoopCount + 1, vectorLoopCount + 2, vectorLoopCount + 3);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try modularizing L5368-L5382 inside rpp_cpu_common.hpp under compute_resize_column_loc() and reuse it for L5499-L5512, L5613-L5626 and if possible for all other bit depths.

Rpp32f weightedWidth = srcLocationColumn - srcLocationColumnFloor;
Rpp32f weightedWidth1 = 1 - weightedWidth;
srcLocationColumnFloor = (srcLocationColumnFloor > widthLimit) ? widthLimit : srcLocationColumnFloor;
*dstPtrRow++ = (Rpp32f)(((*(srcPtrTopRowR + srcLocationColumnFloor)) * weightedHeight1 * weightedWidth1)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can possibly do the m1/m2/m3/m4 here too?

inline RppStatus rpp_bilinear_load4_f32pkd3_to_f32pln3(Rpp32f* srcPtrTopRow, Rpp32f* srcPtrBottomRow, Rpp32u* loc, __m128* p)
{
__m128 pTemp[8];
pTemp[0] = _mm_loadu_ps((float *)(srcPtrTopRow + loc[0]));
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this typecast?

inline RppStatus rpp_bilinear_load4_f32pln_to_f32pln(Rpp32f* srcPtrTopRow, Rpp32f* srcPtrBottomRow, Rpp32u* loc, __m128* p)
{
__m128 pTemp[8];
pTemp[0] = _mm_loadu_ps((float *)(srcPtrTopRow + loc[0]));
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - why this typecast?

p[9] = _mm_unpacklo_ps(pTemp[4], pTemp[5]);
p[10] = _mm_unpackhi_ps(pTemp[4], pTemp[5]);
p[11] = _mm_unpacklo_ps(pTemp[6], pTemp[7]);
return RPP_SUCCESS;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One line space before all return RPP_SUCCESS;

return RPP_SUCCESS;
}

inline RppStatus rpp_store4_f32pln_to_u8pln(Rpp8u* dstPtr, __m128 p)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just for pln1 to pln1? Then change to say f32pln1_to_u8pln1

fiona-gladwin and others added 5 commits December 9, 2021 00:50
* Add crop for hip

* Add support for crop

* Add test suite for crop

* Fix #ifdefs

* Remove unused variable

* Remove unused variable

* Revert hip headers

* Remove unnecessary hip includes

* Add two versions of code - slower removal pending

* Remove usage of buffer_copy helpers
Copy link
Copy Markdown
Owner

@r-abishek r-abishek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of changes. @fiona-gladwin please take a look.

return product;
}

inline void compute_resize_src_loc(Rpp32s dstLocation, Rpp32f scale, Rpp32u limit, Rpp32s &srcLoc, Rpp32f &weight, bool hasRGBChannels=false)
Copy link
Copy Markdown
Owner

@r-abishek r-abishek Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this compute_resize_src_loc() as the first function under the compute section at L2094 in this file

for(int i = 0; i < dstImgSize[batchCount].height; i++)
{
compute_resize_src_loc(i, hRatio, heightLimit, srcLocationRowFloor, weightedHeight);
weightedHeight1 = 1 - weightedHeight;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the weightedHeight1 alone computed outside?

{
pDstLoc = _mm_setr_ps(vectorLoopCount, vectorLoopCount + 1, vectorLoopCount + 2, vectorLoopCount + 3);
compute_resize_src_loc_sse(pDstLoc, pWRatio, pWidthLimit, srcLocCF, pWeightedWidth, true);
pWeightedWidth1 = _mm_sub_ps(pOne, pWeightedWidth);
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the pWeightedWidth1 alone computed outside?

for (; vectorLoopCount < dstImgSize[batchCount].width; vectorLoopCount++)
{
compute_resize_src_loc(vectorLoopCount, wRatio, widthLimit, srcLocationColumnFloor, weightedWidth, true);
weightedWidth1 = 1 - weightedWidth;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compute weightedWidth1 inside too

{
compute_resize_src_loc(vectorLoopCount, wRatio, widthLimit, srcLocationColumnFloor, weightedWidth, true);
weightedWidth1 = 1 - weightedWidth;
Rpp32f m1 = weightedHeight1 * weightedWidth1;
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could have a templated compute_bilinear_interpolation() just like the compute_bilinear_interpolation_sse() for L5392-L5407 and reuse everywhere

return RPP_SUCCESS;
}

inline void compute_resize_src_loc_sse(__m128 &dstLoc, __m128 &scale, __m128 &limit, Rpp32u *srcLoc, __m128 &weight, bool hasRGBChannels = false)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets move all these compute functions to rpp_cpu_common under the compute section. Use the rpp_cpu_simd only for generalized load/store routine pairs.

Copy link
Copy Markdown
Owner

@r-abishek r-abishek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks much better now

@r-abishek r-abishek merged commit b120d67 into r-abishek:ar/resize_tensor Dec 11, 2021
@fiona-gladwin fiona-gladwin deleted the fg/tensor_resize branch December 13, 2021 06:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants