Resize Bilinear interpolation - Tensor support#32
Resize Bilinear interpolation - Tensor support#32r-abishek merged 35 commits intor-abishek:ar/resize_tensorfrom
Conversation
…nto fg/tensor_resize
r-abishek
left a comment
There was a problem hiding this comment.
Please look into these changes
|
|
||
| srcDescPtr->w = ((srcDescPtr->w / 8) * 8) + 8; | ||
| dstDescPtr->w = ((dstDescPtr->w / 8) * 8) + 8; | ||
| // srcDescPtr->w = ((srcDescPtr->w / 8) * 8) + 8; |
There was a problem hiding this comment.
Can we leave it uncommented so as to ensure that it works with a random size too?
|
|
||
| char temp[1000]; | ||
| strcpy(temp, dst); | ||
| strcpy(temp, "dst_ten_pkd_"); |
| for (; vectorLoopCount < alignedLength; vectorLoopCount+=4) | ||
| { | ||
| p0 = _mm_setr_ps(vectorLoopCount, vectorLoopCount + 1, vectorLoopCount + 2, vectorLoopCount + 3); | ||
| p0 = _mm_mul_ps(p0, pWRatio); |
There was a problem hiding this comment.
Instead of the setr_ps inside the loop, see if you can do a px0 = _mm_setr_epi32(0, 1, 2, 3); and pxFour = _mm_set1_epi32(4); outside this loop. Check if replacing this line to px0 = _mm_add_epi32(px0, pxFour); p0 = _mm_castsi128_ps(px0); works
src/include/cpu/rpp_cpu_simd.hpp
Outdated
| } | ||
| } | ||
|
|
||
| inline RppStatus rpp_resize_load4_u8pkd3_to_f32pln3(Rpp8u* srcPtrTopRow, Rpp8u* srcPtrBottomRow, Rpp32u* loc, __m128* p) |
There was a problem hiding this comment.
This kind of double row load could be useful to any functionality using bilinear interpolation. Probably call it bilinear load instead of having a tie to resize.
src/include/cpu/rpp_cpu_simd.hpp
Outdated
| px[1] = _mm_loadu_si128((__m128i *)(srcPtrTopRow + loc[1])); /* Load the 1st RGBR pixel from TopRow*/ | ||
| px[2] = _mm_loadu_si128((__m128i *)(srcPtrTopRow + loc[2])); /* Load the 1st RGBR pixel from TopRow*/ | ||
| px[3] = _mm_loadu_si128((__m128i *)(srcPtrTopRow + loc[3])); /* Load the 1st RGBR pixel from TopRow*/ | ||
| px[0] = _mm_unpacklo_epi8(px[0], px[2]); // R1 R3 G1 G3 B1 B3 R2 R3 G2 G3 B B |
There was a problem hiding this comment.
Few things on the comments:
Follow the same /* comment style like other simd helper functions.
Please add spacing after /* and before */.
Also add a minimum of 4 spaces after the semicolon and before you start your comment.
Please see if you can use some similar commenting style from the previous simd helpers for uniformity in all our helpers.
| p[9] = _mm_cvtepi32_ps(_mm_shuffle_epi8(px[2], maskp4)); | ||
| p[10] = _mm_cvtepi32_ps(_mm_shuffle_epi8(px[3], maskp1)); | ||
| p[11] = _mm_cvtepi32_ps(_mm_shuffle_epi8(px[3], maskp2)); | ||
| return RPP_SUCCESS; |
There was a problem hiding this comment.
Give one line space before all the return RPP_SUCCESS; lines
| __m128 pChannel = _mm_set1_ps((float) srcDescPtr->c); | ||
| __m128 p0, p2, p4, p5, p6, p7, pColFloor; | ||
| __m128i pxColFloor; | ||
| __m128 pRow[16], pPixels[4]; |
There was a problem hiding this comment.
Not sure if you are using all 16 and 4 allocated?
Modify test suite
* Erode initial commit * Optimize erode for hip * Add hip tensor test suite support for erode * Dilate initial commit * Optimize dilate for hip * Add hip tensor test suite support for dilate * Add hip perf tests for erode and dilate * Add conditions for borders * Modify ROI override comment for better clarity * Modify comment style
Modify unit tests and performance tests Add ImagePatch struct Introduce dstImageSizes param for resize Fix the pkd version wrt changes
Add F32 resize HOST calls in test suite
…Cm#81) * Tensor color_twist initial commit for U8 * Merge branch 'rr/color_twist_new' of https://github.com/rrawther/rpp into ar/opt_color_twist_host * Remove two mul instructions * Add cast to ps * Add AVX2 version for U8PKD3 color_twist * Fix all variants under U8 * Common formatting change * Add support for f32 * Add f16 support * Add i8 support * Minor build fix * Add tensor color_twist unittests * Add tensor color_twist performancetests * Fix codacy issue * Fix codacy issue * Remove commented code * Change alignedLength computation in tensor_augmentations * Modify comment style * Fix for BatchPD PLN3 color_twist * Increase static allocation * Fix for squiggly lines
Resize F32 host tensor
Add I8 and F16 resize calls in test suite
* Remove OCL_COMPILE and reformat * Minor #ifdef change
r-abishek
left a comment
There was a problem hiding this comment.
Please take a look at restructuring and this second round of changes
| // *param[in] srcDesc source tensor descriptor | ||
| // *param[out] dstPtr destination tensor memory | ||
| // *param[in] dstDesc destination tensor descriptor | ||
| // *param[in] roiTensorSrc ROI data for each image in source tensor (2D tensor of size batchSize * 4, in either format - XYWH(xy.x, xy.y, roiWidth, roiHeight) or LTRB(lt.x, lt.y, rb.x, rb.y)) |
There was a problem hiding this comment.
Please describe the dstImgSizes param here with language similar to the alpha for brightness or other parm in this file.
| if ((srcDescPtr->dataType == RpptDataType::U8) && (dstDescPtr->dataType == RpptDataType::U8)) | ||
| { | ||
| resize_u8_u8_host_tensor(static_cast<Rpp8u*>(srcPtr) + srcDescPtr->offsetInBytes, | ||
| srcDescPtr, |
There was a problem hiding this comment.
Minor formatting change to maintain tabbing such that all params start at the character after "(". Add one space for L869-L876. F16/F32/I8 are fine.
| dstPtrChannel = dstPtrImage; | ||
| Rpp32f srcLocationRow, srcLocationColumn, pixel; | ||
| Rpp32s srcLocationRowFloor, srcLocationColumnFloor; | ||
| Rpp32u alignedLength = (dstImgSize[batchCount].width / 4) * 4; |
There was a problem hiding this comment.
Can you try using Rpp32u alignedLength = dstImgSize[batchCount].width & ~3; here
| for (; vectorLoopCount < alignedLength; vectorLoopCount+=4) | ||
| { | ||
| p0 = _mm_setr_ps(vectorLoopCount, vectorLoopCount + 1, vectorLoopCount + 2, vectorLoopCount + 3); | ||
| p0 = _mm_mul_ps(p0, pWRatio); |
|
|
||
| for (; vectorLoopCount < alignedLength; vectorLoopCount+=4) | ||
| { | ||
| p0 = _mm_setr_ps(vectorLoopCount, vectorLoopCount + 1, vectorLoopCount + 2, vectorLoopCount + 3); |
There was a problem hiding this comment.
Please try modularizing L5368-L5382 inside rpp_cpu_common.hpp under compute_resize_column_loc() and reuse it for L5499-L5512, L5613-L5626 and if possible for all other bit depths.
| Rpp32f weightedWidth = srcLocationColumn - srcLocationColumnFloor; | ||
| Rpp32f weightedWidth1 = 1 - weightedWidth; | ||
| srcLocationColumnFloor = (srcLocationColumnFloor > widthLimit) ? widthLimit : srcLocationColumnFloor; | ||
| *dstPtrRow++ = (Rpp32f)(((*(srcPtrTopRowR + srcLocationColumnFloor)) * weightedHeight1 * weightedWidth1) |
There was a problem hiding this comment.
You can possibly do the m1/m2/m3/m4 here too?
src/include/cpu/rpp_cpu_simd.hpp
Outdated
| inline RppStatus rpp_bilinear_load4_f32pkd3_to_f32pln3(Rpp32f* srcPtrTopRow, Rpp32f* srcPtrBottomRow, Rpp32u* loc, __m128* p) | ||
| { | ||
| __m128 pTemp[8]; | ||
| pTemp[0] = _mm_loadu_ps((float *)(srcPtrTopRow + loc[0])); |
src/include/cpu/rpp_cpu_simd.hpp
Outdated
| inline RppStatus rpp_bilinear_load4_f32pln_to_f32pln(Rpp32f* srcPtrTopRow, Rpp32f* srcPtrBottomRow, Rpp32u* loc, __m128* p) | ||
| { | ||
| __m128 pTemp[8]; | ||
| pTemp[0] = _mm_loadu_ps((float *)(srcPtrTopRow + loc[0])); |
| p[9] = _mm_unpacklo_ps(pTemp[4], pTemp[5]); | ||
| p[10] = _mm_unpackhi_ps(pTemp[4], pTemp[5]); | ||
| p[11] = _mm_unpacklo_ps(pTemp[6], pTemp[7]); | ||
| return RPP_SUCCESS; |
There was a problem hiding this comment.
One line space before all return RPP_SUCCESS;
src/include/cpu/rpp_cpu_simd.hpp
Outdated
| return RPP_SUCCESS; | ||
| } | ||
|
|
||
| inline RppStatus rpp_store4_f32pln_to_u8pln(Rpp8u* dstPtr, __m128 p) |
There was a problem hiding this comment.
Is this just for pln1 to pln1? Then change to say f32pln1_to_u8pln1
* Add crop for hip * Add support for crop * Add test suite for crop * Fix #ifdefs * Remove unused variable * Remove unused variable * Revert hip headers * Remove unnecessary hip includes * Add two versions of code - slower removal pending * Remove usage of buffer_copy helpers
r-abishek
left a comment
There was a problem hiding this comment.
Another round of changes. @fiona-gladwin please take a look.
src/include/cpu/rpp_cpu_common.hpp
Outdated
| return product; | ||
| } | ||
|
|
||
| inline void compute_resize_src_loc(Rpp32s dstLocation, Rpp32f scale, Rpp32u limit, Rpp32s &srcLoc, Rpp32f &weight, bool hasRGBChannels=false) |
There was a problem hiding this comment.
Move this compute_resize_src_loc() as the first function under the compute section at L2094 in this file
| for(int i = 0; i < dstImgSize[batchCount].height; i++) | ||
| { | ||
| compute_resize_src_loc(i, hRatio, heightLimit, srcLocationRowFloor, weightedHeight); | ||
| weightedHeight1 = 1 - weightedHeight; |
There was a problem hiding this comment.
Why is the weightedHeight1 alone computed outside?
| { | ||
| pDstLoc = _mm_setr_ps(vectorLoopCount, vectorLoopCount + 1, vectorLoopCount + 2, vectorLoopCount + 3); | ||
| compute_resize_src_loc_sse(pDstLoc, pWRatio, pWidthLimit, srcLocCF, pWeightedWidth, true); | ||
| pWeightedWidth1 = _mm_sub_ps(pOne, pWeightedWidth); |
There was a problem hiding this comment.
Why is the pWeightedWidth1 alone computed outside?
| for (; vectorLoopCount < dstImgSize[batchCount].width; vectorLoopCount++) | ||
| { | ||
| compute_resize_src_loc(vectorLoopCount, wRatio, widthLimit, srcLocationColumnFloor, weightedWidth, true); | ||
| weightedWidth1 = 1 - weightedWidth; |
There was a problem hiding this comment.
Compute weightedWidth1 inside too
| { | ||
| compute_resize_src_loc(vectorLoopCount, wRatio, widthLimit, srcLocationColumnFloor, weightedWidth, true); | ||
| weightedWidth1 = 1 - weightedWidth; | ||
| Rpp32f m1 = weightedHeight1 * weightedWidth1; |
There was a problem hiding this comment.
You could have a templated compute_bilinear_interpolation() just like the compute_bilinear_interpolation_sse() for L5392-L5407 and reuse everywhere
src/include/cpu/rpp_cpu_simd.hpp
Outdated
| return RPP_SUCCESS; | ||
| } | ||
|
|
||
| inline void compute_resize_src_loc_sse(__m128 &dstLoc, __m128 &scale, __m128 &limit, Rpp32u *srcLoc, __m128 &weight, bool hasRGBChannels = false) |
There was a problem hiding this comment.
Lets move all these compute functions to rpp_cpu_common under the compute section. Use the rpp_cpu_simd only for generalized load/store routine pairs.
Restructure codes
…te-Libraries/rpp into fg/tensor_resize
Add U8 resize bilinear interpolation support.