Box Filter Nearest Neighbors Padding by Srihari-mcw · Pull Request #455 · r-abishek/rpp

Srihari-mcw · 2025-07-01T01:47:52Z

The PR contains changes to add nearest neighbors padding for box filter on both HOST and HIP Backends

… PLN3

src/include/common/cpu/rpp_cpu_filter.hpp

Srihari-mcw · 2025-08-05T00:13:12Z

@r-abishek, the PR is ready for your review

…ptimized the I8 variants

HazarathKumarM · 2025-08-11T12:46:21Z

@r-abishek
This PR is ready for review. Major function restructuring has been done in the HIP code

r-abishek

@Srihari-mcw @HazarathKumarM Added some key comments. Please discuss out the feasibility and let me know

r-abishek · 2025-08-12T23:56:21Z

src/include/common/cpu/rpp_cpu_filter.hpp

 // load function for 3x3 kernel size
-inline void rpp_load_box_filter_char_3x3_host(__m256i *pxRow, Rpp8u **srcPtrTemp, Rpp32s rowKernelLoopLimit)
+template<typename T>
+inline void rpp_load_box_filter_char_3x3_host(__m256i *pxRow, T **srcPtrTemp, Rpp32s rowKernelLoopLimit, Rpp32s padIndex)


If these 4 functions are templated, then ideally separator comment on L1017 above should indicate that its for both U8/I8?

r-abishek · 2025-08-13T00:16:02Z

src/modules/tensor/cpu/kernel/box_filter.cpp

+// unpack and sign extend higher half of 9 256 bit registers and add (used for 9x9 kernel size I8 variants)
+inline void unpackhi_signext_and_add_9x9_host(__m256i *pxRow, __m256i *pxDst)
+{
+    pxDst[0] = _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpackhi_epi8(pxRow[0], avx_px0), 8), 8);


Can we have 2 templated functions, just one for unpackhi and one for unpacklo?
Try looping for different kernel sizes like below. Check if we can get away on performance.
Try a # pragma unroll if compiler doesn't automatically loop unroll

template <typename K> inline void unpackhi_signext_and_add_host(__m256i *pxRow, __m256i *pxDst) { pxDst[0] = _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpackhi_epi8(pxRow[0], avx_px0), 8), 8); for (int i = 1; i <= K; i++) pxDst[0] = _mm256_add_epi16(pxDst[0], _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpackhi_epi8(pxRow[i], avx_px0), 8), 8)); } template <typename K> inline void unpacklo_signext_and_add_host(__m256i *pxRow, __m256i *pxDst) { pxDst[0] = _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpacklo_epi8(pxRow[0], avx_px0), 8), 8); for (int i = 1; i <= K; i++) pxDst[0] = _mm256_add_epi16(pxDst[0], _mm256_srai_epi16(_mm256_slli_epi16(_mm256_unpacklo_epi8(pxRow[i], avx_px0), 8), 8)); }

r-abishek · 2025-08-13T00:29:23Z

src/modules/tensor/cpu/kernel/box_filter.cpp

+                        if constexpr (std::is_same<T, Rpp8s>::value)
+                        {
+                            unpacklo_signext_and_add_9x9_host(pxRow, &pxRowHalf[0]);
+                            unpackhi_signext_and_add_9x9_host(pxRow, &pxRowHalf[1]);


Seems like you always have unpacklo immediately followed by unpackhi.

Why not change to:

if constexpr (std::is_same<T, Rpp8s>::value) unpack_signext_and_add_host<kernelSize>(pxRow, pxRowHalf); // excutes unpacklo and unpackhi (kernelSize is templated) else unpack_and_add_host<kernelSize>(pxRow, pxRowHalf); // excutes unpacklo and unpackhi (kernelSize is templated)

I understand unpack_signext_and_add_host is new and should work. But the older unpack_and_add_9x9_host() might have deps. But we can check on ability to shrink code further here

r-abishek · 2025-08-13T00:50:16Z

src/modules/tensor/hip/kernel/box_filter.cpp

+            int clampedX = max(roiTensorPtrSrc[id_z].xywhROI.xy.x,
+                                min(id_x_i + i, roiTensorPtrSrc[id_z].xywhROI.xy.x + roiTensorPtrSrc[id_z].xywhROI.roiWidth - 1));
+            int clampedY = max(roiTensorPtrSrc[id_z].xywhROI.xy.y,
+                                min(id_y_i, roiTensorPtrSrc[id_z].xywhROI.xy.y + roiTensorPtrSrc[id_z].xywhROI.roiHeight - 1));


clampedY is completely loop independent.
For clampedX, the whole section "roiTensorPtrSrc[id_z].xywhROI.xy.x + roiTensorPtrSrc[id_z].xywhROI.roiWidth - 1" is loop independent.

r-abishek · 2025-08-13T01:19:19Z

src/modules/tensor/hip/kernel/box_filter.cpp

+
+            tempBuffer[rgbOffset] = srcPtr[clampedIdx];         // R
+            tempBuffer[rgbOffset + 1] = srcPtr[clampedIdx + 1]; // G
+            tempBuffer[rgbOffset + 2] = srcPtr[clampedIdx + 2]; // B


srcPtr is global memory.
We should probably not be accessing it 3 * 8 in loop times.

Can't we read a whole row of n elements in one shot, then assign into tempBuffer?
Basically:

there will be a minimum possible value of clampedIdx in all runs of that loop.

there will be a maximum possible value of clampedIdx in all runs of that loop.

Find those and read the whole row.

For the last point, we may need a way to map our scalar and vector types at compile time. (Basically float should mean d_float24, uchar should mean d_uchar24 and so on)
If there is a way to define a type vec24 - which would always be the 24-element version of whatever template type T comes in, then this would perhaps work?

Copying all the elements at once won't be possible as it will lead out of bounds memory access in some test cases . However we eliminated the usage of temp buffer and directly copied to shared mem. It gave us some improvements as well

r-abishek · 2025-08-13T01:38:36Z

src/modules/tensor/hip/kernel/box_filter.cpp

+            int clampedIdx = (id_z * srcStridesNH.x) + (clampedY * srcStridesNH.y) + (clampedX * 3);
+
+            tempBuffer[rgbOffset] = srcPtr[clampedIdx];         // R
+            tempBuffer[rgbOffset + 1] = srcPtr[clampedIdx + 1]; // G


@Srihari-mcw @HazarathKumarM
For previous comment, something like this possible? Pls discuss

template<typename T> struct vec24_of; template<> struct vec24_of<uchar> { using type = d_uchar24; }; // 24 uchar will always be d_uchar24 template<> struct vec24_of<schar> { using type = d_schar24; }; // 24 schar will always be d_schar24 template<> struct vec24_of<half> { using type = d_half24; }; // 24 half will always be d_half24 template<> struct vec24_of<float> { using type = d_float24; }; // 24 float will always be d_float24 template<typename T> using vec24_t = typename vec24_of<T>::type; // Just to say that vec24_t is the 24 element type of T and is dependent on T // Then while calling or using it inside a kernel template<typename T> __global__ void box_filter_5x5_pkd_hip_tensor(T *srcPtr, uint2 srcStridesNH, T *dstPtr, uint2 dstStridesNH, uint padLength, uint2 tileSize, RpptROIPtr roiTensorPtrSrc) { vec24_t<T> tempBuffer_24 = *(vec24_t<T>*)&srcPtr[clampedIndex]; // use minimum value of clampedIndex for the whole loop and take all elements needed after that into tempBuffer_24 - assuming a max of 24 elements needed? }

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…adHorizontal

Srihari-mcw · 2025-10-12T17:50:30Z

External PR issued and merged

…s/sphinx (r-abishek#455) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.8.1 to 1.8.2. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/v1.8.2/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v1.8.1...v1.8.2) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Srihari-mcw added 30 commits June 16, 2025 16:20

Updates for box filter to test

e2f34fb

Box Filter Newer Commit with fixes

a1218c3

Further updates to match PKD3 and PLN3 output

5bec7e1

Fixes for RPP Box filter border replicate - Fix accuracy

c7e925b

Rename variables

ff0582d

Restore AVX and Update AVX code for box filter

a6b4afd

Box Filter float updates

4c34bd7

Updates to float version

92f338c

Compilation fixes

db2f2ba

HIP Updates for Box Filter kernelSize = 3

09ee499

HIP Updates for PLN Code

1c774d6

HIP Updates for kernelSize = 5

35aa810

HIP Updates for kernelSize = 7

edd3dc3

HIP Updates for kernelSize = 9

f6e48c9

Remove additional else

dc7ef37

Fix issues with alignedLength for kernelSize 3 float variants PKD3 to…

27dfedc

… PLN3

Add golden outputs

17d597e

Add additional borderType parameter

46ae0ea

Separate float implementation for 3x3 box filter planar

32cb4be

Rename function

6abfcc7

Introduce functions to calculate in float type itself

0fbc9fe

Updates for PKD variants

03fdbe6

Compilation fixes

22d8f5a

Updates for PKD3 to PLN3

52a7cef

Fix accuracy issues

4d75b0d

PLN3 to PKD3 updates for box filter

7f456d7

Float shared variables

747f4b3

Overload box filter for various kernelSizes

6fb913f

Template the PKD3 and PLN3 implementations

6e77dbb

Template the rest of the implementations

c2c42ff

HazarathKumarM reviewed Aug 4, 2025

View reviewed changes

src/include/common/cpu/rpp_cpu_filter.hpp Outdated Show resolved Hide resolved

Update comment

9db2985

Srihari-mcw added 2 commits August 5, 2025 08:41

Add declarations for I8 functions with rounding

5a11e1a

Whitespace and type updates

f407f34

Srihari-mcw force-pushed the box_filter_padding_updates branch from e01e45d to f407f34 Compare August 5, 2025 03:57

Srihari-mcw and others added 2 commits August 6, 2025 19:24

Updates for performance - U8/I8

9b96354

Templated the box filter compute functions for all kernel sizes and o…

80711b0

…ptimized the I8 variants

Update comments

15b9212

r-abishek requested changes Aug 13, 2025

View reviewed changes

r-abishek requested a review from Copilot August 13, 2025 01:42

Copilot AI reviewed Aug 13, 2025

View reviewed changes

Srihari-mcw and others added 14 commits August 13, 2025 16:04

Make initial changes to template unpack function

fd80cdd

Fixes for box filter compilation

5d7d0d4

modified padding load logic

cf910d8

Merge branch 'develop' into box_filter_padding_updates

eebd365

Merge branch 'develop' into box_filter_padding_updates

3710940

Merge branch 'develop' into box_filter_padding_updates

2f14374

Merge branch 'develop' into box_filter_padding_updates

6babc28

Merge branch 'develop' into box_filter_padding_updates

f19c3fd

Update the round function used

2d47a76

Merge branch 'develop' into box_filter_padding_updates

c89f6f5

Rename verticalDirection and horizontalDirection to padVertical and p…

dc863ef

…adHorizontal

Merge branch 'develop' into box_filter_padding_updates

0de01f8

Fix accuracy issues

51c3cfe

Merge branch 'develop' into box_filter_padding_updates

d4d1bb8

Srihari-mcw closed this Oct 12, 2025

Conversation

Srihari-mcw commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Srihari-mcw commented Aug 5, 2025

Uh oh!

HazarathKumarM commented Aug 11, 2025

Uh oh!

r-abishek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Srihari-mcw commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Srihari-mcw commented Jul 1, 2025 •

edited

Loading