Erode - rebased with latest changes by HazarathKumarM · Pull Request #549 · r-abishek/rpp

HazarathKumarM · 2025-12-10T17:07:35Z

Rebased version of #334

removed commented code

* Add F32 QA Golden outputs * modify Doxygen comments * modify range check functions * RPP F32 QA : Review Comments Resolution (r-abishek#431) * Modified SIMD print functions to use union * remove redundant unions in print functions * removed pixel checks * remove pixel check in threshold * resolve review comments --------- Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> Co-authored-by: HazarathKumarM <119284987+HazarathKumarM@users.noreply.github.com> Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>

Srihari-mcw · 2025-12-21T15:42:08Z

utilities/test_suite/HIP/Tensor_image_hip.cpp

                if (roiTypeSrc == RpptRoiType::LTRB)
                    convert_roi(roiTensorPtrDst, RpptRoiType::XYWH, dstDescPtr->n);
-
+    


Whitespace added?

Srihari-mcw · 2025-12-21T16:23:32Z

src/modules/tensor/hip/kernel/erode.cpp

+            minVal = 1.0f;
+        #pragma unroll
+        for (int j = 0; j < filterSize; j++)
+            minVal = fminf(minVal, (float)srcPtr[k + j]);


Do a static cast

Srihari-mcw · 2025-12-21T16:40:39Z

src/modules/tensor/hip/kernel/erode.cpp

-    src_f = rpp_hip_unpack3(src_ui4.w);
-    dst_f8->f1[7] = fminf(src_f, dst_f8->f1[7]);
+    #pragma unroll
+    for (int k = 0; k < 8; k++)


Can we think of something like this or is the current approach fine - The below approach reduces complexity to O(3n)

#include <iostream> #include <climits> using namespace std; int main() { // Input array int src[] = {3, 5, 6, 4, 7, 8, 9, 6, 2, 4}; int n = sizeof(src) / sizeof(src[0]); // Window size const int k = 8; // Auxiliary arrays int leftMin[10]; int rightMin[10]; // Build leftMin (prefix minimum per block) for (int i = 0; i < n; i++) { if (i % k == 0) leftMin[i] = src[i]; else leftMin[i] = min(leftMin[i - 1], src[i]); } // Build rightMin (suffix minimum per block) for (int i = n - 1; i >= 0; i--) { if (i % k == k - 1 || i == n - 1) rightMin[i] = src[i]; else rightMin[i] = min(rightMin[i + 1], src[i]); } // Compute sliding window minimums cout << "Output: "; for (int i = 0; i + k - 1 < n; i++) { int windowMin = min(rightMin[i], leftMin[i + k - 1]); cout << windowMin << " "; } cout << endl; return 0; }

Basically a sliding window approach similar to box filter - But it requires addl computation like the above code - leftMin, rightMin

AI reply for this comment : The current implementation benefits from coalesced memory access and shared memory usage. The proposed approach would require more complex memory access patterns.

Srihari-mcw · 2025-12-21T16:48:40Z

src/modules/tensor/hip/kernel/erode.cpp

+        // Nearest-neighbor padding
+        for (int i = 0; i < 8; i++)
+        {
+            int clampedX = roiBeginX + max(0, min(id_x_i + i, (roiWidth - 1)));            int clampedIdx = (id_z * srcStridesNH.x) + (clampedY * srcStridesNH.y) + (clampedX * 3);


Check Formatting

Srihari-mcw · 2025-12-21T16:48:57Z

src/modules/tensor/hip/kernel/erode.cpp

+            src_smem[hipThreadIdx_y_channel.x][hipThreadIdx_x8 + i] = srcPtr[clampedIdx];         // R
+            src_smem[hipThreadIdx_y_channel.y][hipThreadIdx_x8 + i] = srcPtr[clampedIdx + 1]; // G


Check comment spacing

Srihari-mcw · 2025-12-21T16:55:29Z

src/modules/tensor/hip/kernel/erode.cpp

+    for (int k = 0; k < 8; k++)
+    {
+        float minVal;
+        if constexpr (std::is_same_v<T, Rpp8u>)


Is a minVal setting really needed, we can just set it to srcPtr[k]? rather than hardcoded expressions based on type

Srihari-mcw · 2025-12-21T16:59:24Z

src/modules/tensor/hip/kernel/erode.cpp

    else
-        *(uint2 *)&src_smem[hipThreadIdx_y][hipThreadIdx_x8] = borderVal;
+    {
+        // Nearest-neighbor padding


Add a comment somewhere maybe that erode and dilate are independent of the type of padding in general?

we are doing NN padding itself

Srihari-mcw · 2025-12-21T17:05:29Z

src/modules/tensor/hip/kernel/erode.cpp

+        erode_row_hip_compute<7>(&src_smem[hipThreadIdx_y + 4][hipThreadIdx_x8], &sum_f8);
+        erode_row_hip_compute<7>(&src_smem[hipThreadIdx_y + 5][hipThreadIdx_x8], &sum_f8);
+        erode_row_hip_compute<7>(&src_smem[hipThreadIdx_y + 6][hipThreadIdx_x8], &sum_f8);
+        if constexpr (std::is_same<T, Rpp8s>::value)


If this should be added, then it should be added uniformly at all places - I doubt the requirement of this given its a min/max operation

RoundToNearest

Srihari-mcw · 2025-12-21T17:35:59Z

src/modules/tensor/cpu/kernel/erode.cpp

+        }
+        increment_row_ptrs(srcPtrTemp, kernelSize, 1);
+    }
+    // reset source to initial position


Add a empty line

before the comment

Srihari-mcw · 2025-12-21T17:43:25Z

src/modules/tensor/cpu/kernel/erode.cpp

+// -------------------- Set 0 erode compute functions --------------------
+
+// unpack lower half of 3 256 bit registers and add (used for 3x3 kernel size U8/I8 variants)
+inline void unpacklo_and_min_3x3_host(__m256i *pxRow, __m256i *pxDst)


Even these functions can be templated?

With 5x5, 7x7 and 9x9

Srihari-mcw · 2025-12-21T17:44:50Z

src/modules/tensor/cpu/kernel/erode.cpp

+}
+
+// add 3 256 bit registers (used for 3x3 kernel size F32/F16 variants)
+inline void min_rows_3x3(__m256 *pRow, __m256 *pDst)


Maybe these too

Srihari-mcw · 2025-12-21T17:48:54Z

src/include/common/cpu/rpp_cpu_filter.hpp

+    constexpr int PreloadRows = (KernelSize + 1) / 2;
+
+    // Load initial rows
+    for (int k = 0; k < PreloadRows; ++k)


Use #pragma unroll

Srihari-mcw · 2025-12-21T17:49:37Z

src/include/common/cpu/rpp_cpu_filter.hpp

+    using Info = MorphLoadInfo<T>;
+    using Vec  = typename Info::VecType;
+
+    constexpr int PreloadRows = (KernelSize + 1) / 2;


preLoadRows maybe the name? Following other camelCase conventions
Similarly kernelSize for KernelSize

Srihari-mcw · 2025-12-21T17:52:31Z

src/include/common/cpu/rpp_cpu_filter.hpp

+template <int KernelSize, typename T, typename PadPolicy>
+inline void rpp_morphological_load_NxN(typename MorphLoadInfo<T>::VecType *pxRow, T **srcPtrTemp, Rpp32s rowKernelLoopLimit)
+{
+    using Info = MorphLoadInfo<T>;


Maybe some other name instead of Info and MorphLoadInfo? imo

Srihari-mcw · 2025-12-21T18:03:30Z

src/modules/tensor/cpu/kernel/erode.cpp

+                            rpp_morphological_load_NxN<3, T, MorphPad_Erode>(pxRow, srcPtrTemp, rowKernelLoopLimit);
+
+                            // unpack lower half and higher half of each of 3 loaded row values from 8 bit to 16 bit and add
+                            unpacklo_and_min_3x3_host(pxRow, &pxRowHalf[0]);


The current version works fine, but maybe should we explore the usage of epi8 instructions instead of epi16 because we only do a min/max? Pls share ur thoughts @HazarathKumarM

requires exploration

Srihari-mcw · 2025-12-21T18:05:38Z

src/include/common/cpu/rpp_cpu_filter.hpp

+{
+    using VecType = __m256i;
+
+    static inline VecType load(void *ptr) { return _mm256_add_epi8(avx_pxConvertI8, _mm256_loadu_si256((__m256i*)ptr)); }


Pls remove the usage of add_epi8 with 128 and we should check the usage of unpack and pack similar to box filter

https://github.com/ROCm/rpp/blob/ddae1036b280fd5325833005ec7defdb2fa077c7/src/modules/tensor/cpu/kernel/box_filter.cpp#L344
https://github.com/ROCm/rpp/blob/ddae1036b280fd5325833005ec7defdb2fa077c7/src/modules/tensor/cpu/kernel/box_filter.cpp#L364

Srihari-mcw · 2025-12-21T18:11:21Z

src/modules/tensor/cpu/kernel/erode.cpp

+                        __m256i pxRow[3], pxRowHalf[2], pxResult;
+                        rpp_morphological_load_NxN<3, T, MorphPad_Erode>(pxRow, srcPtrTemp, rowKernelLoopLimit);
+
+                        // unpack lower half and higher half of each of 3 loaded row values from 8 bit to 16 bit and add


Check comment at all places - It should be min instead of add

Srihari-mcw · 2025-12-21T18:40:32Z

src/modules/tensor/cpu/kernel/erode.cpp

+
+// -------------------- Set 0 erode compute functions --------------------
+
+// unpack lower half of 3 256 bit registers and add (used for 3x3 kernel size U8/I8 variants)


min instead of add

Srihari-mcw · 2025-12-24T02:29:09Z

src/include/common/cpu/rpp_cpu_filter.hpp

+        pxRow[k] = Info::load(srcPtrTemp[k]);
+
+    // Load valid remaining rows
+    for (int k = PreloadRows; k < rowKernelLoopLimit; ++k)


using Loader = MorphVecLoader;
using Vec = typename Loader::VecType;

pxRow[k] = Loader::load(srcPtrTemp[k]);

HazarathKumarM · 2025-12-26T08:32:59Z

src/include/common/cpu/rpp_cpu_filter.hpp

+struct MorphVecLoader<Rpp8u>
+{
+    using VecType = __m256i;
+


remove empty line

HazarathKumarM · 2025-12-26T08:33:13Z

src/include/common/cpu/rpp_cpu_filter.hpp

+struct MorphVecLoader<Rpp8s>
+{
+    using VecType = __m256i;
+


remove empty line, applicable to other functions similar funcs also

HazarathKumarM · 2025-12-26T08:34:52Z

src/include/common/cpu/rpp_cpu_filter.hpp

+    using Loader = MorphVecLoader<T>;
+    using Vec  = typename Loader::VecType;
+
+    constexpr int preLoadRows = (KernelSize + 1) / 2;


change KernelSize to kernelSize

HazarathKumarM · 2025-12-26T08:35:09Z

src/include/common/cpu/rpp_cpu_filter.hpp

+    static inline __m256  pad_float() { return avx_p0; }
+};
+
+template <int KernelSize, typename T, typename PadPolicy>


change PadPolicy to padPolicy

HazarathKumarM · 2025-12-26T08:42:02Z

src/modules/tensor/cpu/kernel/erode.cpp

+                            rpp_morphological_load_NxN<3, T, MorphPad_Erode>(pxRow, srcPtrTemp, rowKernelLoopLimit);
+
+                            // unpack lower half and higher half of each of 3 loaded row values from 8 bit to 16 bit and min
+                            if constexpr (std::is_same<T, Rpp8s>::value)


remove {} for single line condition, please change for all instances

…for single line condition

r-abishek

lgtm

Copilot

Pull request overview

This PR adds HOST backend support for the erode morphological operation, completing the implementation that was previously only available on the HIP backend.

Changes:

Added erode to the HOST-supported operations list in the test suite
Implemented the HOST version of rppt_erode_host with support for U8, F16, F32, and I8 data types
Added SIMD-optimized helper functions for erode operations supporting kernel sizes 3x3, 5x5, 7x7, and 9x9

Reviewed changes

Copilot reviewed 7 out of 17 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
utilities/test_suite/common.py	Enabled HOST backend support for erode operation
utilities/test_suite/HOST/runImageTests.py	Added erode to kernel size iteration logic in unit and performance tests
utilities/test_suite/HOST/Tensor_image_host.cpp	Implemented erode test case with kernel size parameter handling
src/modules/tensor/rppt_tensor_morphological_operations.cpp	Implemented rppt_erode_host function with validation and data type support, plus added validation checks to GPU version
src/include/tensor/host_tensor_executors.hpp	Added function declarations for erode char and float host tensor executors
src/include/common/cpu/rpp_cpu_filter.hpp	Added SIMD-optimized helper functions for erode operations including blend/shuffle/min operations and morphological vector loaders
api/rppt_tensor_morphological_operations.h	Added documentation for rppt_erode_host and corrected parameter description for GPU version

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* Add F32 QA Golden outputs * modify Doxygen comments * modify range check functions * RPP F32 QA : Review Comments Resolution (r-abishek#431) * Modified SIMD print functions to use union * remove redundant unions in print functions * removed pixel checks * remove pixel check in threshold * resolve review comments --------- Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> Co-authored-by: HazarathKumarM <119284987+HazarathKumarM@users.noreply.github.com> Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>

sampath1117 and others added 7 commits September 5, 2024 11:14

added initial api support for erode

142b006

added support for U8 and I8 bitdepths for 3, 5, 7, 9 kernel sizes

8642d83

added F16 and F32 bitdepth support

874aad6

added generic kernel support

854e5db

added golden outputs

4c572c7

removed commented code

Merge remote-tracking branch 'develop' into sr/opt_erode_host

1e9f45a

fix build errors

f3b721b

HazarathKumarM mentioned this pull request Dec 10, 2025

RPP Tensor Support - Erode on HOST #334

Closed

HazarathKumarM added 5 commits December 11, 2025 04:15

Fix build and test_suite errors

333484f

revert padding changes

d831c5f

updated erode HIP kernel with latest changes

57152c4

Add F32 QA

ae95005

minor formatting fixes

ce7aa2f

HazarathKumarM added 3 commits December 19, 2025 01:09

minor comment fix

77f84cc

resolve copilot comments

9821642

resolve review comments

a36f0fe

Srihari-mcw reviewed Dec 21, 2025

View reviewed changes

HazarathKumarM and others added 6 commits December 22, 2025 05:12

resolved review comments

0105c2f

Add unpack templating changes and fix segmentation issue

99786e5

Fix PKD to PKD kernel 9 for Pack and Unpack changes.

1b6c4db

Add and template signext function

67fb3b1

Fix min Comments

1e99ef7

Fix one min Comments

d81db53

Srihari-mcw reviewed Dec 24, 2025

View reviewed changes

mukeshj0606 added 3 commits December 24, 2025 00:15

Add unroll and rename of preLoadRows

055bfd2

Fix remane of Loader and MorphVecLoader

1782665

Add empty line before comment

79653c0

HazarathKumarM commented Dec 26, 2025

View reviewed changes

Fix remove empty line, rename of kernelSze & padPolicy and remove {} …

e09f82c

…for single line condition

r-abishek approved these changes Jan 21, 2026

View reviewed changes

r-abishek changed the base branch from develop to ar/opt_erode January 21, 2026 03:53

r-abishek assigned HazarathKumarM Jan 21, 2026

r-abishek requested a review from Copilot January 21, 2026 03:53

r-abishek added the enhancement New feature or request label Jan 21, 2026

Copilot AI reviewed Jan 21, 2026

View reviewed changes

r-abishek merged commit 8c88da6 into r-abishek:ar/opt_erode Jan 27, 2026

		if (roiTypeSrc == RpptRoiType::LTRB)
		convert_roi(roiTensorPtrDst, RpptRoiType::XYWH, dstDescPtr->n);

		src_smem[hipThreadIdx_y_channel.x][hipThreadIdx_x8 + i] = srcPtr[clampedIdx]; // R
		src_smem[hipThreadIdx_y_channel.y][hipThreadIdx_x8 + i] = srcPtr[clampedIdx + 1]; // G


		// -------------------- Set 0 erode compute functions --------------------

		// unpack lower half of 3 256 bit registers and add (used for 3x3 kernel size U8/I8 variants)

Conversation

HazarathKumarM commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r-abishek left a comment

Choose a reason for hiding this comment

HazarathKumarM commented Dec 10, 2025 •

edited

Loading