Erode - rebased with latest changes#549
Conversation
removed commented code
* Add F32 QA Golden outputs * modify Doxygen comments * modify range check functions * RPP F32 QA : Review Comments Resolution (r-abishek#431) * Modified SIMD print functions to use union * remove redundant unions in print functions * removed pixel checks * remove pixel check in threshold * resolve review comments --------- Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> Co-authored-by: HazarathKumarM <119284987+HazarathKumarM@users.noreply.github.com> Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>
| if (roiTypeSrc == RpptRoiType::LTRB) | ||
| convert_roi(roiTensorPtrDst, RpptRoiType::XYWH, dstDescPtr->n); | ||
|
|
||
| minVal = 1.0f; | ||
| #pragma unroll | ||
| for (int j = 0; j < filterSize; j++) | ||
| minVal = fminf(minVal, (float)srcPtr[k + j]); |
| src_f = rpp_hip_unpack3(src_ui4.w); | ||
| dst_f8->f1[7] = fminf(src_f, dst_f8->f1[7]); | ||
| #pragma unroll | ||
| for (int k = 0; k < 8; k++) |
There was a problem hiding this comment.
Can we think of something like this or is the current approach fine - The below approach reduces complexity to O(3n)
#include <iostream>
#include <climits>
using namespace std;
int main() {
// Input array
int src[] = {3, 5, 6, 4, 7, 8, 9, 6, 2, 4};
int n = sizeof(src) / sizeof(src[0]);
// Window size
const int k = 8;
// Auxiliary arrays
int leftMin[10];
int rightMin[10];
// Build leftMin (prefix minimum per block)
for (int i = 0; i < n; i++) {
if (i % k == 0)
leftMin[i] = src[i];
else
leftMin[i] = min(leftMin[i - 1], src[i]);
}
// Build rightMin (suffix minimum per block)
for (int i = n - 1; i >= 0; i--) {
if (i % k == k - 1 || i == n - 1)
rightMin[i] = src[i];
else
rightMin[i] = min(rightMin[i + 1], src[i]);
}
// Compute sliding window minimums
cout << "Output: ";
for (int i = 0; i + k - 1 < n; i++) {
int windowMin = min(rightMin[i], leftMin[i + k - 1]);
cout << windowMin << " ";
}
cout << endl;
return 0;
}
There was a problem hiding this comment.
Basically a sliding window approach similar to box filter - But it requires addl computation like the above code - leftMin, rightMin
There was a problem hiding this comment.
AI reply for this comment : The current implementation benefits from coalesced memory access and shared memory usage. The proposed approach would require more complex memory access patterns.
| // Nearest-neighbor padding | ||
| for (int i = 0; i < 8; i++) | ||
| { | ||
| int clampedX = roiBeginX + max(0, min(id_x_i + i, (roiWidth - 1))); int clampedIdx = (id_z * srcStridesNH.x) + (clampedY * srcStridesNH.y) + (clampedX * 3); |
| src_smem[hipThreadIdx_y_channel.x][hipThreadIdx_x8 + i] = srcPtr[clampedIdx]; // R | ||
| src_smem[hipThreadIdx_y_channel.y][hipThreadIdx_x8 + i] = srcPtr[clampedIdx + 1]; // G |
There was a problem hiding this comment.
Check comment spacing
| for (int k = 0; k < 8; k++) | ||
| { | ||
| float minVal; | ||
| if constexpr (std::is_same_v<T, Rpp8u>) |
There was a problem hiding this comment.
Is a minVal setting really needed, we can just set it to srcPtr[k]? rather than hardcoded expressions based on type
| else | ||
| *(uint2 *)&src_smem[hipThreadIdx_y][hipThreadIdx_x8] = borderVal; | ||
| { | ||
| // Nearest-neighbor padding |
There was a problem hiding this comment.
Add a comment somewhere maybe that erode and dilate are independent of the type of padding in general?
There was a problem hiding this comment.
we are doing NN padding itself
| erode_row_hip_compute<7>(&src_smem[hipThreadIdx_y + 4][hipThreadIdx_x8], &sum_f8); | ||
| erode_row_hip_compute<7>(&src_smem[hipThreadIdx_y + 5][hipThreadIdx_x8], &sum_f8); | ||
| erode_row_hip_compute<7>(&src_smem[hipThreadIdx_y + 6][hipThreadIdx_x8], &sum_f8); | ||
| if constexpr (std::is_same<T, Rpp8s>::value) |
There was a problem hiding this comment.
If this should be added, then it should be added uniformly at all places - I doubt the requirement of this given its a min/max operation
| } | ||
| increment_row_ptrs(srcPtrTemp, kernelSize, 1); | ||
| } | ||
| // reset source to initial position |
| // -------------------- Set 0 erode compute functions -------------------- | ||
|
|
||
| // unpack lower half of 3 256 bit registers and add (used for 3x3 kernel size U8/I8 variants) | ||
| inline void unpacklo_and_min_3x3_host(__m256i *pxRow, __m256i *pxDst) |
There was a problem hiding this comment.
Even these functions can be templated?
There was a problem hiding this comment.
With 5x5, 7x7 and 9x9
| } | ||
|
|
||
| // add 3 256 bit registers (used for 3x3 kernel size F32/F16 variants) | ||
| inline void min_rows_3x3(__m256 *pRow, __m256 *pDst) |
| constexpr int PreloadRows = (KernelSize + 1) / 2; | ||
|
|
||
| // Load initial rows | ||
| for (int k = 0; k < PreloadRows; ++k) |
| using Info = MorphLoadInfo<T>; | ||
| using Vec = typename Info::VecType; | ||
|
|
||
| constexpr int PreloadRows = (KernelSize + 1) / 2; |
There was a problem hiding this comment.
preLoadRows maybe the name? Following other camelCase conventions
Similarly kernelSize for KernelSize
| template <int KernelSize, typename T, typename PadPolicy> | ||
| inline void rpp_morphological_load_NxN(typename MorphLoadInfo<T>::VecType *pxRow, T **srcPtrTemp, Rpp32s rowKernelLoopLimit) | ||
| { | ||
| using Info = MorphLoadInfo<T>; |
There was a problem hiding this comment.
Maybe some other name instead of Info and MorphLoadInfo? imo
| rpp_morphological_load_NxN<3, T, MorphPad_Erode>(pxRow, srcPtrTemp, rowKernelLoopLimit); | ||
|
|
||
| // unpack lower half and higher half of each of 3 loaded row values from 8 bit to 16 bit and add | ||
| unpacklo_and_min_3x3_host(pxRow, &pxRowHalf[0]); |
There was a problem hiding this comment.
The current version works fine, but maybe should we explore the usage of epi8 instructions instead of epi16 because we only do a min/max? Pls share ur thoughts @HazarathKumarM
There was a problem hiding this comment.
requires exploration
| { | ||
| using VecType = __m256i; | ||
|
|
||
| static inline VecType load(void *ptr) { return _mm256_add_epi8(avx_pxConvertI8, _mm256_loadu_si256((__m256i*)ptr)); } |
There was a problem hiding this comment.
Pls remove the usage of add_epi8 with 128 and we should check the usage of unpack and pack similar to box filter
https://github.com/ROCm/rpp/blob/ddae1036b280fd5325833005ec7defdb2fa077c7/src/modules/tensor/cpu/kernel/box_filter.cpp#L344
https://github.com/ROCm/rpp/blob/ddae1036b280fd5325833005ec7defdb2fa077c7/src/modules/tensor/cpu/kernel/box_filter.cpp#L364
| __m256i pxRow[3], pxRowHalf[2], pxResult; | ||
| rpp_morphological_load_NxN<3, T, MorphPad_Erode>(pxRow, srcPtrTemp, rowKernelLoopLimit); | ||
|
|
||
| // unpack lower half and higher half of each of 3 loaded row values from 8 bit to 16 bit and add |
There was a problem hiding this comment.
Check comment at all places - It should be min instead of add
|
|
||
| // -------------------- Set 0 erode compute functions -------------------- | ||
|
|
||
| // unpack lower half of 3 256 bit registers and add (used for 3x3 kernel size U8/I8 variants) |
| pxRow[k] = Info::load(srcPtrTemp[k]); | ||
|
|
||
| // Load valid remaining rows | ||
| for (int k = PreloadRows; k < rowKernelLoopLimit; ++k) |
There was a problem hiding this comment.
using Loader = MorphVecLoader;
using Vec = typename Loader::VecType;
pxRow[k] = Loader::load(srcPtrTemp[k]);
| struct MorphVecLoader<Rpp8u> | ||
| { | ||
| using VecType = __m256i; | ||
|
|
There was a problem hiding this comment.
remove empty line
| struct MorphVecLoader<Rpp8s> | ||
| { | ||
| using VecType = __m256i; | ||
|
|
There was a problem hiding this comment.
remove empty line, applicable to other functions similar funcs also
| using Loader = MorphVecLoader<T>; | ||
| using Vec = typename Loader::VecType; | ||
|
|
||
| constexpr int preLoadRows = (KernelSize + 1) / 2; |
There was a problem hiding this comment.
change KernelSize to kernelSize
| static inline __m256 pad_float() { return avx_p0; } | ||
| }; | ||
|
|
||
| template <int KernelSize, typename T, typename PadPolicy> |
There was a problem hiding this comment.
change PadPolicy to padPolicy
| rpp_morphological_load_NxN<3, T, MorphPad_Erode>(pxRow, srcPtrTemp, rowKernelLoopLimit); | ||
|
|
||
| // unpack lower half and higher half of each of 3 loaded row values from 8 bit to 16 bit and min | ||
| if constexpr (std::is_same<T, Rpp8s>::value) |
There was a problem hiding this comment.
remove {} for single line condition, please change for all instances
…for single line condition
There was a problem hiding this comment.
Pull request overview
This PR adds HOST backend support for the erode morphological operation, completing the implementation that was previously only available on the HIP backend.
Changes:
- Added erode to the HOST-supported operations list in the test suite
- Implemented the HOST version of rppt_erode_host with support for U8, F16, F32, and I8 data types
- Added SIMD-optimized helper functions for erode operations supporting kernel sizes 3x3, 5x5, 7x7, and 9x9
Reviewed changes
Copilot reviewed 7 out of 17 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| utilities/test_suite/common.py | Enabled HOST backend support for erode operation |
| utilities/test_suite/HOST/runImageTests.py | Added erode to kernel size iteration logic in unit and performance tests |
| utilities/test_suite/HOST/Tensor_image_host.cpp | Implemented erode test case with kernel size parameter handling |
| src/modules/tensor/rppt_tensor_morphological_operations.cpp | Implemented rppt_erode_host function with validation and data type support, plus added validation checks to GPU version |
| src/include/tensor/host_tensor_executors.hpp | Added function declarations for erode char and float host tensor executors |
| src/include/common/cpu/rpp_cpu_filter.hpp | Added SIMD-optimized helper functions for erode operations including blend/shuffle/min operations and morphological vector loaders |
| api/rppt_tensor_morphological_operations.h | Added documentation for rppt_erode_host and corrected parameter description for GPU version |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* Add F32 QA Golden outputs * modify Doxygen comments * modify range check functions * RPP F32 QA : Review Comments Resolution (r-abishek#431) * Modified SIMD print functions to use union * remove redundant unions in print functions * removed pixel checks * remove pixel check in threshold * resolve review comments --------- Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com> Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com> Co-authored-by: HazarathKumarM <119284987+HazarathKumarM@users.noreply.github.com> Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>
Rebased version of #334