Skip to content

Erode - rebased with latest changes#549

Merged
r-abishek merged 25 commits intor-abishek:ar/opt_erodefrom
HazarathKumarM:erode_rebased
Jan 27, 2026
Merged

Erode - rebased with latest changes#549
r-abishek merged 25 commits intor-abishek:ar/opt_erodefrom
HazarathKumarM:erode_rebased

Conversation

@HazarathKumarM
Copy link
Copy Markdown
Collaborator

@HazarathKumarM HazarathKumarM commented Dec 10, 2025

Rebased version of #334

ManasaDattaT pushed a commit to ManasaDattaT/rpp that referenced this pull request Dec 19, 2025
* Add F32 QA Golden outputs

* modify Doxygen comments

* modify range check functions

* RPP F32 QA : Review Comments Resolution (r-abishek#431)

* Modified SIMD print functions to use union

* remove redundant unions in print functions

* removed pixel checks

* remove pixel check in threshold

* resolve review comments

---------

Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>
Co-authored-by: HazarathKumarM <119284987+HazarathKumarM@users.noreply.github.com>
Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>
if (roiTypeSrc == RpptRoiType::LTRB)
convert_roi(roiTensorPtrDst, RpptRoiType::XYWH, dstDescPtr->n);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whitespace added?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

minVal = 1.0f;
#pragma unroll
for (int j = 0; j < filterSize; j++)
minVal = fminf(minVal, (float)srcPtr[k + j]);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do a static cast

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

src_f = rpp_hip_unpack3(src_ui4.w);
dst_f8->f1[7] = fminf(src_f, dst_f8->f1[7]);
#pragma unroll
for (int k = 0; k < 8; k++)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we think of something like this or is the current approach fine - The below approach reduces complexity to O(3n)

#include <iostream>
#include <climits>
using namespace std;

int main() {
    // Input array
    int src[] = {3, 5, 6, 4, 7, 8, 9, 6, 2, 4};
    int n = sizeof(src) / sizeof(src[0]);

    // Window size
    const int k = 8;

    // Auxiliary arrays
    int leftMin[10];
    int rightMin[10];

    // Build leftMin (prefix minimum per block)
    for (int i = 0; i < n; i++) {
        if (i % k == 0)
            leftMin[i] = src[i];
        else
            leftMin[i] = min(leftMin[i - 1], src[i]);
    }

    // Build rightMin (suffix minimum per block)
    for (int i = n - 1; i >= 0; i--) {
        if (i % k == k - 1 || i == n - 1)
            rightMin[i] = src[i];
        else
            rightMin[i] = min(rightMin[i + 1], src[i]);
    }

    // Compute sliding window minimums
    cout << "Output: ";
    for (int i = 0; i + k - 1 < n; i++) {
        int windowMin = min(rightMin[i], leftMin[i + k - 1]);
        cout << windowMin << " ";
    }

    cout << endl;
    return 0;
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically a sliding window approach similar to box filter - But it requires addl computation like the above code - leftMin, rightMin

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI reply for this comment : The current implementation benefits from coalesced memory access and shared memory usage. The proposed approach would require more complex memory access patterns.

// Nearest-neighbor padding
for (int i = 0; i < 8; i++)
{
int clampedX = roiBeginX + max(0, min(id_x_i + i, (roiWidth - 1))); int clampedIdx = (id_z * srcStridesNH.x) + (clampedY * srcStridesNH.y) + (clampedX * 3);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check Formatting

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines +107 to +108
src_smem[hipThreadIdx_y_channel.x][hipThreadIdx_x8 + i] = srcPtr[clampedIdx]; // R
src_smem[hipThreadIdx_y_channel.y][hipThreadIdx_x8 + i] = srcPtr[clampedIdx + 1]; // G
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check comment spacing

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for (int k = 0; k < 8; k++)
{
float minVal;
if constexpr (std::is_same_v<T, Rpp8u>)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a minVal setting really needed, we can just set it to srcPtr[k]? rather than hardcoded expressions based on type

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

else
*(uint2 *)&src_smem[hipThreadIdx_y][hipThreadIdx_x8] = borderVal;
{
// Nearest-neighbor padding
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment somewhere maybe that erode and dilate are independent of the type of padding in general?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are doing NN padding itself

erode_row_hip_compute<7>(&src_smem[hipThreadIdx_y + 4][hipThreadIdx_x8], &sum_f8);
erode_row_hip_compute<7>(&src_smem[hipThreadIdx_y + 5][hipThreadIdx_x8], &sum_f8);
erode_row_hip_compute<7>(&src_smem[hipThreadIdx_y + 6][hipThreadIdx_x8], &sum_f8);
if constexpr (std::is_same<T, Rpp8s>::value)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this should be added, then it should be added uniformly at all places - I doubt the requirement of this given its a min/max operation

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RoundToNearest

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
increment_row_ptrs(srcPtrTemp, kernelSize, 1);
}
// reset source to initial position
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a empty line

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before the comment

// -------------------- Set 0 erode compute functions --------------------

// unpack lower half of 3 256 bit registers and add (used for 3x3 kernel size U8/I8 variants)
inline void unpacklo_and_min_3x3_host(__m256i *pxRow, __m256i *pxDst)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even these functions can be templated?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With 5x5, 7x7 and 9x9

}

// add 3 256 bit registers (used for 3x3 kernel size F32/F16 variants)
inline void min_rows_3x3(__m256 *pRow, __m256 *pDst)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe these too

constexpr int PreloadRows = (KernelSize + 1) / 2;

// Load initial rows
for (int k = 0; k < PreloadRows; ++k)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use #pragma unroll

using Info = MorphLoadInfo<T>;
using Vec = typename Info::VecType;

constexpr int PreloadRows = (KernelSize + 1) / 2;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preLoadRows maybe the name? Following other camelCase conventions
Similarly kernelSize for KernelSize

template <int KernelSize, typename T, typename PadPolicy>
inline void rpp_morphological_load_NxN(typename MorphLoadInfo<T>::VecType *pxRow, T **srcPtrTemp, Rpp32s rowKernelLoopLimit)
{
using Info = MorphLoadInfo<T>;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some other name instead of Info and MorphLoadInfo? imo

rpp_morphological_load_NxN<3, T, MorphPad_Erode>(pxRow, srcPtrTemp, rowKernelLoopLimit);

// unpack lower half and higher half of each of 3 loaded row values from 8 bit to 16 bit and add
unpacklo_and_min_3x3_host(pxRow, &pxRowHalf[0]);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version works fine, but maybe should we explore the usage of epi8 instructions instead of epi16 because we only do a min/max? Pls share ur thoughts @HazarathKumarM

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requires exploration

{
using VecType = __m256i;

static inline VecType load(void *ptr) { return _mm256_add_epi8(avx_pxConvertI8, _mm256_loadu_si256((__m256i*)ptr)); }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__m256i pxRow[3], pxRowHalf[2], pxResult;
rpp_morphological_load_NxN<3, T, MorphPad_Erode>(pxRow, srcPtrTemp, rowKernelLoopLimit);

// unpack lower half and higher half of each of 3 loaded row values from 8 bit to 16 bit and add
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check comment at all places - It should be min instead of add


// -------------------- Set 0 erode compute functions --------------------

// unpack lower half of 3 256 bit registers and add (used for 3x3 kernel size U8/I8 variants)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min instead of add

pxRow[k] = Info::load(srcPtrTemp[k]);

// Load valid remaining rows
for (int k = PreloadRows; k < rowKernelLoopLimit; ++k)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using Loader = MorphVecLoader;
using Vec = typename Loader::VecType;

pxRow[k] = Loader::load(srcPtrTemp[k]);

struct MorphVecLoader<Rpp8u>
{
using VecType = __m256i;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

struct MorphVecLoader<Rpp8s>
{
using VecType = __m256i;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line, applicable to other functions similar funcs also

using Loader = MorphVecLoader<T>;
using Vec = typename Loader::VecType;

constexpr int preLoadRows = (KernelSize + 1) / 2;
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change KernelSize to kernelSize

static inline __m256 pad_float() { return avx_p0; }
};

template <int KernelSize, typename T, typename PadPolicy>
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change PadPolicy to padPolicy

rpp_morphological_load_NxN<3, T, MorphPad_Erode>(pxRow, srcPtrTemp, rowKernelLoopLimit);

// unpack lower half and higher half of each of 3 loaded row values from 8 bit to 16 bit and min
if constexpr (std::is_same<T, Rpp8s>::value)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove {} for single line condition, please change for all instances

Copy link
Copy Markdown
Owner

@r-abishek r-abishek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@r-abishek r-abishek changed the base branch from develop to ar/opt_erode January 21, 2026 03:53
@r-abishek r-abishek requested a review from Copilot January 21, 2026 03:53
@r-abishek r-abishek added the enhancement New feature or request label Jan 21, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds HOST backend support for the erode morphological operation, completing the implementation that was previously only available on the HIP backend.

Changes:

  • Added erode to the HOST-supported operations list in the test suite
  • Implemented the HOST version of rppt_erode_host with support for U8, F16, F32, and I8 data types
  • Added SIMD-optimized helper functions for erode operations supporting kernel sizes 3x3, 5x5, 7x7, and 9x9

Reviewed changes

Copilot reviewed 7 out of 17 changed files in this pull request and generated no comments.

Show a summary per file
File Description
utilities/test_suite/common.py Enabled HOST backend support for erode operation
utilities/test_suite/HOST/runImageTests.py Added erode to kernel size iteration logic in unit and performance tests
utilities/test_suite/HOST/Tensor_image_host.cpp Implemented erode test case with kernel size parameter handling
src/modules/tensor/rppt_tensor_morphological_operations.cpp Implemented rppt_erode_host function with validation and data type support, plus added validation checks to GPU version
src/include/tensor/host_tensor_executors.hpp Added function declarations for erode char and float host tensor executors
src/include/common/cpu/rpp_cpu_filter.hpp Added SIMD-optimized helper functions for erode operations including blend/shuffle/min operations and morphological vector loaders
api/rppt_tensor_morphological_operations.h Added documentation for rppt_erode_host and corrected parameter description for GPU version

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

RooseweltMcW pushed a commit to HazarathKumarM/rpp that referenced this pull request Jan 21, 2026
* Add F32 QA Golden outputs

* modify Doxygen comments

* modify range check functions

* RPP F32 QA : Review Comments Resolution (r-abishek#431)

* Modified SIMD print functions to use union

* remove redundant unions in print functions

* removed pixel checks

* remove pixel check in threshold

* resolve review comments

---------

Co-authored-by: HazarathKumarM <hazarathkumar@multicorewareinc.com>
Co-authored-by: Kiriti Gowda <kiritigowda@gmail.com>
Co-authored-by: HazarathKumarM <119284987+HazarathKumarM@users.noreply.github.com>
Co-authored-by: Lakshmi Kumar <lakshmi.kumar@amd.com>
@r-abishek r-abishek merged commit 8c88da6 into r-abishek:ar/opt_erode Jan 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants