Histogram Equalize kernel Implementation - HOST / HIP #533
Histogram Equalize kernel Implementation - HOST / HIP #533r-abishek merged 23 commits intor-abishek:ar/histogram_equalizefrom
Conversation
89ac0c2 to
93696af
Compare
r-abishek
left a comment
There was a problem hiding this comment.
Please address minor comment.
This funciton only has U8.
Please confirm U8 QA passing for this?
|
|
||
| if ((srcDescPtr->dataType == RpptDataType::U8) && (dstDescPtr->dataType == RpptDataType::U8)) | ||
| { | ||
| histogram_equalize_u8_u8_host_tensor(static_cast<Rpp8u*>(srcPtr) + srcDescPtr->offsetInBytes, |
There was a problem hiding this comment.
Seems like this function is only for U8?
Pls add reejction condition for an F32/I8/F16 input then.
There was a problem hiding this comment.
Check for src and dst data types.
Perhaps add an appropriate RPP_ERROR for invalid source data type and invalid destination data type
There was a problem hiding this comment.
Pull request overview
This pull request implements histogram equalization functionality for both HOST (CPU) and HIP (GPU) backends in the RPP library. Histogram equalization is a color augmentation technique that enhances image contrast by redistributing pixel intensity values.
Changes:
- Adds histogram_equalize augmentation with ID 97 for both HOST and HIP backends
- Implements color space conversion (RGB ↔ YCbCr) and histogram equalization on the Y channel for RGB images
- Adds test suite integration for the new augmentation
Reviewed changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| utilities/test_suite/rpp_test_suite_image.h | Adds HISTOGRAM_EQUALIZE enum and augmentation map entry |
| utilities/test_suite/common.py | Adds histogram_equalize to Python test configuration and color_augmentations group |
| utilities/test_suite/HOST/Tensor_image_host.cpp | Integrates histogram_equalize test case for HOST backend |
| utilities/test_suite/HIP/Tensor_image_hip.cpp | Integrates histogram_equalize test case for HIP backend |
| src/modules/tensor/rppt_tensor_color_augmentations.cpp | Implements rppt_histogram_equalize API entry point with validation |
| src/modules/tensor/hip/kernel/histogram_equalize.cpp | Implements HIP kernels for color conversion, histogram collection, LUT building, and application |
| src/modules/tensor/cpu/kernel/histogram_equalize.cpp | Implements HOST CPU functions with AVX2 SIMD optimizations for histogram equalization |
| src/include/tensor/host_tensor_executors.hpp | Declares HOST executor function |
| src/include/tensor/hip_tensor_executors.hpp | Declares HIP executor function |
| api/rppt_tensor_color_augmentations.h | Adds public API documentation for rppt_histogram_equalize |
| docs/data/doxygenOutputs/color_augmentations_histogram_equalize_img150x150.png | Adds sample output image for documentation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| __device__ constexpr float coeffGCb = -0.344136f; | ||
| __device__ constexpr float coeffGCr = -0.714136f; |
There was a problem hiding this comment.
The YCbCr to RGB conversion coefficients have a sign mismatch between HOST and HIP implementations. The HIP version uses negative values (coeffGCb = -0.344136f, coeffGCr = -0.714136f) on lines 42-43, while the HOST version uses positive values (coeffGCb = 0.344136f, coeffGCr = 0.714136f) in cpu/kernel/histogram_equalize.cpp lines 43-44. The standard YCbCr to RGB conversion formula is: G = Y - 0.344136 * Cb - 0.714136 * Cr. Since the HIP implementation already negates coeffGCb and coeffGCr when defining them, but then uses them in addition operations (lines 100-101), this results in incorrect calculations. Either the sign should be removed from the coefficient definitions, or the operations should use subtraction instead of addition.
| Rpp8u *scratchBase = reinterpret_cast<Rpp8u *>(handle.GetInitHandle()->mem.mcpu.scratchBufferHost); | ||
| Rpp8u *yBuf = scratchBase + batchCount * (pixels * 3); | ||
| Rpp8u *cbBuf = yBuf + pixels; | ||
| Rpp8u *crBuf = cbBuf + pixels; | ||
| Rpp8u *dstYBuf = yBuf; |
There was a problem hiding this comment.
Potential buffer overflow in HOST implementation. The code allocates scratch buffers for each batch element without validating total size against available scratch buffer. Line 299 computes: yBuf = scratchBase + batchCount * (pixels * 3), where pixels = roiWidth * roiHeight. For parallel execution with OpenMP (line 278), multiple threads may simultaneously access the scratch buffer with different batchCount values. If the total requirement (batchSize * max_pixels * 3) exceeds the available scratchBufferHost size, this will cause memory corruption. There's no bounds checking to ensure the scratch buffer is large enough.
|
|
||
| // -------------------- histogram_equalize -------------------- | ||
|
|
||
| RppStatus hip_exec_histogram_equalize_tensor(Rpp8u *srcPtr1, |
There was a problem hiding this comment.
Inconsistent parameter naming. The first parameter is named srcPtr1 in the declaration, but it should be srcPtr for consistency with all other function declarations in this file and with the actual implementation.
| RppStatus hip_exec_histogram_equalize_tensor(Rpp8u *srcPtr1, | |
| RppStatus hip_exec_histogram_equalize_tensor(Rpp8u *srcPtr, |
| cbPtr = cbBuf; | ||
| crPtr = crBuf; | ||
| #if __AVX2__ | ||
| alignedLength = ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement; |
There was a problem hiding this comment.
Potential edge case issue with alignedLength calculation. The formula ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement will produce negative or zero values when roiWidth is less than vectorIncrement (48). For small ROIs (e.g., roiWidth = 16), this results in alignedLength = ((16 / 48) - 1) * 48 = (0 - 1) * 48 = -48. The loop condition vectorLoopCount < alignedLength would then be false initially, which is correct behavior, but using a potentially negative value is poor practice. Consider using: alignedLength = (roi.xywhROI.roiWidth / vectorIncrement) * vectorIncrement to avoid negative values, or add a check to ensure alignedLength is non-negative.
| alignedLength = ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement; | |
| if (roi.xywhROI.roiWidth >= vectorIncrement) | |
| alignedLength = ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement; | |
| else | |
| alignedLength = 0; |
| cbPtr = cbBuf; | ||
| crPtr = crBuf; | ||
| #if __AVX2__ | ||
| alignedLength = ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement; |
There was a problem hiding this comment.
Potential edge case issue with alignedLength calculation. The formula ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement will produce negative or zero values when roiWidth is less than vectorIncrement (48). For small ROIs (e.g., roiWidth = 16), this results in alignedLength = ((16 / 48) - 1) * 48 = (0 - 1) * 48 = -48. The loop condition vectorLoopCount < alignedLength would then be false initially, which is correct behavior, but using a potentially negative value is poor practice. Consider using: alignedLength = (roi.xywhROI.roiWidth / vectorIncrement) * vectorIncrement to avoid negative values, or add a check to ensure alignedLength is non-negative.
| alignedLength = ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement; | |
| alignedLength = ((roi.xywhROI.roiWidth / vectorIncrement) - 1) * vectorIncrement; | |
| if (alignedLength < 0) | |
| alignedLength = 0; |
| unsigned int *d_hist = reinterpret_cast<unsigned int*>(handle.GetInitHandle()->mem.mgpu.scratchBufferHip.floatmem); | ||
| unsigned char *d_lut = reinterpret_cast<unsigned char*>(d_hist + batchSize * HISTOGRAM_BINS); | ||
|
|
||
| if(srcDescPtr->c == 3) | ||
| { | ||
| const size_t planeSize = static_cast<size_t>(srcDescPtr->w) * srcDescPtr->h * srcDescPtr->n; | ||
| Rpp8u *yuvBuf = reinterpret_cast<Rpp8u *>(d_lut + batchSize * HISTOGRAM_BINS); | ||
|
|
||
| Rpp8u *yBuf = yuvBuf; | ||
| Rpp8u *cbBuf = yuvBuf + planeSize; | ||
| Rpp8u *crBuf = yuvBuf + (planeSize * 2); |
There was a problem hiding this comment.
Potential buffer overflow issue. The scratch buffer allocation assumes sequential memory layout without validation. The code allocates:
- d_hist: batchSize * 256 * sizeof(unsigned int) = batchSize * 1024 bytes
- d_lut: batchSize * 256 * sizeof(unsigned char) = batchSize * 256 bytes
- yuvBuf: 3 * planeSize = 3 * w * h * n bytes (for 3-channel images)
The total requirement is batchSize * 1280 + 3 * w * h * n bytes. For large images (e.g., 3840x2160), this equals batchSize * 1280 + 3 * 3840 * 2160 * n = 24,883,200 * n + 1280 * n bytes per batch element. The scratch buffer is 8,294,400 * sizeof(float) = 33,177,600 bytes total. This may be insufficient for large batches or images, yet there's no validation to ensure the required memory fits within the available scratch buffer, risking memory corruption.
|
|
||
| // -------------------- histogram_equalize -------------------- | ||
|
|
||
| RppStatus histogram_equalize_u8_u8_host_tensor(Rpp8u *srcPtr1, |
There was a problem hiding this comment.
Inconsistent parameter naming. The first parameter is named srcPtr1 in the declaration, but it should be srcPtr for consistency with all other function declarations in this file and with the actual implementation in histogram_equalize.cpp line 265 which uses srcPtr.
| RppStatus histogram_equalize_u8_u8_host_tensor(Rpp8u *srcPtr1, | |
| RppStatus histogram_equalize_u8_u8_host_tensor(Rpp8u *srcPtr, |
| collect_hist_pln_tensor_host(srcPtr, hist, roiWidth, roiHeight, srcDescPtr->strides.hStride); | ||
| build_lut_from_hist_host(hist, lutBatch, pixels); | ||
| apply_lut_tensor(srcPtr, dstPtr, roiWidth, roiHeight, lutBatch, srcDescPtr->strides.hStride, dstDescPtr->strides.hStride); | ||
| } |
There was a problem hiding this comment.
Missing error handling for unsupported layout combinations. The implementation only handles specific layout combinations (NCHW->NCHW, NCHW->NHWC, NHWC->NCHW, NHWC->NHWC for 3-channel, and NCHW->NCHW for 1-channel). If the code reaches a combination not explicitly handled (though unlikely given earlier validation), it will silently skip processing and return RPP_SUCCESS without producing any output or error. Consider adding an else clause to return an appropriate error status for consistency with error handling in other parts of the codebase.
| } | |
| } | |
| else | |
| { | |
| // Unsupported channel/layout combination | |
| return RPP_ERROR; | |
| } |
There was a problem hiding this comment.
Not relevant , we already checking the condition and also it is not possible to return inside openMP loop
| testCaseName = "histogram_equalize"; | ||
|
|
||
| startWallTime = omp_get_wtime(); | ||
| startCpuTime = clock(); |
There was a problem hiding this comment.
The startCpuTime = clock() statement is missing in this case block. All other test cases in this file follow the pattern of setting both startWallTime and startCpuTime before calling the function. This line should be added after line 1855 (startWallTime = omp_get_wtime()) for consistency with the rest of the codebase.
There was a problem hiding this comment.
Already present, can be ignored.
|
@r-abishek all comments are reoslved now |
74fbd82 to
226d386
Compare
Histogram Equalize kernel implementation for both HOST and HIP backend