[io] Fix: GID auto-detection, AINIC support, and enhance stability#113
[io] Fix: GID auto-detection, AINIC support, and enhance stability#113
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR addresses critical stability and compatibility issues in the RDMA transport layer, particularly for AMD AINIC environments. The changes implement GID auto-detection, remove SRQ usage in favor of per-QP receive queues, add CPU memory support, and fix memory management issues.
Key changes:
- GID index auto-detection mechanism with IPv4 prioritization for complex network configurations
- Migration from Shared Receive Queue (SRQ) to per-QP receive queues for improved stability
- CPU memory topology support with round-robin device selection
- Memory safety fixes (proper C++ object deletion and mutex protection)
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
src/application/transport/rdma/providers/ibverbs/ibverbs.cpp |
Implements GID auto-detection logic, fixes maxSge attribute usage, uses port's active MTU, and adds proper error checking for ibv_modify_qp calls |
src/io/rdma/backend_impl.cpp |
Replaces SRQ with per-QP notification contexts, adds CPU topology support, fixes memory management (delete vs free), and improves status message thread safety |
src/io/rdma/backend_impl.hpp |
Updates notification context structure from device-level SRQ to per-QP contexts |
src/io/rdma/common.cpp |
Adds badWr parameter to ibv_post_send calls and improves error messages |
include/mori/application/transport/rdma/rdma.hpp |
Changes gidIdx type to int32_t to support -1 as auto-detect sentinel, adds gidIdx field to EthernetEndpointHandle |
include/mori/io/common.hpp |
Strengthens memory ordering for atomic operations and adds mutex protection for message updates |
src/application/context/context.cpp |
Updates default gidIdx to -1 for auto-detection |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
534f2fa to
434a3f5
Compare
f9609bb to
ca90aca
Compare
ca90aca to
0152ad9
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
src/io/rdma/backend_impl.cpp:1
- This INFO-level log is in a hot path and will execute for every batch posted. Consider moving to TRACE level or protecting with a compile-time flag to reduce overhead in production environments.
// Copyright © Advanced Micro Devices, Inc. All rights reserved.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
3203681 to
d667d38
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
aa02a68 to
0152ad9
Compare
284ead9 to
dcd116c
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
) * Fix GID Selection: Resolve issues where incorrect GID indices caused connection failures in environments with different network configurations * Enhance CPU Support: Enable RDMA device search and usage when memory is located on the CPU * Improve Stability: Fix segmentation faults related to object lifecycle management * Optimize for AINIC: Fine-tune Queue Pair (QP) parameters and notification mechanisms for better compatibility with AMD AINIC hardware * Refine Notification Mechanism: Move from Shared Receive Queues (SRQ) to per-Queue Pair (QP) posting for improved compatibility and stability
Motivation
This PR improves the robustness and flexibility of the RDMA transport layer and fixes several critical bugs encountered in AMD AINIC network environments and edge cases.
Key Goals
Technical Details
RDMA / IBVerbs Provider
Auto GID Detection (
src/application/transport/rdma/providers/ibverbs/ibverbs.cpp):gidIdxwhen configured to-1Correct Attribute Usage:
maxMsgsNum,maxCqeNum,maxMsgSgeparamssgid_indexduring connection establishment using the detected local GID indexError Handling:
SYSCALL_RETURN_ZEROchecks toibv_modify_qpand other critical callsbadWrparameter toibv_post_sendinstead ofnullptr, which was causing post send error on AINIC driver(ionic_datapath.c):RDMA Backend (
src/io/rdma/backend_impl.cpp)Flexible Notification Mechanism (
include/mori/io/backend.hpp):enableNotificationconfiguration option toRdmaBackendConfigRdmaNotifyTransfercalls and related RECV operationsCPU Topology Search:
MemoryLocationType::CPUinRdmaManager::SearchMemory Safety:
free()withdeletefor C++ objects (msg->meta,msg) to ensure proper destructor callsTransferStatus::SetMessageTest Plan
-O3related).Test Result
Submission Checklist