Skip to content

[io] Fix: GID auto-detection, AINIC support, and enhance stability#113

Merged
maning00 merged 9 commits intomainfrom
feature-io-sgl
Nov 27, 2025
Merged

[io] Fix: GID auto-detection, AINIC support, and enhance stability#113
maning00 merged 9 commits intomainfrom
feature-io-sgl

Conversation

@maning00
Copy link
Copy Markdown
Contributor

@maning00 maning00 commented Nov 21, 2025

Motivation

This PR improves the robustness and flexibility of the RDMA transport layer and fixes several critical bugs encountered in AMD AINIC network environments and edge cases.

Key Goals

  • Fix GID Selection: Resolve issues where incorrect GID indices caused connection failures in environments with different network configurations
  • Enhance CPU Support: Enable RDMA device search and usage when memory is located on the CPU
  • Improve Stability: Fix segmentation faults related to object lifecycle management
  • Optimize for AINIC: Fine-tune Queue Pair (QP) parameters and notification mechanisms for better compatibility with AMD AINIC hardware
  • Remove SRQ: Refine notification mechanism, Move from Shared Receive Queues (SRQ) to per-Queue Pair (QP) posting for improved compatibility (for AINIC) and stability

Technical Details

RDMA / IBVerbs Provider

Auto GID Detection (src/application/transport/rdma/providers/ibverbs/ibverbs.cpp):

  • Implemented auto-detection mechanism for gidIdx when configured to -1
  • Iterates through the GID table and selects the best index (prioritizing: IPv4 > Global IPv6 > Link Local)

Correct Attribute Usage:

  • Fixed maxMsgsNum, maxCqeNum, maxMsgSge params
  • Updated QP initialization attributes
  • Correctly sets sgid_index during connection establishment using the detected local GID index

Error Handling:

  • Added SYSCALL_RETURN_ZERO checks to ibv_modify_qp and other critical calls
  • Added badWr parameter to ibv_post_send instead of nullptr, which was causing post send error on AINIC driver(ionic_datapath.c):
static int ionic_post_send_common(struct ionic_ibdev *dev,
				  struct ionic_vcq *vcq,
				  struct ionic_cq *cq,
				  struct ionic_qp *qp,
				  struct ib_send_wr *wr,
				  struct ib_send_wr **bad)
{
	unsigned long irqflags;
	bool notify = false;
	int spend, rc = 0;
	u16 old_prod;

	ionic_lat_trace(dev->lats, application);
	ionic_stat_incr(dev->stats, post_send);

	if (!bad)
		return -EINVAL;

	if (!qp->has_sq) {
		*bad = wr;
		return -EINVAL;
	}
...

RDMA Backend (src/io/rdma/backend_impl.cpp)

Flexible Notification Mechanism (include/mori/io/backend.hpp):

  • Added enableNotification configuration option to RdmaBackendConfig
  • Allows disabling transfer completion notifications when not needed for compatibility or performance reasons
  • When disabled, skips RdmaNotifyTransfer calls and related RECV operations

CPU Topology Search:

  • Added support for MemoryLocationType::CPU in RdmaManager::Search
  • Implemented round-robin strategy for device selection to prevent failures
  • Note: Current implementation uses simple round-robin; more sophisticated selection logic planned for future work

Memory Safety:

  • Replaced free() with delete for C++ objects (msg->meta, msg) to ensure proper destructor calls
  • Added mutex protection for status messages in TransferStatus::SetMessage

Test Plan

  • RDMA Connectivity: Verified connection establishment in environments requiring specific GID indices.
  • CPU Memory: Tested RDMA operations with CPU-resident memory.
  • Stability: Ran existing benchmarks and workloads to verify the fix for segmentation faults (-O3 related).

Test Result

  • Pass: GID auto-detection correctly identifies IPv4 GIDs
  • Pass: Configurable notification mechanism works correctly
  • Pass: CPU memory operations work correctly
  • Pass: No segmentation faults in tests

Submission Checklist

@maning00 maning00 requested a review from Copilot November 21, 2025 10:29
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses critical stability and compatibility issues in the RDMA transport layer, particularly for AMD AINIC environments. The changes implement GID auto-detection, remove SRQ usage in favor of per-QP receive queues, add CPU memory support, and fix memory management issues.

Key changes:

  • GID index auto-detection mechanism with IPv4 prioritization for complex network configurations
  • Migration from Shared Receive Queue (SRQ) to per-QP receive queues for improved stability
  • CPU memory topology support with round-robin device selection
  • Memory safety fixes (proper C++ object deletion and mutex protection)

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/application/transport/rdma/providers/ibverbs/ibverbs.cpp Implements GID auto-detection logic, fixes maxSge attribute usage, uses port's active MTU, and adds proper error checking for ibv_modify_qp calls
src/io/rdma/backend_impl.cpp Replaces SRQ with per-QP notification contexts, adds CPU topology support, fixes memory management (delete vs free), and improves status message thread safety
src/io/rdma/backend_impl.hpp Updates notification context structure from device-level SRQ to per-QP contexts
src/io/rdma/common.cpp Adds badWr parameter to ibv_post_send calls and improves error messages
include/mori/application/transport/rdma/rdma.hpp Changes gidIdx type to int32_t to support -1 as auto-detect sentinel, adds gidIdx field to EthernetEndpointHandle
include/mori/io/common.hpp Strengthens memory ordering for atomic operations and adds mutex protection for message updates
src/application/context/context.cpp Updates default gidIdx to -1 for auto-detection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/io/rdma/backend_impl.cpp Outdated
@maning00 maning00 force-pushed the feature-io-sgl branch 7 times, most recently from 534f2fa to 434a3f5 Compare November 25, 2025 04:10
@maning00 maning00 force-pushed the feature-io-sgl branch 3 times, most recently from f9609bb to ca90aca Compare November 25, 2025 10:23
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

src/io/rdma/backend_impl.cpp:1

  • This INFO-level log is in a hot path and will execute for every batch posted. Consider moving to TRACE level or protecting with a compile-time flag to reduce overhead in production environments.
// Copyright © Advanced Micro Devices, Inc. All rights reserved.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/io/rdma/backend_impl.cpp Outdated
Comment thread src/io/rdma/backend_impl.cpp Outdated
Comment thread src/io/rdma/backend_impl.cpp Outdated
Comment thread include/mori/application/transport/rdma/rdma.hpp Outdated
@maning00 maning00 force-pushed the feature-io-sgl branch 7 times, most recently from 3203681 to d667d38 Compare November 26, 2025 12:55
@maning00 maning00 requested a review from Copilot November 26, 2025 13:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/io/rdma/backend_impl.cpp Outdated
Comment thread src/io/rdma/backend_impl.cpp
@maning00 maning00 changed the title [io] Fix: GID auto-detection, remove SRQ usage, and enhance stability [io] Fix: GID auto-detection, AINIC support, and enhance stability Nov 26, 2025
@maning00 maning00 force-pushed the feature-io-sgl branch 3 times, most recently from aa02a68 to 0152ad9 Compare November 27, 2025 03:55
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/io/rdma/backend_impl.cpp Outdated
Comment thread src/io/rdma/backend_impl.cpp Outdated
Comment thread include/mori/io/common.hpp Outdated
@maning00 maning00 merged commit 87724b7 into main Nov 27, 2025
TianDi101 pushed a commit that referenced this pull request Jan 12, 2026
)

* Fix GID Selection: Resolve issues where incorrect GID indices caused connection failures in environments with different network configurations
* Enhance CPU Support: Enable RDMA device search and usage when memory is located on the CPU
* Improve Stability: Fix segmentation faults related to object lifecycle management
* Optimize for AINIC: Fine-tune Queue Pair (QP) parameters and notification mechanisms for better compatibility with AMD AINIC hardware
* Refine Notification Mechanism: Move from Shared Receive Queues (SRQ) to per-Queue Pair (QP) posting for improved compatibility and stability
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants