Skip to content

rpc : proper handling of data pointers to CPU buffers#21030

Merged
rgerganov merged 1 commit intoggml-org:masterfrom
rgerganov:rpc-post-rce
Mar 27, 2026
Merged

rpc : proper handling of data pointers to CPU buffers#21030
rgerganov merged 1 commit intoggml-org:masterfrom
rgerganov:rpc-post-rce

Conversation

@rgerganov
Copy link
Copy Markdown
Member

The compute graph may contain tensors pointing to CPU buffers. In these cases the buffer address is serialized as 0 and sent over the wire. However, the data pointer is serialized as-is and this prevents proper validation on the server side. This patches fixes this by serializing the data pointer as 0 for non-RPC buffers and doing proper validation on the server side.

closes: #21006

The compute graph may contain tensors pointing to CPU buffers. In these
cases the buffer address is serialized as 0 and sent over the wire.
However, the data pointer is serialized as-is and this prevents proper
validation on the server side. This patches fixes this by serializing
the data pointer as 0 for non-RPC buffers and doing proper validation on
the server side.

closes: ggml-org#21006
@rgerganov rgerganov requested a review from a team as a code owner March 26, 2026 15:06
@rgerganov rgerganov self-assigned this Mar 26, 2026
@rgerganov rgerganov requested a review from ggerganov March 26, 2026 15:06
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 26, 2026
Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge?

@ggerganov
Copy link
Copy Markdown
Member

@las7 Maybe also take a look as a quick follow-up on the previous patch.

@las7
Copy link
Copy Markdown
Contributor

las7 commented Mar 26, 2026

Taking a look

@stew675
Copy link
Copy Markdown

stew675 commented Mar 26, 2026

I am the original issue reported. I just applied @rgerganov 's patch to a pull of the latest source code and rebuilt. I can confirm that this change now fixes the issue.

@las7
Copy link
Copy Markdown
Contributor

las7 commented Mar 26, 2026

LGTM

@Xcc313r4n7
Copy link
Copy Markdown

Still failing on cross-architecture NVIDIA setup (RTX 5090 Blackwell + RTX 4090 Ada):

RPC server side:
[create_node] invalid data ptr
[graph_compute] failed to create graph node 5 (id=94071254079184)

Client side:
ggml-rpc.cpp:669: Remote RPC server crashed or returned malformed response
recv failed (bytes_recv=0, size_to_recv=8)

Note: Error changed from "invalid tensor: null buffer" to "invalid data ptr" — the PR fixed one check but there appears to be a second issue in cross-architecture setups.

Environment: RTX 5090 (compute 12.0) + RTX 4090 (compute 8.9)

@rgerganov
Copy link
Copy Markdown
Member Author

Note: Error changed from "invalid tensor: null buffer" to "invalid data ptr" — the PR fixed one check but there appears to be a second issue in cross-architecture setups.

You need to apply the patch and rebuild both client and server sides. I suspect you only patched rpc-server.

@Xcc313r4n7
Copy link
Copy Markdown

Note: Error changed from "invalid tensor: null buffer" to "invalid data ptr" — the PR fixed one check but there appears to be a second issue in cross-architecture setups.

You need to apply the patch and rebuild both client and server sides. I suspect you only patched rpc-server.

Sorry about that, confirmed working

@rgerganov rgerganov requested a review from CISC March 27, 2026 08:21
@rgerganov
Copy link
Copy Markdown
Member Author

@ggml-org/maintainers fyi, this is good to go but needs a second approval

@rgerganov rgerganov merged commit ba38f3b into ggml-org:master Mar 27, 2026
43 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: PR20908 breaks rpc-server functionality when balancing split a model across multiple machines.

6 participants