CANN: Support MUL_MAT_ID in ACL graph by hipudding · Pull Request #19228 · ggml-org/llama.cpp

hipudding · 2026-01-31T08:20:06Z

Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.

Key features:

Support Q4_0 and Q8_0 quantized weight formats
Use IndexSelect to dynamically route expert-specific weights based on indices
Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
Handle automatic F16 type conversion for hardware compatibility
Support both per-expert and broadcast input modes

Implementation details:

Extract expert weights and scales using CANN IndexSelect operation
Process each batch and expert combination independently
Create proper tensor views with correct stride for matmul operations
Automatic input/output type casting to/from F16 as needed

Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).

Make sure to read the contributing guidelines before submitting a PR

ggml/src/ggml-cann/aclnn_ops.cpp

Implement ggml_cann_mul_mat_id_quant function to support quantized matrix multiplication for Mixture of Experts (MoE) architectures on CANN backend. Key features: - Support Q4_0 and Q8_0 quantized weight formats - Use IndexSelect to dynamically route expert-specific weights based on indices - Leverage WeightQuantBatchMatmulV2 for efficient quantized computation - Handle automatic F16 type conversion for hardware compatibility - Support both per-expert and broadcast input modes Implementation details: - Extract expert weights and scales using CANN IndexSelect operation - Process each batch and expert combination independently - Create proper tensor views with correct stride for matmul operations - Automatic input/output type casting to/from F16 as needed Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).

hipudding · 2026-02-10T00:42:06Z

Morning @ggerganov, Could you please review this PR? Thanks.

noemotiovon

LGTM. The current implementation no longer relies on device-to-host copies and can use the ACL graph.

Implement ggml_cann_mul_mat_id_quant function to support quantized matrix multiplication for Mixture of Experts (MoE) architectures on CANN backend. Key features: - Support Q4_0 and Q8_0 quantized weight formats - Use IndexSelect to dynamically route expert-specific weights based on indices - Leverage WeightQuantBatchMatmulV2 for efficient quantized computation - Handle automatic F16 type conversion for hardware compatibility - Support both per-expert and broadcast input modes Implementation details: - Extract expert weights and scales using CANN IndexSelect operation - Process each batch and expert combination independently - Create proper tensor views with correct stride for matmul operations - Automatic input/output type casting to/from F16 as needed Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).

hipudding added the Ascend NPU issues specific to Ascend NPUs label Jan 31, 2026

hipudding self-assigned this Jan 31, 2026

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 31, 2026

loci-dev mentioned this pull request Jan 31, 2026

UPSTREAM PR #19228: CANN: Support MUL_MAT_ID in ACL graph auroralabs-loci/llama.cpp#1102

Open

hipudding marked this pull request as ready for review February 3, 2026 02:32

hipudding requested a review from noemotiovon February 3, 2026 06:07

hipudding force-pushed the mul_mat_id branch from d50cd44 to 30bc705 Compare February 3, 2026 08:01

hipudding commented Feb 4, 2026

View reviewed changes

hipudding force-pushed the mul_mat_id branch from 30bc705 to 5624c99 Compare February 4, 2026 08:36

hipudding requested a review from ggerganov February 4, 2026 09:09

hipudding force-pushed the mul_mat_id branch from 5624c99 to 2d3679b Compare February 9, 2026 12:42

noemotiovon approved these changes Feb 10, 2026

View reviewed changes

ggerganov approved these changes Feb 10, 2026

View reviewed changes

hipudding merged commit 52e38fa into ggml-org:master Feb 10, 2026
78 checks passed

hipudding mentioned this pull request Feb 10, 2026

CANN : Optimize mul_mat_id quantization #17782

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CANN: Support MUL_MAT_ID in ACL graph#19228

CANN: Support MUL_MAT_ID in ACL graph#19228
hipudding merged 1 commit intoggml-org:masterfrom
hipudding:mul_mat_id

hipudding commented Jan 31, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hipudding commented Feb 10, 2026

Uh oh!

noemotiovon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hipudding commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hipudding commented Feb 10, 2026

Uh oh!

noemotiovon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hipudding commented Jan 31, 2026 •

edited

Loading