Skip to content

CANN: Support MUL_MAT_ID in ACL graph#19228

Merged
hipudding merged 1 commit intoggml-org:masterfrom
hipudding:mul_mat_id
Feb 10, 2026
Merged

CANN: Support MUL_MAT_ID in ACL graph#19228
hipudding merged 1 commit intoggml-org:masterfrom
hipudding:mul_mat_id

Conversation

@hipudding
Copy link
Contributor

@hipudding hipudding commented Jan 31, 2026

Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.

Key features:

  • Support Q4_0 and Q8_0 quantized weight formats
  • Use IndexSelect to dynamically route expert-specific weights based on indices
  • Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
  • Handle automatic F16 type conversion for hardware compatibility
  • Support both per-expert and broadcast input modes

Implementation details:

  • Extract expert weights and scales using CANN IndexSelect operation
  • Process each batch and expert combination independently
  • Create proper tensor views with correct stride for matmul operations
  • Automatic input/output type casting to/from F16 as needed

Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).

Make sure to read the contributing guidelines before submitting a PR

@hipudding hipudding added the Ascend NPU issues specific to Ascend NPUs label Jan 31, 2026
@hipudding hipudding self-assigned this Jan 31, 2026
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 31, 2026
@hipudding hipudding marked this pull request as ready for review February 3, 2026 02:32
@hipudding hipudding requested a review from noemotiovon February 3, 2026 06:07
Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.

Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes

Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed

Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).
@hipudding
Copy link
Contributor Author

Morning @ggerganov, Could you please review this PR? Thanks.

Copy link
Collaborator

@noemotiovon noemotiovon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The current implementation no longer relies on device-to-host copies and can use the ACL graph.

@hipudding hipudding merged commit 52e38fa into ggml-org:master Feb 10, 2026
78 checks passed
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.

Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes

Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed

Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
Implement ggml_cann_mul_mat_id_quant function to support quantized matrix
multiplication for Mixture of Experts (MoE) architectures on CANN backend.

Key features:
- Support Q4_0 and Q8_0 quantized weight formats
- Use IndexSelect to dynamically route expert-specific weights based on indices
- Leverage WeightQuantBatchMatmulV2 for efficient quantized computation
- Handle automatic F16 type conversion for hardware compatibility
- Support both per-expert and broadcast input modes

Implementation details:
- Extract expert weights and scales using CANN IndexSelect operation
- Process each batch and expert combination independently
- Create proper tensor views with correct stride for matmul operations
- Automatic input/output type casting to/from F16 as needed

Testing: All test cases passed for supported types (F32, F16, Q4_0, Q8_0).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ascend NPU issues specific to Ascend NPUs ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants