Skip to content

common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up#20416

Merged
CISC merged 1 commit intoggml-org:masterfrom
ddh0:fix-fused-cpu-moe
Mar 11, 2026
Merged

common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up#20416
CISC merged 1 commit intoggml-org:masterfrom
ddh0:fix-fused-cpu-moe

Conversation

@ddh0
Copy link
Contributor

@ddh0 ddh0 commented Mar 11, 2026

Changed the regex that matches conditional experts from:

const char * const LLM_FFN_EXPS_REGEX = "\\.ffn_(up|down|gate)_(ch|)exps";

to:

const char * const LLM_FFN_EXPS_REGEX = "\\.ffn_(up|down|gate|gate_up)_(ch|)exps";

I think this should fix #20414.

@ddh0 ddh0 requested a review from ggerganov as a code owner March 11, 2026 17:38
@Nekotekina
Copy link
Contributor

Fixes #20431

@CISC CISC merged commit 4a748b8 into ggml-org:master Mar 11, 2026
65 of 75 checks passed
@ddh0 ddh0 deleted the fix-fused-cpu-moe branch March 11, 2026 23:31
ProgenyAlpha pushed a commit to ProgenyAlpha/llama.cpp that referenced this pull request Mar 12, 2026
tekintian added a commit to tekintian/llama.cpp that referenced this pull request Mar 12, 2026
* 'master' of github.com:ggml-org/llama.cpp: (33 commits)
  convert : better mtp check and fix return [no ci] (ggml-org#20419)
  vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379)
  New conversations now auto-select the first loaded model (ggml-org#20403)
  ggml-virtgpu: Fix some build commands (ggml-org#20341)
  metal : avoid divisions in bin kernel (ggml-org#20426)
  ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154)
  vulkan: fix l2_norm epsilon handling (ggml-org#20350)
  vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296)
  vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059)
  opencl: use larger workgroup size for get_rows (ggml-org#20316)
  opencl: add cumsum op (ggml-org#18981)
  hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392)
  common/parser: add GigaChatV3/3.1 models support (ggml-org#19931)
  model : add support for Phi4ForCausalLMV (ggml-org#20168)
  graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427)
  common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416)
  ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230)
  llama : enable chunked fused GDN path (ggml-org#20340)
  llama : whitespace cleanup (ggml-org#20422)
  ggml : add NVFP4 quantization type support (ggml-org#19769)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: --n-cpu-moe offloading breaks with fused gate + up tensors

3 participants