NVFP4 + MTP Conversions

Methodology repo for Incarnas's NVFP4 GGUF conversions with MTP (Multi-Token Prediction) head retention for self-speculative decoding via llama.cpp on Blackwell-class hardware.

This repo carries the methodology toolchain: the technical report, the local converter patches required on top of mainline llama.cpp, the bench scripts + config + raw bench output cited from the report, and the plots derived from that data. Per-release GGUF model artifacts live on HuggingFace under Incarnas and link back here for methodology + bench provenance.

Scope

The work covered by this methodology:

NVFP4-quantized GGUF conversion from NVIDIA ModelOpt source weights (the modelopt quant-method path; not the compressed-tensors flavor)
Retention of the MTP head for self-speculative decoding via llama.cpp's --spec-type draft-mtp path
Blackwell-native NVFP4 runtime via BLACKWELL_NATIVE_FP4=1
Hybrid Mamba-MoE architecture handling (Qwen3.5-MoE family today; broader as the family grows)
Bench validation under chat-completions discipline per the chat-vs-raw methodology project

What's here

.
├── paper/                  -- the technical report (markdown source + pandoc-built PDF)
├── patches/                -- local converter patches required on top of mainline llama.cpp
├── scripts/                -- conversion + bench wrappers referenced from the report
├── configs/                -- llama-server / lm-eval / converter configs referenced from the report
├── benchmark_results/      -- raw bench output the report cites (JSON / CSV)
└── plots/                  -- source for figures referenced in the report

The technical report at paper/REPORT.md (PDF: paper/REPORT.pdf) is the canonical methodology document. Every released GGUF model card links back to it for methodology + bench reproducibility.

Releases under this methodology

Release	HuggingFace	Status	Notes
`Qwen3.5-122B-A10B-NVFP4-MTP-GGUF`	Incarnas/Qwen3.5-122B-A10B-NVFP4-MTP-GGUF	v1.0 (2026-05-16)	First release. AR + MTP (`--spec-type draft-mtp --spec-draft-n-max 2`). +46% long-decode throughput vs AR on Blackwell Pro 96 GiB. Requires `patches/qwen35_mtp_bid_fix.patch`

Patches

Local patches required on top of mainline llama.cpp for the current methodology release set. Each patch ships with a sibling .md carrying the rationale, upstream PR status, and the drop trigger that retires the patch.

Patch	Applies to	Upstream status
`patches/qwen35_mtp_bid_fix.patch`	llama.cpp `0253fb21f` (build `b9187`) -- pre-merge of fix	Merged as ggml-org/llama.cpp#23237 in master commit `1867a0c69` on 2026-05-18; patch retires when next methodology release builds off a post-fix mainline commit -- see `qwen35_mtp_bid_fix.md`

Bench environment

Reference rig for all numbers in the report:

GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (96 GiB GDDR7, 300 W TDP)
CPU: AMD Ryzen Threadripper 7960X (24 cores / 48 threads)
RAM: 128 GiB DDR5 ECC RDIMM @ 6400 MHz
OS: Linux 7.0.0-15-generic with NVIDIA driver supporting BLACKWELL_NATIVE_FP4
llama.cpp: locally-built from documented commits per release

Bench scripts are simple bash + python over urllib.request against the llama-server HTTP endpoint; no vendor harness dependencies.

License

MIT. See LICENSE.

Citation

@techreport{incarnas_2026_nvfp4_mtp_conversions_qwen35_122b,
  author       = {Incarnas},
  title        = {{Qwen3.5-122B-A10B-NVFP4-MTP-GGUF Technical Report}},
  institution  = {Independent},
  year         = {2026},
  month        = {May},
  note         = {v1.0; companion to https://huggingface.co/Incarnas/Qwen3.5-122B-A10B-NVFP4-MTP-GGUF},
  url          = {https://github.com/bit-incarnas/nvfp4-mtp-conversions/blob/main/paper/REPORT.pdf}
}

Attribution

llama.cpp by Georgi Gerganov and contributors: https://github.com/ggml-org/llama.cpp
NVIDIA ModelOpt: NVFP4 source-weight toolchain
Base models: per-release model card; first release derives from Qwen/Qwen3.5-122B-A10B via txn545/Qwen3.5-122B-A10B-NVFP4
Sibling methodology: bit-incarnas/chat-vs-raw-methodology -- bench-mode discipline applied to every release here

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
benchmark_results		benchmark_results
paper		paper
patches		patches
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NVFP4 + MTP Conversions

Scope

What's here

Releases under this methodology

Patches

Bench environment

License

Citation

Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

NVFP4 + MTP Conversions

Scope

What's here

Releases under this methodology

Patches

Bench environment

License

Citation

Attribution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages