[GGML] Current NVFP4 support has risk of functional incorrectness due to unclear separation of concerns #22042

ORippler · 2026-04-17T12:22:52Z

ORippler
Apr 17, 2026
Collaborator

NVFP4 is a derived tensor

NVFP4 is a two-step quantization scheme consisting of:

per-tensor F32 scale
blocks of length N (with N=16 typically), where each block consists of F4 and F8 values.

To dequantize NVFP4 -> FP32, one has to do FP32_activations = F4 * F8 * F32. We have to "derive" dequantized values from both the blocks and the F32 scale, hence the term derived tensor.

What does this mean for GGML?

From the above, it follows that current struct block_nvfp4 does not fully represent the quantized tensor (i.e. we cannot dequantize without knowing about F32).
Consequentially, current quantize_nvfp4/dequantize_row_nvfp4 functions inside GGML are incorrect when F32 != 1.0 (and F32 = 1.0 incurs a significant quality loss).
To nevertheless handle NVFP4 correctly, the responsibility of dequantizing NVFP4 is currently shared between GGML and its callee (often llama.cpp). The callee has to ensure that we multiply the F32 scale after consuming NVFP4 tensors. A compute graph for a MulMat has to look roughly like this for NVFP4

graph LR
A[NVFP4 weights] --> B{GGML_OP_MUL_MAT}
C[FP32 activations] --> B
B --> D{GGML_OP_MUL}
E[F32 scale] --> D
D --> F{non-linearity, <br> e.g. SwiGLU}

whereas it looks like this for non-derived quant formats:

graph LR
A[Q8_0 weights] --> B{GGML_OP_MUL_MAT}
C[FP32 activations] --> B
B --> F{non-linearity, <br> e.g. SwiGLU}

Afaik, this behavior is currently undocumented in GGML and is, in my eyes, error prone: Swapping GGML_OP_MUL and the non-linearity may produce functionally incorrect results.
While GGML is co-developed with llama.cpp, it also serves other consumers such as stablediffusion.cpp or whisper.cpp.

Leveraging Tensor Cores to accelerate NVFP4

To leverage NVFP4-HW-accelerators, one has to quantize incoming activations to NVFP4 via F4 = input / (F32 * F8), where F32 is an optional input parameter yielded during quantization (cf. ModelOpt golden reference section below):

In weight-only quantization recipes, F32 is absent. Here, the reference would estimate F32 based on incoming FP32 activations, which comes at a perf cost as it typically involves reducing the whole activation-tensor.
Post-training Static Quantization/Quantization Aware Training recipes estimate F32 offline as part of their training/calibration steps.

ModelOpt as NVFP4 quant/dequant golden reference

NVIDIA's Model-Optimizer lib can be taken as the golden reference. FP32 <-> NVFP4 conversion is handled by NVFP4QTensor, specifically its quantize/dequantize class methods.

For NVFP4QTensor.quantize, the path is as follows:
1. If per-tensor F32 (F32) scale is given, use it, else default to amax(input) / (FLOAT8_E4M3_MAX * FLOAT4_E2M1_MAX) of the incoming tensor https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/quantization/qtensor/nvfp4_tensor.py#L250-L251 as a sensible heuristic.
2. For each block:
  - If per-block F8 (F8) scales are given, use them, else default to amax(block) / F32 * FLOAT4_E2M1_MAX of the incoming tensor https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/quantization/qtensor/nvfp4_tensor.py#L284-L28 (nobody currently provides/exports per-block F8 scales).
  - Derive per-nibble FP4 value via F4 = input / (F32 * F8)
For NVFP4QTensor.dequantize, the class simply does dequantize = F4 * F8 * F32 https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/quantization/qtensor/nvfp4_tensor.py#L364-L375

Ways to resolve this

Discussed with @JohannesGaessler that we could expand GGML_OP_MUL_MAT along the following lines

graph LR
A[NVFP4 weights] --> B{GGML_OP_MUL_MAT}
C[FP32 activations] --> B
D[F32 scale weights] --> B
E[F32 scale activations] -.-> B
B --> F{non-linearity, <br> e.g. SwiGLU}

where we would ensure that F32 scales have to be passed for weights of derived-tensors. We can't enforce F32 scale activations unfortunately as they are optional (cf. weight-only quantization vs. )

Think of alternatives on how to better represent derived tensors in GGML: The above approach would resolve inference paths, but quantize_nvfp4/dequantize_row_nvfp4 references would still be broken as they don't consume/produce F32 scales according to the golden reference (and other backends will look here rather than at modelopt if they are to add support imo). I think the imatrix path is something one could lean on (as this imatrix maps to quant_weights, which we could (maybe) reuse to load/store F32 scales).

@ggml-org/maintainers Pointers/thoughts on the above are welcome. CC @ggerganov

Side notes/Trivia

We would typically call NVFP4QTensor.quantize with block_size of 16 to get to the standard outlined here.
NVIDIAs FP8 format is also a derived tensor, and will thus share everything outlined here for NVFP4.
On NVGPUS, F4 * F8 is handled by tensor cores, and F32 of weights & activations are handled in the GEMM epilogues (See PTX doc, where F32 is absent - scale A and B of Fig 42 refer to F8).
Not sure if we support calibration & NVFP4 quantization in llama.cpp, but if so, one should include amax-estimation/some kind of heuristic for intermediate activations.

pwilkin · 2026-04-17T12:43:23Z

pwilkin
Apr 17, 2026
Collaborator

I know I'm very much on the sidelines of this, but from my perspective passing scales to MUL_MAT is good as it would enable other quantization schemes with trained scales, something we cannot do in the current regime (hence in my #19941 PR I actually extended the signature to include scales).

0 replies

michaelw9999 · 2026-04-17T15:02:01Z

michaelw9999
Apr 17, 2026

@ORippler Here is part of the implementation of input scale integration I was almost ready to share as POC, using the in_s tensors that we are loading already, just integrating them into build_lora_mm:

Details

ggml_tensor * build_lora_mm(
              ggml_tensor * w,
              ggml_tensor * cur,
              ggml_tensor * w_s = nullptr,
              ggml_tensor * w_in_s = nullptr) const;
 ...
ggml_tensor * build_lora_mm_id(
              ggml_tensor * w,   // ggml_tensor * as
              ggml_tensor * cur, // ggml_tensor * b
              ggml_tensor * ids,
              ggml_tensor * w_in_s = nullptr) const;
...
+            register_input_scale(layer.wq,              layer.wq_in_s);
+            register_input_scale(layer.wk,              layer.wk_in_s);
+            register_input_scale(layer.wv,              layer.wv_in_s);
+            register_input_scale(layer.wo,              layer.wo_in_s);
+            register_input_scale(layer.wqkv,            layer.wqkv_in_s);
+            register_input_scale(layer.wqkv_gate,       layer.wqkv_gate_in_s);
+            register_input_scale(layer.ffn_gate,        layer.ffn_gate_in_s);
+            register_input_scale(layer.ffn_up,          layer.ffn_up_in_s);
+            register_input_scale(layer.ffn_down,        layer.ffn_down_in_s);
+            register_input_scale(layer.ffn_gate_exps,   layer.ffn_gate_exps_in_s);
+            register_input_scale(layer.ffn_up_exps,     layer.ffn_up_exps_in_s);
+            register_input_scale(layer.ffn_down_exps,   layer.ffn_down_exps_in_s);
+            register_input_scale(layer.ffn_gate_shexp,  layer.ffn_gate_shexp_in_s);
+            register_input_scale(layer.ffn_up_shexp,    layer.ffn_up_shexp_in_s);
+            register_input_scale(layer.ffn_down_shexp,  layer.ffn_down_shexp_in_s);
+            register_input_scale(layer.ssm_in,          layer.ssm_in_in_s);
+            register_input_scale(layer.ssm_out,         layer.ssm_out_in_s);
+            register_input_scale(layer.ssm_alpha,       layer.ssm_alpha_in_s);
+            register_input_scale(layer.ssm_beta,        layer.ssm_beta_in_s);

.....
 static __global__ void quantize_mmq_nvfp4(
         const float * __restrict__ x, const int32_t * __restrict__ ids, void * __restrict__ vy,
+        const float * __restrict__ input_scale, const int32_t * __restrict__ expert_bounds, const int64_t input_scale_ne,
         const int64_t ne00, const int64_t s01, const int64_t s02, const int64_t s03,
         const int64_t ne0, const int64_t ne1, const int64_t ne2) {
 #if defined(BLACKWELL_MMA_AVAILABLE)
@@ -101,11 +125,13 @@ static __global__ void quantize_mmq_nvfp4(
     float vals_raw[QK_NVFP4_SUB];
     float amax_raw = 0.0f;
     const int64_t base_idx = i3 * s03 + i2 * s02 + i01 * s01;
+    const int input_scale_idx = nvfp4_input_scale_index(expert_bounds, input_scale_ne, i1);
+    const float inv_input_scale = input_scale ? 1.0f / input_scale[input_scale_idx] : 1.0f;
 #pragma unroll
     for (int k = 0; k < QK_NVFP4_SUB; k++) {
         const int64_t i00 = i0_base + k;
         if (i00 < ne00) {
-            const float v = x[base_idx + i00];
+            const float v = x[base_idx + i00] * inv_input_scale;
             vals_raw[k] = v;
             amax_raw = fmaxf(amax_raw, fabsf(v));
         } else {
@@ -113,7 +139,7 @@ static __global__ void quantize_mmq_nvfp4(
......
const ggml_tensor * input_scale = nullptr;
if (use_native_fp4 && src0->type == GGML_TYPE_NVFP4) {
    memcpy(&input_scale, (const char *) dst->op_params + 2*sizeof(int32_t), sizeof(input_scale));
}

const float * input_scale_scale  = input_scale ? (const float *) input_scale->data : nullptr;
const int64_t input_scale_ne = input_scale ? ggml_nelements(input_scale) : 0;
...
if constexpr (type == GGML_TYPE_NVFP4) {
#pragma unroll
    for (int i = 0; i < mmq_x*mmq_y / (nwarps*warp_size); ++i) {
        sum[i] *= activation_scale;
    }
}
....
+void quantize_mmq_nvfp4_cuda_input_scale(
+        const float * x, const int32_t * ids, void * vy,
+        const float * input_scale, const int32_t * expert_bounds, const int64_t input_scale_ne,
+        const int64_t ne00, const int64_t s01, const int64_t s02, const int64_t s03,
+        const int64_t ne0, const int64_t ne1, const int64_t ne2, const int64_t ne3, cudaStream_t stream) {
+    GGML_ASSERT(ne00 % QK_NVFP4 == 0);
+    GGML_ASSERT(ne0 > 0);
+
+    constexpr int nvfp4_block_size = 128;
+    const int64_t block_num_y = (ne0 + QK_NVFP4_SUB * nvfp4_block_size - 1) / (QK_NVFP4_SUB * nvfp4_block_size);
+    const dim3 block_size(nvfp4_block_size, 1, 1);
+    const dim3 num_blocks(ne1, block_num_y, ne2 * ne3);
+    quantize_mmq_nvfp4<<<num_blocks, block_size, 0, stream>>>(
+        x, ids, vy, input_scale, expert_bounds, input_scale_ne, ne00, s01, s02, s03, ne0, ne1, ne2);
+}

We could optionally hard enforce the presence of both input/weight scales and not allow them to be 1.0f
This isn't nicely integrated or clean by any means, but it could be done in a more cohesive way, and not be restricted to NVFP4.

Take a look at: https://github.com/ggml-org/llama.cpp/pull/20845/changes
A similar API could be used for both the weight scale and input scale together to properly enforce the NVFP4 rules, even without modifying GGML_OP_MUL_MAT

As far as alternative derivation of input scale using imatrix, this is is what I used in my offline NVFP4 quantizer:

static float llama_nvfp4_input_scale_from_imatrix(
    const float * imatrix,
    int64_t n_per_row) {
    if (imatrix == nullptr || n_per_row <= 0) {
        return 1.0f;
    }

    double sum = 0.0;
    size_t count = 0;

    for (int64_t i = 0; i < n_per_row; i++) {
        const float v = imatrix[i];
        if (!std::isfinite(v) || v <= 0.0f) {
            continue;
        }
        sum += (double) v;
        count++;
    }

    if (count == 0 || sum <= 0.0) {
        return 1.0f;
    }

    const double rms = std::sqrt(sum / (double) count);
    if (!(rms > 0.0) || !std::isfinite(rms)) {
        return 1.0f;
    }

    const double file_input_scale = std::clamp(rms, 1.0 / 32.0, 32.0);
    return (float) file_input_scale;
}

0 replies

JohannesGaessler · 2026-04-17T16:06:29Z

JohannesGaessler
Apr 17, 2026
Collaborator

One thing that we could maybe do is add a function like this

    GGML_API struct ggml_tensor * ggml_mul_mat_ext(
            struct ggml_context * ctx,
            struct ggml_tensor  * a,
            struct ggml_tensor  * b,
            struct ggml_tensor  * scale_weight,
            struct ggml_tensor  * scale_activations);

which automatically constructs the ggml graph in the correct way by creating multiple tensors, similarly to how come convolution ops are handled using GGML_OP_IM2COL. ggml_mul_mat would internally call ggml_mul_mat_ext but without the optional tensors. Trying to call ggml_mul_mat with a->type == GGML_TYPE_NVFP4 results in an error. I think that way we don't need to modify the backend implementations for GGML_OP_MUL_MAT after all.

28 replies

am17an Apr 23, 2026
Collaborator

What I also thought of was reserving the initial bytes of a derived tensor for these things.

I thought of this and ultimately it's going to be messy to do this (I might be wrong though). The super-block approach could be 128 (5.6% waste) vs 256 (2.8% waste). I'm okay with either because it's ultimately a 4-bit quant which has outsized performance on blackwell, so I don't think anyone picking a q4_1 over this because it takes up 5% more memory while has 1.5x the PP. In comparison with other frameworks, ggml still wins out quite comfortably in the memory department simply by not using python.

Most model's hidden dims should be divisible by 128 (there are notable exceptions for which we would have to pad the last block), but overall to me it looks like a decent solution.

am17an Apr 23, 2026
Collaborator

@michaelw9999 you should not implement your ideas which we haven't agreed on and ask people to review it, writing code is easier than reviewing it. If you want your voice heard you should write simply what you are proposing and take part in this discussion.

CISC Apr 23, 2026
Collaborator

Honestly not sold on changing the GGUF format, just don't see any upsides to that, the scales are stored separately in the original format, why should we change that and store it more inefficiently?

If we want to be strict on the quantize round-trip we can change the internal format and repack as suggested by @michaelw9999.

ORippler Apr 23, 2026
Collaborator Author

I thought of this and ultimately it's going to be messy to do this (I might be wrong though).

Yeah these proposals all tend to hit the "GGML is first and foremost designed around block-scaled quants" fact, simply at different angles. The core question to ask is "do we want to increase GGML's complexity by increasing its support surface beyond block-scaled quants" vs. "do we shoehorn all future quants into a block-scaled format". FP8 for example has per-tensor scale and no block-scales at all (and we would love to add FP8 support to GGML in the future 😃). Shoehorning this into a block-scaled quant format makes it face the ne[0] % block_size == 0 restriction unnecessarily.

am17an Apr 23, 2026
Collaborator

I don't think repack is a good option as it increases complexity, the space on disk is hardly the problem. Basically concur on the above statement about block-scaled quants, as long as nvfp4 is one of those types that can be just another quant type, we should do that because it is lowest maintenance burden. Also for activation quantization it does not use extra space, since we will get a scale per super block.

michaelw9999 · 2026-04-18T17:21:38Z

michaelw9999
Apr 18, 2026

I did a full implementation doing this now for both scales, but as a generic API. Solving the last bugs and will post a link to see what you.all think

…

Message ID: <ggml-org/llama.cpp/repo-discussions/22042/comments/16612473@ github.com>

0 replies

michaelw9999 · 2026-04-20T14:11:51Z

michaelw9999
Apr 20, 2026

Here is a fully functional and working POC/WIP implementation.
I made a generic API to handle derived tensors and then attach them to GGML_MUL_MAT/ID , and then I also set up NVFP4 to use it properly. The input scale goes directly into quantize_nvfp4. It brought the Qwen3.5 4.5 ppl down to 11.65~ to 11.599.

Here is just some snippets and a portion of the code and I also made up a graph below. Let me know what you think of this.

API to make derived tensors:

    // Create a derived_tensor to attach as aux tensors to a matmul op.
    GGML_API struct ggml_derived_tensor ggml_create_derived_tensor(
                struct ggml_tensor * tensor,
                enum ggml_derived_tensor_type type,
                enum ggml_derived_tensor_flags flags);
    // compatible with ops GML_OP_MUL_MAT or GGML_OP_MUL_MAT_ID
    GGML_API void ggml_mul_mat_add_derived_tensor(
                struct ggml_tensor * t,
                struct ggml_derived_tensor derived);

    GGML_API struct ggml_tensor * ggml_get_derived_tensor(
            const struct ggml_tensor * t,
            enum ggml_derived_tensor_type type);

    GGML_API enum ggml_derived_tensor_flags ggml_get_derived_tensor_flags(
            const struct ggml_tensor * t,
            enum ggml_derived_tensor_type type);

We can make multiple types here and add it to the list for whatever needs to directly attach to GGML_MUL_MAT:

    enum ggml_derived_tensor_type {
        GGML_NVFP4_TENSOR_SCALE  =   0, // ... this is the global fp32 tensor scale for NVFP4
        GGML_NVFP4_INPUT_SCALE   =   1, // ... this is the matching fp32 activation scale per tensor scale
 // can add more here for anything else that should attach directly to MUL_MAT
        GGML_DERIVED_TENSOR_COUNT,
    };

    enum ggml_derived_tensor_flags {  
        GGML_DERIVED_TENSOR_FLAG_OPTIONAL = 0,  // This would determine if this tensor is required or optional
        GGML_DERIVED_TENSOR_FLAG_REQUIRED = 1  
    };

    struct ggml_derived_tensor {      // The derived tensors here for now belong to mul_mat or mul_mat_id
        struct ggml_tensor * tensor;
        enum ggml_derived_tensor_type type;
        enum ggml_derived_tensor_flags flags;
    };

In build_lora it will link up the weight scale and input scale tensors:

    if (derived_tensors) {
        GGML_ASSERT(w_s == nullptr || w_s->type == GGML_TYPE_F32);
        GGML_ASSERT(w_in_s == nullptr || w_in_s->type == GGML_TYPE_F32);

        if (w_s != nullptr) {
            ggml_mul_mat_add_derived_tensor(res, ggml_create_derived_tensor(
                    w_s,
                    GGML_NVFP4_TENSOR_SCALE,
                    GGML_DERIVED_TENSOR_FLAG_OPTIONAL));
        }

        if (w_in_s != nullptr) {
            ggml_mul_mat_add_derived_tensor(res, ggml_create_derived_tensor(
                    w_in_s,
                    GGML_NVFP4_INPUT_SCALE,
                    GGML_DERIVED_TENSOR_FLAG_OPTIONAL));
        }
    }

Basic mermaid:

flowchart TB

    T0["GGML_NVFP4_TENSOR_SCALE"] --> A0["ggml_mul_mat_add_derived_tensor"]
    T1["GGML_NVFP4_INPUT_SCALE"] --> A0

    G0["build_lora_mm"] --> A0
    G1["build_lora_mm_id"] --> A0

    A0 --> M0["GGML_OP_MUL_MAT"]
    A0 --> M1["GGML_OP_MUL_MAT_ID"]

    M0 --> A1["ggml_get_derived_tensor"]
    M1 --> A1

    A1 --> C0["ggml_cuda_mul_mat"]

    C0 --> Q0["quantize_mmq_fp4_cuda<false>"]
    C0 --> Q1["quantize_mmq_fp4_cuda<true>"]

    Q0 --> N0["quantize_mmq_nvfp4<false>"]
    Q1 --> N1["quantize_mmq_nvfp4<true>"]

    M1 --> I0["ids_expert"]
    I0 --> N1

1 reply

michaelw9999 Apr 21, 2026

Updated the repo link to 64664e6 . This is now the correct link to try and it is fully working. I messed up the previous commit trying to base with the unmerged NVFP4 PR with freshly rebased llama.cpp and it did not have all the right files or latest local branch. This is now clean and freshly rebased against llama.cpp alone, and it includes (entire NVFP4 PR + derived tensors API POC with NVFP4 to mulmat + input scale wiring for a few arches). One commit now and builds properly, working really well and very fast.

vishalpandya1990 · 2026-04-21T15:46:33Z

vishalpandya1990
Apr 21, 2026

Is it possible to leverage unused slots in src[] array of the ggml_tensor for hooking up per-tensor scales of activation and weights? (like its used in ggml_flash_attn_ext_add_sinks for passing sinks)

1 reply

michaelw9999 Apr 21, 2026

Yes, I did it that way in bdd31fd#diff-3dfb056f1ce5ee9bb5ab48336c48127b2ae765dd68716b75a13eee5cecdd6a93R1229 using src[3] and it worked but not as elegant, it was small and did not need much code, which was nice. But if any other backend uses those or needs to it's not defined or reserved in any way

vishalpandya1990 · 2026-04-22T07:50:55Z

vishalpandya1990
Apr 22, 2026

I was reviewing the options discussed thus far, and wanted to summarize.

Broadly, there are following design options: (A) extend ggml_mul_mat signature to accomodate per-tensor scales of weights and activations (B) create new ggml operator (ext version) that internally builds graph using ggml_mul_mat + ggml_mul nodes (C) pass the scales in ggml_tensor using existing unused fields (e.g. src[], op_params[] etc.) for the NVFP4 + CUDA + Blackwell specific path, (D) add per-tensor scales in NVFP4 block (only weight's per-tensor scale?).

The POC implementation mentioned above mainly comes under option-C. It uses op_params[] for passing per-tensor scales of weights and activations, and creates some new typed APIs for setting and fetching those scales. With new APIs specifically for scales handling, this approach is cleaner and may be more strongly typed. I feel src[] field in ggml_tensor is also a candidate for this - GGML already has precedent for attaching auxiliary tensors post-creation via src[] (e.g. sink in attention); and scales can be viewed as one of the operand for eventual multiply operations. In either case, dequantize_row_nvfp4 is not fixed. And a separate ggml_mul should still be required (fallback).

I think a hybrid approach (say, option-E) is also worth considering here: (1) store per-tensor scale of weights in nvfp4 block, and (2) pass input-scale in ggml_tensor field (e.g. in unused src[] / padding / extra slot). This hybrid approach makes weights handling / nvfp4-block self-contained, fixes dequantize_row_nvfp4, eliminates post-GEMM ggml_mul node for weight's per-tensor scale, and potentially removes the risk of GGML consumers forgetting to apply the weight scale. We can avoid placing ggml_mul for input_scale since that will anyway be unused/irrelvant for paths/backends which don't support FP4 GEMM. On the down side, this approach can potentially break the existing gguf-nvfp4 checkpoint (if any) but given that NVFP4 support in ggml/llama.cpp is still being finalized, this may not be a major issue.

Any Thoughts?

5 replies

am17an Apr 22, 2026
Collaborator

in Option-E, passing an input scale is never a problem because we only quantize activations inside a backend, so backends need to take care of it when quantizing to NVPF4, which the only case would be Blackwell.

vishalpandya1990 Apr 22, 2026

Yes, so the non-CUDA or non-Blackwell paths would ignore the input-scale coming in ggml_tensor and CUDA + Blackwell path can update kernels to make use of that input-scale for quantize and dequantize. In addition, CPU backend (dequantize_row_nvfp4) might need to be updated for applying per-tensor scale of weights - besides this, probably no other backend should require change in this approach (E)?.

am17an Apr 22, 2026
Collaborator

Activations coming into backends via ggml will never be nvfp4, they are usually f32. So it does not make sense for input tensor to have a scale when the tensor itself gets quantized inside the backend.

ORippler Apr 22, 2026
Collaborator Author

I am against option E. The biggest technological differntiator of GGML over other libs such as pytorch/mlx is in my eyes the absolute optimization for memory-footprint (as this is the most constrained resource in local AI inference). Anything that goes against this is something I feel weakens GGML's profile and is something I have been holding back on (e.g. padding during repack).

ORippler Apr 22, 2026
Collaborator Author

I mainly favor C atm., and was ~~thinking of converting the 8 char padding[8] bytes (that are currently unused) to store an optional pointer to another ggml_tensor (which may also be a nullptr).~~ We would then have to create a set of quants that are "derived", where we have to ensure this "scales" pointer is not null at run-time.

Edit: Going to try a PoC where derived-tensors have weights and activation scales as their src[] inputs (meaning they start to have an in-degree != 0, and we assume they are guaranteed to have an in-degree == 0 before)

michaelw9999 · 2026-04-22T11:16:40Z

michaelw9999
Apr 22, 2026

I have yet another option. Finishing testing and compiling now. You might like this one 😎

…

On Wed, Apr 22, 2026, 3:13 AM Oliver Simons ***@***.***> wrote: I mainly favor C atm., and was thinking of converting the 8 padding-bytes (that are currently unused) to store a pointer to another ggml_tensor (which may also be a nullptr). We would then have to create a set of quants that are "derived", where we have to ensure this "scales" pointer is not null at call-time — Reply to this email directly, view it on GitHub <#22042?email_source=notifications&email_token=BTEDPHD5RXKHEC63UD3BLET4XCLNJA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNRWGY4DSMRRUZZGKYLTN5XKM3LBNZ2WC3FFMV3GK3TUVRTG633UMVZF6Y3MNFRWW#discussioncomment-16668921>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BTEDPHCKBTR6G57PLQCRZKD4XCLNJAVCNFSM6AAAAACX45KSUKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNRWHA4TEMI> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.*** com>

0 replies

michaelw9999 · 2026-04-22T22:19:50Z

michaelw9999
Apr 22, 2026

@everyone
Here is the other option, I am correcting one small bug I found last night but will post the fully working POC demo later today (California time). I'll tell how this one works:

I was waiting for the first PR to merge, to keep it small and easy to review, before I posted the ready to go AoSoa repack. This option is a modification of that, it was easy to change the block since it's repacking the block anyway. It already was much faster/ than the PR with room for more tuning. So I modified it to put weight scale in the block, still maintaining 16B AoSoA. This means no need to change the GGUF or anything disk, and we don't need to introduce a new API or anything else except the replacement vecdot/quantizer and the repack functions.
Repack is done on model loading for CUDA with BLACKWELL_MMA_AVAILABLE. It works properly on CPU for offloading, that is not a problem.
This latest repack AoSoA version pre-arranges the bytes into tiles; it's ready to go for e2m1.e2m1 MMA instead of a traditional llama layout. It is a huge performance advantage; the kernel does direct gmem -> smem -> regs without much overheard, no transpose necessary, no load_generic, no load_ldmatrix. It's been a while since I profiled but I recall the total kernel smem usage went down massively in vecdot_nvfp4_mma. Regardless of these side benefits, I put tensor scale into a header as follows:

struct  __align__(16) block_nvfp4_blackwell_tensor {
  float    weight_scale; // per tensor 
  uint8t  pad[12]; 
  block_nvfp4_blackwell tiles[]; // same AoSoA as in previous fully working ready to go version
};

The tile is defined as:

struct  __align__(16) block_nvfp4_blackwell {
  block_nvfp4_bw_frag tiles[4]
};
struct __align__(16) block_nvfp4_frag {
 uint32_t regs[32][4];
 uint32_t scales_u32[32];
};

So the the whole new block is as such: [16 byte header][tile 0][tile 1 ...n...] :

bytes 0..3       : weight_scale
bytes 4..15      : padding

bytes 16..527    : tile0.tiles[0].regs[0..31][0..3]
bytes 528..655   : tile0.tiles[0].scales_u32[0..31]

bytes 656..1167  : tile0.tiles[1].regs[0..31][0..3]
bytes 1168..1295 : tile0.tiles[1].scales_u32[0..31]

bytes 1296..1807 : tile0.tiles[2].regs[0..31][0..3]
bytes 1808..1935 : tile0.tiles[2].scales_u32[0..31]

bytes 1936..2447 : tile0.tiles[3].regs[0..31][0..3]
bytes 2448..2575 : tile0.tiles[3].scales_u32[0..31]

On its own just the block does not address how the integration of the tensor weight scale would integrate, but now it's carried through in src0 and readily available and much less complicated than the previous way. I'll share hpw it's done soon!

0 replies

ORippler · 2026-04-24T12:58:06Z

ORippler
Apr 24, 2026
Collaborator Author

Dabbling around with how one could try to better represent derived tensors in ggml here to get a clearer understanding of the implications and if it could be worth it. Will open a draft PR should it reach a state I'm actually happy with.

0 replies

michaelw9999 · 2026-04-27T07:34:49Z

michaelw9999
Apr 27, 2026

Here is a (updated/rebased) working WIP/POC based off the existing PR #22196 and @ORippler 's POC.
This lowers ppl, moving Qwen3.5 4B 11.65 to 11.55.

This uses the AoSoA repack work I've already done, but puts the weight and activation scales the new block. I was hoping to "kill two birds with one stone" with this. I know it's a lot of code and respect everyone's time, and the intent is for the best quality and fastest NVFP4 for everyone. If there is anything worth using I will make small separate PRs later. I tried to keep code isolated and gated to Blackwell as much as possible.

No need for a derived tensor API, or aux ggml_mul_mat... functions, or changes to llama graph.
The direct to MMA load skips mul_mat_q_process_tile and load_tiles() as previously, leaving those alone, and just uses a small tile loader. The activation comes directly from the quantizer and goes to a new nvfp4 only vecdot, where the scales are applied directly.
All of the paths I explored led to similar issues, this resolves all of them: retaining correct tensor scales when doing CPU offloading or tensor copy, bringing them in directly to the quantizer for FP4 activation without making a mess, dealing with where to place expert scales and channel indexing, and then not slowing down with the overheard, which was not exactly easy.

Implementation:

Details

Tensor Scales placed as follows:

weight->src[0] = weight_scale;
weight->src[1] = input_scale;
weight->op_params[0] = cached weight scale; // for CUDA only ** when _s scales are nullptr **
weight->op_params[1] = cached input scale;  // optional, weight scale required

To not break the current generic NVFP4 implementations and with non-CUDA backends, *_s pointers as currently in place are set nullptr. Taking them out breaks generic NVFP4 or the other backends. Fixing all of those too adds more work and complexity now. So this blocks any involvement at ggml_mul as in the build_lora_mm design. Input scales will not need to go through there anymore. MMQ/MMVQ use the scales directly from the weight tensor.

New vec_dot_nvfp4_q8_1_bw for MMVQ uses the new block layout and weight scale without needing to convert to the old block and uses the scales directly. MXFP4 stays the same. I experimented using MXFP4 repack with this same layout and direct load to PTX - it works too. NVFP4 x NVFP4 for all of MMVQ is still much slower than NVFP4xQ8, I've been working on it for some time but not there yet.
The new AoSoa layout is:

struct  __align__(16) block_nvfp4_blackwell_tensor {
    float   weight_scale;
    float   input_scale;
    const float * weight_scales; // For MOE per expert
    const float * input_scales;
    block_nvfp4_blackwell tiles[];
};

struct  __align__(16) block_nvfp4_blackwell_frag {
    uint32_t regs[32][4];
    uint32_t scales_u32[32];
};

struct  __align__(16) block_nvfp4_blackwell {
    block_nvfp4_blackwell_frag tiles[4];
};

No need to change the GGUF or mess up anything already quantized. This is only for CUDA and Blackwell
I put static asserts to require weight scale, but left input scale optional.

Speedup:

Details

Model	Test	Baseline	PR	Repack	Speedup(PR,RP)
Qwen3.5-4B	pp512	14831	17101.30	20228	1.18x, 1.36x
Qwen3.5-4B	tg128	218.25	210.36	221.54	0.96x, 1.02x
Nemotron Cascade	pp512	8707	12385	13402	1.42x, 1.54x
Nemotron Cascade	tg128	232	235	257	1.01x, 1.11x

Ppl improvement:

Details

model	Base Ppl	Mean Kld	PR Ppl	PR Kld	RP Ppl	RP Kld
Qwen3.5-4B	11.400200	0.053265	11.658689	0.092041	11.557739	0.091930
Ppl unchanged with -ngl X or -dev none

3 replies

michaelw9999 Apr 28, 2026

Restored MoE speedup and cleaned up a bit here . Moved expert scales to the block instead of empty padding. Cascade-2 now faster than baseline with 12,967 pp and 245 tg

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32606 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B NVFP4 |  18.01 GiB |    31.58 B | CUDA       |  99 |           pp512 |     13402.70 ± 99.04|
| nemotron_h_moe 31B.A3.5B NVFP4 |  18.01 GiB |    31.58 B | CUDA       |  99 |           tg128 |        257.10 ± 1.20

stevelikesrhino Apr 29, 2026

From my own testing the OG poc caused quite a regression to tg speed for 30B class dense models. Changing VDR_NVFP4_Q8_1_MMVQ from 4 to 2 can make up most of it back.

In /ggml/src/ggml-cuda/vecdotq.cuh

-#define VDR_NVFP4_Q8_1_MMVQ 4
+#define VDR_NVFP4_Q8_1_MMVQ 2
 #define VDR_NVFP4_Q8_1_MMQ  8
 #define VDR_NVFP4_NVFP4_MMQ 4

michaelw9999 Apr 29, 2026

From my own testing the OG poc caused quite a regression to tg speed for 30B class dense models. Changing VDR_NVFP4_Q8_1_MMVQ from 4 to 2 can make up most of it back.
In /ggml/src/ggml-cuda/vecdotq.cuh

-#define VDR_NVFP4_Q8_1_MMVQ 4
+#define VDR_NVFP4_Q8_1_MMVQ 2
 #define VDR_NVFP4_Q8_1_MMQ  8
 #define VDR_NVFP4_NVFP4_MMQ 4

Outstanding, thanks @stevelikesrhino ! That is definitely the better parameter. I got the speedup,, it works on all the models, even the MoE, much appreciated :) Rebased commit here.

from:
qwen35 27B BF16                |  17.50 GiB |    26.90 B | CUDA       |  99 |           tg128 |         56.95 ± 0.14 |
to:
qwen35 27B BF16                |  17.50 GiB |    26.90 B | CUDA       |  99 |           tg128 |         60.85 ± 0.10 |

$ ./llama-bench -m /home/mw/Qwen3.5-4B-NVFP4.gguf 
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32606 MiB):
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen35 4B BF16                 |   3.06 GiB |     4.21 B | CUDA       |  99 |           pp512 |     20106.50 ± 49.61 |
| qwen35 4B BF16                 |   3.06 GiB |     4.21 B | CUDA       |  99 |           tg128 |        206.23 ± 0.69 |

| qwen35 4B BF16                 |   3.06 GiB |     4.21 B | CUDA       |  99 |           pp512 |    20228.69 ± 81.76 |
| qwen35 4B BF16                 |   3.06 GiB |     4.21 B | CUDA       |  99 |           tg128 |        221.54 ± 0.72 |

On Nemotron:

from::
| nemotron_h_moe 31B.A3.5B NVFP4 |  18.01 GiB |    31.58 B | CUDA       |  99 |           pp512 |     13233.49 ± 107.65
|
| nemotron_h_moe 31B.A3.5B NVFP4 |  18.01 GiB |    31.58 B | CUDA       |  99 |           tg128 |        241.21 ± 0.98 
to:
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| nemotron_h_moe 31B.A3.5B NVFP4 |  18.01 GiB |    31.58 B | CUDA       |  99 |           pp512 |     13402.70 ± 99.04|
| nemotron_h_moe 31B.A3.5B NVFP4 |  18.01 GiB |    31.58 B | CUDA       |  99 |           tg128 |        257.10 ± 1.20

ORippler · 2026-05-12T20:24:41Z

ORippler
May 12, 2026
Collaborator Author

Dabbling around with how one could try to better represent derived tensors in ggml here to get a clearer understanding of the implications and if it could be worth it. Will open a draft PR should it reach a state I'm actually happy with.

Having spent more time and thought on this I think the best path to move forward is along the lines of what was initially brainstormed with Johannes:

Generally, extend GGML ops that do GEMMs to accept scales as (optional) inputs - a logical starting point here are MUL_MAT/MUL_MAT_ID, as those sparked the discussion initially.
After feasibility is shown for MUL_MAT/MUL_MAT_ID, add support for derived/scaled tensors to other GEMM-heavy ops such as FA (and associated KV cache in llama.cpp) - this would require applying the same changes to SET/GET_ROWS, FLASH_ATTN, CAST among others. We are looking to add FP8 support to GGML soon, and we have been using FP8 for KV-Cache-Quantization in ModelOpt.
If wished: Assert that numerical, non-GEMM ops are illegal to call on scaled/derived tensors (e.g. GGML_OP_ADD), and guide callers to explicitly dequantize via ggml_cast first.

Why make the distinction between GEMM and non-GEMM ops? Because math in FP8/FP4 can only be done in Tensor Cores for NVGPUs, which in turn can only accelerate GEMMs. I presume the same to hold for CDNA4/other accelerators, but have not explicitly verified it.

I believe the above (I) aligns closely with the support-surface of HW-acceleration, (II) is a software design also employed in other inferencing solutions (e.g. vLLM), (III) keeps changes to ggml to a minimum, while (IV) increasing safety.

0 replies

[GGML] Current NVFP4 support has risk of functional incorrectness due to unclear separation of concerns #22042

Uh oh!

ORippler Apr 17, 2026 Collaborator

NVFP4 is a derived tensor

What does this mean for GGML?

Leveraging Tensor Cores to accelerate NVFP4

ModelOpt as NVFP4 quant/dequant golden reference

Ways to resolve this

Side notes/Trivia

Replies: 12 comments · 38 replies

Uh oh!

pwilkin Apr 17, 2026 Collaborator

Uh oh!

Uh oh!

JohannesGaessler Apr 17, 2026 Collaborator

Uh oh!

Uh oh!

am17an Apr 23, 2026 Collaborator

Uh oh!

am17an Apr 23, 2026 Collaborator

Uh oh!

Uh oh!

CISC Apr 23, 2026 Collaborator

Uh oh!

Uh oh!

ORippler Apr 23, 2026 Collaborator Author

Uh oh!

am17an Apr 23, 2026 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

am17an Apr 22, 2026 Collaborator

Uh oh!

Uh oh!

am17an Apr 22, 2026 Collaborator

Uh oh!

Uh oh!

ORippler Apr 22, 2026 Collaborator Author

Uh oh!

Uh oh!

ORippler Apr 22, 2026 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

ORippler Apr 24, 2026 Collaborator Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ORippler
Apr 17, 2026
Collaborator

Replies: 12 comments 38 replies

pwilkin
Apr 17, 2026
Collaborator

JohannesGaessler
Apr 17, 2026
Collaborator

am17an Apr 23, 2026
Collaborator

am17an Apr 23, 2026
Collaborator

CISC Apr 23, 2026
Collaborator

ORippler Apr 23, 2026
Collaborator Author

am17an Apr 23, 2026
Collaborator

am17an Apr 22, 2026
Collaborator

am17an Apr 22, 2026
Collaborator

ORippler Apr 22, 2026
Collaborator Author

ORippler Apr 22, 2026
Collaborator Author

ORippler
Apr 24, 2026
Collaborator Author