Replies: 12 comments 38 replies
-
|
I know I'm very much on the sidelines of this, but from my perspective passing scales to MUL_MAT is good as it would enable other quantization schemes with trained scales, something we cannot do in the current regime (hence in my #19941 PR I actually extended the signature to include scales). |
Beta Was this translation helpful? Give feedback.
-
|
@ORippler Here is part of the implementation of input scale integration I was almost ready to share as POC, using the DetailsWe could optionally hard enforce the presence of both input/weight scales and not allow them to be 1.0f Take a look at: https://github.com/ggml-org/llama.cpp/pull/20845/changes As far as alternative derivation of input scale using imatrix, this is is what I used in my offline NVFP4 quantizer: |
Beta Was this translation helpful? Give feedback.
-
|
One thing that we could maybe do is add a function like this GGML_API struct ggml_tensor * ggml_mul_mat_ext(
struct ggml_context * ctx,
struct ggml_tensor * a,
struct ggml_tensor * b,
struct ggml_tensor * scale_weight,
struct ggml_tensor * scale_activations);which automatically constructs the ggml graph in the correct way by creating multiple tensors, similarly to how come convolution ops are handled using |
Beta Was this translation helpful? Give feedback.
-
|
I did a full implementation doing this now for both scales, but as a
generic API. Solving the last bugs and will post a link to see what
you.all think
… Message ID: <ggml-org/llama.cpp/repo-discussions/22042/comments/16612473@
github.com>
|
Beta Was this translation helpful? Give feedback.
-
|
Here is a fully functional and working POC/WIP implementation. Here is just some snippets and a portion of the code and I also made up a graph below. Let me know what you think of this. API to make derived tensors: We can make multiple types here and add it to the list for whatever needs to directly attach to GGML_MUL_MAT: In Basic mermaid: flowchart TB
T0["GGML_NVFP4_TENSOR_SCALE"] --> A0["ggml_mul_mat_add_derived_tensor"]
T1["GGML_NVFP4_INPUT_SCALE"] --> A0
G0["build_lora_mm"] --> A0
G1["build_lora_mm_id"] --> A0
A0 --> M0["GGML_OP_MUL_MAT"]
A0 --> M1["GGML_OP_MUL_MAT_ID"]
M0 --> A1["ggml_get_derived_tensor"]
M1 --> A1
A1 --> C0["ggml_cuda_mul_mat"]
C0 --> Q0["quantize_mmq_fp4_cuda<false>"]
C0 --> Q1["quantize_mmq_fp4_cuda<true>"]
Q0 --> N0["quantize_mmq_nvfp4<false>"]
Q1 --> N1["quantize_mmq_nvfp4<true>"]
M1 --> I0["ids_expert"]
I0 --> N1
|
Beta Was this translation helpful? Give feedback.
-
|
Is it possible to leverage unused slots in |
Beta Was this translation helpful? Give feedback.
-
|
I was reviewing the options discussed thus far, and wanted to summarize. Broadly, there are following design options: (A) extend The POC implementation mentioned above mainly comes under option-C. It uses I think a hybrid approach (say, option-E) is also worth considering here: (1) store per-tensor scale of weights in nvfp4 block, and (2) pass input-scale in ggml_tensor field (e.g. in unused src[] / padding / extra slot). This hybrid approach makes weights handling / nvfp4-block self-contained, fixes Any Thoughts? |
Beta Was this translation helpful? Give feedback.
-
|
I have yet another option. Finishing testing and compiling now. You might
like this one 😎
…On Wed, Apr 22, 2026, 3:13 AM Oliver Simons ***@***.***> wrote:
I mainly favor C atm., and was thinking of converting the 8 padding-bytes
(that are currently unused) to store a pointer to another ggml_tensor
(which may also be a nullptr). We would then have to create a set of quants
that are "derived", where we have to ensure this "scales" pointer is not
null at call-time
—
Reply to this email directly, view it on GitHub
<#22042?email_source=notifications&email_token=BTEDPHD5RXKHEC63UD3BLET4XCLNJA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNRWGY4DSMRRUZZGKYLTN5XKM3LBNZ2WC3FFMV3GK3TUVRTG633UMVZF6Y3MNFRWW#discussioncomment-16668921>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BTEDPHCKBTR6G57PLQCRZKD4XCLNJAVCNFSM6AAAAACX45KSUKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNRWHA4TEMI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
|
@everyone I was waiting for the first PR to merge, to keep it small and easy to review, before I posted the ready to go AoSoa repack. This option is a modification of that, it was easy to change the block since it's repacking the block anyway. It already was much faster/ than the PR with room for more tuning. So I modified it to put weight scale in the block, still maintaining 16B AoSoA. This means no need to change the GGUF or anything disk, and we don't need to introduce a new API or anything else except the replacement vecdot/quantizer and the repack functions. The tile is defined as: So the the whole new block is as such: On its own just the block does not address how the integration of the tensor weight scale would integrate, but now it's carried through in src0 and readily available and much less complicated than the previous way. I'll share hpw it's done soon! |
Beta Was this translation helpful? Give feedback.
-
|
Dabbling around with how one could try to better represent derived tensors in ggml here to get a clearer understanding of the implications and if it could be worth it. Will open a draft PR should it reach a state I'm actually happy with. |
Beta Was this translation helpful? Give feedback.
-
|
Here is a (updated/rebased) working WIP/POC based off the existing PR #22196 and @ORippler 's POC. This uses the AoSoA repack work I've already done, but puts the weight and activation scales the new block. I was hoping to "kill two birds with one stone" with this. I know it's a lot of code and respect everyone's time, and the intent is for the best quality and fastest NVFP4 for everyone. If there is anything worth using I will make small separate PRs later. I tried to keep code isolated and gated to Blackwell as much as possible. No need for a derived tensor API, or aux Implementation: DetailsTensor Scales placed as follows: weight->src[0] = weight_scale;
weight->src[1] = input_scale;
weight->op_params[0] = cached weight scale; // for CUDA only ** when _s scales are nullptr **
weight->op_params[1] = cached input scale; // optional, weight scale requiredTo not break the current generic NVFP4 implementations and with non-CUDA backends, New No need to change the GGUF or mess up anything already quantized. This is only for CUDA and Blackwell Speedup: Details
Ppl improvement: Details
|
Beta Was this translation helpful? Give feedback.
-
Having spent more time and thought on this I think the best path to move forward is along the lines of what was initially brainstormed with Johannes:
Why make the distinction between GEMM and non-GEMM ops? Because math in FP8/FP4 can only be done in Tensor Cores for NVGPUs, which in turn can only accelerate GEMMs. I presume the same to hold for CDNA4/other accelerators, but have not explicitly verified it. I believe the above (I) aligns closely with the support-surface of HW-acceleration, (II) is a software design also employed in other inferencing solutions (e.g. vLLM), (III) keeps changes to ggml to a minimum, while (IV) increasing safety. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
NVFP4 is a derived tensor
NVFP4 is a two-step quantization scheme consisting of:
To dequantize NVFP4 -> FP32, one has to do
FP32_activations = F4 * F8 * F32. We have to "derive" dequantized values from both the blocks and the F32 scale, hence the term derived tensor.What does this mean for GGML?
struct block_nvfp4does not fully represent the quantized tensor (i.e. we cannot dequantize without knowing about F32).quantize_nvfp4/dequantize_row_nvfp4functions inside GGML are incorrect when F32 != 1.0 (and F32 = 1.0 incurs a significant quality loss).graph LR A[NVFP4 weights] --> B{GGML_OP_MUL_MAT} C[FP32 activations] --> B B --> D{GGML_OP_MUL} E[F32 scale] --> D D --> F{non-linearity, <br> e.g. SwiGLU}whereas it looks like this for non-derived quant formats:
graph LR A[Q8_0 weights] --> B{GGML_OP_MUL_MAT} C[FP32 activations] --> B B --> F{non-linearity, <br> e.g. SwiGLU}Afaik, this behavior is currently undocumented in GGML and is, in my eyes, error prone: Swapping
GGML_OP_MULand the non-linearity may produce functionally incorrect results.While GGML is co-developed with llama.cpp, it also serves other consumers such as stablediffusion.cpp or whisper.cpp.
Leveraging Tensor Cores to accelerate NVFP4
To leverage NVFP4-HW-accelerators, one has to quantize incoming activations to NVFP4 via
F4 = input / (F32 * F8), where F32 is an optional input parameter yielded during quantization (cf. ModelOpt golden reference section below):ModelOpt as NVFP4 quant/dequant golden reference
NVIDIA's Model-Optimizer lib can be taken as the golden reference. FP32 <-> NVFP4 conversion is handled by
NVFP4QTensor, specifically itsquantize/dequantizeclass methods.NVFP4QTensor.quantize, the path is as follows:amax(input) / (FLOAT8_E4M3_MAX * FLOAT4_E2M1_MAX)of the incoming tensor https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/quantization/qtensor/nvfp4_tensor.py#L250-L251 as a sensible heuristic.amax(block) / F32 * FLOAT4_E2M1_MAXof the incoming tensor https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/quantization/qtensor/nvfp4_tensor.py#L284-L28 (nobody currently provides/exports per-block F8 scales).F4 = input / (F32 * F8)NVFP4QTensor.dequantize, the class simply doesdequantize = F4 * F8 * F32https://github.com/NVIDIA/Model-Optimizer/blob/main/modelopt/torch/quantization/qtensor/nvfp4_tensor.py#L364-L375Ways to resolve this
GGML_OP_MUL_MATalong the following linesgraph LR A[NVFP4 weights] --> B{GGML_OP_MUL_MAT} C[FP32 activations] --> B D[F32 scale weights] --> B E[F32 scale activations] -.-> B B --> F{non-linearity, <br> e.g. SwiGLU}where we would ensure that F32 scales have to be passed for weights of derived-tensors. We can't enforce F32 scale activations unfortunately as they are optional (cf. weight-only quantization vs. )
quantize_nvfp4/dequantize_row_nvfp4references would still be broken as they don't consume/produce F32 scales according to the golden reference (and other backends will look here rather than at modelopt if they are to add support imo). I think the imatrix path is something one could lean on (as this imatrix maps toquant_weights, which we could (maybe) reuse to load/store F32 scales).@ggml-org/maintainers Pointers/thoughts on the above are welcome. CC @ggerganov
Side notes/Trivia
NVFP4QTensor.quantizewith block_size of 16 to get to the standard outlined here.F4 * F8is handled by tensor cores, andF32of weights & activations are handled in the GEMM epilogues (See PTX doc, where F32 is absent - scale A and B of Fig 42 refer to F8).Beta Was this translation helpful? Give feedback.
All reactions