ggml : add NVFP4 quantization type support by richarddd · Pull Request #19769 · ggml-org/llama.cpp

richarddd · 2026-02-20T21:50:47Z

I'm not super experienced with the ggml/gguf internals so feedback is very welcome. Note on AI usage: Claude Opus 4.6 was used for navigating the codebase, debugging, and writing parts of the code. All changes have been reviewed and tested manually. Open to reworking anything that doesn't meet the project's standards.

This adds support for NVIDIA's NVFP4 quantization format (FP4 E2M1 weights, UE4M3 per-block scale, 16 elements per block). This is the format produced by NVIDIA ModelOpt's NVFP4 algo. The main difference is the scale encoding (UE4M3 vs E8M0).

What's in here:

New GGML_TYPE_NVFP4 type, block struct, UE4M3 conversion helpers, reference quantize/dequantize
convert_hf_to_gguf.py detects NVFP4 ModelOpt models and repacks into the GGUF block format
CPU backend: scalar dot product + ARM NEON
gguf-py: type constant, quant/dequant, endian conversion
Tests added to test-backend-ops and test-quantize-fns

Tested with models from https://huggingface.co/NVFP4 Apple M5 MacBook (CPU, NEON) Ran llama-bench and a basic server smoke test. Would appreciate help with that if someone has a good baseline to compare against.

Here is a Qwen3-4B model to test with.

JohannesGaessler · 2026-02-20T22:16:40Z

As is clearly laid out in the llama.cpp contributing guidelines:

When adding support for a new model or feature, focus on CPU support only in the initial PR unless you have a good reason not to. Add support for other backends like CUDA in follow-up PRs

pwilkin · 2026-02-20T23:14:57Z

I would really love NVFP4 support and I appreciate the work done here, but as @JohannesGaessler has already mentioned, the ratio of verified information to maintainer-needed work is way too high with this PR.

Please:

shelf all the backend implementations for now, they should be added in separate PRs so people specialized in specific backends can look at them
provide a GGUF of a converted model, preferrably one that can be ran comfortably by most mtaintainers (i.e. rather 8B or 12B than 400B).
make a KLD analysis for a full FP16 version as documented here
make perplexity and KLD checks for your quantized model as well as a comparable "standard" quant (Q4_1 would probably be a good choice here)
run benchmark tests for a known benchmark (you can use a tool such as Inspect AI, a good quick general benchmark to run is for example ARC Challenge

jeffbolznv · 2026-02-20T23:21:44Z

It would be great if nvfp4 could be stored in larger blocks that are at least a multiple of 4B (16B would be better).

JohannesGaessler · 2026-02-20T23:36:45Z

I agree that memory alignment is relevant, as long as the tensor dimensions are multiples of e.g. 256 it should be feasible to permute the data upon load though (except for maybe CPU+GPU hybrid inference where the overhead could be relevant).

ggerganov · 2026-02-23T08:25:34Z

make a KLD analysis for a full FP16 version as documented here

make perplexity and KLD checks for your quantized model as well as a comparable "standard" quant (Q4_1 would probably be a good choice here)

run benchmark tests for a known benchmark (you can use a tool such as Inspect AI, a good quick general benchmark to run is for example ARC Challenge

Btw, @pwilkin these are not really necessary for NVFP4 - adding support for this data type would not depend on the outcome of these. They are good for sanity checks, but other than that do not matter much. The main use case of NVFP4 is to load models that are already trained in that format - not to quantize models with it.

Regarding the alignment - I guess we can make blocks of 256 which would result in alignment of 16 bytes. Though we risk not being able to load tensors with dimension that is not multiple of 256. There was the same dilemma for MXFP4 and gpt-oss unfortunately has shapes that are only divisible by 64 but not 256.

am17an · 2026-02-23T11:23:10Z

NVFP4 also has a separate per tensor float scale which this PR doesn't take into account, unless I'm wrong. Also this whole PR is pretty much AI generated from what I can see. I had plans to add nvfp4 support after mxfp4 but another developer had promised to do it but since has not delivered so I will also create a PR for nvfp4 support in the meantime.

pwilkin · 2026-02-23T11:36:00Z

@ggerganov I know but I meant it exactly as a sanity check.

pwilkin · 2026-02-23T11:36:40Z

NVFP4 also has a separate per tensor float scale which this PR doesn't take into account, unless I'm wrong. Also this whole PR is pretty much AI generated from what I can see. I had plans to add nvfp4 support after mxfp4 but another developer had promised to do it but since has not delivered so I will also create a PR for nvfp4 support in the meantime.

Yeah I'm pretty frustrated as I was also thinking about working on it and was hoping this PR goes somewhere but seems it's going nowhere so far :/

richarddd · 2026-02-23T12:33:10Z

NVFP4 also has a separate per tensor float scale which this PR doesn't take into account, unless I'm wrong. Also this whole PR is pretty much AI generated from what I can see. I had plans to add nvfp4 support after mxfp4 but another developer had promised to do it but since has not delivered so I will also create a PR for nvfp4 support in the meantime.

It's taken into account. And regarding AI, as mentioned in the PR, I leaned on AI and following principles patterns applied in the MXFP4 PR. I'll remove the half-baked backend implementation and stick with NEON + generic CPU implementation for now. Again, this is a WIP which proves the concept and implements a lot of the boilerplate. I'll also increased blocksize to 64.

am17an · 2026-02-23T13:11:42Z

It's taken into account.

It is not. Please see the f32 scale as presented here https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/

As a reminder: you are supposed to know the content of the PR even if the PR is written with AI help. See the contributing guidelines.

richarddd · 2026-02-23T13:18:08Z

I would really love NVFP4 support and I appreciate the work done here, but as @JohannesGaessler has already mentioned, the ratio of verified information to maintainer-needed work is way too high with this PR.

Please:

shelf all the backend implementations for now, they should be added in separate PRs so people specialized in specific backends can look at them

provide a GGUF of a converted model, preferrably one that can be ran comfortably by most mtaintainers (i.e. rather 8B or 12B than 400B).

make a KLD analysis for a full FP16 version as documented here

make perplexity and KLD checks for your quantized model as well as a comparable "standard" quant (Q4_1 would probably be a good choice here)

run benchmark tests for a known benchmark (you can use a tool such as Inspect AI, a good quick general benchmark to run is for example ARC Challenge

Addressed these comments.

Here are results for Qwen3-4B

	NVFP4 (5.0 BPW)	Q4_1 (5.15 BPW)
PPL	15.25 (+8.0%)	15.81 (+12.0%)
Mean KLD	0.110	0.112
tg128 t/s	15.2	14.7
ARC Challenge (Inspect AI)	80%

am17an · 2026-02-23T14:10:04Z

Okay, not sure if that works but if it does then it's great since it simplifies the implementation quite a bit. The current state of your PR is not ok though, I see random changes in the CUDA and Vulkan code. Can you fix it?

richarddd · 2026-02-23T14:36:33Z

I see random changes in the CUDA and Vulkan code. Can you fix it?

Thanks, I noticed that as well. The problem was a one-time thing from the shelf commit targeting an older master. PR should be clean now

CISC · 2026-03-11T16:02:24Z

@ORippler ping

ORippler · 2026-03-11T17:06:54Z

@ORippler Are you proposing any further changes or is this GA for merge?
@ORippler ping

Sorry for the delayed response, was busy with #20391.

I'd love for us to have 16-byte-alignement via AoSoA (we already have an AoSoA (the first array of is simply the pointer to block_nvfp4)) that is unfortunately only 4-byte-aligned due to packing "superblocks" of 64 as opposed to 256 elements). I guess this was motivated by

Regarding the alignment - I guess we can make blocks of 256 which would result in alignment of 16 bytes. Though we risk not being able to load tensors with dimension that is not multiple of 256. There was the same dilemma for MXFP4 and gpt-oss unfortunately has shapes that are only divisible by 64 but not 256.

How is this divisibility-problem handled for other formats such as block_q4_K? block_q4_K packs 256 values (and is 144 bytes in size thus perfectly 16-byte-aligned), and would therefore face the same issue as raised by @ggerganov above. I would presume we would either fill with 0s or denote somehow that this has to happen at runtime in a backend. Or can we simply not quant all models into all quant-formats. I have not interacted with this part of llama.cpp before, hence I'm very unknowledgeable of this part of the code-base still.

TLDR: I'd love to get 16-byte-alignment, but know I am obviously late to the party (this PR is already 3 weeks open and has gone a long way). Since we can repack for the CUDA backend I am fine if it's merged as is (though Vulkan and other IHVs that benefit from this alignement would miss out as it would be backend-specific implementation). Would still appreciate an answer to my points so I can learn and apply them during repacking (should repacking turn out to be pre-req for perf).

ORippler · 2026-03-11T17:08:21Z

Can't seem to dismiss my review due to missing rights, but do consider it dismissed as stale.

ORippler · 2026-03-11T17:11:05Z

// ====================== 4-bit (de)-quantization

void quantize_row_q4_K_ref(const float * GGML_RESTRICT x, block_q4_K * GGML_RESTRICT y, int64_t k) {
assert(k % QK_K == 0);
const int nb = k / QK_K;

I guess the mapping of models to quants is indeed currently sparse? Bummer.

CISC · 2026-03-11T19:48:08Z

Or can we simply not quant all models into all quant-formats. I have not interacted with this part of llama.cpp before, hence I'm very unknowledgeable of this part of the code-base still.

That's exactly it, you cannot use this quant for that tensor then, which obviously is unacceptable for NVFP4. For other formats another one that fits is chosen instead.

#19769 (comment)

ggerganov · 2026-03-11T20:03:01Z

Since we can repack for the CUDA backend I am fine if it's merged as is (though Vulkan and other IHVs that benefit from this alignement would miss out as it would be backend-specific implementation)

In theory, repacking in the backend should solve the problem. ~~I say in theory because we haven't exercised this yet for any GPU-based backend. But it should be similar to the CPU repacks that we are doing in the CPU backend.~~

I guess the repack implementation could be shared by multiple backends to avoid duplicated work for the repacking.

I guess we could hold on merging this until we prototype this and make sure there aren't any surprises?

CISC · 2026-03-11T20:03:30Z

I guess we could hold on merging this until we prototype this and make sure there aren't any surprises?

Ouch. :)

ggerganov · 2026-03-11T20:04:38Z

No worries, we don't have other alternatives either way, so if the repack does not work out we'll have to live with the 4 byte alignment.

CISC · 2026-03-11T20:09:19Z

No worries, we don't have other alternatives either way, so if the repack does not work out we'll have to live with the 4 byte alignment.

Well, come to think of it, can we not have two NVFP4 quants? One with 16-byte alignment and this one to fall back on if that won't fit?

ggerganov · 2026-03-11T20:15:28Z

Sounds like too much redundancy and extra complexity for not much benefit.

CISC · 2026-03-11T20:19:30Z

Sounds like too much redundancy and extra complexity for not much benefit.

True, let's hope repacking pans out.

am17an · 2026-03-12T02:33:53Z

4 byte alignment is already quite good. Each CUDA thread reading 4 bytes in a warp leads to a 128 byte transaction which is ideal.

* WIP: add NVFP4 quantization support * tests * improve NVFP4 dot product implementation performance and fix bad super call * typo * Use nvfp4 kvalues * vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table * vulcal and perf fixes * wip * Fix metal * fix vulcan * Rename threshold & fix wrong scale * Fix MOE * Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD) Remove NVFP4 support from GPU backends and architecture-specific optimized dot products. These should be added in separate PRs so backend specialists can review them independently. Reverted files: - ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh, quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh - ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h, ggml-metal-ops.cpp - ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/* - ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c Core NVFP4 support (type definition, CPU fallback dot product, quantization, dequantization, conversion) is retained. * Fix arch-fallback.h: add NVFP4 generic fallback for all platforms After shelving backend-specific SIMD implementations, the generic CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390 platforms that previously relied on arch-specific versions. * quantize: add NVFP4 as a quantization type option * Fix ggml_fp32_to_ue4m3: handle subnormal values Previously, values with ue4m3_exp <= 0 were clamped to 0, causing all small scales to underflow. This made NVFP4 quantization via llama-quantize produce garbage (PPL = 5.8M) since typical transformer weights have amax/6.0 in the range 0.001-0.01, which falls in the UE4M3 subnormal range. Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7), matching the decode path in ggml_ue4m3_to_fp32. Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33), comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15). * Restore ARM NEON NVFP4 dot product implementation Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products. tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup * Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq - Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy ggml_ue4m3_to_fp32() in the hot loop - Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32 - Accumulate with vfmaq_f32 into float32x4_t vector accumulators tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed) * ARM NEON NVFP4: rearrange q8 to match nibble layout Alternative approach: rearrange q8 data to match the NVFP4 lo/hi nibble layout instead of rearranging the looked-up NVFP4 values. Eliminates vcombine_s8(vget_low, vget_low) shuffles. Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x block overhead from QK=16 vs QK=32, not the shuffle instructions. * CPU only backend 64 super-block layout * cleanup * Remove unused LUT * int * exclude NVFP4 from unsupported ops in metal build * remove quantization for now * store scales as native UE4M3, preserve original model bits when possible * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * correct comment * format * reduce duplication and cleanup * Address comments * move detection to prepare_tensors * Use math instead of const * Move * fix comment * Shelf quantize tests * Rebase and move check * cleanup * lint * Update gguf-py/gguf/scripts/gguf_convert_endian.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Use fallback quant config * Simplify Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * organize * Refactor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * fix return type --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* 'master' of github.com:ggml-org/llama.cpp: (33 commits) convert : better mtp check and fix return [no ci] (ggml-org#20419) vulkan: fix SSM_CONV PP scaling with large ubatch sizes (ggml-org#20379) New conversations now auto-select the first loaded model (ggml-org#20403) ggml-virtgpu: Fix some build commands (ggml-org#20341) metal : avoid divisions in bin kernel (ggml-org#20426) ci: Setup self-hosted CI for Intel Linux Vulkan backend (ggml-org#20154) vulkan: fix l2_norm epsilon handling (ggml-org#20350) vulkan: fix OOB check in flash_attn_mask_opt (ggml-org#20296) vulkan: Fix ErrorOutOfHostMemory on Intel GPU when loading large models with --no-mmap (ggml-org#20059) opencl: use larger workgroup size for get_rows (ggml-org#20316) opencl: add cumsum op (ggml-org#18981) hip: compile debug builds with -O2 on hip to avoid a compiler bug (ggml-org#20392) common/parser: add GigaChatV3/3.1 models support (ggml-org#19931) model : add support for Phi4ForCausalLMV (ggml-org#20168) graph : add optional scale parameter to build_lora_mm [no ci] (ggml-org#20427) common : fix --n-cpu-moe, --cpu-moe for models with fused gate + up (ggml-org#20416) ggml-webgpu: Add supports for `GGML_OP_REPEAT` (ggml-org#20230) llama : enable chunked fused GDN path (ggml-org#20340) llama : whitespace cleanup (ggml-org#20422) ggml : add NVFP4 quantization type support (ggml-org#19769) ...

JohannesGaessler · 2026-03-12T18:53:36Z

4 byte alignment is already quite good. Each CUDA thread reading 4 bytes in a warp leads to a 128 byte transaction which is ideal.

For synchronous data copies I agree, for asynchronous copies chunks of 16 bytes work better in my excperience.

michaelw9999 · 2026-03-12T19:06:36Z

I've got the current version working with CUDA converting to pack SoA (without 4/6 or any fancy stuff) but it's not as fast as it should be (about 13,000 tk/s on Qwen4-B). Should I post it anywhere or do we have a thread to discuss follow up NVFP4 tasks? Having issues converting models and have fixes for the py script. Hope I can contribute something. Thanks

richarddd · 2026-03-12T19:57:12Z

I've got the current version working with CUDA converting to pack SoA (without 4/6 or any fancy stuff) but it's not as fast as it should be (about 13,000 tk/s on Qwen4-B). Should I post it anywhere or do we have a thread to discuss follow up NVFP4 tasks? Having issues converting models and have fixes for the py script. Hope I can contribute something. Thanks

@michaelw9999 I think individual PRs. Small isolated onces. If improvements are incremental, they should rather be separate PR's IMO. For example, one with basic CUDA support, one for 4/6 and maybe some fancy stuff etc

JohannesGaessler · 2026-03-12T20:24:41Z

The CUDA code should have the following pieces for basic support: NVFP4 dequantization + cuBLAS, MMVQ support, MMQ support via dp4a, MMQ support via tensor cores. For new contributors please only as individual and self-contained PRs, for more experienced contributors I think it's fine to do multiple things at once. Fancy stuff should come after that with evidence that it is an improvement.

xkmire · 2026-03-12T20:26:53Z

Thanks very much for the NVFP4 work!!

I found two very interesting NVFP4 models on huggingface:

txn545/Qwen3.5-122B-A10B-NVFP4 quantized using the NVIDIA Model Optimizer.
AxionML/Qwen3.5-122B-A10B-NVFP4 quantized using NVIDIA TensorRT Model Optimizer.

I tried to convert them to gguf, but both failed.

ValueError: Can not map tensor 'model.language_model.layers.0.mlp.shared_expert.down_proj.weight'
ValueError: Can not map tensor 'model.language_model.layers.0.linear_attn.in_proj_a.weight'

I was just wondering, if this are the kind of models that is intended to work with the NVFP4 support I have seen going into llama.cpp the last days.

If yes, I tink I might have a go at trying to figure out why they fail. Not sure I will be able to find out how to fix, but eager to get my new expensive GPU to run at its best...

michaelw9999 · 2026-03-12T20:48:18Z

Hi @xkmire I put a proposed fix in. See #20505. It was working well with your Qwen3.5-122B-A10B-NVFP4 model :-))

…

Message ID: ***@***.***>

vbooka1 · 2026-03-13T11:23:27Z

Hello, I am getting error "Quant method is not yet supported: 'modelopt'" when trying to convert NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ( https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/ ) to .gguf

error log: #20411 (comment)

CISC · 2026-03-13T11:30:41Z

Hello, I am getting error "Quant method is not yet supported: 'modelopt'" when trying to convert NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 ( https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4/ ) to .gguf

error log: #20411 (comment)

Seems they have per-tensor quant_algo, which we don't check for, so repacking never kicks in.

ORippler · 2026-03-13T13:33:53Z

The CUDA code should have the following pieces for basic support: NVFP4 dequantization + cuBLAS, MMVQ support, MMQ support via dp4a, MMQ support via tensor cores.

Most likely stating the obvious: For MMVQ and MMQ dp4a path, it makes sense to do computations in BF16/FP16, as throughput is equal for FP and ALU in CUDA cores and we can save the I2F conversion via fp4 intrinsics (on the hardware that supports those of course).

Just wanted to point this out as the CPU path in this PR does ALU followed by I2F.

ORippler · 2026-03-13T13:39:31Z

4 byte alignment is already quite good. Each CUDA thread reading 4 bytes in a warp leads to a 128 byte transaction which is ideal.

For synchronous data copies I agree, for asynchronous copies chunks of 16 bytes work better in my excperience.

4 byte is the minimum we need to be able to issue LDGSTS via cg::memcpy_async and reduce register pressure by bypassing registers for the store op, and wider should always be better (as it doesn't depend on the MMU to pack LDGs issued across threads into the same read call and has higher IPC)

JohannesGaessler · 2026-03-14T10:02:56Z

Regarding MMVQ: currently the activations are unconditionally converted to q8_1, if we intend to use floating-point math we will need to extend this. More generally, if we add a path using floating-point math it may make sense to use it for small matrices to remove the overhead from quantizing the activations. This table doesn't seem to list the throughput of __dp4a but I would assume for a matrix vector multiplication it won't make much of a difference either way though. We should maybe also try to define what we want to put in mmvq.cu vs. mmvf.cu. The way that would make sense to me is to use MMVQ for block-wise src0 data types and to use MMVF for scalar data types. (MMVF is strictly not needed since cuBLAS could be used, we basically only have it for performance reasons, particularly for MoE models.)

richarddd requested review from 0cc4m, CISC, JohannesGaessler and ggerganov as code owners February 20, 2026 21:50

richarddd force-pushed the feat/nvfp4 branch from 9cd0f58 to 86dd3fc Compare February 20, 2026 21:51

loci-dev mentioned this pull request Feb 21, 2026

UPSTREAM PR #19769: WIP: ggml : add NVFP4 quantization type support auroralabs-loci/llama.cpp#1194

Open

richarddd marked this pull request as draft February 23, 2026 12:27

This comment was marked as outdated.

Sign in to view

richarddd force-pushed the feat/nvfp4 branch from 5f8f21b to ffab58b Compare February 23, 2026 14:28

github-actions bot added the examples label Feb 23, 2026

richarddd marked this pull request as ready for review February 23, 2026 17:06

Merge branch 'master' into feat/nvfp4

51f757c

CISC merged commit 5eae9cb into ggml-org:master Mar 11, 2026
1 check passed

richarddd deleted the feat/nvfp4 branch March 11, 2026 20:35

richarddd mentioned this pull request Mar 11, 2026

graph : add optional scale parameter to build_lora_mm #20427

Merged

Conversation

richarddd commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Feb 20, 2026

Uh oh!

pwilkin commented Feb 20, 2026

Uh oh!

jeffbolznv commented Feb 20, 2026

Uh oh!

JohannesGaessler commented Feb 20, 2026

Uh oh!

ggerganov commented Feb 23, 2026

Uh oh!

am17an commented Feb 23, 2026

Uh oh!

pwilkin commented Feb 23, 2026

Uh oh!

pwilkin commented Feb 23, 2026

Uh oh!

richarddd commented Feb 23, 2026

Uh oh!

am17an commented Feb 23, 2026

Uh oh!

richarddd commented Feb 23, 2026

Uh oh!

This comment was marked as outdated.

am17an commented Feb 23, 2026

Uh oh!

richarddd commented Feb 23, 2026

Uh oh!

CISC commented Mar 11, 2026

Uh oh!

ORippler commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ORippler commented Mar 11, 2026

Uh oh!

ORippler commented Mar 11, 2026

Uh oh!

CISC commented Mar 11, 2026

Uh oh!

Uh oh!

ggerganov commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC commented Mar 11, 2026

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

CISC commented Mar 11, 2026

Uh oh!

ggerganov commented Mar 11, 2026

Uh oh!

CISC commented Mar 11, 2026

Uh oh!

am17an commented Mar 12, 2026

Uh oh!

JohannesGaessler commented Mar 12, 2026

Uh oh!

michaelw9999 commented Mar 12, 2026

Uh oh!

richarddd commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Mar 12, 2026

Uh oh!

xkmire commented Mar 12, 2026

Uh oh!

michaelw9999 commented Mar 12, 2026 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vbooka1 commented Mar 13, 2026

Uh oh!

CISC commented Mar 13, 2026

Uh oh!

ORippler commented Mar 13, 2026

Uh oh!

ORippler commented Mar 13, 2026

Uh oh!

richarddd commented Feb 20, 2026 •

edited

Loading

ORippler commented Mar 11, 2026 •

edited

Loading

ggerganov commented Mar 11, 2026 •

edited

Loading

richarddd commented Mar 12, 2026 •

edited

Loading

michaelw9999 commented Mar 12, 2026 via email •

edited

Loading