Unified delta net handling for Qwen3Next and Kimi Linear models#18792
Unified delta net handling for Qwen3Next and Kimi Linear models#18792pwilkin wants to merge 8 commits intoggml-org:masterfrom
Conversation
There was a problem hiding this comment.
Just note that while working on #18683, I have been thinking about whether g sound be pre-broadcasted to [S_k, H_v, n_tokens, n_seqs] before entering this function (to make it the same shape as q and k). A broadcast should be fast, shouldn't hurt much performance
Probably we can play around with that idea, or you can reshape it to [1, n_tokens, H_k, n_seqs] as I suggested in the following comments
There was a problem hiding this comment.
I think the file name should be graph-context-delta.cpp to match the graph-context-mamba.cpp naming
| g = ggml_cont_4d(ctx0, ggml_permute(ctx0, g, 0, 2, 1, 3), S_k, n_tokens, H_k, n_seqs); | ||
| } else { | ||
| // GDA: g [H_v, n_tokens, n_seqs] -> [n_tokens, 1, H_k, n_seqs] | ||
| g = ggml_cont_4d(ctx0, ggml_permute(ctx0, g, 2, 0, 3, 1), n_tokens, 1, H_k, n_seqs); |
There was a problem hiding this comment.
I think if g is reshaped to [1, n_tokens, H_k, n_seqs], then a large part of the logic below can be reused between KDA and GDA (see comments below)
src/models/delta.cpp
Outdated
| g = ggml_pad(ctx0, g, 0, pad, 0, 0); | ||
| } else { | ||
| // GDA: g shape [n_tokens, 1, H_k, n_seqs] -> pad along dim 0 | ||
| g = ggml_pad(ctx0, g, pad, 0, 0, 0); |
There was a problem hiding this comment.
first of, I think this branch can be removed if g shape: [1, n_tokens], so we pad along dim 1
| beta = ggml_reshape_4d(ctx0, beta, 1, chunk_size, n_chunks, H_k * n_seqs); | ||
|
|
||
| // Reshape g for chunks | ||
| ggml_tensor * g_cumsum; |
There was a problem hiding this comment.
| ggml_tensor * g_cumsum; | |
| ggml_tensor * g_cumsum; | |
| ggml_tensor * g_cumsum_t; |
Since we need both versions, it can be a good idea to get the transposed version right here.
For the GDA branch, a transpose will be a simple reshape as the first dim is [1, n_tokens], so no need for ggml_cont
In other words, given a tensor A with shape: [n, 1, ...], then A.view(1, n, ...) == A^T
src/models/delta.cpp
Outdated
| // Cumsum along chunk_size dimension (ne[1]) | ||
| // GGML cumsum operates on ne[0], so we need to transpose, cumsum, transpose back | ||
| g = ggml_cont(ctx0, ggml_transpose(ctx0, g)); // [chunk_size, S_k, n_chunks, H_k * n_seqs] | ||
| g_cumsum = ggml_cumsum(ctx0, g); |
There was a problem hiding this comment.
Just a quick note-to-self, but probably we need to support ggml_cumsum column-wise version, that should eliminate some transposes in the future. Or another idea, support non-cont tensors in ggml_cumsum
There was a problem hiding this comment.
I think it should be like the cumsum in pytorch that you can specify which dimension to cumsum.
src/models/delta.cpp
Outdated
| // GDA: Use decay mask approach (g broadcasts over K dimension) | ||
| // g_cumsum [chunk_size, 1, n_chunks, H_v * n_seqs] | ||
| ggml_tensor * gcs_i = g_cumsum; | ||
| ggml_tensor * gcs_j = ggml_reshape_4d(ctx0, g_cumsum, 1, chunk_size, n_chunks, H_v * n_seqs); |
There was a problem hiding this comment.
this gcs_j should be equivalent to g_cumsum_t (or just g_cumsum, depending on what shape of g you consider to be the transposed version)
then g_exp_pos = ggml_exp(ctx0, g_cumsum_t) can be computed directly here
src/models/delta.cpp
Outdated
| if (is_kda) { | ||
| // KDA: Reuse g_exp_pos computed earlier | ||
| gexp = g_exp_pos; | ||
| } else { | ||
| // GDA: g_cumsum [chunk_size, 1, n_chunks, H_k * n_seqs] | ||
| ggml_tensor * g_cumsum_t = ggml_cont(ctx0, ggml_transpose(ctx0, g_cumsum)); | ||
| gexp = ggml_exp(ctx0, g_cumsum_t); | ||
| } |
There was a problem hiding this comment.
this can be removed when you apply my last trick above
| ggml_tensor * g_diff = ggml_sub(ctx0, g_last_broadcast, g_cumsum); | ||
| g_diff_exp = ggml_exp(ctx0, g_diff); | ||
| } else { | ||
| // GDA: g_cumsum [chunk_size, 1, n_chunks, H_k * n_seqs] |
There was a problem hiding this comment.
I'm not 100% sure, but seems like this can be removed too, as we now have both g_cumsum and g_cumsum_t that you can play with
| } else { | ||
| // GDA: g_last_exp [1, 1, n_chunks, H_k * n_seqs] | ||
| // Broadcasts over both K and V dimensions | ||
| gexp_last_chunk = ggml_reshape_4d(ctx0, gexp_last_chunk, |
There was a problem hiding this comment.
we can avoid this branching if g_last_exp is already boardcasted
|
Thanks for your refactoring effort. I think my kda_autoregressive is better implemented as I used mul_mat to replace sum_rows. If we refactor, the new function should be based on kda_autoregressive. |
|
@ymcki indeed your version is better :) there's like another 4% performance gain on autoregressive passes in Qwen3Next. |
|
This code in the chunking function will cause overflow without clamping. You either have to clamp or you have to use my mul_mat trick for exact solution. My mul_mat trick: |
|
@ymcki aight, migrated the KDA branch to use decay mask as well. |
|
I think my Kimi Linear PR is almost done, so I can start working on refactoring now. Do we want to do refactoring along with block matrix multiplication? The idea is that since we don't care about the upper triangle in Akk and Aqk, so we can take bigger blocks and divide them into chunk size of 64 blocks. For example, if we handle n_seq_tokens >192, then we can pad it to 256 and then break it down to 4x4 64x64 blocks. Then we only need to do mul_mat on 10/16 blocks and apply diag_mask only on the diagonal blocks, ie 4/16 blocks. If we only do refactoring, then maybe only Kimi will be a few % faster. If we include block mul_mat, then both Qwen3Next and Kimi will see significant gain. |
|
@ymcki Sure, can try, sounds like a good idea at least in theory, let's see what we can get out of this in practice. |
|
Implemented a version that breaks the 64x64 Akk/Aqk chunks into 4x4 16x16 blocks. About 16% pp gain. I will try to see if I can implement a version that starts with 256x256 Akk/Aqk shards and then breaks it into 4x4 64x64 chunks. If I fail to implement this new version, I think this 16% gain version is still pretty good. Original Code: pp 725t/s tg 34t/s./build/bin/llama-bench -m ~/Kimi-Linear-48B-A3B-Instruct-GGUF/Kimi-Linear-48B-A3B-Instruct-jp-imatrix.Q2_K.gguf -n 32 -d 8192 -b 64,128,256,512,1024,2048,4096,8192,16384 -ngl 100 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | n_batch | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 64 | pp512 @ d8192 | 511.71 ± 1.51 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 64 | tg32 @ d8192 | 34.07 ± 0.16 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 128 | pp512 @ d8192 | 515.66 ± 1.31 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 128 | tg32 @ d8192 | 34.03 ± 0.21 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 256 | pp512 @ d8192 | 638.60 ± 1.90 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 256 | tg32 @ d8192 | 33.95 ± 0.19 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 512 | pp512 @ d8192 | 729.91 ± 7.39 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 512 | tg32 @ d8192 | 34.06 ± 0.11 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 1024 | pp512 @ d8192 | 726.15 ± 6.85 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 1024 | tg32 @ d8192 | 33.98 ± 0.14 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 2048 | pp512 @ d8192 | 725.90 ± 7.98 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 2048 | tg32 @ d8192 | 33.85 ± 0.28 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 4096 | pp512 @ d8192 | 725.57 ± 6.77 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 4096 | tg32 @ d8192 | 34.01 ± 0.14 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 8192 | pp512 @ d8192 | 722.58 ± 7.37 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 8192 | tg32 @ d8192 | 34.01 ± 0.14 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 16384 | pp512 @ d8192 | 720.68 ± 6.48 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 16384 | tg32 @ d8192 | 33.97 ± 0.26 |build: e87ac9b (7816) Breaks 64x64 Akk/Aqk chunks into 4x4 16x16 blocks: pp 840t/s tg 34t/s./build/bin/llama-bench -m ~/Kimi-Linear-48B-A3B-Instruct-GGUF/Kimi-Linear-48B-A3B-Instruct-jp-imatrix.Q2_K.gguf -n 32 -d 8192 -b 64,128,256,512,1024,2048,4096,8192,16384 -ngl 100 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | n_batch | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 64 | pp512 @ d8192 | 509.68 ± 3.69 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 64 | tg32 @ d8192 | 34.08 ± 0.15 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 128 | pp512 @ d8192 | 538.52 ± 0.83 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 128 | tg32 @ d8192 | 34.05 ± 0.12 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 256 | pp512 @ d8192 | 682.39 ± 1.46 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 256 | tg32 @ d8192 | 34.07 ± 0.16 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 512 | pp512 @ d8192 | 844.40 ± 14.26 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 512 | tg32 @ d8192 | 33.89 ± 0.16 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 1024 | pp512 @ d8192 | 842.77 ± 13.82 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 1024 | tg32 @ d8192 | 33.76 ± 0.15 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 2048 | pp512 @ d8192 | 841.18 ± 15.00 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 2048 | tg32 @ d8192 | 33.76 ± 0.16 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 4096 | pp512 @ d8192 | 841.19 ± 14.22 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 4096 | tg32 @ d8192 | 33.65 ± 0.36 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 8192 | pp512 @ d8192 | 838.83 ± 14.66 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 8192 | tg32 @ d8192 | 33.76 ± 0.14 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 16384 | pp512 @ d8192 | 838.24 ± 12.62 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 16384 | tg32 @ d8192 | 33.79 ± 0.12 |build: e87ac9b (7816) |
|
Just discovered that this 4x4 16x16 blocks version reduces CUDA0 compute buffer size from 2397.39MB to 1432.55MB such that I can increase context from 96k to 160k running IQ3_M on my 3090. |
|
Somehow managed to move ggml_solve_tri inside the loop computing Akk and Aqk. This further improves pp to 860t/s, ie 18.6% gain. However, CUDA0 compute buffer running IQ2_M @ 400k context increases from 1432.55MB to 1512.52MB. Let me see if I further optimize it. I think optimization probably is better to focus on memory saving than speed. Running more context is way more important than a few % increase in pp speed for users. Can llama-bench also display CUDA0 compute buffer usage? solve_tri inside Akk loop: pp 860t/s tg 32.6t/s./build/bin/llama-bench -m ~/Kimi-Linear-48B-A3B-Instruct-GGUF/Kimi-Linear-48B-A3B-Instruct-jp-imatrix.Q2_K.gguf -n 32 -d 8192 -b 64,128,256,512,1024,2048,4096,8192,16384 -ngl 100 ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes | model | size | params | backend | ngl | n_batch | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 64 | pp512 @ d8192 | 413.93 ± 9.46 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 64 | tg32 @ d8192 | 32.42 ± 0.14 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 128 | pp512 @ d8192 | 679.76 ± 1.83 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 128 | tg32 @ d8192 | 32.36 ± 0.25 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 256 | pp512 @ d8192 | 796.51 ± 6.17 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 256 | tg32 @ d8192 | 32.27 ± 0.71 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 512 | pp512 @ d8192 | 863.52 ± 17.90 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 512 | tg32 @ d8192 | 32.51 ± 0.24 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 1024 | pp512 @ d8192 | 863.01 ± 17.10 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 1024 | tg32 @ d8192 | 32.68 ± 0.10 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 2048 | pp512 @ d8192 | 860.94 ± 16.71 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 2048 | tg32 @ d8192 | 32.65 ± 0.16 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 4096 | pp512 @ d8192 | 859.43 ± 17.37 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 4096 | tg32 @ d8192 | 32.68 ± 0.15 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 8192 | pp512 @ d8192 | 859.30 ± 17.15 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 8192 | tg32 @ d8192 | 32.60 ± 0.17 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 16384 | pp512 @ d8192 | 858.63 ± 17.41 | | kimi-linear 48B.A3B Q2_K - Medium | 16.78 GiB | 49.12 B | CUDA | 100 | 16384 | tg32 @ d8192 | 32.58 ± 0.21 |build: e87ac9b (7816) |
|
@ymcki can you upload it somewhere so I could take a look? |
|
Dear all, I have opened a PR to pwilkin's repo that modified his code to work with kimi linear. @pwilkin, please test it with Qwen3Next as I don't have the resources to properly test it. This code is based on a slightly earlier llama.cpp code that doesn't break my Kimi Linear MLA code. I think we can work on this version first and see what optimizations can be done. |
let me try, any tips or things I should watch out for? |
Just run it with the new Qwen3Next ggufs and see if they work. If you like, you can also run Kimi Linear as well. If they both works to your satisfaction, then it should be ok. In this unified version, they should share quite a lot of code. |
|
Since my PR was merged, so I also uploaded a 4x4 16x16 block computation of Akk and Aqk to my repo. I believe this is also the way the Kimi Linear team did it at: Based on my test, it achieved about 20% speed up in pp and 25% VRAM saving. I presume if @pwilkin can also do it for Qwen3Next, it should have similar pp and vram gain. They also managed to put the solve_tri code inside the loop: I also have an implementation of this but it is not satisfactory enough, so it is not published for now. In the first glance, this optimization doesn't bring as much gain while taking more vram, so probably we don't miss too much for now. I will now update my delta_net repo to sync with the latest code and then send another PR to pwilkin's repo. |
|
Done with updating to the latest code. Oops. Submitted PR previously to the wrong place. Now it should be ok. |
| // Equivalence to previous version: | ||
| // Previous: kv_mem = sum_k(state * k) using elementwise mult + sum_rows | ||
| // Current: k_state = state_t @ k_t using matrix multiplication | ||
| // These are equivalent because: sum_k(A * B) = A @ B when dimensions align | ||
| ggml_tensor * state_t = ggml_cont(ctx0, ggml_transpose(ctx0, state)); | ||
| ggml_tensor * k_t = ggml_reshape_4d(ctx0, k, S_k, 1, H_k, n_seqs); | ||
| ggml_tensor * k_state = ggml_mul_mat(ctx0, state_t, k_t); |
There was a problem hiding this comment.
I think my version of the autoregressive part in #19375 is better - it keep the ggml_sum_rows variant. The idea is to transpose the state at the start and then the rest of the operations line up.
There was a problem hiding this comment.
But in my tests, the mul_mat variant did increase PP by about 10%. Let me check both versions and I'll tell you.
There was a problem hiding this comment.
The ar path should not affect the pp
There was a problem hiding this comment.
Ah, this is in autoregressive... sorry, wrong fragment then.
Anyway, I'll redownload the IQ2_S quant and check :>
|
@ggerganov fixed it methinks, PPL is back to normal, ran a long context check: |
|
Someone at r/LocalLlama reported crash when running vulkan (-fit on) with the main branch code. Can someone take a look and see what's going on? llama.cpp\ggml\src\ggml-backend.cpp:809: pre-allocated tensor (cache_k_l15) in a buffer (Vulkan1) that cannot run the operation (NONE) |
|
Compiled a vulkan llama.cpp for my 3090. I can't replicate the crash reported. However, while the main branch works, my block implementation generates gibberish for vulkan. So I will look into it and see what's going on. |
|
Replaced ggml_acc with ggml_set and now my block implementation works for CPU, CUDA and vulkan. Probably something is wrong with the acc implementation in vulkan? |
Try to add tests to |
|
Added this test case to emulate my code. It does indeed fail in vulkan but not in cuda in some cases. What's next? CUDA test./build/bin/test-backend-ops test -o ACC ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Testing 2 devicesBackend 1/2: CUDA0 ACC(type=f32,ne_a=[256,17,1,1],ne_b=[256,16,1,1]): OK vulkan test./build/bin/test-backend-ops test -o ACC -b Vulkan1 ggml_vulkan: Found 2 Vulkan devices: ggml_vulkan: 0 = NVIDIA GeForce RTX 3050 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat ggml_vulkan: 1 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat Testing 3 devicesBackend 1/3: Vulkan0 ACC(type=f32,ne_a=[256,17,1,1],ne_b=[256,16,1,1]): OK Failing tests: |
|
Make a separate PR to master that adds these tests so that we can fix the backends that are currently failing. |
PR submitted with a fix |
|
This branch still produces wrong results compared to |
Ah, sorry, forgot to push the fix on this branch. Should be OK now. |
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
|
I noticed that about 1 in 4 responses are repetitive, incomplete and sometimes with Chinese characters if I run llama-server in parallel. However, if I run llama-completion, it is mostly perfect. Seems fixed with this PR |
|
Closing as obsoleted. |


Refactoring in preparation for #18755
Tested on CUDA - no performance regressions compared to @ngxson's optimized version.
AI Usage: yes. Opus 4.5.