-
Notifications
You must be signed in to change notification settings - Fork 28
Comparing changes
Open a pull request
base repository: mudler/parakeet.cpp
base: v0.1.2
head repository: mudler/parakeet.cpp
compare: v0.2.0
- 11 commits
- 61 files changed
- 5 contributors
Commits on Jun 6, 2026
-
feat: banded local (Longformer) attention — fix O(T^2) long-audio OOM (…
…#9) * feat(attn): banded local (Longformer) attention — O(T*window), NeMo-faithful Adds NeMo rel_pos_local_attn (RelPositionMultiHeadAttentionLongformer) as a memory-bounded banded attention. This is the kernel for fixing the O(T^2) attention blowup that OOM'd long-audio offline transcription on unified-memory GPUs (a ~20-min clip allocated ~100GB and took the node down). - RelPosAttention::build_graph_local / forward_local: banded attention via pad-and-shift, peak memory O(T*window) instead of O(T*T). Each query attends only to keys in [t-att_left, t+att_right]; the positional term (q_v . p^T over the 2W+1 local pos) is added 1:1 to the banded content scores, exactly as NeMo combines them. Verified against NeMo's own sliding_chunks_matmul_qk/pv (col->key t-w+c to 1e-6) and a deterministic band reference (1.4e-3). - local_rel_pos_encoding: NeMo LocalAttRelPositionalEncoding (positions +att_left..-att_right), bit-identical to the centre rows of the full table. - pk::last_graph_alloc_bytes(): gallocr high-water accessor for the memory test. - gen_nemo_baseline.py --att-context-size (local-attention baseline); and gen_band_ref.py for the deterministic band reference. NOTE: NeMo's longformer is non-deterministic on short clips (sliding_chunks_matmul_pv reads uninitialized memory at boundaries via F.pad value=-1 + as_strided — two identical forward() calls differ by >1e3), so kernel parity must use the deterministic reference; end-to-end NeMo quality is anchored by long-audio WER. - Tests: test_relpos_attention_local (parity 1.4e-3) and test_relpos_attention_local_memory (alloc grows ~linearly, ratio 1.98). Not yet wired into the offline encoder path — follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 * feat(encoder): use banded local attention for long audio (fixes O(T^2) OOM) Wires the banded rel_pos_local_attn kernel into the offline encoder so long audio no longer allocates O(T^2) attention, which OOM'd unified-memory GPUs (a ~17-min clip drove ~100GB and took the node down). - ConformerLayer::build_graph gains optional att_left/att_right; when set it routes self-attention to RelPosAttention::build_graph_local with a LOCAL positional encoding, else keeps full attention unchanged. - Encoder::forward picks the window via local_attn_window(Tp): env PARAKEET_ATT_CONTEXT=W forces NeMo rel_pos_local_attn [W,W]; otherwise audio longer than ~11 min (>8192 encoder frames) auto-switches to W=128. Short audio keeps full attention (NeMo-exact; the encoder parity test is unchanged). - backend.cpp: bump kGraphSize 16384->65536 — the pad-and-shift kernel adds O(window) graph-node descriptors per layer. Verified end-to-end on a 16.6-min clip with tdt-0.6b-v3 (CPU, 16 threads): full attention: 151 s, 55.4 GB peak RSS banded (W=16): 41 s, 9.1 GB peak RSS (coherent transcript) ~6x less memory and ~3.7x faster; the full-attention path is what hit ~100GB and OOM'd. Short-clip transcripts: W=128 == full byte-for-byte; W=16 essentially identical. Note: pad-and-shift creates O(window) nodes and an O(window^2) incremental concat — fine for small windows but slow for W=128 on CPU; an efficient chunk-matmul construction (like NeMo's sliding_chunks) is a follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 * fix(encoder): cap local-attention window to the node budget (no small-model regression) Bumping kGraphSize 16384->65536 to fit a W=128 banded graph regressed small models ~+22% (tdt_ctc-110m): the per-compute metadata context and graph hash-set scale with kGraphSize. Revert to 16384 and instead cap the local-attention window at W=32 — the pad-and-shift kernel adds ~6*(2W+1) graph nodes/layer, and W<=32 fits every shipped model's encoder within the budget. PARAKEET_ATT_CONTEXT is clamped to 32. Regression bench (librispeech, 100 files, CPU, back-to-back): tdt_ctc-110m: master 19.5s vs banded 19.4s (within noise), 0/100 text diffs tdt-0.6b-v3: 0/100 text diffs Long-audio fix intact: 16.6-min clip + tdt-0.6b-v3 auto-uses W=32 -> 48s, 9.4 GB peak RSS (vs full attention 151s / 55.4 GB). Lifting the window cap to NeMo's [128,128] needs the efficient chunk-matmul construction (O(1) graph nodes) — follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 * feat(attn): banded local attention in the batched encoder (B>1) Mirror the B=1 banded path (build_graph_local) into the fused batched encoder so long-audio batches also use NeMo rel_pos_local_attn (O(T*window)) instead of full O(T^2) attention. RelPosAttention::build_graph_batched_local builds the 4D ([dk,T,H,B]) pad-and-shift band: K/V padded on the time axis, per-window-column views, sum_rows content scores + mul_mat positional scores (shared pos broadcast over B), a per-item band mask [P,T,1,B] keyed on each item's valid_len, soft_max over the window, then the context gather and head merge. Conformer build_graph_batched and the batched encoder forward route to it when att_left/att_right >= 0, with the shared LOCAL positional encoding. Verified on dgx (tdt_ctc-110m): the new test_encoder_batch_local exercises the path at the production window (W=32 = kMaxLocalWindow). item0 (the full clip) is bit-exact beside its shorter padded neighbour (no cross-item leak), and the padded item1 matches its standalone run within 5e-2/5e-2 - the same tolerance the full-attention batch test uses. Tighter-than-production windows only amplify float noise on near-zero activations of the padded clip (item0 stays exact, mean|d| ~1e-2); not pad leakage. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 * feat(attn): chunk-matmul banded local attention (O(1) graph nodes) The pad-and-shift banded path (build_graph_local) is correct but emits O(window) graph nodes per layer (a P-iteration view+mul+sum_rows+concat loop), which is why the window was capped at W=32. build_graph_local_chunked computes the exact same NeMo rel_pos_local_attn output with a fixed, O(1) number of nodes regardless of window, lifting the cap toward NeMo's full [128,128]. Construction: time is tiled into chunks of C frames; each chunk carries its own C+P-1 keys/values (the P-1 halo overlaps the neighbour), so a query attends only within its chunk. K/V are gathered as OVERLAPPING strided chunk views - which ggml's view-bounds check (ggml.c: data_size = dense product of ne, ignoring nb) rejects unless the source is OVER-padded to (C+P-1)*G frames; with that pad the view is legal and a single batched ggml_mul_mat produces the per-chunk q.k blocks [C+P-1, C, G, H]. A diagonal "skew" view (nb1 walking C+P on a [C+P-1,...] tensor, which passes the bounds check since P <= C+P-1) extracts the [P,T] band. The PV side inverse-skews the softmaxed band back to a [C+P-1, C] banded matrix (pad ne0 by C, skew-view, mask the lower off-band), then one batched matmul against the transposed V chunks gathers the context. Verified against the trusted pad-and-shift path (forward_local, itself 1.4e-3 vs a deterministic brute-force band reference): new test test_relpos_attention_local_chunked runs synthetic x/pos through the real layer-0 weights for T up to 333 and W up to 128 (chunk < W and chunk == W), matching forward_local to <1e-3 (max|d| ~6e-4). Existing pad-and-shift path and all encoder/conformer regressions unchanged. Encoder wiring (raise the cap and route long audio to this kernel) follows. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 * feat(encoder): route local attention to chunk-matmul, lift window cap to 128 Wire the O(1)-node chunk-matmul kernel into the encoder and raise the local window cap from 32 to NeMo's full 128. Both conformer attention paths now use it: build_graph (B=1) -> build_graph_local_chunked, build_graph_batched (B>1) -> build_graph_batched_local_chunked. The batched wrapper runs the 4D chunk kernel once per item and stacks the [D,T] outputs into [D,T,B] (the chunk graph is already 4D, so it can't also carry a batch dim); that is O(B) nodes, still O(1) in the window, and B is small. local_attn_window's cap (kMaxLocalWindow) goes 32 -> 128: the pad-and-shift path emitted ~6*(2W+1) nodes/layer (hence the 32 cap to fit kGraphSize), but the chunk-matmul path is window-independent in node count, so long audio now runs at NeMo's full [128,128] window. The pad-and-shift build_graph_local / build_graph_batched_local are kept as the verification oracle for test_relpos_attention_local{,_chunked}. Verified on dgx: full ctest green (51/51). test_encoder_batch_local passes at every forced window W=8..128 (now through the chunked path). e2e on a 16.6-min clip (tdt-0.6b-v3, CPU/16t), auto-local W=128: 36.8s / 9.8GB peak RSS, coherent transcript - faster than the W=32 pad-and-shift capstone (41-48s / 9.1GB) at a 4x wider, NeMo-faithful window, and ~5.6x under the full-attention path that OOM'd the node (151s / 55GB). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 * docs(bench): add long-audio banded-attention section Document the banded local attention (rel_pos_local_attn) memory/speed win that the chunk-matmul kernel enables. 16.6-min clip, tdt-0.6b-v3, GB10 CPU/16t: global O(T^2) attention 148.3s / 54.0GB vs banded W=128 36.9s / 9.4GB (~4x faster, ~5.7x less peak RAM) at NeMo's full window, with the chunk-matmul making W=128 as cheap as W=32. Notes that short clips stay on the global path and are unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Configuration menu - View commit details
-
Copy full SHA for 8436005 - Browse repository at this point
Copy the full SHA 8436005View commit details -
Add nemotron-3.5-asr-streaming-0.6b (multilingual prompt-conditioned …
…streaming) (#10) * feat(convert): emit prompt-conditioning KVs + use_bias; handle nested att_context presets Emit parakeet.prompt.{present,num_prompts,dictionary.keys,dictionary.values, default_lang} and parakeet.encoder.use_bias for prompt-conditioned multilingual checkpoints (nvidia/nemotron-3.5-asr-streaming-0.6b). Handle the nested att_context_size preset list ([[56,3],[56,0],...]) by taking the first preset as the default and recording all presets in parakeet.encoder.att_context_presets. Also refine detect_arch: a bare aux_ctc config block is no longer enough to mark a model hybrid. The nemotron prompt RNNT carries an unconfigured aux_ctc stub (num_classes=-1, empty vocabulary) but has no ctc decoder and zero ctc_decoder.* weights (NeMo initializes it RNNT-only), so require an actual model.ctc_decoder before classifying as hybrid. This makes the model convert as arch=rnnt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(loader): read prompt-conditioning config + encoder.use_bias Add PromptCfg (present, num_prompts, default_lang, dict_keys/vals, lang_to_index) and use_bias to ParakeetConfig, and read the new KVs in ModelLoader::load via a new kv_str_arr helper. present=false / use_bias=true defaults keep every existing model byte-identical. Extend test_model_loader with a PARAKEET_TEST_GGUF_NEMOTRON block asserting the resolved prompt dictionary (de=9, auto=101, unknown=-1) and use_bias=false; it skips silently when the fixture env var is unset. The encoder attention/FFN linear bias loads were already optional (clone_weight_opt + ml.tensor guards across relpos_attention/conformer/streaming_encoder), and every subsampling bias is present in this checkpoint, so the use_bias=false model loads and its encoder graph builds with no further changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: make nemotron loader assertions reachable without PARAKEET_TEST_GGUF The nemotron prompt-config block was unreachable when PARAKEET_TEST_GGUF was unset, because main() returned 77 before it. Guard the base-model checks behind PARAKEET_TEST_GGUF and run the nemotron block whenever PARAKEET_TEST_GGUF_NEMOTRON is set. Only skip (return 77) when neither env var is present. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(baseline): dump prompt_kernel_out + per-language RNNT reference for nemotron Decode the prompt-conditioned encoder output directly via the model's RNNT decoding object: the prompt model's transcribe dataloader resolves the prompt index from per-cut language metadata, which a bare wav fixture lacks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: PromptKernel post-encoder conditioning unit + isolated NeMo parity test Concat the constant language one-hot onto the encoder output, then Linear->ReLU ->Linear (prompt_kernel.0/2) on the persistent backend via run_graph. Parity vs NeMo prompt_kernel_out: max|d|=1.9e-6. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(model): apply PromptKernel + resolve target_lang in offline decode resolve_prompt_index maps a locale to its prompt index (empty -> default_lang), and the offline + batch decode paths project the encoder output through the PromptKernel when prompt.present. Threaded target_lang through the transcribe entry points (default empty); non-prompt models take the no-op path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: offline nemotron end-to-end NeMo parity (multi-language) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(streaming): apply PromptKernel per chunk; target_lang on session Resolve a language prompt index in the StreamingSession constructor (new target_lang param, default empty -> model default_lang) and apply the prompt_kernel projection to each chunk's encoder frames before the RNN-T decode. The one-hot is constant over time, so per-chunk application is exact and equals the offline forward's single application. Non-prompt models take the no-op path (prompt_.present()==false) and stay byte-identical. run_stream_over_pcm gains a trailing target_lang param (default empty) so a language can route through one entry point; the session already owns its resolved index, so the driver leaves it unused for now (Phase 4 wires it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: streaming nemotron end-to-end NeMo parity Extend dump_prompt_baseline to emit baseline.stream_text: run NeMo's cache-aware streaming encoder, apply m.prompt_kernel to the concatenated streamed output for the target_lang, and RNN-T greedy decode it (specials stripped). Add tests/test_streaming_nemotron.cpp: drive a prompt-aware StreamingSession over the clip and assert sess.text() == baseline.stream_text. Parity gate (lang=en, speech.wav): got == ref EXACTLY: "Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. <en-US> It is certainly very like the old portrait. <en-US>" Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(capi): target_lang variants for transcribe + stream (ABI bump) Add parakeet_capi_transcribe_path_lang, parakeet_capi_transcribe_pcm_lang and parakeet_capi_stream_begin_lang for multilingual prompt-conditioned (nemotron) models. target_lang is a locale string; NULL or "" selects the model default and non-prompt models ignore it. An unknown locale on a prompt model is caught at the boundary, returning NULL with the message set on the ctx last error. The original non-lang entry points delegate to the new ones with the model default, preserving behavior. ABI version bumped to 3. test_capi gains a PARAKEET_TEST_GGUF_NEMOTRON-guarded block asserting a known lang transcribes (non-NULL) and an unknown lang returns NULL with a non-empty last_error; the two model blocks are now independent and skip cleanly (77) when neither env var is set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(cli): --lang flag for multilingual prompt models transcribe gains --lang <locale> to select the language prompt for multilingual (nemotron) prompt models; empty -> the model default and non-prompt models ignore it. The plain offline path routes through the C-API parakeet_capi_transcribe_path_lang when --lang is set (so an unknown locale is a clean error), and keeps the existing free-function path otherwise so behavior for every other model is unchanged. --timestamps threads lang into transcribe_path_with_timestamps; --stream threads it into the StreamingSession ctor (what stream_begin_lang forwards), keeping the rich per-word/EOU output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: e2e NeMo-vs-parakeet.cpp comparison harness (per-language, offline+stream) Refactor gen_nemo_baseline.dump_prompt_baseline into an importable compute_prompt_reference helper and reuse it from the new e2e driver, which runs the built parakeet-cli per (clip, lang, mode) and asserts WER 0 vs NeMo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: document nemotron multilingual streaming support + prompt KVs README: add the prompt-conditioned multilingual streaming model (nvidia/nemotron-3.5-asr-streaming-0.6b, 40+ locales, --lang, WER 0 offline + streaming). conversion.md: document the parakeet.prompt.* KV schema, encoder.use_bias, att_context_presets, and the prompt_kernel tensors (stay F32). parity.md: add the nemotron coverage row + e2e cross-check note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(publish): add nemotron-3.5-asr-streaming-0.6b (5 variants, OpenMDW-1.1 card) Add the model to ALL_MODELS and KNOWN_WER (f16/q8_0/q6_k/q5_k/q4_k, all WER 0.0 offline vs NeMo with recorded sizes). Add a per-id LICENSES map (default CC-BY-4.0) so the generated card states OpenMDW-1.1 for this entry, wired into both the per-model and the collection cards (frontmatter license/license_name/ license_link, License section, per-model rows). Quant allowlist unchanged: the prompt_kernel, LSTM prediction net, and featurizer tensors stay F32. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * bench: nemotron-3.5-asr CPU benchmark vs NeMo (WER 0, 2.4x f32 / 2.5x q8_0) Benchmarks the prompt-conditioned nemotron-3.5-asr-streaming-0.6b port on CPU against NeMo (PyTorch CPU), following the existing parakeet.cpp methodology: load once, warm up once, time transcribe only, median of N passes, RTFx = audio_sec / proc_sec. - Adds scripts/bench_nemotron.py. ours runs parakeet-cli bench with the en language prompt; NeMo runs the same prompt forward (preprocessor, encoder, PromptKernel, RNN-T greedy) reusing gen_nemo_baseline.resolve_prompt_lang. Optionally times the cache-aware streaming path too. - Adds --lang to the CLI bench subcommand so the prompt-conditioned timing path selects the same language prompt as transcribe (passed to transcribe_pcm). - Adds build_nemotron_section to gen_benchmark_md.py, fed by the new benchmarks/results/nemotron/bench.json, so the section is reproducible. Results on AMD Ryzen 9 9950X3D (20 cores, CPU-only, 8 threads), speech.wav (7.43 s), lang en, median of 7 passes: NeMo RTFx 12.2 parakeet.cpp f32 RTFx 29.4 2.40x agreement WER 0.0000% parakeet.cpp q8_0 RTFx 30.8 2.52x agreement WER 0.0000% streaming f32 compute RTFx 3.80 (latency-oriented) Transcripts are byte-identical to NeMo on the timed runs, so the speed numbers compare equal work. Full suite green (ctest 48/48 non-nemotron, 2/2 nemotron). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(streaming): reject unknown target_lang for prompt models (match offline + capi contract) The streaming StreamingSession ctor silently fell back to the default language on an unknown locale, contradicting the parakeet_capi_stream_begin_lang header contract (NULL on an unknown locale) and diverging from the offline Model::resolve_prompt_index path, which throws. A typo like --stream --lang xx produced wrong-language output with no error. Factor the throwing resolution into PromptCfg::resolve_index_or_throw and use it from both the offline path and the StreamingSession ctor so both reject typos identically. The empty-lang default and the non-prompt no-op are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 271e70b - Browse repository at this point
Copy the full SHA 271e70bView commit details -
Batch mode for nemotron: batched causal subsampling + batched target_…
…lang C-API (#11) * feat(capi): batched target_lang variants (transcribe_pcm_batch_json_lang / _batch_lang) Add language-aware batched C-API entry points so a request-coalesced batch can select one language prompt for the whole batch on multilingual (nemotron) models: char* parakeet_capi_transcribe_pcm_batch_json_lang(...) int parakeet_capi_transcribe_pcm_batch_lang(...) The existing non-lang batch functions now delegate to these with nullptr (model default), mirroring the Phase 4 single-clip pattern, so no logic is duplicated. target_lang threads into the C++ batch methods that already accept it; NULL/"" means the model default, non-prompt models ignore it, and an unknown locale is caught by the existing try/catch (NULL / nonzero + last_error). ABI stays v3 (unreleased on this branch); the v3 comment now lists the two new symbols. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(subsampling): support batched causal subsampling (byte-identical to per-item); enable batched nemotron The build_graph_batched causal branch already applied the leading ggml_pad_ext uniformly across the batch and masked each item's trailing pad time frames per stage via the all_paddings=3 valid-length recurrence, so it reproduces the standalone causal boundary per item. The B>1 guard assert was a conservative leftover; remove it so the multilingual streaming nemotron model can run real batches. Validated byte-identical: a clip transcribed inside a B>1 batch (uniform, mixed-length, reversed order, and a non-empty truncated item as the padded/masked clip) equals the same clip transcribed standalone, plus batched timestamps text parity. Re-enable the positive valid-language (de) 2-clip batch JSON assertion in test_capi_batch_json. No change to the non-causal path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 50dfc24 - Browse repository at this point
Copy the full SHA 50dfc24View commit details -
docs: add a supported-models table with links (#12)
Lists all 11 published models (the 10 Parakeet checkpoints plus the new multilingual streaming nemotron-3.5-asr-streaming-0.6b) with their type, size, notes, and a link to each NVIDIA source, plus a pointer to the GGUF collection repo and docs/parity.md. Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 86bc69e - Browse repository at this point
Copy the full SHA 86bc69eView commit details -
docs(bench): add nemotron GPU numbers (GB10) to BENCHMARK.md (#14)
parakeet.cpp vs NeMo on the NVIDIA GB10, same clip and methodology as the CPU table: NeMo (PyTorch GPU) RTFx 91.8, parakeet.cpp f32 106.5 (1.16x), q8_0 119.8 (1.30x), transcripts byte-identical (WER 0). The margin is smaller than on CPU because nemotron is RNN-T and NeMo's CUDA-graph greedy decode is fast there. NeMo now runs natively on the GB10 via torch 2.11 plus cu128 (no nvcr container). Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 96c3177 - Browse repository at this point
Copy the full SHA 96c3177View commit details -
fix: reset the streaming decoder on <EOU>/<EOB> so transcription cont…
…inues (#13) (#15) The realtime EOU model (parakeet_realtime_eou_120m-v1) emits <EOU> / <EOB> as ordinary vocab tokens to mark end of utterance. The cache-aware streaming decode carried the RNN-T decoder state across chunks but never reset it, so once <EOU> was emitted the prediction net stayed conditioned on it and the joint scored blank on every following frame: the stream went silent after the first utterance (issue #13). This matched NeMo's plain rnnt_decoder_predictions_tensor (which does the same), but that is not how the model is meant to run. NeMo's reference streaming driver for this model (examples/voice_agent/.../nemo/streaming_asr.py NemoStreamingASRService.transcribe) calls reset_state() whenever <EOU>/<EOB> appears in a chunk, so the next utterance decodes from a fresh decoder state. StreamingSession::feed_mel_chunk now does the same: after a chunk emits <EOU>/<EOB> it resets the RNN-T decoder state (LSTM h/c to zero, last token back to SOS) for the next chunk. Only the decoder is reset, not the StreamingEncoder cache. NeMo's reset_state also drops the encoder cache, but that was verified byte-identical on the transcript (decoder-only reset == full reset_state on the diffusion 60s/2-EOU and 180s/5-EOU clips), so the validated streaming-encoder path is left untouched. enc_frame_ keeps running so EOU timestamps stay absolute in the clip, and the offline path is unchanged (it matches NeMo offline on single utterances). Adds a gated regression test (test_streaming_eou_reset) plus a NeMo reset-on-EOU baseline generator (gen_stream_reset_baseline.py) that builds a two-utterance clip so an <EOU> fires mid-stream; the test asserts our streamed transcript matches NeMo's reset reference exactly and that the second utterance is recovered. Confirmed it fails with the reset disabled. Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for abd0087 - Browse repository at this point
Copy the full SHA abd0087View commit details
Commits on Jun 7, 2026
-
feat(capi): ABI v4 segment-timestamp support (frame_sec + streaming J…
…SON) (#16) Add the data LocalAI needs to build NeMo-faithful segment timestamps: - Offline JSON (transcribe_*_json) now carries "frame_sec", the encoder frame stride in seconds, so a consumer can convert NeMo's frame-unit segment_gap_threshold into the seconds gap between words. - New streaming JSON entry points parakeet_capi_stream_feed_json / parakeet_capi_stream_finalize_json return {text, eou, frame_sec, words} by surfacing the streaming session's existing drain_words() per-word start/end/conf alongside the newly-finalized text and EOU flag. Bumps PARAKEET_CAPI_ABI_VERSION to 4. All existing entry points are unchanged; the new symbols are additive (consumers probe for them). tests/test_capi_stream_json.cpp drives the new streaming JSON path on the EOU model (skips with 77 when PARAKEET_TEST_GGUF_EOU is unset, like the sibling streaming tests). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Configuration menu - View commit details
-
Copy full SHA for ce02d29 - Browse repository at this point
Copy the full SHA ce02d29View commit details -
fix: define M_PI for MSVC builds (#18)
M_PI is not declared by <cmath> on MSVC unless _USE_MATH_DEFINES is set before the header. Define it (plus an #ifndef M_PI fallback) in fft.cpp and mel_gpu.cpp, and add _USE_MATH_DEFINES as a PUBLIC MSVC compile definition on the parakeet target so the test executables that also use M_PI build too. Non-MSVC builds are unaffected. Closes #6 Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for eb09678 - Browse repository at this point
Copy the full SHA eb09678View commit details -
fix: tile subsampling for long audio to avoid ggml 2^31 tensor overfl…
…ow on GPU (#19) * feat(subsampling): add subsample_len spatial-length helper * feat(subsampling): tiled long-audio path (parity vs forward) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(subsampling): cache valid_out_len in forward_tiled; document tiling test invariant * feat(encoder): forward_batch_tiled from pre-subsampled features Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(model): tile subsampling for long audio above safe threshold Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(model): tile single-clip transcribe for long audio (CLI/path C-API) * fix(ggml-cuda): grid-stride pad kernel for dims > 65535 (long-audio attention) --------- Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 96b81bb - Browse repository at this point
Copy the full SHA 96b81bbView commit details -
fix: select integrated GPUs and allow PARAKEET_DEVICE to name a device (
#20) Backend device selection only accepted GGML_BACKEND_DEVICE_TYPE_GPU, so integrated GPUs (Ryzen APUs and similar, reported as GGML_BACKEND_DEVICE_TYPE_IGPU) were skipped and the engine fell back to CPU on those machines. The auto-pick now matches both discrete and integrated GPU devices. PARAKEET_DEVICE also gains a third form: besides "cpu" (force CPU) and being unset (auto-pick the first GPU/IGPU), it can now name a specific registry device such as "CUDA0" or "Vulkan1" (case-insensitive). An unknown name logs and falls back to CPU instead of failing. use_sched is now derived from the chosen device type so any non-CPU device still offloads unsupported ops to CPU. Adds a regression test covering the env-var fallback paths (cpu, unknown name, case-insensitive CPU), which run on a CPU-only build, and documents the new behavior in the README. Closes #17 Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for e270af7 - Browse repository at this point
Copy the full SHA e270af7View commit details
Commits on Jun 11, 2026
-
ci: pre-built release binaries for linux, macos and windows (#22)
* ci: pre-built release binaries for linux, macos and windows (#21) Adds a release workflow that builds self-contained parakeet-cli bundles for every v* tag: linux x64 (cpu, vulkan, cuda) and arm64 (cpu), macos arm64 (metal) and x64 (cpu), windows x64 (cpu, vulkan, cuda) plus a separate cudart runtime zip. Assets attach to the GitHub release for the tag, creating a draft release when none exists yet. Fixes #21 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: point the README at the pre-built release bundles Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * ci: capture the usage banner before grepping in the smoke tests parakeet-cli exits 2 when invoked bare; under the runner's bash -e -o pipefail that exit code fails the pipeline even though grep matched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * ci: drop the temporary branch trigger used for matrix validation Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * ci: let ggml pick the CUDA architectures, like llama.cpp releases Dropping the hand-rolled CMAKE_CUDA_ARCHITECTURES lists lets ggml's curated non-native default apply: PTX for the datacenter generations (75, 80, 90), real code for the common consumer cards (86, 89, 120a), and 121a-real for GB10 on CUDA 13. Smaller binaries, faster builds, and the list stays current with submodule bumps. Temporarily re-adds the branch trigger to validate the CUDA builds. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Configuration menu - View commit details
-
Copy full SHA for 9db92be - Browse repository at this point
Copy the full SHA 9db92beView commit details
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff v0.1.2...v0.2.0