Comparing changes

…#9) * feat(attn): banded local (Longformer) attention — O(T*window), NeMo-faithful Adds NeMo rel_pos_local_attn (RelPositionMultiHeadAttentionLongformer) as a memory-bounded banded attention. This is the kernel for fixing the O(T^2) attention blowup that OOM'd long-audio offline transcription on unified-memory GPUs (a ~20-min clip allocated ~100GB and took the node down). - RelPosAttention::build_graph_local / forward_local: banded attention via pad-and-shift, peak memory O(T*window) instead of O(T*T). Each query attends only to keys in [t-att_left, t+att_right]; the positional term (q_v . p^T over the 2W+1 local pos) is added 1:1 to the banded content scores, exactly as NeMo combines them. Verified against NeMo's own sliding_chunks_matmul_qk/pv (col->key t-w+c to 1e-6) and a deterministic band reference (1.4e-3). - local_rel_pos_encoding: NeMo LocalAttRelPositionalEncoding (positions +att_left..-att_right), bit-identical to the centre rows of the full table. - pk::last_graph_alloc_bytes(): gallocr high-water accessor for the memory test. - gen_nemo_baseline.py --att-context-size (local-attention baseline); and gen_band_ref.py for the deterministic band reference. NOTE: NeMo's longformer is non-deterministic on short clips (sliding_chunks_matmul_pv reads uninitialized memory at boundaries via F.pad value=-1 + as_strided — two identical forward() calls differ by >1e3), so kernel parity must use the deterministic reference; end-to-end NeMo quality is anchored by long-audio WER. - Tests: test_relpos_attention_local (parity 1.4e-3) and test_relpos_attention_local_memory (alloc grows ~linearly, ratio 1.98). Not yet wired into the offline encoder path — follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 * feat(encoder): use banded local attention for long audio (fixes O(T^2) OOM) Wires the banded rel_pos_local_attn kernel into the offline encoder so long audio no longer allocates O(T^2) attention, which OOM'd unified-memory GPUs (a ~17-min clip drove ~100GB and took the node down). - ConformerLayer::build_graph gains optional att_left/att_right; when set it routes self-attention to RelPosAttention::build_graph_local with a LOCAL positional encoding, else keeps full attention unchanged. - Encoder::forward picks the window via local_attn_window(Tp): env PARAKEET_ATT_CONTEXT=W forces NeMo rel_pos_local_attn [W,W]; otherwise audio longer than ~11 min (>8192 encoder frames) auto-switches to W=128. Short audio keeps full attention (NeMo-exact; the encoder parity test is unchanged). - backend.cpp: bump kGraphSize 16384->65536 — the pad-and-shift kernel adds O(window) graph-node descriptors per layer. Verified end-to-end on a 16.6-min clip with tdt-0.6b-v3 (CPU, 16 threads): full attention: 151 s, 55.4 GB peak RSS banded (W=16): 41 s, 9.1 GB peak RSS (coherent transcript) ~6x less memory and ~3.7x faster; the full-attention path is what hit ~100GB and OOM'd. Short-clip transcripts: W=128 == full byte-for-byte; W=16 essentially identical. Note: pad-and-shift creates O(window) nodes and an O(window^2) incremental concat — fine for small windows but slow for W=128 on CPU; an efficient chunk-matmul construction (like NeMo's sliding_chunks) is a follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 * fix(encoder): cap local-attention window to the node budget (no small-model regression) Bumping kGraphSize 16384->65536 to fit a W=128 banded graph regressed small models ~+22% (tdt_ctc-110m): the per-compute metadata context and graph hash-set scale with kGraphSize. Revert to 16384 and instead cap the local-attention window at W=32 — the pad-and-shift kernel adds ~6*(2W+1) graph nodes/layer, and W<=32 fits every shipped model's encoder within the budget. PARAKEET_ATT_CONTEXT is clamped to 32. Regression bench (librispeech, 100 files, CPU, back-to-back): tdt_ctc-110m: master 19.5s vs banded 19.4s (within noise), 0/100 text diffs tdt-0.6b-v3: 0/100 text diffs Long-audio fix intact: 16.6-min clip + tdt-0.6b-v3 auto-uses W=32 -> 48s, 9.4 GB peak RSS (vs full attention 151s / 55.4 GB). Lifting the window cap to NeMo's [128,128] needs the efficient chunk-matmul construction (O(1) graph nodes) — follow-up. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 * feat(attn): banded local attention in the batched encoder (B>1) Mirror the B=1 banded path (build_graph_local) into the fused batched encoder so long-audio batches also use NeMo rel_pos_local_attn (O(T*window)) instead of full O(T^2) attention. RelPosAttention::build_graph_batched_local builds the 4D ([dk,T,H,B]) pad-and-shift band: K/V padded on the time axis, per-window-column views, sum_rows content scores + mul_mat positional scores (shared pos broadcast over B), a per-item band mask [P,T,1,B] keyed on each item's valid_len, soft_max over the window, then the context gather and head merge. Conformer build_graph_batched and the batched encoder forward route to it when att_left/att_right >= 0, with the shared LOCAL positional encoding. Verified on dgx (tdt_ctc-110m): the new test_encoder_batch_local exercises the path at the production window (W=32 = kMaxLocalWindow). item0 (the full clip) is bit-exact beside its shorter padded neighbour (no cross-item leak), and the padded item1 matches its standalone run within 5e-2/5e-2 - the same tolerance the full-attention batch test uses. Tighter-than-production windows only amplify float noise on near-zero activations of the padded clip (item0 stays exact, mean|d| ~1e-2); not pad leakage. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 * feat(attn): chunk-matmul banded local attention (O(1) graph nodes) The pad-and-shift banded path (build_graph_local) is correct but emits O(window) graph nodes per layer (a P-iteration view+mul+sum_rows+concat loop), which is why the window was capped at W=32. build_graph_local_chunked computes the exact same NeMo rel_pos_local_attn output with a fixed, O(1) number of nodes regardless of window, lifting the cap toward NeMo's full [128,128]. Construction: time is tiled into chunks of C frames; each chunk carries its own C+P-1 keys/values (the P-1 halo overlaps the neighbour), so a query attends only within its chunk. K/V are gathered as OVERLAPPING strided chunk views - which ggml's view-bounds check (ggml.c: data_size = dense product of ne, ignoring nb) rejects unless the source is OVER-padded to (C+P-1)*G frames; with that pad the view is legal and a single batched ggml_mul_mat produces the per-chunk q.k blocks [C+P-1, C, G, H]. A diagonal "skew" view (nb1 walking C+P on a [C+P-1,...] tensor, which passes the bounds check since P <= C+P-1) extracts the [P,T] band. The PV side inverse-skews the softmaxed band back to a [C+P-1, C] banded matrix (pad ne0 by C, skew-view, mask the lower off-band), then one batched matmul against the transposed V chunks gathers the context. Verified against the trusted pad-and-shift path (forward_local, itself 1.4e-3 vs a deterministic brute-force band reference): new test test_relpos_attention_local_chunked runs synthetic x/pos through the real layer-0 weights for T up to 333 and W up to 128 (chunk < W and chunk == W), matching forward_local to <1e-3 (max|d| ~6e-4). Existing pad-and-shift path and all encoder/conformer regressions unchanged. Encoder wiring (raise the cap and route long audio to this kernel) follows. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 * feat(encoder): route local attention to chunk-matmul, lift window cap to 128 Wire the O(1)-node chunk-matmul kernel into the encoder and raise the local window cap from 32 to NeMo's full 128. Both conformer attention paths now use it: build_graph (B=1) -> build_graph_local_chunked, build_graph_batched (B>1) -> build_graph_batched_local_chunked. The batched wrapper runs the 4D chunk kernel once per item and stacks the [D,T] outputs into [D,T,B] (the chunk graph is already 4D, so it can't also carry a batch dim); that is O(B) nodes, still O(1) in the window, and B is small. local_attn_window's cap (kMaxLocalWindow) goes 32 -> 128: the pad-and-shift path emitted ~6*(2W+1) nodes/layer (hence the 32 cap to fit kGraphSize), but the chunk-matmul path is window-independent in node count, so long audio now runs at NeMo's full [128,128] window. The pad-and-shift build_graph_local / build_graph_batched_local are kept as the verification oracle for test_relpos_attention_local{,_chunked}. Verified on dgx: full ctest green (51/51). test_encoder_batch_local passes at every forced window W=8..128 (now through the chunked path). e2e on a 16.6-min clip (tdt-0.6b-v3, CPU/16t), auto-local W=128: 36.8s / 9.8GB peak RSS, coherent transcript - faster than the W=32 pad-and-shift capstone (41-48s / 9.1GB) at a 4x wider, NeMo-faithful window, and ~5.6x under the full-attention path that OOM'd the node (151s / 55GB). Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 * docs(bench): add long-audio banded-attention section Document the banded local attention (rel_pos_local_attn) memory/speed win that the chunk-matmul kernel enables. 16.6-min clip, tdt-0.6b-v3, GB10 CPU/16t: global O(T^2) attention 148.3s / 54.0GB vs banded W=128 36.9s / 9.4GB (~4x faster, ~5.7x less peak RAM) at NeMo's full window, with the chunk-matmul making W=128 as cheap as W=32. Notes that short clips stay on the global path and are unchanged. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>

…streaming) (#10) * feat(convert): emit prompt-conditioning KVs + use_bias; handle nested att_context presets Emit parakeet.prompt.{present,num_prompts,dictionary.keys,dictionary.values, default_lang} and parakeet.encoder.use_bias for prompt-conditioned multilingual checkpoints (nvidia/nemotron-3.5-asr-streaming-0.6b). Handle the nested att_context_size preset list ([[56,3],[56,0],...]) by taking the first preset as the default and recording all presets in parakeet.encoder.att_context_presets. Also refine detect_arch: a bare aux_ctc config block is no longer enough to mark a model hybrid. The nemotron prompt RNNT carries an unconfigured aux_ctc stub (num_classes=-1, empty vocabulary) but has no ctc decoder and zero ctc_decoder.* weights (NeMo initializes it RNNT-only), so require an actual model.ctc_decoder before classifying as hybrid. This makes the model convert as arch=rnnt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(loader): read prompt-conditioning config + encoder.use_bias Add PromptCfg (present, num_prompts, default_lang, dict_keys/vals, lang_to_index) and use_bias to ParakeetConfig, and read the new KVs in ModelLoader::load via a new kv_str_arr helper. present=false / use_bias=true defaults keep every existing model byte-identical. Extend test_model_loader with a PARAKEET_TEST_GGUF_NEMOTRON block asserting the resolved prompt dictionary (de=9, auto=101, unknown=-1) and use_bias=false; it skips silently when the fixture env var is unset. The encoder attention/FFN linear bias loads were already optional (clone_weight_opt + ml.tensor guards across relpos_attention/conformer/streaming_encoder), and every subsampling bias is present in this checkpoint, so the use_bias=false model loads and its encoder graph builds with no further changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: make nemotron loader assertions reachable without PARAKEET_TEST_GGUF The nemotron prompt-config block was unreachable when PARAKEET_TEST_GGUF was unset, because main() returned 77 before it. Guard the base-model checks behind PARAKEET_TEST_GGUF and run the nemotron block whenever PARAKEET_TEST_GGUF_NEMOTRON is set. Only skip (return 77) when neither env var is present. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(baseline): dump prompt_kernel_out + per-language RNNT reference for nemotron Decode the prompt-conditioned encoder output directly via the model's RNNT decoding object: the prompt model's transcribe dataloader resolves the prompt index from per-cut language metadata, which a bare wav fixture lacks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat: PromptKernel post-encoder conditioning unit + isolated NeMo parity test Concat the constant language one-hot onto the encoder output, then Linear->ReLU ->Linear (prompt_kernel.0/2) on the persistent backend via run_graph. Parity vs NeMo prompt_kernel_out: max|d|=1.9e-6. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(model): apply PromptKernel + resolve target_lang in offline decode resolve_prompt_index maps a locale to its prompt index (empty -> default_lang), and the offline + batch decode paths project the encoder output through the PromptKernel when prompt.present. Threaded target_lang through the transcribe entry points (default empty); non-prompt models take the no-op path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: offline nemotron end-to-end NeMo parity (multi-language) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(streaming): apply PromptKernel per chunk; target_lang on session Resolve a language prompt index in the StreamingSession constructor (new target_lang param, default empty -> model default_lang) and apply the prompt_kernel projection to each chunk's encoder frames before the RNN-T decode. The one-hot is constant over time, so per-chunk application is exact and equals the offline forward's single application. Non-prompt models take the no-op path (prompt_.present()==false) and stay byte-identical. run_stream_over_pcm gains a trailing target_lang param (default empty) so a language can route through one entry point; the session already owns its resolved index, so the driver leaves it unused for now (Phase 4 wires it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: streaming nemotron end-to-end NeMo parity Extend dump_prompt_baseline to emit baseline.stream_text: run NeMo's cache-aware streaming encoder, apply m.prompt_kernel to the concatenated streamed output for the target_lang, and RNN-T greedy decode it (specials stripped). Add tests/test_streaming_nemotron.cpp: drive a prompt-aware StreamingSession over the clip and assert sess.text() == baseline.stream_text. Parity gate (lang=en, speech.wav): got == ref EXACTLY: "Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. <en-US> It is certainly very like the old portrait. <en-US>" Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(capi): target_lang variants for transcribe + stream (ABI bump) Add parakeet_capi_transcribe_path_lang, parakeet_capi_transcribe_pcm_lang and parakeet_capi_stream_begin_lang for multilingual prompt-conditioned (nemotron) models. target_lang is a locale string; NULL or "" selects the model default and non-prompt models ignore it. An unknown locale on a prompt model is caught at the boundary, returning NULL with the message set on the ctx last error. The original non-lang entry points delegate to the new ones with the model default, preserving behavior. ABI version bumped to 3. test_capi gains a PARAKEET_TEST_GGUF_NEMOTRON-guarded block asserting a known lang transcribes (non-NULL) and an unknown lang returns NULL with a non-empty last_error; the two model blocks are now independent and skip cleanly (77) when neither env var is set. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(cli): --lang flag for multilingual prompt models transcribe gains --lang <locale> to select the language prompt for multilingual (nemotron) prompt models; empty -> the model default and non-prompt models ignore it. The plain offline path routes through the C-API parakeet_capi_transcribe_path_lang when --lang is set (so an unknown locale is a clean error), and keeps the existing free-function path otherwise so behavior for every other model is unchanged. --timestamps threads lang into transcribe_path_with_timestamps; --stream threads it into the StreamingSession ctor (what stream_begin_lang forwards), keeping the rich per-word/EOU output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: e2e NeMo-vs-parakeet.cpp comparison harness (per-language, offline+stream) Refactor gen_nemo_baseline.dump_prompt_baseline into an importable compute_prompt_reference helper and reuse it from the new e2e driver, which runs the built parakeet-cli per (clip, lang, mode) and asserts WER 0 vs NeMo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: document nemotron multilingual streaming support + prompt KVs README: add the prompt-conditioned multilingual streaming model (nvidia/nemotron-3.5-asr-streaming-0.6b, 40+ locales, --lang, WER 0 offline + streaming). conversion.md: document the parakeet.prompt.* KV schema, encoder.use_bias, att_context_presets, and the prompt_kernel tensors (stay F32). parity.md: add the nemotron coverage row + e2e cross-check note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(publish): add nemotron-3.5-asr-streaming-0.6b (5 variants, OpenMDW-1.1 card) Add the model to ALL_MODELS and KNOWN_WER (f16/q8_0/q6_k/q5_k/q4_k, all WER 0.0 offline vs NeMo with recorded sizes). Add a per-id LICENSES map (default CC-BY-4.0) so the generated card states OpenMDW-1.1 for this entry, wired into both the per-model and the collection cards (frontmatter license/license_name/ license_link, License section, per-model rows). Quant allowlist unchanged: the prompt_kernel, LSTM prediction net, and featurizer tensors stay F32. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * bench: nemotron-3.5-asr CPU benchmark vs NeMo (WER 0, 2.4x f32 / 2.5x q8_0) Benchmarks the prompt-conditioned nemotron-3.5-asr-streaming-0.6b port on CPU against NeMo (PyTorch CPU), following the existing parakeet.cpp methodology: load once, warm up once, time transcribe only, median of N passes, RTFx = audio_sec / proc_sec. - Adds scripts/bench_nemotron.py. ours runs parakeet-cli bench with the en language prompt; NeMo runs the same prompt forward (preprocessor, encoder, PromptKernel, RNN-T greedy) reusing gen_nemo_baseline.resolve_prompt_lang. Optionally times the cache-aware streaming path too. - Adds --lang to the CLI bench subcommand so the prompt-conditioned timing path selects the same language prompt as transcribe (passed to transcribe_pcm). - Adds build_nemotron_section to gen_benchmark_md.py, fed by the new benchmarks/results/nemotron/bench.json, so the section is reproducible. Results on AMD Ryzen 9 9950X3D (20 cores, CPU-only, 8 threads), speech.wav (7.43 s), lang en, median of 7 passes: NeMo RTFx 12.2 parakeet.cpp f32 RTFx 29.4 2.40x agreement WER 0.0000% parakeet.cpp q8_0 RTFx 30.8 2.52x agreement WER 0.0000% streaming f32 compute RTFx 3.80 (latency-oriented) Transcripts are byte-identical to NeMo on the timed runs, so the speed numbers compare equal work. Full suite green (ctest 48/48 non-nemotron, 2/2 nemotron). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(streaming): reject unknown target_lang for prompt models (match offline + capi contract) The streaming StreamingSession ctor silently fell back to the default language on an unknown locale, contradicting the parakeet_capi_stream_begin_lang header contract (NULL on an unknown locale) and diverging from the offline Model::resolve_prompt_index path, which throws. A typo like --stream --lang xx produced wrong-language output with no error. Factor the throwing resolution into PromptCfg::resolve_index_or_throw and use it from both the offline path and the StreamingSession ctor so both reject typos identically. The empty-lang default and the non-prompt no-op are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…lang C-API (#11) * feat(capi): batched target_lang variants (transcribe_pcm_batch_json_lang / _batch_lang) Add language-aware batched C-API entry points so a request-coalesced batch can select one language prompt for the whole batch on multilingual (nemotron) models: char* parakeet_capi_transcribe_pcm_batch_json_lang(...) int parakeet_capi_transcribe_pcm_batch_lang(...) The existing non-lang batch functions now delegate to these with nullptr (model default), mirroring the Phase 4 single-clip pattern, so no logic is duplicated. target_lang threads into the C++ batch methods that already accept it; NULL/"" means the model default, non-prompt models ignore it, and an unknown locale is caught by the existing try/catch (NULL / nonzero + last_error). ABI stays v3 (unreleased on this branch); the v3 comment now lists the two new symbols. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(subsampling): support batched causal subsampling (byte-identical to per-item); enable batched nemotron The build_graph_batched causal branch already applied the leading ggml_pad_ext uniformly across the batch and masked each item's trailing pad time frames per stage via the all_paddings=3 valid-length recurrence, so it reproduces the standalone causal boundary per item. The B>1 guard assert was a conservative leftover; remove it so the multilingual streaming nemotron model can run real batches. Validated byte-identical: a clip transcribed inside a B>1 batch (uniform, mixed-length, reversed order, and a non-empty truncated item as the padded/masked clip) equals the same clip transcribed standalone, plus batched timestamps text parity. Re-enable the positive valid-language (de) 2-clip batch JSON assertion in test_capi_batch_json. No change to the non-causal path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Lists all 11 published models (the 10 Parakeet checkpoints plus the new multilingual streaming nemotron-3.5-asr-streaming-0.6b) with their type, size, notes, and a link to each NVIDIA source, plus a pointer to the GGUF collection repo and docs/parity.md. Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

parakeet.cpp vs NeMo on the NVIDIA GB10, same clip and methodology as the CPU table: NeMo (PyTorch GPU) RTFx 91.8, parakeet.cpp f32 106.5 (1.16x), q8_0 119.8 (1.30x), transcripts byte-identical (WER 0). The margin is smaller than on CPU because nemotron is RNN-T and NeMo's CUDA-graph greedy decode is fast there. NeMo now runs natively on the GB10 via torch 2.11 plus cu128 (no nvcr container). Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…inues (#13) (#15) The realtime EOU model (parakeet_realtime_eou_120m-v1) emits <EOU> / <EOB> as ordinary vocab tokens to mark end of utterance. The cache-aware streaming decode carried the RNN-T decoder state across chunks but never reset it, so once <EOU> was emitted the prediction net stayed conditioned on it and the joint scored blank on every following frame: the stream went silent after the first utterance (issue #13). This matched NeMo's plain rnnt_decoder_predictions_tensor (which does the same), but that is not how the model is meant to run. NeMo's reference streaming driver for this model (examples/voice_agent/.../nemo/streaming_asr.py NemoStreamingASRService.transcribe) calls reset_state() whenever <EOU>/<EOB> appears in a chunk, so the next utterance decodes from a fresh decoder state. StreamingSession::feed_mel_chunk now does the same: after a chunk emits <EOU>/<EOB> it resets the RNN-T decoder state (LSTM h/c to zero, last token back to SOS) for the next chunk. Only the decoder is reset, not the StreamingEncoder cache. NeMo's reset_state also drops the encoder cache, but that was verified byte-identical on the transcript (decoder-only reset == full reset_state on the diffusion 60s/2-EOU and 180s/5-EOU clips), so the validated streaming-encoder path is left untouched. enc_frame_ keeps running so EOU timestamps stay absolute in the clip, and the offline path is unchanged (it matches NeMo offline on single utterances). Adds a gated regression test (test_streaming_eou_reset) plus a NeMo reset-on-EOU baseline generator (gen_stream_reset_baseline.py) that builds a two-utterance clip so an <EOU> fires mid-stream; the test asserts our streamed transcript matches NeMo's reset reference exactly and that the second utterance is recovered. Confirmed it fails with the reset disabled. Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…SON) (#16) Add the data LocalAI needs to build NeMo-faithful segment timestamps: - Offline JSON (transcribe_*_json) now carries "frame_sec", the encoder frame stride in seconds, so a consumer can convert NeMo's frame-unit segment_gap_threshold into the seconds gap between words. - New streaming JSON entry points parakeet_capi_stream_feed_json / parakeet_capi_stream_finalize_json return {text, eou, frame_sec, words} by surfacing the streaming session's existing drain_words() per-word start/end/conf alongside the newly-finalized text and EOU flag. Bumps PARAKEET_CAPI_ABI_VERSION to 4. All existing entry points are unchanged; the new symbols are additive (consumers probe for them). tests/test_capi_stream_json.cpp drives the new streaming JSON path on the EOU model (skips with 77 when PARAKEET_TEST_GGUF_EOU is unset, like the sibling streaming tests). Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>

M_PI is not declared by <cmath> on MSVC unless _USE_MATH_DEFINES is set before the header. Define it (plus an #ifndef M_PI fallback) in fft.cpp and mel_gpu.cpp, and add _USE_MATH_DEFINES as a PUBLIC MSVC compile definition on the parakeet target so the test executables that also use M_PI build too. Non-MSVC builds are unaffected. Closes #6 Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ow on GPU (#19) * feat(subsampling): add subsample_len spatial-length helper * feat(subsampling): tiled long-audio path (parity vs forward) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(subsampling): cache valid_out_len in forward_tiled; document tiling test invariant * feat(encoder): forward_batch_tiled from pre-subsampled features Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(model): tile subsampling for long audio above safe threshold Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(model): tile single-clip transcribe for long audio (CLI/path C-API) * fix(ggml-cuda): grid-stride pad kernel for dims > 65535 (long-audio attention) --------- Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

#20) Backend device selection only accepted GGML_BACKEND_DEVICE_TYPE_GPU, so integrated GPUs (Ryzen APUs and similar, reported as GGML_BACKEND_DEVICE_TYPE_IGPU) were skipped and the engine fell back to CPU on those machines. The auto-pick now matches both discrete and integrated GPU devices. PARAKEET_DEVICE also gains a third form: besides "cpu" (force CPU) and being unset (auto-pick the first GPU/IGPU), it can now name a specific registry device such as "CUDA0" or "Vulkan1" (case-insensitive). An unknown name logs and falls back to CPU instead of failing. use_sched is now derived from the chosen device type so any non-CPU device still offloads unsupported ops to CPU. Adds a regression test covering the env-var fallback paths (cpu, unknown name, case-insensitive CPU), which run on a CPU-only build, and documents the new behavior in the README. Closes #17 Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* ci: pre-built release binaries for linux, macos and windows (#21) Adds a release workflow that builds self-contained parakeet-cli bundles for every v* tag: linux x64 (cpu, vulkan, cuda) and arm64 (cpu), macos arm64 (metal) and x64 (cpu), windows x64 (cpu, vulkan, cuda) plus a separate cudart runtime zip. Assets attach to the GitHub release for the tag, creating a draft release when none exists yet. Fixes #21 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs: point the README at the pre-built release bundles Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * ci: capture the usage banner before grepping in the smoke tests parakeet-cli exits 2 when invoked bare; under the runner's bash -e -o pipefail that exit code fails the pipeline even though grep matched. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * ci: drop the temporary branch trigger used for matrix validation Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * ci: let ggml pick the CUDA architectures, like llama.cpp releases Dropping the hand-rolled CMAKE_CUDA_ARCHITECTURES lists lets ggml's curated non-native default apply: PTX for the datacenter generations (75, 80, 90), real code for the common consumer cards (86, 89, 120a), and 121a-real for GB10 on CUDA 13. Smaller binaries, faster builds, and the list stays current with submodule bumps. Temporarily re-adds the branch trigger to validate the CUDA builds. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on Jun 6, 2026

Commits on Jun 7, 2026

Commits on Jun 11, 2026

This comparison is taking too long to generate.

Uh oh!