Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: mudler/parakeet.cpp
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.1.2
Choose a base ref
...
head repository: mudler/parakeet.cpp
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v0.2.0
Choose a head ref
  • 11 commits
  • 61 files changed
  • 5 contributors

Commits on Jun 6, 2026

  1. feat: banded local (Longformer) attention — fix O(T^2) long-audio OOM (

    …#9)
    
    * feat(attn): banded local (Longformer) attention — O(T*window), NeMo-faithful
    
    Adds NeMo rel_pos_local_attn (RelPositionMultiHeadAttentionLongformer) as a
    memory-bounded banded attention. This is the kernel for fixing the O(T^2)
    attention blowup that OOM'd long-audio offline transcription on unified-memory
    GPUs (a ~20-min clip allocated ~100GB and took the node down).
    
    - RelPosAttention::build_graph_local / forward_local: banded attention via
      pad-and-shift, peak memory O(T*window) instead of O(T*T). Each query attends
      only to keys in [t-att_left, t+att_right]; the positional term (q_v . p^T over
      the 2W+1 local pos) is added 1:1 to the banded content scores, exactly as NeMo
      combines them. Verified against NeMo's own sliding_chunks_matmul_qk/pv
      (col->key t-w+c to 1e-6) and a deterministic band reference (1.4e-3).
    - local_rel_pos_encoding: NeMo LocalAttRelPositionalEncoding (positions
      +att_left..-att_right), bit-identical to the centre rows of the full table.
    - pk::last_graph_alloc_bytes(): gallocr high-water accessor for the memory test.
    - gen_nemo_baseline.py --att-context-size (local-attention baseline); and
      gen_band_ref.py for the deterministic band reference. NOTE: NeMo's longformer
      is non-deterministic on short clips (sliding_chunks_matmul_pv reads
      uninitialized memory at boundaries via F.pad value=-1 + as_strided — two
      identical forward() calls differ by >1e3), so kernel parity must use the
      deterministic reference; end-to-end NeMo quality is anchored by long-audio WER.
    - Tests: test_relpos_attention_local (parity 1.4e-3) and
      test_relpos_attention_local_memory (alloc grows ~linearly, ratio 1.98).
    
    Not yet wired into the offline encoder path — follow-up.
    
    Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
    Assisted-by: Claude:claude-opus-4-8
    
    * feat(encoder): use banded local attention for long audio (fixes O(T^2) OOM)
    
    Wires the banded rel_pos_local_attn kernel into the offline encoder so long
    audio no longer allocates O(T^2) attention, which OOM'd unified-memory GPUs (a
    ~17-min clip drove ~100GB and took the node down).
    
    - ConformerLayer::build_graph gains optional att_left/att_right; when set it
      routes self-attention to RelPosAttention::build_graph_local with a LOCAL
      positional encoding, else keeps full attention unchanged.
    - Encoder::forward picks the window via local_attn_window(Tp): env
      PARAKEET_ATT_CONTEXT=W forces NeMo rel_pos_local_attn [W,W]; otherwise audio
      longer than ~11 min (>8192 encoder frames) auto-switches to W=128. Short audio
      keeps full attention (NeMo-exact; the encoder parity test is unchanged).
    - backend.cpp: bump kGraphSize 16384->65536 — the pad-and-shift kernel adds
      O(window) graph-node descriptors per layer.
    
    Verified end-to-end on a 16.6-min clip with tdt-0.6b-v3 (CPU, 16 threads):
      full attention:  151 s, 55.4 GB peak RSS
      banded (W=16):    41 s,  9.1 GB peak RSS  (coherent transcript)
    ~6x less memory and ~3.7x faster; the full-attention path is what hit ~100GB and
    OOM'd. Short-clip transcripts: W=128 == full byte-for-byte; W=16 essentially
    identical.
    
    Note: pad-and-shift creates O(window) nodes and an O(window^2) incremental
    concat — fine for small windows but slow for W=128 on CPU; an efficient
    chunk-matmul construction (like NeMo's sliding_chunks) is a follow-up.
    
    Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
    Assisted-by: Claude:claude-opus-4-8
    
    * fix(encoder): cap local-attention window to the node budget (no small-model regression)
    
    Bumping kGraphSize 16384->65536 to fit a W=128 banded graph regressed small
    models ~+22% (tdt_ctc-110m): the per-compute metadata context and graph hash-set
    scale with kGraphSize. Revert to 16384 and instead cap the local-attention
    window at W=32 — the pad-and-shift kernel adds ~6*(2W+1) graph nodes/layer, and
    W<=32 fits every shipped model's encoder within the budget. PARAKEET_ATT_CONTEXT
    is clamped to 32.
    
    Regression bench (librispeech, 100 files, CPU, back-to-back):
      tdt_ctc-110m: master 19.5s vs banded 19.4s (within noise), 0/100 text diffs
      tdt-0.6b-v3:  0/100 text diffs
    Long-audio fix intact: 16.6-min clip + tdt-0.6b-v3 auto-uses W=32 -> 48s,
    9.4 GB peak RSS (vs full attention 151s / 55.4 GB).
    
    Lifting the window cap to NeMo's [128,128] needs the efficient chunk-matmul
    construction (O(1) graph nodes) — follow-up.
    
    Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
    Assisted-by: Claude:claude-opus-4-8
    
    * feat(attn): banded local attention in the batched encoder (B>1)
    
    Mirror the B=1 banded path (build_graph_local) into the fused batched
    encoder so long-audio batches also use NeMo rel_pos_local_attn
    (O(T*window)) instead of full O(T^2) attention.
    
    RelPosAttention::build_graph_batched_local builds the 4D ([dk,T,H,B])
    pad-and-shift band: K/V padded on the time axis, per-window-column views,
    sum_rows content scores + mul_mat positional scores (shared pos broadcast
    over B), a per-item band mask [P,T,1,B] keyed on each item's valid_len,
    soft_max over the window, then the context gather and head merge. Conformer
    build_graph_batched and the batched encoder forward route to it when
    att_left/att_right >= 0, with the shared LOCAL positional encoding.
    
    Verified on dgx (tdt_ctc-110m): the new test_encoder_batch_local exercises
    the path at the production window (W=32 = kMaxLocalWindow). item0 (the full
    clip) is bit-exact beside its shorter padded neighbour (no cross-item leak),
    and the padded item1 matches its standalone run within 5e-2/5e-2 - the same
    tolerance the full-attention batch test uses. Tighter-than-production windows
    only amplify float noise on near-zero activations of the padded clip (item0
    stays exact, mean|d| ~1e-2); not pad leakage.
    
    Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
    Assisted-by: Claude:claude-opus-4-8
    
    * feat(attn): chunk-matmul banded local attention (O(1) graph nodes)
    
    The pad-and-shift banded path (build_graph_local) is correct but emits
    O(window) graph nodes per layer (a P-iteration view+mul+sum_rows+concat
    loop), which is why the window was capped at W=32. build_graph_local_chunked
    computes the exact same NeMo rel_pos_local_attn output with a fixed, O(1)
    number of nodes regardless of window, lifting the cap toward NeMo's full
    [128,128].
    
    Construction: time is tiled into chunks of C frames; each chunk carries its
    own C+P-1 keys/values (the P-1 halo overlaps the neighbour), so a query
    attends only within its chunk. K/V are gathered as OVERLAPPING strided chunk
    views - which ggml's view-bounds check (ggml.c: data_size = dense product of
    ne, ignoring nb) rejects unless the source is OVER-padded to (C+P-1)*G frames;
    with that pad the view is legal and a single batched ggml_mul_mat produces the
    per-chunk q.k blocks [C+P-1, C, G, H]. A diagonal "skew" view (nb1 walking C+P
    on a [C+P-1,...] tensor, which passes the bounds check since P <= C+P-1)
    extracts the [P,T] band. The PV side inverse-skews the softmaxed band back to a
    [C+P-1, C] banded matrix (pad ne0 by C, skew-view, mask the lower off-band),
    then one batched matmul against the transposed V chunks gathers the context.
    
    Verified against the trusted pad-and-shift path (forward_local, itself 1.4e-3
    vs a deterministic brute-force band reference): new test
    test_relpos_attention_local_chunked runs synthetic x/pos through the real
    layer-0 weights for T up to 333 and W up to 128 (chunk < W and chunk == W),
    matching forward_local to <1e-3 (max|d| ~6e-4). Existing pad-and-shift path
    and all encoder/conformer regressions unchanged. Encoder wiring (raise the cap
    and route long audio to this kernel) follows.
    
    Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
    Assisted-by: Claude:claude-opus-4-8
    
    * feat(encoder): route local attention to chunk-matmul, lift window cap to 128
    
    Wire the O(1)-node chunk-matmul kernel into the encoder and raise the local
    window cap from 32 to NeMo's full 128. Both conformer attention paths now use
    it: build_graph (B=1) -> build_graph_local_chunked, build_graph_batched (B>1)
    -> build_graph_batched_local_chunked. The batched wrapper runs the 4D chunk
    kernel once per item and stacks the [D,T] outputs into [D,T,B] (the chunk graph
    is already 4D, so it can't also carry a batch dim); that is O(B) nodes, still
    O(1) in the window, and B is small.
    
    local_attn_window's cap (kMaxLocalWindow) goes 32 -> 128: the pad-and-shift
    path emitted ~6*(2W+1) nodes/layer (hence the 32 cap to fit kGraphSize), but the
    chunk-matmul path is window-independent in node count, so long audio now runs at
    NeMo's full [128,128] window. The pad-and-shift build_graph_local /
    build_graph_batched_local are kept as the verification oracle for
    test_relpos_attention_local{,_chunked}.
    
    Verified on dgx: full ctest green (51/51). test_encoder_batch_local passes at
    every forced window W=8..128 (now through the chunked path). e2e on a 16.6-min
    clip (tdt-0.6b-v3, CPU/16t), auto-local W=128: 36.8s / 9.8GB peak RSS, coherent
    transcript - faster than the W=32 pad-and-shift capstone (41-48s / 9.1GB) at a
    4x wider, NeMo-faithful window, and ~5.6x under the full-attention path that
    OOM'd the node (151s / 55GB).
    
    Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
    Assisted-by: Claude:claude-opus-4-8
    
    * docs(bench): add long-audio banded-attention section
    
    Document the banded local attention (rel_pos_local_attn) memory/speed win that
    the chunk-matmul kernel enables. 16.6-min clip, tdt-0.6b-v3, GB10 CPU/16t:
    global O(T^2) attention 148.3s / 54.0GB vs banded W=128 36.9s / 9.4GB (~4x
    faster, ~5.7x less peak RAM) at NeMo's full window, with the chunk-matmul making
    W=128 as cheap as W=32. Notes that short clips stay on the global path and are
    unchanged.
    
    Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
    Assisted-by: Claude:claude-opus-4-8
    
    ---------
    
    Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
    Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
    localai-bot and mudler authored Jun 6, 2026
    Configuration menu
    Copy the full SHA
    8436005 View commit details
    Browse the repository at this point in the history
  2. Add nemotron-3.5-asr-streaming-0.6b (multilingual prompt-conditioned …

    …streaming) (#10)
    
    * feat(convert): emit prompt-conditioning KVs + use_bias; handle nested att_context presets
    
    Emit parakeet.prompt.{present,num_prompts,dictionary.keys,dictionary.values,
    default_lang} and parakeet.encoder.use_bias for prompt-conditioned multilingual
    checkpoints (nvidia/nemotron-3.5-asr-streaming-0.6b). Handle the nested
    att_context_size preset list ([[56,3],[56,0],...]) by taking the first preset as
    the default and recording all presets in parakeet.encoder.att_context_presets.
    
    Also refine detect_arch: a bare aux_ctc config block is no longer enough to mark
    a model hybrid. The nemotron prompt RNNT carries an unconfigured aux_ctc stub
    (num_classes=-1, empty vocabulary) but has no ctc decoder and zero ctc_decoder.*
    weights (NeMo initializes it RNNT-only), so require an actual model.ctc_decoder
    before classifying as hybrid. This makes the model convert as arch=rnnt.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(loader): read prompt-conditioning config + encoder.use_bias
    
    Add PromptCfg (present, num_prompts, default_lang, dict_keys/vals, lang_to_index)
    and use_bias to ParakeetConfig, and read the new KVs in ModelLoader::load via a
    new kv_str_arr helper. present=false / use_bias=true defaults keep every existing
    model byte-identical. Extend test_model_loader with a PARAKEET_TEST_GGUF_NEMOTRON
    block asserting the resolved prompt dictionary (de=9, auto=101, unknown=-1) and
    use_bias=false; it skips silently when the fixture env var is unset.
    
    The encoder attention/FFN linear bias loads were already optional (clone_weight_opt
    + ml.tensor guards across relpos_attention/conformer/streaming_encoder), and every
    subsampling bias is present in this checkpoint, so the use_bias=false model loads
    and its encoder graph builds with no further changes.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * test: make nemotron loader assertions reachable without PARAKEET_TEST_GGUF
    
    The nemotron prompt-config block was unreachable when PARAKEET_TEST_GGUF
    was unset, because main() returned 77 before it. Guard the base-model
    checks behind PARAKEET_TEST_GGUF and run the nemotron block whenever
    PARAKEET_TEST_GGUF_NEMOTRON is set. Only skip (return 77) when neither
    env var is present.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(baseline): dump prompt_kernel_out + per-language RNNT reference for nemotron
    
    Decode the prompt-conditioned encoder output directly via the model's RNNT
    decoding object: the prompt model's transcribe dataloader resolves the prompt
    index from per-cut language metadata, which a bare wav fixture lacks.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat: PromptKernel post-encoder conditioning unit + isolated NeMo parity test
    
    Concat the constant language one-hot onto the encoder output, then Linear->ReLU
    ->Linear (prompt_kernel.0/2) on the persistent backend via run_graph. Parity vs
    NeMo prompt_kernel_out: max|d|=1.9e-6.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(model): apply PromptKernel + resolve target_lang in offline decode
    
    resolve_prompt_index maps a locale to its prompt index (empty -> default_lang),
    and the offline + batch decode paths project the encoder output through the
    PromptKernel when prompt.present. Threaded target_lang through the transcribe
    entry points (default empty); non-prompt models take the no-op path.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * test: offline nemotron end-to-end NeMo parity (multi-language)
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(streaming): apply PromptKernel per chunk; target_lang on session
    
    Resolve a language prompt index in the StreamingSession constructor (new
    target_lang param, default empty -> model default_lang) and apply the
    prompt_kernel projection to each chunk's encoder frames before the RNN-T
    decode. The one-hot is constant over time, so per-chunk application is exact
    and equals the offline forward's single application. Non-prompt models take
    the no-op path (prompt_.present()==false) and stay byte-identical.
    
    run_stream_over_pcm gains a trailing target_lang param (default empty) so a
    language can route through one entry point; the session already owns its
    resolved index, so the driver leaves it unused for now (Phase 4 wires it).
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * test: streaming nemotron end-to-end NeMo parity
    
    Extend dump_prompt_baseline to emit baseline.stream_text: run NeMo's
    cache-aware streaming encoder, apply m.prompt_kernel to the concatenated
    streamed output for the target_lang, and RNN-T greedy decode it (specials
    stripped). Add tests/test_streaming_nemotron.cpp: drive a prompt-aware
    StreamingSession over the clip and assert sess.text() == baseline.stream_text.
    
    Parity gate (lang=en, speech.wav): got == ref EXACTLY:
      "Well, I don't wish to see it any more, observed Phoebe, turning away her
       eyes. <en-US> It is certainly very like the old portrait. <en-US>"
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(capi): target_lang variants for transcribe + stream (ABI bump)
    
    Add parakeet_capi_transcribe_path_lang, parakeet_capi_transcribe_pcm_lang and
    parakeet_capi_stream_begin_lang for multilingual prompt-conditioned (nemotron)
    models. target_lang is a locale string; NULL or "" selects the model default
    and non-prompt models ignore it. An unknown locale on a prompt model is caught
    at the boundary, returning NULL with the message set on the ctx last error. The
    original non-lang entry points delegate to the new ones with the model default,
    preserving behavior. ABI version bumped to 3.
    
    test_capi gains a PARAKEET_TEST_GGUF_NEMOTRON-guarded block asserting a known
    lang transcribes (non-NULL) and an unknown lang returns NULL with a non-empty
    last_error; the two model blocks are now independent and skip cleanly (77) when
    neither env var is set.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(cli): --lang flag for multilingual prompt models
    
    transcribe gains --lang <locale> to select the language prompt for multilingual
    (nemotron) prompt models; empty -> the model default and non-prompt models
    ignore it. The plain offline path routes through the C-API
    parakeet_capi_transcribe_path_lang when --lang is set (so an unknown locale is a
    clean error), and keeps the existing free-function path otherwise so behavior
    for every other model is unchanged. --timestamps threads lang into
    transcribe_path_with_timestamps; --stream threads it into the StreamingSession
    ctor (what stream_begin_lang forwards), keeping the rich per-word/EOU output.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * test: e2e NeMo-vs-parakeet.cpp comparison harness (per-language, offline+stream)
    
    Refactor gen_nemo_baseline.dump_prompt_baseline into an importable
    compute_prompt_reference helper and reuse it from the new e2e driver, which
    runs the built parakeet-cli per (clip, lang, mode) and asserts WER 0 vs NeMo.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * docs: document nemotron multilingual streaming support + prompt KVs
    
    README: add the prompt-conditioned multilingual streaming model
    (nvidia/nemotron-3.5-asr-streaming-0.6b, 40+ locales, --lang, WER 0 offline +
    streaming). conversion.md: document the parakeet.prompt.* KV schema,
    encoder.use_bias, att_context_presets, and the prompt_kernel tensors (stay F32).
    parity.md: add the nemotron coverage row + e2e cross-check note.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(publish): add nemotron-3.5-asr-streaming-0.6b (5 variants, OpenMDW-1.1 card)
    
    Add the model to ALL_MODELS and KNOWN_WER (f16/q8_0/q6_k/q5_k/q4_k, all WER 0.0
    offline vs NeMo with recorded sizes). Add a per-id LICENSES map (default
    CC-BY-4.0) so the generated card states OpenMDW-1.1 for this entry, wired into
    both the per-model and the collection cards (frontmatter license/license_name/
    license_link, License section, per-model rows). Quant allowlist unchanged: the
    prompt_kernel, LSTM prediction net, and featurizer tensors stay F32.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * bench: nemotron-3.5-asr CPU benchmark vs NeMo (WER 0, 2.4x f32 / 2.5x q8_0)
    
    Benchmarks the prompt-conditioned nemotron-3.5-asr-streaming-0.6b port on CPU
    against NeMo (PyTorch CPU), following the existing parakeet.cpp methodology:
    load once, warm up once, time transcribe only, median of N passes, RTFx =
    audio_sec / proc_sec.
    
    - Adds scripts/bench_nemotron.py. ours runs parakeet-cli bench with the en
      language prompt; NeMo runs the same prompt forward (preprocessor, encoder,
      PromptKernel, RNN-T greedy) reusing gen_nemo_baseline.resolve_prompt_lang.
      Optionally times the cache-aware streaming path too.
    - Adds --lang to the CLI bench subcommand so the prompt-conditioned timing path
      selects the same language prompt as transcribe (passed to transcribe_pcm).
    - Adds build_nemotron_section to gen_benchmark_md.py, fed by the new
      benchmarks/results/nemotron/bench.json, so the section is reproducible.
    
    Results on AMD Ryzen 9 9950X3D (20 cores, CPU-only, 8 threads), speech.wav
    (7.43 s), lang en, median of 7 passes:
    
      NeMo            RTFx 12.2
      parakeet.cpp f32  RTFx 29.4  2.40x  agreement WER 0.0000%
      parakeet.cpp q8_0 RTFx 30.8  2.52x  agreement WER 0.0000%
      streaming f32     compute RTFx 3.80 (latency-oriented)
    
    Transcripts are byte-identical to NeMo on the timed runs, so the speed numbers
    compare equal work. Full suite green (ctest 48/48 non-nemotron, 2/2 nemotron).
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * fix(streaming): reject unknown target_lang for prompt models (match offline + capi contract)
    
    The streaming StreamingSession ctor silently fell back to the default
    language on an unknown locale, contradicting the parakeet_capi_stream_begin_lang
    header contract (NULL on an unknown locale) and diverging from the offline
    Model::resolve_prompt_index path, which throws. A typo like --stream --lang xx
    produced wrong-language output with no error.
    
    Factor the throwing resolution into PromptCfg::resolve_index_or_throw and use
    it from both the offline path and the StreamingSession ctor so both reject
    typos identically. The empty-lang default and the non-prompt no-op are
    unchanged.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    ---------
    
    Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
    Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    3 people authored Jun 6, 2026
    Configuration menu
    Copy the full SHA
    271e70b View commit details
    Browse the repository at this point in the history
  3. Batch mode for nemotron: batched causal subsampling + batched target_…

    …lang C-API (#11)
    
    * feat(capi): batched target_lang variants (transcribe_pcm_batch_json_lang / _batch_lang)
    
    Add language-aware batched C-API entry points so a request-coalesced batch
    can select one language prompt for the whole batch on multilingual
    (nemotron) models:
    
      char* parakeet_capi_transcribe_pcm_batch_json_lang(...)
      int   parakeet_capi_transcribe_pcm_batch_lang(...)
    
    The existing non-lang batch functions now delegate to these with nullptr
    (model default), mirroring the Phase 4 single-clip pattern, so no logic is
    duplicated. target_lang threads into the C++ batch methods that already
    accept it; NULL/"" means the model default, non-prompt models ignore it,
    and an unknown locale is caught by the existing try/catch (NULL / nonzero +
    last_error). ABI stays v3 (unreleased on this branch); the v3 comment now
    lists the two new symbols.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(subsampling): support batched causal subsampling (byte-identical to per-item); enable batched nemotron
    
    The build_graph_batched causal branch already applied the leading
    ggml_pad_ext uniformly across the batch and masked each item's trailing
    pad time frames per stage via the all_paddings=3 valid-length
    recurrence, so it reproduces the standalone causal boundary per item.
    The B>1 guard assert was a conservative leftover; remove it so the
    multilingual streaming nemotron model can run real batches.
    
    Validated byte-identical: a clip transcribed inside a B>1 batch (uniform,
    mixed-length, reversed order, and a non-empty truncated item as the
    padded/masked clip) equals the same clip transcribed standalone, plus
    batched timestamps text parity. Re-enable the positive valid-language
    (de) 2-clip batch JSON assertion in test_capi_batch_json. No change to
    the non-causal path.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    ---------
    
    Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
    Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    3 people authored Jun 6, 2026
    Configuration menu
    Copy the full SHA
    50dfc24 View commit details
    Browse the repository at this point in the history
  4. docs: add a supported-models table with links (#12)

    Lists all 11 published models (the 10 Parakeet checkpoints plus the new
    multilingual streaming nemotron-3.5-asr-streaming-0.6b) with their type, size,
    notes, and a link to each NVIDIA source, plus a pointer to the GGUF collection
    repo and docs/parity.md.
    
    Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
    Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    3 people authored Jun 6, 2026
    Configuration menu
    Copy the full SHA
    86bc69e View commit details
    Browse the repository at this point in the history
  5. docs(bench): add nemotron GPU numbers (GB10) to BENCHMARK.md (#14)

    parakeet.cpp vs NeMo on the NVIDIA GB10, same clip and methodology as the CPU
    table: NeMo (PyTorch GPU) RTFx 91.8, parakeet.cpp f32 106.5 (1.16x), q8_0 119.8
    (1.30x), transcripts byte-identical (WER 0). The margin is smaller than on CPU
    because nemotron is RNN-T and NeMo's CUDA-graph greedy decode is fast there.
    NeMo now runs natively on the GB10 via torch 2.11 plus cu128 (no nvcr container).
    
    Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
    Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    3 people authored Jun 6, 2026
    Configuration menu
    Copy the full SHA
    96c3177 View commit details
    Browse the repository at this point in the history
  6. fix: reset the streaming decoder on <EOU>/<EOB> so transcription cont…

    …inues (#13) (#15)
    
    The realtime EOU model (parakeet_realtime_eou_120m-v1) emits <EOU> / <EOB> as
    ordinary vocab tokens to mark end of utterance. The cache-aware streaming decode
    carried the RNN-T decoder state across chunks but never reset it, so once <EOU>
    was emitted the prediction net stayed conditioned on it and the joint scored
    blank on every following frame: the stream went silent after the first
    utterance (issue #13). This matched NeMo's plain rnnt_decoder_predictions_tensor
    (which does the same), but that is not how the model is meant to run.
    
    NeMo's reference streaming driver for this model
    (examples/voice_agent/.../nemo/streaming_asr.py NemoStreamingASRService.transcribe)
    calls reset_state() whenever <EOU>/<EOB> appears in a chunk, so the next
    utterance decodes from a fresh decoder state. StreamingSession::feed_mel_chunk
    now does the same: after a chunk emits <EOU>/<EOB> it resets the RNN-T decoder
    state (LSTM h/c to zero, last token back to SOS) for the next chunk.
    
    Only the decoder is reset, not the StreamingEncoder cache. NeMo's reset_state
    also drops the encoder cache, but that was verified byte-identical on the
    transcript (decoder-only reset == full reset_state on the diffusion 60s/2-EOU
    and 180s/5-EOU clips), so the validated streaming-encoder path is left
    untouched. enc_frame_ keeps running so EOU timestamps stay absolute in the clip,
    and the offline path is unchanged (it matches NeMo offline on single utterances).
    
    Adds a gated regression test (test_streaming_eou_reset) plus a NeMo reset-on-EOU
    baseline generator (gen_stream_reset_baseline.py) that builds a two-utterance
    clip so an <EOU> fires mid-stream; the test asserts our streamed transcript
    matches NeMo's reset reference exactly and that the second utterance is
    recovered. Confirmed it fails with the reset disabled.
    
    Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
    Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    3 people authored Jun 6, 2026
    Configuration menu
    Copy the full SHA
    abd0087 View commit details
    Browse the repository at this point in the history

Commits on Jun 7, 2026

  1. feat(capi): ABI v4 segment-timestamp support (frame_sec + streaming J…

    …SON) (#16)
    
    Add the data LocalAI needs to build NeMo-faithful segment timestamps:
    
    - Offline JSON (transcribe_*_json) now carries "frame_sec", the encoder
      frame stride in seconds, so a consumer can convert NeMo's frame-unit
      segment_gap_threshold into the seconds gap between words.
    
    - New streaming JSON entry points parakeet_capi_stream_feed_json /
      parakeet_capi_stream_finalize_json return {text, eou, frame_sec, words}
      by surfacing the streaming session's existing drain_words() per-word
      start/end/conf alongside the newly-finalized text and EOU flag.
    
    Bumps PARAKEET_CAPI_ABI_VERSION to 4. All existing entry points are
    unchanged; the new symbols are additive (consumers probe for them).
    
    tests/test_capi_stream_json.cpp drives the new streaming JSON path on the
    EOU model (skips with 77 when PARAKEET_TEST_GGUF_EOU is unset, like the
    sibling streaming tests).
    
    Assisted-by: Claude:claude-opus-4-8 [Claude Code]
    
    Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
    Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
    localai-bot and mudler authored Jun 7, 2026
    Configuration menu
    Copy the full SHA
    ce02d29 View commit details
    Browse the repository at this point in the history
  2. fix: define M_PI for MSVC builds (#18)

    M_PI is not declared by <cmath> on MSVC unless _USE_MATH_DEFINES is set
    before the header. Define it (plus an #ifndef M_PI fallback) in fft.cpp and
    mel_gpu.cpp, and add _USE_MATH_DEFINES as a PUBLIC MSVC compile definition on
    the parakeet target so the test executables that also use M_PI build too.
    Non-MSVC builds are unaffected.
    
    Closes #6
    
    Co-authored-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
    Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    3 people authored Jun 7, 2026
    Configuration menu
    Copy the full SHA
    eb09678 View commit details
    Browse the repository at this point in the history
  3. fix: tile subsampling for long audio to avoid ggml 2^31 tensor overfl…

    …ow on GPU (#19)
    
    * feat(subsampling): add subsample_len spatial-length helper
    
    * feat(subsampling): tiled long-audio path (parity vs forward)
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * refactor(subsampling): cache valid_out_len in forward_tiled; document tiling test invariant
    
    * feat(encoder): forward_batch_tiled from pre-subsampled features
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(model): tile subsampling for long audio above safe threshold
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(model): tile single-clip transcribe for long audio (CLI/path C-API)
    
    * fix(ggml-cuda): grid-stride pad kernel for dims > 65535 (long-audio attention)
    
    ---------
    
    Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
    Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    3 people authored Jun 7, 2026
    Configuration menu
    Copy the full SHA
    96b81bb View commit details
    Browse the repository at this point in the history
  4. fix: select integrated GPUs and allow PARAKEET_DEVICE to name a device (

    #20)
    
    Backend device selection only accepted GGML_BACKEND_DEVICE_TYPE_GPU, so
    integrated GPUs (Ryzen APUs and similar, reported as
    GGML_BACKEND_DEVICE_TYPE_IGPU) were skipped and the engine fell back to
    CPU on those machines.
    
    The auto-pick now matches both discrete and integrated GPU devices.
    PARAKEET_DEVICE also gains a third form: besides "cpu" (force CPU) and
    being unset (auto-pick the first GPU/IGPU), it can now name a specific
    registry device such as "CUDA0" or "Vulkan1" (case-insensitive). An
    unknown name logs and falls back to CPU instead of failing. use_sched is
    now derived from the chosen device type so any non-CPU device still
    offloads unsupported ops to CPU.
    
    Adds a regression test covering the env-var fallback paths (cpu, unknown
    name, case-insensitive CPU), which run on a CPU-only build, and documents
    the new behavior in the README.
    
    Closes #17
    
    Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
    Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    3 people authored Jun 7, 2026
    Configuration menu
    Copy the full SHA
    e270af7 View commit details
    Browse the repository at this point in the history

Commits on Jun 11, 2026

  1. ci: pre-built release binaries for linux, macos and windows (#22)

    * ci: pre-built release binaries for linux, macos and windows (#21)
    
    Adds a release workflow that builds self-contained parakeet-cli bundles
    for every v* tag: linux x64 (cpu, vulkan, cuda) and arm64 (cpu), macos
    arm64 (metal) and x64 (cpu), windows x64 (cpu, vulkan, cuda) plus a
    separate cudart runtime zip. Assets attach to the GitHub release for
    the tag, creating a draft release when none exists yet.
    
    Fixes #21
    
    Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
    
    * docs: point the README at the pre-built release bundles
    
    Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
    
    * ci: capture the usage banner before grepping in the smoke tests
    
    parakeet-cli exits 2 when invoked bare; under the runner's bash -e -o
    pipefail that exit code fails the pipeline even though grep matched.
    
    Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
    
    * ci: drop the temporary branch trigger used for matrix validation
    
    Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
    
    * ci: let ggml pick the CUDA architectures, like llama.cpp releases
    
    Dropping the hand-rolled CMAKE_CUDA_ARCHITECTURES lists lets ggml's
    curated non-native default apply: PTX for the datacenter generations
    (75, 80, 90), real code for the common consumer cards (86, 89, 120a),
    and 121a-real for GB10 on CUDA 13. Smaller binaries, faster builds,
    and the list stays current with submodule bumps.
    
    Temporarily re-adds the branch trigger to validate the CUDA builds.
    
    Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
    
    ---------
    
    Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
    Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
    3 people authored Jun 11, 2026
    Configuration menu
    Copy the full SHA
    9db92be View commit details
    Browse the repository at this point in the history
Loading