Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: mudler/parakeet.cpp
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v0.1.1
Choose a base ref
...
head repository: mudler/parakeet.cpp
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v0.1.2
Choose a head ref
  • 7 commits
  • 71 files changed
  • 3 contributors

Commits on May 30, 2026

  1. docs: add parakeet.cpp vs NeMo / whisper.cpp duel videos

    Short side-by-side "processing race" clips for the README and benchmark doc:
    parakeet.cpp vs NeMo (PyTorch) on GPU (byte-for-byte identical output,
    parakeet.cpp finishes first), and vs whisper.cpp turbo on GPU and CPU (same
    accuracy, about 12x and 27x faster). GIF inline plus the source MP4s under
    benchmarks/media/. The demo block is emitted by gen_benchmark_md.py so it
    survives a benchmark regen.
    
    Assisted-by: Claude:claude-opus-4-8 [Claude Code]
    mudler committed May 30, 2026
    Configuration menu
    Copy the full SHA
    26333bf View commit details
    Browse the repository at this point in the history
  2. docs: add parakeet.cpp vs NeMo (PyTorch) on CPU duel

    Same processing-race clip on CPU against NeMo's own PyTorch runtime
    (tdt-0.6b-v2, 23s LibriSpeech clip, 8 threads): parakeet.cpp 426 ms vs NeMo
    661 ms, about 1.5x faster, transcripts byte-for-byte identical. NeMo measured
    in the nvcr.io/nvidia/nemo container, CPU mode. Linked from the README demo
    section and added to the benchmark doc's demo table.
    
    Assisted-by: Claude:claude-opus-4-8 [Claude Code]
    mudler committed May 30, 2026
    Configuration menu
    Copy the full SHA
    5ed9789 View commit details
    Browse the repository at this point in the history
  3. docs: embed the NeMo-CPU duel inline in the benchmark doc

    Show the parakeet.cpp vs NeMo (PyTorch) CPU race as an inline GIF like the GPU
    one, instead of a table link. The whisper.cpp matchups stay as links.
    
    Assisted-by: Claude:claude-opus-4-8 [Claude Code]
    mudler committed May 30, 2026
    Configuration menu
    Copy the full SHA
    60bd1bb View commit details
    Browse the repository at this point in the history

Commits on May 31, 2026

  1. docs: trim dead terminal lead-in from the duel clips

    The recorder showed about 1s of empty terminal before the TUI drew its first
    frame; the clips now start on the first rendered frame. Regenerated the GIFs and
    MP4s under benchmarks/media/.
    
    Assisted-by: Claude:claude-opus-4-8 [Claude Code]
    mudler committed May 31, 2026
    Configuration menu
    Copy the full SHA
    cb45f68 View commit details
    Browse the repository at this point in the history

Commits on Jun 1, 2026

  1. Batched encoder: run N clips through one fused ggml graph (#1)

    * feat(encoder): add MelBatch + forward_batch (B=1 per-item loop)
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(model): transcribe_pcm_batch + extract decode_enc_out helper
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * fix(model): guard invalid sample_rate in transcribe_pcm_batch (parity with single-clip)
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(capi): parakeet_capi_transcribe_pcm_batch
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * fix(capi): null out[] on entry so error paths leave a clean, uniform contract
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * test: batch API B=1 equivalence smoke
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(subsampling): batched build_graph_batched; build_graph adapts B=1
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * fix(subsampling): release-surviving guard for batched-causal; dedupe valid_out_len
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * test(subsampling): batched-vs-standalone per-item equivalence
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * fix(subsampling): per-stage per-item trailing-pad input masking for batched conv (matches NeMo MaskedConvSequential)
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(conformer): batch axis through build_conv_module (B=1 unchanged)
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(attention): batched build_graph_batched with 4D rel-shift
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * test(attention): batched rel-shift equivalence + padding invariance
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(conformer): build_graph_batched ([D,T,B]); build_graph adapts B=1
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * test(conformer): batched-vs-standalone per-item equivalence + padding invariance
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * fix(conformer): per-item depthwise conv for B>1 (ggml 1D im2col requires ne[3]==1); B=1 byte-identical
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(encoder): fused single-graph batched forward_batch
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * test(encoder): fused batched equivalence + padding invariance
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * bench: add bench-batch CLI subcommand for batched-encoder throughput
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * docs(encoder): correct forward_batch comment (fused graph, not internal loop)
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(model): transcribe_pcm_batch_with_timestamps + extract decode_enc_out_with_timestamps
    
    Adds batched timestamped transcription (N 16 kHz clips -> N Transcriptions)
    built on the fused batched encoder, plus the public
    transcribe_pcm_batch_with_timestamps entry point. The per-item decode tail of
    transcribe_16k_with_timestamps is extracted verbatim into a file-scope
    decode_enc_out_with_timestamps helper (behavior-preserving) and reused by both
    the single-clip and batched paths.
    
    Also fixes a pre-existing batched-encoder bug surfaced by the new equivalence
    test: forward_batch emitted each enc_outs[b] at the padded width Tp, but the
    decoders index enc_out[c*Tout + t] with Tout = valid_Tout[b]. For a padded
    (shorter) item that stride mismatch misaligned every row after the first,
    corrupting the decode (e.g. a 18-word clip collapsed to 3 garbage words). This
    affected the text-only transcribe_pcm_batch path too. forward_batch now
    compacts each enc_outs[b] to its own valid_Tout[b] columns so the row stride
    matches and no pad-derived frames reach the decoder; test_encoder_batch's
    slice is updated to the compacted per-item width.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * refactor(model): share build_mel_batch across batch paths; restore CTC span rationale
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(capi): parakeet_capi_transcribe_pcm_batch_json (batched timestamps, ABI bump)
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * docs(capi): state sum(n_samples) precondition for batch_json as caller-must-uphold
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * refactor(decode): share argmax/max_prob_conf in decode_common.hpp
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(prediction): step_batch (batched LSTM, [H,N])
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * docs(prediction): explain the 4H gate-slice stride in step_batch
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(joint): step_logits_batch (batched joint, [V_plus,N])
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(decode): transducer_greedy_batch (batched RNNT+TDT greedy, bit-exact parity)
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * refactor(decode): extract commit_state lambda in transducer_greedy_batch; document max_symbols assumption
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * feat(model): batched transducer decode in transcribe_pcm_batch[_with_timestamps]
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * bench: add bench-decode (batched vs serial transducer decode timing)
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * perf(decode): cache prediction-net g across rounds in transducer_greedy_batch (bit-exact; recovers B=1, fewer LSTM calls)
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * refactor(model): extract batch_enc_to_row_major helper (dedup batched-decode transpose)
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * docs(decode): correct header comment to describe the g_valid cache
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * refactor(subsampling): restore v1 2-D scalar build_graph for B=1
    
    Replace the delegating Subsampling::build_graph (which forwarded to
    build_graph_batched at B=1) with the verbatim v1 2-D body from 60bd1bb.
    The single-clip/forward path now runs the lean 2-D graph again: faster and
    bit-exact with v1 (the batch test's item0, the full clip, is max|d|=0).
    
    build_graph_batched is unchanged (forward_batch / B>1 still use it).
    
    test_subsampling_batch: the short, zero-padded item1 now compares the
    batched builder against the 2-D scalar path. They match to ~1e-3 on interior
    frames; only the single trailing valid frame diverges (its downsampled
    receptive field straddles the clip boundary, where per-stage masking and the
    conv zero-edge round differently by design). Compare interior frames at a
    modest 5e-3 and skip that last boundary frame. item0 stays exact at 1e-3.
    
    * refactor(conformer,attention): restore v1 2-D scalar builders for B=1
    
    Replace the delegating scalar builders (which forwarded to the batched
    builders at B=1) with the verbatim v1 2-D bodies from 60bd1bb:
      - ConformerLayer::build_graph (full 2-D conformer layer)
      - RelPosAttention::build_graph (2-D/3D rel-pos attention)
      - build_conv_module gains a scalar (int valid_len) overload alongside the
        batched (int B, const std::vector<int>&) one; distinguished by signature.
    
    The scalar callers (build_graph, forward_with_conv localization,
    conv_module_forward) now route to the 2-D conv module. The batched builders
    (build_graph_batched, the batched build_conv_module) are untouched; forward_batch
    / B>1 still use them.
    
    Net effect: the single-clip / forward path runs the lean 2-D graph again
    (faster) and is bit-exact with v1. test_conformer, test_conformer_batch,
    test_relpos_attention_batch, test_encoder, test_encoder_batch all pass; the
    24-layer 2-D-vs-batched accumulation stays within test_encoder_batch's existing
    5e-2 tolerance, so no tolerance change was needed there.
    
    * feat(bench): bench-decode --json + BENCHMARK.md batched-decode section
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * docs(bench): add batched-decode throughput tables (CPU + GPU)
    
    Record serial-vs-batched decode speedups for the transducer models at
    B=1,4,8,16, captured via bench-decode --json. CPU (this 20-core host,
    q5_k) reaches ~3-5x at B=16; GPU (dgx GB10, f16) reaches ~10-12x. CTC
    models are excluded (no autoregressive decode). gen_benchmark_md.py
    renders the new 'Batched decode throughput' section from
    benchmarks/results/decode_batch/{cpu,gpu}/.
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    * docs(readme): add Batching section (CLI, C-API, when to use)
    
    Document the opt-in batched-decode path: the bench-decode/bench-batch
    CLI commands, the C++ transcribe_16k_batch and C-API
    transcribe_pcm_batch[_json] entry points, the decode-batching win (GPU
    ~10-12x at B=16, CPU ~3-5x; encoder and CTC excluded), bit-exactness vs
    single-clip, and a pointer to the BENCHMARK.md tables. Notes that
    LocalAI exposes it via batch_max_size (off by default).
    
    Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    
    ---------
    
    Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
    mudler and claude authored Jun 1, 2026
    Configuration menu
    Copy the full SHA
    8a7c482 View commit details
    Browse the repository at this point in the history
  2. Apple Metal support, with a GPU fast path that does not regress CUDA (#4

    )
    
    parakeet.cpp aborted on Apple Metal (ggml's Metal backend has no CONV_2D_DW
    kernel, which the Conformer subsampling emits). Make it run on Metal and keep
    every backend fast:
    
    - backend: GPU devices use the persistent gallocr fast path. A per-graph
      ggml_backend_supports_op scan only routes to ggml_backend_sched (CPU fallback)
      when the active GPU backend genuinely lacks a kernel for some op. CUDA covers
      every op, so it stays on gallocr (verified parity with master on the GB10; the
      earlier blanket-sched approach regressed CUDA 7-23%). The CPU path is unchanged.
    - ggml: native Metal CONV_2D_DW kernel (patch 0002) and leading-side PAD support
      (patch 0003), the two ops the encoder needed. With these the whole encoder,
      down to the log-mel front end, runs on Metal.
    - bench: scripts/bench_metal_dw.sh measures steady-state RTFx via parakeet-cli
      bench (warm up once, time inference only).
    - ci: run the closed-loop end-to-end transcript assertion on pull requests, not
      just manual dispatch.
    - docs: Apple Metal section in README and BENCHMARK; AGENTS.md gains a
      performance-invariant note (keep gallocr) and the reference transcript.
    
    Metal (M4, q4_k) is about 3-5x over CPU on the larger models. CPU and CUDA are
    within noise of master.
    
    Assisted-by: Claude:claude-opus-4-8 [Claude Code]
    mudler authored Jun 1, 2026
    Configuration menu
    Copy the full SHA
    9edf17c View commit details
    Browse the repository at this point in the history

Commits on Jun 2, 2026

  1. ci: build and publish the parakeet-cli container image to ghcr (#7)

    * ci: build and publish the parakeet-cli container image to ghcr
    
    Add a multi-stage Dockerfile and a docker workflow that builds the
    parakeet-cli image and pushes it to ghcr.io/<owner>/parakeet.cpp-cli.
    
    - Dockerfile: fat build stage compiles parakeet-cli plus the ggml backends,
      a slim runtime stage carries only the binary and the ggml .so files. One
      Dockerfile, CPU and CUDA variants selected via BUILD_BASE / RUNTIME_BASE /
      CMAKE_EXTRA_ARGS build args. GGML_NATIVE=OFF so the image is portable
      across x86-64 hosts. The ggml submodule is re-inited as a throwaway git
      repo in the build stage so the CMake-driven patch step (git apply) works
      regardless of how the submodule arrived in the context.
    - docker.yml: matrix over cpu/cuda, builds on every push/PR (build-only gate
      on PRs), pushes to ghcr on master + tags + dispatch. Tags via
      metadata-action: latest / sha / vX.Y.Z, with a -cuda suffix for the CUDA
      variant. Uses GITHUB_TOKEN, gha build cache.
    - .dockerignore keeps the context small (excludes .git, build dirs, models,
      benchmark media) while keeping the ggml source.
    - README: Docker section with CPU and CUDA run examples.
    
    Verified the CPU image end to end: builds at 127 MB, parakeet-cli runs, and
    transcribing tests/fixtures/speech.wav with a mounted q5_k 110m model yields
    the exact NeMo reference transcript.
    
    Assisted-by: Claude:claude-opus-4-8 [Claude Code]
    
    * ci(docker): build multi-arch (amd64 + arm64) images for both variants
    
    Publish each variant (cpu, cuda) as a multi-arch manifest covering
    linux/amd64 and linux/arm64. The arm64 CUDA image runs natively on Grace /
    GB10-class hosts.
    
    Every arch is built natively, no QEMU: amd64 on ubuntu-24.04, arm64 on the
    ubuntu-24.04-arm hosted runner (free for public repos). Emulated nvcc builds
    would be far too slow. The per-arch images are pushed by digest and a merge
    job stitches them into one manifest per variant, tagged via metadata-action.
    
    Verified the arm64 CPU image builds and runs (aarch64) under emulation
    locally, and confirmed the ubuntu and nvidia/cuda base images all ship arm64.
    
    Assisted-by: Claude:claude-opus-4-8 [Claude Code]
    
    * ci(docker): build CUDA images on CUDA 13 for Blackwell / GB10 (DGX Spark)
    
    CUDA 12.6 tops out at sm_90, so the CUDA images would not run on GB10 /
    Grace-Blackwell. The vendored ggml's CUDA CMake adds 120a-real at CUDA >= 12.8
    and 121a-real (GB10 / DGX Spark / Thor) at CUDA >= 12.9, all under our
    GGML_NATIVE=OFF default. Bumping both arches to nvidia/cuda:13.0.1 therefore
    compiles Turing through Blackwell with no manual arch list: amd64 picks up
    Hopper / Ada / RTX 50, arm64 picks up GH200 (sm_90 PTX) and GB10 (sm_121).
    
    Assisted-by: Claude:claude-opus-4-8 [Claude Code]
    
    * ci(docker): fix CUDA link (GGML_CUDA_NO_VMM), trim arm64 archs, CPU-only PR gate
    
    The CUDA builds failed at link: libggml-cuda.so had undefined references to
    the CUDA driver API (cuMemCreate, cuMemMap, cuDeviceGet, ...). Those come from
    ggml's VMM memory pool, which links libcuda -- a lib a GPU-less build
    container does not have. Build with -DGGML_CUDA_NO_VMM=ON: every cuMem* call
    is under #if defined(GGML_USE_VMM), which this flag disables, so the symbols
    and the libcuda link dependency both go away. Verified locally: the amd64
    CUDA image now links clean, ships libggml-cuda.so, and resolves libcudart /
    libcublas from the CUDA 13 runtime base.
    
    Also cut build time, which had blown out to 43 min on the arm64 CUDA job:
    - arm64 CUDA targets only Grace GPUs now (CUDA_ARCHS=90;121-real -> GH200 +
      GB10/Spark) instead of ggml's full 7-arch list. Added a dedicated quoted
      CUDA_ARCHS build-arg so the ';' list separator survives the shell (the
      unquoted CMAKE_EXTRA_ARGS would split it as a command separator).
    - pull_request now builds the CPU variant only (fast Dockerfile gate) via a
      dynamic matrix from a setup job. CUDA builds only on push / tag / dispatch,
      which also publish. Use workflow_dispatch to exercise CUDA before merging.
    
    Assisted-by: Claude:claude-opus-4-8 [Claude Code]
    mudler authored Jun 2, 2026
    Configuration menu
    Copy the full SHA
    b11fe5b View commit details
    Browse the repository at this point in the history
Loading