Comparing changes

Short side-by-side "processing race" clips for the README and benchmark doc: parakeet.cpp vs NeMo (PyTorch) on GPU (byte-for-byte identical output, parakeet.cpp finishes first), and vs whisper.cpp turbo on GPU and CPU (same accuracy, about 12x and 27x faster). GIF inline plus the source MP4s under benchmarks/media/. The demo block is emitted by gen_benchmark_md.py so it survives a benchmark regen. Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Same processing-race clip on CPU against NeMo's own PyTorch runtime (tdt-0.6b-v2, 23s LibriSpeech clip, 8 threads): parakeet.cpp 426 ms vs NeMo 661 ms, about 1.5x faster, transcripts byte-for-byte identical. NeMo measured in the nvcr.io/nvidia/nemo container, CPU mode. Linked from the README demo section and added to the benchmark doc's demo table. Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Show the parakeet.cpp vs NeMo (PyTorch) CPU race as an inline GIF like the GPU one, instead of a table link. The whisper.cpp matchups stay as links. Assisted-by: Claude:claude-opus-4-8 [Claude Code]

The recorder showed about 1s of empty terminal before the TUI drew its first frame; the clips now start on the first rendered frame. Regenerated the GIFs and MP4s under benchmarks/media/. Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* feat(encoder): add MelBatch + forward_batch (B=1 per-item loop) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(model): transcribe_pcm_batch + extract decode_enc_out helper Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(model): guard invalid sample_rate in transcribe_pcm_batch (parity with single-clip) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(capi): parakeet_capi_transcribe_pcm_batch Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(capi): null out[] on entry so error paths leave a clean, uniform contract Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test: batch API B=1 equivalence smoke Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(subsampling): batched build_graph_batched; build_graph adapts B=1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(subsampling): release-surviving guard for batched-causal; dedupe valid_out_len Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(subsampling): batched-vs-standalone per-item equivalence Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(subsampling): per-stage per-item trailing-pad input masking for batched conv (matches NeMo MaskedConvSequential) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(conformer): batch axis through build_conv_module (B=1 unchanged) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(attention): batched build_graph_batched with 4D rel-shift Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(attention): batched rel-shift equivalence + padding invariance Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(conformer): build_graph_batched ([D,T,B]); build_graph adapts B=1 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(conformer): batched-vs-standalone per-item equivalence + padding invariance Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(conformer): per-item depthwise conv for B>1 (ggml 1D im2col requires ne[3]==1); B=1 byte-identical Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(encoder): fused single-graph batched forward_batch Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(encoder): fused batched equivalence + padding invariance Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * bench: add bench-batch CLI subcommand for batched-encoder throughput Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(encoder): correct forward_batch comment (fused graph, not internal loop) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(model): transcribe_pcm_batch_with_timestamps + extract decode_enc_out_with_timestamps Adds batched timestamped transcription (N 16 kHz clips -> N Transcriptions) built on the fused batched encoder, plus the public transcribe_pcm_batch_with_timestamps entry point. The per-item decode tail of transcribe_16k_with_timestamps is extracted verbatim into a file-scope decode_enc_out_with_timestamps helper (behavior-preserving) and reused by both the single-clip and batched paths. Also fixes a pre-existing batched-encoder bug surfaced by the new equivalence test: forward_batch emitted each enc_outs[b] at the padded width Tp, but the decoders index enc_out[c*Tout + t] with Tout = valid_Tout[b]. For a padded (shorter) item that stride mismatch misaligned every row after the first, corrupting the decode (e.g. a 18-word clip collapsed to 3 garbage words). This affected the text-only transcribe_pcm_batch path too. forward_batch now compacts each enc_outs[b] to its own valid_Tout[b] columns so the row stride matches and no pad-derived frames reach the decoder; test_encoder_batch's slice is updated to the compacted per-item width. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(model): share build_mel_batch across batch paths; restore CTC span rationale Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(capi): parakeet_capi_transcribe_pcm_batch_json (batched timestamps, ABI bump) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(capi): state sum(n_samples) precondition for batch_json as caller-must-uphold Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(decode): share argmax/max_prob_conf in decode_common.hpp Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(prediction): step_batch (batched LSTM, [H,N]) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(prediction): explain the 4H gate-slice stride in step_batch Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(joint): step_logits_batch (batched joint, [V_plus,N]) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(decode): transducer_greedy_batch (batched RNNT+TDT greedy, bit-exact parity) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(decode): extract commit_state lambda in transducer_greedy_batch; document max_symbols assumption Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(model): batched transducer decode in transcribe_pcm_batch[_with_timestamps] Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * bench: add bench-decode (batched vs serial transducer decode timing) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * perf(decode): cache prediction-net g across rounds in transducer_greedy_batch (bit-exact; recovers B=1, fewer LSTM calls) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(model): extract batch_enc_to_row_major helper (dedup batched-decode transpose) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(decode): correct header comment to describe the g_valid cache Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(subsampling): restore v1 2-D scalar build_graph for B=1 Replace the delegating Subsampling::build_graph (which forwarded to build_graph_batched at B=1) with the verbatim v1 2-D body from 60bd1bb. The single-clip/forward path now runs the lean 2-D graph again: faster and bit-exact with v1 (the batch test's item0, the full clip, is max|d|=0). build_graph_batched is unchanged (forward_batch / B>1 still use it). test_subsampling_batch: the short, zero-padded item1 now compares the batched builder against the 2-D scalar path. They match to ~1e-3 on interior frames; only the single trailing valid frame diverges (its downsampled receptive field straddles the clip boundary, where per-stage masking and the conv zero-edge round differently by design). Compare interior frames at a modest 5e-3 and skip that last boundary frame. item0 stays exact at 1e-3. * refactor(conformer,attention): restore v1 2-D scalar builders for B=1 Replace the delegating scalar builders (which forwarded to the batched builders at B=1) with the verbatim v1 2-D bodies from 60bd1bb: - ConformerLayer::build_graph (full 2-D conformer layer) - RelPosAttention::build_graph (2-D/3D rel-pos attention) - build_conv_module gains a scalar (int valid_len) overload alongside the batched (int B, const std::vector<int>&) one; distinguished by signature. The scalar callers (build_graph, forward_with_conv localization, conv_module_forward) now route to the 2-D conv module. The batched builders (build_graph_batched, the batched build_conv_module) are untouched; forward_batch / B>1 still use them. Net effect: the single-clip / forward path runs the lean 2-D graph again (faster) and is bit-exact with v1. test_conformer, test_conformer_batch, test_relpos_attention_batch, test_encoder, test_encoder_batch all pass; the 24-layer 2-D-vs-batched accumulation stays within test_encoder_batch's existing 5e-2 tolerance, so no tolerance change was needed there. * feat(bench): bench-decode --json + BENCHMARK.md batched-decode section Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(bench): add batched-decode throughput tables (CPU + GPU) Record serial-vs-batched decode speedups for the transducer models at B=1,4,8,16, captured via bench-decode --json. CPU (this 20-core host, q5_k) reaches ~3-5x at B=16; GPU (dgx GB10, f16) reaches ~10-12x. CTC models are excluded (no autoregressive decode). gen_benchmark_md.py renders the new 'Batched decode throughput' section from benchmarks/results/decode_batch/{cpu,gpu}/. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(readme): add Batching section (CLI, C-API, when to use) Document the opt-in batched-decode path: the bench-decode/bench-batch CLI commands, the C++ transcribe_16k_batch and C-API transcribe_pcm_batch[_json] entry points, the decode-batching win (GPU ~10-12x at B=16, CPU ~3-5x; encoder and CTC excluded), bit-exactness vs single-clip, and a pointer to the BENCHMARK.md tables. Notes that LocalAI exposes it via batch_max_size (off by default). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

) parakeet.cpp aborted on Apple Metal (ggml's Metal backend has no CONV_2D_DW kernel, which the Conformer subsampling emits). Make it run on Metal and keep every backend fast: - backend: GPU devices use the persistent gallocr fast path. A per-graph ggml_backend_supports_op scan only routes to ggml_backend_sched (CPU fallback) when the active GPU backend genuinely lacks a kernel for some op. CUDA covers every op, so it stays on gallocr (verified parity with master on the GB10; the earlier blanket-sched approach regressed CUDA 7-23%). The CPU path is unchanged. - ggml: native Metal CONV_2D_DW kernel (patch 0002) and leading-side PAD support (patch 0003), the two ops the encoder needed. With these the whole encoder, down to the log-mel front end, runs on Metal. - bench: scripts/bench_metal_dw.sh measures steady-state RTFx via parakeet-cli bench (warm up once, time inference only). - ci: run the closed-loop end-to-end transcript assertion on pull requests, not just manual dispatch. - docs: Apple Metal section in README and BENCHMARK; AGENTS.md gains a performance-invariant note (keep gallocr) and the reference transcript. Metal (M4, q4_k) is about 3-5x over CPU on the larger models. CPU and CUDA are within noise of master. Assisted-by: Claude:claude-opus-4-8 [Claude Code]

* ci: build and publish the parakeet-cli container image to ghcr Add a multi-stage Dockerfile and a docker workflow that builds the parakeet-cli image and pushes it to ghcr.io/<owner>/parakeet.cpp-cli. - Dockerfile: fat build stage compiles parakeet-cli plus the ggml backends, a slim runtime stage carries only the binary and the ggml .so files. One Dockerfile, CPU and CUDA variants selected via BUILD_BASE / RUNTIME_BASE / CMAKE_EXTRA_ARGS build args. GGML_NATIVE=OFF so the image is portable across x86-64 hosts. The ggml submodule is re-inited as a throwaway git repo in the build stage so the CMake-driven patch step (git apply) works regardless of how the submodule arrived in the context. - docker.yml: matrix over cpu/cuda, builds on every push/PR (build-only gate on PRs), pushes to ghcr on master + tags + dispatch. Tags via metadata-action: latest / sha / vX.Y.Z, with a -cuda suffix for the CUDA variant. Uses GITHUB_TOKEN, gha build cache. - .dockerignore keeps the context small (excludes .git, build dirs, models, benchmark media) while keeping the ggml source. - README: Docker section with CPU and CUDA run examples. Verified the CPU image end to end: builds at 127 MB, parakeet-cli runs, and transcribing tests/fixtures/speech.wav with a mounted q5_k 110m model yields the exact NeMo reference transcript. Assisted-by: Claude:claude-opus-4-8 [Claude Code] * ci(docker): build multi-arch (amd64 + arm64) images for both variants Publish each variant (cpu, cuda) as a multi-arch manifest covering linux/amd64 and linux/arm64. The arm64 CUDA image runs natively on Grace / GB10-class hosts. Every arch is built natively, no QEMU: amd64 on ubuntu-24.04, arm64 on the ubuntu-24.04-arm hosted runner (free for public repos). Emulated nvcc builds would be far too slow. The per-arch images are pushed by digest and a merge job stitches them into one manifest per variant, tagged via metadata-action. Verified the arm64 CPU image builds and runs (aarch64) under emulation locally, and confirmed the ubuntu and nvidia/cuda base images all ship arm64. Assisted-by: Claude:claude-opus-4-8 [Claude Code] * ci(docker): build CUDA images on CUDA 13 for Blackwell / GB10 (DGX Spark) CUDA 12.6 tops out at sm_90, so the CUDA images would not run on GB10 / Grace-Blackwell. The vendored ggml's CUDA CMake adds 120a-real at CUDA >= 12.8 and 121a-real (GB10 / DGX Spark / Thor) at CUDA >= 12.9, all under our GGML_NATIVE=OFF default. Bumping both arches to nvidia/cuda:13.0.1 therefore compiles Turing through Blackwell with no manual arch list: amd64 picks up Hopper / Ada / RTX 50, arm64 picks up GH200 (sm_90 PTX) and GB10 (sm_121). Assisted-by: Claude:claude-opus-4-8 [Claude Code] * ci(docker): fix CUDA link (GGML_CUDA_NO_VMM), trim arm64 archs, CPU-only PR gate The CUDA builds failed at link: libggml-cuda.so had undefined references to the CUDA driver API (cuMemCreate, cuMemMap, cuDeviceGet, ...). Those come from ggml's VMM memory pool, which links libcuda -- a lib a GPU-less build container does not have. Build with -DGGML_CUDA_NO_VMM=ON: every cuMem* call is under #if defined(GGML_USE_VMM), which this flag disables, so the symbols and the libcuda link dependency both go away. Verified locally: the amd64 CUDA image now links clean, ships libggml-cuda.so, and resolves libcudart / libcublas from the CUDA 13 runtime base. Also cut build time, which had blown out to 43 min on the arm64 CUDA job: - arm64 CUDA targets only Grace GPUs now (CUDA_ARCHS=90;121-real -> GH200 + GB10/Spark) instead of ggml's full 7-arch list. Added a dedicated quoted CUDA_ARCHS build-arg so the ';' list separator survives the shell (the unquoted CMAKE_EXTRA_ARGS would split it as a command separator). - pull_request now builds the CPU variant only (fast Dockerfile gate) via a dynamic matrix from a setup job. CUDA builds only on push / tag / dispatch, which also publish. Use workflow_dispatch to exercise CUDA before merging. Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparing changes

Open a pull request

Commits on May 30, 2026

Commits on May 31, 2026

Commits on Jun 1, 2026

Commits on Jun 2, 2026

This comparison is taking too long to generate.

Uh oh!