lloyal.node API Reference - v2.0.5
    Preparing search index...

    lloyal.node API Reference - v2.0.5

    lloyal.node

    Build & Test GPU Tests npm License llama.cpp

    Native backend for the lloyal inference platform.

    Prebuilt llama.cpp binaries for 13 platform/GPU combinations, exposing a SessionContext that powers the @lloyal-labs/sdk inference primitives (Branch, BranchStore, Session, Rerank) and @lloyal-labs/lloyal-agents multi-agent framework. Built on liblloyal, a header-only C++20 inference kernel for llama.cpp.

    All SDK and agent exports are re-exported from this package for convenience — import { Branch, runAgents } from "@lloyal-labs/lloyal.node" works out of the box.

    npm install @lloyal-labs/lloyal.node
    

    Prebuilt binaries for 13 platform/GPU combinations. GPU selection at runtime, not install time.

    Platform Arch Acceleration
    macOS arm64 Metal
    macOS x64 CPU
    Linux x64 CPU / CUDA / Vulkan
    Linux arm64 CPU / CUDA / Vulkan
    Windows x64 CPU / CUDA / Vulkan
    Windows arm64 CPU / Vulkan
    import { createContext } from "@lloyal-labs/lloyal.node";
    import { Branch, BranchStore } from "@lloyal-labs/sdk";

    const ctx = await createContext({ modelPath: "./model.gguf", nSeqMax: 4 });
    const store = new BranchStore(ctx);

    const root = Branch.create(ctx, 0, { temperature: 0.8 });
    await root.prefill(await ctx.tokenize("Explain quantum entanglement"));

    // Fork and generate — all branches in lockstep, 1 GPU call per step
    const branches = await Promise.all([root.fork(), root.fork(), root.fork()]);
    for (;;) {
    const live = branches.filter((b) => !b.disposed);
    if (!live.length) break;
    const produced = live.map((b) => ({ b, ...b.produce() }));
    for (const p of produced.filter((p) => p.isStop)) await p.b.prune();
    const items = produced
    .filter((p) => !p.isStop)
    .map((p) => {
    p.b.accept(p.token);
    return [p.b, p.token];
    });
    await store.commit(items);
    }

    Or for single-branch generation, Branch is an async iterable:

    for await (const { token, text } of branch) {
    process.stdout.write(text);
    }

    See @lloyal-labs/sdk for the full Branch API, continuous tree batching, KV tenancy, and topology documentation.

    createContext returns a SessionContext — the native interface to llama.cpp. You can use it directly without the SDK's Branch/BranchStore layer:

    import { createContext } from "@lloyal-labs/lloyal.node";

    const ctx = await createContext({ modelPath: "./model.gguf", nSeqMax: 4 });

    // Chat templates — model-agnostic formatting + tool calling
    const { prompt, grammar, format } = await ctx.formatChat(messages, {
    addGenerationPrompt: true,
    tools: [{ type: "function", function: { name: "search", parameters: schema } }],
    });
    const { content, toolCalls } = await ctx.parseChatOutput(output, format);

    // Branch primitives — what the SDK's Branch class wraps
    const handle = ctx._branchCreate(0, samplerParams);
    await ctx._branchPrefill(handle, tokens);
    const token = ctx._branchSample(handle);
    const text = ctx.tokenToText(token);
    const isStop = ctx.isStopToken(token);
    ctx._branchAccept(handle, token);
    const logits = ctx._branchGetLogits(handle); // Float32Array(vocabSize)
    const entropy = ctx._branchModelEntropy(handle);
    const child = ctx._branchFork(handle);

    // Store primitives — what the SDK's BranchStore wraps
    await ctx._storeCommit([handle1, handle2], [tok1, tok2]); // N branches, 1 GPU call
    await ctx._storePrefill([handle], [tokens]);
    await ctx._storeRetainOnly(winner);
    const available = ctx._storeAvailable();

    // KV cache — snapshot, copy, persist
    await ctx.kvSeqCopy(0, 1); // share prefix across sequences
    await ctx.kvCacheSave(); // snapshot for rollback
    await ctx.kvCacheLoad(); // restore checkpoint
    await ctx.kvCacheWriteFile("cache.bin"); // persist to disk

    // Embeddings
    const embeddings = await ctx.encode("query text");
    const dim = ctx.getEmbeddingDimension();

    // Grammar + tokenizer
    const grammar = await ctx.jsonSchemaToGrammar(schema);
    const tokens = await ctx.tokenize("Hello world");
    const sep = await ctx.getTurnSeparator();

    Native-only (not in SDK):

    • createContext(options) — load a GGUF model, return a SessionContext
    • loadBinary(options?) — explicit GPU variant selection with automatic fallback
    • Prebuilt binaries for 13 platform/GPU combinations

    Re-exported from @lloyal-labs/sdk:

    • Branch, BranchStore, Session, Rerank
    • Per-token metrics: modelEntropy(), modelSurprisal(), samplingPerplexity
    • Chat formatting: formatChat(), parseChatOutput()
    • Grammar: jsonSchemaToGrammar(), setGrammar()

    Re-exported from @lloyal-labs/lloyal-agents:

    • runAgents, useAgentPool, generate, diverge, createToolkit
    • Structured concurrency DAG via Effection generators
    • In-loop orchestration: agents as branches of a single running process
    import { loadBinary, createContext } from "@lloyal-labs/lloyal.node";

    // Automatic — uses Metal on macOS, CPU elsewhere
    const ctx = await createContext({ modelPath: "./model.gguf" });

    // Explicit CUDA
    const binding = loadBinary({ gpuVariant: "cuda" });
    const ctx = await binding.createContext({ modelPath: "./model.gguf" });
    // Falls back to CPU with a warning if CUDA runtime not available
    Example Pattern
    entropy/ modelEntropy() mid-generation as control signal
    chat/ Interactive streaming chat
    embed/ Text embeddings extraction
    npx tsx examples/best-of-n/best-of-n.ts
    npx tsx examples/chat/chat.ts ./model.gguf

    Integration tests run real inference across architectures:

    Architecture Test Model Template
    Llama Llama 3.2 1B llama3
    Phi Phi 3.5 Mini phi3
    Qwen Qwen 3 1.7B chatml
    Gemma Gemma 3 1B gemma
    SmolLM SmolLM2 1.7B chatml
    Ministral Ministral 3B mistral

    See distribution.md for details.

    Package Description
    @lloyal-labs/sdk Backend-agnostic inference primitives (Branch, BranchStore, Session, Rerank)
    @lloyal-labs/lloyal-agents Multi-agent framework — in-loop orchestration via structured concurrency
    liblloyal Header-only C++20 inference kernel for llama.cpp
    lloyal.node This package — native backend + prebuilt binaries
    nitro-llama React Native backend via Nitro Modules
    tsampler Reference sampler implementation

    See CONTRIBUTING.md for development setup and release process.

    Apache 2.0 — See LICENSE for details.