Transducing Language Models

Official code for the paper Transducing Language Models (ICLR 2026).

A language model defines a distribution over strings, but downstream tasks often need a different output format — words instead of byte-pair tokens, characters instead of subwords, amino acids instead of DNA. This library takes a language model composed with a finite-state transducer (FST) and gives it the standard autoregressive interface (next-symbol distributions and prefix probabilities) over the transformed output, so it drops into any system built for ordinary language models.

The idea

A source language model $p_{\mathcal{X}}$ defines a distribution over strings in $\mathcal{X}^$, and $f \colon \mathcal{X}^ \to \mathcal{Y}^*$ is a deterministic string-to-string map (lowercasing, token→byte, DNA→amino-acid, …) encoded as an FST. Applying $f$ to samples from $p_{\mathcal{X}}$ induces a new transduced language model $p_{\mathcal{Y}}$ over the output strings — its mass on a target string $\boldsymbol{y}$ is the total source mass of every string that maps to it:

$$p_{\mathcal{Y}}(\boldsymbol{y}) \;=\; \Pr_{X \sim p_{\mathcal{X}}}\!\big[\,\boldsymbol{y} = f(X)\,\big] \;=\; \sum_{\boldsymbol{x}\,\in\, f^{-1}(\boldsymbol{y})} p_{\mathcal{X}}(\boldsymbol{x})$$

Sampling from $p_{\mathcal{Y}}$ is trivial (sample $\boldsymbol{x}$, apply $f$), but scoring and conditioning require summing over that preimage. The key quantity for the autoregressive interface is the prefix probability $\overrightarrow{p_{\mathcal{Y}}}$, which sums the source model over the precover $\mathcal{P}(\boldsymbol{y})$ — the source strings whose image starts with $\boldsymbol{y}$:

$$\overrightarrow{p_{\mathcal{Y}}}(\boldsymbol{y}) \;=\; \Pr_{X \sim p_{\mathcal{X}}}\!\big[\,\boldsymbol{y} \preceq f(X)\,\big] \;=\; \sum_{\boldsymbol{x}\,\in\, \mathcal{P}(\boldsymbol{y})} p_{\mathcal{X}}(\boldsymbol{x}), \qquad \mathcal{P}(\boldsymbol{y}) = \{\, \boldsymbol{x} : \boldsymbol{y} \preceq f(\boldsymbol{x}) \,\}$$

That sum is generally infinite. The library computes it in finite time by decomposing the precover $\mathcal{P}(\boldsymbol{y})$ into a quotient $\mathcal{Q}$ (cylinders of "universal" continuations, summable in closed form) and a remainder $\mathcal{R}$ (Algorithm 1 in the paper).

A worked example: lowercasing

Take the transducer $f$ that lowercases every character — a single state that is both initial and accepting, with one self-loop per input symbol:

  ──▶ (( q0 ))  ⟲     for every input character c, an arc  c : lowercase(c)
                      e.g.  A:a   B:b   …   a:a   b:b   …

For the target string $\boldsymbol{y} = \texttt{ab}$, the precover is every source string that lowercases to something starting with ab. Written with the paper's basis (cylinder) notation $\langle,\cdot,\rangle$:

$$\mathcal{P}(\texttt{ab}) \;=\; \big\langle\, \{\texttt{AB},\ \texttt{Ab},\ \texttt{aB},\ \texttt{ab}\} \,\big\rangle$$

So the transduced prefix probability is a finite sum of four source prefix probabilities — these four strings form the quotient $\mathcal{Q}$ of the decomposition (the remainder $\mathcal{R}$ is empty here):

$$\overrightarrow{p_{\mathcal{Y}}}(\texttt{ab}) \;=\; \overrightarrow{p_{\mathcal{X}}}(\texttt{AB}) + \overrightarrow{p_{\mathcal{X}}}(\texttt{Ab}) + \overrightarrow{p_{\mathcal{X}}}(\texttt{aB}) + \overrightarrow{p_{\mathcal{X}}}(\texttt{ab})$$

Computing that decomposition — and the next-symbol distribution it yields — is exactly what TransducedLM does.

Using the interface

TransducedLM(vfst, lm, config) takes a VectorizedFST (the transducer), a source LM scorer, and a Config. The example below lifts GPT-2 (a token-level model) to a byte-level model with the hf_realpha transducer, then queries the autoregressive interface of the transduced model:

import asyncio
from genlm.backend import load_model_by_name
from transduced_lm import Config, TransducedLM
from transduced_lm.benchmark.transducer import load_transducer

# Source model: GPT-2 over byte-pair tokens.
llm = load_model_by_name("gpt2", backend="hf")

# Compose it with a transducer f. "hf_realpha" maps GPT-2 tokens → bytes,
# turning the token-level model into a character/byte-level language model.
setup = load_transducer("hf_realpha", llm=llm, model_name="gpt2")
tlm = TransducedLM(setup.vfst, setup.lm, Config(prune_threshold=1e-3))

async def main():
    # Encode a target (byte) prefix as output-symbol ids.
    ctx = tuple(setup.out_sym_to_id[str(b)] for b in b"Hello")

    # Next-symbol distribution of the transduced model p_Y: log p_Y(byte | "Hello")
    dist = await tlm.logp_next(ctx)
    best = max(dist, key=dist.get)
    print(f"most likely next byte: {chr(best)!r}  (logp={dist[best]:.3f})")

    # The precover decomposition behind that number (quotient + remainder beams):
    remainder, quotient = await tlm.decompose(ctx)

asyncio.run(main())

To use your own transformation you need a VectorizedFST. Build the transducer either directly in pynini (symbols are the string forms of integer ids; epsilon is label 0), or with the bundled FST class in transduced_lm.benchmark.ptb.fst and convert it to pynini with transduction_fst_to_pynini from transduced_lm.benchmark.ptb.fst_converter. Then wrap it in VectorizedFST, call compute_universal_states(), and pass it to TransducedLM with any source LM exposing async logp_next_for(ctx) -> ndarray. The three transducers from the paper are constructed in src/transduced_lm/benchmark/fst_loaders.py and ptb_fst_builder.py.

Installation

Requires Python ≥ 3.10 (tested on 3.12). A fresh environment is recommended:

conda create -n tlm python=3.12 && conda activate tlm   # or: python -m venv
pip install -e .

This installs pinned dependency versions matching the paper experiments (pynini, genlm-bytes, genlm-backend, torch, transformers, …). For GPU-accelerated inference with vLLM (used for all paper experiments):

pip install -e ".[vllm]"

The experiment scripts require a CUDA GPU.

To verify the install end-to-end — package import, vLLM, HuggingFace model download/load, and dataset download — run the setup smoke test (loads each transducer + model and scores a few symbols, using the same CLI as the experiment scripts below):

bash scripts/smoke_test.sh   # gpt2-large + vesteinn/gpt2-dna; no HF login needed

Transducers

The three transformations studied in the paper:

hf_realpha — tokens → bytes (turns a subword LM into a character-level model).
ptb_ported — tokens → words (applies Penn Treebank tokenization as a transduction).
hf_dna2aa — DNA → amino acids.

The pretrained DNA model is on the Hugging Face Hub at vesteinn/gpt2-dna and downloads automatically when you pass --model vesteinn/gpt2-dna. Llama models require huggingface-cli login for gated access.

Reproducing the paper experiments

Single quick runs:

bash scripts/run_realpha.sh     # tokens → bytes,  GPT-2 Large, one paragraph
bash scripts/run_ptb.sh         # tokens → words (Penn Treebank)
bash scripts/run_dna2aa.sh      # DNA → amino acids

Full pipeline (benchmarks → CSVs → LaTeX tables → figures):

bash scripts/experiments/run_all.sh --quick   # fast smoke test
bash scripts/experiments/run_all.sh           # full reproduction

See scripts/experiments/README.md for the mapping from scripts to paper tables/figures and per-experiment parameters. The scripts/experiments/paper_runs/ directory holds the exact SLURM (sbatch) scripts used for the paper, including the Phi-4 runs. Scripts use the active Python environment; set CONDA_ENV=<name> to have them activate a named conda environment automatically.

Citation

@inproceedings{snbjarnarson2026transducing,
  title     = {Transducing Language Models},
  author    = {V\'esteinn Sn{\ae}bjarnarson and Samuel Kiegeland and Tianyu Liu and Reda Boumasmoud and Ryan Cotterell and Tim Vieira},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year      = {2026},
  url       = {https://openreview.net/forum?id=qOyF214xmg}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transducing Language Models

The idea

A worked example: lowercasing

Using the interface

Installation

Transducers

Reproducing the paper experiments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Transducing Language Models

The idea

A worked example: lowercasing

Using the interface

Installation

Transducers

Reproducing the paper experiments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages