Skip to content

rycolab/transducing-language-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transducing Language Models

Official code for the paper Transducing Language Models (ICLR 2026).

A language model defines a distribution over strings, but downstream tasks often need a different output format — words instead of byte-pair tokens, characters instead of subwords, amino acids instead of DNA. This library takes a language model composed with a finite-state transducer (FST) and gives it the standard autoregressive interface (next-symbol distributions and prefix probabilities) over the transformed output, so it drops into any system built for ordinary language models.

The idea

A source language model $p_{\mathcal{X}}$ defines a distribution over strings in $\mathcal{X}^$, and $f \colon \mathcal{X}^ \to \mathcal{Y}^*$ is a deterministic string-to-string map (lowercasing, token→byte, DNA→amino-acid, …) encoded as an FST. Applying $f$ to samples from $p_{\mathcal{X}}$ induces a new transduced language model $p_{\mathcal{Y}}$ over the output strings — its mass on a target string $\boldsymbol{y}$ is the total source mass of every string that maps to it:

$$p_{\mathcal{Y}}(\boldsymbol{y}) \;=\; \Pr_{X \sim p_{\mathcal{X}}}\!\big[\,\boldsymbol{y} = f(X)\,\big] \;=\; \sum_{\boldsymbol{x}\,\in\, f^{-1}(\boldsymbol{y})} p_{\mathcal{X}}(\boldsymbol{x})$$

Sampling from $p_{\mathcal{Y}}$ is trivial (sample $\boldsymbol{x}$, apply $f$), but scoring and conditioning require summing over that preimage. The key quantity for the autoregressive interface is the prefix probability $\overrightarrow{p_{\mathcal{Y}}}$, which sums the source model over the precover $\mathcal{P}(\boldsymbol{y})$ — the source strings whose image starts with $\boldsymbol{y}$:

$$\overrightarrow{p_{\mathcal{Y}}}(\boldsymbol{y}) \;=\; \Pr_{X \sim p_{\mathcal{X}}}\!\big[\,\boldsymbol{y} \preceq f(X)\,\big] \;=\; \sum_{\boldsymbol{x}\,\in\, \mathcal{P}(\boldsymbol{y})} p_{\mathcal{X}}(\boldsymbol{x}), \qquad \mathcal{P}(\boldsymbol{y}) = \{\, \boldsymbol{x} : \boldsymbol{y} \preceq f(\boldsymbol{x}) \,\}$$

That sum is generally infinite. The library computes it in finite time by decomposing the precover $\mathcal{P}(\boldsymbol{y})$ into a quotient $\mathcal{Q}$ (cylinders of "universal" continuations, summable in closed form) and a remainder $\mathcal{R}$ (Algorithm 1 in the paper).

A worked example: lowercasing

Take the transducer $f$ that lowercases every character — a single state that is both initial and accepting, with one self-loop per input symbol:

  ──▶ (( q0 ))  ⟲     for every input character c, an arc  c : lowercase(c)
                      e.g.  A:a   B:b   …   a:a   b:b   …

For the target string $\boldsymbol{y} = \texttt{ab}$, the precover is every source string that lowercases to something starting with ab. Written with the paper's basis (cylinder) notation $\langle,\cdot,\rangle$:

$$\mathcal{P}(\texttt{ab}) \;=\; \big\langle\, \{\texttt{AB},\ \texttt{Ab},\ \texttt{aB},\ \texttt{ab}\} \,\big\rangle$$

So the transduced prefix probability is a finite sum of four source prefix probabilities — these four strings form the quotient $\mathcal{Q}$ of the decomposition (the remainder $\mathcal{R}$ is empty here):

$$\overrightarrow{p_{\mathcal{Y}}}(\texttt{ab}) \;=\; \overrightarrow{p_{\mathcal{X}}}(\texttt{AB}) + \overrightarrow{p_{\mathcal{X}}}(\texttt{Ab}) + \overrightarrow{p_{\mathcal{X}}}(\texttt{aB}) + \overrightarrow{p_{\mathcal{X}}}(\texttt{ab})$$

Computing that decomposition — and the next-symbol distribution it yields — is exactly what TransducedLM does.

Using the interface

TransducedLM(vfst, lm, config) takes a VectorizedFST (the transducer), a source LM scorer, and a Config. The example below lifts GPT-2 (a token-level model) to a byte-level model with the hf_realpha transducer, then queries the autoregressive interface of the transduced model:

import asyncio
from genlm.backend import load_model_by_name
from transduced_lm import Config, TransducedLM
from transduced_lm.benchmark.transducer import load_transducer

# Source model: GPT-2 over byte-pair tokens.
llm = load_model_by_name("gpt2", backend="hf")

# Compose it with a transducer f. "hf_realpha" maps GPT-2 tokens → bytes,
# turning the token-level model into a character/byte-level language model.
setup = load_transducer("hf_realpha", llm=llm, model_name="gpt2")
tlm = TransducedLM(setup.vfst, setup.lm, Config(prune_threshold=1e-3))

async def main():
    # Encode a target (byte) prefix as output-symbol ids.
    ctx = tuple(setup.out_sym_to_id[str(b)] for b in b"Hello")

    # Next-symbol distribution of the transduced model p_Y: log p_Y(byte | "Hello")
    dist = await tlm.logp_next(ctx)
    best = max(dist, key=dist.get)
    print(f"most likely next byte: {chr(best)!r}  (logp={dist[best]:.3f})")

    # The precover decomposition behind that number (quotient + remainder beams):
    remainder, quotient = await tlm.decompose(ctx)

asyncio.run(main())

To use your own transformation you need a VectorizedFST. Build the transducer either directly in pynini (symbols are the string forms of integer ids; epsilon is label 0), or with the bundled FST class in transduced_lm.benchmark.ptb.fst and convert it to pynini with transduction_fst_to_pynini from transduced_lm.benchmark.ptb.fst_converter. Then wrap it in VectorizedFST, call compute_universal_states(), and pass it to TransducedLM with any source LM exposing async logp_next_for(ctx) -> ndarray. The three transducers from the paper are constructed in src/transduced_lm/benchmark/fst_loaders.py and ptb_fst_builder.py.

Installation

Requires Python ≥ 3.10 (tested on 3.12). A fresh environment is recommended:

conda create -n tlm python=3.12 && conda activate tlm   # or: python -m venv
pip install -e .

This installs pinned dependency versions matching the paper experiments (pynini, genlm-bytes, genlm-backend, torch, transformers, …). For GPU-accelerated inference with vLLM (used for all paper experiments):

pip install -e ".[vllm]"

The experiment scripts require a CUDA GPU.

To verify the install end-to-end — package import, vLLM, HuggingFace model download/load, and dataset download — run the setup smoke test (loads each transducer + model and scores a few symbols, using the same CLI as the experiment scripts below):

bash scripts/smoke_test.sh   # gpt2-large + vesteinn/gpt2-dna; no HF login needed

Transducers

The three transformations studied in the paper:

  • hf_realpha — tokens → bytes (turns a subword LM into a character-level model).
  • ptb_ported — tokens → words (applies Penn Treebank tokenization as a transduction).
  • hf_dna2aa — DNA → amino acids.

The pretrained DNA model is on the Hugging Face Hub at vesteinn/gpt2-dna and downloads automatically when you pass --model vesteinn/gpt2-dna. Llama models require huggingface-cli login for gated access.

Reproducing the paper experiments

Single quick runs:

bash scripts/run_realpha.sh     # tokens → bytes,  GPT-2 Large, one paragraph
bash scripts/run_ptb.sh         # tokens → words (Penn Treebank)
bash scripts/run_dna2aa.sh      # DNA → amino acids

Full pipeline (benchmarks → CSVs → LaTeX tables → figures):

bash scripts/experiments/run_all.sh --quick   # fast smoke test
bash scripts/experiments/run_all.sh           # full reproduction

See scripts/experiments/README.md for the mapping from scripts to paper tables/figures and per-experiment parameters. The scripts/experiments/paper_runs/ directory holds the exact SLURM (sbatch) scripts used for the paper, including the Phi-4 runs. Scripts use the active Python environment; set CONDA_ENV=<name> to have them activate a named conda environment automatically.

Citation

@inproceedings{snbjarnarson2026transducing,
  title     = {Transducing Language Models},
  author    = {V\'esteinn Sn{\ae}bjarnarson and Samuel Kiegeland and Tianyu Liu and Reda Boumasmoud and Ryan Cotterell and Tim Vieira},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year      = {2026},
  url       = {https://openreview.net/forum?id=qOyF214xmg}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors