Struktur

All-in-one tool for structured data extraction using LLMs. Feed it documents, get back validated JSON. Handles parsing files, chunking, retries, merging, and deduplication — you just define the schema and choose a strategy.

Quickstart | Documentation

struktur extract --input ./invoice.pdf --fields "number, vendor, total:number"

{ 
  "number": "1042", 
  "vendor": "Acme Corp", 
  "total": 2400 
}

or more complex schemas...

curl https://example.com/long-rental-contract.docx \
  | struktur extract --stdin \
      --strategy parallel
      --schema ./contract-schema.json \
      --model openai/gpt-4o-mini

{ 
  "tenant": "Jane Doe", 
  "rent": 1500, 
  "term_months": 12, 
  "start_date": "2026-05-01" 
}

Install

npm install -g @struktur/cli
# or
bun add -g @struktur/cli

CLI quickstart

1. Set your LLM API key

Works with env variables or Struktur's built-in secure credential manager.
Supports many providers out of the box (OpenAI, Anthropic, Mistral, OpenRouter, OpenCode Zen, ...)

export OPENAI_API_KEY=sk-...
# or store it securely:
echo "sk-..." | struktur config providers add openai --token-stdin

# Set a default model (so you can skip --model every time)
struktur config models use openai/gpt-4o-mini

2. Extract some data!

Use the extract command with --input for files/URLs or --stdin for pipes.
Define simple schemas with --fields or use --schema for full JSON Schema support.
Automatically prepares documents before extraction — no need to manually convert PDFs to text or images.

# From a PDF — parsed and extracted automatically
struktur extract --input ./contract.pdf \
  --fields "parties:array{string}, effective_date, governing_law"

3. Configure strategies, models, and more (Optional)

Struktur uses the Agent strategy by default — it autonomously explores documents and extracts data
Set aliases for your favorite models (e.g. fast or quality) or change your default model
Add custom parsers for unsupported file types, or use your own command-line tools for parsing
For multi provider LLM gateways like OpenRouter, use a hashtag to specify which provider you want to use (e.g. #groq or #cerebras for faster inference)

# Agent is the default - it decides how to extract
struktur extract --input ./document.pdf --schema ./schema.json

# Use a different strategy for specific cases
struktur extract --strategy simple --input ./small-file.txt --fields "title, content"

# Create a model alias
struktur config models alias set fast openrouter/meta-llama/llama-3.1-8b-instruct#groq

# Choose a default model for all extractions
struktur config models use fast

# Add parsers for more file types (supports NPM packages or custom CLI commands)
# 1. Using an npm package
# 2. Using a CLI command (FILE_PATH is a placeholder)
# 3. Using a CLI command that reads from stdin
struktur config parsers add --mime application/vnd.ms-excel --npm @my-custom/excel-parser
struktur config parsers add --mime text/html --file-command "my-html-parser FILE_PATH"
struktur config parsers add --mime text/calendar --stdin-command "my-ical-parser"

→ Full CLI reference

SDK quickstart

import { extract, agent, urlToArtifact } from "@struktur/sdk";
import { openai } from "@ai-sdk/openai";
import type { JSONSchemaType } from "ajv";

type Invoice = { number: string; vendor: string; total: number };

const schema: JSONSchemaType<Invoice> = {
  type: "object",
  properties: {
    number: { type: "string" },
    vendor: { type: "string" },
    total: { type: "number" },
  },
  required: ["number", "vendor", "total"],
  additionalProperties: false,
};

const artifact = await urlToArtifact("https://example.com/invoice.pdf");

const result = await extract({
  artifacts: [artifact],
  schema,
  strategy: agent({
    provider: "openai",
    modelId: "gpt-4o-mini",
  }),
});

console.log(result.data.number); // fully typed
console.log(result.usage.totalTokens);

For quick extractions without writing a full JSON Schema, use the fields shorthand:

const result = await extract({
  artifacts,
  fields: "invoice_number, vendor, total:number, due_date",
  strategy: agent({ provider: "openai", modelId: "gpt-4o-mini" }),
});

→ Full SDK reference

How it works

Struktur converts input files into Artifacts — normalized JSON with text and media slices. The Agent then autonomously explores the document, deciding how to extract data: reading files, searching for patterns, and building the output incrementally.

flowchart LR
    A[Input] --> B[Parse]
    B --> C[Artifacts]
    C --> D[Agent Strategy]
    D --> E[Validated JSON]
    
    subgraph Agent [Agent Strategy]
        direction TB
        A1[Explore] --> A2[Read/Grep/Find]
        A2 --> A3[Extract Data]
        A3 --> A4[Validate]
        A4 -->|Need more info| A1
    end
    
    D --> Agent --> E

Key stages:

Parse: Convert files (PDF, text, images) into Artifact JSON
Agent: Autonomously explore and extract using tools (read, grep, bash, find)
Validate: Check against schema, retry on errors
Output: Return validated JSON

The agent decides how to approach extraction based on your schema and the document content. It may read the entire document at once for small inputs, or navigate through sections systematically for large documents. Every LLM response is validated against your schema. If validation fails, the errors are sent back to the model automatically. Most extractions converge in one or two attempts.

→ Extraction pipeline explained

Parsing

Struktur has a built-in parsing layer that converts files into Artifacts before extraction. You can use it standalone via the parse command, or it runs automatically when you pass --input to extract.

Built-in formats

Format	MIME type	Notes
PDF	`application/pdf`	Per-page text + embedded images
Plain text	`text/plain`	Split into paragraph blocks
Markdown	`text/markdown`	Treated as text
HTML	`text/html`	Treated as text
Images	`image/png`, `image/jpeg`, etc.	Passed through as image artifacts
Artifact JSON	`application/json`	Hydrated directly if valid `SerializedArtifact[]`

`struktur parse`

Convert any file or stdin to Artifact JSON. Useful for inspecting what Struktur sees before running extraction, or for building pipelines that cache parsed artifacts.

# Parse a PDF to Artifact JSON
struktur parse --input ./report.pdf

# Parse and save for later
struktur parse --input ./report.pdf --output ./report-artifact.json

# Skip image extraction from PDFs
struktur parse --input ./report.pdf --no-images

# Override MIME detection
struktur parse --input ./data.bin --mime application/pdf

# Pipe through stdin
cat ./report.pdf | struktur parse --stdin --mime application/pdf

Custom parsers

Register external parsers for any MIME type — they handle the conversion and output SerializedArtifact[] JSON.

npm package parser:

struktur config parsers add --mime application/vnd.ms-excel --npm @myorg/excel-parser

The package must export at least one of parseStream(stream, mimeType) or parseFile(path, mimeType), each returning Promise<Artifact[]>. Optionally export detectFileType(header: Uint8Array): boolean for magic-byte detection.

Command-line parsers:

# Command receives a file path via FILE_PATH placeholder
struktur config parsers add \
  --mime application/vnd.openxmlformats-officedocument.wordprocessingml.document \
  --file-command "markitdown FILE_PATH"

# Command reads from stdin, outputs SerializedArtifact[] JSON to stdout
struktur config parsers add \
  --mime text/html \
  --stdin-command "my-html-parser"

Inline parsers (SDK):

For code-only parsers in the SDK, use InlineParserDef:

import { parse } from "@struktur/sdk";
import type { InlineParserDef } from "@struktur/sdk";

const excelParser: InlineParserDef = {
  type: "inline",
  handler: async (buffer) => {
    // parse and return Artifact
  },
};

const artifacts = await parse(
  { kind: "file", path: "report.xlsx" },
  { parserConfig: { "application/vnd.ms-excel": excelParser } }
);

Per-invocation override (skips stored config):

struktur extract --input ./file.docx --parser @myorg/docx-parser --fields "title, summary"

Manage configured parsers:

struktur config parsers list
struktur config parsers get --mime application/vnd.ms-excel
struktur config parsers remove --mime application/vnd.ms-excel

MIME detection

MIME type is detected automatically in this order:

--mime flag — always wins if provided
Magic bytes — %PDF, PNG header, JPEG/GIF/WebP markers, Office ZIP signatures
npm parser detectFileType — called with the first 512 bytes if the parser exports it
File extension — .pdf, .txt, .md, .html, .json, .csv, .xml, .yaml, .docx, .xlsx, .pptx, and more

For stdin with no --mime, falls back to text/plain.

Strategies

Struktur uses an Agent by default — it autonomously explores documents and extracts data using a virtual filesystem. The agent decides when to read files, search for patterns, or execute commands based on your schema and the document content.

Security: The agent runs fully sandboxed in the same process — no custom VM needed. It uses an emulated shell with only read/grep/glob utilities. No external HTTP calls or command execution.

For specific use cases, you can also use other strategies:

Strategy	When to use
`agent` (default)	Autonomous exploration — best for most documents
`simple`	Small input, fits in one context window
`parallel`	Large input, order doesn't matter, scalar fields
`sequential`	Large input, context carries across chunks
`parallelAutoMerge`	Large input with arrays — parallel + dedup
`sequentialAutoMerge`	Large input with arrays — sequential + dedup
`doublePass`	Quality matters, two-pass refinement
`doublePassAutoMerge`	Quality + arrays + dedup

# Agent is the default - no --strategy needed
struktur extract --input ./document.pdf --schema ./schema.json

# Use a specific strategy when needed
struktur extract --input ./document.pdf --schema ./schema.json --strategy simple

import { extract, agent } from "@struktur/sdk";
import { openai } from "@ai-sdk/openai";

const result = await extract({
  artifacts,
  schema,
  strategy: agent({
    provider: "openai",
    modelId: "gpt-4.1-mini",
    maxSteps: 50,
  }),
});

→ Extraction Strategies

Fields shorthand

Skip the JSON Schema boilerplate for flat extractions:

"title, price:number, status:enum{draft|live}, tags:array"

Supported types: string (default), number, integer, boolean, enum{a|b}, array (defaults to array{string}), array{type}.

For optional fields, nested objects, or TypeScript inference on result.data, use a full JSONSchemaType<T> schema instead.

→ Fields Shorthand

Configuration

All persistent settings live under struktur config.

Providers

# Add a provider token
struktur config providers add openai --token sk-...
echo "sk-..." | struktur config providers add anthropic --token-stdin

# List all providers and their status
struktur config providers list

# Remove a token
struktur config providers remove openai

Models

# Set a default model
struktur config models use openai/gpt-4o-mini

# List available models for a provider
struktur config models list --provider openai

# Aliases
struktur config models alias set fast openai/gpt-4o-mini
struktur config models alias set smart anthropic/claude-opus-4
struktur config models use fast

Parsers

struktur config parsers list
struktur config parsers add --mime application/vnd.ms-excel --npm @myorg/excel-parser
struktur config parsers remove --mime application/vnd.ms-excel

Documentation

Full documentation at struktur.sh

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
docs		docs
examples		examples
packages		packages
resources		resources
scripts		scripts
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
bun.lock		bun.lock
marketing-plan.md		marketing-plan.md
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Struktur

or more complex schemas...

Install

CLI quickstart

1. Set your LLM API key

2. Extract some data!

3. Configure strategies, models, and more (Optional)

SDK quickstart

How it works

Parsing

Built-in formats

`struktur parse`

Custom parsers

MIME detection

Strategies

Fields shorthand

Configuration

Providers

Models

Parsers

Documentation

About

Uh oh!

Releases 11

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Struktur

or more complex schemas...

Install

CLI quickstart

1. Set your LLM API key

2. Extract some data!

3. Configure strategies, models, and more (Optional)

SDK quickstart

How it works

Parsing

Built-in formats

struktur parse

Custom parsers

MIME detection

Strategies

Fields shorthand

Configuration

Providers

Models

Parsers

Documentation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 11

Contributors

Uh oh!

Languages

`struktur parse`