All-in-one tool for structured data extraction using LLMs. Feed it documents, get back validated JSON. Handles parsing files, chunking, retries, merging, and deduplication — you just define the schema and choose a strategy.
Quickstart |
Documentation
struktur extract --input ./invoice.pdf --fields "number, vendor, total:number"{
"number": "1042",
"vendor": "Acme Corp",
"total": 2400
}curl https://example.com/long-rental-contract.docx \
| struktur extract --stdin \
--strategy parallel
--schema ./contract-schema.json \
--model openai/gpt-4o-mini{
"tenant": "Jane Doe",
"rent": 1500,
"term_months": 12,
"start_date": "2026-05-01"
}npm install -g @struktur/cli
# or
bun add -g @struktur/cli- Works with env variables or Struktur's built-in secure credential manager.
- Supports many providers out of the box (OpenAI, Anthropic, Mistral, OpenRouter, OpenCode Zen, ...)
export OPENAI_API_KEY=sk-...
# or store it securely:
echo "sk-..." | struktur config providers add openai --token-stdin
# Set a default model (so you can skip --model every time)
struktur config models use openai/gpt-4o-mini- Use the
extractcommand with--inputfor files/URLs or--stdinfor pipes. - Define simple schemas with
--fieldsor use--schemafor full JSON Schema support. - Automatically prepares documents before extraction — no need to manually convert PDFs to text or images.
# From a PDF — parsed and extracted automatically
struktur extract --input ./contract.pdf \
--fields "parties:array{string}, effective_date, governing_law"- Struktur uses the Agent strategy by default — it autonomously explores documents and extracts data
- Set aliases for your favorite models (e.g.
fastorquality) or change your default model - Add custom parsers for unsupported file types, or use your own command-line tools for parsing
- For multi provider LLM gateways like OpenRouter, use a hashtag to specify which provider you want to use (e.g.
#groqor#cerebrasfor faster inference)
# Agent is the default - it decides how to extract
struktur extract --input ./document.pdf --schema ./schema.json
# Use a different strategy for specific cases
struktur extract --strategy simple --input ./small-file.txt --fields "title, content"
# Create a model alias
struktur config models alias set fast openrouter/meta-llama/llama-3.1-8b-instruct#groq
# Choose a default model for all extractions
struktur config models use fast
# Add parsers for more file types (supports NPM packages or custom CLI commands)
# 1. Using an npm package
# 2. Using a CLI command (FILE_PATH is a placeholder)
# 3. Using a CLI command that reads from stdin
struktur config parsers add --mime application/vnd.ms-excel --npm @my-custom/excel-parser
struktur config parsers add --mime text/html --file-command "my-html-parser FILE_PATH"
struktur config parsers add --mime text/calendar --stdin-command "my-ical-parser"import { extract, agent, urlToArtifact } from "@struktur/sdk";
import { openai } from "@ai-sdk/openai";
import type { JSONSchemaType } from "ajv";
type Invoice = { number: string; vendor: string; total: number };
const schema: JSONSchemaType<Invoice> = {
type: "object",
properties: {
number: { type: "string" },
vendor: { type: "string" },
total: { type: "number" },
},
required: ["number", "vendor", "total"],
additionalProperties: false,
};
const artifact = await urlToArtifact("https://example.com/invoice.pdf");
const result = await extract({
artifacts: [artifact],
schema,
strategy: agent({
provider: "openai",
modelId: "gpt-4o-mini",
}),
});
console.log(result.data.number); // fully typed
console.log(result.usage.totalTokens);For quick extractions without writing a full JSON Schema, use the fields shorthand:
const result = await extract({
artifacts,
fields: "invoice_number, vendor, total:number, due_date",
strategy: agent({ provider: "openai", modelId: "gpt-4o-mini" }),
});Struktur converts input files into Artifacts — normalized JSON with text and media slices. The Agent then autonomously explores the document, deciding how to extract data: reading files, searching for patterns, and building the output incrementally.
flowchart LR
A[Input] --> B[Parse]
B --> C[Artifacts]
C --> D[Agent Strategy]
D --> E[Validated JSON]
subgraph Agent [Agent Strategy]
direction TB
A1[Explore] --> A2[Read/Grep/Find]
A2 --> A3[Extract Data]
A3 --> A4[Validate]
A4 -->|Need more info| A1
end
D --> Agent --> E
Key stages:
- Parse: Convert files (PDF, text, images) into Artifact JSON
- Agent: Autonomously explore and extract using tools (read, grep, bash, find)
- Validate: Check against schema, retry on errors
- Output: Return validated JSON
The agent decides how to approach extraction based on your schema and the document content. It may read the entire document at once for small inputs, or navigate through sections systematically for large documents. Every LLM response is validated against your schema. If validation fails, the errors are sent back to the model automatically. Most extractions converge in one or two attempts.
→ Extraction pipeline explained
Struktur has a built-in parsing layer that converts files into Artifacts before extraction. You can use it standalone via the parse command, or it runs automatically when you pass --input to extract.
| Format | MIME type | Notes |
|---|---|---|
application/pdf |
Per-page text + embedded images | |
| Plain text | text/plain |
Split into paragraph blocks |
| Markdown | text/markdown |
Treated as text |
| HTML | text/html |
Treated as text |
| Images | image/png, image/jpeg, etc. |
Passed through as image artifacts |
| Artifact JSON | application/json |
Hydrated directly if valid SerializedArtifact[] |
Convert any file or stdin to Artifact JSON. Useful for inspecting what Struktur sees before running extraction, or for building pipelines that cache parsed artifacts.
# Parse a PDF to Artifact JSON
struktur parse --input ./report.pdf
# Parse and save for later
struktur parse --input ./report.pdf --output ./report-artifact.json
# Skip image extraction from PDFs
struktur parse --input ./report.pdf --no-images
# Override MIME detection
struktur parse --input ./data.bin --mime application/pdf
# Pipe through stdin
cat ./report.pdf | struktur parse --stdin --mime application/pdfRegister external parsers for any MIME type — they handle the conversion and output SerializedArtifact[] JSON.
npm package parser:
struktur config parsers add --mime application/vnd.ms-excel --npm @myorg/excel-parserThe package must export at least one of parseStream(stream, mimeType) or parseFile(path, mimeType), each returning Promise<Artifact[]>. Optionally export detectFileType(header: Uint8Array): boolean for magic-byte detection.
Command-line parsers:
# Command receives a file path via FILE_PATH placeholder
struktur config parsers add \
--mime application/vnd.openxmlformats-officedocument.wordprocessingml.document \
--file-command "markitdown FILE_PATH"
# Command reads from stdin, outputs SerializedArtifact[] JSON to stdout
struktur config parsers add \
--mime text/html \
--stdin-command "my-html-parser"Inline parsers (SDK):
For code-only parsers in the SDK, use InlineParserDef:
import { parse } from "@struktur/sdk";
import type { InlineParserDef } from "@struktur/sdk";
const excelParser: InlineParserDef = {
type: "inline",
handler: async (buffer) => {
// parse and return Artifact
},
};
const artifacts = await parse(
{ kind: "file", path: "report.xlsx" },
{ parserConfig: { "application/vnd.ms-excel": excelParser } }
);Per-invocation override (skips stored config):
struktur extract --input ./file.docx --parser @myorg/docx-parser --fields "title, summary"Manage configured parsers:
struktur config parsers list
struktur config parsers get --mime application/vnd.ms-excel
struktur config parsers remove --mime application/vnd.ms-excelMIME type is detected automatically in this order:
--mimeflag — always wins if provided- Magic bytes —
%PDF, PNG header, JPEG/GIF/WebP markers, Office ZIP signatures - npm parser
detectFileType— called with the first 512 bytes if the parser exports it - File extension —
.pdf,.txt,.md,.html,.json,.csv,.xml,.yaml,.docx,.xlsx,.pptx, and more
For stdin with no --mime, falls back to text/plain.
Struktur uses an Agent by default — it autonomously explores documents and extracts data using a virtual filesystem. The agent decides when to read files, search for patterns, or execute commands based on your schema and the document content.
Security: The agent runs fully sandboxed in the same process — no custom VM needed. It uses an emulated shell with only read/grep/glob utilities. No external HTTP calls or command execution.
For specific use cases, you can also use other strategies:
| Strategy | When to use |
|---|---|
agent (default) |
Autonomous exploration — best for most documents |
simple |
Small input, fits in one context window |
parallel |
Large input, order doesn't matter, scalar fields |
sequential |
Large input, context carries across chunks |
parallelAutoMerge |
Large input with arrays — parallel + dedup |
sequentialAutoMerge |
Large input with arrays — sequential + dedup |
doublePass |
Quality matters, two-pass refinement |
doublePassAutoMerge |
Quality + arrays + dedup |
# Agent is the default - no --strategy needed
struktur extract --input ./document.pdf --schema ./schema.json
# Use a specific strategy when needed
struktur extract --input ./document.pdf --schema ./schema.json --strategy simpleimport { extract, agent } from "@struktur/sdk";
import { openai } from "@ai-sdk/openai";
const result = await extract({
artifacts,
schema,
strategy: agent({
provider: "openai",
modelId: "gpt-4.1-mini",
maxSteps: 50,
}),
});Skip the JSON Schema boilerplate for flat extractions:
"title, price:number, status:enum{draft|live}, tags:array"
Supported types: string (default), number, integer, boolean, enum{a|b}, array (defaults to array{string}), array{type}.
For optional fields, nested objects, or TypeScript inference on result.data, use a full JSONSchemaType<T> schema instead.
All persistent settings live under struktur config.
# Add a provider token
struktur config providers add openai --token sk-...
echo "sk-..." | struktur config providers add anthropic --token-stdin
# List all providers and their status
struktur config providers list
# Remove a token
struktur config providers remove openai# Set a default model
struktur config models use openai/gpt-4o-mini
# List available models for a provider
struktur config models list --provider openai
# Aliases
struktur config models alias set fast openai/gpt-4o-mini
struktur config models alias set smart anthropic/claude-opus-4
struktur config models use faststruktur config parsers list
struktur config parsers add --mime application/vnd.ms-excel --npm @myorg/excel-parser
struktur config parsers remove --mime application/vnd.ms-excelFull documentation at struktur.sh