Skip to content

mateffy/struktur

Repository files navigation

Struktur Logo

Struktur Logo Struktur

All-in-one tool for structured data extraction using LLMs. Feed it documents, get back validated JSON. Handles parsing files, chunking, retries, merging, and deduplication — you just define the schema and choose a strategy.

Quickstart | Documentation



struktur extract --input ./invoice.pdf --fields "number, vendor, total:number"
{ 
  "number": "1042", 
  "vendor": "Acme Corp", 
  "total": 2400 
}
or more complex schemas...
curl https://example.com/long-rental-contract.docx \
  | struktur extract --stdin \
      --strategy parallel
      --schema ./contract-schema.json \
      --model openai/gpt-4o-mini
{ 
  "tenant": "Jane Doe", 
  "rent": 1500, 
  "term_months": 12, 
  "start_date": "2026-05-01" 
}


Install

npm install -g @struktur/cli
# or
bun add -g @struktur/cli


CLI quickstart


1. Set your LLM API key

  • Works with env variables or Struktur's built-in secure credential manager.
  • Supports many providers out of the box (OpenAI, Anthropic, Mistral, OpenRouter, OpenCode Zen, ...)
export OPENAI_API_KEY=sk-...
# or store it securely:
echo "sk-..." | struktur config providers add openai --token-stdin

# Set a default model (so you can skip --model every time)
struktur config models use openai/gpt-4o-mini

2. Extract some data!

  • Use the extract command with --input for files/URLs or --stdin for pipes.
  • Define simple schemas with --fields or use --schema for full JSON Schema support.
  • Automatically prepares documents before extraction — no need to manually convert PDFs to text or images.
# From a PDF — parsed and extracted automatically
struktur extract --input ./contract.pdf \
  --fields "parties:array{string}, effective_date, governing_law"

3. Configure strategies, models, and more (Optional)

  • Struktur uses the Agent strategy by default — it autonomously explores documents and extracts data
  • Set aliases for your favorite models (e.g. fast or quality) or change your default model
  • Add custom parsers for unsupported file types, or use your own command-line tools for parsing
  • For multi provider LLM gateways like OpenRouter, use a hashtag to specify which provider you want to use (e.g. #groq or #cerebras for faster inference)
# Agent is the default - it decides how to extract
struktur extract --input ./document.pdf --schema ./schema.json

# Use a different strategy for specific cases
struktur extract --strategy simple --input ./small-file.txt --fields "title, content"

# Create a model alias
struktur config models alias set fast openrouter/meta-llama/llama-3.1-8b-instruct#groq

# Choose a default model for all extractions
struktur config models use fast

# Add parsers for more file types (supports NPM packages or custom CLI commands)
# 1. Using an npm package
# 2. Using a CLI command (FILE_PATH is a placeholder)
# 3. Using a CLI command that reads from stdin
struktur config parsers add --mime application/vnd.ms-excel --npm @my-custom/excel-parser
struktur config parsers add --mime text/html --file-command "my-html-parser FILE_PATH"
struktur config parsers add --mime text/calendar --stdin-command "my-ical-parser"

Full CLI reference



SDK quickstart

import { extract, agent, urlToArtifact } from "@struktur/sdk";
import { openai } from "@ai-sdk/openai";
import type { JSONSchemaType } from "ajv";

type Invoice = { number: string; vendor: string; total: number };

const schema: JSONSchemaType<Invoice> = {
  type: "object",
  properties: {
    number: { type: "string" },
    vendor: { type: "string" },
    total: { type: "number" },
  },
  required: ["number", "vendor", "total"],
  additionalProperties: false,
};

const artifact = await urlToArtifact("https://example.com/invoice.pdf");

const result = await extract({
  artifacts: [artifact],
  schema,
  strategy: agent({
    provider: "openai",
    modelId: "gpt-4o-mini",
  }),
});

console.log(result.data.number); // fully typed
console.log(result.usage.totalTokens);

For quick extractions without writing a full JSON Schema, use the fields shorthand:

const result = await extract({
  artifacts,
  fields: "invoice_number, vendor, total:number, due_date",
  strategy: agent({ provider: "openai", modelId: "gpt-4o-mini" }),
});

Full SDK reference



How it works

Struktur converts input files into Artifacts — normalized JSON with text and media slices. The Agent then autonomously explores the document, deciding how to extract data: reading files, searching for patterns, and building the output incrementally.

flowchart LR
    A[Input] --> B[Parse]
    B --> C[Artifacts]
    C --> D[Agent Strategy]
    D --> E[Validated JSON]
    
    subgraph Agent [Agent Strategy]
        direction TB
        A1[Explore] --> A2[Read/Grep/Find]
        A2 --> A3[Extract Data]
        A3 --> A4[Validate]
        A4 -->|Need more info| A1
    end
    
    D --> Agent --> E
Loading

Key stages:

  • Parse: Convert files (PDF, text, images) into Artifact JSON
  • Agent: Autonomously explore and extract using tools (read, grep, bash, find)
  • Validate: Check against schema, retry on errors
  • Output: Return validated JSON

The agent decides how to approach extraction based on your schema and the document content. It may read the entire document at once for small inputs, or navigate through sections systematically for large documents. Every LLM response is validated against your schema. If validation fails, the errors are sent back to the model automatically. Most extractions converge in one or two attempts.

Extraction pipeline explained



Parsing

Struktur has a built-in parsing layer that converts files into Artifacts before extraction. You can use it standalone via the parse command, or it runs automatically when you pass --input to extract.

Built-in formats

Format MIME type Notes
PDF application/pdf Per-page text + embedded images
Plain text text/plain Split into paragraph blocks
Markdown text/markdown Treated as text
HTML text/html Treated as text
Images image/png, image/jpeg, etc. Passed through as image artifacts
Artifact JSON application/json Hydrated directly if valid SerializedArtifact[]

struktur parse

Convert any file or stdin to Artifact JSON. Useful for inspecting what Struktur sees before running extraction, or for building pipelines that cache parsed artifacts.

# Parse a PDF to Artifact JSON
struktur parse --input ./report.pdf

# Parse and save for later
struktur parse --input ./report.pdf --output ./report-artifact.json

# Skip image extraction from PDFs
struktur parse --input ./report.pdf --no-images

# Override MIME detection
struktur parse --input ./data.bin --mime application/pdf

# Pipe through stdin
cat ./report.pdf | struktur parse --stdin --mime application/pdf

Custom parsers

Register external parsers for any MIME type — they handle the conversion and output SerializedArtifact[] JSON.

npm package parser:

struktur config parsers add --mime application/vnd.ms-excel --npm @myorg/excel-parser

The package must export at least one of parseStream(stream, mimeType) or parseFile(path, mimeType), each returning Promise<Artifact[]>. Optionally export detectFileType(header: Uint8Array): boolean for magic-byte detection.

Command-line parsers:

# Command receives a file path via FILE_PATH placeholder
struktur config parsers add \
  --mime application/vnd.openxmlformats-officedocument.wordprocessingml.document \
  --file-command "markitdown FILE_PATH"

# Command reads from stdin, outputs SerializedArtifact[] JSON to stdout
struktur config parsers add \
  --mime text/html \
  --stdin-command "my-html-parser"

Inline parsers (SDK):

For code-only parsers in the SDK, use InlineParserDef:

import { parse } from "@struktur/sdk";
import type { InlineParserDef } from "@struktur/sdk";

const excelParser: InlineParserDef = {
  type: "inline",
  handler: async (buffer) => {
    // parse and return Artifact
  },
};

const artifacts = await parse(
  { kind: "file", path: "report.xlsx" },
  { parserConfig: { "application/vnd.ms-excel": excelParser } }
);

Per-invocation override (skips stored config):

struktur extract --input ./file.docx --parser @myorg/docx-parser --fields "title, summary"

Manage configured parsers:

struktur config parsers list
struktur config parsers get --mime application/vnd.ms-excel
struktur config parsers remove --mime application/vnd.ms-excel

MIME detection

MIME type is detected automatically in this order:

  1. --mime flag — always wins if provided
  2. Magic bytes%PDF, PNG header, JPEG/GIF/WebP markers, Office ZIP signatures
  3. npm parser detectFileType — called with the first 512 bytes if the parser exports it
  4. File extension.pdf, .txt, .md, .html, .json, .csv, .xml, .yaml, .docx, .xlsx, .pptx, and more

For stdin with no --mime, falls back to text/plain.



Strategies

Struktur uses an Agent by default — it autonomously explores documents and extracts data using a virtual filesystem. The agent decides when to read files, search for patterns, or execute commands based on your schema and the document content.

Security: The agent runs fully sandboxed in the same process — no custom VM needed. It uses an emulated shell with only read/grep/glob utilities. No external HTTP calls or command execution.

For specific use cases, you can also use other strategies:

Strategy When to use
agent (default) Autonomous exploration — best for most documents
simple Small input, fits in one context window
parallel Large input, order doesn't matter, scalar fields
sequential Large input, context carries across chunks
parallelAutoMerge Large input with arrays — parallel + dedup
sequentialAutoMerge Large input with arrays — sequential + dedup
doublePass Quality matters, two-pass refinement
doublePassAutoMerge Quality + arrays + dedup
# Agent is the default - no --strategy needed
struktur extract --input ./document.pdf --schema ./schema.json

# Use a specific strategy when needed
struktur extract --input ./document.pdf --schema ./schema.json --strategy simple
import { extract, agent } from "@struktur/sdk";
import { openai } from "@ai-sdk/openai";

const result = await extract({
  artifacts,
  schema,
  strategy: agent({
    provider: "openai",
    modelId: "gpt-4.1-mini",
    maxSteps: 50,
  }),
});

Extraction Strategies



Fields shorthand

Skip the JSON Schema boilerplate for flat extractions:

"title, price:number, status:enum{draft|live}, tags:array"

Supported types: string (default), number, integer, boolean, enum{a|b}, array (defaults to array{string}), array{type}.

For optional fields, nested objects, or TypeScript inference on result.data, use a full JSONSchemaType<T> schema instead.

Fields Shorthand



Configuration

All persistent settings live under struktur config.

Providers

# Add a provider token
struktur config providers add openai --token sk-...
echo "sk-..." | struktur config providers add anthropic --token-stdin

# List all providers and their status
struktur config providers list

# Remove a token
struktur config providers remove openai

Models

# Set a default model
struktur config models use openai/gpt-4o-mini

# List available models for a provider
struktur config models list --provider openai

# Aliases
struktur config models alias set fast openai/gpt-4o-mini
struktur config models alias set smart anthropic/claude-opus-4
struktur config models use fast

Parsers

struktur config parsers list
struktur config parsers add --mime application/vnd.ms-excel --npm @myorg/excel-parser
struktur config parsers remove --mime application/vnd.ms-excel


Documentation

Full documentation at struktur.sh

About

All-in-one tool for structured data extraction using LLMs. Feed it documents, get back validated JSON. Handles parsing files, chunking, retries, merging, and deduplication — you just define the schema and choose a strategy.

Topics

Resources

License

Stars

Watchers

Forks

Contributors