Extract, repair, parse, and validate structured outputs from LLM messy text or output — consistently across C++, Python, and TypeScript.
Parsing you can trust—turns ambiguity into clarity.
Parsing that respects your time: fewer surprises, more signal.
A powerful cross-language toolkit for extracting JSON-like payloads from mixed text, applying controlled repairs (JSON-ish normalization), parsing, and validating with a pragmatic schema subset.
It also includes validators for Markdown, key-value (.env-ish) text, and SQL, plus incremental (streaming) parsers.
The same core behavior is shared across implementations (C++ core + bindings), so you can reuse schemas and expect consistent error fields.
Typical inputs include:
- A JSON object/array embedded inside other text
- A fenced code block (for example, one starting with
\``json`) - Trailing commas
- Smart quotes
- Occasional Python literals (
True,False,None)
Repairs are explicit and configurable via RepairConfig, and the extended APIs (*_ex) return:
value: parsed valuefixed: repaired textmetadata: which repairs actually happened
Supported repairs (opt-in via RepairConfig) include:
fixSmartQuotes: normalize curly quotes to ASCII quotesstripJsonComments: remove// ...and/* ... */commentsreplacePythonLiterals: translate Python-y literals (True/False/None) to JSON (true/false/null)convertKvObjectToJson: accept loosekey=valueblobs and convert to JSON objectsquoteUnquotedKeys: allow{a: 1}-style objects by quoting keysdropTrailingCommas: remove trailing commas in objects/arraysallowSingleQuotes: allow single-quoted strings/keys when parsing
Duplicate keys inside objects are handled via duplicateKeyPolicy:
firstWins(default): keep the first occurrence (backwards compatible)lastWins: overwrite with the last occurrenceerror: reject and raise a parse error with a specific key path (for example$.a)
Parse YAML from LLM output with automatic repairs:
- Extract YAML from
\``yamlor```yml` fenced blocks - Extract multiple YAML documents (separated by
---) - Fix tabs → spaces and normalize indentation
- Allow inline JSON objects/arrays within YAML
- Validate against JSON Schema (same as JSON)
Repairs are configurable via YamlRepairConfig:
fixTabs: convert tabs to spacesnormalizeIndentation: ensure consistent spacingfixUnquotedValues: handle unquoted special charactersallowInlineJson: permit JSON-style syntax within YAMLquoteAmbiguousStrings: auto-quote strings that could be numbers/booleans
APIs mirror JSON-ish pattern:
- C++:
loads_yamlish,loads_yamlish_ex,parse_and_validate_yaml,dumps_yaml - Python:
loads_yamlish,parse_and_validate_yaml,dumps_yaml - TypeScript:
loadsYamlish,parseAndValidateYaml,dumpsYaml
Parse TOML from LLM output with automatic repairs:
- Extract TOML from
\``toml` fenced blocks - Support for standard tables
[section]and arrays of tables[[items]] - Handle dotted keys (
a.b.c = value) - Parse inline tables and inline arrays
- Support for all TOML value types (strings, numbers, booleans, dates)
- Convert single quotes to double quotes
- Normalize whitespace (tabs to spaces)
- Validate against JSON Schema (same as JSON)
Repairs are configurable via TomlRepairConfig:
fixUnquotedStrings: handle unquoted string valuesallowSingleQuotes: convert single quotes to double quotesnormalizeWhitespace: convert tabs to spacesfixTableNames: auto-fix table names with special charactersallowMultilineInlineTables: permit multiline inline tables
APIs mirror JSON-ish pattern:
- C++:
loads_tomlish,loads_tomlish_ex,parse_and_validate_toml,dumps_toml - Python:
loads_tomlish,parse_and_validate_toml,dumps_toml - TypeScript:
loadsTomlish,parseAndValidateToml,dumpsToml
Parse XML and HTML from LLM output with automatic repairs:
- Extract XML/HTML from
\``xmlor```html` fenced blocks - Parse well-formed XML and lenient HTML (auto-close tags, unquoted attributes)
- Support for elements, text, comments, CDATA, processing instructions, and doctypes
- Query nodes with CSS-like selectors (
query_xml) - Convert XML to JSON representation (
xml_to_json) - Extract text content from node trees (
xml_text_content) - Validate against JSON Schema (same as JSON)
Repairs are configurable via XmlRepairConfig:
html_mode: enable HTML-specific parsing (void elements, optional closing tags)fix_unquoted_attributes: handle<div class=foo>style attributesauto_close_tags: automatically close unclosed tagsnormalize_whitespace: normalize whitespace in text nodeslowercase_names: convert tag/attribute names to lowercasedecode_entities: decode HTML entities (&→&)
APIs mirror JSON-ish pattern:
- C++:
loads_xml,loads_xml_ex,loads_html,loads_html_ex,xml_to_json,dumps_xml,dumps_html,query_xml,validate_xml,parse_and_validate_xml - Python:
loads_xml,loads_xml_ex,loads_html,loads_html_ex,xml_to_json,dumps_xml,dumps_html,query_xml,validate_xml,parse_and_validate_xml - TypeScript:
loadsXml,loadsXmlEx,loadsHtml,loadsHtmlEx,xmlToJson,dumpsXml,dumpsHtml,queryXml,validateXml,parseAndValidateXml
If your input contains multiple JSON blobs (for example, several ```json fences plus inline {...} / [...]), you can extract/parse all blocks instead of only the first one.
Semantics:
- Candidates are returned in source order (left-to-right in the original text).
- Validation errors are rooted at
$[i]to indicate which block failed (for example$[0].a). *_all_exreturns per-blockfixed[]andmetadata[]aligned withvalues[].
APIs:
- C++:
extract_json_candidates,loads_jsonish_all,loads_jsonish_all_ex,parse_and_validate_all,parse_and_validate_all_ex - Python:
extract_json_candidates,loads_jsonish_all,loads_jsonish_all_ex,parse_and_validate_json_all,parse_and_validate_json_all_ex - TypeScript:
extractJsonCandidates,loadsJsonishAll,loadsJsonishAllEx,parseAndValidateJsonAll,parseAndValidateJsonAllEx
Validation is performed against a pragmatic schema subset (see Schema support below).
Errors carry consistent fields across languages:
kind:schema|type|limit|parsepath: JSONPath-ish location (for example$.steps[0].id)jsonPointer(Python/TypeScript): JSON Pointer (for example/steps/0/id)
In addition to JSON schema validation:
- SQL: parse and validate against an allowlist-style policy (statements/tables/keywords/placeholders/limits)
- Markdown: validate document structure (headings/sections/bullets/code fences/tables/task lists)
- Key-value: validate
.env-ish text (required keys, allowed extras, regex patterns, enum lists)
For LLM streaming output, the library provides incremental parsers/collectors:
- JSON: emit-first parser and emit-all collectors, with optional limits (
maxBufferBytes,maxItems) - SQL: incremental parsing/validation for streaming SQL output
Core methods are consistent across languages: append(...), poll(), plus finish()/close() and location().
Automatically infer JSON Schema from example values:
- Infer type, format, and constraints from one or more sample values
- Detect common string formats:
date-time,date,time,email,uri,uuid,ipv4,hostname - Merge schemas from multiple examples (intersection of required fields, union of properties)
- Detect enums from repeated string values
- Configurable via
SchemaInferenceConfig:
Options:
include_examples: include sample values inexamplesarraymax_examples: maximum number of examples to collect (default: 5)include_default: include first seen value asdefaultinfer_formats: detect string formats (email, uri, date-time, etc.)infer_patterns: generate regex patterns from string valuesinfer_numeric_ranges: addminimum/maximumfrom observed valuesinfer_string_lengths: addminLength/maxLengthfrom observed valuesinfer_array_lengths: addminItems/maxItemsfrom observed arraysrequired_by_default: mark all properties as required (default: true)strict_additional_properties: setadditionalProperties: false(default: true)prefer_integer: useintegertype for whole numbers (default: true)allow_any_of: useanyOffor mixed types (default: true)detect_enums: detect enum values from repeated strings (default: false)max_enum_values: max unique values to consider as enum (default: 10)
APIs:
- C++:
infer_schema,infer_schema_from_values,merge_schemas - Python:
infer_schema,infer_schema_from_values,merge_schemas - TypeScript:
inferSchema,inferSchemaFromValues,mergeSchemas
Works with YAML and TOML too: Schema inference operates on parsed values, so you can infer schemas from YAML/TOML by parsing first:
# YAML → Schema
yaml_value = loads_yamlish("name: Alice\nage: 30")
schema = infer_schema(yaml_value)
# TOML → Schema
toml_value = loads_tomlish('[user]\nname = "Alice"')
schema = infer_schema(toml_value)Note: XML/HTML have a different structure (attributes, mixed content) and cannot be directly mapped to JSON Schema.
When validation fails, you can ask the library to produce:
- a
repaired_value(best-effort, low-risk auto-fixes) - structured repair
suggestions - remaining
unfixable_errors(if any)
APIs:
- C++:
validate_with_repair,parse_and_repair - Python:
validate_with_repair,parse_and_repair - TypeScript:
validateWithRepair,parseAndRepair
Common config options (ValidationRepairConfig) include:
coerce_types: coerce basic types when safe (e.g. "123" → 123)use_defaults: fill missing properties from schemadefault(when available)clamp_numbers: clamp numbers into[minimum, maximum]truncate_strings/truncate_arrays: respectmaxLength/maxItemsremove_extra_properties: drop extra object keys whenadditionalProperties: falsefix_enums: choose closest enum string (best-effort)fix_formats: small format cleanups (best-effort)max_suggestions: cap how many suggestions to generate
Python:
from llm_structured import validate_with_repair
schema = {
"type": "object",
"required": ["age", "name"],
"additionalProperties": False,
"properties": {
"age": {"type": "integer", "minimum": 0, "maximum": 120},
"name": {"type": "string", "minLength": 1},
},
}
value = {"age": "200", "name": "Alice", "extra": 1}
r = validate_with_repair(value, schema, {
"coerce_types": True,
"clamp_numbers": True,
"remove_extra_properties": True,
"max_suggestions": 20,
})
print(r["valid"], r["fully_repaired"], r["repaired_value"])TypeScript:
import { validateWithRepair } from "./src/index";
const schema = {
type: "object",
required: ["age", "name"],
additionalProperties: false,
properties: {
age: { type: "integer", minimum: 0, maximum: 120 },
name: { type: "string", minLength: 1 },
},
} as const;
const r = validateWithRepair({ age: "200", name: "Alice", extra: 1 } as any, schema, {
coerceTypes: true,
clampNumbers: true,
removeExtraProperties: true,
maxSuggestions: 20,
});
console.log(r.valid, r.fullyRepaired, r.repairedValue);Build tool schemas for major providers and parse tool calls back from responses.
What you get:
- Schema builders to convert a JSON Schema into each provider's tool/function declaration shape.
- Tool-call parsers that extract arguments and run validation + repair using
validate_with_repair. - Convenience helpers that scan common response envelopes and return a list of parsed tool calls.
Notes:
- OpenAI tool call arguments are often a string; the library will apply JSON-ish repairs before parsing.
- Gemini uses a different schema dialect; the library performs a best-effort conversion from JSON Schema.
Python example:
from llm_structured import (
build_openai_function_tool,
parse_openai_tool_calls_from_response,
)
schema = {
"type": "object",
"additionalProperties": False,
"required": ["id"],
"properties": {"id": {"type": "integer"}},
}
tool = build_openai_function_tool("get_user", "Get a user", schema)["tool"]
print(tool)
response = {
"choices": [
{
"message": {
"tool_calls": [
{
"id": "call_1",
"type": "function",
"function": {"name": "get_user", "arguments": "{'id': '123',}"},
}
]
}
}
]
}
calls = parse_openai_tool_calls_from_response(
response,
{"get_user": schema},
validation_repair={"coerce_types": True},
parse_repair={"allowSingleQuotes": True, "dropTrailingCommas": True},
)
assert calls[0]["ok"] is True
assert calls[0]["validation"]["repaired_value"] == {"id": 123}TypeScript example:
import {
buildOpenaiFunctionTool,
parseOpenaiToolCallsFromResponse,
type JsonSchema,
} from "./src/index";
const schema: JsonSchema = {
type: "object",
additionalProperties: false,
required: ["id"],
properties: { id: { type: "integer" } },
};
const tool = buildOpenaiFunctionTool("get_user", "Get a user", schema).tool;
console.log(tool);
const response = {
choices: [
{
message: {
tool_calls: [
{
id: "call_1",
type: "function",
function: { name: "get_user", arguments: "{'id':'123'}" },
},
],
},
},
],
};
const calls = parseOpenaiToolCallsFromResponse(
response as any,
{ get_user: schema },
{ coerceTypes: true },
{ allowSingleQuotes: true }
);
console.log(calls[0].validation.repairedValue);- C++17 core library shared by the Python (pybind11) and TypeScript (Node-API) bindings
- Consistent error shape (
kind,path,jsonPointer) and repair metadata across languages - C++ CLI for quick validation from the terminal (see CLI section below)
The commands below assume you run them from the repository root.
Build/install the editable package:
python -m pip install -r python/requirements-build.txt
python -m pip install -e pythonRun examples:
python python/examples/example_structured_output.py
python python/examples/example_sql_validate.pyRun tests:
python -m unittest discover -s python/test -p "test_*.py" -vSet-Location typescript
npm install
npm testIf PowerShell blocks npm because of script execution policy, use one of these options:
- Run
npm.cmdexplicitly (PowerShell can invoke it withoutnpm.ps1). - Or run the commands from
cmd.exe(for example:cmd /c npm install).
cmake -S cpp -B cpp/build
cmake --build cpp/build -j
ctest --test-dir cpp/build -C Debug -VArtifacts:
llm_structured_clillm_structured_tests
This example uses the same schema and the same input across all three languages:
- The input is fenced and contains a trailing comma.
- The schema expects an object with
titleand an array ofsteps.
Input text:
Here is the payload:
```json
{"title":"Plan","steps":[{"id":1,"text":"Write docs"}],}
```
#include "llm_structured.hpp"
int main() {
llm_structured::Json schema = llm_structured::loads_jsonish(R"JSON(
{
"type": "object",
"required": ["title", "steps"],
"additionalProperties": false,
"properties": {
"title": {"type": "string", "minLength": 1},
"steps": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"required": ["id", "text"],
"additionalProperties": false,
"properties": {
"id": {"type": "integer", "minimum": 1},
"text": {"type": "string", "minLength": 1}
}
}
}
}
}
)JSON");
try {
const std::string fence_open = std::string("`") + "``json\n";
const std::string fence_close = std::string("`") + "``\n";
auto obj = llm_structured::parse_and_validate(
"Here is the payload:\n\n" +
fence_open +
"{\"title\":\"Plan\",\"steps\":[{\"id\":1,\"text\":\"Write docs\"}],}\n" +
fence_close,
schema);
(void)obj;
} catch (const llm_structured::ValidationError& e) {
// e.kind: schema | type | limit | parse
// e.path: $.steps[0].id
throw;
}
}from llm_structured import ValidationError, parse_and_validate
schema = {
"type": "object",
"required": ["title", "steps"],
"additionalProperties": False,
"properties": {
"title": {"type": "string", "minLength": 1},
"steps": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"required": ["id", "text"],
"additionalProperties": False,
"properties": {
"id": {"type": "integer", "minimum": 1},
"text": {"type": "string", "minLength": 1},
},
},
},
},
}
fence_open = "`" + "``json\n"
fence_close = "`" + "``\n"
text = "Here is the payload:\n\n" + fence_open + "{\"title\":\"Plan\",\"steps\":[{\"id\":1,\"text\":\"Write docs\"}],}\n" + fence_close
try:
obj = parse_and_validate(text, schema)
print(obj)
except ValidationError as e:
# e.kind / e.path / e.jsonPointer
raiseimport { parseAndValidateJson, type ValidationError } from "./src/index";
const schema = {
type: "object",
required: ["title", "steps"],
additionalProperties: false,
properties: {
title: { type: "string", minLength: 1 },
steps: {
type: "array",
minItems: 1,
items: {
type: "object",
required: ["id", "text"],
additionalProperties: false,
properties: {
id: { type: "integer", minimum: 1 },
text: { type: "string", minLength: 1 },
},
},
},
},
} as const;
const fenceOpen = "`" + "``json\n";
const fenceClose = "`" + "``\n";
const text = "Here is the payload:\n\n" + fenceOpen + "{\"title\":\"Plan\",\"steps\":[{\"id\":1,\"text\":\"Write docs\"}],}\n" + fenceClose;
try {
const obj = parseAndValidateJson(text, schema);
console.log(obj);
} catch (e) {
const err = e as ValidationError;
throw err;
}Input text containing multiple JSON payloads:
prefix
```json
{"a": 1}
```
middle {"b": 2} tail
```json
[1, 2]
```
#include <cassert>
#include <string>
#include "llm_structured.hpp"
int main() {
// Build a text that contains multiple JSON payloads.
const std::string fence_open = std::string("`") + "``json\n";
const std::string fence_close = std::string("`") + "``\n";
const std::string text =
"prefix\n" +
fence_open +
"{\"a\": 1}\n" +
fence_close +
"middle {\"b\": 2} tail\n" +
fence_open +
"[1, 2]\n" +
fence_close;
// 1) Extract all candidates (in source order).
auto cands = llm_structured::extract_json_candidates(text);
assert(cands.size() == 3);
// 2) Parse all values.
auto values = llm_structured::loads_jsonish_all(text);
assert(values.size() == 3);
// 3) Validate all blocks. Errors are rooted at $[i].
llm_structured::Json schema = llm_structured::loads_jsonish(R"JSON(
{"type":"object","required":["a"],"properties":{"a":{"type":"integer"}}}
)JSON");
try {
(void)llm_structured::parse_and_validate_all("{\"a\":\"x\"} {\"a\":2}", schema);
assert(false && "expected ValidationError");
} catch (const llm_structured::ValidationError& e) {
// e.path is like $[0].a
(void)e;
}
return 0;
}from llm_structured import extract_json_candidates, loads_jsonish_all, parse_and_validate_json_all
text = (
"prefix\n"
"`" "``json\n{\"a\": 1}\n`" "``\n"
"middle {\"b\": 2} tail\n"
"`" "``json\n[1, 2]\n`" "``\n"
)
print(extract_json_candidates(text))
print(loads_jsonish_all(text))
schema = {"type": "object", "required": ["a"], "properties": {"a": {"type": "integer"}}}
# Raises ValidationError with path rooted at $[i] if any block fails.
parse_and_validate_json_all('{"a":"x"} {"a":2}', schema)import { extractJsonCandidates, loadsJsonishAll, parseAndValidateJsonAll, type ValidationError } from "./src/index";
const text =
"prefix\n" +
"`" + "``json\n{\"a\": 1}\n" + "`" + "``\n" +
"middle {\"b\": 2} tail\n" +
"`" + "``json\n[1, 2]\n" + "`" + "``\n";
console.log(extractJsonCandidates(text));
console.log(loadsJsonishAll(text));
try {
parseAndValidateJsonAll('{"a":"x"} {"a":2}', {
type: "object",
required: ["a"],
properties: { a: { type: "integer" } },
});
} catch (e) {
const err = e as ValidationError;
// err.path is like $[0].a
throw err;
}The library also supports YAML parsing with automatic repairs for common LLM output issues:
- Extracts YAML from
\``yaml` fenced blocks - Fixes tabs and mixed indentation
- Allows inline JSON within YAML
- Validates against JSON Schema
Input YAML:
```yaml
users:
- name: Alice
age: 30
- name: Bob
age: 25
```#include "llm_structured.hpp"
int main() {
llm_structured::Json schema = llm_structured::loads_jsonish(R"JSON(
{
"type": "object",
"required": ["users"],
"properties": {
"users": {
"type": "array",
"items": {
"type": "object",
"required": ["name", "age"],
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 0}
}
}
}
}
}
)JSON");
const std::string fence_open = std::string("`") + "``yaml\n";
const std::string fence_close = std::string("`") + "``\n";
const std::string yaml_text =
fence_open +
"users:\n"
" - name: Alice\n"
" age: 30\n"
" - name: Bob\n"
" age: 25\n" +
fence_close;
try {
auto obj = llm_structured::parse_and_validate_yaml(yaml_text, schema);
// Serialize back to YAML
std::string yaml_out = llm_structured::dumps_yaml(obj);
(void)yaml_out;
} catch (const llm_structured::ValidationError& e) {
// e.kind / e.path
throw;
}
}from llm_structured import parse_and_validate_yaml, dumps_yaml, ValidationError
schema = {
"type": "object",
"required": ["users"],
"properties": {
"users": {
"type": "array",
"items": {
"type": "object",
"required": ["name", "age"],
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 0},
},
},
},
},
}
yaml_text = """
```yaml
users:
- name: Alice
age: 30
- name: Bob
age: 25"""
try: obj = parse_and_validate_yaml(yaml_text, schema) print(obj)
# Serialize back to YAML
yaml_out = dumps_yaml(obj)
print(yaml_out)
except ValidationError as e: # e.kind / e.path / e.jsonPointer raise
### TypeScript (YAML)
```ts
import { parseAndValidateYaml, dumpsYaml, type ValidationError } from "./src/index";
const schema = {
type: "object",
required: ["users"],
properties: {
users: {
type: "array",
items: {
type: "object",
required: ["name", "age"],
properties: {
name: { type: "string" },
age: { type: "integer", minimum: 0 },
},
},
},
},
} as const;
const yamlText = `
\`\`\`yaml
users:
- name: Alice
age: 30
- name: Bob
age: 25
\`\`\`
`;
try {
const obj = parseAndValidateYaml(yamlText, schema);
console.log(obj);
// Serialize back to YAML
const yamlOut = dumpsYaml(obj);
console.log(yamlOut);
} catch (e) {
const err = e as ValidationError;
throw err;
}
The library supports XML and HTML parsing with automatic repairs:
- Extracts from
\``xmlor```html` fenced blocks - Handles malformed HTML (unclosed tags, unquoted attributes)
- Query nodes with CSS-like selectors
- Validates against JSON Schema
Input XML:
```xml
<config>
<server host="localhost" port="8080"/>
<database>
<connection>postgresql://localhost/db</connection>
</database>
</config>
```#include "llm_structured.hpp"
int main() {
const std::string fence_open = std::string("`") + "``xml\n";
const std::string fence_close = std::string("`") + "``\n";
const std::string xml_text =
fence_open +
"<config>\n"
" <server host=\"localhost\" port=\"8080\"/>\n"
" <database>\n"
" <connection>postgresql://localhost/db</connection>\n"
" </database>\n"
"</config>\n" +
fence_close;
// Parse XML
auto root = llm_structured::loads_xml(xml_text);
// Query nodes by tag name
auto servers = llm_structured::query_xml(root, "server");
// Get attribute
std::string host = llm_structured::xml_get_attribute(servers[0], "host");
// Get text content
std::string text = llm_structured::xml_text_content(root);
// Convert to JSON representation
auto json = llm_structured::xml_to_json(root);
// Serialize back to XML
std::string xml_out = llm_structured::dumps_xml(root);
return 0;
}from llm_structured import (
loads_xml, loads_html, query_xml, xml_get_attribute,
xml_text_content, xml_to_json, dumps_xml, dumps_html
)
xml_text = """
```xml
<config>
<server host="localhost" port="8080"/>
<database>
<connection>postgresql://localhost/db</connection>
</database>
</config>"""
result = loads_xml(xml_text) if result["ok"]: root = result["root"]
# Query nodes by tag name
servers = query_xml(root, "server")
print(f"Found {len(servers)} server(s)")
# Get attribute
host = xml_get_attribute(servers[0], "host")
print(f"Host: {host}")
# Get all text content
text = xml_text_content(root)
print(f"Text: {text}")
# Convert to JSON representation
json_repr = xml_to_json(xml_text)
print(f"JSON: {json_repr}")
# Serialize back to XML
xml_out = dumps_xml(root)
print(xml_out)
html_text = '
Hello World
### TypeScript (XML)
```ts
import {
loadsXml, loadsHtml, queryXml, xmlGetAttribute,
xmlTextContent, xmlToJson, dumpsXml, dumpsHtml
} from "./src/index";
const xmlText = `
\`\`\`xml
<config>
<server host="localhost" port="8080"/>
<database>
<connection>postgresql://localhost/db</connection>
</database>
</config>
\`\`\`
`;
// Parse XML
const result = loadsXml(xmlText);
if (result.ok) {
const root = result.root;
// Query nodes by tag name
const servers = queryXml(root, "server");
console.log(`Found ${servers.length} server(s)`);
// Get attribute
const host = xmlGetAttribute(servers[0], "host");
console.log(`Host: ${host}`);
// Get all text content
const text = xmlTextContent(root);
console.log(`Text: ${text}`);
// Convert to JSON representation
const jsonRepr = xmlToJson(xmlText);
console.log("JSON:", jsonRepr);
// Serialize back to XML
const xmlOut = dumpsXml(root);
console.log(xmlOut);
}
// Parse HTML with lenient mode
const htmlText = '<div class=container><p>Hello <b>World</b></div>';
const htmlResult = loadsHtml(htmlText);
if (htmlResult.ok) {
const htmlOut = dumpsHtml(htmlResult.root);
console.log(htmlOut);
}
This example enforces a conservative policy:
- Only
SELECTis allowed - Comments and semicolons are forbidden
- A
LIMITis required and must not exceed a maximum - Only a known set of tables can appear
SQL input:
SELECT u.id FROM users u WHERE u.id = ? LIMIT 1#include "llm_structured.hpp"
int main() {
llm_structured::Json schema = llm_structured::loads_jsonish(R"JSON(
{
"allowedStatements": ["select"],
"forbidComments": true,
"forbidSemicolon": true,
"requireFrom": true,
"requireWhere": true,
"requireLimit": true,
"maxLimit": 100,
"forbidUnion": true,
"forbidSubqueries": true,
"allowedTables": ["users"],
"placeholderStyle": "either"
}
)JSON");
auto out = llm_structured::parse_and_validate_sql(
"SELECT u.id FROM users u WHERE u.id = ? LIMIT 1",
schema);
(void)out;
}from llm_structured import ValidationError, parse_and_validate_sql
schema = {
"allowedStatements": ["select"],
"forbidComments": True,
"forbidSemicolon": True,
"requireFrom": True,
"requireWhere": True,
"requireLimit": True,
"maxLimit": 100,
"forbidUnion": True,
"forbidSubqueries": True,
"allowedTables": ["users"],
"placeholderStyle": "either",
}
try:
out = parse_and_validate_sql("SELECT u.id FROM users u WHERE u.id = ? LIMIT 1", schema)
print(out)
except ValidationError as e:
raiseimport { parseAndValidateSql, type ValidationError } from "./src/index";
const schema = {
allowedStatements: ["select"],
forbidComments: true,
forbidSemicolon: true,
requireFrom: true,
requireWhere: true,
requireLimit: true,
maxLimit: 100,
forbidUnion: true,
forbidSubqueries: true,
allowedTables: ["users"],
placeholderStyle: "either",
} as const;
try {
const out = parseAndValidateSql("SELECT u.id FROM users u WHERE u.id = ? LIMIT 1", schema);
console.log(out);
} catch (e) {
const err = e as ValidationError;
throw err;
}Automatically infer JSON Schema from example values.
from llm_structured import infer_schema, infer_schema_from_values, merge_schemas
import json
# Infer schema from a single value
value = {
"name": "Alice",
"age": 30,
"email": "alice@example.com",
"created_at": "2024-01-15T10:30:00Z"
}
schema = infer_schema(value)
print(json.dumps(schema, indent=2))
# Output:
# {
# "type": "object",
# "properties": {
# "name": {"type": "string"},
# "age": {"type": "integer"},
# "email": {"type": "string", "format": "email"},
# "created_at": {"type": "string", "format": "date-time"}
# },
# "required": ["name", "age", "email", "created_at"],
# "additionalProperties": false
# }
# Infer schema from multiple values (merges properties)
values = [
{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25, "city": "NYC"},
{"name": "Carol", "age": 35}
]
schema = infer_schema_from_values(values)
# Properties are merged; required = intersection of all examples
# Custom config
config = {
"infer_formats": True,
"infer_numeric_ranges": True,
"include_examples": True,
"max_examples": 3,
"detect_enums": True,
"max_enum_values": 10
}
schema = infer_schema(value, config)
# Merge two schemas
schema1 = {"type": "object", "properties": {"a": {"type": "string"}}}
schema2 = {"type": "object", "properties": {"b": {"type": "number"}}}
merged = merge_schemas(schema1, schema2)import { inferSchema, inferSchemaFromValues, mergeSchemas } from "./src/index";
// Infer schema from a single value
const value = {
name: "Alice",
age: 30,
email: "alice@example.com",
createdAt: "2024-01-15T10:30:00Z"
};
const schema = inferSchema(value);
console.log(JSON.stringify(schema, null, 2));
// Infer from multiple values
const values = [
{ name: "Alice", age: 30 },
{ name: "Bob", age: 25, city: "NYC" }
];
const mergedSchema = inferSchemaFromValues(values);
// Custom config
const config = {
inferFormats: true,
inferNumericRanges: true,
includeExamples: true,
detectEnums: true
};
const schemaWithConfig = inferSchema(value, config);
// Merge schemas
const schema1 = { type: "object", properties: { a: { type: "string" } } };
const schema2 = { type: "object", properties: { b: { type: "number" } } };
const merged = mergeSchemas(schema1, schema2);After building the C++ targets:
echo '```json\n{"a":1,}\n```' | build/llm_structured_cpp/llm_structured_cli jsonWith a schema file:
build/llm_structured_cpp/llm_structured_cli json --schema schema.json < input.txtecho 'SELECT u.id FROM users u WHERE u.id = ? LIMIT 1' | build/llm_structured_cpp/llm_structured_cli sql --schema sql_schema.jsonThe streaming APIs are designed for chunked input:
- C++:
JsonStreamParser,JsonStreamCollector,JsonStreamBatchCollector,JsonStreamValidatedBatchCollector,SqlStreamParser - Python:
JsonStreamParser,JsonStreamCollector,JsonStreamBatchCollector,JsonStreamValidatedBatchCollector,SqlStreamParser - TypeScript:
new JsonStreamParser(...),new JsonStreamCollector(...),new JsonStreamBatchCollector(...),new JsonStreamValidatedBatchCollector(...),new SqlStreamParser(...)
Core semantics:
append(chunk): feed more textpoll(): returns either a value, an error, or not ready yetfinish()/close(): signal no more input will arrive (parser vs collector)location(): best-effort position within the current internal buffer
The JSON validator intentionally supports a useful subset of JSON Schema. Common keywords include:
type,enum,constproperties,required,additionalProperties,propertyNamesitems,minItems,maxItems,contains,minContains,maxContainsminLength,maxLength,pattern,formatminimum,maximum,multipleOfallOf,anyOf,oneOf,if/then/else,dependentRequired
- C++ builds require a C++17 toolchain and CMake.
- Python/TypeScript native builds require a working C/C++ build toolchain (for example, a Visual Studio build environment on Windows).
- If you see an import error for the Python native module, rebuild/install the editable package from
python/.
This project is under active development — issues, suggestions, and code contributions are welcome.
- Open an issue for bugs, feature requests, or design discussion.
- Small PRs are easiest to review: keep changes focused and include/adjust tests.
- If you add or change public APIs, please update the docs( if it has ) and keep C++/Python/TypeScript behavior consistent.