Skip to content

Add FracturedJson formatting support for DOM serialization#2580

Merged
lemire merged 16 commits intomasterfrom
francisco/fractured_json_support
Jan 20, 2026
Merged

Add FracturedJson formatting support for DOM serialization#2580
lemire merged 16 commits intomasterfrom
francisco/fractured_json_support

Conversation

@FranciscoThiesen
Copy link
Member

@FranciscoThiesen FranciscoThiesen commented Jan 7, 2026

Summary

Implements FracturedJson formatting as requested in issue #2576. FracturedJson produces human-readable yet compact JSON output by intelligently choosing between different layout strategies based on content complexity, length, and structure similarity.

Key Features

  • Four layout modes: inline, compact multiline, table, and expanded
  • Structure analysis: Pre-pass to compute metrics before formatting for optimal layout decisions
  • Table formatting: Arrays of similar objects are formatted with column alignment
  • Highly configurable: Options for line length, indentation, padding, table detection thresholds, etc.
  • Builder API integration: Works seamlessly with static reflection for direct struct formatting

Example Output

Inline mode (simple containers):

{ "id": 1, "name": "Alice", "active": true }

Table mode (uniform arrays of objects):

[
    { "id": 1, "name": "Alice", "score": 95 },
    { "id": 2, "name": "Bob"  , "score": 87 },
    { "id": 3, "name": "Carol", "score": 92 }
]

Compact multiline (arrays of simple elements):

[
    0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
    10, 11, 12, 13, 14, 15, 16, 17, 18, 19
]

API

DOM API

// Format a DOM element
dom::parser parser;
element doc = parser.parse(json_string);
std::cout << fractured_json(doc) << std::endl;

// With custom options
fractured_json_options opts;
opts.indent_spaces = 2;
std::cout << fractured_json(doc, opts) << std::endl;

// Format any JSON string
auto formatted = fractured_json_string(minified_json);

Builder/Reflection API (new!)

// Direct struct serialization with formatting
struct User { int id; std::string name; bool active; };
User user{1, "Alice", true};

// Minified (existing)
auto minified = to_json_string(user);
// {"id":1,"name":"Alice","active":true}

// Formatted (new)
auto formatted = to_fractured_json_string(user);
// { "id": 1, "name": "Alice", "active": true }

// Partial extraction with formatting
auto partial = extract_fractured_json<"id", "name">(user);
// { "id": 1, "name": "Alice" }

New Files

File Description
dom/fractured_json.h Public API with fractured_json_options struct
dom/fractured_json-inl.h Full implementation (~1080 lines)
internal/json_structure_analyzer.h Structure analysis for layout decisions
internal/fractured_formatter.h Formatter class using CRTP pattern
generic/builder/fractured_json_builder.h Builder/reflection API integration
tests/fractured_json_tests.cpp DOM tests (27 tests)
tests/builder/static_reflection_fractured_json_tests.cpp Builder tests (7 tests)

Test plan

  • All 27 DOM tests pass covering:
    • Core functionality (roundtrip, inline, expanded, compact, table modes)
    • Edge cases (unicode, boundary numbers, deep nesting, special chars)
    • All configurable options
  • 7 builder/reflection integration tests (requires SIMDJSON_STATIC_REFLECTION)
  • Builds successfully with standard cmake configuration
  • Re-parsing formatted output produces identical results (roundtrip verification)
  • Rebased on latest master with builder directory changes

Resolves #2576

@lemire
Copy link
Member

lemire commented Jan 7, 2026

Ah Ah.

This would be step 1. Later we would want it to work with the new builder component.

@lemire
Copy link
Member

lemire commented Jan 7, 2026

I recently moved the builder API in a separate directory (for clarity).

@lemire
Copy link
Member

lemire commented Jan 7, 2026

Implements FracturedJson formatting as requested in issue #2576.
FracturedJson produces human-readable yet compact JSON output by
intelligently choosing between different layout strategies based on
content complexity, length, and structure similarity.

Key features:
- Four layout modes: inline, compact multiline, table, and expanded
- Structure analysis pass to compute metrics before formatting
- Table formatting for arrays of similar objects with column alignment
- Configurable options for line length, indentation, padding, etc.

New files:
- fractured_json.h: Public API with fractured_json_options struct
- fractured_json-inl.h: Implementation (~1000 lines)
- json_structure_analyzer.h: Structure analysis for layout decisions
- fractured_formatter.h: Formatter class using CRTP pattern

Usage:
  dom::parser parser;
  element doc = parser.parse(json_string);
  std::cout << fractured_json(doc) << std::endl;

  // Or with custom options:
  fractured_json_options opts;
  opts.indent_spaces = 2;
  std::cout << fractured_json(doc, opts) << std::endl;

  // Or format any JSON string (useful with reflection API):
  auto formatted = fractured_json_string(minified_json);

Resolves #2576
Adds 27 test cases covering all aspects of the FracturedJson formatter:

Core functionality tests (13):
- Roundtrip parsing verification
- Inline formatting for simple arrays and objects
- Expanded formatting for complex nested structures
- Compact multiline arrays with configurable items per line
- Table formatting for uniform arrays of objects
- Empty container handling
- All scalar types (string, int, uint, double, bool, null)
- String escaping (quotes, backslashes, control characters)
- Custom indentation options
- Deep nesting (10+ levels)
- Mixed type arrays

Edge case tests (11):
- Unicode strings (Chinese, emoji, Arabic, Russian, accented chars)
- Boundary numbers (INT64_MIN/MAX, UINT64_MAX, DBL_MIN/MAX)
- Nested arrays (arrays of arrays)
- Empty string values
- Keys with special characters (spaces, quotes, colons, etc.)
- Non-uniform arrays (should not trigger table mode)
- Very long strings (500+ chars)
- Large arrays (100 elements)
- Reflection API workflow simulation
- Control characters (tab, newline, CR, null)
- Single element containers

Option tests (3):
- Disable compact multiline mode
- Disable table format mode
- Disable all padding options
Extends FracturedJson to work seamlessly with the builder API, enabling
formatted output directly from C++ structs using static reflection.

New functions:
- to_fractured_json_string(obj, opts) - serialize struct to formatted JSON
- to_fractured_json(obj, output, opts) - same with output parameter
- extract_fractured_json<fields...>(obj, opts) - format only specific fields

These functions combine the builder's reflection-based serialization with
FracturedJson formatting in a single convenient call:

  struct User { int id; std::string name; bool active; };
  User user{1, "Alice", true};

  // Minified output (existing):
  auto minified = to_json_string(user);
  // {"id":1,"name":"Alice","active":true}

  // Formatted output (new):
  auto formatted = to_fractured_json_string(user);
  // { "id": 1, "name": "Alice", "active": true }

  // Partial extraction with formatting:
  auto partial = extract_fractured_json<"id", "name">(user);
  // { "id": 1, "name": "Alice" }

New files:
- generic/builder/fractured_json_builder.h - builder integration
- tests/builder/static_reflection_fractured_json_tests.cpp - 7 tests
@FranciscoThiesen FranciscoThiesen force-pushed the francisco/fractured_json_support branch from 946e0d0 to a147fce Compare January 7, 2026 23:31
@FranciscoThiesen
Copy link
Member Author

@lemire I've added the builder/reflection API integration as requested. The PR now includes:

  1. to_fractured_json_string(obj, opts) - Serializes any reflectable struct directly to formatted JSON
  2. to_fractured_json(obj, output, opts) - Same with output parameter
  3. extract_fractured_json<fields...>(obj, opts) - Extract and format only specific fields

The implementation combines the builder's to_json_string() with fractured_json_string() for a seamless experience:

struct User { int id; std::string name; bool active; };
User user{1, "Alice", true};

// One-liner for formatted output from any struct
auto formatted = to_fractured_json_string(user);
// { "id": 1, "name": "Alice", "active": true }

I've also added 7 tests for the builder integration in tests/builder/static_reflection_fractured_json_tests.cpp (requires SIMDJSON_STATIC_REFLECTION to run).

The PR has been rebased on the latest master which includes the builder directory reorganization (#2578).

@lemire
Copy link
Member

lemire commented Jan 8, 2026

Wow.

- Fix undefined behavior when negating INT64_MIN in estimate_number_length()
  and measure_value_length() by returning 20 (the exact length of the
  string representation) directly
- Actually use table_similarity_threshold in check_array_uniformity() by
  calling compute_object_similarity() to compare objects against the first
  object in the array
@FranciscoThiesen FranciscoThiesen force-pushed the francisco/fractured_json_support branch from 4c96aaf to e5fb747 Compare January 8, 2026 05:06
Initialize all member variables in member initialization lists to
satisfy GCC's -Werror=effc++ flag:
- element_metrics::common_keys - add {} default initializer
- structure_analyzer - add default constructor with member init list
- fractured_formatter - add column_widths_{} to constructor
- fractured_string_builder - add analyzer_{} to constructor
The class has a pointer member (current_opts_) which triggers
-Werror=effc++ requiring explicit copy/move operations. Delete
copy operations (class shouldn't be copied due to cache) and
default move operations.
Windows.h defines max/min macros that interfere with std::max/std::min.
Wrapping in parentheses as (std::max)(...) prevents macro expansion.
GCC 15 on MINGW64 gives a false positive warning in parser_moving_parser()
when the std::vector<std::string> goes out of scope. Suppress this
specific warning with a pragma for GCC builds.
@FranciscoThiesen
Copy link
Member Author

Code review

Found 1 issue:

  1. Cache key mechanism uses temporary object addresses which will always cause cache misses

The caching mechanism in structure_analyzer uses reinterpret_cast<size_t>(&elem) to generate cache keys for element metrics. However, when iterating through arrays/objects with range-based for loops (e.g., for (dom::element child : arr)), each iteration creates a new temporary object with a different address. This means:

  • During analysis phase: metrics are cached using addresses of analysis-loop temporaries
  • During format phase: lookups use addresses of different formatting-loop temporaries
  • Result: cache lookups always fail, falling back to default element_metrics{}

This breaks the two-phase design where phase 1 analyzes structure and phase 2 uses those metrics for layout decisions.

}
case dom::element_type::INT64: {
int64_t val;
if (elem.get_int64().get(val) == SUCCESS) {
metrics.complexity = 0;
metrics.estimated_inline_len = estimate_number_length(val);
metrics.child_count = 0;
metrics.can_inline = true;
metrics.recommended_layout = layout_mode::INLINE;
}
break;
}
case dom::element_type::UINT64: {

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

The cache was using element addresses as keys, but dom::element objects
are lightweight wrappers that get copied during iteration, causing
different addresses between analysis and formatting phases. This resulted
in cache misses and fallback to empty metrics.

Solution: Store child metrics in the element_metrics struct and pass
them through recursive calls, eliminating the need for address-based
caching entirely.

Changes:
- Add children vector to element_metrics for hierarchical metrics
- Remove metrics_cache_ and related get_metrics/has_metrics methods
- Update all format functions to accept and pass child metrics
- Add public analyze_array/analyze_object overloads for standalone use
@FranciscoThiesen
Copy link
Member Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Add entries for node_modules, package-lock.json, Rust target
directories, local ablation artifacts, and generated documentation
files.
Extract common scalar type handling (STRING, INT64, UINT64, DOUBLE,
BOOL, NULL_VALUE) into a dedicated analyze_scalar method. Each scalar
type shares the same initialization pattern for complexity, child_count,
can_inline, and recommended_layout.

Also simplify boolean formatting in format_scalar to use ternary operator.
Reformat cramped is_amalgamator condition to multi-line for readability.
Fix duplicate error message text in _included_filename_root and use
correct variable name (relative_root instead of root).
Extract repeated newline counting loop into a reusable static helper
function, used by inline_array_test, inline_object_test, and
expanded_test.
@lemire
Copy link
Member

lemire commented Jan 9, 2026

@FranciscoThiesen Fantastic. I am currently travelling and this is a major PR so I want to wait to be back before reviewing it. Won't be long. On my todo.

@lemire lemire merged commit fc57c09 into master Jan 20, 2026
156 checks passed
@lemire lemire deleted the francisco/fractured_json_support branch January 20, 2026 15:32
@lemire
Copy link
Member

lemire commented Jan 20, 2026

Merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement FracturedJson

2 participants