SalvaJSON is a powerful Python library designed to seamlessly repair and process corrupted or malformed JSON data. It intelligently fixes common syntax errors, making it an essential tool for developers working with unreliable JSON sources, such as LLM outputs, APIs, or user-generated content. Beyond repair, SalvaJSON offers high-performance JSON serialization and deserialization.
SalvaJSON ("salvage JSON") is your go-to Python toolkit for handling imperfect JSON. It excels at:
- Repairing Corrupted JSON: Automatically fixes many common JSON syntax issues like missing quotes, trailing commas, incorrect quoting, and comments.
- High-Performance Parsing: Provides fast and efficient JSON loading.
- Flexible Serialization: Offers robust Python object to JSON string conversion.
It leverages the lenient jsonic JavaScript parser via the PythonMonkey bridge for its powerful repair capabilities and the high-speed orjson library for standard JSON operations.
SalvaJSON is designed for:
- Python Developers integrating with external APIs that might return slightly malformed JSON.
- Data Scientists and Analysts cleaning JSON data from various, sometimes unreliable, sources.
- AI/ML Engineers working with Large Language Models (LLMs) that can produce JSON-like output with minor syntax errors.
- Anyone who needs to reliably parse JSON data that doesn't strictly adhere to the JSON specification.
- Resilience: Makes your applications more robust by gracefully handling common JSON errors instead of crashing.
- Simplicity: Offers a straightforward API (
salvaj,loads,dumps) that is easy to learn and use. - Performance: Utilizes
orjsonfor fast parsing and serialization of valid JSON. - Versatility: Provides both a Python library and a command-line interface (CLI) for flexible usage.
- Fixes Many Common Errors:
- Missing or mismatched quotes around keys and string values.
- Use of single quotes instead of double quotes.
- Trailing commas in objects and arrays.
- Missing commas between elements or key-value pairs.
- JavaScript-style comments (
//,/* */). - Unquoted keys or string values where unambiguous.
To install SalvaJSON, ensure you have Python 3.10 or higher. Then, run the following command:
pip install salvajsonThis will also install its dependencies, including pythonmonkey (for JavaScript interoperability) and orjson (for fast JSON processing).
SalvaJSON can be used as a Python library in your projects or as a command-line tool.
The library provides three main functions:
-
salvaj(json_str: str) -> str: This is the core repair function. It takes a potentially corrupted JSON string, uses thejsonicparser (via JavaScript) to fix it, and returns a valid JSON string.from salvajson import salvaj corrupted_json = """{ name: "John Doe", // Name is unquoted, comment present age: 30, 'city': 'New York', // Single quotes for key and value interests: ["coding" "reading",], // Missing comma, trailing comma }""" try: fixed_json_string = salvaj(corrupted_json) print(f"Fixed JSON string: {fixed_json_string}") # Output: Fixed JSON string: {"name":"John Doe","age":30,"city":"New York","interests":["coding","reading"]} except Exception as e: print(f"Failed to salvage JSON: {e}")
-
loads(s: bytes | str, **kw) -> Any: This function parses a JSON string (or bytes) into a Python object. It first attempts to parse using the fastorjsonlibrary. If that fails due to malformed JSON, it automatically falls back to usingsalvaj()to repair the string and then parses the result.from salvajson import loads # Example with valid JSON valid_json = '{"id": 1, "status": "active"}' data = loads(valid_json) print(f"Parsed valid JSON: {data}") # Output: Parsed valid JSON: {'id': 1, 'status': 'active'} # Example with corrupted JSON (automatically fixed by salvaj) corrupted_json_for_loads = "{'item': 'gadget', price: 49.99,}" # Single quotes, unquoted key, trailing comma data_from_corrupted = loads(corrupted_json_for_loads) print(f"Parsed corrupted JSON: {data_from_corrupted}") # Output: Parsed corrupted JSON: {'item': 'gadget', 'price': 49.99}
-
dumps(obj, *, indent=None, sort_keys=False, **kw) -> str: This function serializes a Python object into a JSON string usingorjsonfor high performance. It supports common parameters likeindentfor pretty-printing andsort_keysfor ordering dictionary keys.from salvajson import dumps python_object = {'name': 'Jane Doe', 'age': 28, 'hobbies': ['skiing', 'music']} # Standard serialization json_string = dumps(python_object) print(f"Serialized JSON: {json_string}") # Output: Serialized JSON: {"name":"Jane Doe","age":28,"hobbies":["skiing","music"]} # Pretty-printed serialization with sorted keys pretty_json_string = dumps(python_object, indent=2, sort_keys=True) print(f"Pretty JSON:\n{pretty_json_string}") # Output: # Pretty JSON: # { # "age": 28, # "hobbies": [ # "skiing", # "music" # ], # "name": "Jane Doe" # }
SalvaJSON also provides a simple CLI to fix JSON files directly from your terminal. The CLI uses the salvaj function to process the input.
-
Process a JSON file and print the fixed JSON to standard output:
python -m salvajson /path/to/your/corrupted_file.json
-
Process a JSON file and save the fixed output to a new file:
python -m salvajson input.json > output_fixed.jsonIf
input.jsoncontains, for example:{name: "Test", value: 123, // A comment}Thenoutput_fixed.jsonwill contain:{"name":"Test","value":123}
SalvaJSON cleverly combines Python and JavaScript technologies to achieve its robustness and performance:
-
PythonMonkey Bridge: At its core, SalvaJSON uses
PythonMonkey. This library embeds a JavaScript engine (SpiderMonkey, the engine used in Firefox) within the Python process. This allows Python code to execute JavaScript code and exchange data seamlessly. -
jsonicfor Lenient Parsing: The actual JSON repair magic is handled byjsonic, a mature and lenient JavaScript JSON parser. Whensalvaj(json_str)is called:- The input
json_str(Python string) is passed to the embedded JavaScript environment. - A small JavaScript wrapper (
js_src/salvajson.src.js, bundled intosrc/salvajson/salvajson.js) callsJsonic(json_str). jsonicparses the string, correcting common syntax errors according to its lenient rules.- The result from
jsonic(a JavaScript object/array) is then stringified usingJSON.stringify()in JavaScript to ensure it's a valid JSON string. - This valid JSON string is returned to the Python environment.
- If
jsoniccannot parse the input even with its lenient rules, it throws an error in JavaScript, which PythonMonkey propagates as apythonmonkey.SpiderMonkeyErrorin Python.
- The input
-
orjsonfor Performance: For standard JSON operations (loadsanddumps), SalvaJSON usesorjson.orjsonis a high-performance Python JSON library known for its speed.dumps(obj, ...): Directly usesorjson.dumps()for fast Python object to JSON string serialization.loads(s, ...): First attempts to parse the input string/bytes usingorjson.loads(). If this succeeds (meaning the JSON is already valid), the result is returned quickly.
-
Fallback Mechanism in
loads: If the initialorjson.loads()attempt fails (e.g., due to aorjson.JSONDecodeError), it indicates malformed JSON. Theloadsfunction then automatically:- Takes the input string (decoding from bytes if necessary).
- Calls
salvaj()on this string to get a repaired JSON string. - Finally, calls
orjson.loads()again on this repaired string.
This layered approach ensures that valid JSON is processed with maximum speed, while corrupted JSON gets a chance to be repaired and then parsed.
The project is organized as follows:
salvajson/
├── .github/workflows/ # GitHub Actions CI/CD workflows (ci.yml, update-js.yml)
├── src/
│ └── salvajson/
│ ├── __init__.py # Main package interface, exports salvaj, loads, dumps
│ ├── _version.py # Version info (managed by hatch-vcs & python-semantic-release)
│ ├── salvajson.py # Core Python logic for salvaj, loads, dumps
│ └── salvajson.js # Bundled JavaScript code (generated by esbuild from js_src/)
├── js_src/ # JavaScript source code and build tools
│ ├── package.json # npm package definition, lists JS dependencies (jsonic)
│ ├── package-lock.json # npm lockfile for reproducible JS builds
│ ├── salvajson.src.js # Main JavaScript source file wrapping jsonic
│ └── build.esbuild.js # esbuild configuration for bundling salvajson.src.js
├── tests/ # Pytest tests
│ └── test_salvajson.py # Unit tests for salvajson functionalities
├── .pre-commit-config.yaml # Configuration for pre-commit hooks
├── build.py # Hatchling build hook for bundling JS during package build
├── commitlint.config.js # Configuration for commitlint (Conventional Commits)
├── pyproject.toml # Python project configuration (PEP 621, Hatch, Ruff, Mypy, etc.)
├── README.md # This file
├── CHANGELOG.md # Changelog (managed by python-semantic-release)
└── LICENSE # Apache 2.0 License
SalvaJSON provides convenient development scripts for easy setup and management:
-
Clone the Repository:
git clone https://github.com/twardoch/salvajson.git cd salvajson -
Set up Development Environment:
./dev.sh setup
This will:
- Create a virtual environment
- Install Python dependencies
- Install JavaScript dependencies
- Set up pre-commit hooks
-
Alternative Manual Setup: If you prefer manual setup:
python -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate pip install -e ".[dev,test]" cd js_src && npm ci && cd .. pre-commit install pre-commit install --hook-type commit-msg
# Run tests
./dev.sh test
# Build the project
./dev.sh build --clean
# Run linting only
./dev.sh lint
# Show coverage report
./dev.sh coverage
# Create a release (dry run)
./dev.sh release --dry-runFor detailed development instructions, see DEVELOPMENT.md.
SalvaJSON uses Hatch as its build backend, configured in pyproject.toml.
- JavaScript Bundling: A custom Hatchling build hook defined in
build.pyis responsible for bundling the JavaScript code.- It uses
npm run buildwithin thejs_srcdirectory (which in turn usesesbuildas configured injs_src/build.esbuild.js). - This bundles
js_src/salvajson.src.jsand its dependencies (likejsonic) into a single file:src/salvajson/salvajson.js. - This bundle is then included in the Python package.
- It uses
- Building the Python Package: To build the wheel and source distribution:
The distributable files will be placed in the
hatch build
dist/directory. - Manual JS Rebuild (for development): If you modify files in
js_src/after the initial editable install, the JavaScript bundlesrc/salvajson/salvajson.jsneeds to be rebuilt. You can do this by:- Running
python build.pydirectly. - Or, reinstalling the editable package:
pip install -e .[dev,test]
- Running
We welcome contributions! Please adhere to the following guidelines:
- Code Style & Quality:
- Pre-commit: Always run
pre-commit run --all-filesbefore committing. This tool automatically formats code and checks for issues using:Ruff: For linting and formatting Python code.Prettier: For formatting JavaScript, JSON, and Markdown.Mypy: For static type checking in Python.Bandit: For finding common security issues in Python code.
- Configurations for these tools are in
pyproject.tomland.pre-commit-config.yaml.
- Pre-commit: Always run
- Commit Messages:
- Follow the Conventional Commits specification (e.g.,
feat: ...,fix: ...,docs: ...,chore: ...). commitlint(via pre-commit hook) enforces this.
- Follow the Conventional Commits specification (e.g.,
- Testing:
- Write
pytesttests for any new features or bug fixes. - Ensure all tests pass by running
pytestin the project root. - Aim for high test coverage. Current settings require >=80% coverage (
pytest.ini).
- Write
- Branching and Pull Requests:
- Work on feature branches.
- Submit Pull Requests (PRs) to the
mainbranch for review. - Ensure your PR passes all CI checks.
- Versioning and Releasing:
- The project uses
python-semantic-releasefor automated versioning, changelog generation, and tagging. - The version is derived from Git tags (e.g.,
v1.2.3) managed byhatch-vcsduring build time. CHANGELOG.mdis automatically updated based on Conventional Commit messages when a release is made.- Release Process (for Maintainers):
- Ensure
mainbranch is up-to-date and all desired commits are present. - Run
semantic-release publish. This command:- Determines the next semantic version based on commit history.
- Updates
CHANGELOG.md. - Commits these changes with a
chore(release): ...message. - Creates and pushes a new Git tag (e.g.,
v0.2.1).
- Pushing this tag triggers the "PyPI Publishing" job in the
ci.ymlGitHub Actions workflow. This job builds the package and uploads it to PyPI, then creates a GitHub Release.
- Ensure
- The project uses
- Automated Workflows (GitHub Actions):
ci.yml:- Triggered on pushes, pull requests, and manual dispatch.
- Sets up Python and Node.js environments.
- Installs dependencies (Python and JS).
- Runs
pre-commitchecks. - Runs
pytest. - Builds the package.
- On new version tags, publishes to PyPI and creates a GitHub Release.
update-js.yml:- Runs weekly or manually.
- Updates JavaScript dependencies in
js_src/package-lock.jsonusingnpm update. - Runs
npm audit --audit-level=critical. - Rebuilds the JS bundle (
src/salvajson/salvajson.js). - Creates a Pull Request with these updates for review.
- Python 3.10+: The core language.
- PythonMonkey: Embeds a JavaScript engine, enabling the use of JavaScript libraries like
jsonic. - jsonic: A lenient JavaScript JSON parser used for repairing malformed JSON.
- orjson: A fast Python JSON library used for serializing and deserializing valid JSON.
- Fire: For creating the command-line interface (
__main__.py). - Hatch / Hatchling: Modern Python build system and backend.
- esbuild: Extremely fast JavaScript bundler, used to package
jsonicand the wrapper script. - npm / Node.js: For managing JavaScript dependencies and running build scripts for the JS part.
- Pre-commit Framework: Manages Git hooks for code quality checks.
- Ruff: Linter and formatter for Python.
- Mypy: Static type checker for Python.
- Prettier: Code formatter for JS, JSON, MD.
- Bandit: Security linter for Python.
- commitlint: Linter for commit messages (Conventional Commits).
- Pytest: Testing framework for Python.
- python-semantic-release: For automating versioning, changelog generation, and release tagging.
- GitHub Actions: For CI/CD and other automated workflows.
SalvaJSON is licensed under the Apache License 2.0. See the LICENSE file for details.
- Adam Twardoch (@twardoch): Original author and maintainer.
- Anthropic Claude: Contributed to development and documentation.
- Dependencies: This project relies on the excellent work of the maintainers of:
- jsonic
- PythonMonkey
- orjson
- And all other tools and libraries listed under "Key Technologies".
This project also serves as a demonstration of effective Python-JavaScript interoperability, modern Python packaging, and CI/CD best practices.