LangArena is a collection of 50+ diverse benchmarks designed for a realistic, apples-to-apples comparison of programming language performance. The goal is not to find the ultimate winner in micro-optimizations, but to evaluate how well each language's compiler or runtime optimizes clean and readable code.
The suite started with my original implementation in Crystal. AI tools assisted in translating it to other languages. Throughout this process, I reviewed and edited the implementation for semantic correctness and logical consistency to ensure idiomatic accuracy and fair benchmarking. Not all algorithms could be implemented identically across all languages — simply because the languages are too different (this is particularly true for base64 and JSON tests). However, I made every effort to make the implementations as similar as possible to each other. Handling Library Differences: To address performance differences stemming from varying standard library implementations, I created a special tab in the results — Runtime Score. This metric normalizes execution times (seconds) into a 0–100 scoring system, where 50 represents the average performance across all languages. The overall Runtime Score is calculated as the average across all benchmarks. This approach reduces the impact of outliers and ensures a fair overall assessment: a language that excels in most tasks but struggles with one particular library implementation (like JSON parsing) isn't severely penalized. It reflects the real-world scenario where developers use a mix of algorithms and libraries.
Sources: Benchmark ideas were taken from:
- The Computer Language Benchmarks Game
- My own collections: benchmarks, jit-benchmarks, crystal-benchmarks-game, crystal-metric
- Crystal code samples
- Clean Code: Benchmarks are written in a clear, idiomatic style that prioritizes readability and maintainability.
- Algorithmic Consistency: The same core algorithm is implemented across all languages for each task to ensure a fair comparison.
- Standard vs Unsafe Modes: All benchmarks use standard production compiler flags for each language (safe mode by default). However, we also provide a separate "Hacking" section that compares performance with specialized unsafe flags (like disabling bounds checks, removing runtime checks, or other language-specific optimizations that trade safety for speed). This shows what's possible when you prioritize performance over safety guarantees.
- Testing Language "Muscle": We measure the cost of abstractions. Can a language take clean, idiomatic code and optimize it to efficient machine code? Languages that can (like Rust, Java) prove their compilers are powerful. Languages that can't show the honest price of their abstractions. Benchmarks like matrix multiplication use naive implementations intentionally. We're not measuring how fast a language can call a C library (like BLAS via numpy), but how efficiently it handles fundamental computational patterns — because one day you'll have to write that loop yourself.
- Pull Requests Welcome: While consistency is key, improvements that maintain the philosophy and fix suboptimal implementations are encouraged.
Each benchmark's execution time is measured in isolation, with data preparation excluded from timing. The suite includes a separate warmup phase for JIT-based languages (C#, Java, Julia, etc.) to allow compilation and optimization before measurements begin. This ensures fair comparisons by measuring steady-state performance where applicable, while still capturing cold-start characteristics for AOT-compiled languages. All benchmarks produce verifiable checksums to ensure algorithmic correctness across implementations.
The benchmarks cover common practical tasks:
- JSON Processing: Parsing and generation
- Data Encoding: Base64 encoding/decoding
- Text Processing: Regex matching, string manipulation
- Cryptography & Hashing: SHA-256, CRC32
- Sorting Algorithms: Quick sort, merge sort
- Graph Algorithms: BFS, DFS, Dijkstra, A* pathfinding
- Mathematical Computations: Matrix multiplication, prime calculation, spectral norm
- Simulations: N-body, Game of Life, neural network
- Classic Benchmarks: Binary trees, Fannkuchredux, Mandelbrot (from Computer Language Benchmarks Game)
The suite currently focuses on compiled and high-performance managed languages:
C, C++, Crystal, Rust, Go, Swift, C#, Java, Kotlin, TypeScript, Zig, D, V, Julia, Nim, F#, Dart, Python, Odin, Scala.
Languages like Python, Ruby, or PHP are intentionally excluded to maintain a focused comparison within a similar performance bracket.
This suite is also a practical tool for:
- Compiler Tracking: Monitor performance regressions/improvements across compiler versions.
- New Language Evaluation: Get a standardized "score" to position a new language against established ones.
AMD Ryzen 7 3800X 8-Core Processor 78GB (x86_64-linux-gnu)
cd rust
./test [BenchName]
./run [BenchName]
Require docker-compose-plugin v2, check if it installed: run docker compose version, version should be v2.xxx. Or install it.
docker compose build rust
docker compose run rust
./test [BenchName]
./run [BenchName]
sh build-docker.sh
ruby benchmarks.rb
cd docs
ruby gen.rb ../results/2026-02-02-x86_64-linux-gnu.js
open index.html