This document introduces Kanzi C++, a modern lossless data compression library and command-line tool. It explains the project's purpose, key features, high-level architecture, and core design principles. For detailed information about specific subsystems, see:
Sources: README.md1-17
Kanzi is a modern, modular, portable, and efficient lossless data compressor written in C++. Unlike traditional compressors limited to a single compression paradigm (e.g., LZ-based compression in gzip, zstd), Kanzi combines multiple algorithms and techniques to support a broader range of compression ratios and adapt to diverse data types.
Key characteristics:
| Characteristic | Description |
|---|---|
| Modern | Implements state-of-the-art algorithms (BWT, ROLZ, CM, TPAQ) and fully utilizes multi-core CPUs |
| Modular | Entropy codecs and transforms can be selected and combined at runtime |
| Portable | Supports Windows, Linux, macOS, BSD, and Android with multiple compilers (Visual Studio, GCC, Clang) |
| Efficient | Optimized to balance compression ratio and speed for practical use |
| Expandable | Clean, interface-driven design with no external dependencies |
Kanzi is a data compressor, not an archiver. It provides optional checksums for data integrity but does not include features like cross-file deduplication or data recovery. However, it produces a seekable bitstream, allowing one or more consecutive blocks to be decompressed independently.
Sources: README.md3-15 README.md36-53
Kanzi supports multiple compression approaches that can be combined:
Sources: src/app/Kanzi.cpp134-149 src/app/BlockCompressor.cpp521-577
Data is divided into independently compressible blocks (default 4-32 MB depending on compression level):
Block sizes vary by compression level (levels 0-9), with level 9 using 32 MB blocks by default.
Sources: src/app/BlockCompressor.cpp38-40 src/app/BlockCompressor.cpp120-138
Kanzi processes multiple blocks in parallel across threads for significant performance gains:
The system supports up to 64 concurrent jobs, with thread pool management to avoid deadlocks.
Sources: src/app/Kanzi.cpp59-63 src/app/BlockCompressor.cpp316-317
The following diagram shows the major subsystems and their relationships:
Diagram: Kanzi System Architecture - Major Components
Sources: src/app/Kanzi.cpp1-56 src/app/BlockCompressor.hpp120-164 src/app/BlockDecompressor.hpp113-145 src/io/CompressedOutputStream.hpp125-212 src/io/CompressedInputStream.hpp168-280
The following table maps conceptual components to their primary code entities:
| Component | Primary Classes/Files | Responsibility |
|---|---|---|
| CLI Tool | main() in Kanzi.cpp1-1732 | Argument parsing, mode selection (compress/decompress/info) |
| File Orchestration | BlockCompressor BlockCompressor.hpp120-164BlockDecompressor BlockDecompressor.hpp113-145 | Multi-file management, task creation, concurrency coordination |
| Stream Processing | CompressedOutputStream CompressedOutputStream.hpp125-212CompressedInputStream CompressedInputStream.hpp168-280 | Block division, header management, task submission |
| Encoding/Decoding Tasks | EncodingTask<T> CompressedOutputStream.hpp103-123DecodingTask<T> CompressedInputStream.hpp144-166 | Per-block transform + entropy coding execution |
| Transform Chain | TransformSequence<byte> transform/TransformSequence.hppTransformFactory<byte> transform/TransformFactory.hpp | Runtime transform selection and chaining |
| Entropy Coding | EntropyEncoder/Decoder interfacesEntropyEncoderFactory/DecoderFactory entropy/ | Statistical compression codec selection |
| Bit I/O | DefaultOutputBitStream bitstream/DefaultOutputBitStream.hppDefaultInputBitStream bitstream/DefaultInputBitStream.hpp | Buffered bit-level I/O with 3-tier buffering |
| Configuration | Context Context.hpp | Key-value parameter store flowing through pipeline |
| Concurrency | ThreadPool concurrent.hppBoundedConcurrentQueue<T> concurrent.hpp | Thread management and work queue |
| Utilities | Global Global.hppIOUtil io/IOUtil.hpp | Math, entropy calculation, file enumeration |
Sources: src/app/Kanzi.cpp373-1050 src/app/BlockCompressor.cpp43-161 src/app/BlockDecompressor.cpp37-70 src/io/CompressedOutputStream.cpp43-245 src/io/CompressedInputStream.cpp40-240
Diagram: High-Level Compression Pipeline
Sources: src/app/BlockCompressor.cpp169-496 src/io/CompressedOutputStream.cpp361-389 src/io/CompressedOutputStream.cpp476-537 src/io/CompressedOutputStream.cpp277-342
createFileList() io/IOUtil.hpp97-211 recursively discovers files based on configurationBlockCompressor::compress() creates one FileCompressTask<T> per file BlockCompressor.cpp322-404Global::computeJobsPerTask() assigns jobs to tasks BlockCompressor.cpp365CompressedOutputStream::write() divides data into blocks CompressedOutputStream.cpp361-389writeHeader() writes bitstream metadata (type=0x4B414E5A, version, codecs) CompressedOutputStream.cpp277-342EncodingTask::run() applies transforms then entropy coding CompressedOutputStream.cpp612-1252DefaultOutputBitStream writes compressed bits with 3-tier buffering bitstream/DefaultOutputBitStream.cppSources: src/app/BlockCompressor.cpp169-496 src/io/CompressedOutputStream.cpp361-537
Decompression follows a symmetric pipeline:
createFileList()readHeader() CompressedInputStream.cpp509-676 reads bitstream metadatasubmitBlock() creates DecodingTask<T> instances CompressedInputStream.cpp272-328DecodingTask::run() applies entropy decoding then inverse transforms CompressedInputStream.cpp775-1423read() CompressedInputStream.cpp422-506 returns decompressed data to callerSources: src/app/BlockDecompressor.cpp78-399 src/io/CompressedInputStream.cpp272-506 src/io/CompressedInputStream.cpp509-676
Kanzi provides 10 compression levels (0-9) that select predefined transform and entropy codec combinations:
| Level | Transform | Entropy | Block Size | Typical Use |
|---|---|---|---|---|
| 0 | NONE | NONE | 4 MB | Store (no compression) |
| 1 | LZX | NONE | 4 MB | Fast compression |
| 2 | DNA+LZ | HUFFMAN | 4 MB | General purpose |
| 3 | TEXT+UTF+PACK+MM+LZX | HUFFMAN | 4 MB | Default level |
| 4 | TEXT+UTF+EXE+PACK+MM+ROLZ | NONE | 4 MB | High compression |
| 5 | TEXT+UTF+BWT+RANK+ZRLT | ANS0 | 4 MB | BWT-based |
| 6 | TEXT+UTF+BWT+SRT+ZRLT | FPAQ | 8 MB | Better ratio |
| 7 | LZP+TEXT+UTF+BWT+LZP | CM | 16 MB | Strong compression |
| 8 | EXE+RLT+TEXT+UTF+DNA | TPAQ | 16 MB | Maximum ratio |
| 9 | EXE+RLT+TEXT+UTF+DNA | TPAQX | 32 MB | Best compression |
Users can also specify custom transform and entropy codec combinations using -t and -e options.
Sources: src/app/Kanzi.cpp130-143 src/app/BlockCompressor.cpp521-577
The codebase is organized into logical directories:
src/
├── app/ # CLI application and file orchestration
│ ├── Kanzi.cpp # Main entry point, argument parsing
│ ├── BlockCompressor.cpp/hpp
│ ├── BlockDecompressor.cpp/hpp
│ └── InfoPrinter.cpp/hpp
├── io/ # Stream I/O and compressed streams
│ ├── CompressedOutputStream.cpp/hpp
│ ├── CompressedInputStream.cpp/hpp
│ └── IOUtil.hpp # File enumeration utilities
├── bitstream/ # Bit-level I/O
│ ├── DefaultOutputBitStream.cpp/hpp
│ └── DefaultInputBitStream.cpp/hpp
├── transform/ # Data transformation algorithms
│ ├── BWT.cpp/hpp
│ ├── LZCodec.cpp/hpp
│ ├── ROLZCodec.cpp/hpp
│ ├── TextCodec.cpp/hpp
│ └── TransformFactory.hpp
├── entropy/ # Entropy coding algorithms
│ ├── ANSRangeEncoder.cpp/hpp
│ ├── HuffmanEncoder.cpp/hpp
│ ├── FPAQEncoder.cpp/hpp
│ └── EntropyEncoderFactory.hpp
├── util/ # Utilities (XXHash, Clock, etc.)
└── types.hpp, Global.hpp, Context.hpp, concurrent.hpp
Sources: README.md133-179 src/io/IOUtil.hpp1-354
The bitstream format is versioned and includes header checksums to detect incompatibility.
Sources: src/app/Kanzi.cpp53 src/io/CompressedOutputStream.cpp31-32 README.md133-162
Refresh this wiki
This wiki was recently refreshed. Please wait 6 days to refresh again.