⚠️ Development Status: This library is currently under development. The API may change without notice.
archive_r is a libarchive-based library for processing many archive formats. It streams entry data directly from the source to recursively read nested archives without extracting to temporary files or loading large in-memory buffers.
- Nested Archive Support: Recursively processes archives within archives (including split archives when specified)
- Password-Protected Archives: Reads encrypted archives (multiple passphrases supported)
- Iterator API: Follows C++ Standard Library idioms
- Multi-Language Bindings: Available from Python and Ruby
- OS:
- Linux: x86_64, aarch64 (glibc 2.28+, manylinux_2_28)
- macOS: x86_64, arm64 (Universal2, macOS 11.0+)
- Windows: x64 (Windows 10/11, Server 2019+)
- Compiler: C++17 or later (GCC 7+, Clang 5+, MSVC 2019+, etc.)
- Dependencies:
- libarchive 3.x (required)
cd archive_r
./build.shBuild artifacts will be generated under archive_r/build/.
To build with language bindings:
# Include Python bindings
./build.sh --with-python
# Include Ruby bindings
./build.sh --with-ruby
# Include both
./build.sh --with-python --with-ruby
# Full rebuild for Python-only CI workflows (skips Ruby binding steps)
./build.sh --rebuild-all --python-only- Python workflows (standalone builds, packaging automation, tests, and usage examples) now live in
bindings/python/README.md. - Ruby workflows remain documented in
bindings/ruby/README.md.
#include "archive_r/traverser.h"
#include <iostream>
#include <string>
#include <vector>
using namespace archive_r;
// Stream search within entry content (buffer boundary aware)
// entry.read(buffer.data(), buffer.size()) returns the number of bytes read (0 for EOF, -1 for error)
bool search_in_entry(Entry& entry, const std::string& keyword) {
std::string overlap; // Preserve tail from previous read
std::vector<char> buffer(8192);
while (true) {
const ssize_t bytes_read = entry.read(buffer.data(), buffer.size());
if (bytes_read <= 0) break; // EOF or error
std::string chunk(buffer.begin(), buffer.begin() + bytes_read);
std::string search_text = overlap + chunk;
if (search_text.find(keyword) != std::string::npos) {
return true;
}
// Preserve tail for next iteration (keyword length - 1)
if (chunk.size() >= keyword.size() - 1) {
overlap = chunk.substr(chunk.size() - (keyword.size() - 1));
} else {
overlap = chunk;
}
}
return false;
}
int main() {
TraverserOptions options;
options.formats = {
"7zip", "ar", "cab", "cpio", "empty", "iso9660",
"lha", "rar", "tar", "warc", "xar", "zip"
}; // Exclude libarchive's mtree/raw pseudo formats
// Wrap single filesystem path into PathHierarchy helper before traversal.
Traverser traverser({make_single_path("test.zip")}, options);
for (auto it = traverser.begin(); it != traverser.end(); ++it) {
Entry& entry = *it;
const std::string full_path = entry.path();
std::cout << "Path: " << full_path
<< " (depth=" << entry.depth() << ")\n";
// Search text file content
if (entry.is_file() && full_path.ends_with(".txt")) {
if (search_in_entry(entry, "search_keyword")) {
std::cout << " Found keyword in: " << full_path << "\n";
}
}
}
return 0;
}ℹ️ Entry Path Representation (C++)
entry.path()returns a path string including the top-level archive name (e.g.,outer/archive.zip/dir/subdir/file.txt).entry.name()returns the last element ofpath_hierarchy()(e.g.,"dir/subdir/file.txt").entry.path_hierarchy()returns aPathHierarchy(a sequence ofPathEntrysteps). In the common case, each step is a single string (conceptually like{"outer/archive.zip", "dir/subdir/file.txt"}), but it can also represent multi-volume grouping.
For Python and Ruby usage guides (installation, API references, practical samples), see the dedicated binding documents:
- Python:
bindings/python/README.md - Ruby:
bindings/ruby/README.md
PathHierarchy is the core abstraction representing a path through nested or multi-volume archives.
For convenience, archive_r also provides Traverser(const std::string& path, ...) for the common single-root case. PathHierarchy remains the underlying representation returned by Entry::path_hierarchy() and is useful when you need to explicitly express multi-volume roots.
archive_r models each traversal step as a sequence of path entries, where each entry can be:
- Single-volume entry: A regular file or directory (e.g.,
"archive.tar","dir/file.txt") - Multi-volume entry: A split archive group (e.g.,
{"vol.part1", "vol.part2", "vol.part3"})
This design enables archive_r to represent complex archive structures uniformly, supporting operations like path comparison, ordering, and display.
A PathEntry is a value type that can hold two forms:
// include/archive_r/path_hierarchy.h
class PathEntry {
public:
struct Parts {
std::vector<std::string> values;
enum class Ordering { Natural, Given } ordering = Ordering::Natural;
};
static PathEntry single(std::string entry);
static PathEntry multi_volume(std::vector<std::string> entries,
Parts::Ordering ordering = Parts::Ordering::Natural);
bool is_single() const;
bool is_multi_volume() const;
const std::string& single_value() const;
const Parts& multi_volume_parts() const;
};- Single (
std::string): Represents a single path component (e.g.,"archive.zip","dir/file.txt") - Multi-volume (
Parts): Holds a list of volume paths plus an ordering flag:Naturalordering: Sorted by natural numeric ordering (e.g.,["vol.part1", "vol.part10", "vol.part2"]→["vol.part1", "vol.part2", "vol.part10"])Givenordering: Preserves the order specified by the user
using PathHierarchy = std::vector<PathEntry>;A PathHierarchy is a sequence of PathEntry elements representing the full path from the root to a target entry. For example:
{"archive.tar", "dir/subdir/file.txt"}— regular nested path{{"vol.part1", "vol.part2"}, "inner.zip", "data.csv"}— multi-volume archive containing nested archive with CSV file
PathHierarchy defines strict ordering rules to enable consistent path comparison:
- Type-based ordering:
Single < Multi-volume - Within-type ordering:
- Single: Lexicographic string comparison
- Multi-volume: First by ordering mode (
Natural < Given), then lexicographic comparison of part lists
- Hierarchy comparison: Compare entries level-by-level until a difference is found
This ordering ensures that archive paths can be sorted, deduplicated, and indexed consistently across all archive types.
archive_r provides convenience builders for common cases:
// Create a single-entry hierarchy from a filesystem path
PathHierarchy single_path = make_single_path("archive.tar.gz");
// Result: {PathEntry("archive.tar.gz")}
// Create a multi-volume hierarchy from a list of parts
PathHierarchy multi_volume;
append_multi_volume(multi_volume,
{"archive.part1", "archive.part2", "archive.part3"},
PathEntry::Parts::Ordering::Natural); // or Ordering::Given
// Result: {PathEntry(Parts{{"archive.part1", "archive.part2", "archive.part3"}, Natural})}When constructing a Traverser, wrap top-level paths using these helpers:
// Single archive
Traverser tr1({make_single_path("archive.tar")});
// Multiple archives
Traverser tr2({
make_single_path("first.zip"),
make_single_path("second.tar.gz")
});
// Multi-volume archive
PathHierarchy mv_root;
append_multi_volume(mv_root, {"vol.part1", "vol.part2"}, PathEntry::Parts::Ordering::Natural);
Traverser tr3({mv_root});The Entry class exposes PathHierarchy through several methods:
entry.path_hierarchy()— Returns the fullPathHierarchyfor the current entryentry.path()— Flattens the hierarchy into a single string (e.g.,"archive.tar/dir/file.txt")entry.name()— Returns the last component of the hierarchy (e.g.,"file.txt")
For custom display formats or deep path analysis, use path_hierarchy() directly:
PathHierarchy hier = entry.path_hierarchy();
for (const PathEntry& step : hier) {
if (step.is_single()) {
std::cout << "Single: " << step.single_value() << "\n";
continue;
}
if (step.is_multi_volume()) {
const auto& parts = step.multi_volume_parts();
std::cout << "Multi-volume (" << parts.values.size() << " parts)\n";
continue;
}
}By default, all files are attempted to be expanded as archives. If expansion fails or the format is unsupported, the error is ignored and the file is treated as a regular file.
🔧 Default descent configuration
- C++: set
TraverserOptions.descend_archives(defaulttrue) before constructing the traverser.- Python: pass
descend_archives=True/Falsetoarchive_r.Traverser.- Ruby: provide the
descend_archives:keyword toArchive_r.traverse/Archive_r::Traverser.new. This controls the initial value reported byentry.descent_enabled()for every entry.
To suppress automatic expansion for specific entries, call set_descent(false):
// C++ example
for (Entry& entry : traverser) {
// Don't attempt to expand Office files (internally ZIP but expansion unnecessary)
std::string path = entry.path();
if (path.ends_with(".docx") || path.ends_with(".xlsx") || path.ends_with(".pptx")) {
entry.set_descent(false);
}
}For Python and Ruby examples, see the respective binding documentation:
⚠️ Reading entry content temporarily disables descent
- Calling
Entry::read(or the binding equivalents) automatically flipsentry.descent_enabled()toFalseso the partially consumed payload will not be re-opened implicitly.- Call
entry.set_descent(True)if you still want to descend into the entry after streaming its data.
Metadata that cannot be retrieved via Entry's fixed API (size(), is_file(), etc.) can be obtained using metadata() or find_metadata().
Specify the metadata keys to capture in advance using TraverserOptions (C++) or the metadata_keys argument in the bindings:
// C++ example
TraverserOptions options;
options.metadata_keys = {"uid", "gid", "mtime"};
// Convert filesystem root into PathHierarchy prior to traversal.
Traverser traverser({make_single_path("test.tar")}, options);
for (Entry& entry : traverser) {
if (auto* uid = entry.find_metadata("uid")) {
std::cout << "UID: " << std::get<int64_t>(*uid) << "\n";
}
}For Python and Ruby examples, see the respective binding documentation:
By default, all formats supported by libarchive are enabled. To enable only specific formats, specify TraverserOptions.formats (C++) or pass the formats keyword argument:
// C++ example
TraverserOptions options;
options.formats = {"zip", "tar"}; // Enable only ZIP and TAR
// Each provided root path must be expressed as a PathHierarchy.
Traverser traverser({make_single_path("test.zip")}, options);For Python and Ruby examples, see the respective binding documentation:
When processing split archive files (.zip.001, .zip.002, ...), use set_multi_volume_group() to register them as the same group.
After the parent archive traversal completes, each group is automatically merged and expanded:
// C++ example
for (Entry& entry : traverser) {
std::string path = entry.path();
if (path.find(".part") != std::string::npos) {
// Extract base name from extension (e.g., "archive.zip.part001" → "archive.zip")
// Implement actual extraction logic based on your extension conventions
size_t pos = path.rfind(".part");
std::string base_name = path.substr(0, pos);
entry.set_multi_volume_group(base_name);
}
}For Python and Ruby examples, see the respective binding documentation:
archive_r supports multi-threaded usage with the following constraints:
- Thread-safe: Each thread can create and use its own
Traverserinstance independently. - Not thread-safe: A single
Traverserinstance must not be shared across threads.
// ✓ SAFE: Each thread has its own Traverser
std::thread t1([]{
Traverser tr({make_single_path("archive.tar.gz")});
for(Entry& e : tr) { /* process */ }
});
std::thread t2([]{
Traverser tr({make_single_path("archive.tar.gz")});
for(Entry& e : tr) { /* process */ }
});
// ✗ UNSAFE: Sharing a single Traverser instance
Traverser shared_tr({make_single_path("archive.tar.gz")});
std::thread t1([&]{ for(Entry& e : shared_tr) { /* process */ } }); // Race condition!
std::thread t2([&]{ for(Entry& e : shared_tr) { /* process */ } }); // Race condition!Internal components (ArchiveStackOrchestrator, Entry, etc.) inherit the same constraint.
archive_r reports recoverable data errors (corrupted archives, I/O failures) via callbacks. Faults do not stop traversal; you can decide how to react in your callback implementation.
| Situation | Reporting mechanism | Notes |
|---|---|---|
Invalid Traverser arguments (e.g., empty paths / empty PathHierarchy) |
Exception (std::invalid_argument) |
Thrown during construction |
| Directory traversal errors | Exception (std::filesystem::filesystem_error) |
Not converted to faults (current behavior) |
| Recoverable archive/data errors during traversal | Fault callback (EntryFault) |
Traversal continues |
| Entry content read failure | Entry::read() returns -1 and dispatches an EntryFault |
See Entry header docs for details |
- Call
set_descent()/set_multi_volume_group()on theEntry&inside the traversal loop (before advancing). Copies do not retain traverser-managed control state. - After a successful
read()(including EOF),descentis disabled until you explicitly re-enable it withset_descent(true).
Use the library-wide register_fault_callback helper (or the binding-level archive_r.on_fault / Archive_r.on_fault) to receive fault notifications:
#include "archive_r/entry_fault.h"
register_fault_callback([](const EntryFault& fault) {
std::cerr << "Warning at " << hierarchy_display(fault.hierarchy)
<< ": " << fault.message << std::endl;
// Traversal continues to next entry
});
Traverser traverser({make_single_path("archive.tar.gz")});
for (Entry& entry : traverser) {
// Process valid entries; faults are reported via callback
}
// Reset when you no longer need the callback
register_fault_callback({});EntryFault structure:
hierarchy: Path where the fault occurredmessage: Human-readable descriptionerrno_value: Optional errno from system calls
This design allows processing valid entries even when some are corrupted.
cd archive_r
./run_tests.sh # core tests
./bindings/ruby/run_binding_tests.sh
./bindings/python/run_binding_tests.shCore tests run via run_tests.sh; Ruby/Python binding suites live in the dedicated scripts under bindings/ (these scripts are also called from CI).
archive_r is distributed under the MIT License. See the LICENSE file for details.
This project depends on the following third-party libraries:
- libarchive: New BSD License (required at runtime)
- pybind11: BSD-style License (required only for building Python bindings)
- rake: MIT License (required only for building Ruby bindings)
- minitest: MIT License (required only for testing Ruby bindings)
archive_r/
├── include/ # C++ header files
├── src/ # C++ implementation
├── bindings/ # Python/Ruby bindings
│ ├── python/
│ └── ruby/
├── test/ # Test code
├── examples/ # Example code
├── docs/ # Documentation
└── build.sh # Build script
# Build core library only
./build.sh
# Rebuild core library (clean then build)
./build.sh --rebuild
# Rebuild all (core + bindings)
./build.sh --rebuild-all
# Rebuild all artifacts but skip Ruby binding (equivalent to Python-only CI)
./build.sh --rebuild-all --python-only
# Clean core library only
./build.sh --clean
# Clean all (core + bindings)
./build.sh --clean-all
# Build with bindings
./build.sh --with-python --with-rubyci.ymlruns on Ubuntu 24.04 for every push/PR tomain, executes./build.sh --rebuild-all, then runs./run_tests.shand the Ruby binding tests (bindings/ruby/run_binding_tests.sh). Python is verified via the wheel-install check performed during./build.sh --package-python.build-wheels.ymlproduces manylinux_2_28 wheels for CPython 3.9–3.12 inside Docker, relying on./build.sh --rebuild-all --python-onlybefore repairing wheels withauditwheel.release.ymlties everything together: it re-runs the full build, downloads the wheel/SDist artifacts, creates a GitHub Release, and publishes Python packages to PyPI (RubyGems publishing remains optional and requires a token when enabled).
Contributions to the project are welcome. Please submit bug reports and feature requests to GitHub Issues.