add streaming uniwig alongside current batch parallel implementation#236
Merged
add streaming uniwig alongside current batch parallel implementation#236
Conversation
Member
Author
|
@donaldcampbelljr take a look at this, I'm proposing we replace the batch implementation with this streaming implementation. I wanted to leave them side by side at first to do the benchmarking. |
Member
Could you offer more clarity on how this may affect downstream tools that use these file types for input? I'm struggling to visualize how the output file is affected and am concerned it might break some use cases. I didn't see any output test files that demonstrate the different output options for quick comparison/grokking of how this is working. |
donaldcampbelljr
approved these changes
Mar 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a streaming mode to uniwig (
--streaming) that computes coverage counts with O(smooth_size) memory instead of O(chromosome_size). Processes BED input line-by-line using a sliding VecDeque window, supports all three count types (start/end/core), outputs WIG or bedGraph, and handles gzip transparently. Works with stdin/stdout for piping.A key addition is sparse output — the existing batch mode always writes every position from 1 to chrom_size, even where counts are zero. Streaming defaults to emitting only non-zero positions, which dramatically reduces output size (1.2 GB vs 5.8 GB on 10M records) and is the main reason it's faster despite using a single thread.
The
--denseflag controls gap handling:0(default) = sparse,-1= fully dense (match batch behavior), and any positiveNfills gaps ≤N bases wide. Default of 100 is essentially free.Also fixes a bug in the batch path where all 3 count types were always computed regardless of
-uflag — this was inflating batch runtimes by ~2.5x.Usage
Benchmarks
Synthetic BED data, 10M records, 24 chromosomes, full hg38 sizes.
--smoothsize 25 --stepsize 1 -u start -y wig:Sparse streaming is faster (28s vs 43s), uses 2000x less memory (4 MB vs 8 GB), and produces 5x smaller output (1.2 GB vs 5.8 GB) — all on a single thread vs batch's 6 cores. Dense-for-dense, per-core throughput is roughly equivalent.