Unary Coding in Python: Practical Encoding, Decoding, and Bit-Packing

Last year I was debugging a telemetry pipeline where “small non-negative integers, most of the time” described half the fields: retry counts, tiny deltas, short run lengths, and the occasional spike. The bug wasn’t in the math—it was in how we encoded values before handing them to a downstream compressor. We had chosen a variable-length code that was correct, but we’d implemented the “prefix” part inconsistently across services, and a single off-by-one created a silent drift.\n\nWhen I want an encoding that’s almost impossible to misunderstand at a glance, I reach for unary coding. It’s not glamorous, and it’s rarely the final format you store on disk, but it’s an excellent building block: simple to implement, easy to test, and foundational for families of codes like Golomb and Rice.\n\nIf you’ve ever wondered how to represent a natural number with nothing but a run of identical bits—and how to do it cleanly in Python—you’re in the right place. I’ll show you how unary encoding and decoding work, how to make the implementation robust (not just “works on happy-path”), where unary fits into real compression systems, and how to think about performance when your “bits” are actually Python objects.\n\n## Unary Codes: The Fast Mental Model\nUnary coding (also called thermometer code) represents a natural number n as n ones followed by a zero terminator.\n\n- 0 becomes 0\n- 1 becomes 10\n- 5 becomes 111110\n- 6 becomes 1111110\n\nI like to picture it like a row of LEDs: you turn on n LEDs (the ones), then you place a “stop” marker (the zero) so the decoder knows where the number ends.\n\nA few properties fall out immediately:\n\n- It’s prefix-free when you include the terminating 0. No valid unary code is the prefix of another, because every code ends at the first 0.\n- It’s self-synchronizing in a narrow sense: if you start reading at the beginning of a codeword, you can always find its end.\n- It’s efficient only for small n. The length is n + 1 bits, so large values are expensive.\n\nUnary coding often appears as the quotient part of Golomb/Rice codes: you encode a quotient in unary, and the remainder in fixed-width binary. That hybrid is where unary stops being “toy-ish” and becomes genuinely useful.\n\nOne extra mental model I keep handy: unary is basically “how many times did something happen?” written directly as a run length. That’s why it shows up naturally in contexts like run-length encoding, counting small events, and coding quotients.\n\n## Encoding Natural Numbers in Python (Clean, Runnable, and Explicit)\nWhen I implement unary encoding in Python, I pick an API that makes incorrect usage hard. Specifically:\n\n- Accept only non-negative integers.\n- Return a string of ‘0‘/‘1‘ for clarity when teaching or debugging.\n- Provide a variant that writes to a byte-oriented buffer when you care about space.\n\nHere’s a simple, readable encoder:\n\n from future import annotations\n\n\n def unaryencode(n: int) -> str:\n """Encode a non-negative integer n into unary code as a ‘0‘/‘1‘ string."""\n if not isinstance(n, int):\n raise TypeError(f"n must be int, got {type(n).name}")\n if n < 0:\n raise ValueError("n must be non-negative")\n\n # n ones followed by a terminating zero\n return "1" n + "0"\n\n\n if name == "main":\n for value in [0, 1, 5, 8]:\n print(value, "->", unaryencode(value))\n\nThis version is “boringly correct,” which is exactly what I want for a primitive.\n\nIf you’re coming from an approach that builds a list of integers and joins it, it works, but it allocates more than necessary. A string multiplication is direct and readable.\n\nHere’s a practical comparison I use when reviewing code:\n\n

Task
Traditional Python approach
Modern Python approach (what I recommend)
\n
—
—
—
\n

Build the codeword
Append 1 in a loop, append 0, then ‘‘.join(...)
‘1‘ n + ‘0‘

Clarity

More moving parts

One expression matches the definition

Allocations

List + intermediate strings

One final string (implementation details aside)

\n\nIf you need the code as bytes (still not bit-packed, but friendlier to I/O), you can do this:\n\n def unaryencodebytes(n: int) -> bytes:\n """Unary encode into ASCII bytes b‘1110‘ etc."""\n return unaryencode(n).encode("ascii")\n\nThat’s still not a bit-level representation, but it’s sometimes useful when interfacing with systems that already treat these as textual bitstrings.\n\n### A Note on Conventions (The Source of Many Off-By-One Bugs)\nBefore I go further, I always write down (and test) the exact unary convention. There are a few common ones:\n\n- 1-run terminated by 0 (the one I’m using): n becomes 1...10 with n ones then a zero.\n- 0-run terminated by 1 (same idea flipped): n becomes 0...01 with n zeros then a one.\n- “shifted” unary: some codebases encode n as n+1 ones then 0 (so 0 becomes 10). This is sometimes done to avoid an empty run in variations that omit terminators, or to reserve a codeword for “missing.”\n\nNone of these are “wrong.” The bug happens when two components silently disagree. If you do nothing else, lock your convention into tests with known vectors.\n\n## Decoding and Validating Unary Bitstrings\nDecoding unary is the inverse: count the ones until you hit the terminating zero. The catch is validation. In real code, you should decide what to do with malformed inputs:\n\n- No terminating 0 (e.g. "111")\n- Contains characters other than 0/1 (e.g. "11a0")\n- Contains additional data after the terminator (e.g. "1110xxxx")\n\nFor teaching, a simple “count ones” loop is fine, but for production I prefer a decoder that:\n\n- Confirms there is exactly one terminator for the codeword being decoded (or returns the remainder explicitly).\n- Fails loudly on invalid characters.\n- Works well with streaming: decode one number and return how many symbols were consumed.\n\nHere’s a robust decoder for a bitstring held in a Python str:\n\n from future import annotations\n\n\n def unarydecode(code: str, , start: int = 0) -> tuple[int, int]:\n """Decode one unary-coded integer from code[start:].\n\n Returns (value, nextindex) where nextindex is the position after the terminator.\n\n Raises ValueError for malformed encodings.\n """\n if not isinstance(code, str):\n raise TypeError(f"code must be str, got {type(code).name}")\n if start len(code):\n raise ValueError("start out of range")\n\n countones = 0\n i = start\n\n while i < len(code):\n ch = code[i]\n if ch == "1":\n countones += 1\n i += 1\n continue\n if ch == "0":\n # Terminator found\n return countones, i + 1\n raise ValueError(f"invalid character {ch!r} at index {i}")\n\n raise ValueError("unterminated unary code (missing ‘0‘)")\n\n\n if name == "main":\n encoded = "111111110"\n value, nexti = unarydecode(encoded)\n print("decoded:", value, "consumed:", nexti)\n\nNow you can decode a stream of concatenated unary numbers reliably:\n\n def unarydecodemany(stream: str) -> list[int]:\n values: list[int] = []\n i = 0\n while i < len(stream):\n value, i = unarydecode(stream, start=i)\n values.append(value)\n return values\n\n\n if name == "main":\n stream = unaryencode(2) + unaryencode(0) + unaryencode(4)\n print(stream)\n print(unarydecodemany(stream))\n\nFor performance on large strings, a small trick is to search for the terminator and validate the prefix:\n\n def unarydecodefastish(code: str, , start: int = 0) -> tuple[int, int]:\n """Decode using string search; still validates the run."""\n end = code.find("0", start)\n if end == -1:\n raise ValueError("unterminated unary code")\n\n run = code[start:end]\n # Validate: the run must be all ‘1‘\n if run and run.strip("1") != "":\n raise ValueError("invalid unary run (contains non-‘1‘)")\n\n return (end – start), end + 1\n\nThis can be faster because find is implemented in C, but the validation step matters if you don’t fully trust input.\n\n### Decoding With Safety Limits (Preventing “Decode Until RAM Dies”)\nOne thing I’ve learned the hard way: variable-length codes need explicit safety limits when they parse untrusted or partially trusted input. Unary is especially vulnerable because a malicious stream can be “just a lot of ones,” forcing the decoder to do linear work and potentially buffer data.\n\nEven if your stream is trusted, truncated frames happen: partial reads, cut files, bad offsets, etc. It’s worth building a decoder that can cap the maximum acceptable value.\n\nHere’s a variant that enforces a maximum run length:\n\n def unarydecodelimited(code: str, , start: int = 0, maxvalue: int = 1000000) -> tuple[int, int]:\n if maxvalue < 0:\n raise ValueError("maxvalue must be non-negative")\n\n countones = 0\n i = start\n while i < len(code):\n ch = code[i]\n if ch == "1":\n countones += 1\n if countones > maxvalue:\n raise ValueError("unary value exceeds limit")\n i += 1\n continue\n if ch == "0":\n return countones, i + 1\n raise ValueError(f"invalid character {ch!r} at index {i}")\n\n raise ValueError("unterminated unary code")\n\nI don’t always turn this on in internal pipelines, but I do turn it on anywhere bytes come from “outside my process.”\n\n## Unary as a Building Block: Golomb and Rice Codes (Why You’ll See It in Real Systems)\nUnary alone is rarely the best final encoding. Its real power shows up inside other codes.\n\nHere’s the typical pattern:\n\n1. Choose a parameter m (or k for Rice codes where m = 2k).\n2. Split your number n into quotient and remainder:\n – q = n // m\n – r = n % m\n3. Encode q in unary (q ones then zero).\n4. Encode r in fixed-width binary (for Rice, exactly k bits).\n\nWhy is that useful? Because many real-world distributions are heavily skewed toward small values. Unary handles the small quotient cheaply. The remainder gives you bounded precision without the “n+1 bits” blow-up for every increment.\n\nI’m not going to implement full Golomb coding here (there are careful details around truncated binary for non-power-of-two m), but I will show Rice coding because it’s clean and highlights unary’s role.\n\n from future import annotations\n\n\n def riceencode(n: int, k: int) -> str:\n """Encode non-negative n using Rice code with parameter k (m = 2k)."""\n if n < 0:\n raise ValueError("n must be non-negative")\n if k < 0:\n raise ValueError("k must be non-negative")\n\n m = 1 << k\n q, r = divmod(n, m)\n\n prefix = unaryencode(q)\n suffix = format(r, f"0{k}b") if k > 0 else ""\n return prefix + suffix\n\n\n def ricedecode(code: str, k: int, , start: int = 0) -> tuple[int, int]:\n """Decode one Rice-coded integer; returns (value, nextindex)."""\n q, i = unarydecode(code, start=start)\n\n if k == 0:\n return q, i\n\n if i + k > len(code):\n raise ValueError("truncated Rice remainder")\n\n suffix = code[i:i + k]\n if suffix.strip("01") != "":\n raise ValueError("invalid bits in remainder")\n\n r = int(suffix, 2)\n return (q << k) + r, i + k\n\n\n if name == "main":\n value = 37\n k = 3 # m = 8\n encoded = riceencode(value, k)\n decoded, consumed = ricedecode(encoded, k)\n print("value:", value)\n print("encoded:", encoded)\n print("decoded:", decoded, "consumed:", consumed)\n\nWhen you see unary in compression code “in the wild,” it’s usually this: a short unary prefix plus a fixed-width tail.\n\n### Picking k in Rice Coding (A Practical Rule of Thumb)\nIf you’ve never tuned Rice parameters, here’s how I think about it in practice:\n\n- Larger k means larger remainder (fixed width) but smaller quotient (unary).\n- Smaller k means smaller remainder but potentially huge quotient runs.\n\nIf your numbers tend to be small (say 0–7), k=3 can work well because many values have q=0, and unary for q=0 is just 0. If your values tend to be even smaller (0–1), k=0 or k=1 may win.\n\nWhen I have data, I pick k by measuring average code length over a representative sample. When I don’t have data, I start with the rough scale of typical values (how many bits to represent most values) and adjust from there.\n\n## Performance and Memory: When “Bits” in Python Aren’t Bits\nUnary coding is about bits, but Python strings are bytes (and Unicode code points) with non-trivial overhead. If you store unary codes as str, you are doing it for clarity, testing, or interoperability—not for compact storage.\n\nHere’s how I think about the trade:\n\n- str bitstrings are great for:\n – teaching and debugging\n – writing unit tests with expected literals\n – quick prototypes\n- str bitstrings are not great for:\n – real compression ratios\n – high-throughput pipelines\n\nIf you truly care about size, you want bit-packing. Python doesn’t ship a “bitstring” in the standard library, so you pick one of these strategies:\n\n1. Pack bits into int (good for small sequences)\n2. Pack bits into bytearray with manual bit operations (good for control)\n3. Use a third-party bit array library (good ergonomics, extra dependency)\n\nEven if you stay with str, you should avoid accidental quadratic behavior. For example, repeated string concatenation in a loop can get expensive because each + may allocate a new string.\n\nI keep these rules in my head:\n\n- Building a unary string: "1" n + "0" is usually fine.\n- Decoding from a string: a loop is fine; find is often faster for long runs.\n- Packing: only bother when you measure and you see memory pressure or I/O becomes the bottleneck.\n\nIn a typical Python service, unary coding in str form is often “fast enough” for configuration-scale data, feature flags, small logs, or teaching demos. For bulk telemetry, you’ll likely want a binary format.\n\nHere’s a minimal bit writer/reader pair that packs unary codes into bytes. This is not meant to be the final word; it’s a clear foundation you can extend.\n\n from future import annotations\n\n\n class BitWriter:\n def init(self) -> None:\n self.buf = bytearray()\n self.current = 0\n self.bitsfilled = 0 # number of bits already written into current\n\n def writebit(self, bit: int) -> None:\n if bit not in (0, 1):\n raise ValueError("bit must be 0 or 1")\n\n self.current = (self.current << 1)

bit\n self.bitsfilled += 1\n\n if self.bitsfilled == 8:\n self.buf.append(self.current)\n self.current = 0\n self.bitsfilled = 0\n\n def writebits(self, value: int, width: int) -> None:\n if width < 0:\n raise ValueError("width must be non-negative")\n if value = (1 << width):\n raise ValueError("value does not fit in width")\n\n for i in range(width – 1, -1, -1):\n self.writebit((value >> i) & 1)\n\n def writeunary(self, n: int) -> None:\n if n < 0:\n raise ValueError("n must be non-negative")\n for in range(n):\n self.writebit(1)\n self.writebit(0)\n\n def finish(self) -> bytes:\n # Pad with zeros to full byte if needed\n if self.bitsfilled:\n self.current <<= (8 – self.bitsfilled)\n self.buf.append(self.current)\n self.current = 0\n self.bitsfilled = 0\n return bytes(self.buf)\n\n\n class BitReader:\n def init(self, data: bytes) -> None:\n self.data = data\n self.bytei = 0\n self.biti = 0 # next bit index within current byte (0..7)\n\n def readbit(self) -> int:\n if self.bytei >= len(self.data):\n raise EOFError("no more bits")\n\n byte = self.data[self.bytei]\n bit = (byte >> (7 – self.biti)) & 1\n\n self.biti += 1\n if self.biti == 8:\n self.biti = 0\n self.bytei += 1\n\n return bit\n\n def readbits(self, width: int) -> int:\n if width < 0:\n raise ValueError("width must be non-negative")\n value = 0\n for in range(width):\n value = (value << 1)

self.readbit()\n return value\n\n def readunary(self, , maxvalue: int

None = None) -> int:\n countones = 0\n while True:\n bit = self.readbit()\n if bit == 1:\n countones += 1\n if maxvalue is not None and countones > maxvalue:\n raise ValueError("unary value exceeds limit")\n else:\n return countones\n\n\n if name == "main":\n values = [0, 1, 5, 8]\n\n writer = BitWriter()\n for v in values:\n writer.writeunary(v)\n packed = writer.finish()\n\n reader = BitReader(packed)\n decoded = [reader.readunary() for in values]\n\n print("packed bytes:", packed)\n print("decoded:", decoded)\n\nThis bit-packed approach is where unary becomes meaningful for compression. Note the new concern: because we padded the last byte with zeros, you need framing (how many numbers to read, or total bit length) to avoid reading the padding as real codes.\n\n## Framing and Streaming: Making Unary Practical in Real Pipelines\nWhen unary is used as a sub-code inside a larger format (like Rice), you usually have framing “for free” because the container already knows what it expects next. When unary is used as a standalone stream, you need to answer at least one of these questions:\n\n- How many integers are in the stream?\n- How many bits are valid (excluding padding)?\n- Where do frames start and end (length-delimited blocks, record boundaries, etc.)?\n\nIn systems work, I see two patterns most often:\n\n1. Count-framed: prefix the stream with an integer count N, then decode exactly N unary integers.\n2. Bit-length-framed: prefix the stream with a bit length L, then decode until you’ve consumed L bits.\n\nCount-framing is simpler if you know the number of values. Bit-length-framing is better if you’re embedding into a larger byte stream and want to skip quickly.\n\nHere’s a simple “count-framed” binary format using standard Python building blocks. I’m not packing the header in a fancy way; I’m prioritizing clarity:\n\n import struct\n\n\n def packunaryvalues(values: list[int]) -> bytes:\n """Pack a list of non-negative ints as: u32 count + bitstream bytes."""\n for v in values:\n if not isinstance(v, int):\n raise TypeError("all values must be int")\n if v < 0:\n raise ValueError("all values must be non-negative")\n\n w = BitWriter()\n for v in values:\n w.writeunary(v)\n payload = w.finish()\n\n header = struct.pack(">I", len(values))\n return header + payload\n\n\n def unpackunaryvalues(data: bytes, , maxvalue: int

None = None) -> list[int]:\n if len(data) I", data[:4])\n r = BitReader(data[4:])\n out: list[int] = []\n for in range(count):\n out.append(r.readunary(maxvalue=maxvalue))\n return out\n\nThis gets you a “real” binary format with the crucial property that the decoder knows how many values to read (and therefore won’t accidentally treat padding as extra numbers).\n\nIf you care about storing or transmitting large lists, you’ll probably want a more compact header (a variable-length integer for the count, for example). But even then, I like starting with something boring, correct, and inspectable.\n\n## Common Pitfalls and Edge Cases (The Stuff That Breaks Pipelines)\nUnary coding is simple, which makes mistakes feel embarrassing—and also common. These are the issues I watch for in reviews:\n\n1. Off-by-one meaning\n – Some teams encode n as n ones then 0.\n – Others encode n as n+1 ones then 0.\n – You need one convention, documented in tests. If you’re encoding counts where zero is frequent, 0 -> 0 is natural.\n\n2. Assuming inputs are always valid\n – If unary-decoding is reading untrusted data, validate characters and terminator presence.\n – If data is trusted but may be truncated (network boundaries), surface “truncated” errors explicitly.\n\n3. Forgetting framing when bit-packed\n – A raw bitstream with unary codes needs a boundary: a count, a total bit length, or a higher-level container.\n – Padding bits can look like valid unary for 0 if you’re not careful.\n\n4. Mixing up “string bits” vs real bits\n – A str containing "1110" is 4 characters, not 4 bits of storage.\n – For memory-sensitive systems, use packed bytes.\n\n5. Ignoring negative numbers or non-integers\n – Unary is defined for natural numbers (non-negative integers). Decide early: do you reject negatives, or map signed integers via zigzag encoding?\n\nI’ll expand on (5) because it’s a classic “we’ll never need negatives” assumption that always becomes false at the worst possible time.\n\n## Signed Integers: Zigzag Mapping Before Unary (When You Need It)\nUnary itself only covers non-negative integers. If you need to encode signed integers (like small deltas that can be negative), I recommend a mapping step first. The most common mapping is zigzag encoding, which interleaves non-negative results so that small-magnitude signed integers map to small non-negative integers:\n\n- 0 -> 0\n- -1 -> 1\n- +1 -> 2\n- -2 -> 3\n- +2 -> 4\n\nIn code, I usually implement zigzag like this (works for Python’s unbounded ints too):\n\n def zigzagencode(x: int) -> int:\n """Map signed int -> non-negative int, favoring small magnitudes."""\n if not isinstance(x, int):\n raise TypeError("x must be int")\n # For x>=0: 2x, for x<0: -2x-1\n return (x <= 0 else ((-x << 1) – 1)\n\n\n def zigzagdecode(u: int) -> int:\n """Inverse mapping: non-negative int -> signed int."""\n if not isinstance(u, int):\n raise TypeError("u must be int")\n if u > 1) if (u & 1) == 0 else -( (u >> 1) + 1 )\n\nOnce you have that, you can build “signed unary” as a composition: encode(x) = unaryencode(zigzagencode(x)) and decode the reverse way.\n\nIs it always a good idea? No. Unary grows linearly, so if your signed values can spike (say a delta of 2000), unary becomes expensive fast. Zigzag helps only if your values are genuinely concentrated around 0 and spikes are rare or handled separately.\n\n### Example: Encoding Signed Deltas\nI often end up encoding deltas (differences) rather than raw values, because deltas are usually smaller and more compressible. Here’s a clear, testable version using string unary for readability:\n\n def deltas(values: list[int]) -> list[int]:\n if not values:\n return []\n out = [values[0]]\n for i in range(1, len(values)):\n out.append(values[i] – values[i – 1])\n return out\n\n\n def undeltas(deltas: list[int]) -> list[int]:\n if not deltas:\n return []\n out = [deltas[0]]\n for i in range(1, len(deltas)):\n out.append(out[-1] + deltas[i])\n return out\n\n\n def encodesigneddeltas(values: list[int]) -> str:\n parts: list[str] = []\n for d in deltas(values):\n parts.append(unaryencode(zigzagencode(d)))\n return "".join(parts)\n\n\n def decodesigneddeltas(stream: str) -> list[int]:\n ds: list[int] = []\n i = 0\n while i < len(stream):\n u, i = unarydecode(stream, start=i)\n ds.append(zigzagdecode(u))\n return undeltas(ds)\n\nThat pattern—delta + zigzag + variable-length coding—is a workhorse in compression and telemetry. Unary isn’t always the right variable-length code, but the structure is extremely common.\n\n## Where Unary Shines (And Where It Doesn’t)\nUnary has a very specific sweet spot: values that are frequently small, plus a decoding context that benefits from simplicity and predictability. Here are the scenarios where I actually like using it.\n\n### Great Fits\n- Quotients in Rice/Golomb codes: unary is doing exactly what it’s good at—representing a small quotient with a simple prefix.\n- Run-length encoding (RLE) of short runs: if you mostly see runs of length 0–3 with occasional longer runs, unary can be a reasonable “length code,” especially inside a larger format.\n- Sparse signals: sometimes you store distances between events (how many zeros until the next one). Those distances are natural numbers and can be small if events are common. Unary can model that directly.\n- Debuggability-first formats: internal debug payloads, test fixtures, “golden files” where being able to eyeball the encoding matters more than saving bytes.\n\n### Bad Fits\n- Uniformly distributed integers over a wide range: unary will be huge compared to fixed-width binary.\n- Anything with frequent large spikes unless you have an escape hatch (see below).\n- High-throughput storage formats where space efficiency is the primary goal: you’ll almost always prefer something like Rice, Elias codes, or byte-oriented varints.\n\n### A Practical Escape Hatch: Unary With a Stop-Code\nIf I need to keep unary’s simplicity but avoid pathological blow-ups, I’ll sometimes add an escape mechanism. The idea: unary handles small values; a special pattern signals “large value follows in fixed-width” (or varint).\n\nFor example (conceptual):\n- Encode n in unary if n < T\n- Otherwise encode unary T (a run of T ones then 0) followed by a larger integer encoding for n - T\n\nThis keeps the common case tiny but prevents “ten million ones” from ever appearing. It does complicate the decoder, so I only do it when I have evidence I need it.\n\n## A Deeper Implementation: Bit-Packed Rice Coding End-to-End\nEarlier I showed Rice coding with string bits because it’s easy to see. If I’m actually shipping this, I want the packed bit version. Unary becomes a prefix in a packed stream, and the remainder is a fixed number of bits.\n\nHere’s a compact, readable implementation using the BitWriter/BitReader from above. It’s not optimized for extreme speed, but it is correct and testable.\n\n def ricewrite(writer: BitWriter, n: int, k: int) -> None:\n if n < 0:\n raise ValueError("n must be non-negative")\n if k < 0:\n raise ValueError("k must be non-negative")\n\n m = 1 << k\n q, r = divmod(n, m)\n writer.writeunary(q)\n writer.writebits(r, k)\n\n\n def riceread(reader: BitReader, k: int, , maxq: int

None = None) -> int:\n if k < 0:\n raise ValueError("k must be non-negative")\n\n q = reader.readunary(maxvalue=maxq)\n r = reader.readbits(k) if k else 0\n return (q << k) + r\n\nNow you can define a simple packed container format: a count, k, then N Rice-coded values.\n\n import struct\n\n\n def packrice(values: list[int], k: int) -> bytes:\n if not (0 <= k <= 31):\n raise ValueError("k out of a reasonable range for this demo")\n for v in values:\n if v < 0:\n raise ValueError("values must be non-negative")\n\n w = BitWriter()\n for v in values:\n ricewrite(w, v, k)\n payload = w.finish()\n\n # header: u32 count, u8 k\n header = struct.pack(">IB", len(values), k)\n return header + payload\n\n\n def unpackrice(data: bytes, *, maxq: int

None = None) -> tuple[list[int], int]:\n if len(data) IB", data[:5])\n r = BitReader(data[5:])\n\n out: list[int] = []\n for in range(count):\n out.append(riceread(r, k, maxq=maxq))\n return out, k\n\nThat’s already enough to build a real experiment: run it on your sample distributions, compute average bits per value, and compare it to other encodings.\n\n## Testing: How I Prove My Encoder/Decoder Won’t Drift\nUnary is “simple,” which is exactly why I’m strict about testing it. I don’t want a subtle mismatch to linger for months.\n\nHere’s the testing approach I use:\n\n- Known vectors: hard-code small values and expected encodings (human-auditable).\n- Round-trip tests: encode then decode equals original.\n- Fuzz tests: lots of random values, including extremes.\n- Malformed input tests: ensure failures are loud and specific.\n\nYou can do all of this with the standard library. Here’s a unittest suite that covers the essentials without external dependencies:\n\n import unittest\n import random\n\n\n class TestUnary(unittest.TestCase):\n def testknownvectors(self) -> None:\n self.assertEqual(unaryencode(0), "0")\n self.assertEqual(unaryencode(1), "10")\n self.assertEqual(unaryencode(2), "110")\n self.assertEqual(unaryencode(5), "111110")\n\n def testroundtripsmall(self) -> None:\n for n in range(0, 1000):\n code = unaryencode(n)\n back, i = unarydecode(code)\n self.assertEqual(back, n)\n self.assertEqual(i, len(code))\n\n def testroundtriprandom(self) -> None:\n rng = random.Random(0)\n for in range(10000):\n n = rng.randrange(0, 50000)\n code = unaryencode(n)\n back, i = unarydecode(code)\n self.assertEqual(back, n)\n self.assertEqual(i, len(code))\n\n def testdecodemany(self) -> None:\n values = [0, 1, 5, 2, 0, 9]\n stream = "".join(unaryencode(v) for v in values)\n self.assertEqual(unarydecodemany(stream), values)\n\n def testmalformedmissingterminator(self) -> None:\n with self.assertRaises(ValueError):\n unarydecode("111")\n\n def testmalformedinvalidchar(self) -> None:\n with self.assertRaises(ValueError):\n unarydecode("11×0")\n\n\n if name == "main":\n unittest.main()\n\nIf you’re writing a library rather than a one-off script, I recommend also testing boundaries like very large n (within reason) and ensuring your “limits” behave as intended.\n\n### A Quick Performance Sanity Check (Without Obsessing)\nI’m careful not to over-benchmark early, but I do like one quick sanity check when I change an implementation detail. In Python, timeit is a great lightweight tool:\n\n import timeit\n\n def bench() -> None:\n print(timeit.timeit("unaryencode(1000)", number=50000, globals=globals()))\n s = unaryencode(1000)\n print(timeit.timeit("unarydecode(s)", number=50000, globals={"unarydecode": unary_decode, "s": s}))\n\nThis doesn’t give you a universal truth (hardware, Python version, workload all matter), but it catches obvious regressions like accidentally turning a linear operation into something worse.\n\n## Alternatives: What to Use When Unary Isn’t Enough\nUnary is a tool, not a religion. When it’s not a fit, I reach for one of these patterns instead—still in the same “variable-length integer encoding” family, but better suited to certain distributions or constraints.\n\n### Byte-Oriented Varints\nIf you want something that’s easy to implement, compact for moderately small integers, and efficient for streaming over bytes (not bits), byte-oriented varints are a strong default. They trade a little bit of overhead (you move in 7-bit chunks, for example) for a much simpler I/O story.\n\nUnary is bit-oriented and shines inside bit-level codes; varints are byte-oriented and shine in network protocols, storage records, and “lots of integers” data structures.\n\n### Elias Gamma / Delta (Bit-Level Universal Codes)\nIf you want a pure bit-level code for positive integers with better asymptotic behavior than unary, Elias gamma and Elias delta codes are classic options. They’re still prefix-free, still self-delimiting, and they grow roughly like O(log n) rather than O(n).\n\nUnary tends to win only when values are extremely small and extremely frequent. Once values spread out, universal codes can be a better fit.\n\n### Huffman / Arithmetic Coding\nIf you have a known distribution and you want near-optimal average code length, entropy coding (Huffman or arithmetic/range coding) is where you end up. Unary doesn’t compete here; it’s often a subcomponent or a baseline.\n\nMy rule of thumb: I start with Rice/Golomb for geometric-ish distributions of non-negative integers, and only move to heavier machinery when measurements justify it.\n\n## A Practical “How I’d Use This” Checklist\nWhen I’m about to use unary (or any variable-length code) in a real Python system, I run through this checklist:\n\n- Define the exact convention (n ones then 0, or something else). Put it in a docstring and tests.\n- Decide on your representation: str for clarity/tests, packed bits for size/throughput.\n- Add framing if you’re producing a standalone stream (count or bit length).\n- Add safety limits if input is untrusted or could be corrupted (max value/run length).\n- Write round-trip tests and malformed-input tests before optimizing anything.\n- If values can be negative, add zigzag (or another mapping) explicitly; don’t hand-wave it.\n- Measure average bits/value on representative data before committing to the format.\n\nUnary coding is one of those primitives I keep coming back to because it’s almost impossible to get lost in. When the goal is correctness, interoperability, and easy audits—especially when you’re building a larger code like Rice—unary is a surprisingly powerful foundation.\n\nAnd when it’s not the right tool, it still teaches a useful lesson: don’t underestimate the value of a code you can explain (and test) in one screen of Python.

You maybe like,

Related Posts