The bug always shows up at the boundary.
A few months ago I was wiring a Python service into a third-party signing flow. Locally, everything looked fine: I printed the payload, I logged the headers, I even copy-pasted the text into a debugger. Then the remote API started rejecting requests with a vague “invalid signature” response. The reason was painfully small: I was hashing a Python str in one place and hashing bytes in another. The payload looked identical when printed, but the byte sequences were not.
If you work with HTTP, files, cryptography, message queues, image/audio data, or anything that touches the network, you will convert text to bytes constantly. The good news: Python makes this easy. The bad news: it is also easy to do it almost right.
I am going to show you the main ways I convert a string into bytes, when I choose each one, what can go wrong (especially with non-ASCII text), and how to keep your code predictable.
Strings vs bytes: two kinds of “data”
In Python 3, str and bytes are intentionally different types.
stris text: a sequence of Unicode characters.bytesis binary data: a sequence of integers in the range 0..255.
A simple analogy I use when teaching this: a str is the idea of the message (characters), while bytes is the shipping container (octets) you send over a wire or write to disk.
That gap between idea and container is bridged by an encoding.
- Encoding: convert
str->bytes(example: UTF-8) - Decoding: convert
bytes->str
A quick sanity check you can run any time:
message_text = ‘Hello‘
messagebytes = messagetext.encode(‘utf-8‘)
print(type(messagetext), messagetext)
print(type(messagebytes), messagebytes)
If your mental model is only one sentence, make it this: text is not bytes until you encode it.
The default move: str.encode() (what I reach for first)
When I have a real piece of text (human language, JSON, CSV, headers, log lines), I almost always start with encode().
greeting = ‘Hello, World!‘
packet = greeting.encode(‘utf-8‘)
print(packet)
Output:
b‘Hello, World!‘
Why I like encode():
- It reads like what you mean: “encode this text”
- It keeps the encoding decision close to the data
- It has an
errors=parameter that matters in production
Choosing an encoding: default to UTF-8, be explicit at boundaries
UTF-8 is the practical default in modern systems, and Python uses it pervasively. Still, I recommend you pass it explicitly when you are crossing a boundary (file, socket, DB driver, crypto, external service). It makes future debugging easier.
username = ‘café‘
raw = username.encode(‘utf-8‘)
print(raw)
If you encode the same string as ASCII, you will get an error because ASCII cannot represent “é”:
username = ‘café‘
try:
raw = username.encode(‘ascii‘)
except UnicodeEncodeError as exc:
print(‘encode failed:‘, exc)
The errors= parameter: pick a policy, do not wing it
By default, encode() uses errors=‘strict‘, which raises an exception when a character cannot be represented.
In services, I treat this as a policy decision:
strict: best for correctness; fail fastreplace: keep going, but you may lose informationignore: drops characters silently; I avoid this unless I am intentionally filteringbackslashreplace: keeps a visible escape form, useful for logs
Example:
label = ‘café‘
print(label.encode(‘ascii‘, errors=‘replace‘))
print(label.encode(‘ascii‘, errors=‘backslashreplace‘))
Expected output (shape):
b‘caf?‘
b‘caf\\xe9‘
If you are dealing with OS-level bytes (filenames, environment variables) on Unix, errors=‘surrogateescape‘ is sometimes the right tool, but I treat it as a specialized interop escape hatch rather than a general solution.
Round-tripping: verify the path both ways
Whenever you are unsure, do a round-trip in a quick REPL check:
original = ‘Payment received: $19.99‘
wire = original.encode(‘utf-8‘)
restored = wire.decode(‘utf-8‘)
assert restored == original
Traditional vs modern practice (what I recommend in 2026)
A lot of Python bugs come from implicit conversions and invisible defaults. Here is the pattern shift I push on teams:
Older habit
—
pass str and hope library converts
.encode(‘utf-8‘)) mix binary/text file modes
str in text mode, or explicitly encode for binary mode errors=‘ignore‘
strict for correctness, replace or backslashreplace for logging print strings
repr(...) and inspect actual bytes The bytes() constructor: useful, but know the sharp edges
You can also convert a string to bytes using the bytes() constructor by providing the encoding.
payload_text = ‘Hello, World!‘
payloadbytes = bytes(payloadtext, ‘utf-8‘)
print(payload_bytes)
This is functionally close to payload_text.encode(‘utf-8‘). When do I pick bytes()?
- When I am writing APIs that accept
strorbytesand I want to normalize inputs - When I want the conversion to read as “construct bytes from this”
Here is a pattern I use in libraries to accept either type without surprising behavior:
def ensure_bytes(value, *, encoding=‘utf-8‘):
if isinstance(value, bytes):
return value
if isinstance(value, str):
return value.encode(encoding)
raise TypeError(f‘Expected str or bytes, got {type(value).name}‘)
apitoken = ‘p9F2k3…‘
print(ensurebytes(apitoken))
print(ensure_bytes(b‘already-bytes‘))
Two bytes() gotchas I see in code reviews
1) bytes(10) does not encode the string “10”
It creates ten zero bytes:
print(bytes(10))
2) bytes(b‘data‘) makes a copy
Sometimes that copy is fine. If you are handling large buffers and you want a view instead, look at memoryview.
chunk = b‘A‘ * 1024
copy = bytes(chunk)
view = memoryview(chunk)
print(len(copy), len(view))
For string-to-bytes conversions, the main point is: bytes(text, encoding) is correct and explicit, but in most application code I still prefer text.encode(encoding) because it is harder to misuse.
bytearray(): when you need bytes you can edit
A bytes object is immutable. That is a feature for safety and hashability, but it is annoying when you need to patch or build binary payloads.
bytearray is the mutable cousin: same 0..255 elements, but you can change them.
header_text = ‘HELLO‘
buffer = bytearray(header_text, ‘utf-8‘)
buffer[0] = ord(‘h‘)
print(buffer)
print(bytes(buffer))
Why mutability matters in real work:
- Building a custom binary protocol where you fill in a length field later
- Editing a message header without allocating a new bytes object each time
- Incrementally assembling output in tight loops
A practical example: prefix a message with a 4-byte big-endian length. I often build the frame in a bytearray and then convert to bytes once.
import struct
message_text = ‘status=ok;user=alice‘
messagebytes = messagetext.encode(‘utf-8‘)
frame = bytearray()
frame += struct.pack(‘>I‘, len(message_bytes))
frame += message_bytes
wire = bytes(frame)
print(wire)
If you only need an immutable result, do not keep bytearray around longer than necessary. Convert to bytes at the boundary, especially if you plan to store it as a key in a dict, cache it, or pass it into code that expects immutability.
Manual ASCII encoding with ord(): valid for narrow cases
Sometimes you will see code that maps characters to integers with ord() and then constructs bytes.
word = ‘Hello‘
raw = bytes([ord(ch) for ch in word])
print(raw)
For pure ASCII text, this works because ASCII code points fit in 0..127.
Where it breaks: the moment you step outside that range.
word = ‘café‘
try:
raw = bytes([ord(ch) for ch in word])
print(raw)
except ValueError as exc:
print(‘failed:‘, exc)
You will typically get a ValueError because ord(‘é‘) is greater than 255, and bytes([...]) refuses integers outside 0..255.
So when do I use manual ord() mapping?
- When I am working with a protocol that is explicitly ASCII-only
- When I am teaching the concept of encodings at the byte level
- When I want to generate specific byte values and text is not really the goal
If your data is “text that humans typed”, this approach is usually the wrong tool. Use encode(‘utf-8‘) and move on.
The boundary patterns: files, HTTP, sockets, hashing, and subprocess
If you want fewer encoding bugs, get strict about boundaries: keep text as str inside your application, and convert to bytes at the edges.
Files: text mode vs binary mode
When you open a file in text mode (‘r‘ or ‘w‘), Python will encode/decode for you using an encoding. When you open in binary mode (‘rb‘ or ‘wb‘), you must deal with bytes yourself.
Text mode (good when you are working with text data):
report_line = ‘user=alice, total=$19.99\n‘
with open(‘report.txt‘, ‘w‘, encoding=‘utf-8‘) as f:
f.write(report_line)
Binary mode (good when you need exact bytes, such as a custom format):
report_line = ‘user=alice, total=$19.99\n‘
with open(‘report.bin‘, ‘wb‘) as f:
f.write(report_line.encode(‘utf-8‘))
My rule: if the file is conceptually text, pick text mode with an explicit encoding. If the file is conceptually binary, pick binary mode and encode explicitly.
HTTP and JSON: bytes on the wire, text in the model
Most HTTP client libraries accept both str and bytes in some places, which can hide problems.
- JSON libraries typically want
strobjects (because JSON is a text format). - Signing, hashing, compression, and encryption functions typically want
bytes.
Example: JSON string in, bytes out for hashing.
import json
import hashlib
payload_obj = {‘event‘: ‘payment.succeeded‘, ‘amount‘: 1999}
payloadtext = json.dumps(payloadobj, separators=(‘,‘, ‘:‘), ensure_ascii=False)
payloadbytes = payloadtext.encode(‘utf-8‘)
digesthex = hashlib.sha256(payloadbytes).hexdigest()
print(payload_text)
print(digest_hex)
Note the ensure_ascii=False choice: it keeps Unicode characters readable in the JSON text, but the important part is that you encode with UTF-8 before hashing.
Sockets and asyncio: always send bytes
Sockets send bytes. If you have text, encode it.
import asyncio
async def sendline(host, port, linetext):
reader, writer = await asyncio.open_connection(host, port)
writer.write((line_text + ‘\n‘).encode(‘utf-8‘))
await writer.drain()
writer.close()
await writer.wait_closed()
If you find yourself calling .encode() repeatedly inside a loop, consider encoding once and reusing the bytes.
Subprocess: choose text or bytes intentionally
In subprocess.run, you can keep data as bytes or ask Python to decode it.
import subprocess
result = subprocess.run(
[‘python‘, ‘–version‘],capture_output=True,
text=True,
encoding=‘utf-8‘,
)
print(result.stdout.strip() or result.stderr.strip())
If you set text=False (the default), you will receive bytes and you can decode them yourself.
Unicode edge cases that surprise even experienced devs
The classic trap is assuming “same text” means “same bytes”. Unicode makes that false.
Normalization: visually identical strings can encode differently
Some characters can be represented in multiple ways (a composed character vs a base letter plus a combining mark). They can look identical on screen and still be different sequences of code points.
If you are generating stable identifiers (hash keys, signatures, filenames) from user-visible text, I often normalize first.
import unicodedata
label_a = ‘café‘
labelb = unicodedata.normalize(‘NFD‘, labela)
print(labela == labelb)
norma = unicodedata.normalize(‘NFC‘, labela)
normb = unicodedata.normalize(‘NFC‘, labelb)
print(norma == normb)
print(norm_a.encode(‘utf-8‘))
print(norm_b.encode(‘utf-8‘))
My guidance:
- For display: keep the original user input.
- For comparisons and stable byte output: normalize (often NFC), then encode.
Newlines and platform differences
If your bytes are used for signing or hashing, normalize newlines (‘\n‘ vs ‘\r\n‘) before encoding. Otherwise, a payload generated on Windows can hash differently than the same text generated on Linux or macOS.
Do not trust implicit encodings
Relying on defaults is how bugs hide.
- Always set
encoding=‘utf-8‘when opening text files you control. - Always pass an encoding when converting
strtobytesfor storage or transmission.
In 2026, I also let tooling help: type checkers (Pyright, mypy) and linters can catch a lot of accidental str/bytes mixing early, especially if your functions annotate the boundary types clearly.
Performance and memory notes (the practical version)
Encoding is linear work: Python has to walk the string and produce bytes. For typical web payloads, it is fast enough that you should not contort your code. Still, there are a few patterns that keep latency predictable.
Encode once per payload, not once per fragment
If you build a message as text, build it fully as str, then encode one time.
Better:
parts = [‘user=alice‘, ‘status=ok‘, ‘amount=1999‘]
line_text = ‘;‘.join(parts)
linebytes = linetext.encode(‘utf-8‘)
Worse (unnecessary repeated encoding work):
parts = [‘user=alice‘, ‘status=ok‘, ‘amount=1999‘]
line_bytes = b‘;‘.join(p.encode(‘utf-8‘) for p in parts)
That “worse” version is not always terrible, but it makes it easier to accidentally encode with different settings per part.
Prefer bytearray for incremental binary assembly
If you are appending many small chunks, a bytearray can reduce intermediate allocations. Then convert to bytes once at the end.
Be careful with very large strings
If you are encoding multi-megabyte strings, you will see noticeable time and memory use. In services, this often shows up as:
- uploading large CSV/JSON
- batch exports
- log aggregation
In those cases, I try to avoid holding everything in memory at once. For example, write incrementally to a file or stream chunks through the pipeline.
I do not micro-benchmark every encoding call, but as a rough feel: encoding a few KB is usually sub-millisecond; encoding many MB can take tens of milliseconds and allocate similarly sized buffers. That is enough to matter in tight loops.
Common mistakes I see (and how I prevent them)
This is the section I wish someone had handed me early. Almost every “mysterious” encoding bug I’ve debugged falls into one of these buckets.
1) Hashing str instead of bytes
Most crypto and hashing APIs in Python accept bytes-like objects. If you accidentally hand them str, you’ll get a TypeError quickly. The subtler bug is when you hash different byte representations in different places.
I prevent this by making the boundary explicit:
import hashlib
def sha256hextext(text: str, *, encoding: str = ‘utf-8‘) -> str:
return hashlib.sha256(text.encode(encoding)).hexdigest()
print(sha256hextext(‘café‘))
If you see code like hashlib.sha256(str(obj)).hexdigest(), I treat it as a smell. It might work “locally”, but it makes the byte representation dependent on a stringification decision that can change.
2) Forgetting that JSON text must be identical for signatures
When signing JSON, two documents can be semantically equal but textually different.
- Key order can differ.
- Whitespace can differ.
- Escaping choices can differ (
ensure_ascii=TruevsFalse).
If a remote service signs the exact bytes on the wire, you must sign those exact bytes too. That often means:
- Use canonical JSON formatting (stable separators, stable key order).
- Encode as UTF-8.
- Normalize newlines if your payload has embedded text.
Example pattern I use:
import json
def canonicaljsonbytes(obj) -> bytes:
text = json.dumps(
obj,
ensure_ascii=False,
separators=(‘,‘, ‘:‘),
sort_keys=True,
)
return text.encode(‘utf-8‘)
3) Mixing text-mode and binary-mode file handling
I regularly see code that opens a file in binary mode and then tries to .write() a str.
Bad:
with open(‘out.bin‘, ‘wb‘) as f:
f.write(‘hello‘)
Better (conceptually text):
with open(‘out.txt‘, ‘w‘, encoding=‘utf-8‘, newline=‘\n‘) as f:
f.write(‘hello\n‘)
Better (conceptually bytes):
with open(‘out.bin‘, ‘wb‘) as f:
f.write(‘hello‘.encode(‘utf-8‘))
I like choosing one world per file: either it’s text and stays str until the file handle, or it’s bytes and stays bytes from the start.
4) Relying on the platform default encoding
If you do not pass an encoding to open(...) in text mode, Python uses a default that depends on your system configuration.
I treat “encoding unspecified” as “bug waiting to happen.” For files your application owns, I almost always do:
with open(‘data.txt‘, ‘r‘, encoding=‘utf-8‘) as f:
content = f.read()
If you have to interact with a legacy system that uses something else, encode/decode explicitly and label it (variable name, function name, docs).
5) Assuming len(text) equals len(text.encode(...))
A Unicode character is not “one byte.” len(‘é‘) is 1, but len(‘é‘.encode(‘utf-8‘)) is 2.
If a protocol wants a byte length field, compute the length from the bytes:
text = ‘naïve café‘
data = text.encode(‘utf-8‘)
print(len(text), len(data))
6) Confusing “bytes” with “printable bytes”
I see base64 and hex mixed up with raw bytes all the time.
- Raw bytes are what crypto and sockets want.
- Base64 and hex are text encodings of bytes meant for logging, JSON, URLs, or storage layers that expect printable characters.
If you take a base64 string and treat it like raw bytes, you will not get the original data.
7) Logging bytes incorrectly and hiding the real bug
print(payload) often lies by omission. I prefer repr(...) when I’m debugging boundaries.
Example:
payload_text = ‘line1\nline2‘
payloadbytes = payloadtext.encode(‘utf-8‘)
print(payload_text)
print(repr(payload_text))
print(payload_bytes)
print(repr(payload_bytes))
The repr output makes invisible characters visible.
8) Overusing errors=‘ignore‘
Silently dropping characters is almost never what you want. When I see errors=‘ignore‘, I ask:
- Are we intentionally filtering, or are we hiding corruption?
- Would
replacebe safer? - Would
backslashreplacepreserve evidence for later?
For example, for logs I’ll sometimes do:
raw = user_input.encode(‘ascii‘, errors=‘backslashreplace‘)
That keeps the log line printable but still leaves a trace of what couldn’t be encoded.
A practical conversion toolbox (recipes I actually use)
This section is about “what do I do in production code,” not “what are all possible methods.”
Recipe: accept str or bytes, normalize once
I showed ensurebytes earlier; here is the matching ensurestr I use when I want text.
def ensure_str(value, *, encoding=‘utf-8‘, errors=‘strict‘):
if isinstance(value, str):
return value
if isinstance(value, (bytes, bytearray, memoryview)):
return bytes(value).decode(encoding, errors=errors)
raise TypeError(f‘Expected str or bytes-like, got {type(value).name}‘)
The key design choice is that I normalize once, close to the boundary, then keep internal logic in one type.
Recipe: build wire data with bytearray, freeze as bytes
If I’m constructing a payload in stages, I like this pattern:
import struct
def frameutf8message(text: str) -> bytes:
body = text.encode(‘utf-8‘)
buf = bytearray()
buf += struct.pack(‘>I‘, len(body))
buf += body
return bytes(buf)
Even if you never build binary protocols, this pattern teaches a useful habit: measure lengths in bytes, not characters.
Recipe: prepend a UTF-8 BOM only if you must
Some Windows-centric tools expect UTF-8 with a BOM (byte order mark). Most modern systems do not want it.
- If a consumer explicitly requires it, you can add it.
- Otherwise, avoid it because it can confuse parsers.
text = ‘name,café\n‘
utf8 = text.encode(‘utf-8‘)
utf8_bom = b‘\xef\xbb\xbf‘ + utf8
Recipe: treat “bytes from OS” as a special category
When dealing with raw OS bytes (especially on Unix), surrogateescape is the tool that keeps round-tripping possible.
This is advanced interop territory, but the mental model is simple: sometimes the OS gives you byte sequences that are “not valid UTF-8,” yet you still need to pass them around without losing information.
Debugging bytes problems quickly (my checklist)
When a signature fails, a checksum differs, or a service complains about “invalid encoding,” I go through the same few moves.
1) Identify the boundary: where does text become bytes?
I look for:
.encode(...)and.decode(...)open(..., ‘b‘)vs text mode- crypto functions (
hashlib,hmac,cryptography) - network writes (
socket.send,writer.write)
Then I make the boundary explicit and deterministic.
2) Print repr and the first few bytes
I don’t just print the payload; I print its representation and sometimes its integer values.
data = ‘café\n‘.encode(‘utf-8‘)
print(repr(data))
print(list(data[:10]))
Seeing [99, 97, 102, 195, 169, 10] is often enough to spot “this is UTF-8” vs “this is something else.”
3) Compare bytes, not strings
When two services disagree, I want to compare the exact bytes.
If I have two candidates:
a = payload_a.encode(‘utf-8‘)
b = payload_b.encode(‘utf-8‘)
I’ll compare lengths and find the first mismatch.
def first_diff(a: bytes, b: bytes):
for i, (x, y) in enumerate(zip(a, b)):
if x != y:
return i, x, y
if len(a) != len(b):
return min(len(a), len(b)), None, None
return None
This is the fastest way I know to locate “the bug is at byte 1432 because one side has \r\n.”
4) Check normalization and newlines for signatures
If the payload contains human text, I verify:
- Unicode normalization (
NFCis usually my choice for canonicalization) - Newline normalization (
\n) - JSON canonicalization choices
5) Verify the same encoding is used everywhere
“UTF-8 in one place, Latin-1 in another” is a common root cause.
If a library call accepts either str or bytes, I don’t let it decide. I pass bytes explicitly.
Working with base64 and hex (bytes that must travel as text)
Sometimes you cannot ship raw bytes because the surrounding channel only supports text: JSON fields, URLs, config files, environment variables, or manual copy/paste.
Two common solutions are hex and base64.
Hex: simple, readable, larger
Hex encodes each byte as two hexadecimal characters.
- Pros: easy to eyeball, easy to debug.
- Cons: 2x size overhead.
import binascii
raw = ‘café‘.encode(‘utf-8‘)
as_hex = binascii.hexlify(raw).decode(‘ascii‘)
back = binascii.unhexlify(as_hex)
print(raw)
print(as_hex)
print(back)
Base64: compact, common in APIs
Base64 encodes bytes into a limited ASCII alphabet.
- Pros: smaller overhead than hex (roughly 4/3), common in web APIs.
- Cons: less human-readable.
import base64
raw = ‘café‘.encode(‘utf-8‘)
b64_text = base64.b64encode(raw).decode(‘ascii‘)
back = base64.b64decode(b64_text)
print(b64_text)
print(back)
A rule I repeat to myself: base64 is not encryption. It’s just a transport encoding.
When not to use base64
If you are hashing or signing, do it on the raw bytes, not the base64 string (unless a protocol explicitly tells you otherwise). It’s a common mistake to sign the printable representation instead of the underlying data.
When you don’t know the encoding
In ideal systems, you always know the encoding. In real systems, you sometimes receive mystery bytes.
My approach is pragmatic:
1) Check whether the protocol specifies the encoding (HTTP headers, file format docs, DB driver docs).
2) If it’s “probably UTF-8,” try UTF-8 with errors=‘strict‘ and treat failure as signal.
3) If you must salvage, decide on a policy (lossy vs preserving evidence).
Example: attempt UTF-8, fall back to a reversible-ish strategy
For diagnostic tools, I sometimes do something like:
def decodebesteffort(data: bytes) -> str:
try:
return data.decode(‘utf-8‘)
except UnicodeDecodeError:
return data.decode(‘utf-8‘, errors=‘backslashreplace‘)
This is not about correctness; it’s about making logs actionable.
Latin-1 as a “byte-preserving” decode trick (with caution)
There’s a trick: latin-1 maps byte values 0..255 directly to the first 256 Unicode code points. That means you can decode arbitrary bytes without errors:
text = data.decode(‘latin-1‘)
But I treat this as an escape hatch for tooling, not for business logic. The resulting str is not “real text”; it’s just a reversible view.
If you do use it, document it loudly and keep it at the edges.
Type hints and boundaries (how I make bugs rarer)
I’ve found that a little typing discipline reduces str/bytes bugs dramatically.
Annotate boundary functions
If a function returns wire-ready payloads, I make it return bytes, not str.
def buildrequestbody(…) -> bytes:
…
If a function parses a payload into text, I make it return str.
This sounds obvious, but it forces you to answer the question: “what type is this data really?”
Don’t accept “either” everywhere
Accepting str | bytes everywhere sounds flexible, but it spreads boundary logic across your codebase.
I prefer:
- One normalization step at the edge
- Stable internal types
- Explicit conversions when leaving the system
Write tiny tests for round-trip behavior
If you’ve ever been bitten by encoding changes, this test is worth its weight:
def testroundtrip_utf8():
original = ‘naïve café — 東京‘
wire = original.encode(‘utf-8‘)
restored = wire.decode(‘utf-8‘)
assert restored == original
If you sign payloads, add a fixture with non-ASCII characters and newlines. Those are the cases that reveal hidden assumptions.
Production considerations: where encoding bugs hide
Encoding problems are rarely “Python can’t do it.” They’re usually about assumptions.
Logging and observability
When debugging, I like logs that:
- record the encoding explicitly
- record a safe representation of bytes (hex or base64)
- avoid leaking secrets
Example approach:
- For request bodies: log length + a hash (not the raw content)
- For debug environments: log a truncated hex prefix
Security and secrets
Be careful about converting secrets to text casually.
- Many secrets are bytes by nature (random token bytes).
- If you must show them, prefer base64.
- Avoid decoding random bytes as UTF-8 “just to print them.”
Databases and drivers
Most database drivers accept Python str and handle encoding for you (often UTF-8). That’s fine—until you need stable byte output (signatures, caches, external APIs).
My rule is the same: if a downstream consumer cares about bytes, I control the encoding and do it explicitly.
Checklist: predictable str bytes handling
When I’m reviewing code or designing a boundary, I run through this list.
- Keep internal text as
str. - Convert to
bytesonly at boundaries (file/socket/crypto/storage). - When encoding: default to
utf-8and pass it explicitly. - Choose an
errors=policy intentionally; avoidignore. - For signatures/hashes: canonicalize text first (JSON formatting, newlines, normalization), then encode.
- Measure sizes in bytes, not characters.
- When debugging: print
repr, lengths, and hex/base64 previews. - Use typing to make boundaries obvious (
-> bytesfor wire data).
Quick reference: conversions at a glance
Here’s the condensed “I just need the right call” table.
Use
—
text.encode(‘utf-8‘)
data.decode(‘utf-8‘)
ensure_bytes(...)
bytearray(text, ‘utf-8‘)
hex/base64
If there’s one theme running through all of this, it’s that “string to bytes” is not just a syntax problem. It’s an interface contract. Be explicit about encoding at boundaries, and your code stops surprising you.


