I spend a lot of time cleaning and reshaping text: log lines coming from multiple services, pasted addresses from user forms, chat transcripts that need redaction, and messy CSV exports that are almost-but-not-quite structured. The common thread is that the data is text, and the fastest way to make it usable is usually a rewrite step: find a pattern, replace it with something better.
In Python, that rewrite step is often one function call: re.sub(). It looks simple, but it scales from basic find-and-replace all the way up to rules like, ‘Replace dates only when they appear in headers‘, or ‘Normalize phone numbers but keep the country code if present‘, or ‘Redact tokens while preserving the token prefix so debugging stays possible‘.
If you already know regular expressions for matching, substitution is where your regex skills start paying real dividends in production code. Here, I walk through the mental model of re.sub(), how replacement strings actually work (including the parts that surprise people), when to swap in a callable replacement, and how to keep patterns readable and safe. I also include runnable snippets you can paste into a file and run immediately.
The Mental Model: re.sub() as Search + Rewrite
At its core, re.sub() does two jobs in one pass:
1) It searches your input string for non-overlapping matches of a pattern.
2) It rewrites each match as a replacement.
The signature matters because every parameter has sharp edges:
re.sub(pattern, repl, string, count=0, flags=0)
What I keep in my head:
pattern: a regex pattern (string) or a compiled pattern fromre.compile().repl: either a replacement string or a function that receives a match object.string: the input text.count: maximum number of substitutions.0means ‘no limit‘ (replace all matches).flags: modifiers like case-insensitive matching.
A simple example (replace all occurrences):
import re
sentence = ‘Thank you very very much.‘
result = re.sub(r‘very‘, ‘so‘, sentence)
print(result)
Expected output:
Thank you so so much.
That ‘replace all‘ behavior is the default. If you only want the first match, set count=1:
import re
sentence = ‘Thank you very very much.‘
result = re.sub(r‘very‘, ‘so‘, sentence, count=1)
print(result)
Output:
Thank you so very much.
Two practical notes I rely on in real projects:
re.sub()returns a new string. It does not modify the original.- Matches do not overlap. If your pattern could match in overlapping ways, only the leftmost, then next leftmost after the previous match ends, will be replaced.
Replacement Strings Are Not Just Text
A lot of bugs in substitution code come from misunderstanding what the replacement string is allowed to contain.
In a replacement string, backslashes and group references have meaning:
\1,\2, … refer to numbered capturing groups.\grefers to a named capturing group.\gis also valid and is the safest way to disambiguate group numbers in longer replacements.
I strongly prefer \g in production code because it stays readable when patterns change.
Example: Normalize ‘Last, First‘ to ‘First Last‘
import re
name = ‘Doe, Jane‘
pattern = r‘^(?P[A-Za-z-]+),\s*(?P[A-Za-z-]+)$‘
result = re.sub(pattern, r‘\g \g‘, name)
print(result)
Output:
Jane Doe
Why the raw string for the replacement (r‘...‘)? Because otherwise Python itself interprets backslashes before re.sub() ever sees them. Raw strings prevent accidental escapes.
Common pitfall: Backslashes in replacements
If you write a Windows path replacement, you can accidentally create escapes like \n (newline) or \t (tab) inside normal Python strings.
I avoid that by:
- Using raw strings for replacements when I include backslashes.
- Or doubling backslashes in normal strings.
For example, replacing / with \ (Windows separators) while keeping the rest of the string:
import re
posix_path = ‘/var/log/app/server.log‘
windowsish = re.sub(r‘/‘, r‘\\‘, posix_path)
print(windowsish)
Output:
\\var\\log\\app\\server.log
That output shows double backslashes because printing represents each literal backslash as one character; in an actual string, it is a single backslash per separator.
re.escape() is your friend when the pattern is user-provided
If you are substituting literal user text (not a regex), escape it.
import re
user_literal = ‘1.2.3‘
text = ‘Current versions: 1.2.3 and 1.2.30‘
safepattern = re.escape(userliteral)
result = re.sub(safe_pattern, ‘X.Y.Z‘, text)
print(result)
Output:
Current versions: X.Y.Z and 1.2.30
Without re.escape(), the dots would match any character, which is almost never what you want.
Shaping Matches: Character Classes, Boundaries, and Lookarounds
Substitution quality is mostly about match quality. I see three recurring categories:
- Character classes: define sets like digits, letters, whitespace.
- Boundaries: match only whole words or token edges.
- Lookarounds: match something based on context without consuming it.
Replace a character set (classic sanitization)
Replace all lowercase letters with 0:
import re
sentence = ‘22 April is celebrated as Earth Day.‘
print(re.sub(r‘[a-z]‘, ‘0‘, sentence))
Output:
22 A0000 00 0000000000 00 E0000 D00.
If you also want uppercase, you can expand the set:
import re
sentence = ‘22 April is celebrated as Earth Day.‘
print(re.sub(r‘[A-Za-z]‘, ‘0‘, sentence))
Or you can use a flag, which I prefer when the whole pattern should be case-insensitive:
import re
sentence = ‘22 April is celebrated as Earth Day.‘
print(re.sub(r‘[a-z]‘, ‘0‘, sentence, flags=re.IGNORECASE))
Replace only whole words with \b
If you substitute cat in concatenate, you probably did not mean to.
import re
text = ‘A cat can concatenate strings.‘
print(re.sub(r‘\bcat\b‘, ‘dog‘, text))
Output:
A dog can concatenate strings.
\b is a word boundary: it matches the transition between word characters and non-word characters.
Remove repeated whitespace without touching newlines
I often normalize runs of spaces and tabs to a single space, but keep line breaks intact.
import re
text = ‘Name:\t\tJane Doe\nRole:\tSenior Engineer\n‘
result = re.sub(r‘[ \t]+‘, ‘ ‘, text)
print(result)
Output:
Name: Jane Doe
Role: Senior Engineer
This pattern avoids \s+ because \s includes newlines and would collapse your line structure.
Replace only when a prefix exists (lookbehind)
Imagine log lines with token=... and you want to redact the token value but keep token=.
import re
line = ‘request_id=9f1 token=abcd1234efgh5678 user=jane‘
redacted = re.sub(r‘(?<=token=)[A-Za-z0-9]+', '[REDACTED]', line)
print(redacted)
Output:
request_id=9f1 token=[REDACTED] user=jane
A few constraints to remember:
- Python lookbehinds must be fixed-width (no
+,*, or variable-length alternations inside the lookbehind). - If you need variable-width context, you can often restructure to capture the prefix and reuse it.
The capture-based alternative:
import re
line = ‘request_id=9f1 token=abcd1234efgh5678 user=jane‘
redacted = re.sub(r‘(token=)[A-Za-z0-9]+‘, r‘\1[REDACTED]‘, line)
print(redacted)
Replace everything except the part you want to keep (negative lookahead)
A practical example: you want to prefix every non-comment line in a config with - .
import re
config = ‘# generated\nport=8080\n# do not edit\nmode=prod\n‘
result = re.sub(r‘^(?!#)(.+)$‘, r‘- \1‘, config, flags=re.MULTILINE)
print(result)
Output:
# generated
– port=8080
# do not edit
– mode=prod
Key detail: re.MULTILINE makes ^ and $ work per line.
When Replacement Needs Logic: Pass a Function
A string replacement is great when the output is constant or purely based on captured groups.
When the rewrite depends on the matched text in a more complex way (validation, formatting, conditional behavior), pass a callable as repl. That callable receives a re.Match and must return the replacement string.
Example: Normalize phone numbers with a tiny ruleset
Say your input contains numbers like 415.555.2671, (415) 555-2671, or 415 555 2671, and you want +1-415-555-2671.
import re
def normalizeusphone(match: re.Match) -> str:
digits = re.sub(r‘\D‘, ‘‘, match.group(0))
# Accept 10 digits or 11 digits starting with 1.
if len(digits) == 10:
country = ‘1‘
area = digits[0:3]
prefix = digits[3:6]
line = digits[6:10]
return f‘+{country}-{area}-{prefix}-{line}‘
if len(digits) == 11 and digits.startswith(‘1‘):
country = ‘1‘
area = digits[1:4]
prefix = digits[4:7]
line = digits[7:11]
return f‘+{country}-{area}-{prefix}-{line}‘
# If it does not match expected formats, keep the original.
return match.group(0)
text = ‘Call (415) 555-2671 or 415.555.2671. Office: +1 (212) 555 0000.‘
pattern = r‘(?:\+?1\s*)?(?:\(\d{3}\)|\d{3})[ .-]?\d{3}[ .-]?\d{4}‘
result = re.sub(pattern, normalizeusphone, text)
print(result)
Output:
Call +1-415-555-2671 or +1-415-555-2671. Office: +1-212-555-0000.
Why I like the function approach:
- I can decide to keep the original match on unexpected input.
- I can add small validation without turning the regex into a monster.
Example: Mask all but the last 4 digits of account-like numbers
import re
def mask_account(match: re.Match) -> str:
value = match.group(‘acct‘)
return ‘‘ (len(value) – 4) + value[-4:]
text = ‘acct=123456789012 acct=9876543210987654‘
result = re.sub(r‘acct=(?P\d{12,19})‘, lambda m: ‘acct=‘ + mask_account(m), text)
print(result)
Output:
acct=9012 acct=7654
Here I keep acct= visible for debugging while protecting the sensitive part.
Flags You Will Actually Reach For
I treat flags as part of the readability story. They make intent explicit and often let you simplify the pattern.
re.IGNORECASE (re.I)
Use it when casing should not matter:
import re
text = ‘Error: Disk Full. ERROR: Retry failed.‘
result = re.sub(r‘error‘, ‘warning‘, text, flags=re.IGNORECASE)
print(result)
Output:
warning: Disk Full. warning: Retry failed.
re.MULTILINE (re.M)
Use it when you want ^ and $ to work per line.
re.DOTALL (re.S)
Use it when . should match newlines.
A small but realistic example: collapse a block between markers, even across lines.
import re
text = ‘start\nsecret line 1\nsecret line 2\nend\n‘
result = re.sub(r‘start.*end‘, ‘start\n[REDACTED]\nend‘, text, flags=re.DOTALL)
print(result)
Output:
start
[REDACTED]end
I avoid DOTALL unless I truly need it, because it can make patterns match far more than expected.
re.VERBOSE (re.X)
This is the biggest maintainability win for non-trivial patterns. It lets you space out the pattern and comment it.
Example: parse an ISO-like timestamp and rewrite it as YYYY/MM/DD HH:MM.
import re
timestamp = ‘2026-02-03T19:42:11Z‘
pattern = re.compile(r‘‘‘
^
(?P\d{4})-(?P\d{2})-(?P\d{2})
T
(?P\d{2}):(?P\d{2}):(?P\d{2})
Z
$
‘‘‘, re.VERBOSE)
result = pattern.sub(r‘\g/\g/\g \g:\g‘, timestamp)
print(result)
Output:
2026/02/03 19:42
With re.VERBOSE, whitespace in the pattern is ignored unless escaped or inside a character class. That is usually what you want, but it does mean you must write literal spaces as \ or \x20 or put them in a character class like [ ].
Practical Substitution Recipes I Keep Reusing
I do not write regex substitutions for fun. I write them because they solve recurring engineering problems. Here are patterns I reach for constantly.
Recipe: Redact emails but keep the domain
This keeps enough context for debugging (which domain is involved) while protecting the user.
import re
def redact_email(match: re.Match) -> str:
local = match.group(‘local‘)
domain = match.group(‘domain‘)
if len(local) <= 2:
masked = ‘‘ len(local)
else:
masked = local[0] + ‘‘ (len(local) – 2) + local[-1]
return masked + ‘@‘ + domain
text = ‘Contact: [email protected] or [email protected]‘
pattern = r‘(?P[A-Za-z0-9._%+-]+)@(?P[A-Za-z0-9.-]+\.[A-Za-z]{2,})‘
result = re.sub(pattern, redact_email, text)
print(result)
Output:
Contact: j[email protected] or @corp.internal
Recipe: Normalize multiple date separators
Convert 2026/2/3, 2026-02-03, and 2026.02.03 to 2026-02-03.
import re
text = ‘Deploys: 2026/2/3, 2026-02-03, 2026.02.03‘
pattern = r‘\b(?P\d{4})<a href="?P\d{1,2}">./-<a href="?P\d{1,2}">./-\b‘
def fix_date(match: re.Match) -> str:
y = match.group(‘y‘)
m = int(match.group(‘m‘))
d = int(match.group(‘d‘))
return f‘{y}-{m:02d}-{d:02d}‘
result = re.sub(pattern, fix_date, text)
print(result)
Output:
Deploys: 2026-02-03, 2026-02-03, 2026-02-03
Recipe: Remove trailing whitespace without touching indentation
This is a quiet quality-of-life improvement when cleaning generated text.
import re
text = ‘line one \n indented line\t\t\nline three\n‘
cleaned = re.sub(r‘[ \t]+$‘, ‘‘, text, flags=re.MULTILINE)
print(cleaned)
Output:
line one
indented line
line three
When Not to Use Regex Substitution (And What I Use Instead)
Regex is great when:
- The text has a clear pattern.
- You need contextual matching (boundaries, alternations, optional segments).
- You need to rewrite many matches in one pass.
Regex is a bad fit when:
- The replacement is purely literal and you want maximal clarity.
- You are parsing a nested grammar (HTML, JSON, programming languages) where edge cases explode.
- Input size is large and patterns are complex enough to risk slowdowns.
Here is the decision table I use when I am choosing an approach.
Traditional approach
—
str.replace()
str.replace(); it is clearer and often faster chained replace()
str.translate() with a translation table regex everywhere
json, csv, email, urllib.parse, or a dedicated library regex \s+
[ \t]+ plus tests regex directly
re.escape() + regex, or plain replace() if possible An example where str.translate() is cleaner than regex:
# Remove common punctuation characters quickly.
text = ‘Hello, world! (v2.0)‘
table = str.maketrans(‘‘, ‘‘, ‘,!()‘)
print(text.translate(table))
Output:
Hello world v2.0
I still use re.sub() all the time, but I do not force it into problems it is not good at.
Performance and Safety: Keep Substitutions Fast and Predictable
Most substitutions in application code are fast enough that you do not need to measure. But the failure modes are nasty when they happen: a pattern that backtracks badly can take seconds on the wrong input.
What typically keeps re.sub() fast
- Prefer specific patterns over overly generic ones.
- Avoid nested quantifiers like
(.)+or(.+). - Anchor when you can (
^,$,\b) to reduce scanning. - Compile patterns you reuse.
Compiling is mainly about avoiding repeated parse overhead and making intent clear:
import re
RE_MULTISPACE = re.compile(r‘[ \t]{2,}‘)
def normalize_spaces(text: str) -> str:
return RE_MULTISPACE.sub(‘ ‘, text)
If you do this in a loop over thousands of lines, precompiling is often a noticeable speed-up. In typical backend services, that can shave a few milliseconds per request in hot paths, or remove noisy variance.
Guarding against catastrophic backtracking
If you accept arbitrary user input, treat regex patterns as part of your attack surface. A pattern like ^(a+)+$ can behave horribly on inputs like ‘a‘ * 50_000 + ‘X‘.
Practical mitigations I use:
- Keep patterns simple and anchored.
- Prefer negated character classes over
.when possible. Example: use[^\n]instead of.*for single-line matches. - Put a max length on inputs you run regex over when you are processing untrusted data.
- Add tests for worst-case-ish inputs when a pattern is complex.
Substitution correctness beats micro-speed
I have seen teams spend time chasing tiny gains while shipping a substitution that silently corrupts data. My rule is: lock correctness first with tests, then measure if you still think you have a hot path.
Maintainable Substitution Code in 2026: Types, Tests, and Tooling
I want substitution code to be boring to maintain. That means three things: patterns that can be read later, replacements that are explicit, and tests that catch drift.
Prefer named groups and re.VERBOSE for anything non-trivial
If your pattern is longer than a tweet, I switch to:
re.compile(...)- named groups
re.VERBOSE
It turns ‘mystery punctuation‘ into something I can review like normal code.
Type hints make callable replacements easier to reason about
You do not need heavy typing to benefit. Even a simple signature helps:
import re
def rewrite(match: re.Match) -> str:
return match.group(0)
When you work in a codebase with static checks (mypy, pyright), this prevents a class of bugs where you accidentally return None or an int.
Add tests for edge cases you know will happen
Here is a small pytest-style set of tests for a whitespace normalizer and a date normalizer. You can adapt the structure even if you are not using pytest.
import re
RE_SPACES = re.compile(r‘[ \t]+‘)
RE_DATE = re.compile(r‘\b(?P\d{4})<a href="?P\d{1,2}">./-<a href="?P\d{1,2}">./-\b‘)
def normalizeinlinewhitespace(text: str) -> str:
return RE_SPACES.sub(‘ ‘, text)
def normalize_dates(text: str) -> str:
def repl(m: re.Match) -> str:
return f"{m.group(‘y‘)}-{int(m.group(‘m‘)):02d}-{int(m.group(‘d‘)):02d}"
return RE_DATE.sub(repl, text)
def testnormalizeinlinewhitespacekeeps_newlines() -> None:
text = ‘a\t\tb\n c\t d\n‘
assert normalizeinlinewhitespace(text) == ‘a b\n c d\n‘
def testnormalizedatesmultipleseparators() -> None:
text = ‘x 2026/2/3 y 2026-02-03 z 2026.02.03‘
assert normalize_dates(text) == ‘x 2026-02-03 y 2026-02-03 z 2026-02-03‘
Note: I used a double-quoted f-string in that snippet because it contains single quotes inside; in your codebase, pick one consistent quoting style.
AI-assisted workflows are fine, but you still own the regex
In 2026, it is normal to ask an assistant to draft a regex. I do it too. But I never paste a pattern into production without:
- adding a few representative tests,
- checking worst-case-ish inputs,
- and making sure the replacement behavior is exactly what I want.
Regex is compact code. Compact code can hide big mistakes.
Key Takeaways and Next Steps
If you remember only a few things about pattern substitution in Python, make them these:
re.sub()replaces all non-overlapping matches by default; setcountwhen you want only the first few.- Treat the replacement string as its own mini-language: backreferences (
\1,\g) and backslashes matter, so raw strings are a safe default. - Spend your effort on match correctness. Word boundaries, anchoring, and explicit character classes prevent accidental rewrites.
- When the rewrite needs rules, pass a function as
repl. It keeps complex logic out of the regex and makes validation straightforward. - Make patterns readable before they become urgent.
re.VERBOSEplus named groups is the difference between ‘I can review this‘ and ‘nobody touch it‘. - Keep an eye on safety. Avoid patterns that can backtrack catastrophically, and be cautious with untrusted inputs.
A practical next step: pick one real text-wrangling annoyance in your codebase (log redaction, whitespace cleanup, date normalization, identifier formatting). Implement it with re.sub() plus a few tests that cover both the happy path and the ‘someone will paste weird input‘ path. Once you have that in place, you will start seeing substitution not as a trick, but as a reliable, testable rewrite tool you can use everywhere text shows up.


