Python Regex Substitution: Practical Pattern Rewriting with re.sub()

I spend a lot of time cleaning and reshaping text: log lines coming from multiple services, pasted addresses from user forms, chat transcripts that need redaction, and messy CSV exports that are almost-but-not-quite structured. The common thread is that the data is text, and the fastest way to make it usable is usually a rewrite step: find a pattern, replace it with something better.

In Python, that rewrite step is often one function call: re.sub(). It looks simple, but it scales from basic find-and-replace all the way up to rules like, ‘Replace dates only when they appear in headers‘, or ‘Normalize phone numbers but keep the country code if present‘, or ‘Redact tokens while preserving the token prefix so debugging stays possible‘.

If you already know regular expressions for matching, substitution is where your regex skills start paying real dividends in production code. Here, I walk through the mental model of re.sub(), how replacement strings actually work (including the parts that surprise people), when to swap in a callable replacement, and how to keep patterns readable and safe. I also include runnable snippets you can paste into a file and run immediately.

The Mental Model: `re.sub()` as Search + Rewrite

At its core, re.sub() does two jobs in one pass:

1) It searches your input string for non-overlapping matches of a pattern.

2) It rewrites each match as a replacement.

The signature matters because every parameter has sharp edges:

re.sub(pattern, repl, string, count=0, flags=0)

What I keep in my head:

pattern: a regex pattern (string) or a compiled pattern from re.compile().
repl: either a replacement string or a function that receives a match object.
string: the input text.
count: maximum number of substitutions. 0 means ‘no limit‘ (replace all matches).
flags: modifiers like case-insensitive matching.

A simple example (replace all occurrences):

import re

sentence = ‘Thank you very very much.‘

result = re.sub(r‘very‘, ‘so‘, sentence)

print(result)

Expected output:

Thank you so so much.

That ‘replace all‘ behavior is the default. If you only want the first match, set count=1:

import re

sentence = ‘Thank you very very much.‘

result = re.sub(r‘very‘, ‘so‘, sentence, count=1)

print(result)

Output:

Thank you so very much.

Two practical notes I rely on in real projects:

re.sub() returns a new string. It does not modify the original.
Matches do not overlap. If your pattern could match in overlapping ways, only the leftmost, then next leftmost after the previous match ends, will be replaced.

Replacement Strings Are Not Just Text

A lot of bugs in substitution code come from misunderstanding what the replacement string is allowed to contain.

In a replacement string, backslashes and group references have meaning:

\1, \2, … refer to numbered capturing groups.
\g refers to a named capturing group.
\g is also valid and is the safest way to disambiguate group numbers in longer replacements.

I strongly prefer \g in production code because it stays readable when patterns change.

Example: Normalize ‘Last, First‘ to ‘First Last‘

import re

name = ‘Doe, Jane‘

pattern = r‘^(?P[A-Za-z-]+),\s*(?P[A-Za-z-]+)$‘

result = re.sub(pattern, r‘\g \g‘, name)

print(result)

Output:

Jane Doe

Why the raw string for the replacement (r‘...‘)? Because otherwise Python itself interprets backslashes before re.sub() ever sees them. Raw strings prevent accidental escapes.

Common pitfall: Backslashes in replacements

If you write a Windows path replacement, you can accidentally create escapes like \n (newline) or \t (tab) inside normal Python strings.

I avoid that by:

Using raw strings for replacements when I include backslashes.
Or doubling backslashes in normal strings.

For example, replacing / with \ (Windows separators) while keeping the rest of the string:

import re

posix_path = ‘/var/log/app/server.log‘

windowsish = re.sub(r‘/‘, r‘\\‘, posix_path)

print(windowsish)

Output:

\\var\\log\\app\\server.log

That output shows double backslashes because printing represents each literal backslash as one character; in an actual string, it is a single backslash per separator.

`re.escape()` is your friend when the pattern is user-provided

If you are substituting literal user text (not a regex), escape it.

import re

user_literal = ‘1.2.3‘

text = ‘Current versions: 1.2.3 and 1.2.30‘

safepattern = re.escape(userliteral)

result = re.sub(safe_pattern, ‘X.Y.Z‘, text)

print(result)

Output:

Current versions: X.Y.Z and 1.2.30

Without re.escape(), the dots would match any character, which is almost never what you want.

Shaping Matches: Character Classes, Boundaries, and Lookarounds

Substitution quality is mostly about match quality. I see three recurring categories:

Character classes: define sets like digits, letters, whitespace.
Boundaries: match only whole words or token edges.
Lookarounds: match something based on context without consuming it.

Replace a character set (classic sanitization)

Replace all lowercase letters with 0:

import re

sentence = ‘22 April is celebrated as Earth Day.‘

print(re.sub(r‘[a-z]‘, ‘0‘, sentence))

Output:

22 A0000 00 0000000000 00 E0000 D00.

If you also want uppercase, you can expand the set:

import re

sentence = ‘22 April is celebrated as Earth Day.‘

print(re.sub(r‘[A-Za-z]‘, ‘0‘, sentence))

Or you can use a flag, which I prefer when the whole pattern should be case-insensitive:

import re

sentence = ‘22 April is celebrated as Earth Day.‘

print(re.sub(r‘[a-z]‘, ‘0‘, sentence, flags=re.IGNORECASE))

Replace only whole words with `\b`

If you substitute cat in concatenate, you probably did not mean to.

import re

text = ‘A cat can concatenate strings.‘

print(re.sub(r‘\bcat\b‘, ‘dog‘, text))

Output:

A dog can concatenate strings.

\b is a word boundary: it matches the transition between word characters and non-word characters.

Remove repeated whitespace without touching newlines

I often normalize runs of spaces and tabs to a single space, but keep line breaks intact.

import re

text = ‘Name:\t\tJane Doe\nRole:\tSenior Engineer\n‘

result = re.sub(r‘[ \t]+‘, ‘ ‘, text)

print(result)

Output:

Name: Jane Doe

Role: Senior Engineer

This pattern avoids \s+ because \s includes newlines and would collapse your line structure.

Replace only when a prefix exists (lookbehind)

Imagine log lines with token=... and you want to redact the token value but keep token=.

import re

line = ‘request_id=9f1 token=abcd1234efgh5678 user=jane‘

redacted = re.sub(r‘(?<=token=)[A-Za-z0-9]+', '[REDACTED]', line)

print(redacted)

Output:

request_id=9f1 token=[REDACTED] user=jane

A few constraints to remember:

Python lookbehinds must be fixed-width (no +, *, or variable-length alternations inside the lookbehind).
If you need variable-width context, you can often restructure to capture the prefix and reuse it.

The capture-based alternative:

import re

line = ‘request_id=9f1 token=abcd1234efgh5678 user=jane‘

redacted = re.sub(r‘(token=)[A-Za-z0-9]+‘, r‘\1[REDACTED]‘, line)

print(redacted)

Replace everything except the part you want to keep (negative lookahead)

A practical example: you want to prefix every non-comment line in a config with - .

import re

config = ‘# generated\nport=8080\n# do not edit\nmode=prod\n‘

result = re.sub(r‘^(?!#)(.+)$‘, r‘- \1‘, config, flags=re.MULTILINE)

print(result)

Output:

# generated

– port=8080

# do not edit

– mode=prod

Key detail: re.MULTILINE makes ^ and $ work per line.

When Replacement Needs Logic: Pass a Function

A string replacement is great when the output is constant or purely based on captured groups.

When the rewrite depends on the matched text in a more complex way (validation, formatting, conditional behavior), pass a callable as repl. That callable receives a re.Match and must return the replacement string.

Example: Normalize phone numbers with a tiny ruleset

Say your input contains numbers like 415.555.2671, (415) 555-2671, or 415 555 2671, and you want +1-415-555-2671.

import re

def normalizeusphone(match: re.Match) -> str:

digits = re.sub(r‘\D‘, ‘‘, match.group(0))

# Accept 10 digits or 11 digits starting with 1.

if len(digits) == 10:

country = ‘1‘

area = digits[0:3]

prefix = digits[3:6]

line = digits[6:10]

return f‘+{country}-{area}-{prefix}-{line}‘

if len(digits) == 11 and digits.startswith(‘1‘):

country = ‘1‘

area = digits[1:4]

prefix = digits[4:7]

line = digits[7:11]

return f‘+{country}-{area}-{prefix}-{line}‘

# If it does not match expected formats, keep the original.

return match.group(0)

text = ‘Call (415) 555-2671 or 415.555.2671. Office: +1 (212) 555 0000.‘

pattern = r‘(?:\+?1\s*)?(?:$\d{3}$|\d{3})[ .-]?\d{3}[ .-]?\d{4}‘

result = re.sub(pattern, normalizeusphone, text)

print(result)

Output:

Call +1-415-555-2671 or +1-415-555-2671. Office: +1-212-555-0000.

Why I like the function approach:

I can decide to keep the original match on unexpected input.
I can add small validation without turning the regex into a monster.

Example: Mask all but the last 4 digits of account-like numbers

import re

def mask_account(match: re.Match) -> str:

value = match.group(‘acct‘)

return ‘‘ (len(value) – 4) + value[-4:]

text = ‘acct=123456789012 acct=9876543210987654‘

result = re.sub(r‘acct=(?P\d{12,19})‘, lambda m: ‘acct=‘ + mask_account(m), text)

print(result)

Output:

acct=9012 acct=7654

Here I keep acct= visible for debugging while protecting the sensitive part.

Flags You Will Actually Reach For

I treat flags as part of the readability story. They make intent explicit and often let you simplify the pattern.

`re.IGNORECASE` (`re.I`)

Use it when casing should not matter:

import re

text = ‘Error: Disk Full. ERROR: Retry failed.‘

result = re.sub(r‘error‘, ‘warning‘, text, flags=re.IGNORECASE)

print(result)

Output:

warning: Disk Full. warning: Retry failed.

`re.MULTILINE` (`re.M`)

Use it when you want ^ and $ to work per line.

`re.DOTALL` (`re.S`)

Use it when . should match newlines.

A small but realistic example: collapse a block between markers, even across lines.

import re

text = ‘start\nsecret line 1\nsecret line 2\nend\n‘

result = re.sub(r‘start.*end‘, ‘start\n[REDACTED]\nend‘, text, flags=re.DOTALL)

print(result)

Output:

start

[REDACTED]

end

I avoid DOTALL unless I truly need it, because it can make patterns match far more than expected.

`re.VERBOSE` (`re.X`)

This is the biggest maintainability win for non-trivial patterns. It lets you space out the pattern and comment it.

Example: parse an ISO-like timestamp and rewrite it as YYYY/MM/DD HH:MM.

import re

timestamp = ‘2026-02-03T19:42:11Z‘

pattern = re.compile(r‘‘‘

(?P\d{4})-(?P\d{2})-(?P\d{2})

(?P\d{2}):(?P\d{2}):(?P\d{2})

‘‘‘, re.VERBOSE)

result = pattern.sub(r‘\g/\g/\g \g:\g‘, timestamp)

print(result)

Output:

2026/02/03 19:42

With re.VERBOSE, whitespace in the pattern is ignored unless escaped or inside a character class. That is usually what you want, but it does mean you must write literal spaces as \ or \x20 or put them in a character class like [ ].

Practical Substitution Recipes I Keep Reusing

I do not write regex substitutions for fun. I write them because they solve recurring engineering problems. Here are patterns I reach for constantly.

Recipe: Redact emails but keep the domain

This keeps enough context for debugging (which domain is involved) while protecting the user.

import re

def redact_email(match: re.Match) -> str:

local = match.group(‘local‘)

domain = match.group(‘domain‘)

if len(local) <= 2:

masked = ‘‘ len(local)

else:

masked = local[0] + ‘‘ (len(local) – 2) + local[-1]

return masked + ‘@‘ + domain

text = ‘Contact: [email protected] or [email protected]‘

pattern = r‘(?P[A-Za-z0-9._%+-]+)@(?P[A-Za-z0-9.-]+\.[A-Za-z]{2,})‘

result = re.sub(pattern, redact_email, text)

print(result)

Output:

Contact: j[email protected] or @corp.internal

Recipe: Normalize multiple date separators

Convert 2026/2/3, 2026-02-03, and 2026.02.03 to 2026-02-03.

import re

text = ‘Deploys: 2026/2/3, 2026-02-03, 2026.02.03‘

pattern = r‘\b(?P\d{4})<a href="?P\d{1,2}">./-<a href="?P\d{1,2}">./-\b‘

def fix_date(match: re.Match) -> str:

y = match.group(‘y‘)

m = int(match.group(‘m‘))

d = int(match.group(‘d‘))

return f‘{y}-{m:02d}-{d:02d}‘

result = re.sub(pattern, fix_date, text)

print(result)

Output:

Deploys: 2026-02-03, 2026-02-03, 2026-02-03

Recipe: Remove trailing whitespace without touching indentation

This is a quiet quality-of-life improvement when cleaning generated text.

import re

text = ‘line one \n indented line\t\t\nline three\n‘

cleaned = re.sub(r‘[ \t]+$‘, ‘‘, text, flags=re.MULTILINE)

print(cleaned)

Output:

line one

indented line

line three

When Not to Use Regex Substitution (And What I Use Instead)

Regex is great when:

The text has a clear pattern.
You need contextual matching (boundaries, alternations, optional segments).
You need to rewrite many matches in one pass.

Regex is a bad fit when:

The replacement is purely literal and you want maximal clarity.
You are parsing a nested grammar (HTML, JSON, programming languages) where edge cases explode.
Input size is large and patterns are complex enough to risk slowdowns.

Here is the decision table I use when I am choosing an approach.

Task

Traditional approach

Modern practice I recommend —

—

— Literal replace (no patterns)

str.replace()

str.replace(); it is clearer and often faster Replace many single characters

chained replace()

str.translate() with a translation table Tokenize structured text

regex everywhere

a parser: json, csv, email, urllib.parse, or a dedicated library Normalize whitespace

regex \s+

regex with explicit classes like [ \t]+ plus tests User-supplied literal text search

regex directly

re.escape() + regex, or plain replace() if possible

An example where str.translate() is cleaner than regex:

# Remove common punctuation characters quickly.

text = ‘Hello, world! (v2.0)‘

table = str.maketrans(‘‘, ‘‘, ‘,!()‘)

print(text.translate(table))

Output:

Hello world v2.0

I still use re.sub() all the time, but I do not force it into problems it is not good at.

Performance and Safety: Keep Substitutions Fast and Predictable

Most substitutions in application code are fast enough that you do not need to measure. But the failure modes are nasty when they happen: a pattern that backtracks badly can take seconds on the wrong input.

What typically keeps `re.sub()` fast

Prefer specific patterns over overly generic ones.
Avoid nested quantifiers like (.)+ or (.+).
Anchor when you can (^, $, \b) to reduce scanning.
Compile patterns you reuse.

Compiling is mainly about avoiding repeated parse overhead and making intent clear:

import re

RE_MULTISPACE = re.compile(r‘[ \t]{2,}‘)

def normalize_spaces(text: str) -> str:

return RE_MULTISPACE.sub(‘ ‘, text)

If you do this in a loop over thousands of lines, precompiling is often a noticeable speed-up. In typical backend services, that can shave a few milliseconds per request in hot paths, or remove noisy variance.

Guarding against catastrophic backtracking

If you accept arbitrary user input, treat regex patterns as part of your attack surface. A pattern like ^(a+)+$ can behave horribly on inputs like ‘a‘ * 50_000 + ‘X‘.

Practical mitigations I use:

Keep patterns simple and anchored.
Prefer negated character classes over . when possible. Example: use [^\n] instead of .* for single-line matches.
Put a max length on inputs you run regex over when you are processing untrusted data.
Add tests for worst-case-ish inputs when a pattern is complex.

Substitution correctness beats micro-speed

I have seen teams spend time chasing tiny gains while shipping a substitution that silently corrupts data. My rule is: lock correctness first with tests, then measure if you still think you have a hot path.

Maintainable Substitution Code in 2026: Types, Tests, and Tooling

I want substitution code to be boring to maintain. That means three things: patterns that can be read later, replacements that are explicit, and tests that catch drift.

Prefer named groups and `re.VERBOSE` for anything non-trivial

If your pattern is longer than a tweet, I switch to:

re.compile(...)
named groups
re.VERBOSE

It turns ‘mystery punctuation‘ into something I can review like normal code.

Type hints make callable replacements easier to reason about

You do not need heavy typing to benefit. Even a simple signature helps:

import re

def rewrite(match: re.Match) -> str:

return match.group(0)

When you work in a codebase with static checks (mypy, pyright), this prevents a class of bugs where you accidentally return None or an int.

Add tests for edge cases you know will happen

Here is a small pytest-style set of tests for a whitespace normalizer and a date normalizer. You can adapt the structure even if you are not using pytest.

import re

RE_SPACES = re.compile(r‘[ \t]+‘)

RE_DATE = re.compile(r‘\b(?P\d{4})<a href="?P\d{1,2}">./-<a href="?P\d{1,2}">./-\b‘)

def normalizeinlinewhitespace(text: str) -> str:

return RE_SPACES.sub(‘ ‘, text)

def normalize_dates(text: str) -> str:

def repl(m: re.Match) -> str:

return f"{m.group(‘y‘)}-{int(m.group(‘m‘)):02d}-{int(m.group(‘d‘)):02d}"

return RE_DATE.sub(repl, text)

def testnormalizeinlinewhitespacekeeps_newlines() -> None:

text = ‘a\t\tb\n c\t d\n‘

assert normalizeinlinewhitespace(text) == ‘a b\n c d\n‘

def testnormalizedatesmultipleseparators() -> None:

text = ‘x 2026/2/3 y 2026-02-03 z 2026.02.03‘

assert normalize_dates(text) == ‘x 2026-02-03 y 2026-02-03 z 2026-02-03‘

Note: I used a double-quoted f-string in that snippet because it contains single quotes inside; in your codebase, pick one consistent quoting style.

AI-assisted workflows are fine, but you still own the regex

In 2026, it is normal to ask an assistant to draft a regex. I do it too. But I never paste a pattern into production without:

adding a few representative tests,
checking worst-case-ish inputs,
and making sure the replacement behavior is exactly what I want.

Regex is compact code. Compact code can hide big mistakes.

Key Takeaways and Next Steps

If you remember only a few things about pattern substitution in Python, make them these:

re.sub() replaces all non-overlapping matches by default; set count when you want only the first few.
Treat the replacement string as its own mini-language: backreferences (\1, \g) and backslashes matter, so raw strings are a safe default.
Spend your effort on match correctness. Word boundaries, anchoring, and explicit character classes prevent accidental rewrites.
When the rewrite needs rules, pass a function as repl. It keeps complex logic out of the regex and makes validation straightforward.
Make patterns readable before they become urgent. re.VERBOSE plus named groups is the difference between ‘I can review this‘ and ‘nobody touch it‘.
Keep an eye on safety. Avoid patterns that can backtrack catastrophically, and be cautious with untrusted inputs.

A practical next step: pick one real text-wrangling annoyance in your codebase (log redaction, whitespace cleanup, date normalization, identifier formatting). Implement it with re.sub() plus a few tests that cover both the happy path and the ‘someone will paste weird input‘ path. Once you have that in place, you will start seeing substitution not as a trick, but as a reliable, testable rewrite tool you can use everywhere text shows up.

The Mental Model: re.sub() as Search + Rewrite

Replacement Strings Are Not Just Text

Example: Normalize ‘Last, First‘ to ‘First Last‘

Common pitfall: Backslashes in replacements

re.escape() is your friend when the pattern is user-provided

Shaping Matches: Character Classes, Boundaries, and Lookarounds

Replace a character set (classic sanitization)

Replace only whole words with \b

Remove repeated whitespace without touching newlines

Replace only when a prefix exists (lookbehind)

Replace everything except the part you want to keep (negative lookahead)

When Replacement Needs Logic: Pass a Function

Example: Normalize phone numbers with a tiny ruleset

Example: Mask all but the last 4 digits of account-like numbers

Flags You Will Actually Reach For

re.IGNORECASE (re.I)

re.MULTILINE (re.M)

re.DOTALL (re.S)

re.VERBOSE (re.X)

Practical Substitution Recipes I Keep Reusing

Recipe: Redact emails but keep the domain

Recipe: Normalize multiple date separators

Recipe: Remove trailing whitespace without touching indentation

When Not to Use Regex Substitution (And What I Use Instead)

Performance and Safety: Keep Substitutions Fast and Predictable

What typically keeps re.sub() fast

Guarding against catastrophic backtracking

Substitution correctness beats micro-speed

Maintainable Substitution Code in 2026: Types, Tests, and Tooling

Prefer named groups and re.VERBOSE for anything non-trivial

Type hints make callable replacements easier to reason about

Add tests for edge cases you know will happen

AI-assisted workflows are fine, but you still own the regex

Key Takeaways and Next Steps

You maybe like,

Related Posts

The Mental Model: `re.sub()` as Search + Rewrite

`re.escape()` is your friend when the pattern is user-provided

Replace only whole words with `\b`

`re.IGNORECASE` (`re.I`)

`re.MULTILINE` (`re.M`)

`re.DOTALL` (`re.S`)

`re.VERBOSE` (`re.X`)

What typically keeps `re.sub()` fast

Prefer named groups and `re.VERBOSE` for anything non-trivial