I keep running into the same small-but-annoying problem in real systems: a blob of text contains quoted values, and I need the parts inside the quotes—fast, correctly, and without “mystery behavior” when the input gets messy.
Maybe it’s a log line like: status="ok" trace_id="9f3c…" user="Ava". Maybe it’s a config snippet, a prompt template, or a CLI output where quoted phrases are the only stable boundary you can trust. The task sounds trivial until you hit escaped quotes (\"), unmatched quotation marks, or “smart quotes” pasted from a document.
In this post, I’ll show you the approaches I actually reach for in Python: quick slicing tricks for controlled inputs, regular expressions for compact extraction, and a small state-machine parser for when correctness matters. I’ll also cover how to choose the right method, common mistakes I see in reviews, and how to test your extractor so it stays correct as the input evolves.
Start With a Clear Contract: What Counts as “Between Quotes”?
Before you touch code, decide what you mean by “between quotation marks.” In my experience, most bugs here come from an unstated contract.
Here are the questions I ask (and I recommend you answer explicitly):
- Which quote characters: only double quotes (")? single quotes (‘) too? typographic quotes (“ ”)?
- Are escaped quotes allowed inside a quoted segment (for example: "She said \\"hi\\"" )?
- Do you want only complete pairs, or should you accept partial/unbalanced input?
- Are quoted values allowed to span multiple lines?
- Is the input a “language” (CSV, JSON, shell-like tokens), or is it just text with quotes sprinkled in?
A simple example with a simple contract (double quotes only, no escaping rules) looks like this:
message = ‘status="ok" trace_id="9f3c" user="Ava"‘
If your contract is “collect the three values inside paired double quotes,” you want:
["ok", "9f3c", "Ava"]But if your contract also allows escapes, this input should behave sensibly:
message = ‘note="She said \\\"hi\\\"" id="42"‘
Now you probably want:
["She said \\\"hi\\\"", "42"]Notice the subtlety: you might want the raw escaped form, or you might want to unescape it to actual quotes. That’s not “free”—you decide it.
A few more contract patterns I commonly use in practice:
- “Best-effort extraction”: return everything from complete quote pairs; ignore unmatched quotes; never raise.
- “Strict mode”: raise an error if there’s a dangling opening quote or a dangling escape.
- “Spans, not strings”: return start/end indices so the caller can slice, highlight, redact, or replace without re-parsing.
- “Normalize first”: convert typographic quotes to ASCII quotes before extraction.
That contract is your compass. Once it’s written down (in code comments, a docstring, or tests), the implementation becomes much easier to evaluate.
The Quick Win: Split + Slice When the Input Is Controlled
When I control the string format (or it’s produced by code I trust), I’ll often start with the simplest thing that can’t surprise me.
If the input is guaranteed to:
- use only double quotes (")
- contain no escaped double quotes inside quoted text
- have properly paired quotes
…then split plus slicing is clean and fast.
Example:
text = ‘some values are "alpha" "bravo" "charlie" in a sentence‘
parts = text.split(‘"‘)
# parts becomes:
# [‘some values are ‘, ‘alpha‘, ‘ ‘, ‘bravo‘, ‘ ‘, ‘charlie‘, ‘ in a sentence‘]
quoted = parts[1::2]
print(quoted)
Output:
[‘alpha‘, ‘bravo‘, ‘charlie‘]Why this works: every quote flips you from “outside” to “inside” and back. The inside segments land at odd indexes (1, 3, 5, …).
When I recommend this approach:
- You own the producer of the string (your own logger/formatter).
- You have a clear guarantee that no embedded
"can appear. - You want something a junior teammate can read in five seconds.
When I actively avoid it:
- The text can come from users, other systems, or copied documents.
- Escaped quotes or unmatched quotes can appear.
- You need to support both single and double quotes.
Performance notes: split is O(n) time in the length of the string and allocates a list of substrings. For short to medium strings, this is typically “too fast to care.” For very large inputs (hundreds of KB+), the extra allocations can matter.
A small but useful tweak: if you only need the first quoted value, don’t split the whole string. Use find twice and slice:
text = ‘status="ok" trace_id="9f3c"‘
start = text.find(‘"‘)
if start != -1:
end = text.find(‘"‘, start + 1)
if end != -1:
first = text[start + 1:end]
else:
first = None
else:
first = None
That’s still only safe under the “controlled input” contract, but it avoids creating a list.
Regular Expressions: Compact Extraction, With Sharp Edges
Regex is popular here because it compresses the “find quoted text” idea into one expression. I use regex for this when:
- the quoting rule is simple, and
- I’m extracting many matches from a single string, and
- I can write tests for the edge cases I care about.
A simple regex for basic double quotes
If you only care about "..." where ... contains no ", this is the classic pattern:
import re
text = ‘status="ok" trace_id="9f3c" user="Ava"‘
values = re.findall(‘"([^"]*)"‘, text)
print(values)
Output:
[‘ok‘, ‘9f3c‘, ‘Ava‘]What it does:
"matches a literal double quote.([^"]*)captures zero or more non-quote characters.- the final
"closes the match.
A small improvement I often make: compile the pattern once if I’m running it in a loop (e.g., parsing a million log lines):
import re
QUOTED = re.compile(‘"([^"]*)"‘)
def extract(text: str) -> list[str]:
return QUOTED.findall(text)
A safer regex that supports escaped quotes
The “simple” pattern breaks as soon as the quoted text may contain escaped quotes. For example:
text = ‘note="She said \\\"hi\\\"" id="42"‘
A more realistic pattern is:
import re
text = ‘note="She said \\\"hi\\\"" id="42"‘
pattern = r‘"((?:\\.|[^"\\])*)"‘
values = re.findall(pattern, text)
print(values)
What it means:
(?:\\.|[^"\\])*matches repeated tokens where each token is either:
– an escape sequence like \" or \n (\\.), or
– a non-quote, non-backslash character ([^"\\])
This returns the captured content without the outer quotes.
If you want to unescape common sequences, you can post-process. One pragmatic option (when the escaping follows Python-style backslash escapes) is decoding with unicode_escape, but be careful: it can interpret sequences you didn’t intend.
A more explicit approach:
def unescapebackslashquotes(s: str) -> str:
return s.replace(‘\\"‘, ‘"‘).replace(‘\\\\‘, ‘\\‘)
Use only the unescapes you actually expect.
Using finditer to get positions (spans)
When I’m building tooling (redaction, highlighting, replacements), I don’t just want the string; I want where it came from.
Regex gives you spans for free via finditer:
import re
pattern = re.compile(r‘"((?:\\.|[^"\\])*)"‘)
text = ‘status="ok" note="She said \\\"hi\\\""‘
for m in pattern.finditer(text):
inner_value = m.group(1)
outer_span = m.span(0) # includes quotes
inner_span = m.span(1) # just the captured group
print(innervalue, outerspan, inner_span)
Spans matter in real systems because they let you do safe transformations without “reconstructing” the string from tokens.
When regex becomes fragile
I stop using regex when:
- quotes can nest (even informally),
- the input can be malformed and I need graceful recovery,
- I need to support multiple quote types with different escaping rules,
- performance becomes unpredictable due to backtracking.
Regex can be correct here, but correctness often ends up living in a single dense string literal that few teammates want to maintain.
A good heuristic: if the regex needs more than one comment block to explain it, I strongly consider a small parser instead.
The “Production Default” I Trust: A Small State-Machine Parser
When input quality is variable, I prefer a tiny parser. It’s not fancy; it’s just explicit. The main reason: it’s easy to reason about, and it fails in ways you can choose.
Here’s a complete, runnable extractor that:
- supports a chosen quote character (default
") - optionally supports backslash escaping
- lets you decide what to do with unmatched quotes
from future import annotations
from dataclasses import dataclass
@dataclass(frozen=True)
class ExtractOptions:
quote: str = ‘"‘
allow_escapes: bool = True
strict: bool = False # if True, raise on unmatched quote
unescape: bool = False # if True, convert \" -> " inside results
class UnmatchedQuoteError(ValueError):
pass
def extractbetweenquotes(text: str, options: ExtractOptions = ExtractOptions()) -> list[str]:
if len(options.quote) != 1:
raise ValueError("quote must be a single character")
results: list[str] = []
buf: list[str] = []
in_quotes = False
escape_next = False
for ch in text:
if not in_quotes:
if ch == options.quote:
in_quotes = True
buf.clear()
continue
# in_quotes == True
if options.allowescapes and escapenext:
# Keep escape sequences as-is for now; unescape later if requested.
buf.append(‘\\‘)
buf.append(ch)
escape_next = False
continue
if options.allow_escapes and ch == ‘\\‘:
escape_next = True
continue
if ch == options.quote:
value = ‘‘.join(buf)
if options.unescape and options.allow_escapes:
value = value.replace(‘\\"‘, ‘"‘).replace(‘\\\\‘, ‘\\‘)
results.append(value)
in_quotes = False
escape_next = False
continue
buf.append(ch)
if in_quotes:
if options.strict:
raise UnmatchedQuoteError("unmatched quote in input")
# If not strict, drop the dangling fragment (or change this policy if you prefer).
return results
Try it:
text = ‘note="She said \\\"hi\\\"" id="42"‘
print(extractbetweenquotes(text))
print(extractbetweenquotes(text, ExtractOptions(unescape=True)))
Typical output:
[‘She said \\\"hi\\\"‘, ‘42‘] [‘She said "hi"‘, ‘42‘]Why I like this:
- The behavior is obvious.
- The “policy” knobs (
strict,unescape) are explicit. - It’s linear time O(n) and doesn’t have regex backtracking risk.
- It’s easy to extend (support single quotes too, or treat doubled quotes as escape, etc.).
If you need both ‘ and " in one pass, I usually call the function twice, or I write a variant that recognizes a set of quote characters and tracks which one opened the segment.
Extending the parser: support multiple quote types
If your contract says “extract from either single or double quotes,” you can model it as: “opening quote decides the closing quote.” That avoids weird mixing like opening with ‘ and closing with ".
Here’s a minimal variant that supports both " and ‘:
from dataclasses import dataclass
@dataclass(frozen=True)
class MultiQuoteOptions:
quotes: tuple[str, …] = (‘"‘, "‘")
allow_escapes: bool = True
strict: bool = False
def extractbetweenany_quotes(text: str, options: MultiQuoteOptions = MultiQuoteOptions()) -> list[str]:
results: list[str] = []
buf: list[str] = []
in_quotes = False
escape_next = False
active_quote: str | None = None
for ch in text:
if not in_quotes:
if ch in options.quotes:
in_quotes = True
active_quote = ch
buf.clear()
continue
# inside quotes
if options.allowescapes and escapenext:
buf.append(ch)
escape_next = False
continue
if options.allow_escapes and ch == ‘\\‘:
escape_next = True
continue
if ch == active_quote:
results.append(‘‘.join(buf))
in_quotes = False
active_quote = None
escape_next = False
continue
buf.append(ch)
if in_quotes and options.strict:
raise UnmatchedQuoteError(f"unmatched quote {active_quote!r} in input")
return results
This variant chooses a different escape policy: it stores escaped characters without keeping the backslash in the buffer. That’s not “more correct” or “less correct”—it’s a contract choice.
The important part is that the rules are visible. When someone on your team asks “What happens with \\‘ inside single quotes?” you can answer by reading the code.
A generator version for large inputs
If you’re processing huge inputs (multi-megabyte logs, streamed text, or very long prompt histories), allocating a full list might be unnecessary. A generator yields values as soon as they’re complete.
A generator version also composes nicely with pipelines (filtering, mapping, writing to a file).
Conceptually:
def iterbetweenquotes(text: str, quote=‘"‘):
# yield each extracted value
…
Then:
for value in iterbetweenquotes(big_text):
process(value)
I won’t fully duplicate the implementation here, but the state machine is the same—just yield where the list version appends.
When You Should Not Hand-Roll It: Use Parsers for Real Formats
Sometimes “text with quotes” is actually a structured format. In those cases, I strongly recommend using the parser for that format instead of extracting quoted substrings as a workaround.
Here are the common ones I see:
CSV-like data
CSV rules include embedded quotes and separators. If you try to extract quoted values with your own code, you will break on legitimate data.
Use the standard library csv module.
JSON fragments
If the input is JSON (or close to it), treat it as JSON. Extracting strings between quotes from JSON is not parsing JSON.
Shell-like command lines
For shell tokenization rules (quotes, escapes, whitespace), use shlex:
import shlex
command = ‘deploy –message "release candidate" –tag v2.4.1‘
tokens = shlex.split(command)
print(tokens)
Output:
[‘deploy‘, ‘–message‘, ‘release candidate‘, ‘–tag‘, ‘v2.4.1‘]Notice: you don’t even need “extract between quotes” anymore. You get parsed tokens with quotes removed.
Python literal strings embedded in text
If the source is literally a Python string literal (including escapes), ast.literal_eval can safely interpret it (within limits). That said, it won’t magically find multiple literals inside an arbitrary sentence—you’d still need to locate them.
The broader point: if the input follows a known grammar, parse the grammar.
Choosing the Right Approach (Traditional vs Modern Workflows)
In 2026, the “right approach” is less about fancy tricks and more about maintainability and tests. I’m opinionated here: pick the simplest method that matches your contract, then lock it down with tests.
Here’s how I decide.
Traditional choice
—
split(‘"‘)[1::2]
re.findall(‘"([^"]*)"‘, text)
complicated regex
manual quote extraction
csv, json, shlex) Tooling I typically pair with this work:
pytestfor unit tests (fast feedback)hypothesisfor property-based tests when inputs are unpredictablerufffor linting and consistencymypyorpyrightfor type checking when this logic becomes shared infrastructure- AI-assisted code review for generating edge-case test cases, then I validate them with real runs
The key is that extraction code is “stringly-typed glue.” Glue fails silently unless you test it.
Practical Scenarios (What I Actually Use This For)
When you’re extracting quoted strings, the “why” tends to fall into a few buckets. Thinking in scenarios helps you pick the right contract.
1) Parsing logfmt-like lines (key="value")
A very common pattern is key/value logs where values are quoted:
line = ‘status="ok" user="Ava" trace_id="9f3c"‘
If you only need values, extraction is enough.
But if you need a dictionary of keys to values, don’t stop at “between quotes.” Use the quotes to protect values, then parse keys around them. One approach:
- Find quoted spans.
- Look left of each span for the preceding
key=. - Store
key -> extracted_value.
That’s exactly where span-based outputs become valuable.
2) Redacting secrets inside quotes
If your system logs include token="..." or authorization="Bearer ...", you may need to redact values.
Redaction is where naive “extract and re-join” approaches can corrupt the original string (spacing, punctuation, ordering). Spans shine here because you can replace exactly the substring range.
A policy I use:
- Replace only the inside of quotes, keep the quotes.
- Keep the same length (or a fixed marker) if downstream tools rely on offsets.
3) Extracting “quoted phrases” for search or NLP
Sometimes you’re building a small search query language where quotes mean “treat this phrase as one token.”
Example input:
query = ‘error "connection reset" region=us-east‘
At that point, you’ve drifted into shell-like rules (tokens, whitespace, quotes). In many cases, shlex.split is a better fit than hand-rolled quote extraction.
4) Handling copy-pasted text from documents
Docs and chat apps often inject typographic quotes (“ ”). If you’re receiving text from a UI, normalize first. A quote extractor that only understands ASCII " is going to look “broken” to users who pasted from a rich editor.
Edge Cases and Mistakes I See in Code Reviews
These are the pitfalls that show up again and again.
1) Confusing “quoted phrases” with “strings”
A quoted substring in an arbitrary sentence is not a string literal in a programming language. If you treat it like one, you’ll unescape too much or too little.
2) Ignoring escaped quotes
If your input can include \", the naive regex and split approaches will misparse. You’ll either truncate early or produce extra fragments.
3) Assuming quotes are always balanced
Logs get truncated. Users paste partial snippets. Network payloads cut off. Decide whether you want strict failure or best-effort extraction.
If you want strict behavior, make it loud:
values = extractbetweenquotes(text, ExtractOptions(strict=True))
4) Forgetting Unicode “smart quotes”
Docs and chat tools often replace " with “ ”. If you need to support that, normalize first.
Example normalization:
def normalize_quotes(text: str) -> str:
return (text
.replace(‘“‘, ‘"‘)
.replace(‘”‘, ‘"‘)
.replace(‘‘‘, "‘")
.replace(‘’‘, "‘")
)
Then run your extractor on normalized text.
5) Accidentally supporting more than you intended
This happens a lot with overly-permissive regex patterns. Someone writes a pattern that “works” for typical inputs, then it silently accepts malformed ones and produces surprising results.
If you choose best-effort extraction, that’s fine—but encode it in tests and docs so it’s a deliberate behavior, not an accident.
6) Using eval for unescaping or parsing
I still see this occasionally: someone extracts quoted text and calls eval to interpret escapes.
Don’t. It’s unsafe.
If you truly need to interpret escape sequences, prefer explicit logic (like .replace(‘\\"‘,‘"‘)) or a safe parser for the intended format.
7) Not deciding what to do with doubled quotes
Some formats escape quotes by doubling them (for example, "" inside a quoted field in CSV-like data). Backslash escapes and doubled-quote escapes are different contracts.
If you see doubled quotes in your input, it’s a strong signal you should be using a dedicated parser (often csv).
Testing Your Extractor So It Stays Correct
If this logic matters, I write tests that encode the contract. Even three or four tests prevent most regressions.
Here are test cases I like, written as plain Python you can adapt to pytest:
def assert_eq(actual, expected):
if actual != expected:
raise AssertionError(f"Expected {expected}, got {actual}")
def run_tests():
assert_eq(
extractbetweenquotes(‘status="ok" id="42"‘),
[‘ok‘, ‘42‘],)
# No quotes
asserteq(extractbetween_quotes(‘status=ok id=42‘), [])
# Empty quoted value
asserteq(extractbetween_quotes(‘a="" b="x"‘), [‘‘, ‘x‘])
# Escaped quote inside
text = ‘note="She said \\\"hi\\\""‘
assert_eq(
extractbetweenquotes(text, ExtractOptions(unescape=True)),
[‘She said "hi"‘],)
# Unmatched quote: non-strict drops fragment
asserteq(extractbetween_quotes(‘a="ok b="x"‘), [‘x‘])
# Unmatched quote: strict raises
try:
extractbetweenquotes(‘a="ok b="x"‘, ExtractOptions(strict=True))
raise AssertionError("Expected UnmatchedQuoteError")
except UnmatchedQuoteError:
pass
if name == ‘main‘:
run_tests()
print(‘All tests passed‘)
If the input is unpredictable (user input, scraped text, mixed encodings), property-based testing can pay off. The idea is simple: generate random strings with quotes inserted, then assert invariants like “every returned value is exactly what was inside a matched pair.”
Here’s a sketch of the kind of properties I like (in plain language):
- If I build a string by concatenating outside chunks and quoted chunks, extraction should return exactly the quoted chunks.
- If I remove all quotes from the string, extraction should return an empty list.
- If I run extraction twice on the same input, I should get the same output (no hidden state).
I also like one practical “golden file” test: store a handful of real production lines that previously caused bugs (or that represent important formats), and assert the extractor output is stable. These tests pay for themselves the first time someone “simplifies” the code and accidentally breaks your parsing.
Performance and Reliability Notes (What Actually Matters)
People often worry about performance here prematurely. In practice:
- For short strings (a few KB), all methods are usually in the sub-millisecond to low-millisecond range on a modern laptop.
- The real cost is debugging misparses in production.
Still, performance can matter in pipelines. Here’s how I think about it:
splitis often fastest for the narrow contract, but it allocates a list and every substring.re.findallis convenient and fast enough, but you should compile the pattern if it’s hot.- A state machine is O(n) and predictable. It’s rarely the fastest in microbenchmarks, but it’s often the most stable under weird inputs.
If you suspect this is hot code, don’t guess. Use timeit or a microbenchmark on representative data. The winner changes based on:
- number of matches per line
- typical string length
- frequency of escape sequences
- whether you post-process/unescape
Reliability-wise, I heavily favor approaches with clear failure modes:
- Prefer raising on unmatched quotes in data pipelines where malformed inputs should be quarantined.
- Prefer best-effort extraction in user-facing features where partial results are better than errors.
A Simple Checklist Before You Ship
This is the checklist I run mentally before I merge quote-extraction code:
- Have I defined which quotes I accept (
",‘, typographic quotes)? - Do I support escape sequences? If yes, which ones, and do I unescape them?
- What happens on malformed input (dangling quote, dangling backslash)?
- Do I need values, or do I need spans/positions?
- Am I accidentally parsing a real format (CSV/JSON/shell) that already has a parser?
- Do I have at least a handful of tests, including one or two nasty cases?
If you align your implementation with a clear contract and back it up with tests, “extract string from between quotations” stops being a recurring annoyance and becomes a solved, reliable utility.


