Extract String From Between Quotations in Python (Practical Guide)

I keep running into the same small-but-annoying problem in real systems: a blob of text contains quoted values, and I need the parts inside the quotes—fast, correctly, and without “mystery behavior” when the input gets messy.

Maybe it’s a log line like: status="ok" trace_id="9f3c…" user="Ava". Maybe it’s a config snippet, a prompt template, or a CLI output where quoted phrases are the only stable boundary you can trust. The task sounds trivial until you hit escaped quotes (\"), unmatched quotation marks, or “smart quotes” pasted from a document.

In this post, I’ll show you the approaches I actually reach for in Python: quick slicing tricks for controlled inputs, regular expressions for compact extraction, and a small state-machine parser for when correctness matters. I’ll also cover how to choose the right method, common mistakes I see in reviews, and how to test your extractor so it stays correct as the input evolves.

Start With a Clear Contract: What Counts as “Between Quotes”?

Before you touch code, decide what you mean by “between quotation marks.” In my experience, most bugs here come from an unstated contract.

Here are the questions I ask (and I recommend you answer explicitly):

Which quote characters: only double quotes (")? single quotes (‘) too? typographic quotes (“ ”)?
Are escaped quotes allowed inside a quoted segment (for example: "She said \\"hi\\"" )?
Do you want only complete pairs, or should you accept partial/unbalanced input?
Are quoted values allowed to span multiple lines?
Is the input a “language” (CSV, JSON, shell-like tokens), or is it just text with quotes sprinkled in?

A simple example with a simple contract (double quotes only, no escaping rules) looks like this:

message = ‘status="ok" trace_id="9f3c" user="Ava"‘

If your contract is “collect the three values inside paired double quotes,” you want:

["ok", "9f3c", "Ava"]

But if your contract also allows escapes, this input should behave sensibly:

message = ‘note="She said \\\"hi\\\"" id="42"‘

Now you probably want:

["She said \\\"hi\\\"", "42"]

Notice the subtlety: you might want the raw escaped form, or you might want to unescape it to actual quotes. That’s not “free”—you decide it.

A few more contract patterns I commonly use in practice:

“Best-effort extraction”: return everything from complete quote pairs; ignore unmatched quotes; never raise.
“Strict mode”: raise an error if there’s a dangling opening quote or a dangling escape.
“Spans, not strings”: return start/end indices so the caller can slice, highlight, redact, or replace without re-parsing.
“Normalize first”: convert typographic quotes to ASCII quotes before extraction.

That contract is your compass. Once it’s written down (in code comments, a docstring, or tests), the implementation becomes much easier to evaluate.

The Quick Win: Split + Slice When the Input Is Controlled

When I control the string format (or it’s produced by code I trust), I’ll often start with the simplest thing that can’t surprise me.

If the input is guaranteed to:

use only double quotes (")
contain no escaped double quotes inside quoted text
have properly paired quotes

…then split plus slicing is clean and fast.

Example:

text = ‘some values are "alpha" "bravo" "charlie" in a sentence‘

parts = text.split(‘"‘)

# parts becomes:

# [‘some values are ‘, ‘alpha‘, ‘ ‘, ‘bravo‘, ‘ ‘, ‘charlie‘, ‘ in a sentence‘]

quoted = parts[1::2]

print(quoted)

Output:

[‘alpha‘, ‘bravo‘, ‘charlie‘]

Why this works: every quote flips you from “outside” to “inside” and back. The inside segments land at odd indexes (1, 3, 5, …).

When I recommend this approach:

You own the producer of the string (your own logger/formatter).
You have a clear guarantee that no embedded " can appear.
You want something a junior teammate can read in five seconds.

When I actively avoid it:

The text can come from users, other systems, or copied documents.
Escaped quotes or unmatched quotes can appear.
You need to support both single and double quotes.

Performance notes: split is O(n) time in the length of the string and allocates a list of substrings. For short to medium strings, this is typically “too fast to care.” For very large inputs (hundreds of KB+), the extra allocations can matter.

A small but useful tweak: if you only need the first quoted value, don’t split the whole string. Use find twice and slice:

text = ‘status="ok" trace_id="9f3c"‘

start = text.find(‘"‘)

if start != -1:

end = text.find(‘"‘, start + 1)

if end != -1:

first = text[start + 1:end]

else:

first = None

else:

first = None

That’s still only safe under the “controlled input” contract, but it avoids creating a list.

Regular Expressions: Compact Extraction, With Sharp Edges

Regex is popular here because it compresses the “find quoted text” idea into one expression. I use regex for this when:

the quoting rule is simple, and
I’m extracting many matches from a single string, and
I can write tests for the edge cases I care about.

A simple regex for basic double quotes

If you only care about "..." where ... contains no ", this is the classic pattern:

import re

text = ‘status="ok" trace_id="9f3c" user="Ava"‘

values = re.findall(‘"([^"]*)"‘, text)

print(values)

Output:

[‘ok‘, ‘9f3c‘, ‘Ava‘]

What it does:

" matches a literal double quote.
([^"]*) captures zero or more non-quote characters.
the final " closes the match.

A small improvement I often make: compile the pattern once if I’m running it in a loop (e.g., parsing a million log lines):

import re

QUOTED = re.compile(‘"([^"]*)"‘)

def extract(text: str) -> list[str]:

return QUOTED.findall(text)

A safer regex that supports escaped quotes

The “simple” pattern breaks as soon as the quoted text may contain escaped quotes. For example:

text = ‘note="She said \\\"hi\\\"" id="42"‘

A more realistic pattern is:

import re

text = ‘note="She said \\\"hi\\\"" id="42"‘

pattern = r‘"((?:\\.|[^"\\])*)"‘

values = re.findall(pattern, text)

print(values)

What it means:

(?:\\.|[^"\\])* matches repeated tokens where each token is either:

– an escape sequence like \" or \n (\\.), or

– a non-quote, non-backslash character ([^"\\])

This returns the captured content without the outer quotes.

If you want to unescape common sequences, you can post-process. One pragmatic option (when the escaping follows Python-style backslash escapes) is decoding with unicode_escape, but be careful: it can interpret sequences you didn’t intend.

A more explicit approach:

def unescapebackslashquotes(s: str) -> str:

return s.replace(‘\\"‘, ‘"‘).replace(‘\\\\‘, ‘\\‘)

Use only the unescapes you actually expect.

Using `finditer` to get positions (spans)

When I’m building tooling (redaction, highlighting, replacements), I don’t just want the string; I want where it came from.

Regex gives you spans for free via finditer:

import re

pattern = re.compile(r‘"((?:\\.|[^"\\])*)"‘)

text = ‘status="ok" note="She said \\\"hi\\\""‘

for m in pattern.finditer(text):

inner_value = m.group(1)

outer_span = m.span(0) # includes quotes

inner_span = m.span(1) # just the captured group

print(innervalue, outerspan, inner_span)

Spans matter in real systems because they let you do safe transformations without “reconstructing” the string from tokens.

When regex becomes fragile

I stop using regex when:

quotes can nest (even informally),
the input can be malformed and I need graceful recovery,
I need to support multiple quote types with different escaping rules,
performance becomes unpredictable due to backtracking.

Regex can be correct here, but correctness often ends up living in a single dense string literal that few teammates want to maintain.

A good heuristic: if the regex needs more than one comment block to explain it, I strongly consider a small parser instead.

The “Production Default” I Trust: A Small State-Machine Parser

When input quality is variable, I prefer a tiny parser. It’s not fancy; it’s just explicit. The main reason: it’s easy to reason about, and it fails in ways you can choose.

Here’s a complete, runnable extractor that:

supports a chosen quote character (default ")
optionally supports backslash escaping
lets you decide what to do with unmatched quotes

from future import annotations

from dataclasses import dataclass

@dataclass(frozen=True)

class ExtractOptions:

quote: str = ‘"‘

allow_escapes: bool = True

strict: bool = False # if True, raise on unmatched quote

unescape: bool = False # if True, convert \" -> " inside results

class UnmatchedQuoteError(ValueError):

pass

def extractbetweenquotes(text: str, options: ExtractOptions = ExtractOptions()) -> list[str]:

if len(options.quote) != 1:

raise ValueError("quote must be a single character")

results: list[str] = []

buf: list[str] = []

in_quotes = False

escape_next = False

for ch in text:

if not in_quotes:

if ch == options.quote:

in_quotes = True

buf.clear()

continue

# in_quotes == True

if options.allowescapes and escapenext:

# Keep escape sequences as-is for now; unescape later if requested.

buf.append(‘\\‘)

buf.append(ch)

escape_next = False

continue

if options.allow_escapes and ch == ‘\\‘:

escape_next = True

continue

if ch == options.quote:

value = ‘‘.join(buf)

if options.unescape and options.allow_escapes:

value = value.replace(‘\\"‘, ‘"‘).replace(‘\\\\‘, ‘\\‘)

results.append(value)

in_quotes = False

escape_next = False

continue

buf.append(ch)

if in_quotes:

if options.strict:

raise UnmatchedQuoteError("unmatched quote in input")

# If not strict, drop the dangling fragment (or change this policy if you prefer).

return results

Try it:

text = ‘note="She said \\\"hi\\\"" id="42"‘

print(extractbetweenquotes(text))

print(extractbetweenquotes(text, ExtractOptions(unescape=True)))

Typical output:

[‘She said \\\"hi\\\"‘, ‘42‘] [‘She said "hi"‘, ‘42‘]

Why I like this:

The behavior is obvious.
The “policy” knobs (strict, unescape) are explicit.
It’s linear time O(n) and doesn’t have regex backtracking risk.
It’s easy to extend (support single quotes too, or treat doubled quotes as escape, etc.).

If you need both ‘ and " in one pass, I usually call the function twice, or I write a variant that recognizes a set of quote characters and tracks which one opened the segment.

Extending the parser: support multiple quote types

If your contract says “extract from either single or double quotes,” you can model it as: “opening quote decides the closing quote.” That avoids weird mixing like opening with ‘ and closing with ".

Here’s a minimal variant that supports both " and ‘:

from dataclasses import dataclass

@dataclass(frozen=True)

class MultiQuoteOptions:

quotes: tuple[str, …] = (‘"‘, "‘")

allow_escapes: bool = True

strict: bool = False

def extractbetweenany_quotes(text: str, options: MultiQuoteOptions = MultiQuoteOptions()) -> list[str]:

results: list[str] = []

buf: list[str] = []

in_quotes = False

escape_next = False

active_quote: str | None = None

for ch in text:

if not in_quotes:

if ch in options.quotes:

in_quotes = True

active_quote = ch

buf.clear()

continue

# inside quotes

if options.allowescapes and escapenext:

buf.append(ch)

escape_next = False

continue

if options.allow_escapes and ch == ‘\\‘:

escape_next = True

continue

if ch == active_quote:

results.append(‘‘.join(buf))

in_quotes = False

active_quote = None

escape_next = False

continue

buf.append(ch)

if in_quotes and options.strict:

raise UnmatchedQuoteError(f"unmatched quote {active_quote!r} in input")

return results

This variant chooses a different escape policy: it stores escaped characters without keeping the backslash in the buffer. That’s not “more correct” or “less correct”—it’s a contract choice.

The important part is that the rules are visible. When someone on your team asks “What happens with \\‘ inside single quotes?” you can answer by reading the code.

A generator version for large inputs

If you’re processing huge inputs (multi-megabyte logs, streamed text, or very long prompt histories), allocating a full list might be unnecessary. A generator yields values as soon as they’re complete.

A generator version also composes nicely with pipelines (filtering, mapping, writing to a file).

Conceptually:

def iterbetweenquotes(text: str, quote=‘"‘):

# yield each extracted value

…

Then:

for value in iterbetweenquotes(big_text):

process(value)

I won’t fully duplicate the implementation here, but the state machine is the same—just yield where the list version appends.

When You Should Not Hand-Roll It: Use Parsers for Real Formats

Sometimes “text with quotes” is actually a structured format. In those cases, I strongly recommend using the parser for that format instead of extracting quoted substrings as a workaround.

Here are the common ones I see:

CSV-like data

CSV rules include embedded quotes and separators. If you try to extract quoted values with your own code, you will break on legitimate data.

Use the standard library csv module.

JSON fragments

If the input is JSON (or close to it), treat it as JSON. Extracting strings between quotes from JSON is not parsing JSON.

Shell-like command lines

For shell tokenization rules (quotes, escapes, whitespace), use shlex:

import shlex

command = ‘deploy –message "release candidate" –tag v2.4.1‘

tokens = shlex.split(command)

print(tokens)

Output:

[‘deploy‘, ‘–message‘, ‘release candidate‘, ‘–tag‘, ‘v2.4.1‘]

Notice: you don’t even need “extract between quotes” anymore. You get parsed tokens with quotes removed.

Python literal strings embedded in text

If the source is literally a Python string literal (including escapes), ast.literal_eval can safely interpret it (within limits). That said, it won’t magically find multiple literals inside an arbitrary sentence—you’d still need to locate them.

The broader point: if the input follows a known grammar, parse the grammar.

Choosing the Right Approach (Traditional vs Modern Workflows)

In 2026, the “right approach” is less about fancy tricks and more about maintainability and tests. I’m opinionated here: pick the simplest method that matches your contract, then lock it down with tests.

Here’s how I decide.

Goal

Traditional choice

Modern, production-friendly choice —

—

— Controlled text you generate

split(‘"‘)[1::2]

Same, plus a small unit test suite Simple extraction, no escapes

re.findall(‘"([^"]*)"‘, text)

Same, but add tests for odd inputs Escapes matter, input is messy

complicated regex

state-machine function with explicit policies Input is a real format (CSV/JSON/shell)

manual quote extraction

dedicated parser (csv, json, shlex)

Tooling I typically pair with this work:

pytest for unit tests (fast feedback)
hypothesis for property-based tests when inputs are unpredictable
ruff for linting and consistency
mypy or pyright for type checking when this logic becomes shared infrastructure
AI-assisted code review for generating edge-case test cases, then I validate them with real runs

The key is that extraction code is “stringly-typed glue.” Glue fails silently unless you test it.

Practical Scenarios (What I Actually Use This For)

When you’re extracting quoted strings, the “why” tends to fall into a few buckets. Thinking in scenarios helps you pick the right contract.

1) Parsing logfmt-like lines (`key="value"`)

A very common pattern is key/value logs where values are quoted:

line = ‘status="ok" user="Ava" trace_id="9f3c"‘

If you only need values, extraction is enough.

But if you need a dictionary of keys to values, don’t stop at “between quotes.” Use the quotes to protect values, then parse keys around them. One approach:

Find quoted spans.
Look left of each span for the preceding key=.
Store key -> extracted_value.

That’s exactly where span-based outputs become valuable.

2) Redacting secrets inside quotes

If your system logs include token="..." or authorization="Bearer ...", you may need to redact values.

Redaction is where naive “extract and re-join” approaches can corrupt the original string (spacing, punctuation, ordering). Spans shine here because you can replace exactly the substring range.

A policy I use:

Replace only the inside of quotes, keep the quotes.
Keep the same length (or a fixed marker) if downstream tools rely on offsets.

3) Extracting “quoted phrases” for search or NLP

Sometimes you’re building a small search query language where quotes mean “treat this phrase as one token.”

Example input:

query = ‘error "connection reset" region=us-east‘

At that point, you’ve drifted into shell-like rules (tokens, whitespace, quotes). In many cases, shlex.split is a better fit than hand-rolled quote extraction.

4) Handling copy-pasted text from documents

Docs and chat apps often inject typographic quotes (“ ”). If you’re receiving text from a UI, normalize first. A quote extractor that only understands ASCII " is going to look “broken” to users who pasted from a rich editor.

Edge Cases and Mistakes I See in Code Reviews

These are the pitfalls that show up again and again.

1) Confusing “quoted phrases” with “strings”

A quoted substring in an arbitrary sentence is not a string literal in a programming language. If you treat it like one, you’ll unescape too much or too little.

2) Ignoring escaped quotes

If your input can include \", the naive regex and split approaches will misparse. You’ll either truncate early or produce extra fragments.

3) Assuming quotes are always balanced

Logs get truncated. Users paste partial snippets. Network payloads cut off. Decide whether you want strict failure or best-effort extraction.

If you want strict behavior, make it loud:

values = extractbetweenquotes(text, ExtractOptions(strict=True))

4) Forgetting Unicode “smart quotes”

Docs and chat tools often replace " with “ ”. If you need to support that, normalize first.

Example normalization:

def normalize_quotes(text: str) -> str:

return (text

.replace(‘“‘, ‘"‘)

.replace(‘”‘, ‘"‘)

.replace(‘‘‘, "‘")

.replace(‘’‘, "‘")

)

Then run your extractor on normalized text.

5) Accidentally supporting more than you intended

This happens a lot with overly-permissive regex patterns. Someone writes a pattern that “works” for typical inputs, then it silently accepts malformed ones and produces surprising results.

If you choose best-effort extraction, that’s fine—but encode it in tests and docs so it’s a deliberate behavior, not an accident.

6) Using `eval` for unescaping or parsing

I still see this occasionally: someone extracts quoted text and calls eval to interpret escapes.

Don’t. It’s unsafe.

If you truly need to interpret escape sequences, prefer explicit logic (like .replace(‘\\"‘,‘"‘)) or a safe parser for the intended format.

7) Not deciding what to do with doubled quotes

Some formats escape quotes by doubling them (for example, "" inside a quoted field in CSV-like data). Backslash escapes and doubled-quote escapes are different contracts.

If you see doubled quotes in your input, it’s a strong signal you should be using a dedicated parser (often csv).

Testing Your Extractor So It Stays Correct

If this logic matters, I write tests that encode the contract. Even three or four tests prevent most regressions.

Here are test cases I like, written as plain Python you can adapt to pytest:

def assert_eq(actual, expected):

if actual != expected:

raise AssertionError(f"Expected {expected}, got {actual}")

def run_tests():

assert_eq(

extractbetweenquotes(‘status="ok" id="42"‘),

[‘ok‘, ‘42‘],

)

# No quotes

asserteq(extractbetween_quotes(‘status=ok id=42‘), [])

# Empty quoted value

asserteq(extractbetween_quotes(‘a="" b="x"‘), [‘‘, ‘x‘])

# Escaped quote inside

text = ‘note="She said \\\"hi\\\""‘

assert_eq(

extractbetweenquotes(text, ExtractOptions(unescape=True)),

[‘She said "hi"‘],

)

# Unmatched quote: non-strict drops fragment

asserteq(extractbetween_quotes(‘a="ok b="x"‘), [‘x‘])

# Unmatched quote: strict raises

try:

extractbetweenquotes(‘a="ok b="x"‘, ExtractOptions(strict=True))

raise AssertionError("Expected UnmatchedQuoteError")

except UnmatchedQuoteError:

pass

if name == ‘main‘:

run_tests()

print(‘All tests passed‘)

If the input is unpredictable (user input, scraped text, mixed encodings), property-based testing can pay off. The idea is simple: generate random strings with quotes inserted, then assert invariants like “every returned value is exactly what was inside a matched pair.”

Here’s a sketch of the kind of properties I like (in plain language):

If I build a string by concatenating outside chunks and quoted chunks, extraction should return exactly the quoted chunks.
If I remove all quotes from the string, extraction should return an empty list.
If I run extraction twice on the same input, I should get the same output (no hidden state).

I also like one practical “golden file” test: store a handful of real production lines that previously caused bugs (or that represent important formats), and assert the extractor output is stable. These tests pay for themselves the first time someone “simplifies” the code and accidentally breaks your parsing.

Performance and Reliability Notes (What Actually Matters)

People often worry about performance here prematurely. In practice:

For short strings (a few KB), all methods are usually in the sub-millisecond to low-millisecond range on a modern laptop.
The real cost is debugging misparses in production.

Still, performance can matter in pipelines. Here’s how I think about it:

split is often fastest for the narrow contract, but it allocates a list and every substring.
re.findall is convenient and fast enough, but you should compile the pattern if it’s hot.
A state machine is O(n) and predictable. It’s rarely the fastest in microbenchmarks, but it’s often the most stable under weird inputs.

If you suspect this is hot code, don’t guess. Use timeit or a microbenchmark on representative data. The winner changes based on:

number of matches per line
typical string length
frequency of escape sequences
whether you post-process/unescape

Reliability-wise, I heavily favor approaches with clear failure modes:

Prefer raising on unmatched quotes in data pipelines where malformed inputs should be quarantined.
Prefer best-effort extraction in user-facing features where partial results are better than errors.

A Simple Checklist Before You Ship

This is the checklist I run mentally before I merge quote-extraction code:

Have I defined which quotes I accept (", ‘, typographic quotes)?
Do I support escape sequences? If yes, which ones, and do I unescape them?
What happens on malformed input (dangling quote, dangling backslash)?
Do I need values, or do I need spans/positions?
Am I accidentally parsing a real format (CSV/JSON/shell) that already has a parser?
Do I have at least a handful of tests, including one or two nasty cases?

If you align your implementation with a clear contract and back it up with tests, “extract string from between quotations” stops being a recurring annoyance and becomes a solved, reliable utility.

Start With a Clear Contract: What Counts as “Between Quotes”?

The Quick Win: Split + Slice When the Input Is Controlled

Regular Expressions: Compact Extraction, With Sharp Edges

A simple regex for basic double quotes

A safer regex that supports escaped quotes

Using finditer to get positions (spans)

When regex becomes fragile

The “Production Default” I Trust: A Small State-Machine Parser

Extending the parser: support multiple quote types

A generator version for large inputs

When You Should Not Hand-Roll It: Use Parsers for Real Formats

CSV-like data

JSON fragments

Shell-like command lines

Python literal strings embedded in text

Choosing the Right Approach (Traditional vs Modern Workflows)

Practical Scenarios (What I Actually Use This For)

1) Parsing logfmt-like lines (key="value")

2) Redacting secrets inside quotes

3) Extracting “quoted phrases” for search or NLP

4) Handling copy-pasted text from documents

Edge Cases and Mistakes I See in Code Reviews

1) Confusing “quoted phrases” with “strings”

2) Ignoring escaped quotes

3) Assuming quotes are always balanced

4) Forgetting Unicode “smart quotes”

5) Accidentally supporting more than you intended

6) Using eval for unescaping or parsing

7) Not deciding what to do with doubled quotes

Testing Your Extractor So It Stays Correct

Performance and Reliability Notes (What Actually Matters)

A Simple Checklist Before You Ship

You maybe like,

Related Posts

Using `finditer` to get positions (spans)

1) Parsing logfmt-like lines (`key="value"`)

6) Using `eval` for unescaping or parsing