Extract String From Between Quotation Marks in Python (Practical Guide)

I run into quoted substrings constantly: log lines like level="warn" user="alice", config snippets, templated messages, and the occasional “human wrote this by hand” text file. The task sounds simple—grab whatever is between quotation marks—but the details matter. Are quotes always double quotes (")? Can values contain escaped quotes (\\")? Are there unbalanced quotes? Do you want empty quoted values like ""? Do you want to support single quotes too?

In this post I’ll show you several practical ways to extract strings between quotation marks in Python, starting with the fastest “good enough” one-liners and working up to a small parser that behaves predictably on messy inputs. You’ll learn when regex is the right tool, when it quietly lies to you, and how to write extraction code you can ship without babysitting. I’ll also include runnable examples, edge-case tests, and a decision table so you can pick an approach quickly.

Define the target: which quotes, which rules?

Before I touch code, I pin down the contract. When someone says “extract string from between quotations,” they might mean any of these:

  • Extract between double quotes: "value" → value
  • Extract between single quotes: ‘value‘ → value
  • Extract between either, without mixing: "a" and ‘b‘
  • Allow empty values: "" → "" (empty string)
  • Handle escaping inside: "he said \\\"hi\\\"" → he said "hi"
  • Treat the input as a programming-language literal (Python/JSON) rather than plain text

You should decide this up front because it changes the best implementation.

If your input is a plain text format you control (for example, you generate the logs yourself), I recommend you keep the rules simple:

  • Only double quotes delimit values
  • Inside a quoted value, a double quote must be escaped as \\\" (backslash + quote)
  • Backslash escapes only matter inside quotes

That gives you a stable target you can implement without surprises.

Two more “contract” choices I find important in real projects:

  • Strict vs lenient: In strict mode, malformed input raises an error. In lenient mode, you return “best effort” matches and maybe also return warnings.
  • Extract all vs extract one: Sometimes you want every quoted segment; sometimes you want just the first one; sometimes you want the quoted value for a specific key.

If you don’t write these down, you’ll end up debating behavior in code review and then rediscovering the debate six months later.

Quick decision guide (what I recommend you use)

Here’s the short version I follow in production code.

  • If quoted segments never contain escaped quotes: use re.findall or split + slicing.
  • If quoted segments can contain escaped quotes: skip naive regex; use a tiny parser (state machine).
  • If your input is actually CSV-ish: use csv.
  • If your input is shell-like tokens: use shlex.
  • If your input is JSON or Python literals: parse it as JSON/Python, don’t scrape it.

A small table helps make this concrete.

Traditional vs Modern approach selection (same Python tools, different mindset):

Problem shape

Traditional quick pick

Modern, safer pick —

— Simple text with "value" and no escapes

re.findall or split(‘"‘)[1::2]

Same, plus validation for unbalanced quotes Text with escaped quotes inside values

“Try a fancier regex”

Write a state machine that handles \\\" CSV / spreadsheet-export text

Regex

csv.reader with proper dialect Shell-like command strings

Regex

shlex.split then post-process JSON / Python literals

Regex

json.loads / ast.literal_eval then traverse

I’m calling this “modern” because reliability is the default expectation. You can still write small code, but you should choose parsers over pattern-matching when a format already has one.

Method 1: Regex findall() for clean, well-formed double quotes

When input is consistent and values don’t contain escaped quotes, re.findall() is a clean solution.

Runnable example:

import re

text = ‘Some values: user="alice" role="admin" feature="beta".‘

# Capture anything that‘s not a double quote, between double quotes

values = re.findall(r‘"([^"]*)"‘, text)

print(values)

Expected output:

[‘alice‘, ‘admin‘, ‘beta‘]

Why this works:

  • " matches a literal double quote.
  • ([^"]*) captures zero or more non-quote characters.
  • The closing " ends the match.

What I watch out for:

  • Greedy patterns: "(.*)" will often swallow too much if there are multiple quoted segments.
  • Newlines: dot . doesn’t match newlines unless you enable flags; [^"]* is often simpler and more predictable for this use.
  • Escaped quotes: this method does not understand \\\". It will stop early.

A concrete failure case:

import re

text = ‘message="She said \\\"hello\\\" to me" user="alice"‘

print(re.findall(r‘"([^"]*)"‘, text))

Typical output looks like:

[‘She said \\‘, ‘ user=‘]

That’s not a “small bug”; it’s a different problem. If escapes are possible, switch methods.

One more practical note: a regex like r‘"([^"]*)"‘ treats a quote as a hard stop. That’s exactly what you want when quotes are never allowed inside values. But it also means you should avoid using it on “unknown” text, because it can return plausible-but-wrong results.

Performance notes:

  • For typical short lines (hundreds of characters), this runs fast (usually sub-millisecond).
  • Complexity is O(n) on the input length for this pattern.

Method 2: split() + slicing for maximum simplicity

If the string is guaranteed to be well-formed and you only care about double quotes, splitting is remarkably effective.

Runnable example:

text = ‘Some values: user="alice" role="admin" feature="beta".‘

# split on the quote character; odd indices are inside quotes

values = text.split(‘"‘)[1::2]

print(values)

Expected output:

[‘alice‘, ‘admin‘, ‘beta‘]

Why it works:

  • split(‘"‘) creates chunks around every quote.
  • The inside-quote segments land at indices 1, 3, 5, … as long as quotes are balanced.

What can go wrong:

  • Unbalanced quotes: you’ll silently get partial results.
  • Escaped quotes: split doesn’t care about context; it will split on the escaped quote too.

If you still want to use split but you care about unbalanced quotes, add a quick check:

def extractbetweenquotes_split(text: str) -> list[str]:

if text.count(‘"‘) % 2 != 0:

raise ValueError(‘Unbalanced double quotes in input‘)

return text.split(‘"‘)[1::2]

This doesn’t handle escapes, but it prevents quiet corruption.

A small improvement I sometimes make: if the input is very large and you only need the first quoted substring, split is wasteful because it splits the entire string. In that case, use find/index (I show a “first match only” method later).

Performance notes:

  • This is typically very fast for short strings.
  • Complexity is O(n), and it allocates a list of parts, so memory can spike if the input is huge.

Method 3: Token scanning with startswith()/endswith() (narrow but readable)

Sometimes your text is already whitespace-tokenized in a way that makes this easy. For example, you may have values that appear as standalone tokens like "alpha" "beta".

Runnable example:

text = ‘I saw "alpha" and "beta" and "gamma" today.‘

tokens = text.split()

values = []

for token in tokens:

if token.startswith(‘"‘) and token.endswith(‘"‘) and len(token) >= 2:

values.append(token[1:-1])

print(values)

Expected output:

[‘alpha‘, ‘beta‘, ‘gamma‘]

Where this breaks:

  • Punctuation: "alpha", ends with a comma, not a quote.
  • Spaces inside quotes: "alpha beta" spans two tokens.
  • Escaped quotes: still not handled.

I treat this as a “narrow tool” for cases where I control the formatting and values are single tokens.

If you want a slightly more robust “token-ish” variant, strip common trailing punctuation before checking:

import string

PUNCT = string.punctuation

def stripedgepunct(token: str) -> str:

return token.strip(PUNCT)

That’s still not a real parser, but it can be enough for human-friendly sentences.

When your data is structured: prefer a real parser (csv, shlex, JSON)

A lot of “quoted substring” problems are actually “I’m parsing a known format with quotes.” If that’s you, you’ll get fewer bugs by parsing the format directly.

CSV-style quoting: csv module

If your input is comma-separated (or tab-separated) with quoted fields, use csv. It handles embedded commas, quotes, and escapes according to CSV rules.

Runnable example:

import csv

from io import StringIO

line = ‘alice,"Senior Engineer, Platform","She said ""hello"" yesterday"‘

reader = csv.reader(StringIO(line))

row = next(reader)

print(row)

Expected output:

[‘alice‘, ‘Senior Engineer, Platform‘, ‘She said "hello" yesterday‘]

Notice how CSV escapes quotes by doubling them (""). That’s a different escape convention than backslash.

If you’re dealing with TSV or pipes, you can set the delimiter:

reader = csv.reader(StringIO(line), delimiter=‘\t‘)

If you’re dealing with “almost CSV,” I still try csv first because it fails in obvious ways, and the fixes tend to be about the dialect rather than rewriting parsing logic.

Shell-like quoting: shlex

If the string behaves like a command line with quotes and backslashes, shlex is usually the right tool.

Runnable example:

import shlex

command = ‘deploy –user "alice" –message "ship it" –tag "release-2026-02"‘

tokens = shlex.split(command)

print(tokens)

Expected output:

[‘deploy‘, ‘–user‘, ‘alice‘, ‘–message‘, ‘ship it‘, ‘–tag‘, ‘release-2026-02‘]

Then you can extract arguments following certain flags instead of scraping everything between quotes.

A real-world pattern: when I want --message only, I don’t “extract between quotes”; I parse tokens and interpret flags:

def getflagvalue(tokens: list[str], flag: str) -> str | None:

try:

i = tokens.index(flag)

except ValueError:

return None

if i + 1 >= len(tokens):

return None

return tokens[i + 1]

JSON and Python literals: don’t scrape, parse

If your input is something like { "user": "alice" } and you want values, parse JSON:

import json

payload = ‘{"user":"alice","role":"admin"}‘

obj = json.loads(payload)

print(obj[‘user‘], obj[‘role‘])

If it’s a Python literal (and you trust the source), ast.literal_eval can help. If you don’t trust the source, do not eval it.

One important mental shift: if your goal is “extract the values,” but the input is actually structured, then “string between quotes” is just an implementation detail of that structure. Parse the structure and you won’t care about quoting rules at all.

Method 4: Regex that tolerates escaped quotes (use sparingly)

Sometimes you’re in a place where regex is still reasonable:

  • You know escapes are possible.
  • You want a quick extractor.
  • The input size is small.
  • You’re willing to accept that regex readability drops quickly.

A commonly used pattern for backslash-escaped quotes inside double quotes is:

  • Start at a "
  • Then repeat either:

– any non-quote, non-backslash character

– or an escaped character like \\.

  • Until the closing "

In code:

import re

pattern = re.compile(r‘"((?:[^"\\]|\\.)*)"‘)

text = ‘user="alice" message="She said \\\"hello\\\"" empty=""‘

matches = pattern.findall(text)

print(matches)

Expected output:

[‘alice‘, ‘She said \\\"hello\\\"‘, ‘‘]

Notice the escape behavior: this regex returns the raw content including backslashes. That can be what you want (preserve original text), or not what you want (you want the unescaped value).

If you want to unescape \\" and \\\\ in a controlled way, you can post-process:

def unescape_backslashes(s: str) -> str:

# Minimal unescape for \" and \\.

# If you need full escape semantics, use the state machine method instead.

out = []

i = 0

while i < len(s):

ch = s[i]

if ch == ‘\\‘ and i + 1 < len(s):

out.append(s[i + 1])

i += 2

else:

out.append(ch)

i += 1

return ‘‘.join(out)

values = [unescape_backslashes(m) for m in matches]

print(values)

Now you get:

[‘alice‘, ‘She said "hello"‘, ‘‘]

Why I still prefer a parser when escapes exist:

  • The regex is harder to audit.
  • Post-processing unescape rules can drift from the regex matching rules.
  • Debugging “one weird input” is easier with explicit state.

If you’re writing production code and you control the format, I generally stop at Method 1/2 (no escapes) or jump straight to the state machine (escapes).

Method 5: Extract the quoted value for a specific key (logs/config)

A very common case is not “give me all quoted substrings,” but “give me the value for this field.” Example:

level="warn" user="alice" action="login" ip="203.0.113.5"

If you blindly extract everything, you lose the mapping. Better is to parse key="value" pairs.

Simple key extraction (no escapes)

If you know values do not contain escaped quotes, a small regex keyed by name is compact:

import re

def getquotedfield(text: str, key: str) -> str | None:

# Example: key=user -> match user="…"

pattern = re.compile(rf‘\b{re.escape(key)}="([^"]*)"‘)

m = pattern.search(text)

return None if m is None else m.group(1)

line = ‘level="warn" user="alice" action="login"‘

print(getquotedfield(line, ‘user‘))

Expected output:

alice

Key extraction with escapes (parser-based)

If values may contain escaped quotes, I prefer parsing all key/value pairs in one pass:

from future import annotations

def parsekvquoted(text: str) -> dict[str, str]:

out: dict[str, str] = {}

i = 0

n = len(text)

def skip_spaces(j: int) -> int:

while j < n and text[j].isspace():

j += 1

return j

while i < n:

i = skip_spaces(i)

if i >= n:

break

# Parse key up to ‘=‘ or whitespace

start = i

while i < n and (text[i].isalnum() or text[i] in ('_', '-', '.')):

i += 1

key = text[start:i]

if not key:

# Skip unknown junk

i += 1

continue

i = skip_spaces(i)

if i >= n or text[i] != ‘=‘:

continue

i += 1

i = skip_spaces(i)

# Only accept quoted values

if i >= n or text[i] != ‘"‘:

continue

i += 1

# Parse quoted value with backslash escapes

buf: list[str] = []

escape = False

while i < n:

ch = text[i]

i += 1

if escape:

buf.append(ch)

escape = False

continue

if ch == ‘\\‘:

escape = True

continue

if ch == ‘"‘:

out[key] = ‘‘.join(buf)

break

buf.append(ch)

else:

raise ValueError(‘Unbalanced double quotes while parsing field: ‘ + key)

if escape:

raise ValueError(‘Dangling escape while parsing field: ‘ + key)

return out

line = ‘user="alice" message="She said \\\"hello\\\"" empty=""‘

print(parsekvquoted(line))

Expected output:

{‘user‘: ‘alice‘, ‘message‘: ‘She said "hello"‘, ‘empty‘: ‘‘}

I like having this in my toolbox because it turns a “string scraping” task into something closer to “parsing a simple record format,” which is easier to reason about.

The reliable method: a small state machine that supports escaped quotes

When escaped quotes are on the table, I stop trying to be clever with regex. Yes, you can write complex patterns, but they get fragile quickly, and they’re painful to review.

A tiny parser is short, readable, and does the right thing.

Contract for this parser:

  • Double quotes delimit values
  • Backslash escapes inside quotes (so \\\" becomes a literal quote)
  • Returns all quoted segments
  • Raises on unbalanced quotes

Runnable example:

from future import annotations

def extract_quoted(text: str) -> list[str]:

values: list[str] = []

buf: list[str] = []

in_quotes = False

escape = False

for ch in text:

if not in_quotes:

if ch == ‘"‘:

in_quotes = True

buf.clear()

continue

# We are inside quotes

if escape:

# Keep the escaped character as-is (" becomes ")

buf.append(ch)

escape = False

continue

if ch == ‘\\‘:

escape = True

continue

if ch == ‘"‘:

# End of quoted segment

values.append(‘‘.join(buf))

in_quotes = False

continue

buf.append(ch)

if escape:

# Trailing backslash inside quotes is usually malformed

raise ValueError(‘Dangling escape at end of input‘)

if in_quotes:

raise ValueError(‘Unbalanced double quotes in input‘)

return values

if name == ‘main‘:

text = ‘user="alice" message="She said \\\"hello\\\" to me" empty=""‘

print(extract_quoted(text))

Expected output:

[‘alice‘, ‘She said "hello" to me‘, ‘‘]

Why I like this approach:

  • It’s O(n) and single-pass.
  • It handles escapes in a way you can explain to a teammate in 30 seconds.
  • It fails loudly on malformed input, which is what you want when data quality matters.

Supporting both single and double quotes (without mixing)

If you want to support both and ", I recommend tracking the active quote character. The key rule I enforce is: once a quoted segment starts with one quote type, it must end with the same type.

def extractquotedany(text: str) -> list[str]:

values: list[str] = []

buf: list[str] = []

quote: str | None = None

escape = False

for ch in text:

if quote is None:

if ch in (‘"‘, "‘"):

quote = ch

buf.clear()

continue

if escape:

buf.append(ch)

escape = False

continue

if ch == ‘\\‘:

escape = True

continue

if ch == quote:

values.append(‘‘.join(buf))

quote = None

continue

buf.append(ch)

if escape:

raise ValueError(‘Dangling escape at end of input‘)

if quote is not None:

raise ValueError(‘Unbalanced quotes in input‘)

return values

I keep this “either quote” parser around for user-authored text, but for machine-generated logs/config I usually standardize on double quotes only.

Strict vs lenient mode (what to do with broken input)

Strict parsing (raising errors) is great when:

  • You control the producer.
  • A parsing error should stop a pipeline.
  • Silent corruption would be worse than missing data.

Lenient parsing is better when:

  • You ingest third-party logs.
  • You need partial recovery.
  • You want to keep the pipeline flowing while flagging anomalies.

A simple lenient variant is: return what you found, and ignore an unclosed quote at EOF.

def extractquotedlenient(text: str) -> tuple[list[str], bool]:

values: list[str] = []

buf: list[str] = []

in_quotes = False

escape = False

for ch in text:

if not in_quotes:

if ch == ‘"‘:

in_quotes = True

buf.clear()

continue

if escape:

buf.append(ch)

escape = False

continue

if ch == ‘\\‘:

escape = True

continue

if ch == ‘"‘:

values.append(‘‘.join(buf))

in_quotes = False

continue

buf.append(ch)

# If we ended inside quotes or after a backslash, call it malformed

malformed = in_quotes or escape

return values, malformed

I like returning a (values, malformed) tuple because it forces the caller to decide whether “best effort” is acceptable.

Edge cases you should decide on (and test)

Here are the cases that bite teams later because nobody wrote them down.

1) Empty quoted segments

  • Input: note=""
  • Do you want [‘‘] or do you want to drop empties?

2) Adjacent quoted segments

  • Input: "a""b"
  • Is that [‘a‘, ‘b‘] or is it invalid for your format?

3) Quotes mixed with punctuation

  • Input: name="alice", role="admin";
  • Split/token methods often break here; regex and the state machine handle it.

4) Multiline text

  • Input:

message="Line one\nLine two"

  • If your input contains literal newlines inside quoted segments (not \n), your parser needs to accept them; the state machine does, because it iterates characters.

5) Smart quotes

  • Input: “alice” (curly quotes)
  • Decide whether you normalize them first. If this input is user-authored, I often normalize by replacing smart quotes with " before parsing.

A simple normalizer:

def normalizesmartquotes(text: str) -> str:

return (

text.replace(‘“‘, ‘"‘)

.replace(‘”‘, ‘"‘)

.replace(‘‘‘, "‘")

.replace(‘’‘, "‘")

)

If you normalize, document it clearly so you don’t surprise downstream systems.

6) Backslashes outside of quotes

  • Input: path=C:\\Temp\\file.txt user="alice"
  • If your contract says “escapes matter only inside quotes,” then backslashes outside quotes should be treated as ordinary characters. The state machine I showed does exactly that.

7) Unicode and combining characters

  • Most of the time, iterating Python strings “just works” because Python iterates by Unicode code point.
  • If you’re processing text where quotes might be represented by different Unicode characters (like corner quotes « »), normalize first or extend your quote set intentionally.

Common mistakes I see (and how I avoid them)

  • Using "(.*)" and being shocked by the result. That pattern is greedy and will often match from the first quote to the last quote.
  • Assuming escapes don’t exist because you haven’t seen them yet. If humans can type the input, you will eventually see them.
  • Forgetting validation. If malformed input is possible, you want ValueError (or a domain error) instead of a silent partial list.
  • Parsing a structured format as plain text. If it’s CSV, JSON, shell tokens, or an actual programming-language literal, parsing it directly is simpler and more correct.

Two more subtle pitfalls:

  • Mixing quote types without a rule: If your input can contain both and ", decide whether you treat them as equivalent delimiters or distinct. “Equivalent” sounds convenient, but it can produce surprising matches in text like: He wrote "it‘s fine".
  • Over-decoding escapes: I’ve seen code that tries to interpret \n, \t, \uXXXX, etc. That can be correct for some formats (like JSON strings) and wildly incorrect for others (like ad-hoc logs). Only decode escapes that your format actually defines.

Testing your extractor (what I do)

I rarely ship string-parsing code without at least a handful of tests. Parsing bugs are cheap to create and expensive to debug.

Here’s a minimal self-contained test harness you can run with plain Python:

def runtests() -> None:

assert extract_quoted(‘a="b"‘) == [‘b‘]

assert extract_quoted(‘a="" b="c"‘) == [‘‘, ‘c‘]

assert extract_quoted(‘msg="She said \\\"hi\\\""‘) == [‘She said "hi"‘]

try:

extract_quoted(‘a="b‘)

except ValueError:

pass

else:

raise AssertionError(‘Expected ValueError for unbalanced quotes‘)

try:

extract_quoted(‘a="b\\"‘)

except ValueError:

pass

else:

raise AssertionError(‘Expected ValueError for dangling escape‘)

if name == ‘main‘:

runtests()

print(‘OK‘)

If you already use a test runner, I recommend pytest plus property-based tests (Hypothesis) for this kind of logic. Property tests are excellent for “any random string with balanced quotes should round-trip” style checks.

A practical property I like:

  • If you take a list of values, encode them into a format like k="...", then parse, you should get the original values back.

Even without Hypothesis, you can simulate the idea with a few randomized cases, as long as you keep it deterministic in CI.

Method 6: Extract only the first quoted substring (fast path)

Sometimes you don’t need “all matches.” You need the first quoted substring after some marker. In that case, avoid building lists and scanning more than necessary.

No escapes, first match

def firstquotedsimple(text: str) -> str | None:

start = text.find(‘"‘)

if start == -1:

return None

end = text.find(‘"‘, start + 1)

if end == -1:

raise ValueError(‘Unbalanced double quotes in input‘)

return text[start + 1:end]

This is tiny and fast, but it has the same limitation: escaped quotes will break it.

Escapes, first match (state machine)

def first_quoted(text: str) -> str | None:

in_quotes = False

escape = False

buf: list[str] = []

for ch in text:

if not in_quotes:

if ch == ‘"‘:

in_quotes = True

buf.clear()

continue

if escape:

buf.append(ch)

escape = False

continue

if ch == ‘\\‘:

escape = True

continue

if ch == ‘"‘:

return ‘‘.join(buf)

buf.append(ch)

if escape:

raise ValueError(‘Dangling escape at end of input‘)

if in_quotes:

raise ValueError(‘Unbalanced double quotes in input‘)

return None

I use this when I’m parsing “prefix + quoted payload” formats (like msg="..." where I only care about the message).

Working with large inputs (files, streams, and memory)

Everything so far assumes you already have the string. In practice, I often extract quoted content from:

  • multi-GB log files
  • streaming input (stdin)
  • lists of records from a database

Two practical rules:

  • Process line-by-line if you can. Don’t read() a huge file into memory just to run a regex.
  • Prefer “yield” for streaming extraction. You can keep memory flat and handle backpressure naturally.

Here’s a generator version of the state machine that yields each match as soon as it closes:

from future import annotations

def iter_quoted(text: str) -> list[str]:

# For API symmetry with examples above, this returns a list.

# If you want streaming, change values.append(...) to yield ....

values: list[str] = []

buf: list[str] = []

in_quotes = False

escape = False

for ch in text:

if not in_quotes:

if ch == ‘"‘:

in_quotes = True

buf.clear()

continue

if escape:

buf.append(ch)

escape = False

continue

if ch == ‘\\‘:

escape = True

continue

if ch == ‘"‘:

values.append(‘‘.join(buf))

in_quotes = False

continue

buf.append(ch)

if escape:

raise ValueError(‘Dangling escape at end of input‘)

if in_quotes:

raise ValueError(‘Unbalanced double quotes in input‘)

return values

And a line-by-line file sketch:

def extractfromfile(path: str) -> list[str]:

out: list[str] = []

with open(path, ‘r‘, encoding=‘utf-8‘, errors=‘replace‘) as f:

for line in f:

out.extend(extract_quoted(line))

return out

If quoted strings can span lines, then “line-by-line” is not correct. In that case you need either:

  • a parser that runs across the whole stream (tracking whether you’re currently inside quotes), or
  • an upstream guarantee that quoted values do not contain literal newlines.

This is another contract point: decide whether quoted values may contain literal newlines.

Performance considerations (what actually matters)

People love asking “which is fastest?” but the more useful question is “which is fast enough and correct for my inputs?”

Here’s the rough performance intuition I rely on:

  • split is extremely fast for simple cases, but allocates a lot.
  • Simple regex ("([^"]*)") is fast and convenient for clean inputs.
  • The state machine is consistently fast, predictable, and avoids catastrophic regex behavior.
  • Complex regex for escapes can be okay, but readability and correctness usually dominate.

If you want a practical approach to performance without obsessing:

  • Pick the correct method first.
  • If you process millions of lines per minute, prefer the state machine or split with validation.
  • Avoid doing multiple passes (like extracting with regex and then re-parsing each match).

Also, measure the whole pipeline. I’ve “optimized” a parser only to discover the real bottleneck was file I/O or JSON decoding downstream.

Choosing the best approach (clear recommendations)

If you want a rule you can apply quickly:

  • Use split(‘"‘)[1::2] when your input is guaranteed balanced, quotes never appear inside values, and you want the smallest code.
  • Use re.findall(r‘"([^"]*)"‘, text) for the same constraints, especially if you want a familiar one-liner.
  • Use the state machine when you can’t guarantee clean inputs or you need to support escaped quotes. This is the version I recommend for logs, config, and user-authored text.
  • Use csv, shlex, or json when you’re actually parsing those formats.

The fastest wrong answer is still wrong. If you’re not sure whether escapes exist, pick the state machine: it’s short enough that you won’t regret it, and it stays correct as your inputs evolve.

Key takeaways and what to do next

If you only remember one thing, make it this: extracting text between quotation marks is easy until it isn’t. The moment escapes, malformed input, or structured formats show up, a one-liner can start returning believable garbage.

My default workflow looks like this:

  • Decide the contract: quote type(s), escape rules, strict vs lenient.
  • If it’s a known format (CSV/JSON/shell tokens), parse the format.
  • If it’s plain text but escapes are possible, use a state machine.
  • Add tests for the edge cases you care about (empty values, escaped quotes, unbalanced quotes).

If you tell me what your inputs actually look like (a sample line or two, plus whether escapes are allowed), I can recommend the smallest implementation that’s still correct—and adjust the parser to match your exact rules.

Scroll to Top