I’ve seen this happen on almost every Python team I’ve worked with: a helper function starts simple, everyone understands it, and the docstring example is accurate on day one. Three months later, the implementation changes, the docs stay frozen, and now your example lies to the next developer who reads it. That mismatch is small at first, but it multiplies quickly across utilities, data parsers, and API wrappers.
doctest is one of my favorite ways to stop that drift. It turns examples in your docstrings into executable tests. You get living documentation and a test signal from the same lines of code. If the example says convertcelsiusto_fahrenheit(0) returns 32.0, Python can verify that claim automatically.
If you’re building libraries, backend services, automation scripts, or data tooling, this matters. You want docs that teach and tests that catch regressions. doctest gives you both with almost no setup cost. I’ll walk you through how it works, how I use it in modern Python workflows (including 2026 AI-assisted development), where it shines, where it hurts, and how to avoid the mistakes that cause flaky or misleading checks.
What doctest actually does under the hood
At a high level, doctest scans docstrings for interactive Python session patterns:
- Input lines begin with
>>> - Continuation lines begin with
... - Expected output appears on the following lines
Then it runs those snippets and compares actual output with expected output.
I like to explain it as a built-in truth detector for documentation examples. If your docstring claims a behavior, doctest asks Python to prove it.
Here’s the mental model I use:
doctestfinds examples in function, class, module, or text docs.- It executes the input exactly as shown.
- It captures printed output and expression results.
- It compares the result to what your docstring says should happen.
- It reports pass/fail with a diff-like message.
That means doctest is strongest when your examples are deterministic and short. It is not meant to replace every kind of test. It is excellent for behavior that can be shown in a few lines and verified by textual output.
A detail that helps in real projects: each doctest block runs in a shared namespace for that docstring, not your full application runtime. That is useful because examples can build on previous lines, but it can also hide accidental coupling. If example B silently depends on state created in example A, reordering examples can break tests. I try to make each example self-contained unless sequence is part of what I’m teaching.
Another under-the-hood point: doctest compares text, not Python objects. That is both its superpower and its weakness. It catches user-visible output drift immediately, but formatting changes can fail even when the underlying behavior is correct. Later sections cover practical ways to reduce that brittleness.
Why I choose doctest for specific problems
I don’t use doctest everywhere. I use it where it gives the highest signal for the lowest effort.
Where it works really well
- Small pure functions (
formatcurrency,slugifytitle,isvalidpostal_code) - Parsing and normalization helpers
- Data transformation logic where input/output is easy to show
- Public library APIs where docs and behavior must stay aligned
- Teaching-oriented codebases where examples are part of onboarding
- Utility functions that people copy from docs into production code
Where I usually avoid it
- Heavy async workflows with multiple external systems
- Non-deterministic behavior (timestamps, random values, process IDs)
- Complex object comparisons that need custom assertions
- Large integration scenarios with fixtures and environment setup
- Anything where setup noise would make examples hard to read
Why this still matters in 2026
AI coding assistants can produce lots of examples quickly, but generated examples are not always correct. I treat AI-written doc examples as drafts. doctest gives me a fast verification loop so those examples cannot silently rot.
In practice, this has saved teams from shipping stale docs after refactors. The pattern is common: assistant suggests a valid example for v1 of an API, then a teammate changes defaults or return type in v2. Without executable docs, old examples survive forever. With doctest in CI, drift becomes a hard failure instead of a hidden trap.
If you’re deciding between writing a quick doc example or a quick test, doctest lets you do both at once. That speed is the main reason I keep it in my toolkit.
Your first runnable example (and how to run it)
Let’s start with a simple factorial example and make it fully runnable.
# math_tools.py
from doctest import testmod
def factorial(n: int) -> int:
‘‘‘
Return factorial for non-negative integers.
>>> factorial(0)
1
>>> factorial(3)
6
>>> factorial(5)
120
‘‘‘
if n < 0:
raise ValueError(‘n must be >= 0‘)
if n <= 1:
return 1
return n * factorial(n - 1)
if name == ‘main‘:
testmod(verbose=True)
Run it:
python math_tools.py
What you’ll see:
- Each example is tried in order
- Expected output is compared
- A summary shows total tests and failures
I recommend verbose=True while learning or debugging. For CI, I usually keep output quieter unless a failure appears.
Testing a module without editing main
You can also run doctests from the command line:
python -m doctest -v math_tools.py
This is my preferred style when I don’t want execution logic in the module body.
Testing doc files, not just docstrings
doctest can also validate examples in plain text docs (as long as examples are written in doctest format).
python -m doctest -v docs/usage_examples.txt
This is useful for SDK-style repositories where usage guides are critical and should not break quietly.
I also use this for migration guides. If you maintain old and new API styles side by side, docs often become the most fragile part of the release. Running doctests against guide files gives you a safety net that normal unit tests rarely provide.
When logic goes wrong: reading failures quickly
Now let’s intentionally break factorial logic to see failure reporting.
# brokenmathtools.py
from doctest import testmod
def factorial(n: int) -> int:
‘‘‘
>>> factorial(3)
6
>>> factorial(5)
120
‘‘‘
if n <= 1:
return 1
# Bug: multiplication was accidentally removed
return factorial(n - 1)
if name == ‘main‘:
testmod(verbose=True)
Typical failure output includes:
- Which file and line contains the failing example
- The exact example that failed
- Expected output
- Actual output
I treat this as a fast regression alarm. If a doc example fails, one of two things is true:
- Your implementation changed and docs are stale.
- Your docs are right and implementation is wrong.
Both are worth fixing before release.
My debugging workflow for doctest failures
When a doctest fails, I usually do this:
- Re-run with
-vfor context. - Copy the failing
>>>lines into a real REPL. - Confirm whether docs or code should change.
- Fix one side only if behavior is intentional.
- Re-run doctest and your unit test suite.
This keeps examples trustworthy and avoids accidental behavior changes hidden behind updated docs.
One extra habit that helps: when the failure is from string formatting, I print repr(actual_value) in a temporary unit test. That makes hidden whitespace and newline differences obvious. Many painful doctest failures are just formatting surprises.
Advanced patterns you’ll actually need
Many teams stop at simple arithmetic examples. In production code, you need a few extra techniques to make doctest dependable.
1) Testing exceptions
def divide(a: float, b: float) -> float:
‘‘‘
>>> divide(6, 2)
3.0
>>> divide(1, 0)
Traceback (most recent call last):
...
ZeroDivisionError: division by zero
‘‘‘
return a / b
Use traceback format exactly like this and include ... for stack trace lines that vary.
2) Handling floating-point output
Float rendering can produce tiny differences. I recommend rounding in examples when business logic allows it.
def monthlyinterest(balance: float, annualrate: float) -> float:
‘‘‘
>>> round(monthly_interest(1000, 0.06), 4)
5.0
‘‘‘
return balance * (annual_rate / 12)
If exact float text is unstable, don’t test the raw float string.
3) Using option flags
Some doctest flags reduce brittle comparisons:
ELLIPSISlets...match variable text.NORMALIZE_WHITESPACEignores spacing differences.IGNOREEXCEPTIONDETAILcompares exception type without exact message text.
Example:
import doctest
def renderuser(userid: int) -> str:
‘‘‘
>>> render_user(42) # doctest: +ELLIPSIS
‘User(id=42, created_at=...)‘
‘‘‘
from datetime import datetime
return f‘User(id={userid}, createdat={datetime.utcnow().isoformat()})‘
if name == ‘main‘:
doctest.testmod(optionflags=doctest.ELLIPSIS)
4) Skipping environment-specific examples
If an example needs a local dependency or OS-specific behavior, mark it:
def readsystemsecret() -> str:
‘‘‘
>>> readsystemsecret() # doctest: +SKIP
‘prod-secret-value‘
‘‘‘
raise NotImplementedError
I use +SKIP sparingly. Too many skips reduce trust in the suite.
5) Controlling global state with setUp and tearDown
For module-level doctest execution, you can provide setup/cleanup helpers when examples need controlled context.
# testdoctestrunner.py
import doctest
import my_module
def setup(test):
test.globs[‘seed_value‘] = 123
def teardown(test):
test.globs.clear()
def load_tests(loader, tests, ignore):
tests.addTests(doctest.DocTestSuite(my_module, setUp=setup, tearDown=teardown))
return tests
This approach works well when integrating doctest with unittest runners.
6) Handling dictionaries and sets safely
Unordered data structures can produce unstable text representations. Python’s dict order is insertion-ordered now, but tests still break if creation order changes. For sets, order is inherently unstable.
I usually write examples like this:
>>> sorted(normalize_tags([‘B‘, ‘a‘, ‘b‘]))
[‘a‘, ‘b‘]
or this:
>>> result = make_lookup([‘x‘, ‘y‘])
>>> sorted(result.items())
[(‘x‘, 1), (‘y‘, 1)]
7) Multiline output examples
For multiline strings, newline formatting can be painful. I prefer one of two patterns:
- Compare with
print()in the doctest so the formatting is explicit. - Compare with
repr()when invisible characters matter.
This makes failures easier to diagnose and keeps examples educational.
How I write docstrings that teach and test at the same time
A good doctest is not just technically valid. It should be easy for a human to read and still robust as a test.
Here are the style rules I give teams:
- Start with the happy path first so readers get quick orientation.
- Add one edge case that reflects real production mistakes.
- Keep setup lines minimal; move heavy setup to regular tests.
- Use domain terms, not toy names (
invoice,sku,timezone, notfoo,bar). - Show one assertion per idea.
- Avoid examples that require hidden context.
I also keep docstrings short enough that someone scanning code can understand function behavior in under 30 seconds. If a docstring needs 40 lines to explain setup, that usually means the example belongs in documentation files or integration tests, not inside the function.
A practical template I use:
- One-sentence behavior summary.
- Two happy-path examples.
- One failure or edge example.
- Optional notes section for caveats.
That gives good learning value without turning docstrings into mini test suites.
Real project structure: combining doctest, pytest, and CI
In modern Python projects, I rarely run doctest alone. I combine it with pytest and regular unit tests.
My recommended test split
doctest: contract examples and usage snippetspytestunit tests: branch-heavy logic and edge cases- integration tests: network/database/service boundaries
doctest should be your first line of clarity, not your only line of defense.
Running doctests through pytest
pytest can collect doctests from docstrings and text files:
pytest --doctest-modulespytest --doctest-glob=‘*.txt‘
Typical pyproject.toml setup:
[tool.pytest.ini_options]
addopts = ‘-q --doctest-modules‘
testpaths = [‘src‘, ‘tests‘, ‘docs‘]
This gives you one test command for everything, which I strongly recommend for team consistency.
CI pipeline pattern I use
A practical pipeline for Python libraries and services:
- Run static checks (type checks/lint).
- Run
pytestincluding doctests. - Build packages and docs only if tests pass.
- Publish only from protected branches.
The key idea: if docs examples fail, docs don’t ship.
Traditional vs modern workflow
Older workflow
—
Written manually, rarely verified
Separate scripts per test type
pytest command with doctest collection Trust by convention
Manual bug reports from readers
Docs drift over time
The modern flow is simple: examples are executable artifacts, not decorative text.
Performance considerations in larger repositories
Teams often ask me whether doctest slows CI too much. In most codebases I’ve seen, the impact is acceptable when scoped well.
What I usually observe:
- Small modules with a few doctests add very little runtime.
- Large repositories with docs-heavy modules can add noticeable overhead.
- Full-doc runs are slower mainly because import time and setup dominate, not the individual comparisons.
Performance patterns that work:
- Run changed-module doctests locally in pre-commit or pre-push hooks.
- Run full doctest collection in CI on merge requests.
- Keep nightly jobs for full docs + slow integration tests.
- Avoid expensive imports in module top-level code used by doctests.
A common hidden cost is import side effects. If importing a module opens network connections, reads large files, or performs expensive configuration, doctest execution suffers. I push expensive setup behind function calls and keep module import cheap.
Limits of doctest (and how I work around them)
doctest is useful, but it has boundaries. Knowing those boundaries helps you avoid fragile tests.
1) Text-based comparison can be brittle
Because output is compared as text, formatting noise can fail tests.
Workaround:
- Normalize output when possible.
- Use flags like
NORMALIZE_WHITESPACE. - Keep examples focused on stable values.
2) Not ideal for complex setup
If a test needs many fixtures, services, and mocks, docstrings become unreadable.
Workaround:
- Keep docstring examples short and educational.
- Move heavy behavior checks to
pytest.
3) Weak assertion language compared to test frameworks
doctest doesn’t give the full assertion power of pytest or unittest.
Workaround:
- Use doctest for input/output contracts.
- Use unit tests for internal invariants and branch coverage.
4) Harder with non-deterministic output
Random values, time-based output, and unordered structures can fail unpredictably.
Workaround:
- Inject deterministic seeds.
- Sort collections before display.
- Use ellipsis matching where reasonable.
5) Performance at scale
Large doc-heavy repositories can add noticeable time.
Workaround:
- Run doctest incrementally in local loops.
- Keep nightly/full runs for all docs.
- Prioritize critical modules in pre-merge checks.
I still find the trade-off favorable because early doc drift detection saves debugging time later.
Common mistakes I see teams make
These are the mistakes that cause most doctest frustration.
Mistake 1: Treating doctest as complete test coverage
If you rely only on doctest, you’ll miss edge cases and internal state checks.
Fix:
- Pair doctest with a proper unit/integration test suite.
Mistake 2: Writing unrealistic examples
Examples like tiny toy inputs don’t reflect production usage.
Fix:
- Use realistic values and domain terms (
invoicetotal,skucode,customer_tier).
Mistake 3: Testing unstable representations
Examples that depend on memory addresses or timestamp text will fail often.
Fix:
- Compare stable values or apply matching flags.
Mistake 4: Ignoring failure output details
Some teams rerun tests without reading expected vs actual carefully.
Fix:
- Treat mismatch output as the source of truth and resolve intentionally.
Mistake 5: Updating expected output without intent
I’ve seen developers fix doctests by changing expected output to match a bug.
Fix:
- Decide behavior first, then update code or docs, not both blindly.
Mistake 6: Not running doctest in CI
If doctest runs only on local machines, drift still happens.
Fix:
- Make doctest part of mandatory CI checks.
Mistake 7: Overusing ellipsis
ELLIPSIS is useful, but overuse can hide real regressions.
Fix:
- Apply ellipsis to the unstable fragment only, not entire outputs.
Mistake 8: Cramming too many scenarios into one docstring
Long doctest blocks are hard to maintain and hard to debug.
Fix:
- Keep each function docstring focused on a few representative examples.
- Move the full matrix of edge cases to unit tests.
Edge-case cookbook I use in real codebases
When teams adopt doctest, they usually hit the same edge cases. Here is how I handle each one.
Time and dates
Problem: local timezone and formatting differences.
Approach:
- Convert to UTC in examples.
- Use fixed input datetimes.
- Compare formatted date-only outputs when time precision is irrelevant.
Locale-sensitive formatting
Problem: number and currency output differs by locale.
Approach:
- Explicitly set locale in setup code when possible.
- Prefer locale-independent examples for doctest.
- Keep locale matrix tests in dedicated unit tests.
Randomized operations
Problem: output changes every run.
Approach:
- Seed randomness (
random.seed(0)) in examples. - Test shape and invariants instead of exact sequence where needed.
Platform differences
Problem: path separators and newline conventions differ.
Approach:
- Normalize with helper functions before comparison.
- Avoid hardcoding platform-specific path strings in doctests.
External APIs
Problem: network and third-party changes make doctests flaky.
Approach:
- Keep API calls out of doctests.
- Use local deterministic stubs in unit or integration tests.
If an example needs a real external dependency to make sense, I put it in documentation with a note and skip it from doctest collection.
Alternative approaches and when I pick them
doctest is valuable, but not always best. I choose based on objective.
Best tool
—
doctest
pytest unit tests
integration tests
property-based tests
schema/contract tests
My rule of thumb:
- If I need to teach usage and verify it, I use doctest.
- If I need to prove correctness over many branches and inputs, I use unit or property-based tests.
- If I need infrastructure confidence, I use integration or end-to-end tests.
The strongest teams combine these, not replace one with another.
AI-assisted doctest workflow that actually works
Most teams now use AI for drafts. I do too. But I apply a strict loop so quality stays high.
- I ask the assistant for 3 to 5 realistic examples per public function.
- I paste only the best 1 to 2 examples into docstrings.
- I run doctest immediately.
- I rewrite any vague or flaky output.
- I add unit tests for uncovered branches.
What I never do: copy AI-generated examples directly into docs without execution.
Good prompt pattern I use with assistants:
- Ask for deterministic input/output examples.
- Ask for one error case.
- Ask to avoid current-time and random output.
- Ask to use domain terms from my codebase.
Then doctest becomes the verifier. This is where AI and doctest complement each other really well: AI helps speed, doctest enforces truth.
FAQ
1) Should I use doctest or pytest?
Use both. I recommend doctest for executable examples and API contracts, and pytest for deeper logic checks, fixtures, and complex assertions.
2) Can doctest test private helper functions?
Yes, if they have docstrings with examples and are discoverable by your test run strategy. I still focus most doctests on public behavior to keep docs useful.
3) Is doctest good for beginners?
Yes. It teaches input/output thinking and documentation clarity at the same time. It’s one of the most beginner-friendly testing tools in Python.
4) Can I run doctests from Markdown files?
You can run doctests from text-like files that contain doctest-formatted prompts. Many teams keep dedicated .txt docs for this purpose, or use pipelines that extract doctest examples from markdown content.
5) How do I test async code with doctest?
You can, but it gets awkward quickly. For most async workflows, I recommend pytest with async support and keep doctest focused on synchronous wrappers or small deterministic examples.
6) What’s the best team policy for doctest?
My policy is simple: every public utility function should have at least one executable example, and CI must run doctests. That gives a consistent baseline of trust in docs.
7) Can I mix doctest with type hints and static typing?
Absolutely. The combination is strong: type hints define interface expectations, while doctests prove behavior with concrete examples.
8) Do doctests replace API reference docs?
No. They make examples reliable, but they are not a full documentation strategy. I still write clear parameter docs, return semantics, and constraints.
9) How many doctests per function is enough?
Usually 1 to 3. One happy path, one edge case, and one error case is often the sweet spot.
10) Should I doctest everything in a mature legacy codebase?
No. Start with high-traffic public functions and modules that are often changed. Expand gradually.
What to do next in your own codebase
If you want fast gains this week, I’d start with three high-traffic utility modules and add 2 to 3 doctest examples per public function. Keep examples realistic, deterministic, and short enough to read in one screen. Then run them in CI through pytest --doctest-modules.
Here’s the rollout plan I use:
- Pick modules with frequent usage and high support burden.
- Add minimal doctests to the public functions only.
- Enable doctest collection in local
pytestconfig. - Fix failures before adding more examples.
- Add one CI gate so doctests must pass on every merge.
- Review failing doctests during code review like any other test regression.
I also suggest adding a short team guideline:
- Every new public helper gets at least one executable example.
- Every behavior-changing refactor updates both unit tests and doctests.
- Skips require a comment explaining why and when to remove.
That policy is lightweight, but it changes behavior quickly. Docs stop being an afterthought and become part of your test surface.
If you do just one thing, do this: make your examples executable and run them on every pull request. That single step dramatically reduces stale docs, onboarding confusion, and regression bugs caused by misunderstood utility behavior.
doctest is not flashy, but it is one of the highest-leverage tools in Python for keeping code and documentation in sync. In my experience, the teams that adopt it thoughtfully move faster because they argue less about intended behavior. The examples in the docs become the contract, and the contract is checked automatically.
That’s exactly the kind of boring reliability I want in production engineering.


