Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

README.md

PyFormatChecker — Clang AST plugin for CPython format-string auditing

A Clang plugin that parses every call to CPython's custom format functions (PyErr_Format, PyUnicode_FromFormat, etc.), type-checks each argument against its format specifier, and reports mismatches. Both standard C specs (%s, %d, %lu, …) and CPython-specific specs (%R, %S, %A, %U, %T, %N, %V) are understood.

Requirements

Tool Version
clang / clang++ 21
cmake ≥ 3.15
Python ≥ 3.11 (for run_checker.py)
compile_commands.json generated by bear -- make or cmake

LLVM 21 headers are expected at /usr/lib/llvm-21/lib/cmake/. Adjust the HINTS paths in CMakeLists.txt if your installation differs.

Build

# From the CPython root
cd Tools/py-format-checker
mkdir -p build && cd build
cmake ..
make -j$(nproc)

This produces build/PyFormatChecker.so.

If you have not yet generated compile_commands.json for the CPython tree:

# In the CPython root (after ./configure)
pip install bear          # or: apt install bear
bear -- make -j$(nproc)

Usage

# From the CPython root
python3 Tools/py-format-checker/run_checker.py [output_file] [--jobs N]

run_checker.py replays every entry in compile_commands.json through clang-21 with the plugin loaded, in parallel. Results are written to Tools/py-format-checker/py_format_report.txt by default (or the path you supply as the first argument).

Options

Flag Default Description
output_file reports/py_format_report[_<target>].txt Output file (auto-named by target)
--jobs N cpu_count Parallel workers
--plugin PATH build/PyFormatChecker.so Override plugin path
--db PATH compile_commands.json Override compilation database
--target TRIPLE host Clang target triple for type-size checks
--verbose Print all call-sites (sets PY_FMT_ERROR_ONLY=0)

Environment variables

Variable Default Description
PY_FMT_ERROR_ONLY 1 Set to 0 to print all call-sites, not just those with at least one mismatch
PY_FMT_INTEGRAL_CHECK_MODE standard Integer width/sign checking: off — accept any integer; standard — bit-width must match (C99, signedness ignored); full — both bit-width and signedness must match

Possible per-argument statuses

Status Meaning
ok Type matches the spec
MISMATCH got=X want=<sentinel> Wrong type
MISSING_ARG want=<sentinel> Fewer arguments than format specs
SURPLUS N arg(s) More arguments than format specs
UNKNOWN_SPEC Unrecognized/unsupported format spec (e.g. %y)

<sentinel> is either a standard C type (e.g. long, const char *) or a special placeholder like <PyObject*> or <any-int> that the plugin understands and checks for. See the source for the full list of supported sentinels and their checks.

hint suggestion

When a mismatch involves an integer conversion spec (%d, %i, %u, %o, %x, %X) the plugin emits a hint: line showing the corrected format string with the proper length modifier (e.g. %d%zd for a Py_ssize_t argument). In full mode the conversion character is also corrected for signedness (e.g. %d%u for an unsigned type).

Cross-checking 32-bit targets

Passing --target changes how Clang resolves type sizes without requiring the plugin itself to be cross-compiled. This is useful for catching bugs (such as using %ld for an off_t that is 64-bit even on 32-bit Linux when _FILE_OFFSET_BITS=64 is set) that are invisible on a 64-bit host.

Install 32-bit headers (Debian/Ubuntu)

sudo apt install gcc-multilib libc6-dev-i386

Without these, Clang cannot resolve system typedefs (int64_t, size_t, …) for the i686 target and the checker will report spurious <dependent type> mismatches instead of real type names.

Run

python3 Tools/py-format-checker/run_checker.py \
    --target=i686-linux-gnu -j$(nproc)
# Output: Tools/py-format-checker/reports/py_format_report_i686-linux-gnu.txt

Compare 64-bit and 32-bit results:

python3 Tools/py-format-checker/run_checker.py -j$(nproc)
# Output: reports/py_format_report.txt  (host/64-bit)

diff Tools/py-format-checker/reports/py_format_report.txt \
     Tools/py-format-checker/reports/py_format_report_i686-linux-gnu.txt

Supported format functions

The following functions are recognised. Static/file-local helpers include a filename filter so calls from unrelated translation units are not matched.

Function Format arg File constraint
_PyErr_FormatNote 0
PyUnicode_FromFormat 0
PySys_FormatStdout / PySys_FormatStderr 0
PyErr_Format 1
_PyErr_FormatFromCause 1
_Py_FatalErrorFormat 1
PyUnicodeWriter_Format 1
PyBytesWriter_Format 1
_PyXIData_FormatNotShareableError 1
_abiinfo_raise 1 modsupport.c
_PyTokenizer_syntaxerror 1
_PyErr_Format 2
_PyErr_FormatFromCauseTstate 2
PyErr_WarnFormat 2
PyErr_ResourceWarning 2
_PyCompile_Error / _PyCompile_Warn 2
_PyTokenizer_parser_warn 2
task_set_error_soon 3 _asynciomodule.c
format_notshareableerror 3 crossinterp
_PyTokenizer_syntaxerror_known_range 3
PyErr_WarnExplicitFormat 5

Supported CPython-specific specs

Spec Expected C type
%R, %S, %A, %U, %T, %#T PyObject * (any Py*-typed pointer)
%N, %#N PyTypeObject *
%V PyObject * + const char * (two arguments)
%lV PyObject * + const wchar_t *
%ls const wchar_t *

Standard specs (%s, %d, %u, %ld, %zd, %p, %x, %o, …) and * width/precision arguments are also supported.

PyObject compatibility check

An argument satisfies <PyObject*> if any of the following hold:

  • Its pointee typedef name starts with Py or _Py (covers all public API types before canonical unwrapping).
  • Its pointee struct's first field is named ob_base (structural PyObject_HEAD / PyObject_VAR_HEAD check — covers internal types such as TaskObj, FutureObj, buffered, ElementObject, etc. without maintaining an explicit name list).

PyTypeObject* compatibility check

An argument satisfies <PyTypeObject*> if any of the following hold:

  • Its pointee typedef name is PyTypeObject (covers all public API types before canonical unwrapping).
  • Its pointee struct's name is _typeobject.

Enum type handling

C enum types have an implementation-defined underlying integer type. The plugin resolves any enum argument to its compiler-chosen underlying integer type before performing width and signedness checks, so e.g. an enum backed by unsigned long is correctly matched against %lu and not %u. Incomplete enums (no underlying type yet) are accepted to avoid false positives.

Adding a new format function

Edit the kFormatFuncs map near the top of py_format_checker.cpp:

{"my_format_helper", {2, "mymodule.c"}},
//                    ^   ^
//                    |   optional filename substring; nullptr = any file
//                    0-based index of the format-string argument

If the function takes a va_list instead of ..., add its name to kVaListFuncs as well (format string is still parsed, but individual argument types cannot be checked).

Rebuild the plugin after any source change:

cd Tools/py-format-checker/build && make -j$(nproc)