fix: crash recovery fails with "Invalid docid page magic" (#291) by tjgreen42 · Pull Request #292 · timescale/pg_textsearch

tjgreen42 · 2026-03-19T04:41:23Z

Summary

Fix write-ordering bug in docid page chain that caused ERROR: Invalid docid page magic after a crash
Flush new docid pages to disk before flushing the pointers that reference them
Downgrade zero-magic page (unflushed crash artifact) to WARNING with graceful chain truncation; keep ERROR for non-zero corruption
On truncation, overwrite next_page on the last valid page instead of clearing the entire chain, preserving valid docid pages for a potential second crash

Root cause

When the docid page chain extended to a new page, the code flushed the chain pointer (on the old page or metapage) before flushing the new page's content. After a crash in that window, the pointer was on disk but the target page contained all zeros. Recovery walked the chain, hit the zero-magic page, and threw ERROR.

The fix ensures FlushOneBuffer is called on new pages before flushing any pointer that references them, in both code paths (first docid page creation and chain extension).

Testing

New shell test test/scripts/docid_chain_recovery.sh — inserts 2000 rows (fills >1 docid page), crashes postgres, verifies recovery succeeds
Test fails without the fix (reproduces the exact error from BM25 index on TimescaleDB hypertable: Invalid docid page magic on chunk scans #291) and passes with it
Runs in all 4 CI jobs (test PG17, test PG18, PG17 sanitizer, PG18 sanitizer)

Closes #291

When the docid page chain extended to a new page, the chain pointer was flushed to disk before the new page itself. A crash in that window left the pointer referencing an all-zero block, causing recovery to fail with ERROR. Fix: flush new docid pages before their referencing pointers, in both the first-page and chain-extension code paths. Also downgrade the magic-check failure from ERROR to WARNING so that indexes with existing corruption can partially recover instead of failing entirely.

The CI sanitizer job uses a non-default socket directory, so createdb/psql connections fail. Explicitly configure unix_socket_directories and pass -h to all client commands.

When recovery encounters an unflushed (all-zero) docid page and truncates the chain, it must also clear first_docid_page in the metapage. Otherwise, subsequent inserts walk the stale chain pointer into the zeroed page, read next_page=0 (the metapage block), and deadlock trying to lock the metapage they already hold exclusively.

The comments said "docid page" as if there's a single page, but it's a linked chain of pages. Reword to describe what's actually happening: flushing a page before flushing the pointer that references it.

src/state/state.c

test/scripts/docid_chain_recovery.sh

- Save docid_header->magic to a local before UnlockReleaseBuffer to avoid use-after-release of buffer memory - On chain truncation, overwrite next_page on the last valid page instead of clearing first_docid_page entirely, so a second crash before memtable spill doesn't lose valid docid pages - Change test port from 55435 to 55437 to avoid collision with recovery.sh

creatorrr mentioned this pull request Mar 20, 2026

fix: attnum drift causes index mismatch with inheritance/hypertables (#288) #289

Merged

tjgreen42 mentioned this pull request Mar 21, 2026

Adopt GenericXLog for WAL-based crash atomicity and replication #294

Open

tjgreen42 added 3 commits March 23, 2026 14:52

fix: set unix_socket_directories in docid chain recovery test

d678fc9

The CI sanitizer job uses a non-default socket directory, so createdb/psql connections fail. Explicitly configure unix_socket_directories and pass -h to all client commands.

tjgreen42 force-pushed the fix/docid-flush-ordering branch from 74543bd to 93cc84d Compare March 23, 2026 21:52

fix: clean up misleading comments on docid chain flush ordering

1b27454

The comments said "docid page" as if there's a single page, but it's a linked chain of pages. Reword to describe what's actually happening: flushing a page before flushing the pointer that references it.

tjgreen42 force-pushed the fix/docid-flush-ordering branch from 93cc84d to 1b27454 Compare March 23, 2026 21:54

tjgreen42 marked this pull request as ready for review March 23, 2026 22:09

claude bot reviewed Mar 23, 2026

View reviewed changes

src/state/state.c Show resolved Hide resolved

src/state/state.c Show resolved Hide resolved

test/scripts/docid_chain_recovery.sh Show resolved Hide resolved

tjgreen42 added 2 commits March 23, 2026 16:02

Merge branch 'main' into fix/docid-flush-ordering

2244af8

tjgreen42 merged commit 76ea737 into main Mar 23, 2026
14 checks passed

tjgreen42 deleted the fix/docid-flush-ordering branch March 23, 2026 23:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: crash recovery fails with "Invalid docid page magic" (#291)#292

fix: crash recovery fails with "Invalid docid page magic" (#291)#292
tjgreen42 merged 6 commits intomainfrom
fix/docid-flush-ordering

tjgreen42 commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tjgreen42 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tjgreen42 commented Mar 19, 2026 •

edited

Loading