fix: crash recovery fails with "Invalid docid page magic" (#291)#292
Merged
fix: crash recovery fails with "Invalid docid page magic" (#291)#292
Conversation
When the docid page chain extended to a new page, the chain pointer was flushed to disk before the new page itself. A crash in that window left the pointer referencing an all-zero block, causing recovery to fail with ERROR. Fix: flush new docid pages before their referencing pointers, in both the first-page and chain-extension code paths. Also downgrade the magic-check failure from ERROR to WARNING so that indexes with existing corruption can partially recover instead of failing entirely.
The CI sanitizer job uses a non-default socket directory, so createdb/psql connections fail. Explicitly configure unix_socket_directories and pass -h to all client commands.
When recovery encounters an unflushed (all-zero) docid page and truncates the chain, it must also clear first_docid_page in the metapage. Otherwise, subsequent inserts walk the stale chain pointer into the zeroed page, read next_page=0 (the metapage block), and deadlock trying to lock the metapage they already hold exclusively.
74543bd to
93cc84d
Compare
The comments said "docid page" as if there's a single page, but it's a linked chain of pages. Reword to describe what's actually happening: flushing a page before flushing the pointer that references it.
93cc84d to
1b27454
Compare
- Save docid_header->magic to a local before UnlockReleaseBuffer to avoid use-after-release of buffer memory - On chain truncation, overwrite next_page on the last valid page instead of clearing first_docid_page entirely, so a second crash before memtable spill doesn't lose valid docid pages - Change test port from 55435 to 55437 to avoid collision with recovery.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ERROR: Invalid docid page magicafter a crashnext_pageon the last valid page instead of clearing the entire chain, preserving valid docid pages for a potential second crashRoot cause
When the docid page chain extended to a new page, the code flushed the chain pointer (on the old page or metapage) before flushing the new page's content. After a crash in that window, the pointer was on disk but the target page contained all zeros. Recovery walked the chain, hit the zero-magic page, and threw ERROR.
The fix ensures
FlushOneBufferis called on new pages before flushing any pointer that references them, in both code paths (first docid page creation and chain extension).Testing
test/scripts/docid_chain_recovery.sh— inserts 2000 rows (fills >1 docid page), crashes postgres, verifies recovery succeedsCloses #291