Add troubleshooting docs about data corruption#88760
Add troubleshooting docs about data corruption#88760DaveCTurner merged 7 commits intoelastic:mainfrom
Conversation
Adds some docs giving more detailed background about what data corruption really means and some suggestions about how to narrow down the root cause.
|
Pinging @elastic/es-docs (Team:Docs) |
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
| than {es} and look for data integrity errors. On Linux the `fio` and | ||
| `stress-ng` tools can both generate challenging I/O workloads and verify the | ||
| integrity of the data they write. You can check that durable writes persist | ||
| across power outages using a script such as | ||
| [`diskchecker.pl`]https://gist.github.com/bradfitz/3172656. Try different |
There was a problem hiding this comment.
Unclear if we should mention these tools like this. Maybe we should give example invocations for fio and stress-ng? diskchecker.pl looks janky but it's pretty much the best way to find fsync() bugs (it's in the PostgreSQL docs).
henningandersen
left a comment
There was a problem hiding this comment.
This looks good, left a couple of mostly optional comments.
| apart from the data corruption, but data corruption itself is a very strong | ||
| indicator that your storage subsystem is not working correctly. | ||
|
|
||
| To narrow down the source of the corruptions, systematically change components |
There was a problem hiding this comment.
I think I'd like to reword this a bit to be more of a non-exhaustive list of suggestion of ways to help diagnosing corruptions.
Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>
arteam
left a comment
There was a problem hiding this comment.
This looks awesome! Thank you David!
|
|
||
| Verifying a checksum is expensive since it involves reading every byte of the | ||
| file which takes significant effort and might evict more useful data from the | ||
| filesystem cache, so systems typically doesn't verify the checksum on a file |
|
|
||
| The files that make up a Lucene index are written in full before they are used. | ||
| If a file is needed to recover an index after a restart then your storage | ||
| system will previously have confirmed to {es} that this file was durably synced |
There was a problem hiding this comment.
Did you mean had previously confirmed here? will previously have confirmed sounds a bit strange to me.
There was a problem hiding this comment.
This one is legitimate English - if you want the technical details, will have confirmed is the future perfect tense, and that's acceptable to use in the first conditional form as I'm doing here. But I think it's fair to say that it could be confusing too, so I simplified it in 85a8ee2.
Adds some docs giving more detailed background about what data corruption really means and some suggestions about how to narrow down the root cause. Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>
Adds some docs giving more detailed background about what data corruption really means and some suggestions about how to narrow down the root cause. Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>
* upstream/main: Add 8.5 migration docs (elastic#88923) Script: Reindex & UpdateByQuery Metadata (elastic#88665) Remove unused plugins dir var from server CLI (elastic#88917) Use tracing API in TaskManager (elastic#88885) Add source fallback for keyword fields using operation (elastic#88735) Prune changelogs after 8.3.3 release Bump versions after 8.3.3 release Add a test for checking for misspelled "dry_run" parameters for Desired Nodes API (elastic#88898) Speedup BalanceUnbalancedClusterTests (elastic#88794) Preventing exceptions on node shutdown in integration tests (elastic#88827) Do not trigger check part3 for test mute and docs PRs (elastic#88895) Add troubleshooting docs about data corruption (elastic#88760) Mute RollupActionSingleNodeTests#testRollupDatastream (elastic#88891) [DOCS] Domain splitting impacts API keys (elastic#88677) Fix SqlSearchIT testAllTypesWithRequestToOldNodes (elastic#88866) (elastic#88883) Update synthetic-source.asciidoc (elastic#88880) Log more details in TaskAssertions (elastic#88864) Make Tuple a record (elastic#88280)
Adds some docs giving more detailed background about what data
corruption really means and some suggestions about how to narrow down
the root cause.