Skip to content

Add troubleshooting docs about data corruption#88760

Merged
DaveCTurner merged 7 commits intoelastic:mainfrom
DaveCTurner:2022-07-25-troubleshooting-corruption
Jul 28, 2022
Merged

Add troubleshooting docs about data corruption#88760
DaveCTurner merged 7 commits intoelastic:mainfrom
DaveCTurner:2022-07-25-troubleshooting-corruption

Conversation

@DaveCTurner
Copy link
Copy Markdown
Member

Adds some docs giving more detailed background about what data
corruption really means and some suggestions about how to narrow down
the root cause.

Adds some docs giving more detailed background about what data
corruption really means and some suggestions about how to narrow down
the root cause.
@DaveCTurner DaveCTurner added >docs General docs changes :Core/Infra/Core Core issues without another label v8.4.0 v8.3.4 labels Jul 25, 2022
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-docs (Team:Docs)

@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team Team:Docs Meta label for docs team labels Jul 25, 2022
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

Comment on lines +77 to +81
than {es} and look for data integrity errors. On Linux the `fio` and
`stress-ng` tools can both generate challenging I/O workloads and verify the
integrity of the data they write. You can check that durable writes persist
across power outages using a script such as
[`diskchecker.pl`]https://gist.github.com/bradfitz/3172656. Try different
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear if we should mention these tools like this. Maybe we should give example invocations for fio and stress-ng? diskchecker.pl looks janky but it's pretty much the best way to find fsync() bugs (it's in the PostgreSQL docs).

@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
Copy link
Copy Markdown
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, left a couple of mostly optional comments.

apart from the data corruption, but data corruption itself is a very strong
indicator that your storage subsystem is not working correctly.

To narrow down the source of the corruptions, systematically change components
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to reword this a bit to be more of a non-exhaustive list of suggestion of ways to help diagnosing corruptions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 see 37a0caa.

Copy link
Copy Markdown
Contributor

@arteam arteam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome! Thank you David!


Verifying a checksum is expensive since it involves reading every byte of the
file which takes significant effort and might evict more useful data from the
filesystem cache, so systems typically doesn't verify the checksum on a file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/doesn't/don't ?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) well spotted, thanks - fixed in 3889101


The files that make up a Lucene index are written in full before they are used.
If a file is needed to recover an index after a restart then your storage
system will previously have confirmed to {es} that this file was durably synced
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean had previously confirmed here? will previously have confirmed sounds a bit strange to me.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is legitimate English - if you want the technical details, will have confirmed is the future perfect tense, and that's acceptable to use in the first conditional form as I'm doing here. But I think it's fair to say that it could be confusing too, so I simplified it in 85a8ee2.

Copy link
Copy Markdown
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@DaveCTurner DaveCTurner merged commit 7103053 into elastic:main Jul 28, 2022
@DaveCTurner DaveCTurner deleted the 2022-07-25-troubleshooting-corruption branch July 28, 2022 10:23
DaveCTurner added a commit that referenced this pull request Jul 28, 2022
Adds some docs giving more detailed background about what data
corruption really means and some suggestions about how to narrow down
the root cause.

Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>
DaveCTurner added a commit that referenced this pull request Jul 28, 2022
Adds some docs giving more detailed background about what data
corruption really means and some suggestions about how to narrow down
the root cause.

Co-authored-by: Henning Andersen <33268011+henningandersen@users.noreply.github.com>
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Jul 29, 2022
* upstream/main:
  Add 8.5 migration docs (elastic#88923)
  Script: Reindex & UpdateByQuery Metadata (elastic#88665)
  Remove unused plugins dir var from server CLI (elastic#88917)
  Use tracing API in TaskManager (elastic#88885)
  Add source fallback for keyword fields using operation (elastic#88735)
  Prune changelogs after 8.3.3 release
  Bump versions after 8.3.3 release
  Add a test for checking for misspelled "dry_run" parameters for Desired Nodes API (elastic#88898)
  Speedup BalanceUnbalancedClusterTests (elastic#88794)
  Preventing exceptions on node shutdown in integration tests (elastic#88827)
  Do not trigger check part3 for test mute and docs PRs (elastic#88895)
  Add troubleshooting docs about data corruption (elastic#88760)
  Mute RollupActionSingleNodeTests#testRollupDatastream (elastic#88891)
  [DOCS] Domain splitting impacts API keys (elastic#88677)
  Fix SqlSearchIT testAllTypesWithRequestToOldNodes (elastic#88866) (elastic#88883)
  Update synthetic-source.asciidoc (elastic#88880)
  Log more details in TaskAssertions (elastic#88864)
  Make Tuple a record (elastic#88280)
@mark-vieira mark-vieira added v8.4.0 and removed v8.4.1 labels Aug 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Core Core issues without another label >docs General docs changes Team:Core/Infra Meta label for core/infra team Team:Docs Meta label for docs team v8.3.4 v8.4.0 v8.5.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants