Skip to content

(Doc+) Recovery ".kibana*" indices write block#189999

Closed
stefnestor wants to merge 3 commits intomainfrom
stefnestor-patch-3
Closed

(Doc+) Recovery ".kibana*" indices write block#189999
stefnestor wants to merge 3 commits intomainfrom
stefnestor-patch-3

Conversation

@stefnestor
Copy link
Copy Markdown
Member

@stefnestor stefnestor commented Aug 6, 2024

👋🏽 howdy, team!

Summary

Adds in how to recover from KB index write blocks which is commonly surfaced to Support by users as recovery step after facing either flood-stage watermark or max shards open Elasticsearch errors. Adding in now that elastic/elasticsearch#111315 merged.

Recovery is either manually removing write block (which will auto-reapply if it thinks issue is ongoing) or resetting to earlier Kibana.

Checklist

Delete any items that are not applicable to this PR.

Risk Matrix

For maintainers

👋🏽 howdy, team! 

Adds in how to recover from KB index write blocks which is commonly surfaced to Support by users as recovery step after facing either [flood-stage watermark](https://www.elastic.co/guide/en/elasticsearch/reference/current/fix-watermark-errors.html) or [max shards open](https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html#_this_action_would_add_x_total_shards_but_this_cluster_currently_has_yz_maximum_shards_open) Elasticsearch errors.
@stefnestor stefnestor added Team:Docs enhancement New value added to drive a business result docs labels Aug 6, 2024
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/kibana-docs (Team:Docs)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Aug 6, 2024

A documentation preview will be available soon.

Request a new doc build by commenting
  • Rebuild this PR: run docs-build
  • Rebuild this PR and all Elastic docs: run docs-build rebuild

run docs-build is much faster than run docs-build rebuild. A rebuild should only be needed in rare situations.

If your PR continues to fail for an unknown reason, the doc build pipeline may be broken. Elastic employees can check the pipeline status here.

@stefnestor
Copy link
Copy Markdown
Member Author

👋 @rudolf ,

I've heard through grape-vine that you might have concerns with this recovery outline, that potentially it is a sunk cost belief that Dev would actually recommend avoiding across all versions. Will you kindly update if so?

Context I know:

  • kibana#158733 lists that these steps can cause silent data loss specifically for v8.8.0 but not other versions
  • internal link lists this workaround or a close variation is viable for v7.17.9

I believe there may be a separate meeting being scheduled between Support+Dev for long-term on how to better avoid/answer this ballpark, but bumping here so this PR can merge/close-unmerged rather than continue sitting WoDev. TIA! 🙏

@rudolf
Copy link
Copy Markdown
Contributor

rudolf commented Sep 17, 2024

@stefnestor Yeah, what happens is that flood stage watermarks, max shards open or other temporary ES problems like continuously hitting circuit breakers causes a migration to continuously fail.

The solution is always to fix the underlying ES problem. Once this is done Kibana will automatically reattempt the migration without further intervention by a user. Users might choose to revert to the previous version if Kibana availability is important and they believe an ES fix might take prohibitively long.

The problem comes in when on ECH a failed Kibana upgrade will automatically rollback without properly following the rollback instructions. This usually results in Kibana version N being unable to upgrade because of a partial Kibana version N+1 migration having been started (the cause of the write block). The recommended fix is to follow our documentation to correctly rollback to the previous version by restoring the Kibana feature state from before the upgrade.

Perhaps what we can do to help users better is to document the the write block under https://www.elastic.co/guide/en/kibana/current/resolve-migrations-failures.html and explain that this most likely is a case of incorrectly rolling back. Recommending that users following our rollback instructions. WDYT?

@stefnestor
Copy link
Copy Markdown
Member Author

stefnestor commented Sep 27, 2024

@rudolf sounds like a game plan to me. If I may confirm, you just linked to a guide that also recommends manually removing the write block instead of the Rollback Kibana guide, was that intentional? I may have misunderstood, sorry.

EDIT: Noting for Sev1 01756480 with workaround KB outlining what we think we don't want it to say, snapshot restoring didn't work because the restored feature state from last successful snapshot was already write blocked. 😬

@rudolf
Copy link
Copy Markdown
Contributor

rudolf commented Sep 30, 2024

@stefnestor https://www.elastic.co/guide/en/kibana/current/resolve-migrations-failures.html documents how to fix migrations failing due to corrupt documents. Since these corrupt documents exist in the existing index they are also present in any feature state snapshots. So in this case a feature state snapshot would not be able to resolve the problem.

@stefnestor
Copy link
Copy Markdown
Member Author

Sorry for the delay, I was out on vacation. I think I'm caught on a Y of an XY problem.

The solution is always to fix the underlying ES problem. Once this is done Kibana will automatically reattempt the migration without further intervention by a user.

My belief is that KB migrations will error because of ES; but after fixing ES, KB will continue erring that its indices are write blocked so the migration will not progress even after restarting KB. That's why some of my teammates have thought they needed to manually remove the write block (to push the state one step prior so KB starts the migration without erring even though the first thing it does it re-establish the write block). Does that line up to your expectations or sound unexpected to you?

@florent-leborgne
Copy link
Copy Markdown
Member

Hi @stefnestor, what's the status of this PR? We're going to be migrating to a new docs format and repo this week, making this PR invalid.
If the information contained here is still relevant to add, let me know and I will recreate the PR in the new repo & format.
Thanks!

@stefnestor stefnestor closed this Jan 30, 2025
@jbudz jbudz deleted the stefnestor-patch-3 branch February 19, 2025 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs enhancement New value added to drive a business result release_note:enhancement Team:Docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants