Skip to content

Adding impacts block to the health info API response#84899

Merged
masseyke merged 30 commits intoelastic:masterfrom
masseyke:feature/health-api-impacts-block
Mar 24, 2022
Merged

Adding impacts block to the health info API response#84899
masseyke merged 30 commits intoelastic:masterfrom
masseyke:feature/health-api-impacts-block

Conversation

@masseyke
Copy link
Copy Markdown
Member

This change introduces the notion of an impact to a health API indicator, and adds impacts to all currently-existing
indicators.
Closes #84773

@masseyke masseyke added :Distributed/Health Issues for the health report API v8.2.0 labels Mar 10, 2022
@elasticmachine elasticmachine added the Team:Data Management (obsolete) DO NOT USE. This team no longer exists. label Mar 10, 2022
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @masseyke, I've created a changelog YAML for you.

Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this Keith.

I've left a couple of minor questions. Can you please post an example of the output in the description?

@masseyke masseyke requested a review from andreidan March 16, 2022 15:36
@masseyke
Copy link
Copy Markdown
Member Author

Here's an example response. You can see empty impact blocks for green items, and a filled out impact block for the shards_availability indicator.

{
    "cluster_name": "elasticsearch",
    "components": {
        "cluster_coordination": {
            "indicators": {
                "instance_has_master": {
                    "details": {
                        "coordinating_node": {
                            "name": "keiths-mbp.lan",
                            "node_id": "R3iN7tj1TP6dfAQCNFzFmA"
                        },
                        "master_node": {
                            "name": "keiths-mbp.lan",
                            "node_id": "R3iN7tj1TP6dfAQCNFzFmA"
                        }
                    },
                    "impact": {},
                    "status": "green",
                    "summary": "Health coordinating instance has a master node."
                }
            },
            "status": "green"
        },
        "data": {
            "indicators": {
                "ilm": {
                    "details": {
                        "ilm_status": "RUNNING",
                        "policies": 15
                    },
                    "impact": {},
                    "status": "green",
                    "summary": "ILM is running"
                },
                "shards_availability": {
                    "details": {
                        "creating_primaries": 0,
                        "initializing_primaries": 0,
                        "initializing_replicas": 0,
                        "restarting_primaries": 0,
                        "restarting_replicas": 0,
                        "started_primaries": 12,
                        "started_replicas": 0,
                        "unassigned_primaries": 0,
                        "unassigned_replicas": 18
                    },
                    "impact": {
                        "impact_description": "Redundancy for 3 indices [my-index-000001, my-index-000003, my-index-000002] is currently disrupted. Fault tolerance and search scalability are reduced.",
                        "severity": 3
                    },
                    "status": "yellow",
                    "summary": "This cluster has 18 unavailable replicas."
                }
            },
            "status": "yellow"
        },
        "snapshot": {
            "indicators": {
                "repository_integrity": {
                    "details": {},
                    "impact": {},
                    "status": "green",
                    "summary": "No repositories configured."
                },
                "slm": {
                    "details": {
                        "policies": 0,
                        "slm_status": "RUNNING"
                    },
                    "impact": {},
                    "status": "green",
                    "summary": "No policies configured"
                }
            },
            "status": "green"
        }
    },
    "status": "yellow"
}

Here's the impact if the primaries and replicas are unavailable:

                    "impact": {
                        "impact_description": "Indices [my-index-000001, my-index-000003, my-index-000002] are unavailable. You are at risk of losing the data in these indices.",
                        "severity": 1
                    },

And here is the impact if primaries are unavailable but replicas are:

                    "impact": {
                        "impact_description": "Cannot add data to 1 index [my-index-000001]. Searches might return incomplete results.",
                        "severity": 2
                    },

In all of the above, it will show at most 10 indices. If there are more, it will include ... at the end. Also note that the impact can be populated even if the status is green. For example if a user has disabled replication there will be zero replicas. The status will be green (because no replicas are unavailable). But we still populate the impact to say that redundancy and fault tolerance are impacted.
Here is the impact block if there is no master:

                    "impact": {
                        "impact_description": "The cluster cannot create, delete, or rebalance indices, and is likely to be very unstable",
                        "severity": 1
                    },

And if there are corrupted repositories:

                    "impact": {
                        "impact_description": "Snapshots in corrupted repositories cannot be restored. Data loss is possible.",
                        "severity": 2
                    },

If ILM is expected to be running but is not:

                    "impact": {
                        "impact_description": "Indices are not being rolled over, which could lead to future instability.",
                        "severity": 3
                    },

And if SLM is expected to be running but is not:

                    "impact": {
                        "impact_description": "Scheduled snapshots are not happening, which could lead to future data loss.",
                        "severity": 3
                    },

@masseyke
Copy link
Copy Markdown
Member Author

@elasticmachine update branch

Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Keith.

I just had a couple of comments.

In the interest of keeping scope to a minimum can you please remove the impacts for all indicators except the shards_availability one? I think it'd be easier to have separate PRs for the other impacts and merge them independently The cluster coordination one for e.g. needs a lot of work to be able to safely indicate the master is unstable, the snapshot one might want to indicate which repository is corrupted etc
What do you think?

@masseyke
Copy link
Copy Markdown
Member Author

Thanks Keith.

I just had a couple of comments.

In the interest of keeping scope to a minimum can you please remove the impacts for all indicators except the shards_availability one? I think it'd be easier to have separate PRs for the other impacts and merge them independently The cluster coordination one for e.g. needs a lot of work to be able to safely indicate the master is unstable, the snapshot one might want to indicate which repository is corrupted etc What do you think?

I'll change them to "TODO". With this change, we have to have some impact so I figured I'd at least put a placeholder in for all of them. The snapshot health indicator already reports the repository name in the summary so I was avoiding duplicating that in the impact.

Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this Keith.

On the TODO aspect, I think this would make it harder for the UI team to work out if they should display the impact or not.
How about we don't include the impact section at all for the indicators for which we don't have the impacts yet? (so for this PR only the shards_availability indicator will report an impact section)

@masseyke
Copy link
Copy Markdown
Member Author

Thanks for iterating on this Keith.

On the TODO aspect, I think this would make it harder for the UI team to work out if they should display the impact or not. How about we don't include the impact section at all for the indicators for which we don't have the impacts yet? (so for this PR only the shards_availability indicator will report an impact section)

I assumed the UI team was just showing whatever string we gave them, and that it would be better to see a placeholder than nothing. I'll change it to an empty impact (there's no such thing as no impact any more).

@andreidan
Copy link
Copy Markdown
Contributor

I assumed the UI team was just showing whatever string we gave them, and that it would be better to see a placeholder than nothing. I'll change it to an empty impact (there's no such thing as no impact any more).

Thanks for iterating on this Keith.

Whilst they're testing the first iterations we might not have impacts for all indicators. I think our APIs generally don't report empty blocks e.g. https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/action/admin/cluster/allocation/ClusterAllocationExplanation.java#L171 , https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/IndexLifecycleExplainResponse.java#L471

Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this Keith. I think this is nearly done.

I've left one question.

if (indicesWithUnavailablePrimariesAndReplicas.isEmpty() == false) {
String impactDescription = String.format(
Locale.ROOT,
"%s [%s] %s unavailable. You are at risk of losing the data in %s.",
Copy link
Copy Markdown
Contributor

@andreidan andreidan Mar 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current impacts are:

Primaries available, replicas unassigned:

"impact": {
  "description": "Redundancy for 3 indices [my-index-000001, my-index-000003, my-index-000002] is currently disrupted. Fault tolerance and search scalability are reduced.",
  "severity": 3
}

edit: update category to "Primaries unassigned"
Primaries unassigned:

"impact": {
  "description": "Cannot add data to 1 index [my-index-000001]. Searches might return incomplete results.",
  "severity": 2
}

Both primaries and replicas unavailable:

 "impact": {
   "description": "Indices [my-index-000001, my-index-000003, my-index-000002] are unavailable. You are at risk of losing the data in these indices.",
   "severity": 1
}

@cristina-eleonora @shubhaat @Leaf-Lin Can you please provide some feedback on these messages?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping @andreidan !

Primaries available, replicas unassigned:

"impact": {
  "description": "Redundancy for 3 indices [my-index-000001, my-index-000003, my-index-000002] is currently disrupted. Fault tolerance and search scalability are reduced.",
  "severity": 3
}

Could we rephrase the first half to something like: Data redundancy is affected on 3 indices [my-index-000001, my-index-000003, my-index-000002]. ? Are there any impacts on search/ingest? And could we phrase Fault tolerance and search scalability are reduced. in a way that a non technical / minimally technical person would understand? Perhaps some of the writers could help with this.

Primaries unassigned, but replicas available:

"impact": {
  "description": "Cannot add data to 1 index [my-index-000001]. Searches might return incomplete results.",
  "severity": 2
}

Looks great 👍🏻

Both primaries and replicas unavailable:

 "impact": {
   "description": "Indices [my-index-000001, my-index-000003, my-index-000002] are unavailable. You are at risk of losing the data in these indices.",
   "severity": 1
}

Does this mean that those indices cannot be accessed anymore? Or that the data on those indices cannot be accessed? How will that impact search/ingest?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought Primaries unassigned, but replicas available would not happen because replica will be promoted?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on @Leaf-Lin comment above we decided to simplify the impacts into the following categories:

  1. primaries unavailable for indices:
Cannot add data to 1 index [my-index-000001]. Searches will return incomplete results.
  1. replicas unavailable for indices:
Data redundancy is affected on 3 indices [my-index-000001, my-index-000003, my-index-000002]. Fault tolerance and search scalability are reduced.

Thanks for your thoughts on the messages @cristina-eleonora
I think with the simplified cases we have the Fault tolerance and search scalability are reduced. bit to rephrase somehow.
When talking about replicas in our docs we say:

Replicas provide redundant copies of your data to protect against hardware failure and increase capacity to serve read requests like searching or retrieving a document

Would impact number 2 make more sense if it was: (or something along those lines?)

Data redundancy is affected on 3 indices [my-index-000001, my-index-000003, my-index-000002]. The capacity to serve search requests is reduced and you could be susceptible to data loss in case of hardware failure.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I have the wording in from Andre's last comment. I can easily change it if we need to.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For missing replicas, how about just:
Searches might return slower than expected. Fewer redundant copies of the data exist than expected on 3 indices [my-index-000001, my-index-000003, my-index-000002].
Or maybe the slightly scarier:
Searches might return slower than expected, and you are at higher risk of the cluster becoming unstable in the future. Fewer redundant copies of the data exist than expected on 3 indices [my-index-000001, my-index-000003, my-index-000002].
Or the even scarier:
Searches might return slower than expected, and you are at higher risk of losing data. Fewer redundant copies of the data exist than expected on 3 indices [my-index-000001, my-index-000003, my-index-000002].

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally like the 1st and the 3rd 🙂 The third is more explicit on the importance of replicas, but may scare the users. If they should be scared, let's keep it, if not, perhaps use the 1st.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinging @alaudazzi , our UX writer.

Copy link
Copy Markdown
Contributor

@andreidan andreidan Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like 1 too, with the only mentions that "than expected ..." is repeated and in the 2nd sentence it should come earlier (?)
I suggest:

Searches might return slower than usual. Fewer than expected redundant copies of the data exist on 3 indices [my-index-000001, my-index-000003, my-index-000002].

Or maybe even drop "than expected" in the 2nd sentence altogether (it reads very strange to me)

Searches might return slower than usual. Fewer redundant copies of the data exist on 3 indices [my-index-000001, my-index-000003, my-index-000002].

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with Andre's last wording, but can update at any point later if we work out something better.

Copy link
Copy Markdown
Contributor

@andreidan andreidan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for iterating on this Keith

Left 2 very minor comments, but I think we can merge this soon as the messaging seems to be getting some consensus.

@masseyke
Copy link
Copy Markdown
Member Author

@elasticmachine run CLA

@masseyke masseyke merged commit 99fe8dc into elastic:master Mar 24, 2022
@masseyke masseyke deleted the feature/health-api-impacts-block branch March 24, 2022 17:01
elasticsearchmachine pushed a commit that referenced this pull request May 9, 2022
The ability to add impacts to indicators was added in #84899, but
impacts for all indicators other than shards availability were left
empty. This commit adds potential impacts for the other indicators.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Health Issues for the health report API >feature Team:Data Management (obsolete) DO NOT USE. This team no longer exists. v8.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement impacts block for the shards availability indicator

7 participants