Skip to content

Issue 25323: Fixed error catch and route handling#43338

Closed
igoristic wants to merge 5 commits intoelastic:masterfrom
igoristic:25323
Closed

Issue 25323: Fixed error catch and route handling#43338
igoristic wants to merge 5 commits intoelastic:masterfrom
igoristic:25323

Conversation

@igoristic
Copy link
Copy Markdown
Contributor

@igoristic igoristic commented Aug 15, 2019

EDIT: This is an old iteration, please checkout the new PR at: #44800


Resolves #25323

In some cases we weren't catching monitoringClusters failures which resulted in a blank error screen:

The default route was set to /no-data which didn't have any loading state, so I rerouted all the entry points to /loading.

I also removed some on route resolve triggers, since their controllers were already doing that same loading/logic


Steps to cause monitoringClusters() to timeout (on master):

  1. Start with a fresh ES/Kibana stack from master
  2. Enable Monitoring in Stack Monitoring
  3. Start a 2nd/independent ES instance (without security) eg:
bin/elasticsearch -E xpack.license.self_generated.type=trial -E xpack.monitoring.collection.enabled=true -E xpack.security.enabled=false
  1. Go to Dev Tools and add your second cluster and add the 2nd cluster (https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-remote-clusters.html)) something like:
PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster_two": {
          "seeds": [
            "127.0.0.1:9201"
          ],
          "skip_unavailable": true
        }
      }
    }
  }
}
  1. Go to Stack Monitoring and notice that Kibana will go to /no-data for about 30 seconds with a completely blank screen. After that you'll see a toast error that'll go away, but the blank screen will still be there

The fixed code will go to /loading page instead (so now there is a loading indicator) and then it will go to /no-data (w/ the relevant error details) after the timeout


TODO:

  • Added cluster.remoteInfo pre-check
  • Removed AppState and added a temp state to kbnUrl
  • Fix unit tests
  • Fix unhandled rejected promises

@igoristic igoristic added bug Fixes for quality problems that affect the customer experience review release_note:fix Team:Monitoring Stack Monitoring team v8.0.0 v7.4.0 labels Aug 15, 2019
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/stack-monitoring

@elasticmachine
Copy link
Copy Markdown
Contributor

💔 Build Failed

@igoristic igoristic requested review from cachedout and removed request for ycombinator August 15, 2019 13:20
@cachedout
Copy link
Copy Markdown
Contributor

After reading the description of this PR, I'm not sure I understand what the new behavior will be so it's hard to test this. Could you please update the description to further explain the change?

@elasticmachine
Copy link
Copy Markdown
Contributor

💔 Build Failed

@igoristic
Copy link
Copy Markdown
Contributor Author

@cachedout I added a step by step instructions, and clarified the current behavior vs the fixed/expected behavior. Hope it helps. Let me know if you have any questions, I'm happy to walk you through it

Copy link
Copy Markdown
Contributor

@chrisronline chrisronline left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with using AppState is we need to ensure we are removing it properly.

If I bring down the remote cluster, I see the error page (which is great!), but then once I bring that same remote cluster back up, the error is stuck in app state:
Screen Shot 2019-08-19 at 4 35 47 PM

We need to clear that out


initSetupModeState($scope, $injector);

const setupMode = getSetupModeState();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this moved?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'/elasticsearch/nodes' route will do its own monitoringClusters() call through routeInitProvider and if that fails it will be stuck at a blank screen again. There is no point of sending it to the nodes route if the clusters call fails.

We are calling monitoringClusters() twice this way, but at least this way it has less chances of failing (after it has just succeeded). I figured this is fine for now, since this will go away once the setup becomes obsolete

uiRoutes
.when('/no-data', {
template,
resolve: {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the thinking behind this page? I'm guessing it's to avoid doing the check that we might fail, but we still should be doing this check so we can do the redirect, if necessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The redirect logic will still happen here: https://github.com/elastic/kibana/pull/43338/files#diff-87ec11d43f1e502792408786cf726eb6R96

We should avoid doing any kinds of requests in resolve, since it provides no template in which we can indicate a loading state to the user.

@igoristic
Copy link
Copy Markdown
Contributor Author

igoristic commented Aug 20, 2019

The problem with using AppState is we need to ensure we are removing it properly.

@chrisronline I agree 💯

Unfortunately there is no way to clear it without refreshing the state again. This is why I wanted to create something that will not be coupled with URL state.

I was actually thinking of extending kbnUrl.changePath to be able to accept state that would be destroyed on the next hash/url change. Or, maybe we can do something similar with AppState

@chrisronline
Copy link
Copy Markdown
Contributor

So I was rereading the original issue and I'm thinking we might be able to simplify and improve this fix.

Instead of needing to incur even the first 30s of timeout, I think we might be able to use the cluster.remoteInfo api to do a sort of "pre check" in the server side code and then throw an appropriate error (which should result in a toast showing). This feels more closely aligned to some other checks we're doing on the server side and the amount of code is much less, plus we can avoid the first timeout all together!

Do you mind opening a new PR and trying to get this approach working? I think it'll be nice to compare both solutions

@cachedout
Copy link
Copy Markdown
Contributor

@igoristic Could you please comment on the plan for this? Are you going to open a new PR as @chrisronline as requested above or are you waiting on a review from me on this?

@igoristic
Copy link
Copy Markdown
Contributor Author

@cachedout Sorry should have mentioned. Yes, I'm implementing cluster.remoteInfo pre-call to this PR.

It's working pretty well, just need to check a couple more cases. Should have it up soon

@elasticmachine
Copy link
Copy Markdown
Contributor

💔 Build Failed

@chrisronline
Copy link
Copy Markdown
Contributor

This seems like a lot of code to solve this problem. I'm wondering if we can do something even simpler.

What if add a verify check like we do for auth and if there is a remote cluster that is not found, we throw an error like the auth logic does. Any errors thrown in the server code will result in a toast on the client side. We might be able to get away with only changing a couple files if this approach does work.

I think you made some efforts to remove some data fetching in resolve blocks, but we should probably separate that out into a separate PR (but I do agree we should be changing that).

WDYT?

@igoristic
Copy link
Copy Markdown
Contributor Author

igoristic commented Aug 28, 2019

@chrisronline This might seems like a trivial issue with a "quick fix", but it's a little more complex and tightly coupled with the majority of our app.

I like your suggestion, but I don't think it'll yield less code (or a better quality fix), just a chain less in clusters.js promise: https://github.com/elastic/kibana/pull/43338/files#diff-7bd0dcc7ef2ad499c1f11c39f0d092eb

The majority of the problem also came from route responses not catching any errors (thus the blank white screen), so it might be inefficient to add .catch to all the route responses in this PR, and then completely remove the response requests in a different PR.

I might be missing your point though. Happy to zoom and discuss anytime 😄

@chrisronline
Copy link
Copy Markdown
Contributor

@igoristic Check out this draft PR I just put up: #44297

I think this approach might also work, and might be easier to reason about since it follows a more common convention in our monitoring code base.

Maybe we can bring in @cachedout for thoughts here too

@igoristic
Copy link
Copy Markdown
Contributor Author

@chrisronline Thank you I really appreciate the draft PR!

Let me poke at it a little before I make a decision

@igoristic
Copy link
Copy Markdown
Contributor Author

This is an old iteration, please checkout the new PR at: #44800

@igoristic igoristic closed this Sep 4, 2019
@elasticmachine
Copy link
Copy Markdown
Contributor

💔 Build Failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Fixes for quality problems that affect the customer experience release_note:fix review Team:Monitoring Stack Monitoring team v7.4.0 v8.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Monitoring] Check for timeouts to ES requests

4 participants