Skip to content

Make CLI more robust to discovery latency.#494

Merged
hidmic merged 8 commits intomasterfrom
hidmic/robust-cli
May 13, 2020
Merged

Make CLI more robust to discovery latency.#494
hidmic merged 8 commits intomasterfrom
hidmic/robust-cli

Conversation

@hidmic
Copy link
Copy Markdown

@hidmic hidmic commented Apr 22, 2020

Split from #421. This pull request builds on top of #493 and ensures the daemon gets used throughout all CLI tools. It also updates some tests which showed to be flakey nonetheless (with and without this patch).

@hidmic
Copy link
Copy Markdown
Author

hidmic commented Apr 27, 2020

With fac0a49, this fixes #488.

@hidmic hidmic force-pushed the hidmic/robust-cli branch from fac0a49 to 02c8d4a Compare April 27, 2020 18:48
@hidmic
Copy link
Copy Markdown
Author

hidmic commented Apr 27, 2020

CI up to ros2topic, ros2service, ros2action, ros2node and ros2component:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • OSX Build Status
  • Windows Build Status

@dirk-thomas
Copy link
Copy Markdown
Member

With fac0a49, #488 should be now addressed.

Please use a format which will ensure the referenced ticket is automatically closed when the PR is merged.

)
try:
list_nodes_client.wait_for_service()
if not list_nodes_client.wait_for_service(timeout_sec=5.0):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When used in a loop (as in component list) a blocking timeout of 5s can add up to a very long time for the user call.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Timeout is a somewhat arbitrary, larger than zero quantity. What about 2.0 seconds?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind the value. For this PR you can choose whatever.

The problem is that with the current API all waits happen sequentially. E.g. if the service of N nodes isn't available the CLI will hang for N * timeout. I would prefer that the CLI only blocks for timeout seconds (whatever that value is). But that is pretty much our of scope for this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a follow up ticket for this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, good point. See #507.

qos_profile = qos_profile_from_short_keys(
args.qos_profile, reliability=args.qos_reliability, durability=args.qos_durability)
if args.message_type is None:
with NodeStrategy(args) as node:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This potentially creates a second direct node (beside the one created a few lines below) which doesn't seem like a good idea.

Same pattern below in multiple cases (even though they preexisted before).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, you might end up creating node, destroying it and creating another one. We do want to use the daemon if available though.

A solution could be to extend NodeStrategy's API to ensure you're dealing with a direct node in case you have to. Something like a force_direct() method (or a context manager?). WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A solution could be to extend NodeStrategy's API to ensure you're dealing with a direct node in case you have to.

What is the difference to a DirectNode then?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should create direct node, and conditionally if a daemon is running query some information from that to not have to wait for "complete" discovery for the direct node.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference to a DirectNode then?

That it'd use the daemon outside the scope where you actually need it to be direct, e.g.:

with NodeStrategy(args) as node:
    # ...
    with node.force_direct():
        node.subscribe()

I think this should create direct node, and conditionally if a daemon is running query some information from that to not have to wait for "complete" discovery for the direct node.

That sounds a lot like replicating NodeStrategy's implementation N times. I'd rather improve it in a single place.

Copy link
Copy Markdown
Author

@hidmic hidmic Apr 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #499 for an alternative.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #499 for an alternative.

This looks much better 👍

hidmic added 5 commits May 11, 2020 16:15
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
@hidmic hidmic force-pushed the hidmic/robust-cli branch from 02c8d4a to d334eaa Compare May 11, 2020 19:16
@hidmic
Copy link
Copy Markdown
Author

hidmic commented May 11, 2020

CI up to ros2topic, ros2service, ros2action, ros2node and ros2component:

  • Linux Build Status (likely a flake, can't reproduce locally)
  • Linux-aarch64 Build Status
  • OSX Build Status
  • Windows Build Status

@hidmic hidmic requested a review from dirk-thomas May 11, 2020 19:17
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Copy link
Copy Markdown
Member

@jacobperron jacobperron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with green CI

Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
@dawonn-haval
Copy link
Copy Markdown
Contributor

Just wanted to chime in and mention that this patch resolves the cli instability for me across multiple hosts that I've been seeing for a long time.

I wonder if the same changes need to be made to the ros2bag package?

Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
@hidmic
Copy link
Copy Markdown
Author

hidmic commented May 12, 2020

Running CI once again, got linter test failures due to recent flake8 updates.

@clalancette
Copy link
Copy Markdown
Contributor

I might also suggest a full CI run on Linux, just to see what the system_tests stuff looks like with this in place.

@dirk-thomas
Copy link
Copy Markdown
Member

I might also suggest a full CI run on Linux, just to see what the system_tests stuff looks like with this in place.

Which part of ros2cli do the system_tests use?

@clalancette
Copy link
Copy Markdown
Contributor

Which part of ros2cli do the system_tests use?

I'm honestly not sure, but this is a big enough change that it makes me wary of the overnights. I'd prefer to see what it does on a full run, as I'd like not to regress at this point (we are actually getting pretty close to green for Linux CI, for instance).

@dirk-thomas
Copy link
Copy Markdown
Member

I'd prefer to see what it does on a full run,

A more complete run always makes sense. But also only for downstream packages which actually depend on something in the ros2cli repo. And in system_tests the only package depending on ros2cli is test_security - so running CI on e.g. test_communication would be not helpful and a waste of time / resources.

@hidmic
Copy link
Copy Markdown
Author

hidmic commented May 13, 2020

CI above ros2cli:

  • Linux Build Status

@hidmic
Copy link
Copy Markdown
Author

hidmic commented May 13, 2020

@clalancette @dirk-thomas all test failures here are unrelated to this patch. I've sent separate PRs to fix flake8 errors and sros2 test failures. Do you want to move forward or do you want to wait for those to land and make another CI run, this time across platforms?

Copy link
Copy Markdown
Member

@dirk-thomas dirk-thomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with it as is.

It would be good to investigate why the test failure happened in the PR build.

@hidmic
Copy link
Copy Markdown
Author

hidmic commented May 13, 2020

It would be good to investigate why the test failure happened in the PR build.

It's a strange one, I cannot reproduce locally. It doesn't show on nightlies either.

@aitazhixin
Copy link
Copy Markdown

In dashing 0.7.11 version, 'Could not determine the type for the passed topic' also happens.

@hidmic
Copy link
Copy Markdown
Author

hidmic commented Aug 19, 2020

@aitazhixin I don't think these changes were backported to Dashing. This patch (and connected ones) only extends and improves API, but I will say that anything bigger than a bugfix is less likely to be considered for a backport. @nuclearsandwich thoughts?

@aitazhixin
Copy link
Copy Markdown

@hidmic so, I need a version which is bigger than 0.9.3?
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants