Skip to content

Implement basic node restarting#1245

Merged
phil-opp merged 17 commits intomainfrom
restart-failed-nodes
Jan 7, 2026
Merged

Implement basic node restarting#1245
phil-opp merged 17 commits intomainfrom
restart-failed-nodes

Conversation

@phil-opp
Copy link
Copy Markdown
Collaborator

@phil-opp phil-opp commented Dec 1, 2025

  • Make PreparedNode clonable and prepare for node restarting
  • Restart nodes according to restart policy

Proposed in https://github.com/orgs/dora-rs/discussions/1181

The more common spelling is cloneable, but we can't do anything about the crate name.
We still want dataflows to be able to exit normally, so nodes should not restart after manual stop commands or when all the node's inputs were already closed.
Starting a dataflow involves creating timer tasks etc, so we only want to do it once.
@haixuanTao
Copy link
Copy Markdown
Collaborator

One of the breaking change that could be nice is to restart node that failed before starting the dataflow. As it is quite common with power cycle or occasional networking issue. We could like try 3 times to respawn the nodes when it fails before giving up.

@oortlieb pointed this issue out

@phil-opp phil-opp marked this pull request as ready for review December 11, 2025 14:40
@phil-opp
Copy link
Copy Markdown
Collaborator Author

One idea that popped up in our meeting was to reuse the Event::Reload to signal that a node was restarted. However, the event is currently used for something else:

/// Instructs the node to reload itself or one of its operators.

Reusing this event to also notify nodes about restarts that already happened is not a good idea imo.

@phil-opp
Copy link
Copy Markdown
Collaborator Author

One of the breaking change that could be nice is to restart node that failed before starting the dataflow. As it is quite common with power cycle or occasional networking issue. We could like try 3 times to respawn the nodes when it fails before giving up.

So this sounds like errors that happen after the node is spawned, but before initializing the Dora node API? If so, they should also be restarted as part of this PR if they have a restart policy set.

Or were you talking about failures to spawn, e.g. because the executable doesn't exist? I can also implement retries in that case, but I'm not sure if that is really something that is fixable by a retry.

@haixuanTao
Copy link
Copy Markdown
Collaborator

So this sounds like errors that happen after the node is spawned, but before initializing the Dora node API? If so, they should also be restarted as part of this PR if they have a restart policy set.

Yes exactly

Or were you talking about failures to spawn, e.g. because the executable doesn't exist? I can also implement retries in that case, but I'm not sure if that is really something that is fixable by a retry.

No errors like, the executable starts but something fail due to some power cycle or usb error ( fairly common with multiple USB ) and so the dataflow is not yet started

@phil-opp
Copy link
Copy Markdown
Collaborator Author

Failures after start are all treated the same by this PR, no matter if the dataflow was started already or not. Could you try whether it fixes your issues?

@haixuanTao
Copy link
Copy Markdown
Collaborator

I see!

Failures after start are all treated the same by this PR, no matter if the dataflow was started already or not. Could you try whether it fixes your issues?

Just tried and I think it's awesome!

I don't see the difference between on-failure and always and I guess it's something to do with like node failing before and like automatic stop right?

On the naming this is docker naming:

restart_policy

restart_policy configures if and how to restart containers when they exit. If restart_policy is not set, Compose considers the restart field set by the service configuration.

condition. When set to:
none, containers are not automatically restarted regardless of the exit status.
on-failure, the container is restarted if it exits due to an error, which manifests as a non-zero exit code.
any (default), containers are restarted regardless of the exit status.

See: https://github.com/compose-spec/compose-spec/blob/main/deploy.md#restart_policy

Could be neat to copy them. :)

@haixuanTao
Copy link
Copy Markdown
Collaborator

Actually after further documentation, I think the current policy follows more closely Kubernetes and systemctl which make sense so we can keep it as is!

@haixuanTao
Copy link
Copy Markdown
Collaborator

Could be nice to have some better error logging as on my test, we don't see the error log spawning before restarting but it's probably WIP

@haixuanTao
Copy link
Copy Markdown
Collaborator

Very excited to merge this PR as we will be able to restart working on "reloading" and "hot-reloading" but this time for custom node and graphs

@phil-opp
Copy link
Copy Markdown
Collaborator Author

I don't see the difference between on-failure and always and I guess it's something to do with like node failing before and like automatic stop right?

The difference is the exit code. on-failure only restarts when the node exits with a non-zero exit code, indicating that it was an abnormal exit. An exit with code 0 is considered a successful exit, meaning that the node decided that it was done and that there is nothing else to do for it. In that case, Dora treats the node as stopped on on-failure and closes all of its outputs. On restart-policy: always, Dora also restarts the node after a successful exit, which can be useful if the node is expected to keep running indefinitely.

@phil-opp
Copy link
Copy Markdown
Collaborator Author

Could be nice to have some better error logging as on my test, we don't see the error log spawning before restarting but it's probably WIP

Could you give some more details on that? The error messages and the node failure should be reported as usual, so it should be visible in the logs.

@haixuanTao
Copy link
Copy Markdown
Collaborator

In the following example:

~/D/w/d/e/python-log ❯❯❯ dora run dataflow.yaml --uv       ✘ 1 restart-failed-nodes ✭ ✱ ◼
2025-12-30T14:14:03.252219Z  INFO dora_core::descriptor::validate: skipping path check for node with build command
2025-12-30T14:14:03.252245Z  INFO dora_core::descriptor::validate: skipping path check for node with build command
2025-12-30T14:14:03.252408Z  INFO zenoh::net::runtime: Using ZID: c53c0f992598e35924d9088a4eb38716
2025-12-30T14:14:03.253212Z  INFO zenoh::net::runtime::orchestrator: Zenoh can be reached at: tcp/[2a01:cb08:67:900:1806:329:6bcf:5ea]:63504
2025-12-30T14:14:03.253222Z  INFO zenoh::net::runtime::orchestrator: Zenoh can be reached at: tcp/[2a01:cb08:67:900:c47:12a2:8dd0:8d00]:63504
2025-12-30T14:14:03.253224Z  INFO zenoh::net::runtime::orchestrator: Zenoh can be reached at: tcp/[fe80::1]:63504
2025-12-30T14:14:03.253225Z  INFO zenoh::net::runtime::orchestrator: Zenoh can be reached at: tcp/[fe80::1c95:bf60:48f8:e4ca]:63504
2025-12-30T14:14:03.253227Z  INFO zenoh::net::runtime::orchestrator: Zenoh can be reached at: tcp/[fe80::270f:606e:f50e:5599]:63504
2025-12-30T14:14:03.253228Z  INFO zenoh::net::runtime::orchestrator: Zenoh can be reached at: tcp/[fe80::fe6e:ef27:c7ff:573c]:63504
2025-12-30T14:14:03.253229Z  INFO zenoh::net::runtime::orchestrator: Zenoh can be reached at: tcp/[fe80::ce81:b1c:bd2c:69e]:63504
2025-12-30T14:14:03.253261Z  INFO zenoh::net::runtime::orchestrator: Zenoh can be reached at: tcp/[fe80::102b:5c43:ecda:af33]:63504
2025-12-30T14:14:03.253271Z  INFO zenoh::net::runtime::orchestrator: Zenoh can be reached at: tcp/[fe80::3412:6aff:fef8:619b]:63504
2025-12-30T14:14:03.253272Z  INFO zenoh::net::runtime::orchestrator: Zenoh can be reached at: tcp/[fe80::3412:6aff:fef8:619b]:63504
2025-12-30T14:14:03.253274Z  INFO zenoh::net::runtime::orchestrator: Zenoh can be reached at: tcp/[fe80::e293:2bbc:9833:63f2]:63504
2025-12-30T14:14:03.253276Z  INFO zenoh::net::runtime::orchestrator: Zenoh can be reached at: tcp/[fe80::5208:6a08:1aa0:e35d]:63504
2025-12-30T14:14:03.253277Z  INFO zenoh::net::runtime::orchestrator: Zenoh can be reached at: tcp/192.168.1.28:63504
2025-12-30T14:14:03.253354Z  INFO zenoh::net::runtime::orchestrator: zenohd listening scout messages on 224.0.0.224:7446
15:14:03 DEBUG   receive_data_with_sleep: daemon::spawner  spawning node
15:14:03 DEBUG   send_data: daemon::spawner  spawning node
15:14:03 INFO    receive_data_with_sleep: spawner  spawning: uv run python -u /Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py
15:14:03 INFO    send_data: spawner  spawning: uv run python -u /Users/xaviertao/Documents/work/dora/examples/python-log/send_data.py
15:14:03 INFO    dora daemon  finished building nodes, spawning...
15:14:03 INFO    receive_data_with_sleep: spawner  spawning `uv` in `/Users/xaviertao/Documents/work/dora/examples/python-log`
15:14:03 DEBUG   receive_data_with_sleep: spawner  spawned node with pid 61459
15:14:03 INFO    send_data: spawner  spawning `uv` in `/Users/xaviertao/Documents/work/dora/examples/python-log`
15:14:03 DEBUG   send_data: spawner  spawned node with pid 61462
15:14:03 INFO    opentelemetry  Global meter provider is set. Meters can now be created using global::meter() or global::meter_with_scope().
15:14:03 INFO    receive_data_with_sleep: daemon  node is ready
15:14:03 INFO    opentelemetry  Global meter provider is set. Meters can now be created using global::meter() or global::meter_with_scope().
15:14:03 INFO    send_data: daemon  node is ready
15:14:03 INFO    daemon  all nodes are ready, starting dataflow
15:14:03 stdout  receive_data_with_sleep:  Traceback (most recent call last):
15:14:03 DEBUG   receive_data_with_sleep: daemon  skipping CloseOutputs because node might restart
15:14:03 DEBUG   receive_data_with_sleep: daemon  keeping outputs open because node might restart
15:14:03 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 23, in <module>
15:14:03 stdout  receive_data_with_sleep:      main()
15:14:03 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 10, in main
15:14:03 stdout  receive_data_with_sleep:      assert False, "This is an assertion error"
15:14:03 stdout  receive_data_with_sleep:             ^^^^^
15:14:03 stdout  receive_data_with_sleep:  AssertionError: This is an assertion error
15:14:03 stdout  receive_data_with_sleep:  
15:14:03 stdout  receive_data_with_sleep:  
15:14:03 WARN    receive_data_with_sleep: daemon  restarting node after failure
15:14:03 INFO    receive_data_with_sleep: spawner  spawning `uv` in `/Users/xaviertao/Documents/work/dora/examples/python-log`
15:14:03 DEBUG   receive_data_with_sleep: spawner  spawned node with pid 61467
15:14:03 INFO    opentelemetry  Global meter provider is set. Meters can now be created using global::meter() or global::meter_with_scope().
15:14:03 INFO    receive_data_with_sleep: daemon  node is ready
15:14:03 stdout  receive_data_with_sleep:  Traceback (most recent call last):
15:14:03 DEBUG   receive_data_with_sleep: daemon  skipping CloseOutputs because node might restart
15:14:03 DEBUG   receive_data_with_sleep: daemon  keeping outputs open because node might restart
15:14:03 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 23, in <module>
15:14:03 stdout  receive_data_with_sleep:      main()
15:14:03 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 10, in main
15:14:03 stdout  receive_data_with_sleep:      assert False, "This is an assertion error"
15:14:03 stdout  receive_data_with_sleep:             ^^^^^
15:14:03 stdout  receive_data_with_sleep:  AssertionError: This is an assertion error
15:14:03 stdout  receive_data_with_sleep:  
15:14:03 stdout  receive_data_with_sleep:  
15:14:03 WARN    receive_data_with_sleep: daemon  restarting node after failure
15:14:03 INFO    receive_data_with_sleep: spawner  spawning `uv` in `/Users/xaviertao/Documents/work/dora/examples/python-log`
15:14:03 DEBUG   receive_data_with_sleep: spawner  spawned node with pid 61471
15:14:04 INFO    opentelemetry  Global meter provider is set. Meters can now be created using global::meter() or global::meter_with_scope().
15:14:04 INFO    receive_data_with_sleep: daemon  node is ready
15:14:04 stdout  receive_data_with_sleep:  Traceback (most recent call last):
15:14:04 DEBUG   receive_data_with_sleep: daemon  skipping CloseOutputs because node might restart
15:14:04 DEBUG   receive_data_with_sleep: daemon  keeping outputs open because node might restart
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 23, in <module>
15:14:04 stdout  receive_data_with_sleep:      main()
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 10, in main
15:14:04 stdout  receive_data_with_sleep:      assert False, "This is an assertion error"
15:14:04 stdout  receive_data_with_sleep:             ^^^^^
15:14:04 stdout  receive_data_with_sleep:  AssertionError: This is an assertion error
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 WARN    receive_data_with_sleep: daemon  restarting node after failure
15:14:04 INFO    receive_data_with_sleep: spawner  spawning `uv` in `/Users/xaviertao/Documents/work/dora/examples/python-log`
15:14:04 DEBUG   receive_data_with_sleep: spawner  spawned node with pid 61475
15:14:04 INFO    opentelemetry  Global meter provider is set. Meters can now be created using global::meter() or global::meter_with_scope().
15:14:04 INFO    receive_data_with_sleep: daemon  node is ready
15:14:04 stdout  receive_data_with_sleep:  Traceback (most recent call last):
15:14:04 DEBUG   receive_data_with_sleep: daemon  skipping CloseOutputs because node might restart
15:14:04 DEBUG   receive_data_with_sleep: daemon  keeping outputs open because node might restart
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 23, in <module>
15:14:04 stdout  receive_data_with_sleep:      main()
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 10, in main
15:14:04 stdout  receive_data_with_sleep:      assert False, "This is an assertion error"
15:14:04 stdout  receive_data_with_sleep:             ^^^^^
15:14:04 stdout  receive_data_with_sleep:  AssertionError: This is an assertion error
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 WARN    receive_data_with_sleep: daemon  restarting node after failure
15:14:04 INFO    receive_data_with_sleep: spawner  spawning `uv` in `/Users/xaviertao/Documents/work/dora/examples/python-log`
15:14:04 DEBUG   receive_data_with_sleep: spawner  spawned node with pid 61479
15:14:04 INFO    opentelemetry  Global meter provider is set. Meters can now be created using global::meter() or global::meter_with_scope().
15:14:04 INFO    receive_data_with_sleep: daemon  node is ready
15:14:04 stdout  receive_data_with_sleep:  Traceback (most recent call last):
15:14:04 DEBUG   receive_data_with_sleep: daemon  skipping CloseOutputs because node might restart
15:14:04 DEBUG   receive_data_with_sleep: daemon  keeping outputs open because node might restart
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 23, in <module>
15:14:04 stdout  receive_data_with_sleep:      main()
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 10, in main
15:14:04 stdout  receive_data_with_sleep:      assert False, "This is an assertion error"
15:14:04 stdout  receive_data_with_sleep:             ^^^^^
15:14:04 stdout  receive_data_with_sleep:  AssertionError: This is an assertion error
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 WARN    receive_data_with_sleep: daemon  restarting node after failure
15:14:04 INFO    receive_data_with_sleep: spawner  spawning `uv` in `/Users/xaviertao/Documents/work/dora/examples/python-log`
15:14:04 DEBUG   receive_data_with_sleep: spawner  spawned node with pid 61485
15:14:04 INFO    opentelemetry  Global meter provider is set. Meters can now be created using global::meter() or global::meter_with_scope().
15:14:04 INFO    receive_data_with_sleep: daemon  node is ready
15:14:04 stdout  receive_data_with_sleep:  Traceback (most recent call last):
15:14:04 DEBUG   receive_data_with_sleep: daemon  skipping CloseOutputs because node might restart
15:14:04 DEBUG   receive_data_with_sleep: daemon  keeping outputs open because node might restart
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 23, in <module>
15:14:04 stdout  receive_data_with_sleep:      main()
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 10, in main
15:14:04 stdout  receive_data_with_sleep:      assert False, "This is an assertion error"
15:14:04 stdout  receive_data_with_sleep:             ^^^^^
15:14:04 stdout  receive_data_with_sleep:  AssertionError: This is an assertion error
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 WARN    receive_data_with_sleep: daemon  restarting node after failure
15:14:04 INFO    receive_data_with_sleep: spawner  spawning `uv` in `/Users/xaviertao/Documents/work/dora/examples/python-log`
15:14:04 DEBUG   receive_data_with_sleep: spawner  spawned node with pid 61489
15:14:04 INFO    opentelemetry  Global meter provider is set. Meters can now be created using global::meter() or global::meter_with_scope().
15:14:04 INFO    receive_data_with_sleep: daemon  node is ready
15:14:04 stdout  receive_data_with_sleep:  Traceback (most recent call last):
15:14:04 DEBUG   receive_data_with_sleep: daemon  skipping CloseOutputs because node might restart
15:14:04 DEBUG   receive_data_with_sleep: daemon  keeping outputs open because node might restart
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 23, in <module>
15:14:04 stdout  receive_data_with_sleep:      main()
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 10, in main
15:14:04 stdout  receive_data_with_sleep:      assert False, "This is an assertion error"
15:14:04 stdout  receive_data_with_sleep:             ^^^^^
15:14:04 stdout  receive_data_with_sleep:  AssertionError: This is an assertion error
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 WARN    receive_data_with_sleep: daemon  restarting node after failure
15:14:04 INFO    receive_data_with_sleep: spawner  spawning `uv` in `/Users/xaviertao/Documents/work/dora/examples/python-log`
15:14:04 DEBUG   receive_data_with_sleep: spawner  spawned node with pid 61493
15:14:04 INFO    opentelemetry  Global meter provider is set. Meters can now be created using global::meter() or global::meter_with_scope().
15:14:04 INFO    receive_data_with_sleep: daemon  node is ready
15:14:04 stdout  receive_data_with_sleep:  Traceback (most recent call last):
15:14:04 DEBUG   receive_data_with_sleep: daemon  skipping CloseOutputs because node might restart
15:14:04 DEBUG   receive_data_with_sleep: daemon  keeping outputs open because node might restart
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 23, in <module>
15:14:04 stdout  receive_data_with_sleep:      main()
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 10, in main
15:14:04 stdout  receive_data_with_sleep:      assert False, "This is an assertion error"
15:14:04 stdout  receive_data_with_sleep:             ^^^^^
15:14:04 stdout  receive_data_with_sleep:  AssertionError: This is an assertion error
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 WARN    receive_data_with_sleep: daemon  restarting node after failure
15:14:04 INFO    receive_data_with_sleep: spawner  spawning `uv` in `/Users/xaviertao/Documents/work/dora/examples/python-log`
15:14:04 DEBUG   receive_data_with_sleep: spawner  spawned node with pid 61497
15:14:04 INFO    opentelemetry  Global meter provider is set. Meters can now be created using global::meter() or global::meter_with_scope().
15:14:04 INFO    receive_data_with_sleep: daemon  node is ready
15:14:04 stdout  receive_data_with_sleep:  Traceback (most recent call last):
15:14:04 DEBUG   receive_data_with_sleep: daemon  skipping CloseOutputs because node might restart
15:14:04 DEBUG   receive_data_with_sleep: daemon  keeping outputs open because node might restart
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 23, in <module>
15:14:04 stdout  receive_data_with_sleep:      main()
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 10, in main
15:14:04 stdout  receive_data_with_sleep:      assert False, "This is an assertion error"
15:14:04 stdout  receive_data_with_sleep:             ^^^^^
15:14:04 stdout  receive_data_with_sleep:  AssertionError: This is an assertion error
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 WARN    receive_data_with_sleep: daemon  restarting node after failure
15:14:04 INFO    receive_data_with_sleep: spawner  spawning `uv` in `/Users/xaviertao/Documents/work/dora/examples/python-log`
15:14:04 DEBUG   receive_data_with_sleep: spawner  spawned node with pid 61501
15:14:04 INFO    opentelemetry  Global meter provider is set. Meters can now be created using global::meter() or global::meter_with_scope().
15:14:04 INFO    receive_data_with_sleep: daemon  node is ready
15:14:04 stdout  receive_data_with_sleep:  Traceback (most recent call last):
15:14:04 DEBUG   receive_data_with_sleep: daemon  skipping CloseOutputs because node might restart
15:14:04 DEBUG   receive_data_with_sleep: daemon  keeping outputs open because node might restart
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 23, in <module>
15:14:04 stdout  receive_data_with_sleep:      main()
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 10, in main
15:14:04 stdout  receive_data_with_sleep:      assert False, "This is an assertion error"
15:14:04 stdout  receive_data_with_sleep:             ^^^^^
15:14:04 stdout  receive_data_with_sleep:  AssertionError: This is an assertion error
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 WARN    receive_data_with_sleep: daemon  restarting node after failure
15:14:04 INFO    receive_data_with_sleep: spawner  spawning `uv` in `/Users/xaviertao/Documents/work/dora/examples/python-log`
15:14:04 DEBUG   receive_data_with_sleep: spawner  spawned node with pid 61505
15:14:04 INFO    opentelemetry  Global meter provider is set. Meters can now be created using global::meter() or global::meter_with_scope().
15:14:04 INFO    receive_data_with_sleep: daemon  node is ready
15:14:04 stdout  receive_data_with_sleep:  Traceback (most recent call last):
15:14:04 DEBUG   receive_data_with_sleep: daemon  skipping CloseOutputs because node might restart
15:14:04 DEBUG   receive_data_with_sleep: daemon  keeping outputs open because node might restart
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 23, in <module>
15:14:04 stdout  receive_data_with_sleep:      main()
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 10, in main
15:14:04 stdout  receive_data_with_sleep:      assert False, "This is an assertion error"
15:14:04 stdout  receive_data_with_sleep:             ^^^^^
15:14:04 stdout  receive_data_with_sleep:  AssertionError: This is an assertion error
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 WARN    receive_data_with_sleep: daemon  restarting node after failure
15:14:04 INFO    receive_data_with_sleep: spawner  spawning `uv` in `/Users/xaviertao/Documents/work/dora/examples/python-log`
15:14:04 DEBUG   receive_data_with_sleep: spawner  spawned node with pid 61509
15:14:04 INFO    opentelemetry  Global meter provider is set. Meters can now be created using global::meter() or global::meter_with_scope().
15:14:04 INFO    receive_data_with_sleep: daemon  node is ready
15:14:04 stdout  send_data:  
15:14:04 stdout  send_data:  
15:14:04 DEBUG   send_data: daemon  handling node stop with exit status Success
15:14:04 INFO    send_data: daemon  send_data finished successfully
15:14:04 stdout  receive_data_with_sleep:  Traceback (most recent call last):
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 23, in <module>
15:14:04 stdout  receive_data_with_sleep:      main()
15:14:04 stdout  receive_data_with_sleep:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 10, in main
15:14:04 stdout  receive_data_with_sleep:      assert False, "This is an assertion error"
15:14:04 stdout  receive_data_with_sleep:             ^^^^^
15:14:04 stdout  receive_data_with_sleep:  AssertionError: This is an assertion error
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 stdout  receive_data_with_sleep:  
15:14:04 INFO    receive_data_with_sleep: daemon  not restarting node because all inputs are already closed
15:14:04 DEBUG   receive_data_with_sleep: daemon  handling node stop with exit status ExitCode(1)
15:14:04 ERROR   receive_data_with_sleep: daemon  exited with code 1 with stderr output:
---------------------------------------------------------------------------------
[...]AssertionError: This is an assertion error
Traceback (most recent call last):
  File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 23, in <module>
    main()
  File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 10, in main
    assert False, "This is an assertion error"
           ^^^^^
AssertionError: This is an assertion error
---------------------------------------------------------------------------------

15:14:04 INFO    daemon  dataflow finished on machine `01d955ea-4e1b-4391-830a-03d08f5f1081`
2025-12-30T14:14:04.902506Z  INFO run_inner: dora_daemon: exiting daemon because all required dataflows are finished self.daemon_id=DaemonId { machine_id: None, uuid: 01d955ea-4e1b-4391-830a-03d08f5f1081 }
2025-12-30T14:14:04.902538Z  INFO run_inner: zenoh::api::session: close session zid=c53c0f992598e35924d9088a4eb38716 self.daemon_id=DaemonId { machine_id: None, uuid: 01d955ea-4e1b-4391-830a-03d08f5f1081 }


[ERROR]
Dataflow failed:

Node `receive_data_with_sleep` failed: exited with code 1 with stderr output:
---------------------------------------------------------------------------------
[...]AssertionError: This is an assertion error
Traceback (most recent call last):
  File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 23, in <module>
    main()
  File "/Users/xaviertao/Documents/work/dora/examples/python-log/receive_data.py", line 10, in main
    assert False, "This is an assertion error"
           ^^^^^
AssertionError: This is an assertion error
---------------------------------------------------------------------------------



Location:
    binaries/cli/src/common.rs:33:17

I think the error message only appeared once when I would have expected the daemon to raise it each time the node failed.

@haixuanTao
Copy link
Copy Markdown
Collaborator

I can double check why

@phil-opp
Copy link
Copy Markdown
Collaborator Author

phil-opp commented Jan 7, 2026

Thanks for clarifying! I pushed 724dc7d to log the node output as before, i.e. print the node error to the logs even if it's going to be restarted.

@phil-opp phil-opp enabled auto-merge January 7, 2026 16:16
@phil-opp phil-opp merged commit 9a550aa into main Jan 7, 2026
27 checks passed
@phil-opp phil-opp deleted the restart-failed-nodes branch January 7, 2026 16:28
@haixuanTao
Copy link
Copy Markdown
Collaborator

That's great thanks!

@haixuanTao
Copy link
Copy Markdown
Collaborator

I think as a follow up PR we could try to have regex to detect python, rust panic or rust eyre error and format them in a way that is easy to debug.

We could also then avoid to double log stderr:

17:49:49 stdout  send_data:  Traceback (most recent call last):
17:49:49 DEBUG   send_data: daemon  skipping CloseOutputs because node might restart
17:49:49 DEBUG   send_data: daemon  keeping outputs open because node might restart
17:49:49 WARN    receive_data_with_sleep: dora  THIS IS A WARNING
17:49:49 stdout  send_data:    File "/Users/xaviertao/Documents/work/dora/examples/python-log/send_data.py", line 23, in <module>
17:49:49 stdout  send_data:      assert False
17:49:49 stdout  send_data:             ^^^^^
17:49:49 stdout  send_data:  AssertionError
17:49:49 stdout  send_data:  
17:49:49 stdout  send_data:  
17:49:49 WARN    send_data: daemon  restarting node after failure
17:49:49 DEBUG   send_data: daemon  handling node stop with exit status ExitCode(1) (restart: true)
17:49:49 INFO    send_data: spawner  spawning `uv` in `/Users/xaviertao/Documents/work/dora/examples/python-log`
17:49:49 ERROR   send_data: daemon  exited with code 1 with stderr output:
---------------------------------------------------------------------------------
Sent data: 30304092390791
Traceback (most recent call last):
  File "/Users/xaviertao/Documents/work/dora/examples/python-log/send_data.py", line 23, in <module>
    assert False
           ^^^^^
AssertionError
---------------------------------------------------------------------------------

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants