Skip to content

feat: make path and build field in dataflow file more flexible#1098

Closed
drindr wants to merge 0 commit intodora-rs:mainfrom
drindr:main
Closed

feat: make path and build field in dataflow file more flexible#1098
drindr wants to merge 0 commit intodora-rs:mainfrom
drindr:main

Conversation

@drindr
Copy link
Copy Markdown
Contributor

@drindr drindr commented Aug 4, 2025

  • support environment variable in the path like $HOME/a/bc
  • support quoted argument in the build command like bash -c "..."

@Hennzau
Copy link
Copy Markdown
Collaborator

Hennzau commented Aug 6, 2025

Hi, quick question, which env is passed to the path command in case of distributed dataflow? The CLI's, the Daemon's? which computer?

@drindr
Copy link
Copy Markdown
Contributor Author

drindr commented Aug 6, 2025

Hi, quick question, which env is passed to the path command in case of distributed dataflow? The CLI's, the Daemon's? which computer?

I'm not so familiar with the distributed feature yet. The env is defined in the dataflow.yml. As it is deserialised with the crate serde-with-expand-env, it can expand the env variables with cli's env. So that the original source of the env is where the dataflow.yml is deserialized. Both the start command and daemon command can parse the yaml file.

In addition, it cannot use the env variables from the cli directly. The variables must "be declared” in the yaml file.
available env variable:

...
  env:
    DORA_HOME: $DORA
    WITH_DEFAULT: ${DEFAULT: FOO}
    BAR: /path/to/bar
...

Those variables will be expanded when execute commands like DORA=/path/to/dora dora start ...

@Hennzau
Copy link
Copy Markdown
Collaborator

Hennzau commented Aug 6, 2025

Ok that makes sense!

Copy link
Copy Markdown
Collaborator

@phil-opp phil-opp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I wonder why we only support quoted parts in build fields, but not in args? And maybe we should also support env variable in build fields for consistency?

@phil-opp
Copy link
Copy Markdown
Collaborator

Hi, quick question, which env is passed to the path command in case of distributed dataflow? The CLI's, the Daemon's? which computer?

As it is deserialised with the crate serde-with-expand-env, it can expand the env variables with cli's env. So that the original source of the env is where the dataflow.yml is deserialized.

Thanks for raising this! I think we should discuss whether we want to change this behavior. Right now, the dataflow is deserialized in the CLI, so all env variables are expanded there. However, this might not want we want if the binaries are run on a different machine. For example, consider the $HOME variable, which might be something like /home/my-personal-user-name on your CLI machine. This folder might not even exist on the target machine....

In addition, it cannot use the env variables from the cli directly. The variables must "be declared” in the yaml file.
available env variable:

This limitation might be a bit annoying, but it's probably a good starting point until we decide on where to expand env variables.

@drindr
Copy link
Copy Markdown
Contributor Author

drindr commented Aug 14, 2025

Thanks for your response.

  • In my understanding, in current implementation, the dataflow should be loaded on the host if the host runs the node, so do as local machine. So it is reasonable to extend them with CLI's env. Or are there any plan for loading dataflow from remote machine, or I missed something have achieved out.

  • In addition, it cannot use the env variables from the cli directly. The variables must "be declared” in the yaml file.
    available env variable:

    This limitation might be a bit annoying, but it's probably a good starting point until we decide on where to expand env variables.

    I think it is not a limitation. It isolates the node from complex CLI env. It makes the env's expansion explicit.

  • off-topic: maybe a method like the dolar-expr in XML launch of ROS or just a script programming language would make launching behaviour flexible

@drindr
Copy link
Copy Markdown
Contributor Author

drindr commented Aug 17, 2025

Oh, I got it.
With the _unstable_deploy, nodes can be launched remotely.

If so, I have an idea is that every daemon maintains a parameter service and extend the variable like ${VAR_A} with CLI and parse ${REMOTE: VAR_A} with daemon's env.

What about your opinion? @phil-opp
If you think it is available, I can start to implement it.

@Hennzau
Copy link
Copy Markdown
Collaborator

Hennzau commented Aug 17, 2025

Oh, I got it. With the _unstable_deploy, nodes can be launched remotely.

If so, I have an idea is that every daemon maintains a parameter service and extend the variable like ${VAR_A} with CLI and parse ${REMOTE: VAR_A} with daemon's env.

What about your opinion? @phil-opp If you think it is available, I can start to implement it.

I thnk this is a good idea! maybe for another PR?

@phil-opp
Copy link
Copy Markdown
Collaborator

I like the idea too! Could you clarify the syntax that you propose?

These are the different variants that you mentioned in this thread:

...
env:
DORA_HOME: $DORA
WITH_DEFAULT: ${DEFAULT: FOO}
BAR: /path/to/bar
BAZ: ${REMOTE: VAR_A}
...

You propose that $DORA refers to a local environment variable (i.e. the machine where the CLI is used), right? The ${REMOTE: VAR_A} is refering to the env variable on the target machine (i.e. expanded by the daemon), right? What is the ${DEFAULT: FOO} syntax doing?

@phil-opp
Copy link
Copy Markdown
Collaborator

maybe for another PR?

Yes, it's probably a good idea to do smaller PRs. The quoted arguments change of this PR seems quite isolated and unrelated to the env expansion. Could you move these changes to a separate PR @drindr ? And apply the same quoting logic to the args field there:

cmd.args(args.split_ascii_whitespace());

Thanks!

@drindr
Copy link
Copy Markdown
Contributor Author

drindr commented Aug 19, 2025

I like the idea too! Could you clarify the syntax that you propose?

These are the different variants that you mentioned in this thread:

... env: DORA_HOME: $DORA WITH_DEFAULT: ${DEFAULT: FOO} BAR: /path/to/bar BAZ: ${REMOTE: VAR_A} ...

You propose that $DORA refers to a local environment variable (i.e. the machine where the CLI is used), right? The ${REMOTE: VAR_A} is refering to the env variable on the target machine (i.e. expanded by the daemon), right? What is the ${DEFAULT: FOO} syntax doing?

The syntax can be ${CLI: VAR_A} for CLI, which CLI: can be omitted, and ${DAEMON: VAR_B} for DAEMON.
Maybe a default value can be provided by the assign expression ${VAR_A = DEFAULT_VALUE}

@phil-opp
Copy link
Copy Markdown
Collaborator

Thanks for clarifying, this sounds good to me in general. Instead of CLI: and DAEMON: we could also use LOCAL: and REMOTE:, maybe that's easier to understand.

@haixuanTao What do you think about this this change?

@drindr drindr closed this Aug 20, 2025
@drindr drindr reopened this Aug 20, 2025
@drindr drindr marked this pull request as draft August 20, 2025 03:56
@drindr
Copy link
Copy Markdown
Contributor Author

drindr commented Aug 20, 2025

The implementation of build command with quoted arguments have been modified and submit on #1103

@haixuanTao
Copy link
Copy Markdown
Collaborator

Thanks for clarifying, this sounds good to me in general. Instead of CLI: and DAEMON: we could also use LOCAL: and REMOTE:, maybe that's easier to understand.

@haixuanTao What do you think about this this change?

To be honest, I'm not really a fan of introducing a new concept of environment variable resolving while every one knows linux env variable resolving. Here are the reason why:

  • We will need to implement it. And I can see so many edge cases, that we will have to handle: None env variable value, Empty string env variable value, naming collusion ...
  • This introduce a new surface for input errors that can be very hard to debug.

I would very much favor using runtime environment variable within each node code ( including build ) for daemon env variable and, use configured env within the dataflow for cli based env variable.

I don't think that dora should be in charge of daemon -> to -> node env variable passing.

Very sorry.

Open for discussion still!

@phil-opp
Copy link
Copy Markdown
Collaborator

I would very much favor using runtime environment variable within each node code ( including build ) for daemon env variable and, use configured env within the dataflow for cli based env variable.

Could you clarify this a bit more? I don't understand what you mean with this.

@phil-opp
Copy link
Copy Markdown
Collaborator

There are two proposals in this PR:

  • expand specified env variables in the path so that you can do $HOME/a/bc
  • a special syntax for expanding env variables on the remote machine
    • otherwise things like $HOME/a/bc would probably not work in distributed dataflows, as the remote $HOME folder can be different

Are you against both of these things? Or just the second point?

@haixuanTao
Copy link
Copy Markdown
Collaborator

haixuanTao commented Aug 20, 2025

expand specified env variables in the path so that you can do $HOME/a/bc

This would be okay for me, but would resolve in daemon env variable indeed and not cli env variable.

I would expect that cli env variable would only apply to the env section of the dataflow.

a special syntax for expanding env variables on the remote machine

This, I'm not okay with because I think it's going to be hard to implement effectively.

I would very much favor using runtime environment variable within each node code ( including build ) for daemon env variable and, use configured env within the dataflow for cli based env variable.

The idea is that let say:

  • The daemon has a unique env variable named IP=192.168.1.2
  • We need to access this variable to be able to build or run a node within dora
  • It's possible to use: std::env::var("IP") at runtime or compile time and this will always resolve as the daemon environment variable so there is no need to use dora for passing daemon related env variable.
  • if we need to pass a cli related env variable. I would expect the user to use the env variable section of the dataflow.

As a matter of fact, we almost will always do a worst job at passing daemon -> node env variable.

@phil-opp
Copy link
Copy Markdown
Collaborator

Thanks for clarifying! This seems reasonable to me.

So let's do it this way:

  • Environment variables referenced in the env section are expanded by the CLI
    • e.g. RUST_LOG: $RUST_LOG
  • Environment variables specified in the path field are expanded by the daemon

Does that sound good to you? Do we also want to support expanding env variables in build commands?

@haixuanTao
Copy link
Copy Markdown
Collaborator

Yes, I think this is the current implementation.

Do we also want to support expanding env variables in build commands?

Yes I think that this is a good idea with daemon.

I think one thing which is confusing is that sometimes user don't do RUST_LOG: $RUST_LOG and then expect rust_log be the one from the cli and is the one from when the daemon is spawned.

@drindr
Copy link
Copy Markdown
Contributor Author

drindr commented Aug 20, 2025

The environment variable expansion for path field has been re-implemented in #1104.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants