Skip to content

Add support for node integration testing#1163

Merged
phil-opp merged 47 commits intomainfrom
node-integration-testing
Nov 24, 2025
Merged

Add support for node integration testing#1163
phil-opp merged 47 commits intomainfrom
node-integration-testing

Conversation

@phil-opp
Copy link
Copy Markdown
Collaborator

@phil-opp phil-opp commented Oct 15, 2025

How it works

  • Set the DORA_TEST_WITH_INPUTS env variable with the path to your input JSON file
  • (Optional) Set the DORA_TEST_WRITE_OUTPUTS_TO env variable with the path where the outputs should be written. If not set, dora will write a outputs.jsonl file next to the given inputs file
  • Start the node executable/script

The node will be run as usual, but its event channel will be filled from the given inputs JSON file. No connection to a dora daemon will be made.

Input JSON file format example:

{
     // ID of the node
    "id": "foo",
    // defines the events that the node should receive
    "events": [
        {
            // specifies when the event arrives (seconds since node start)
            "time_offset_secs": 0.7,
            // type of the event (supported types are `Input`, `Stop`, `InputClosed`, `AllInputsClosed`
            "type": "Input",
            // input ID
            "id": "tick"
            // optional: `data` field with input data
        },
        {
            "time_offset_secs": 0.9,
            "type": "Input",
            "id": "tick"
        },
        {
            "time_offset_secs": 1.2,
            "type": "Stop"
        }
    ]
    // other supported fields: name, description, args, env, outputs, inputs, send_stdout_as (they all behave like in dataflow.yaml)
}

Output JSON file format example:

{"id":"random","data":9267023440904143729,"time_offset_secs":0.700793541}
{"id":"random","data":5753749540645363621,"time_offset_secs":0.900897584}

TODO:

  • Documentation
    • API docs
    • dora-rs.ai docs
  • add some tests for our examples and use them on our CI
  • take a look at arrow_integration_test JSON format -> this might be better suitable than our custom input json format
  • add option (via env variable) to write out received inputs as inputs.json files during normal dataflow operation -> to make creating files with complex input data easier
  • add option (via env variable) to omit time offsets in output formats -> to make them diff-able with expected outputs (the time offsets are a bit different on each run)
  • use ArrowTestUnwrap format for outputs.jsonl (not possible because of ArrowJsonBatch::from_batch is incomplete apache/arrow-rs#8684)
    • instead: Include data type in output JSON file

Add an optional `data_format` field that specifies the format of the `data` field. It defaults to deriving the schema from the given JSON object and converting it to the closest Arrow representation. The `ArrowTest` and `ArrowTestUnwrap` formats expect the `data` field to follow the Arrow integration test data format. The `ArrowTestUnwrap` format unwraps the first column of the deserialized RecordBatch to make other Arrow types representable (i.e. not just StructArrays).
The arrow integration test format crate panics in certain situations, which lead to a closed integration test channel. We want to panic on the sending side too in this case to avoid endless loops.
Useful for diffing the file against an expected file (as time offsets are not deterministic).
@phil-opp
Copy link
Copy Markdown
Collaborator Author

I opened apache/arrow-rs#8737 to add support for binary decoding to arrow-json.

@phil-opp phil-opp marked this pull request as ready for review November 19, 2025 14:53
@phil-opp
Copy link
Copy Markdown
Collaborator Author

@phil-opp
Copy link
Copy Markdown
Collaborator Author

I also added detailed docs in 618689d

@haixuanTao
Copy link
Copy Markdown
Collaborator

There are two tests for the rust-dataflow example in https://github.com/dora-rs/dora/pull/1163/files#diff-c1dad61976e06858147411d3a1b21177dc82b8eda35818b881c08a67add4cdfd

I find the test example very difficult to decipher. Could we make them easier to understand?

@haixuanTao
Copy link
Copy Markdown
Collaborator

Also could you put somewhere within the test folder the way you generated the test data?

@phil-opp
Copy link
Copy Markdown
Collaborator Author

There are two tests for the rust-dataflow example in https://github.com/dora-rs/dora/pull/1163/files#diff-c1dad61976e06858147411d3a1b21177dc82b8eda35818b881c08a67add4cdfd

I find the test example very difficult to decipher. Could we make them easier to understand?

Which part do you find hard to decipher?

We just call two commands for each test. One build command and one run command that sets some env variables. So basically:

  • cargo build -p rust-dataflow-example-node
  • DORA_TEST_WITH_INPUTS=tests/sample-inputs/inputs-rust-node.json DORA_TEST_NO_OUTPUT_TIME_OFFSET=1 DORA_TEST_WRITE_OUTPUTS_TO=<tempfile> target/debug/rust-dataflow-example-node

To make this extensible, I introduced a test function that takes the variable parameters as input.

The other code in that file is necessary to account for the differences between Windows and Linux (e.g. exe extension, line endings) and make it work with the CARGO_TARGET_DIR env variable that we set on CI for Windows.

I'm not sure how we can simplify this more?

@phil-opp
Copy link
Copy Markdown
Collaborator Author

Also could you put somewhere within the test folder the way you generated the test data?

Done in 554ae3a

@haixuanTao
Copy link
Copy Markdown
Collaborator

haixuanTao commented Nov 21, 2025

There are two tests for the rust-dataflow example in https://github.com/dora-rs/dora/pull/1163/files#diff-c1dad61976e06858147411d3a1b21177dc82b8eda35818b881c08a67add4cdfd

I find the test example very difficult to decipher. Could we make them easier to understand?

Which part do you find hard to decipher?

We just call two commands for each test. One build command and one run command that sets some env variables. So basically:

  • cargo build -p rust-dataflow-example-node
  • DORA_TEST_WITH_INPUTS=tests/sample-inputs/inputs-rust-node.json DORA_TEST_NO_OUTPUT_TIME_OFFSET=1 DORA_TEST_WRITE_OUTPUTS_TO=<tempfile> target/debug/rust-dataflow-example-node

To make this extensible, I introduced a test function that takes the variable parameters as input.

The other code in that file is necessary to account for the differences between Windows and Linux (e.g. exe extension, line endings) and make it work with the CARGO_TARGET_DIR env variable that we set on CI for Windows.

I'm not sure how we can simplify this more?

I don't think that as a user of dora, I would be able to reproduce the node test on my own on my own codebase.

Would it be possible to make node integration test be something like within the main.rs of the node:

#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_integration() {
        let input_json = ...
        let output_json = ...
        
        // Maybe add them to env variable 
        // ... 

        let result = main();

        assert_eq! ...
  }
}

So that it's easier to read but also use native testing that can be used with: cargo test -p rust-dataflow-example-node

P.S: Edited for better understanding

@phil-opp
Copy link
Copy Markdown
Collaborator Author

I don't think that there is a way around using Command to spawn a subprocess. Setting an env variable for the current process is an unsafe operation and I don't think that we should encourage users to use unsafe in tests. Also, the main function does not exists when building a crate in test mode (it is replaced by a test handling main function).

What we can do is using cargo run instead of build + manual run. This way we can also avoid the manual exe extension handling and specifying the target/debug/x path.

Moving the test to the respective crate is also possible of course, we just have to duplicate the test function then. But I agree that it makes it easier to copy a test for your own node.

I look into it, thanks for the suggestion!

@haixuanTao
Copy link
Copy Markdown
Collaborator

haixuanTao commented Nov 21, 2025

I mean I genuinely think that using cargo test is the optimal approach as it's well integrated and native within the rust ecosystem.

I find using cargo run to be less readable and harder to use overall as we can see within the current run.rs file.

Would there be a way to pass mock input and output data into the node whether it's using compile time env variable or something like like https://doc.rust-lang.org/cargo/reference/config.html#env, or configuration within the cargo or basically just have a predefined mock_input.json mock_output.json name that we can use?

p.s: Edit env variable to compile time env variable

@phil-opp
Copy link
Copy Markdown
Collaborator Author

I mean I genuinely think that using cargo test is the optimal approach as it's well integrated and native within the rust ecosystem.

We're using cargo test, we just need to invoke cargo run within it.

The difference to standard cargo tests is that we want to test a binary, not a library.

Would there be a way to pass mock input and output data into the node whether it's using compile time env variable or something like like https://doc.rust-lang.org/cargo/reference/config.html#env, or configuration within the cargo or basically just have a predefined mock_input.json mock_output.json name that we can use?

One of the goals of this PR is to make every node integration testable, without recompiling it. This way, you can test the actual main function and it will behave exactly as during a dataflow run.

We can of course also add some DoraNode::init_with_test_inputs function, but this way the main function is not part of the test anymore.

@haixuanTao
Copy link
Copy Markdown
Collaborator

We're using cargo test, we just need to invoke cargo run within it.

The difference to standard cargo tests is that we want to test a binary, not a library.

But wouldn't this be way harder to grok, more error prone and harder to develop with than just using a test as a library?

Especially for people beginner with rust and cargo.

One of the goals of this PR is to make every node integration testable, without recompiling it. This way, you can test the actual main function and it will behave exactly as during a dataflow run.

Yeah, but can't you just call the whole main() function within a main.rs test section? Just like any other test in rust?

@phil-opp
Copy link
Copy Markdown
Collaborator Author

phil-opp commented Nov 21, 2025

I pushed 05eed12 to simplify the test run function to use cargo run.

I also added a DoraNode::init_testing function in 6e138e3, which can be used to write library-style tests. It returns a custom DoraNode and EventStream, so you cannot run the main function directly. However, you can move most of the main function to a run(node, event_stream) function and then call that.

I used the new init_testing feature in d274190 to add some internal testing to the rust-dataflow-example-node crate. I also added a self-contained test_sample_output function that shows how to write test cases with loading/writing external files.

@haixuanTao
Copy link
Copy Markdown
Collaborator

I'm not against the proposed changes but this seems to requires user to learn about init_testing that does not seems to be easy to understand.

I think that what we're trying to do is really close to what cargo nextest is doing natively: https://github.com/nextest-rs/nextest?tab=readme-ov-file . Especially the section on: https://nexte.st/docs/configuration/env-vars/#altering-the-environment-within-tests

I would be in favor of instead of rewriting our test use nextest with a small bin test section that is native in rust:

// test main
#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_status_node_main_1() {
        set_var(DORA_TEST_WITH_INPUTS_1); // this is safe with nextest
        let result = main();
        assert!(result.is_ok());
    }
    
    #[test]
    fn test_status_node_main_2() {
        set_var(DORA_TEST_WITH_INPUTS_2); // this is safe with nextest
        let result = main();
        assert!(result.is_ok());
    }
}

And then:

cargo nextest run -p rust-dataflow-example-node

@phil-opp
Copy link
Copy Markdown
Collaborator Author

So you think it's easier for users to learn about and understand a third-party testing tool than to call a function? This seems unlikely to me.

In general, one goal of Dora is support the standard build and test frameworks for each language, so that users don't need to learn extra tools. Removing support for cargo test would be directly against that goal.

I also don't understand how set_var can become safe, just by using cargo nextest. As far as I understand it, nextest is just an alternative test runner that is compatible with standard Rust tests. So the test functions would still need to call the unsafe std::env::set_var function, right?

The docs you linked just seems to say that set_var won't invoke undefined behavior under the right circumstances, if using cargo nextest. That seems like a very weak guarantee. To me it sounds like there is still undefined behavior if you accidentally run cargo test instead of cargo nextest, which seems quite dangerous.

@phil-opp
Copy link
Copy Markdown
Collaborator Author

I pushed another improvement in 68723bb. There is now a setup_integration_testing function that will store the testing config in thread local data. The init_from_env function checks for this thread-local data and enters integration testing mode if set.

This means that you can now test the main function like this:

#[test]
fn test_main_function() -> eyre::Result<()> {
    let inputs = dora_node_api::integration_testing::TestingInput::FromJsonFile(
        "../../../tests/sample-inputs/inputs-rust-node.json".into(),
    );
    let mut output_file = Arc::new(tempfile::tempfile()?);
    let testing_output =
        dora_node_api::integration_testing::TestingOutput::ToWriter(Box::new(output_file.clone()));
    let options = TestingOptions {
        skip_output_time_offsets: true,
    };

    integration_testing::setup_integration_testing(inputs, testing_output, options);

    crate::main()?;

    let mut output = String::new();
    output_file.seek(std::io::SeekFrom::Start(0))?;
    output_file.read_to_string(&mut output)?;
    let expected =
        std::fs::read_to_string("../../../tests/sample-inputs/expected-outputs-rust-node.jsonl")?;

    assert_eq!(output, expected.replace("\r\n", "\n")); // normalize line endings

    Ok(())
}

@haixuanTao
Copy link
Copy Markdown
Collaborator

I pushed another improvement in 68723bb. There is now a setup_integration_testing function that will store the testing config in thread local data. The init_from_env function checks for this thread-local data and enters integration testing mode if set.

This means that you can now test the main function like this:

#[test]
fn test_main_function() -> eyre::Result<()> {
    let inputs = dora_node_api::integration_testing::TestingInput::FromJsonFile(
        "../../../tests/sample-inputs/inputs-rust-node.json".into(),
    );
    let mut output_file = Arc::new(tempfile::tempfile()?);
    let testing_output =
        dora_node_api::integration_testing::TestingOutput::ToWriter(Box::new(output_file.clone()));
    let options = TestingOptions {
        skip_output_time_offsets: true,
    };

    integration_testing::setup_integration_testing(inputs, testing_output, options);

    crate::main()?;

    let mut output = String::new();
    output_file.seek(std::io::SeekFrom::Start(0))?;
    output_file.read_to_string(&mut output)?;
    let expected =
        std::fs::read_to_string("../../../tests/sample-inputs/expected-outputs-rust-node.jsonl")?;

    assert_eq!(output, expected.replace("\r\n", "\n")); // normalize line endings

    Ok(())
}

ok this works for me.

@phil-opp phil-opp merged commit da26f6e into main Nov 24, 2025
75 of 76 checks passed
@phil-opp phil-opp deleted the node-integration-testing branch November 24, 2025 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants