Add support for node integration testing by phil-opp · Pull Request #1163 · dora-rs/dora

phil-opp · 2025-10-15T15:45:24Z

How it works

Set the DORA_TEST_WITH_INPUTS env variable with the path to your input JSON file
(Optional) Set the DORA_TEST_WRITE_OUTPUTS_TO env variable with the path where the outputs should be written. If not set, dora will write a outputs.jsonl file next to the given inputs file
Start the node executable/script

The node will be run as usual, but its event channel will be filled from the given inputs JSON file. No connection to a dora daemon will be made.

Input JSON file format example:

{
     // ID of the node
    "id": "foo",
    // defines the events that the node should receive
    "events": [
        {
            // specifies when the event arrives (seconds since node start)
            "time_offset_secs": 0.7,
            // type of the event (supported types are `Input`, `Stop`, `InputClosed`, `AllInputsClosed`
            "type": "Input",
            // input ID
            "id": "tick"
            // optional: `data` field with input data
        },
        {
            "time_offset_secs": 0.9,
            "type": "Input",
            "id": "tick"
        },
        {
            "time_offset_secs": 1.2,
            "type": "Stop"
        }
    ]
    // other supported fields: name, description, args, env, outputs, inputs, send_stdout_as (they all behave like in dataflow.yaml)
}

Output JSON file format example:

{"id":"random","data":9267023440904143729,"time_offset_secs":0.700793541}
{"id":"random","data":5753749540645363621,"time_offset_secs":0.900897584}

TODO:

Documentation
- API docs
- dora-rs.ai docs
add some tests for our examples and use them on our CI
take a look at arrow_integration_test JSON format -> this might be better suitable than our custom input json format
add option (via env variable) to write out received inputs as inputs.json files during normal dataflow operation -> to make creating files with complex input data easier
add option (via env variable) to omit time offsets in output formats -> to make them diff-able with expected outputs (the time offsets are a bit different on each run)
~~use ArrowTestUnwrap format for outputs.jsonl~~ (not possible because of ArrowJsonBatch::from_batch is incomplete apache/arrow-rs#8684)
- instead: Include data type in output JSON file

Add an optional `data_format` field that specifies the format of the `data` field. It defaults to deriving the schema from the given JSON object and converting it to the closest Arrow representation. The `ArrowTest` and `ArrowTestUnwrap` formats expect the `data` field to follow the Arrow integration test data format. The `ArrowTestUnwrap` format unwraps the first column of the deserialized RecordBatch to make other Arrow types representable (i.e. not just StructArrays).

The arrow integration test format crate panics in certain situations, which lead to a closed integration test channel. We want to panic on the sending side too in this case to avoid endless loops.

Useful for diffing the file against an expected file (as time offsets are not deterministic).

phil-opp · 2025-10-29T14:31:51Z

I opened apache/arrow-rs#8737 to add support for binary decoding to arrow-json.

The `arrow_integration_test` crate is incomplete and apparently only for internal use.

- wrap values if necessary - avoid double-wrapping array values

To make them available for integration tests.

So that we get the exact same output each time.

To make outputs comparable

phil-opp · 2025-11-20T09:39:22Z

There are two tests for the rust-dataflow example in https://github.com/dora-rs/dora/pull/1163/files#diff-c1dad61976e06858147411d3a1b21177dc82b8eda35818b881c08a67add4cdfd

…n tests

phil-opp · 2025-11-20T10:54:49Z

I also added detailed docs in 618689d

haixuanTao · 2025-11-20T11:10:07Z

There are two tests for the rust-dataflow example in https://github.com/dora-rs/dora/pull/1163/files#diff-c1dad61976e06858147411d3a1b21177dc82b8eda35818b881c08a67add4cdfd

I find the test example very difficult to decipher. Could we make them easier to understand?

haixuanTao · 2025-11-20T11:11:43Z

Also could you put somewhere within the test folder the way you generated the test data?

phil-opp · 2025-11-20T11:21:51Z

There are two tests for the rust-dataflow example in https://github.com/dora-rs/dora/pull/1163/files#diff-c1dad61976e06858147411d3a1b21177dc82b8eda35818b881c08a67add4cdfd

I find the test example very difficult to decipher. Could we make them easier to understand?

Which part do you find hard to decipher?

We just call two commands for each test. One build command and one run command that sets some env variables. So basically:

cargo build -p rust-dataflow-example-node
DORA_TEST_WITH_INPUTS=tests/sample-inputs/inputs-rust-node.json DORA_TEST_NO_OUTPUT_TIME_OFFSET=1 DORA_TEST_WRITE_OUTPUTS_TO=<tempfile> target/debug/rust-dataflow-example-node

To make this extensible, I introduced a test function that takes the variable parameters as input.

The other code in that file is necessary to account for the differences between Windows and Linux (e.g. exe extension, line endings) and make it work with the CARGO_TARGET_DIR env variable that we set on CI for Windows.

I'm not sure how we can simplify this more?

phil-opp · 2025-11-20T11:27:50Z

Also could you put somewhere within the test folder the way you generated the test data?

Done in 554ae3a

haixuanTao · 2025-11-21T01:53:47Z

There are two tests for the rust-dataflow example in https://github.com/dora-rs/dora/pull/1163/files#diff-c1dad61976e06858147411d3a1b21177dc82b8eda35818b881c08a67add4cdfd

I find the test example very difficult to decipher. Could we make them easier to understand?

Which part do you find hard to decipher?

We just call two commands for each test. One build command and one run command that sets some env variables. So basically:

cargo build -p rust-dataflow-example-node

DORA_TEST_WITH_INPUTS=tests/sample-inputs/inputs-rust-node.json DORA_TEST_NO_OUTPUT_TIME_OFFSET=1 DORA_TEST_WRITE_OUTPUTS_TO=<tempfile> target/debug/rust-dataflow-example-node

To make this extensible, I introduced a test function that takes the variable parameters as input.

The other code in that file is necessary to account for the differences between Windows and Linux (e.g. exe extension, line endings) and make it work with the CARGO_TARGET_DIR env variable that we set on CI for Windows.

I'm not sure how we can simplify this more?

I don't think that as a user of dora, I would be able to reproduce the node test on my own on my own codebase.

Would it be possible to make node integration test be something like within the main.rs of the node:

#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_integration() {
        let input_json = ...
        let output_json = ...
        
        // Maybe add them to env variable 
        // ... 

        let result = main();

        assert_eq! ...
  }
}

So that it's easier to read but also use native testing that can be used with: cargo test -p rust-dataflow-example-node

P.S: Edited for better understanding

phil-opp · 2025-11-21T06:11:25Z

I don't think that there is a way around using Command to spawn a subprocess. Setting an env variable for the current process is an unsafe operation and I don't think that we should encourage users to use unsafe in tests. Also, the main function does not exists when building a crate in test mode (it is replaced by a test handling main function).

What we can do is using cargo run instead of build + manual run. This way we can also avoid the manual exe extension handling and specifying the target/debug/x path.

Moving the test to the respective crate is also possible of course, we just have to duplicate the test function then. But I agree that it makes it easier to copy a test for your own node.

I look into it, thanks for the suggestion!

haixuanTao · 2025-11-21T06:36:21Z

I mean I genuinely think that using cargo test is the optimal approach as it's well integrated and native within the rust ecosystem.

I find using cargo run to be less readable and harder to use overall as we can see within the current run.rs file.

Would there be a way to pass mock input and output data into the node whether it's using compile time env variable or something like like https://doc.rust-lang.org/cargo/reference/config.html#env, or configuration within the cargo or basically just have a predefined mock_input.json mock_output.json name that we can use?

p.s: Edit env variable to compile time env variable

phil-opp · 2025-11-21T07:01:06Z

I mean I genuinely think that using cargo test is the optimal approach as it's well integrated and native within the rust ecosystem.

We're using cargo test, we just need to invoke cargo run within it.

The difference to standard cargo tests is that we want to test a binary, not a library.

Would there be a way to pass mock input and output data into the node whether it's using compile time env variable or something like like https://doc.rust-lang.org/cargo/reference/config.html#env, or configuration within the cargo or basically just have a predefined mock_input.json mock_output.json name that we can use?

One of the goals of this PR is to make every node integration testable, without recompiling it. This way, you can test the actual main function and it will behave exactly as during a dataflow run.

We can of course also add some DoraNode::init_with_test_inputs function, but this way the main function is not part of the test anymore.

haixuanTao · 2025-11-21T08:40:48Z

We're using cargo test, we just need to invoke cargo run within it.

The difference to standard cargo tests is that we want to test a binary, not a library.

But wouldn't this be way harder to grok, more error prone and harder to develop with than just using a test as a library?

Especially for people beginner with rust and cargo.

One of the goals of this PR is to make every node integration testable, without recompiling it. This way, you can test the actual main function and it will behave exactly as during a dataflow run.

Yeah, but can't you just call the whole main() function within a main.rs test section? Just like any other test in rust?

This function can be used to write unit tests within the node itself.

phil-opp · 2025-11-21T10:06:46Z

I pushed 05eed12 to simplify the test run function to use cargo run.

I also added a DoraNode::init_testing function in 6e138e3, which can be used to write library-style tests. It returns a custom DoraNode and EventStream, so you cannot run the main function directly. However, you can move most of the main function to a run(node, event_stream) function and then call that.

I used the new init_testing feature in d274190 to add some internal testing to the rust-dataflow-example-node crate. I also added a self-contained test_sample_output function that shows how to write test cases with loading/writing external files.

haixuanTao · 2025-11-21T13:30:24Z

I'm not against the proposed changes but this seems to requires user to learn about init_testing that does not seems to be easy to understand.

I think that what we're trying to do is really close to what cargo nextest is doing natively: https://github.com/nextest-rs/nextest?tab=readme-ov-file . Especially the section on: https://nexte.st/docs/configuration/env-vars/#altering-the-environment-within-tests

I would be in favor of instead of rewriting our test use nextest with a small bin test section that is native in rust:

// test main
#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_status_node_main_1() {
        set_var(DORA_TEST_WITH_INPUTS_1); // this is safe with nextest
        let result = main();
        assert!(result.is_ok());
    }
    
    #[test]
    fn test_status_node_main_2() {
        set_var(DORA_TEST_WITH_INPUTS_2); // this is safe with nextest
        let result = main();
        assert!(result.is_ok());
    }
}

And then:

cargo nextest run -p rust-dataflow-example-node

phil-opp · 2025-11-21T15:05:41Z

So you think it's easier for users to learn about and understand a third-party testing tool than to call a function? This seems unlikely to me.

In general, one goal of Dora is support the standard build and test frameworks for each language, so that users don't need to learn extra tools. Removing support for cargo test would be directly against that goal.

I also don't understand how set_var can become safe, just by using cargo nextest. As far as I understand it, nextest is just an alternative test runner that is compatible with standard Rust tests. So the test functions would still need to call the unsafe std::env::set_var function, right?

The docs you linked just seems to say that set_var won't invoke undefined behavior under the right circumstances, if using cargo nextest. That seems like a very weak guarantee. To me it sounds like there is still undefined behavior if you accidentally run cargo test instead of cargo nextest, which seems quite dangerous.

…ain` function

phil-opp · 2025-11-21T15:55:43Z

I pushed another improvement in 68723bb. There is now a setup_integration_testing function that will store the testing config in thread local data. The init_from_env function checks for this thread-local data and enters integration testing mode if set.

This means that you can now test the main function like this:

#[test]
fn test_main_function() -> eyre::Result<()> {
    let inputs = dora_node_api::integration_testing::TestingInput::FromJsonFile(
        "../../../tests/sample-inputs/inputs-rust-node.json".into(),
    );
    let mut output_file = Arc::new(tempfile::tempfile()?);
    let testing_output =
        dora_node_api::integration_testing::TestingOutput::ToWriter(Box::new(output_file.clone()));
    let options = TestingOptions {
        skip_output_time_offsets: true,
    };

    integration_testing::setup_integration_testing(inputs, testing_output, options);

    crate::main()?;

    let mut output = String::new();
    output_file.seek(std::io::SeekFrom::Start(0))?;
    output_file.read_to_string(&mut output)?;
    let expected =
        std::fs::read_to_string("../../../tests/sample-inputs/expected-outputs-rust-node.jsonl")?;

    assert_eq!(output, expected.replace("\r\n", "\n")); // normalize line endings

    Ok(())
}

haixuanTao · 2025-11-24T09:58:42Z

I pushed another improvement in 68723bb. There is now a setup_integration_testing function that will store the testing config in thread local data. The init_from_env function checks for this thread-local data and enters integration testing mode if set.

This means that you can now test the main function like this:

#[test]
fn test_main_function() -> eyre::Result<()> {
    let inputs = dora_node_api::integration_testing::TestingInput::FromJsonFile(
        "../../../tests/sample-inputs/inputs-rust-node.json".into(),
    );
    let mut output_file = Arc::new(tempfile::tempfile()?);
    let testing_output =
        dora_node_api::integration_testing::TestingOutput::ToWriter(Box::new(output_file.clone()));
    let options = TestingOptions {
        skip_output_time_offsets: true,
    };

    integration_testing::setup_integration_testing(inputs, testing_output, options);

    crate::main()?;

    let mut output = String::new();
    output_file.seek(std::io::SeekFrom::Start(0))?;
    output_file.read_to_string(&mut output)?;
    let expected =
        std::fs::read_to_string("../../../tests/sample-inputs/expected-outputs-rust-node.jsonl")?;

    assert_eq!(output, expected.replace("\r\n", "\n")); // normalize line endings

    Ok(())
}

ok this works for me.

phil-opp added 14 commits October 10, 2025 13:43

wip

69d5d44

Implement event channel for integration testing

e3d46ea

Implement integration testing inputs from JSON file

47c718a

Implement output encoding

aa78acd

Improve output and input file formats

0580754

Introduce env variable to enable testing

fb7d369

Add missing message format file

f522c38

Improve error handling during integration tests

0b4e90e

The arrow integration test format crate panics in certain situations, which lead to a closed integration test channel. We want to panic on the sending side too in this case to avoid endless loops.

Add option to skip time offset from output

a143f56

Useful for diffing the file against an expected file (as time offsets are not deterministic).

Include data type in output JSON file

fe0820e

Merge branch 'main' into node-integration-testing

9e6b797

Update indexmap in cargo.lock

c1a4811

Allow arrow files as inputs + add optional schema for JSON values

b2d948a

phil-opp added 11 commits October 29, 2025 17:44

Remove support for Arrow integration test format again

1fc13ef

The `arrow_integration_test` crate is incomplete and apparently only for internal use.

Improve input JSON handling

cc914ad

- wrap values if necessary - avoid double-wrapping array values

Merge branch 'main' into node-integration-testing

e353d70

Various improvements

95fb459

Add CLI option to write incoming events to JSON file

d858ff7

To make them available for integration tests.

Use env variable instead of introducing new CLI args

8a52537

Fix: create parent folder

f652e9c

Use pretty JSON formatting

7a83567

Improve order

b4addec

Fix field name

4c5a6af

Fix field name

f9d6e8d

phil-opp marked this pull request as ready for review November 19, 2025 14:53

phil-opp added 3 commits November 19, 2025 18:26

Update rust example to use pseudo-random numbers

c0c2c9f

So that we get the exact same output each time.

Add integration test for outputs of rust-dataflow example

6b01bce

rust-dataflow example: ensure that we always send exactl 100 outputs

1e088a2

To make outputs comparable

Revert accidental change

31040e5

phil-opp added 4 commits November 20, 2025 11:52

Rename module

a7897e0

Add missing re-export

ec37ee8

Rename and document structs defining input data format for integratio…

61500bd

…n tests

Document the integration test feature

618689d

Add README for sample-inputs folder

554ae3a

Simplify test function by using cargo run

05eed12

phil-opp added 2 commits November 21, 2025 11:03

Also provide a DoraNode::init_testing function for testing libraries

6e138e3

This function can be used to write unit tests within the node itself.

Use init_testing for internal testing of rust-dataflow-example-node

d274190

Fix: normalize Windows line endings

f6a63d1

Add a setup_integration_testing function for convenient tests of `m…

68723bb

…ain` function

haixuanTao approved these changes Nov 24, 2025

View reviewed changes

phil-opp merged commit da26f6e into main Nov 24, 2025
75 of 76 checks passed

phil-opp deleted the node-integration-testing branch November 24, 2025 10:24

Conversation

phil-opp commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How it works

Input JSON file format example:

Output JSON file format example:

TODO:

Uh oh!

phil-opp commented Oct 29, 2025

Uh oh!

phil-opp commented Nov 20, 2025

Uh oh!

phil-opp commented Nov 20, 2025

Uh oh!

haixuanTao commented Nov 20, 2025

Uh oh!

haixuanTao commented Nov 20, 2025

Uh oh!

phil-opp commented Nov 20, 2025

Uh oh!

phil-opp commented Nov 20, 2025

Uh oh!

haixuanTao commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phil-opp commented Nov 21, 2025

Uh oh!

haixuanTao commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phil-opp commented Nov 21, 2025

Uh oh!

haixuanTao commented Nov 21, 2025

Uh oh!

phil-opp commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

haixuanTao commented Nov 21, 2025

Uh oh!

phil-opp commented Nov 21, 2025

Uh oh!

phil-opp commented Nov 21, 2025

Uh oh!

haixuanTao commented Nov 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

phil-opp commented Oct 15, 2025 •

edited

Loading

haixuanTao commented Nov 21, 2025 •

edited

Loading

haixuanTao commented Nov 21, 2025 •

edited

Loading

phil-opp commented Nov 21, 2025 •

edited

Loading