Add support for node integration testing#1163
Conversation
Add an optional `data_format` field that specifies the format of the `data` field. It defaults to deriving the schema from the given JSON object and converting it to the closest Arrow representation. The `ArrowTest` and `ArrowTestUnwrap` formats expect the `data` field to follow the Arrow integration test data format. The `ArrowTestUnwrap` format unwraps the first column of the deserialized RecordBatch to make other Arrow types representable (i.e. not just StructArrays).
The arrow integration test format crate panics in certain situations, which lead to a closed integration test channel. We want to panic on the sending side too in this case to avoid endless loops.
Useful for diffing the file against an expected file (as time offsets are not deterministic).
|
I opened apache/arrow-rs#8737 to add support for binary decoding to |
The `arrow_integration_test` crate is incomplete and apparently only for internal use.
- wrap values if necessary - avoid double-wrapping array values
To make them available for integration tests.
So that we get the exact same output each time.
To make outputs comparable
|
There are two tests for the |
|
I also added detailed docs in 618689d |
I find the test example very difficult to decipher. Could we make them easier to understand? |
|
Also could you put somewhere within the test folder the way you generated the test data? |
Which part do you find hard to decipher? We just call two commands for each test. One build command and one run command that sets some env variables. So basically:
To make this extensible, I introduced a The other code in that file is necessary to account for the differences between Windows and Linux (e.g. exe extension, line endings) and make it work with the I'm not sure how we can simplify this more? |
Done in 554ae3a |
I don't think that as a user of dora, I would be able to reproduce the node test on my own on my own codebase. Would it be possible to make node integration test be something like within the main.rs of the node: #[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_integration() {
let input_json = ...
let output_json = ...
// Maybe add them to env variable
// ...
let result = main();
assert_eq! ...
}
}So that it's easier to read but also use native testing that can be used with: P.S: Edited for better understanding |
|
I don't think that there is a way around using What we can do is using Moving the test to the respective crate is also possible of course, we just have to duplicate the I look into it, thanks for the suggestion! |
|
I mean I genuinely think that using I find using Would there be a way to pass mock input and output data into the node whether it's using compile time env variable or something like like https://doc.rust-lang.org/cargo/reference/config.html#env, or configuration within the cargo or basically just have a predefined mock_input.json mock_output.json name that we can use? p.s: Edit env variable to compile time env variable |
We're using The difference to standard cargo tests is that we want to test a binary, not a library.
One of the goals of this PR is to make every node integration testable, without recompiling it. This way, you can test the actual main function and it will behave exactly as during a dataflow run. We can of course also add some |
But wouldn't this be way harder to grok, more error prone and harder to develop with than just using a test as a library? Especially for people beginner with rust and cargo.
Yeah, but can't you just call the whole main() function within a main.rs test section? Just like any other test in rust? |
This function can be used to write unit tests within the node itself.
|
I pushed 05eed12 to simplify the test run function to use I also added a I used the new |
|
I'm not against the proposed changes but this seems to requires user to learn about I think that what we're trying to do is really close to what cargo nextest is doing natively: https://github.com/nextest-rs/nextest?tab=readme-ov-file . Especially the section on: https://nexte.st/docs/configuration/env-vars/#altering-the-environment-within-tests I would be in favor of instead of rewriting our test use nextest with a small bin test section that is native in rust: // test main
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_status_node_main_1() {
set_var(DORA_TEST_WITH_INPUTS_1); // this is safe with nextest
let result = main();
assert!(result.is_ok());
}
#[test]
fn test_status_node_main_2() {
set_var(DORA_TEST_WITH_INPUTS_2); // this is safe with nextest
let result = main();
assert!(result.is_ok());
}
}And then: cargo nextest run -p rust-dataflow-example-node |
|
So you think it's easier for users to learn about and understand a third-party testing tool than to call a function? This seems unlikely to me. In general, one goal of Dora is support the standard build and test frameworks for each language, so that users don't need to learn extra tools. Removing support for I also don't understand how The docs you linked just seems to say that |
|
I pushed another improvement in 68723bb. There is now a This means that you can now test the #[test]
fn test_main_function() -> eyre::Result<()> {
let inputs = dora_node_api::integration_testing::TestingInput::FromJsonFile(
"../../../tests/sample-inputs/inputs-rust-node.json".into(),
);
let mut output_file = Arc::new(tempfile::tempfile()?);
let testing_output =
dora_node_api::integration_testing::TestingOutput::ToWriter(Box::new(output_file.clone()));
let options = TestingOptions {
skip_output_time_offsets: true,
};
integration_testing::setup_integration_testing(inputs, testing_output, options);
crate::main()?;
let mut output = String::new();
output_file.seek(std::io::SeekFrom::Start(0))?;
output_file.read_to_string(&mut output)?;
let expected =
std::fs::read_to_string("../../../tests/sample-inputs/expected-outputs-rust-node.jsonl")?;
assert_eq!(output, expected.replace("\r\n", "\n")); // normalize line endings
Ok(())
} |
ok this works for me. |
How it works
DORA_TEST_WITH_INPUTSenv variable with the path to your input JSON fileDORA_TEST_WRITE_OUTPUTS_TOenv variable with the path where the outputs should be written. If not set, dora will write aoutputs.jsonlfile next to the given inputs fileThe node will be run as usual, but its event channel will be filled from the given inputs JSON file. No connection to a
dora daemonwill be made.Input JSON file format example:
Output JSON file format example:
{"id":"random","data":9267023440904143729,"time_offset_secs":0.700793541} {"id":"random","data":5753749540645363621,"time_offset_secs":0.900897584}TODO:
arrow_integration_testJSON format -> this might be better suitable than our custom input json formatuse ArrowTestUnwrap format for outputs.jsonl(not possible because of ArrowJsonBatch::from_batch is incomplete apache/arrow-rs#8684)