-
Notifications
You must be signed in to change notification settings - Fork 1.9k
refactor: move DataSource to datafusion-datasource
#14671
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
So exciting... |
|
@alamb any Idea on what to do with unit tests in |
I looked at the errors, and most of them appear to be related to the tests in physical_plan using Some other ideas:
|
|
BTW thank you for working on this -- I suggest we try the "make a mock ExecutionPlan" approach -- that would be the cleanest I think and would be a good way for you to learn more about how ExecutionPlans worked I can try and find time to sketch it out if needed |
Thanks for the suggestions. I will try to make |
|
@alamb, I have implemented a MockMemorySourceConfig and MockDataSource which is essentially same as what was previously in physical_plan but now for tests only. You can find the implementation here. The tests are passing on my PC, although there are some warnings, so cleaning up that part is still pending. I think this is different from what you were suggesting. It would be pretty helpful if you can point me in the right direction. Thanks, |
I think this sounds reasonable -- I was just saying that the split between |
It does look that way, I will try to make it one Mock struct. |
DataSource to datafusion-datasourceDataSource to datafusion-datasource
|
@alamb What do we want to do with benches? (they are causing circular dependency) |
I think we should move the benches into |
|
Thank you very much for this PR @logan-keede -- I will try and review it later today (I am out of the office this week so I have only limited time to review PRs) |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you SO much @logan-keede -- this is a really nice PR -- its structure is 💯 and the fact you got everything working is very impressive
I had a few small comments (unecessary dependency and a rename) but otherwise this PR is ready to go. I also think if necessary we could do those refactors as follow on PRs
Again thank you so much
FYI @mertak-synnada
| use arrow::util::pretty::pretty_format_batches; | ||
| use datafusion::catalog::{Session, TableFunctionImpl}; | ||
| use datafusion::common::{plan_err, Column}; | ||
| use datafusion::datasource::memory::MemorySourceConfig; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| name = "dataframe" | ||
|
|
||
| [[bench]] | ||
| harness = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
|
||
| /// Data source configuration for reading in-memory batches of data | ||
| #[derive(Clone)] | ||
| pub struct MemorySourceConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just confirming that the plan is that datasource will have MemorySourceConfig / MemorySource and that we will move the other format specific things (like parquet) to their own crates like datafusion-datasource-parquet ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, That's what I had in my mind so far. However, now that I think about it, I don't think format-specific crate(except parquet) will have more than 2(fileformat, Source+mod) files, If we are going to keep a separate crate for them, maybe we can keep a separate crate for MemorySourceConfig(1 file, Source + mod.rs ) too. What do you think?
| use crate::memory::MemorySourceConfig; | ||
| use crate::source::DataSourceExec; | ||
| use crate::test::MockMemorySourceConfig; | ||
| // use crate::test::MockMemorySourceConfig; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, at a few more places too. I think I have cleaned this out with latest commit.
| use crate::repartition::RepartitionExec; | ||
| use crate::source::DataSourceExec; | ||
| use crate::test::MockMemorySourceConfig; | ||
| // use crate::test::MockMemorySourceConfig; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // use crate::test::MockMemorySourceConfig; |
datafusion/physical-plan/src/test.rs
Outdated
| pub mod exec; | ||
|
|
||
| #[derive(Clone, Debug)] | ||
| pub struct MockMemorySourceConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions:
- Document that this implements an in memory
DataSourcefor testing purposes (and maybe we can add a link to the real memory source) and that it duplicates much of theMemorySourceConfigto avoid a dependency between physical-plan and datasource - Rename it to something like
TestMemoryExecas this is not a config but rather something thatimpl ExecutionPlanand is used as a source of data in the tests
mertak-synnada
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for all the hard work. I’ve added a few optional minor comments
datafusion/physical-plan/src/test.rs
Outdated
| /// A memory table can be ordered by multiple expressions simultaneously. | ||
| /// [`EquivalenceProperties`] keeps track of expressions that describe the | ||
| /// global ordering of the schema. These columns are not necessarily same; e.g. | ||
| /// ```text | ||
| /// ┌-------┐ | ||
| /// | a | b | | ||
| /// |---|---| | ||
| /// | 1 | 9 | | ||
| /// | 2 | 8 | | ||
| /// | 3 | 7 | | ||
| /// | 5 | 5 | | ||
| /// └---┴---┘ | ||
| /// ``` | ||
| /// where both `a ASC` and `b DESC` can describe the table ordering. With | ||
| /// [`EquivalenceProperties`], we can keep track of these equivalences | ||
| /// and treat `a ASC` and `b DESC` as the same ordering requirement. | ||
| /// | ||
| /// Note that if there is an internal projection, that projection will be | ||
| /// also applied to the given `sort_information`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can remove this documentation and add a link to the original MemorySourceConfig
datafusion/datasource/src/memory.rs
Outdated
| } | ||
|
|
||
| #[cfg(test)] | ||
| mod memory_exec_tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This looks forgotten. While moving the tests, I think we can rename the module as memory_source_tests since there's no MemoryExec
|
@alamb @mertak-synnada TYSM for the review!! I think I have addressed your suggestions in the latest commit. Let me know if they look good to you (especially the documentation part) or if you have anything else. |
| pub mod exec; | ||
|
|
||
| /// `TestMemoryExec` is a mock equivalent to [`MemorySourceConfig`] with [`ExecutionPlan`] implemented for testing. | ||
| /// i.e. It has some but not all the functionality of [`MemorySourceConfig`]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
Thanks @logan-keede and @mertak-synnada -- this looks great and thanks (again) for driving the plan forward |
* build moves only tests+benches pending * unstable * some tests fixed * Mock MemorySourceConfig and DataSource * some cleanup * test pass but Mock is not efficient * temporary stable * one struct, test pass, cleaning pending * cleaning * more cleaning * clippy * 🧹🧹*cleaning*🧹🧹 * adding re-export * fix:cargo fmt * fix: doctest * fix: leftout doctest * fix: circular dependency * clean, rename, document, improve
I mostly just like to see changes in Github. We can merge this once Datafusion 46 released, assuming it includes the following PRs: - apache/datafusion#14754 - apache/datafusion#14671 - also hoping for apache/datafusion#14798 Seems like everything we wanted to get in made it, this branch now compiles AND passes all checks, so we're only waiting for the actual release in apache/datafusion#14123 --------- Co-authored-by: Alexander Droste <alexander.droste@protonmail.com>

Which issue does this PR close?
datafusioncrate (datafusion/core) #14444.Rationale for this change
What changes are included in this PR?
refactor of
DataSourceAre these changes tested?
not completely, only
cargo checkwork for nowAre there any user-facing changes?
yes, but only if they are building from source.