Skip to content

Conversation

@0xPoe
Copy link
Contributor

@0xPoe 0xPoe commented Jul 28, 2025

Which issue does this PR close?

Rationale for this change

This pull request implements the partition_statistics function specifically for HashJoinExec, taking into account the different partitioning modes.

What changes are included in this PR?

  • Implements the partition_statistics function specifically for HashJoinExec.
  • Added a unit test for it.

Are these changes tested?

Yes, check the unit tests.

Are there any user-facing changes?

I'm not sure; please let me know if any document changes are needed.

/cc @xudong963

@github-actions github-actions bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Jul 28, 2025
@0xPoe 0xPoe force-pushed the poe-patch-hash-join branch from 4240606 to 27efc67 Compare July 28, 2025 20:56
Copy link
Contributor Author

@0xPoe 0xPoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔢 Self-check (PR reviewed by myself and ready for feedback)

  • Code compiles successfully

  • Unit tests added

  • All tests pass

  • Comments added where necessary

  • PR title and description updated

  • Documentation PR created (or confirmed not needed)

  • PR size is reasonable

@0xPoe
Copy link
Contributor Author

0xPoe commented Aug 2, 2025

@xudong963 Could you please take a look when you have time? Thank you!

@0xPoe 0xPoe force-pushed the poe-patch-hash-join branch from 27efc67 to 7babcc7 Compare August 3, 2025 11:44
@xudong963
Copy link
Member

@xudong963 Could you please take a look when you have time? Thank you!

Thanks! I'll take a look tomorrow

}

#[test]
fn test_partition_statistics() -> Result<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also suggest testing with real execution, like this: https://github.com/apache/datafusion/blob/main/datafusion/core/tests/physical_optimizer/partition_statistics.rs#L722-L729

It'll verify the output partitioning count too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I forgot to remove this test. This test has been moved to partition_statistics.rs. I thought I deleted it before I opened the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok(stats.project(self.projection.as_ref()))
}

// For Auto mode or when no specific partition is requested, fall back to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For auto mode, as the comment says:

/// DataFusion optimizer decides which PartitionMode
    /// mode(Partitioned/CollectLeft) is optimal based on statistics. It will
    /// also consider swapping the left and right inputs for the Join
    Auto,

So if the method with partition is called after JoinSelection rule, it's impossible to see Auto mode, however, the method may be called first, I suggest evaluating the cost of related code about choosing partition mode in JoinSelection, then decide if we can decide the PartitionMode here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. However, it seems challenging to get the threshold here for conducting the same evaluation. Do you have any other ideas on how to share the same logic from the JoinSelection to determine the PartitionMode? I'm uncertain how to obtain the optimizer config for accessing the threshold settings in the current API design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check later

Copy link
Contributor Author

@0xPoe 0xPoe Nov 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xudong963 Do you have time to look at the question I asked above? Thank you. Also, could you please help me reopen the PR?
For this question, I've thought about it a little more.

however, the method may be called first,

I searched the code base, no optimizer rules currently call partition_statistics(Some(partition))) before JoinSelection.

I’m also wondering — based on this, should we assert that the auto mode will never happen in this code path?

Or do we still want to determine the PartitionMode here? If so, that would mean we need to store collect_threshold_byte_size and collect_threshold_num_rows in HashJoinExec. I’m not sure if it’s a good design to expose these optimizer-related configs to the executor layer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I searched the code base, no optimizer rules currently call partition_statistics(Some(partition))) before JoinSelection.

The only rule before JoinSelection, called partition_statistics, is AggregateStatistics. However, it only calls partial_agg_exec.input().partition_statistics(None).

@0xPoe 0xPoe force-pushed the poe-patch-hash-join branch from 7babcc7 to 64f74da Compare August 11, 2025 18:41
@0xPoe 0xPoe force-pushed the poe-patch-hash-join branch from 64f74da to 3dcc212 Compare August 11, 2025 19:39
@github-actions
Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Oct 19, 2025
@github-actions github-actions bot closed this Oct 29, 2025
@0xPoe
Copy link
Contributor Author

0xPoe commented Jan 12, 2026

@xudong963 @Jefffrey Could you please help me reopen this PR? Thanks! ref #16956 (comment)

@Jefffrey Jefffrey reopened this Jan 13, 2026
@Jefffrey
Copy link
Contributor

Re-opened this; I'm not as familiar with the statistics part of the codebase so maybe will need @xudong963 to take a look.

Looks like have some merge conflicts to resolve first (unless there's a higher level discussion to take place first?)

@xudong963
Copy link
Member

I'll review the PR again today

@github-actions github-actions bot removed the Stale PR has not had any activity for some time label Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants