Skip to content

[Data] Add contributing guide to Ray Data documentation#58589

Merged
bveeramani merged 15 commits intomasterfrom
add-contributing-guide
Dec 9, 2025
Merged

[Data] Add contributing guide to Ray Data documentation#58589
bveeramani merged 15 commits intomasterfrom
add-contributing-guide

Conversation

@bveeramani
Copy link
Copy Markdown
Member

@bveeramani bveeramani commented Nov 13, 2025

Description

This PR adds a new contributing guide to the Ray Data documentation to help contributors write better code and tests.

This change adds three new documentation files to help Ray Data contributors:

  1. How to contribute an improvement - Guidelines for finding work, getting early feedback, writing clear PRs, keeping changes small, and writing simple code that follows Ray Data's design principles.

  2. How to write tests - Ray-specific and general best practices for writing tests that are fast, reliable, and maintainable. Includes guidance on avoiding common pitfalls like assuming output order, using shutdown_only unnecessarily, and testing against implementation details.

  3. Contributing section - A new top-level section in the documentation that organizes these guides.

The documentation follows principles from "A Philosophy of Software Design" and adapts them specifically for Ray Data's codebase. It aims to help both new and experienced contributors write code that is simple, maintainable, and aligned with the project's design philosophy.

This PR adds a new contributing guide to the Ray Data documentation to help contributors write better code and tests. The guide includes best practices for submitting improvements and writing maintainable tests.

## Description

This change adds three new documentation files to help Ray Data contributors:

1. **How to contribute an improvement** - Guidelines for finding work, getting early feedback, writing clear PRs, keeping changes small, and writing simple code that follows deep module design principles.

2. **How to write tests** - Ray-specific and general best practices for writing tests that are fast, reliable, and maintainable. Includes guidance on avoiding common pitfalls like assuming output order, using `shutdown_only` unnecessarily, and testing against implementation details.

3. **Contributing section** - A new top-level section in the documentation that organizes these guides.

## Related issues

Related to ongoing efforts to improve code quality and contributor experience in Ray Data.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani requested a review from a team as a code owner November 13, 2025 07:33
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds a valuable contributing guide to the Ray Data documentation. The guides for contributing improvements and writing tests are well-structured and provide clear best practices. I've identified a few issues, including a broken link in the table of contents, a syntax error in a code example, and some minor typos and formatting inconsistencies. Addressing these will ensure the documentation is accurate and easy for contributors to follow.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@ray-gardener ray-gardener bot added docs An issue or change related to documentation data Ray Data-related issues labels Nov 13, 2025
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic!

```python
ds2 = ds.repartition(5)
assert sum(len(bundle.blocks) for bundle in ds.iter_internal_ref_bundles()) == 5
# Assertion about the number of rows in each block has been removed.
Copy link
Copy Markdown
Member

@owenowenisme owenowenisme Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a good example?
I think here we really should check the row num from every block.
Maybe the better version should be adding comments for these magic numbers.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you decide how many rows go in each block? Both [10, 10, 0, 0, 0] and [2, 2, 2, 2, 2] satisfy the APIs contract

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right.


```python
ds = ray.data.read_parquet_bulk(paths + [txt_path], filesystem=fs)
# Assertion has been removed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but this one doesn't assert anything.
It won't catch any issues.
If people do want to test against initial_num_blocks, it's okay to add the assertion.
Just document it clearly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual test that this is based on has additional assertions, but I didn't originally include them.

Added an additional assertion so this looks a bit less weird.

If people do want to test against initial_num_blocks, it's okay to add the assertion. Just document it clearly.

I think we should discourage this for a couple of reasons:

  1. It requires testing against internal attributes. If we refactor (e.g., remove ExecutionPlan), the test will break even though nothing is broken.
  2. The number of blocks isn't gaurenteed by the interface and can change depending on our specific implementation. For example, if there are two small files, it could be reasonable to either produce two blocks (to ensure parellism), or one block (to make blocks larger).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, sometimes we do want to test against an internal implementation, that's fine for unit tests.
In this case, if this assertion isn't helpful, we should just remove the entire test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree we do (and should) test lower levels of abstraction. I guess it'd be more accurate to say that the test breaks abstraction barriers, and tests across multiple layers of abstraction (user-facing APIs and ExecutionPlan, and makes as assumptions about which attributes a Dataset has)

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 29, 2025
@owenowenisme owenowenisme added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Nov 29, 2025
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani added the go add ONLY when ready to merge, run all tests label Dec 8, 2025
@bveeramani bveeramani enabled auto-merge (squash) December 8, 2025 22:51
@bveeramani bveeramani disabled auto-merge December 8, 2025 22:51
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
```python
ds2 = ds.repartition(5)
assert sum(len(bundle.blocks) for bundle in ds.iter_internal_ref_bundles()) == 5
# Assertion about the number of rows in each block has been removed.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you're right.

bveeramani and others added 2 commits December 9, 2025 01:48
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…ct/ray into add-contributing-guide

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Copy link
Copy Markdown
Member

@owenowenisme owenowenisme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits but up to you, also might be good to add this somewhere
https://docs.ray.io/en/master/ray-contribute/getting-involved.html

@@ -0,0 +1,9 @@
============
Contributing
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contributing to Ray Data

@@ -0,0 +1,74 @@
# How to contribute an improvement
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contribution Guide sounds more natural to me.

@bveeramani bveeramani merged commit 00b1f9d into master Dec 9, 2025
5 of 6 checks passed
@bveeramani bveeramani deleted the add-contributing-guide branch December 9, 2025 21:10
bveeramani added a commit that referenced this pull request Dec 10, 2025
…ntributing guide (#59320)

This is a follow-up PR to address these review comments:
#58589 (review).

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…58589)

## Description

This PR adds a new contributing guide to the Ray Data documentation to
help contributors write better code and tests.

This change adds three new documentation files to help Ray Data
contributors:

1. **How to contribute an improvement** - Guidelines for finding work,
getting early feedback, writing clear PRs, keeping changes small, and
writing simple code that follows Ray Data's design principles.

2. **How to write tests** - Ray-specific and general best practices for
writing tests that are fast, reliable, and maintainable. Includes
guidance on avoiding common pitfalls like assuming output order, using
`shutdown_only` unnecessarily, and testing against implementation
details.

3. **Contributing section** - A new top-level section in the
documentation that organizes these guides.

The documentation follows principles from "A Philosophy of Software
Design" and adapts them specifically for Ray Data's codebase. It aims to
help both new and experienced contributors write code that is simple,
maintainable, and aligned with the project's design philosophy.

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
…ntributing guide (ray-project#59320)

This is a follow-up PR to address these review comments:
ray-project#58589 (review).

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues docs An issue or change related to documentation go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants