[Data] Add contributing guide to Ray Data documentation#58589
[Data] Add contributing guide to Ray Data documentation#58589bveeramani merged 15 commits intomasterfrom
Conversation
This PR adds a new contributing guide to the Ray Data documentation to help contributors write better code and tests. The guide includes best practices for submitting improvements and writing maintainable tests. ## Description This change adds three new documentation files to help Ray Data contributors: 1. **How to contribute an improvement** - Guidelines for finding work, getting early feedback, writing clear PRs, keeping changes small, and writing simple code that follows deep module design principles. 2. **How to write tests** - Ray-specific and general best practices for writing tests that are fast, reliable, and maintainable. Includes guidance on avoiding common pitfalls like assuming output order, using `shutdown_only` unnecessarily, and testing against implementation details. 3. **Contributing section** - A new top-level section in the documentation that organizes these guides. ## Related issues Related to ongoing efforts to improve code quality and contributor experience in Ray Data. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
There was a problem hiding this comment.
Code Review
This PR adds a valuable contributing guide to the Ray Data documentation. The guides for contributing improvements and writing tests are well-structured and provide clear best practices. I've identified a few issues, including a broken link in the table of contents, a syntax error in a code example, and some minor typos and formatting inconsistencies. Addressing these will ensure the documentation is accurate and easy for contributors to follow.
doc/source/data/contributing/how-to-contribute-an-improvement.md
Outdated
Show resolved
Hide resolved
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
doc/source/data/contributing/how-to-contribute-an-improvement.md
Outdated
Show resolved
Hide resolved
doc/source/data/contributing/how-to-contribute-an-improvement.md
Outdated
Show resolved
Hide resolved
doc/source/data/contributing/how-to-contribute-an-improvement.md
Outdated
Show resolved
Hide resolved
| ```python | ||
| ds2 = ds.repartition(5) | ||
| assert sum(len(bundle.blocks) for bundle in ds.iter_internal_ref_bundles()) == 5 | ||
| # Assertion about the number of rows in each block has been removed. |
There was a problem hiding this comment.
Is this a good example?
I think here we really should check the row num from every block.
Maybe the better version should be adding comments for these magic numbers.
There was a problem hiding this comment.
How would you decide how many rows go in each block? Both [10, 10, 0, 0, 0] and [2, 2, 2, 2, 2] satisfy the APIs contract
doc/source/data/contributing/how-to-contribute-an-improvement.md
Outdated
Show resolved
Hide resolved
|
|
||
| ```python | ||
| ds = ray.data.read_parquet_bulk(paths + [txt_path], filesystem=fs) | ||
| # Assertion has been removed. |
There was a problem hiding this comment.
but this one doesn't assert anything.
It won't catch any issues.
If people do want to test against initial_num_blocks, it's okay to add the assertion.
Just document it clearly.
There was a problem hiding this comment.
The actual test that this is based on has additional assertions, but I didn't originally include them.
Added an additional assertion so this looks a bit less weird.
If people do want to test against initial_num_blocks, it's okay to add the assertion. Just document it clearly.
I think we should discourage this for a couple of reasons:
- It requires testing against internal attributes. If we refactor (e.g., remove
ExecutionPlan), the test will break even though nothing is broken. - The number of blocks isn't gaurenteed by the interface and can change depending on our specific implementation. For example, if there are two small files, it could be reasonable to either produce two blocks (to ensure parellism), or one block (to make blocks larger).
There was a problem hiding this comment.
I mean, sometimes we do want to test against an internal implementation, that's fine for unit tests.
In this case, if this assertion isn't helpful, we should just remove the entire test.
There was a problem hiding this comment.
Yeah, I agree we do (and should) test lower levels of abstraction. I guess it'd be more accurate to say that the test breaks abstraction barriers, and tests across multiple layers of abstraction (user-facing APIs and ExecutionPlan, and makes as assumptions about which attributes a Dataset has)
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
doc/source/data/contributing/how-to-contribute-an-improvement.md
Outdated
Show resolved
Hide resolved
| ```python | ||
| ds2 = ds.repartition(5) | ||
| assert sum(len(bundle.blocks) for bundle in ds.iter_internal_ref_bundles()) == 5 | ||
| # Assertion about the number of rows in each block has been removed. |
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
doc/source/data/contributing/how-to-contribute-an-improvement.md
Outdated
Show resolved
Hide resolved
Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…ct/ray into add-contributing-guide Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
owenowenisme
left a comment
There was a problem hiding this comment.
Some nits but up to you, also might be good to add this somewhere
https://docs.ray.io/en/master/ray-contribute/getting-involved.html
| @@ -0,0 +1,9 @@ | |||
| ============ | |||
| Contributing | |||
| @@ -0,0 +1,74 @@ | |||
| # How to contribute an improvement | |||
There was a problem hiding this comment.
Contribution Guide sounds more natural to me.
…ntributing guide (#59320) This is a follow-up PR to address these review comments: #58589 (review). --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
…58589) ## Description This PR adds a new contributing guide to the Ray Data documentation to help contributors write better code and tests. This change adds three new documentation files to help Ray Data contributors: 1. **How to contribute an improvement** - Guidelines for finding work, getting early feedback, writing clear PRs, keeping changes small, and writing simple code that follows Ray Data's design principles. 2. **How to write tests** - Ray-specific and general best practices for writing tests that are fast, reliable, and maintainable. Includes guidance on avoiding common pitfalls like assuming output order, using `shutdown_only` unnecessarily, and testing against implementation details. 3. **Contributing section** - A new top-level section in the documentation that organizes these guides. The documentation follows principles from "A Philosophy of Software Design" and adapts them specifically for Ray Data's codebase. It aims to help both new and experienced contributors write code that is simple, maintainable, and aligned with the project's design philosophy. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
…ntributing guide (ray-project#59320) This is a follow-up PR to address these review comments: ray-project#58589 (review). --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Signed-off-by: peterxcli <peterxcli@gmail.com>
Description
This PR adds a new contributing guide to the Ray Data documentation to help contributors write better code and tests.
This change adds three new documentation files to help Ray Data contributors:
How to contribute an improvement - Guidelines for finding work, getting early feedback, writing clear PRs, keeping changes small, and writing simple code that follows Ray Data's design principles.
How to write tests - Ray-specific and general best practices for writing tests that are fast, reliable, and maintainable. Includes guidance on avoiding common pitfalls like assuming output order, using
shutdown_onlyunnecessarily, and testing against implementation details.Contributing section - A new top-level section in the documentation that organizes these guides.
The documentation follows principles from "A Philosophy of Software Design" and adapts them specifically for Ray Data's codebase. It aims to help both new and experienced contributors write code that is simple, maintainable, and aligned with the project's design philosophy.