Start splitting up dataset.py#10039
Merged
max-sixty merged 2 commits intopydata:mainfrom Feb 10, 2025
Merged
Conversation
Currently, `dataset.py` is 10956 lines long. This makes doing any work with current LLMs basically impossible — with Claude's tokenizer, the file is 104K tokens, or >2.5x the size of the _per-minute_ rate limit for basic accounts. Most of xarray touches it in some way, so you generally want to give it the file for context. Even if you don't think "LLMs are the future, let's code with vibes!", the file is still really long; can be difficult to navigate (though OTOH it can be easy to just grep, to be fair...). So I would propose: - We start breaking it up, while also being cognizant that big changes can cause merge conflicts - Start with the low-hanging fruit - For example, this PR moves code outside the class (but that's quite limited) - Then move some of the code from the big methods into functions in other files, like `curve_fit` - Possibly (has tradeoffs; needs discussion) build some mixins so we can split up the class, if we want to have much smaller files - We can also think about other files: `dataarray.py` is 7.5K lines. The tests are also huge (`test_dataset` is 7.5K lines), but unlike with the library code, we can copy out & in chunks of tests when developing. (Note that I don't have any strong views on exactly what code should go in which file; I made a quick guess — very open to any suggestions; also easy to change later, particularly since this code doesn't change much so is less likely to cause conflicts)
shoyer
approved these changes
Feb 9, 2025
Member
shoyer
left a comment
There was a problem hiding this comment.
Thanks Max for getting started on this!
The main reason why dataset.py is so gigantic in the first place is that most of the Xarray API uses Dataset methods. But local imports are definitely less problematic than 10,000 line files!
This was referenced Mar 2, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently,
dataset.pyis 10956 lines long. This makes doing any work with current LLMs basically impossible — with Claude's tokenizer, the file is 104K tokens, or >2.5x the size of the per-minute rate limit for basic accounts. Most of xarray touches it in some way, so you generally want to give it the file for context.Even if you don't think "LLMs are the future, let's code with vibes!", the file is still really long; can be difficult to navigate (though OTOH it can be easy to just grep, to be fair...).
So I would propose:
curve_fitdataarray.pyis 7.5K lines. The tests are also huge (test_datasetis 7.5K lines), but unlike with the library code, we can copy out & in chunks of tests when developing.(Note that I don't have any strong views on exactly what code should go in which file; I made a quick guess — very open to any suggestions; also easy to change later, particularly since this code doesn't change much so is less likely to cause conflicts)