MAINT refactor ARFF parser#22026
Conversation
|
@thomasjpfan did you have this refactoring in mind? |
ogrisel
left a comment
There was a problem hiding this comment.
I was not involved in the initial implementation of the ARFF parser in the OpenMP fetcher from scikit-learn and I am a bit confused by the name liac_arff_parser:
- do we vendor code that was originally developed outside of scikit-learn (e.g. https://github.com/renatopp/liac-arff)? If so the functions that are derivatives from outside code should mention it in their docstrings or at the top of the module, including the original author name (or copyright holder name) and the license.
- do we plan to resync this code with upstream from time to time in the future? Probably not because OpenML is moving towards parquet as a replacement for ARFF so I suspect that we won't have to deal with long term maintenance of the ARFF parser.
|
Beyond the point of making the relationship between upstream code and derivative code explicit, I am fine with the idea of moving ARFF parsing code in a dedicated private module. So +1 once the above comment is addressed. |
Yes, we do vendor in the
Basically, the development in the original repository is not super active: https://github.com/renatopp/liac-arff
I open the following issue to ask a rough timeline regarding the landing of parquet format in OpenML: openml/OpenML#1133 |
ogrisel
left a comment
There was a problem hiding this comment.
Thanks for the clarification. LGTM.
thomasjpfan
left a comment
There was a problem hiding this comment.
Thank you for the PR @glemaitre !
LGTM
|
Merging because the CI failure is unrelated to this PR. |
Refactoring ARFF parser in order to introduce pandas parser.
Preparation of #21938
This PR is only moving functions aside and slightly modifying the signature of one of them.