Skip to content

Creation of a 'Reading scikit-learn code' section in the docs? #12869

@NicolasHug

Description

@NicolasHug

Would you welcome a new subsection in (e.g.) the contributing guide, giving pointers on how to actually read and digest the existing code base? This could be a mix between general and sklearn-specific tips.

Lots of things may seem obvious for experienced programmers and contributors, but less experienced people might find the code base quite overwhelming at first. For example even simple utilities like scale() take more than 50 lines, while the bulk of the work could fit in a one-liner. It's easy to get lost in the details.

I was thinking of something like this (very roughly):

  • It takes time and experience to efficiently read code. It's normal if it seems hard, because it is.
  • Get acquainted with the estimator API: fit, predict, fit_predict...
  • Identify the important parts and ignore the rest. In particular: a lot of the code (especially at the beginning of the fit() methods) is just doing input checking. Focusing on this part isn't worth it if you only need to understand what an algorithm is doing. Make sure you can identify those parts from the actual ML algorithm.
  • Before trying to read a function / class, briefly read the parameters docstrings to have at least a vague idea of what each one of them is used for. Same for the attributes.
  • Explain the check_blahblah() functions, e.g. check_random_state() or check_cv: take whatever as input but return an object with predefined type.
  • We use cython to make things fast (.pyx and pxd files). Those files usually contain low-level routines that can probably be ignored during the first "reading sessions".
  • ...

Any suggestion welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions