Skip to content

DOC Mentioned efficiency and precision issues in glossary.rst#28223

Merged
glemaitre merged 5 commits intoscikit-learn:mainfrom
suhasid098:suhaBranch1
Feb 19, 2024
Merged

DOC Mentioned efficiency and precision issues in glossary.rst#28223
glemaitre merged 5 commits intoscikit-learn:mainfrom
suhasid098:suhaBranch1

Conversation

@suhasid098
Copy link
Copy Markdown
Contributor

@suhasid098 suhasid098 commented Jan 22, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

I found a TODO in glossary.rst that demanded to describe the efficiency and precision. It was documentation

Any other comments?

@github-actions
Copy link
Copy Markdown

github-actions bot commented Jan 22, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 6347901. Link to the linter CI: here

@suhasid098 suhasid098 changed the title Mentioned efficiency and precision issues in glossary.rst DOC: Mentioned efficiency and precision issues in glossary.rst Jan 24, 2024
@suhasid098
Copy link
Copy Markdown
Contributor Author

Hey @glemaitre saw that you mentioned we could ping you for PRs wondering if I could get a review please! 😃

@glemaitre glemaitre changed the title DOC: Mentioned efficiency and precision issues in glossary.rst DOC Mentioned efficiency and precision issues in glossary.rst Jan 26, 2024
doc/glossary.rst Outdated

In regards to the efficiency and precision of the NumPy array, the data type
(`dtype`) can make a significant impact. If one chooses a data type with
a higher precision like `np.float64` or `np.int64` it will result in slow
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we need to mitigate the slow operations. Basically, using lower precision can provide a speed up if the operation under the hood leverage some optimizations as vectorization, SIMD, or cache optimization. If those are not implemented or taking care, you might not get any speed up.

doc/glossary.rst Outdated
In regards to the efficiency and precision of the NumPy array, the data type
(`dtype`) can make a significant impact. If one chooses a data type with
a higher precision like `np.float64` or `np.int64` it will result in slow
operations, and increase memory usage, but it will result in accurate results.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accurate results -> more accurate results due to the ability to represent floating number with a lower floating point error.

doc/glossary.rst Outdated
(`dtype`) can make a significant impact. If one chooses a data type with
a higher precision like `np.float64` or `np.int64` it will result in slow
operations, and increase memory usage, but it will result in accurate results.
However, if one opts for lower precision types like `np.float32` or `np.int32`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't thin that we need to repeat. We can just explicitly make the comparison between 64-bits and 32-bits numbers in the first sentence.

doc/glossary.rst Outdated
Comment on lines +294 to +297
results. The analysis of the choice will be dependent on the size of the
dataset needed in machine learning tasks. For example, if a dataset is large,
it would be worth considering a lower-precision data type for a larger dataset
since speed and accuracy would be a priority.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of speaking about datataset, it all comes to if the algorithm used under the hood is leveraging np.float32. For instance, some minimization method are only coded in np.float64 and even passing np.float32 will trigger a conversion and thus the code will be slower and more memory consumer in np.float32 due to the extra conversion.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can remove the TODO below.

I've incorporated the valuable insights and suggestions from glemaitre to enhance the clarity and technical accuracy of the document. Here's a summary of the implemented changes:

Mitigating Slow Operations: As glemaitre suggested, we explored how using lower precision (e.g., np.float32 over np.float64) can speed up operations if the underlying process leverages optimizations like vectorization, SIMD, or cache optimization. This acknowledgment led to a more nuanced discussion on when and why these optimizations matter and how their absence could negate expected speed improvements.

Enhancing Accuracy Description: I updated the text to reflect the point about "more accurate results" being achievable with higher precision types due to their lower floating-point error. This change underscores the precision-efficiency trade-off more clearly.

Streamlined Comparison: Following the advice to avoid repetition, the initial comparison between 64-bit and 32-bit data types was made more explicit in the first sentence. This direct approach simplifies the narrative by immediately setting the stage for the discussion on trade-offs.

Algorithm Dependency: I shifted the focus from dataset size to the relevance of the algorithm's compatibility with np.float32. This section now highlights how some algorithms, particularly certain minimization methods, may not benefit from lower precision due to inherent coding in np.float64, leading to inadvertent conversions that can slow down the process and increase memory usage.

Removal of the TODO Section: As suggested, the TODO prompting further discussion on efficiency, precision, and casting policy has been removed to present the analysis as complete in its current form.
@glemaitre glemaitre self-requested a review February 19, 2024 09:56
Copy link
Copy Markdown
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the trailing spaces and rephrase a part that really look like written by ChatGPT with overly complicated (and not consistent with the rest of the document) choice of words.

@glemaitre
Copy link
Copy Markdown
Member

Enabling auto-merge. Thanks @suhasid098

@glemaitre glemaitre enabled auto-merge (squash) February 19, 2024 12:05
@glemaitre glemaitre merged commit d7b6238 into scikit-learn:main Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants