DOC Mentioned efficiency and precision issues in glossary.rst by suhasid098 · Pull Request #28223 · scikit-learn/scikit-learn

suhasid098 · 2024-01-22T19:34:46Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

I found a TODO in glossary.rst that demanded to describe the efficiency and precision. It was documentation

Any other comments?

github-actions · 2024-01-22T19:36:01Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 6347901. Link to the linter CI: here}

suhasid098 · 2024-01-25T19:53:36Z

Hey @glemaitre saw that you mentioned we could ping you for PRs wondering if I could get a review please! 😃

glemaitre · 2024-01-26T12:14:42Z

doc/glossary.rst


+        In regards to the efficiency and precision of the NumPy array, the data type
+        (`dtype`) can make a significant impact. If one chooses a data type with
+        a higher precision like `np.float64` or `np.int64` it will result in slow


I think that we need to mitigate the slow operations. Basically, using lower precision can provide a speed up if the operation under the hood leverage some optimizations as vectorization, SIMD, or cache optimization. If those are not implemented or taking care, you might not get any speed up.

glemaitre · 2024-01-26T12:15:54Z

doc/glossary.rst

+        In regards to the efficiency and precision of the NumPy array, the data type
+        (`dtype`) can make a significant impact. If one chooses a data type with
+        a higher precision like `np.float64` or `np.int64` it will result in slow
+        operations, and increase memory usage, but it will result in accurate results.


accurate results -> more accurate results due to the ability to represent floating number with a lower floating point error.

glemaitre · 2024-01-26T12:16:56Z

doc/glossary.rst

+        (`dtype`) can make a significant impact. If one chooses a data type with
+        a higher precision like `np.float64` or `np.int64` it will result in slow
+        operations, and increase memory usage, but it will result in accurate results.
+        However, if one opts for lower precision types like `np.float32` or `np.int32`


I don't thin that we need to repeat. We can just explicitly make the comparison between 64-bits and 32-bits numbers in the first sentence.

glemaitre · 2024-01-26T12:19:19Z

doc/glossary.rst

+        results. The analysis of the choice will be dependent on the size of the 
+        dataset needed in machine learning tasks. For example, if a dataset is large,
+        it would be worth considering a lower-precision data type for a larger dataset
+        since speed and accuracy would be a priority.


Instead of speaking about datataset, it all comes to if the algorithm used under the hood is leveraging np.float32. For instance, some minimization method are only coded in np.float64 and even passing np.float32 will trigger a conversion and thus the code will be slower and more memory consumer in np.float32 due to the extra conversion.

You can remove the TODO below.

I've incorporated the valuable insights and suggestions from glemaitre to enhance the clarity and technical accuracy of the document. Here's a summary of the implemented changes: Mitigating Slow Operations: As glemaitre suggested, we explored how using lower precision (e.g., np.float32 over np.float64) can speed up operations if the underlying process leverages optimizations like vectorization, SIMD, or cache optimization. This acknowledgment led to a more nuanced discussion on when and why these optimizations matter and how their absence could negate expected speed improvements. Enhancing Accuracy Description: I updated the text to reflect the point about "more accurate results" being achievable with higher precision types due to their lower floating-point error. This change underscores the precision-efficiency trade-off more clearly. Streamlined Comparison: Following the advice to avoid repetition, the initial comparison between 64-bit and 32-bit data types was made more explicit in the first sentence. This direct approach simplifies the narrative by immediately setting the stage for the discussion on trade-offs. Algorithm Dependency: I shifted the focus from dataset size to the relevance of the algorithm's compatibility with np.float32. This section now highlights how some algorithms, particularly certain minimization methods, may not benefit from lower precision due to inherent coding in np.float64, leading to inadvertent conversions that can slow down the process and increase memory usage. Removal of the TODO Section: As suggested, the TODO prompting further discussion on efficiency, precision, and casting policy has been removed to present the analysis as complete in its current form.

glemaitre

I removed the trailing spaces and rephrase a part that really look like written by ChatGPT with overly complicated (and not consistent with the rest of the document) choice of words.

glemaitre · 2024-02-19T12:05:24Z

Enabling auto-merge. Thanks @suhasid098

Mentiond fficiency and prcision issues in glossary.rst

4504303

suhasid098 changed the title ~~Mentioned efficiency and precision issues in glossary.rst~~ DOC: Mentioned efficiency and precision issues in glossary.rst Jan 24, 2024

Merge branch 'main' into suhaBranch1

244b2e5

github-actions bot added the Documentation label Jan 24, 2024

glemaitre changed the title ~~DOC: Mentioned efficiency and precision issues in glossary.rst~~ DOC Mentioned efficiency and precision issues in glossary.rst Jan 26, 2024

glemaitre reviewed Jan 26, 2024

View reviewed changes

suhasid098 added 2 commits February 10, 2024 16:40

Merge branch 'main' into suhaBranch1

ca353ad

glemaitre self-requested a review February 19, 2024 09:56

fix make text less chatgpt-like phrasing

6347901

glemaitre approved these changes Feb 19, 2024

View reviewed changes

glemaitre enabled auto-merge (squash) February 19, 2024 12:05

glemaitre merged commit d7b6238 into scikit-learn:main Feb 19, 2024

lesteve mentioned this pull request Mar 7, 2024

Update _pca.py docs to reflect floating point issue #28581

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DOC Mentioned efficiency and precision issues in glossary.rst#28223

DOC Mentioned efficiency and precision issues in glossary.rst#28223
glemaitre merged 5 commits intoscikit-learn:mainfrom
suhasid098:suhaBranch1

suhasid098 commented Jan 22, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Jan 22, 2024 •

edited

Loading

Uh oh!

suhasid098 commented Jan 25, 2024

Uh oh!

glemaitre Jan 26, 2024

Uh oh!

glemaitre Jan 26, 2024

Uh oh!

glemaitre Jan 26, 2024

Uh oh!

glemaitre Jan 26, 2024

Uh oh!

glemaitre Jan 26, 2024

Uh oh!

glemaitre left a comment

Uh oh!

glemaitre commented Feb 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

suhasid098 commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

suhasid098 commented Jan 25, 2024

Uh oh!

glemaitre Jan 26, 2024

Choose a reason for hiding this comment

Uh oh!

glemaitre Jan 26, 2024

Choose a reason for hiding this comment

Uh oh!

glemaitre Jan 26, 2024

Choose a reason for hiding this comment

Uh oh!

glemaitre Jan 26, 2024

Choose a reason for hiding this comment

Uh oh!

glemaitre Jan 26, 2024

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Feb 19, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

suhasid098 commented Jan 22, 2024 •

edited

Loading

github-actions bot commented Jan 22, 2024 •

edited

Loading