DOC Mentioned efficiency and precision issues in glossary.rst#28223
DOC Mentioned efficiency and precision issues in glossary.rst#28223glemaitre merged 5 commits intoscikit-learn:mainfrom
Conversation
|
Hey @glemaitre saw that you mentioned we could ping you for PRs wondering if I could get a review please! 😃 |
doc/glossary.rst
Outdated
|
|
||
| In regards to the efficiency and precision of the NumPy array, the data type | ||
| (`dtype`) can make a significant impact. If one chooses a data type with | ||
| a higher precision like `np.float64` or `np.int64` it will result in slow |
There was a problem hiding this comment.
I think that we need to mitigate the slow operations. Basically, using lower precision can provide a speed up if the operation under the hood leverage some optimizations as vectorization, SIMD, or cache optimization. If those are not implemented or taking care, you might not get any speed up.
doc/glossary.rst
Outdated
| In regards to the efficiency and precision of the NumPy array, the data type | ||
| (`dtype`) can make a significant impact. If one chooses a data type with | ||
| a higher precision like `np.float64` or `np.int64` it will result in slow | ||
| operations, and increase memory usage, but it will result in accurate results. |
There was a problem hiding this comment.
accurate results -> more accurate results due to the ability to represent floating number with a lower floating point error.
doc/glossary.rst
Outdated
| (`dtype`) can make a significant impact. If one chooses a data type with | ||
| a higher precision like `np.float64` or `np.int64` it will result in slow | ||
| operations, and increase memory usage, but it will result in accurate results. | ||
| However, if one opts for lower precision types like `np.float32` or `np.int32` |
There was a problem hiding this comment.
I don't thin that we need to repeat. We can just explicitly make the comparison between 64-bits and 32-bits numbers in the first sentence.
doc/glossary.rst
Outdated
| results. The analysis of the choice will be dependent on the size of the | ||
| dataset needed in machine learning tasks. For example, if a dataset is large, | ||
| it would be worth considering a lower-precision data type for a larger dataset | ||
| since speed and accuracy would be a priority. |
There was a problem hiding this comment.
Instead of speaking about datataset, it all comes to if the algorithm used under the hood is leveraging np.float32. For instance, some minimization method are only coded in np.float64 and even passing np.float32 will trigger a conversion and thus the code will be slower and more memory consumer in np.float32 due to the extra conversion.
There was a problem hiding this comment.
You can remove the TODO below.
I've incorporated the valuable insights and suggestions from glemaitre to enhance the clarity and technical accuracy of the document. Here's a summary of the implemented changes: Mitigating Slow Operations: As glemaitre suggested, we explored how using lower precision (e.g., np.float32 over np.float64) can speed up operations if the underlying process leverages optimizations like vectorization, SIMD, or cache optimization. This acknowledgment led to a more nuanced discussion on when and why these optimizations matter and how their absence could negate expected speed improvements. Enhancing Accuracy Description: I updated the text to reflect the point about "more accurate results" being achievable with higher precision types due to their lower floating-point error. This change underscores the precision-efficiency trade-off more clearly. Streamlined Comparison: Following the advice to avoid repetition, the initial comparison between 64-bit and 32-bit data types was made more explicit in the first sentence. This direct approach simplifies the narrative by immediately setting the stage for the discussion on trade-offs. Algorithm Dependency: I shifted the focus from dataset size to the relevance of the algorithm's compatibility with np.float32. This section now highlights how some algorithms, particularly certain minimization methods, may not benefit from lower precision due to inherent coding in np.float64, leading to inadvertent conversions that can slow down the process and increase memory usage. Removal of the TODO Section: As suggested, the TODO prompting further discussion on efficiency, precision, and casting policy has been removed to present the analysis as complete in its current form.
glemaitre
left a comment
There was a problem hiding this comment.
I removed the trailing spaces and rephrase a part that really look like written by ChatGPT with overly complicated (and not consistent with the rest of the document) choice of words.
|
Enabling auto-merge. Thanks @suhasid098 |
Reference Issues/PRs
What does this implement/fix? Explain your changes.
I found a TODO in glossary.rst that demanded to describe the efficiency and precision. It was documentation
Any other comments?