fixed Float16 from Float64 and BigFloat#42837
Conversation
|
Testing these conversions are really hard because the way you generally test math is computing in higher precision and checking if the result is the same. I should probably add tests for a few likely problematic values like |
|
Maybe one test for previous float, not sure if it would add coverage or just add clutter. |
|
Tests added. This still has a bug for some values less than |
base/mpfr.jl
Outdated
| function Float16(x::BigFloat) :: Float16 | ||
| res = Float32(x) | ||
| resi = reinterpret(UInt32, res) | ||
| if (resi&0x7fffffff) < 0x38800000 |
There was a problem hiding this comment.
Maybe this starts to need some comments now. It's a bit of magic number soup now.
There was a problem hiding this comment.
good point. that's just if Float16(res) is subnormal.
There was a problem hiding this comment.
And this is more efficient than issubnormal(Float16(res))? Does it necessarily have to be?
There was a problem hiding this comment.
yeah. this is only 2 instructions while converting to float16 is relatively expensive.
|
Should we merge this as is? This is already a major improvement over current status and I'm not 100% sure how to fix the last remaining case the edge between |
|
Do you mean all of |
|
yeah. Fixed. |
|
The only test error remaining appears to be openblas threading issue. |
|
last remaining bug fixed! from my perspective, this is ready to merge. |
|
How is it ensured that these tests both test the C version and the Julia version? |
|
the c version is float64. the Julia version is bigfloat |
vtjnash
left a comment
There was a problem hiding this comment.
I helped co-author this, so I will approve here too.
|
Any objections to merging this? |
|
Needs a manual backport towards 1.7. |
|
What branch should I target the PR to? |
|
|
|
backported: #43092 is the backport |
* fixed Float16 from Float64 and BigFloat. Many thanks to Jamison.
* fixed Float16 from Float64 and BigFloat. Many thanks to Jamison.
|
Also needs a manual backport towards |
previously
Float16(::Float64)andFloat16(::BigFloat)had double rounding issues.