Skip to content

Vectorize Acos for Vector64/128/256/512 and TensorPrimitives#126423

Open
stephentoub wants to merge 6 commits intomainfrom
copilot/vectorize-acos
Open

Vectorize Acos for Vector64/128/256/512 and TensorPrimitives#126423
stephentoub wants to merge 6 commits intomainfrom
copilot/vectorize-acos

Conversation

@stephentoub
Copy link
Copy Markdown
Member

@stephentoub stephentoub commented Apr 1, 2026

Description

Add vectorized Acos implementations for float and double across all SIMD vector types, with unit tests and XML doc consistency.

Changes

  • Introduces Vector64/128/256/512.Acos public APIs (ref + CoreLib implementations).
  • Implements vectorized acos kernels in VectorMath for both double and float (via widening to double), ported from AMD's aocl-libm-ose optimized source (src/optimized/acos.c and src/optimized/acosf.c) using Sollya polynomial coefficients — 12-term polynomial for double precision, 5-term polynomial for float precision.
  • Enables TensorPrimitives.Acos vectorization for float/double and adjusts tensor test tolerances for Acos/AcosPi.
  • Uses <inheritdoc cref=...> for Vector128/256/512 Acos XML docs, matching the established pattern used by Asin/Cos.
  • Adds AcosDouble and AcosSingle test data to GenericMathTestMemberData with 19 test cases each covering edge cases (NaN, ±∞, ±0, ±1, and values across the valid input range).
  • Adds AcosDoubleTest and AcosSingleTest methods to Vector64Tests, Vector128Tests, Vector256Tests, and Vector512Tests (152 tests total, all passing).
  • All polynomial coefficients are bitwise-verified against the aocl-libm-ose hex float constants.
  • AcosDouble uses a single generic type parameter (TVectorDouble) since the optimized algorithm does not require the TVectorUInt64 bit-manipulation used in AsinDouble.

Add vectorized Acos implementations for float and double across all SIMD vector types.
- AcosDouble: AMD acos.c [5,4] rational polynomial with s1/c high-precision reconstruction
- AcosSingle: widens to double, uses AMD acosf.c [3,1] rational polynomial
- Hook up TensorPrimitives.Acos to use vectorized implementations

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds vectorized Acos implementations for float/double across the System.Runtime.Intrinsics SIMD surface area and enables TensorPrimitives.Acos to use those SIMD implementations when available.

Changes:

  • Introduces Vector64/128/256/512.Acos public APIs (ref + CoreLib implementations).
  • Implements vectorized acos kernels in VectorMath for both double and float (via widening to double).
  • Enables TensorPrimitives.Acos vectorization for float/double and adjusts tensor test tolerances for Acos/AcosPi.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/libraries/System.Runtime.Intrinsics/ref/System.Runtime.Intrinsics.cs Adds new public ref declarations for Vector*.Acos overloads.
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/VectorMath.cs Adds the core vectorized AcosDouble / AcosSingle implementations.
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector64.cs Exposes Vector64.Acos overloads and scalar fallback helper.
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs Exposes Vector128.Acos overloads and fallback to Vector64.Acos.
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector256.cs Exposes Vector256.Acos overloads and fallback to Vector128.Acos.
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector512.cs Exposes Vector512.Acos overloads and fallback to Vector256.Acos.
src/libraries/System.Numerics.Tensors/tests/TensorPrimitives.Generic.cs Adjusts tolerances for Acos / AcosPi validation.
src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Acos.cs Enables vectorized TensorPrimitives.Acos for float/double using Vector*.Acos.

TVectorDouble result = TVectorDouble.ConditionalSelect(transformMask, vTransform, vNormal);

// Handle special cases: |x| > 1 returns NaN, x = ±1 returns 0 or π
TVectorDouble absXGreaterThanOne = TVectorDouble.GreaterThan(TVectorDouble.Abs(x), TVectorDouble.One);
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In AcosDouble, absXGreaterThanOne recomputes Abs(x) even though ax (the absolute value) was already computed earlier. Reuse ax here to avoid redundant vector work and keep the special-case handling consistent with the rest of the method.

Suggested change
TVectorDouble absXGreaterThanOne = TVectorDouble.GreaterThan(TVectorDouble.Abs(x), TVectorDouble.One);
TVectorDouble absXGreaterThanOne = TVectorDouble.GreaterThan(ax, TVectorDouble.One);

Copilot uses AI. Check for mistakes.
@stephentoub
Copy link
Copy Markdown
Member Author

@copilot please evaluate and address the feedback

@github-actions

This comment has been minimized.

Copilot AI and others added 2 commits April 1, 2026 18:57
Copilot AI review requested due to automatic review settings April 1, 2026 19:00
@stephentoub stephentoub review requested due to automatic review settings April 1, 2026 19:00
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 1, 2026

@copilot please evaluate and address the feedback

Addressed all four pieces of feedback in commits e7ddf77 and 548d000:

  1. Reuse ax: Changed TVectorDouble.Abs(x) to ax in absXGreaterThanOne computation
  2. <inheritdoc> for XML docs: Switched Vector128/256/512 Acos docs to use <inheritdoc cref=...> matching the Asin/Cos pattern (Vector64 → Vector128 → Vector256 → Vector512)
  3. Acos tests: Added AcosDouble/AcosSingle test data to GenericMathTestMemberData and Acos test methods to all four vector test files (152 tests, all passing)

- Fix extra leading space in AcosDouble, AcosSingle, and AcosSingleCoreDouble
  methods to match the indentation of AsinDouble/AsinSingle
- Replace magic float literal 3.1415927f with float.Pi for clarity

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 2, 2026 17:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated no new comments.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

🤖 Copilot Code Review — PR #126423

Note

This review was AI-generated by Copilot using Claude Opus 4.6 (primary) and GPT-5.4 (secondary).

Holistic Assessment

Motivation: Justified. TensorPrimitives.Acos was explicitly marked TODO: Vectorize, and the corresponding Asin was already vectorized in the same codebase. Vectorized trig functions provide significant batch-processing speedups. This is a natural and expected extension.

Approach: Correct and well-structured. The implementation mirrors the existing Asin vectorization pattern exactly — same VectorMath infrastructure, same Vector64→128→256→512 decomposition, same TensorPrimitives operator hookup. The math is based on AMD's AOCL-LIBM-OSE (BSD-3 licensed, referenced in THIRD-PARTY-NOTICES.TXT). The double-precision implementation shares the same [5,4] rational minimax coefficients (C1–C6, D1–D5) as AsinDouble, and the single-precision version correctly widens to double before computing.

Summary: ⚠️ Needs Human Review. The code is mathematically correct and structurally consistent with the existing Asin vectorization. Both independent model reviews (Claude Opus 4.6 and GPT-5.4) found no correctness issues. The one item requiring human judgment is the new public API surface — 8 new public methods appear in the ref assembly without a visible api-approved issue link. Given the author and that Asin counterparts already exist, this is very likely approved, but a human reviewer should confirm.


Detailed Findings

✅ Mathematical Correctness — Algorithms are sound

Both AcosDouble and AcosSingleCoreDouble correctly implement the standard acos approximation:

  • |x| ≤ 0.5: acos(x) = π/2 - (x + x³·R(x²)) via rational polynomial
  • |x| > 0.5: sign-dependent reconstruction using acos(x) = π - 2·asin(√((1-|x|)/2)) (negative) or 2·asin(√((1-x)/2)) (positive)

Key verifications:

  • Polynomial coefficients C1–C6 and D1–D5 in AcosDouble are identical to AsinDouble (as expected — both share the same underlying rational minimax)
  • The s1/c high-precision bit-masking (0xFFFFFFFF00000000) matches AsinDouble exactly
  • AcosSingleCoreDouble correctly omits the s1/c correction since double precision is sufficient for float output (matching AsinSingleCoreDouble's approach)
  • The sign-dependent transform path correctly handles acos being a non-odd function (unlike asin where sign can simply be factored out)
  • PI constant 3.1415926535897933e+00 (0x400921fb54442d18) is the correct IEEE 754 double representation

✅ Special Case Handling — All edge cases covered

Input AcosDouble AcosSingle Expected Correct
` x > 1` → NaN → NaN
x = 1 → 0 → 0 0
x = -1 → PI float.Pi π
NaN → NaN → NaN NaN
±Inf → NaN → NaN NaN
±0 → π/2 → π/2 π/2

Intermediate NaN/division-by-zero at boundary values (e.g., sqrt(negative) when |x| > 1, or (r - s1*s1)/(s + s1) when x = ±1) are all masked by the final ConditionalSelect special-case handlers. This is the same branchless SIMD approach used by AsinDouble.

✅ Structural Consistency — Matches Asin pattern exactly

  • VectorMath methods: AcosDouble/AcosSingle/AcosSingleCoreDouble mirror AsinDouble/AsinSingle/AsinSingleCoreDouble structurally
  • Vector64/128/256/512: All Acos methods use the same IsHardwareAccelerated dispatch, wider-vector upgrade path, and lower/upper decomposition as their Asin counterparts
  • TensorPrimitives: AcosOperator uses the same NET11_0_OR_GREATER gating and type dispatch as AsinOperator
  • Ref assembly: Acos entries are correctly placed alphabetically before Asin in all four Vector classes
  • XML docs: Vector64 defines full doc comments; Vector128→256→512 chain inheritdoc correctly
  • Scalar fallback: Vector64.Acos<T> matches the existing Vector64.Asin<T> loop pattern

✅ Test Coverage — Adequate

  • Theory-based tests added for all 4 vector sizes (Vector64, Vector128, Vector256, Vector512) × 2 types (float, double)
  • Test data in GenericMathTestMemberData.AcosDouble/AcosSingle covers: NaN, ±Inf, ±1, ±0, and 14 intermediate values at mathematically significant points (π/4, 1/√2, ln(2), 2/π, log₁₀(e), 1/π)
  • Variance tolerances use CrossPlatformMachineEpsilon scaled appropriately
  • TensorPrimitives.Generic.cs tolerance updated from null to 2e-9/1e-6f to account for MultiplyAddEstimate differences — reasonable for vectorized approximations

⚠️ API Surface — Needs human confirmation of approval

This PR adds 8 new public methods to the ref assembly (Vector64.Acos(float), Vector64.Acos(double), etc. for all 4 vector types). No linked issue with api-approved label was found in the commit messages. Given that:

  • The author (stephentoub) is a .NET runtime maintainer and API reviewer
  • The identical Asin counterpart APIs already exist
  • This follows an established vectorization pattern

...this is very likely already approved, but a human reviewer should confirm the API approval status before merge.

Generated by Code Review for issue #126423 ·

// where R(x^2) is a [5,4] rational minimax approximation (same as asin.c).

// Rational polynomial coefficients (same as asin.c / AMD acos.c)
const double C1 = 0.227485835556935010735943483075;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot where do these polynomials come from?

The aocl-libm acos implementation uses 0x1.55555555552aap-3 as the value for C1, which is bitwise 0x3fc55555555552aa, representing 0.166666666666647700356662653575767762959003448486328125 as the infinitely precise value.

The algorithm needs to be ported from the aocl-libm-ose source.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ported the correct algorithm from src/optimized/acos.c and src/optimized/acosf.c in commit 58d2ed8. The previous implementation incorrectly used the rational polynomial from src/ref/acos.c. The new implementation uses:

  • AcosDouble: 12-term Sollya polynomial (C1-C12) matching the optimized acos.c, with split-constant reconstruction (A - poly) + A / (B + poly) + B for precision
  • AcosSingleCoreDouble: 5-term Sollya polynomial (C1-C5) matching the optimized acosf.c, same reconstruction approach

All 152 Acos vector tests pass.

Copilot AI and others added 2 commits April 6, 2026 17:00
…coefficients

Agent-Logs-Url: https://github.com/dotnet/runtime/sessions/22f9ee13-f6b9-4c35-ad05-63ff26534e31

Co-authored-by: tannergooding <10487869+tannergooding@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 6, 2026 17:05
@stephentoub stephentoub review requested due to automatic review settings April 6, 2026 17:05
Copilot AI requested a review from tannergooding April 6, 2026 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants