Skip to content

Inline rounding.c and reduce.c#410

Merged
mkannwischer merged 2 commits intomainfrom
inline2
Aug 7, 2025
Merged

Inline rounding.c and reduce.c#410
mkannwischer merged 2 commits intomainfrom
inline2

Conversation

@mkannwischer
Copy link
Copy Markdown
Contributor

@mkannwischer mkannwischer commented Aug 4, 2025

With constant-time hardening in place, we should be able to safely inline reduce.c and rounding.c.
This should vastly improve non-LTO performance. As CI is using LTO, I do not expect much improvement there.

I performed benchmarks on Apple M1 (using ./scripts/tests bench -c MAC -r, clang16, i.e., no LTO):

Performance Results (no_opt)

ML-DSA-44

Version Keypair Sign Verify
Main 133,231 600,444 159,982
Commit 1 130,432 595,679 157,727
Commit 2 117,094 463,152 138,418

ML-DSA-65

Version Keypair Sign Verify
Main 224,854 985,587 251,728
Commit 1 219,835 975,350 248,134
Commit 2 200,026 749,140 219,666

ML-DSA-87

Version Keypair Sign Verify
Main 366,192 1,208,165 403,781
Commit 1 360,544 1,195,145 398,647
Commit 2 330,288 932,454 357,744

Commit 1 vs Main (inline rounding.c)

Algorithm Keypair Sign Verify
ML-DSA-44 1.02x 1.01x 1.01x
ML-DSA-65 1.02x 1.01x 1.01x
ML-DSA-87 1.02x 1.01x 1.01x

Commit 2 vs Main (inline reduce.c)

Algorithm Keypair Sign Verify
ML-DSA-44 1.14x 1.30x 1.16x
ML-DSA-65 1.12x 1.32x 1.15x
ML-DSA-87 1.11x 1.30x 1.13x

Commit 2 vs Commit 1 (cumulative improvement)

Algorithm Keypair Sign Verify
ML-DSA-44 1.11x 1.29x 1.14x
ML-DSA-65 1.10x 1.30x 1.13x
ML-DSA-87 1.09x 1.28x 1.11x

Performance Results (opt)

ML-DSA-44

Version Keypair Sign Verify
Main 59,570 295,524 84,262
Commit 1 56,699 281,187 80,827
Commit 2 50,752 222,915 73,456

ML-DSA-65

Version Keypair Sign Verify
Main 102,171 475,945 129,972
Commit 1 98,002 466,443 126,838
Commit 2 87,869 357,814 113,948

ML-DSA-87

Version Keypair Sign Verify
Main 162,700 569,587 200,068
Commit 1 157,204 558,493 195,431
Commit 2 140,383 427,560 175,672

Commit 1 vs Main (inline rounding.c)

Algorithm Keypair Sign Verify
ML-DSA-44 1.05x 1.05x 1.04x
ML-DSA-65 1.04x 1.02x 1.02x
ML-DSA-87 1.03x 1.02x 1.02x

Commit 2 vs Main (inline reduce.c)

Algorithm Keypair Sign Verify
ML-DSA-44 1.17x 1.33x 1.15x
ML-DSA-65 1.16x 1.33x 1.14x
ML-DSA-87 1.16x 1.33x 1.14x

Commit 2 vs Commit 1 (cumulative improvement)

Algorithm Keypair Sign Verify
ML-DSA-44 1.12x 1.26x 1.10x
ML-DSA-65 1.12x 1.30x 1.11x
ML-DSA-87 1.12x 1.31x 1.11x

This also closes the gap of LTO vs no-LTO - now LTO gives less than 2% speed-up.
Comparing the second commit from above with LTO on Apple M1 (using ./scripts/tests bench -c MAC -r --cflags="-flto":

Configuration ML-DSA-44 ML-DSA-65 ML-DSA-87
Keypair cycles (avg)
No Opt
No LTO 117,094 200,026 330,288
LTO 116,422 198,518 325,527
LTO improvement 0.6% better 0.8% better 1.4% better
Opt
No LTO 50,752 87,869 140,383
LTO 50,404 87,459 139,564
LTO improvement 0.7% better 0.5% better 0.6% better
Sign cycles (avg)
No Opt
No LTO 463,152 749,140 932,454
LTO 461,629 747,895 931,065
LTO improvement 0.3% better 0.2% better 0.1% better
Opt
No LTO 222,915 357,814 427,560
LTO 222,269 357,119 427,335
LTO improvement 0.3% better 0.2% better 0.1% better
Verify cycles (avg)
No Opt
No LTO 138,418 219,666 357,744
LTO 137,450 217,326 353,520
LTO improvement 0.7% better 1.1% better 1.2% better
Opt
No LTO 73,456 113,948 175,672
LTO 72,874 112,457 172,648
LTO improvement 0.8% better 1.3% better 1.7% better

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (opt)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 50490 cycles 50488 cycles 1.00
ML-DSA-44 sign 222972 cycles 221550 cycles 1.01
ML-DSA-44 verify 72845 cycles 72861 cycles 1.00
ML-DSA-65 keypair 87373 cycles 87430 cycles 1.00
ML-DSA-65 sign 356131 cycles 355968 cycles 1.00
ML-DSA-65 verify 112686 cycles 112701 cycles 1.00
ML-DSA-87 keypair 140128 cycles 139983 cycles 1.00
ML-DSA-87 sign 425731 cycles 425589 cycles 1.00
ML-DSA-87 verify 173211 cycles 173161 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (no-opt)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 116006 cycles 116048 cycles 1.00
ML-DSA-44 sign 455013 cycles 453867 cycles 1.00
ML-DSA-44 verify 136874 cycles 136916 cycles 1.00
ML-DSA-65 keypair 197992 cycles 197974 cycles 1.00
ML-DSA-65 sign 733195 cycles 733093 cycles 1.00
ML-DSA-65 verify 216797 cycles 216913 cycles 1.00
ML-DSA-87 keypair 335075 cycles 325071 cycles 1.03
ML-DSA-87 sign 915069 cycles 914947 cycles 1.00
ML-DSA-87 verify 353281 cycles 353151 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Mac Mini (M1, 2020) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-87 keypair 335075 cycles 325071 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 37475 cycles 37315 cycles 1.00
ML-DSA-44 sign 169677 cycles 169696 cycles 1.00
ML-DSA-44 verify 50158 cycles 50120 cycles 1.00
ML-DSA-65 keypair 66785 cycles 66560 cycles 1.00
ML-DSA-65 sign 280070 cycles 280908 cycles 1.00
ML-DSA-65 verify 78663 cycles 78493 cycles 1.00
ML-DSA-87 keypair 100811 cycles 101215 cycles 1.00
ML-DSA-87 sign 326631 cycles 326489 cycles 1.00
ML-DSA-87 verify 116185 cycles 117806 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 61097 cycles 61319 cycles 1.00
ML-DSA-44 sign 264587 cycles 265061 cycles 1.00
ML-DSA-44 verify 80366 cycles 80809 cycles 0.99
ML-DSA-65 keypair 106800 cycles 106734 cycles 1.00
ML-DSA-65 sign 436283 cycles 436028 cycles 1.00
ML-DSA-65 verify 127404 cycles 127309 cycles 1.00
ML-DSA-87 keypair 165529 cycles 165748 cycles 1.00
ML-DSA-87 sign 518220 cycles 515316 cycles 1.01
ML-DSA-87 verify 192218 cycles 191132 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 96141 cycles 96567 cycles 1.00
ML-DSA-44 sign 351718 cycles 352042 cycles 1.00
ML-DSA-44 verify 105171 cycles 105507 cycles 1.00
ML-DSA-65 keypair 163839 cycles 164849 cycles 0.99
ML-DSA-65 sign 579619 cycles 578557 cycles 1.00
ML-DSA-65 verify 169522 cycles 170920 cycles 0.99
ML-DSA-87 keypair 273631 cycles 275791 cycles 0.99
ML-DSA-87 sign 734824 cycles 732477 cycles 1.00
ML-DSA-87 verify 281021 cycles 281986 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 73968 cycles 77878 cycles 0.95
ML-DSA-44 sign 269536 cycles 281354 cycles 0.96
ML-DSA-44 verify 88927 cycles 93507 cycles 0.95
ML-DSA-65 keypair 128959 cycles 126862 cycles 1.02
ML-DSA-65 sign 440114 cycles 436323 cycles 1.01
ML-DSA-65 verify 144544 cycles 143280 cycles 1.01
ML-DSA-87 keypair 210813 cycles 210856 cycles 1.00
ML-DSA-87 sign 544841 cycles 544492 cycles 1.00
ML-DSA-87 verify 230027 cycles 230070 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 159236 cycles 159603 cycles 1.00
ML-DSA-44 sign 576240 cycles 574646 cycles 1.00
ML-DSA-44 verify 175523 cycles 175554 cycles 1.00
ML-DSA-65 keypair 272628 cycles 272349 cycles 1.00
ML-DSA-65 sign 948903 cycles 939166 cycles 1.01
ML-DSA-65 verify 283749 cycles 283581 cycles 1.00
ML-DSA-87 keypair 453937 cycles 452169 cycles 1.00
ML-DSA-87 sign 1197536 cycles 1185848 cycles 1.01
ML-DSA-87 verify 471800 cycles 469407 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 44597 cycles 44555 cycles 1.00
ML-DSA-44 sign 197066 cycles 196360 cycles 1.00
ML-DSA-44 verify 60229 cycles 60162 cycles 1.00
ML-DSA-65 keypair 78871 cycles 76125 cycles 1.04
ML-DSA-65 sign 331848 cycles 318170 cycles 1.04
ML-DSA-65 verify 99320 cycles 93458 cycles 1.06
ML-DSA-87 keypair 115528 cycles 115550 cycles 1.00
ML-DSA-87 sign 368922 cycles 367938 cycles 1.00
ML-DSA-87 verify 138832 cycles 136834 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-65 keypair 78871 cycles 76125 cycles 1.04
ML-DSA-65 sign 331848 cycles 318170 cycles 1.04
ML-DSA-65 verify 99320 cycles 93458 cycles 1.06

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 136592 cycles 136387 cycles 1.00
ML-DSA-44 sign 553396 cycles 552602 cycles 1.00
ML-DSA-44 verify 154957 cycles 154968 cycles 1.00
ML-DSA-65 keypair 227998 cycles 228515 cycles 1.00
ML-DSA-65 sign 896453 cycles 892614 cycles 1.00
ML-DSA-65 verify 243881 cycles 243755 cycles 1.00
ML-DSA-87 keypair 376788 cycles 376213 cycles 1.00
ML-DSA-87 sign 1116492 cycles 1119589 cycles 1.00
ML-DSA-87 verify 398522 cycles 397698 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 73176 cycles 73120 cycles 1.00
ML-DSA-44 sign 282505 cycles 282739 cycles 1.00
ML-DSA-44 verify 87069 cycles 87104 cycles 1.00
ML-DSA-65 keypair 128592 cycles 128241 cycles 1.00
ML-DSA-65 sign 460852 cycles 460971 cycles 1.00
ML-DSA-65 verify 139113 cycles 138989 cycles 1.00
ML-DSA-87 keypair 208290 cycles 207784 cycles 1.00
ML-DSA-87 sign 564264 cycles 563219 cycles 1.00
ML-DSA-87 verify 222990 cycles 222452 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 121280 cycles 121795 cycles 1.00
ML-DSA-44 sign 463697 cycles 463017 cycles 1.00
ML-DSA-44 verify 137500 cycles 137430 cycles 1.00
ML-DSA-65 keypair 206358 cycles 206690 cycles 1.00
ML-DSA-65 sign 753174 cycles 745788 cycles 1.01
ML-DSA-65 verify 216373 cycles 216570 cycles 1.00
ML-DSA-87 keypair 340934 cycles 339701 cycles 1.00
ML-DSA-87 sign 956008 cycles 948690 cycles 1.01
ML-DSA-87 verify 357881 cycles 355383 cycles 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 78065 cycles 77998 cycles 1.00
ML-DSA-44 sign 303838 cycles 304214 cycles 1.00
ML-DSA-44 verify 95596 cycles 95550 cycles 1.00
ML-DSA-65 keypair 135065 cycles 134987 cycles 1.00
ML-DSA-65 sign 496942 cycles 496900 cycles 1.00
ML-DSA-65 verify 151295 cycles 151237 cycles 1.00
ML-DSA-87 keypair 217777 cycles 217475 cycles 1.00
ML-DSA-87 sign 606834 cycles 606659 cycles 1.00
ML-DSA-87 verify 239581 cycles 239653 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 134654 cycles 134693 cycles 1.00
ML-DSA-44 sign 520087 cycles 508601 cycles 1.02
ML-DSA-44 verify 149526 cycles 149568 cycles 1.00
ML-DSA-65 keypair 228421 cycles 228621 cycles 1.00
ML-DSA-65 sign 824504 cycles 819792 cycles 1.01
ML-DSA-65 verify 237394 cycles 237589 cycles 1.00
ML-DSA-87 keypair 377029 cycles 377513 cycles 1.00
ML-DSA-87 sign 1029694 cycles 1030352 cycles 1.00
ML-DSA-87 verify 391025 cycles 390819 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 120565 cycles 120764 cycles 1.00
ML-DSA-44 sign 488358 cycles 488941 cycles 1.00
ML-DSA-44 verify 145830 cycles 145977 cycles 1.00
ML-DSA-65 keypair 207541 cycles 207520 cycles 1.00
ML-DSA-65 sign 801841 cycles 802514 cycles 1.00
ML-DSA-65 verify 232008 cycles 231762 cycles 1.00
ML-DSA-87 keypair 336864 cycles 336645 cycles 1.00
ML-DSA-87 sign 986735 cycles 986284 cycles 1.00
ML-DSA-87 verify 369607 cycles 370182 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 139834 cycles 139752 cycles 1.00
ML-DSA-44 sign 513319 cycles 505561 cycles 1.02
ML-DSA-44 verify 154555 cycles 154400 cycles 1.00
ML-DSA-65 keypair 245231 cycles 244774 cycles 1.00
ML-DSA-65 sign 824646 cycles 822040 cycles 1.00
ML-DSA-65 verify 248436 cycles 248601 cycles 1.00
ML-DSA-87 keypair 397826 cycles 397508 cycles 1.00
ML-DSA-87 sign 1043598 cycles 1043535 cycles 1.00
ML-DSA-87 verify 411641 cycles 411198 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 215285 cycles 215420 cycles 1.00
ML-DSA-44 sign 798500 cycles 810106 cycles 0.99
ML-DSA-44 verify 239636 cycles 239763 cycles 1.00
ML-DSA-65 keypair 383817 cycles 384187 cycles 1.00
ML-DSA-65 sign 1313303 cycles 1308318 cycles 1.00
ML-DSA-65 verify 385056 cycles 385474 cycles 1.00
ML-DSA-87 keypair 611765 cycles 613297 cycles 1.00
ML-DSA-87 sign 1666280 cycles 1667822 cycles 1.00
ML-DSA-87 verify 637589 cycles 638454 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 294317 cycles 293781 cycles 1.00
ML-DSA-44 sign 1225926 cycles 1224715 cycles 1.00
ML-DSA-44 verify 356148 cycles 356492 cycles 1.00
ML-DSA-65 keypair 498809 cycles 499030 cycles 1.00
ML-DSA-65 sign 1966811 cycles 1968154 cycles 1.00
ML-DSA-65 verify 554922 cycles 555182 cycles 1.00
ML-DSA-87 keypair 846548 cycles 859424 cycles 0.99
ML-DSA-87 sign 2536389 cycles 2621075 cycles 0.97
ML-DSA-87 verify 920026 cycles 926631 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpacemiT K1 8 (Banana Pi F3) benchmarks (no-opt)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 828262 cycles 946682 cycles 0.87
ML-DSA-44 sign 3363987 cycles 4434339 cycles 0.76
ML-DSA-44 verify 931672 cycles 1090557 cycles 0.85
ML-DSA-65 keypair 1395687 cycles 1578765 cycles 0.88
ML-DSA-65 sign 5489358 cycles 7340298 cycles 0.75
ML-DSA-65 verify 1481977 cycles 1723027 cycles 0.86
ML-DSA-87 keypair 2314070 cycles 2543418 cycles 0.91
ML-DSA-87 sign 6937123 cycles 8975747 cycles 0.77
ML-DSA-87 verify 2444260 cycles 2750040 cycles 0.89

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 120435 cycles 120254 cycles 1.00
ML-DSA-44 sign 488092 cycles 488304 cycles 1.00
ML-DSA-44 verify 145648 cycles 145589 cycles 1.00
ML-DSA-65 keypair 207583 cycles 207315 cycles 1.00
ML-DSA-65 sign 801742 cycles 802722 cycles 1.00
ML-DSA-65 verify 232036 cycles 231653 cycles 1.00
ML-DSA-87 keypair 336621 cycles 336116 cycles 1.00
ML-DSA-87 sign 985366 cycles 985370 cycles 1.00
ML-DSA-87 verify 369932 cycles 370098 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 214982 cycles 214931 cycles 1.00
ML-DSA-44 sign 797243 cycles 797190 cycles 1.00
ML-DSA-44 verify 239466 cycles 239916 cycles 1.00
ML-DSA-65 keypair 383206 cycles 384108 cycles 1.00
ML-DSA-65 sign 1320407 cycles 1313611 cycles 1.01
ML-DSA-65 verify 384384 cycles 385437 cycles 1.00
ML-DSA-87 keypair 611054 cycles 612519 cycles 1.00
ML-DSA-87 sign 1663336 cycles 1665118 cycles 1.00
ML-DSA-87 verify 637137 cycles 638211 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)

Details
Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 465292 cycles 464533 cycles 1.00
ML-DSA-44 sign 2235730 cycles 2239528 cycles 1.00
ML-DSA-44 verify 560391 cycles 560538 cycles 1.00
ML-DSA-65 keypair 778045 cycles 778157 cycles 1.00
ML-DSA-65 sign 3647080 cycles 3662068 cycles 1.00
ML-DSA-65 verify 865702 cycles 866024 cycles 1.00
ML-DSA-87 keypair 1254784 cycles 1257170 cycles 1.00
ML-DSA-87 sign 4531602 cycles 4516332 cycles 1.00
ML-DSA-87 verify 1384203 cycles 1387643 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@mkannwischer
Copy link
Copy Markdown
Contributor Author

SpacemiT K1 8 (Banana Pi F3) benchmarks (no-opt)

Benchmark suite Current: d0da416 Previous: 108d2d3 Ratio
ML-DSA-44 keypair 828262 cycles 946682 cycles 0.87
ML-DSA-44 sign 3363987 cycles 4434339 cycles 0.76
ML-DSA-44 verify 931672 cycles 1090557 cycles 0.85
ML-DSA-65 keypair 1395687 cycles 1578765 cycles 0.88
ML-DSA-65 sign 5489358 cycles 7340298 cycles 0.75
ML-DSA-65 verify 1481977 cycles 1723027 cycles 0.86
ML-DSA-87 keypair 2314070 cycles 2543418 cycles 0.91
ML-DSA-87 sign 6937123 cycles 8975747 cycles 0.77
ML-DSA-87 verify 2444260 cycles 2750040 cycles 0.89
This comment was automatically generated by workflow using github-action-benchmark.

This is very extreme. All else looks as expected.

@mkannwischer mkannwischer marked this pull request as ready for review August 4, 2025 05:11
@mkannwischer mkannwischer requested a review from a team as a code owner August 4, 2025 05:11
Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)

Details
Benchmark suite Current: d0da416 Previous: faefbbe Ratio
ML-DSA-44 keypair 243965 cycles 240968 cycles 1.01
ML-DSA-44 sign 848049 cycles 830594 cycles 1.02
ML-DSA-44 verify 273856 cycles 262743 cycles 1.04
ML-DSA-65 keypair 419643 cycles 408068 cycles 1.03
ML-DSA-65 sign 1354778 cycles 1340034 cycles 1.01
ML-DSA-65 verify 435934 cycles 427802 cycles 1.02
ML-DSA-87 keypair 687064 cycles 683402 cycles 1.01
ML-DSA-87 sign 1727385 cycles 1740210 cycles 0.99
ML-DSA-87 verify 696746 cycles 702708 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: d0da416 Previous: faefbbe Ratio
ML-DSA-44 verify 273856 cycles 262743 cycles 1.04

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)

Details
Benchmark suite Current: d0da416 Previous: faefbbe Ratio
ML-DSA-44 keypair 322104 cycles 306602 cycles 1.05
ML-DSA-44 sign 1259169 cycles 1197826 cycles 1.05
ML-DSA-44 verify 353015 cycles 342184 cycles 1.03
ML-DSA-65 keypair 593365 cycles 560307 cycles 1.06
ML-DSA-65 sign 1997127 cycles 1972800 cycles 1.01
ML-DSA-65 verify 560706 cycles 546049 cycles 1.03
ML-DSA-87 keypair 890525 cycles 869087 cycles 1.02
ML-DSA-87 sign 2589746 cycles 2463008 cycles 1.05
ML-DSA-87 verify 934872 cycles 890530 cycles 1.05

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: d0da416 Previous: faefbbe Ratio
ML-DSA-44 keypair 322104 cycles 306602 cycles 1.05
ML-DSA-44 sign 1259169 cycles 1197826 cycles 1.05
ML-DSA-44 verify 353015 cycles 342184 cycles 1.03
ML-DSA-65 keypair 593365 cycles 560307 cycles 1.06
ML-DSA-87 sign 2589746 cycles 2463008 cycles 1.05
ML-DSA-87 verify 934872 cycles 890530 cycles 1.05

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted some pre-existing copy-pasta you can fix along the way if you like. Otherwise LGTM.

After the constant-time hardening has been applied, we can simply inline
all functions. This should dramatically improve non-LTO performance.

Signed-off-by: Matthias J. Kannwischer <matthias@kannwischer.eu>
After the constant-time hardening has been applied, we can simply inline
all functions. This should dramatically improve non-LTO performance.

Signed-off-by: Matthias J. Kannwischer <matthias@kannwischer.eu>
@mkannwischer mkannwischer merged commit cede4fa into main Aug 7, 2025
209 checks passed
@mkannwischer mkannwischer deleted the inline2 branch August 7, 2025 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider inlining reduce.c

3 participants