Skip to content

8244778: Archive full module graph in CDS#80

Closed
iklam wants to merge 10 commits intoopenjdk:masterfrom
iklam:8244778-archive-full-module-graph
Closed

8244778: Archive full module graph in CDS#80
iklam wants to merge 10 commits intoopenjdk:masterfrom
iklam:8244778-archive-full-module-graph

Conversation

@iklam
Copy link
Copy Markdown
Member

@iklam iklam commented Sep 8, 2020

This is the same patch as 8244778-archive-full-module-graph.v03 published in hotspot-runtime-dev@openjdk.java.net.

The rest of the review will continue on GitHub. I will add new commits to respond to comments to the above e-mail.


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

Reviewers

Download

$ git fetch https://git.openjdk.java.net/jdk pull/80/head:pull/80
$ git checkout pull/80

@iklam
Copy link
Copy Markdown
Member Author

iklam commented Sep 8, 2020

/cc core-libs

@bridgekeeper
Copy link
Copy Markdown

bridgekeeper Bot commented Sep 8, 2020

👋 Welcome back iklam! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk Bot added the core-libs core-libs-dev@openjdk.org label Sep 8, 2020
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 8, 2020

@iklam
The core-libs label was successfully added.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 8, 2020

@iklam The following labels will be automatically applied to this pull request: build hotspot security.

When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label (add|remove) "label" command.

@openjdk openjdk Bot added security security-dev@openjdk.org hotspot hotspot-dev@openjdk.org build build-dev@openjdk.org labels Sep 8, 2020
@iklam
Copy link
Copy Markdown
Member Author

iklam commented Sep 8, 2020

/reviewer add lfoltan,coleenp,alanb,mchung

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 8, 2020

@iklam
Reviewer lfoltan successfully added.

Reviewer coleenp successfully added.

Reviewer alanb successfully added.

Reviewer mchung successfully added.

@iklam iklam force-pushed the 8244778-archive-full-module-graph branch from eaa9125 to 89f3327 Compare September 8, 2020 16:24
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 8, 2020

@iklam This change now passes all automated pre-integration checks. In addition to the automated checks, the change must also fulfill all project specific requirements

After integration, the commit message will be:

8244778: Archive full module graph in CDS

Reviewed-by: erikj, coleenp, lfoltan, redestad, alanb, mchung
  • If you would like to add a summary, use the /summary command.
  • To credit additional contributors, use the /contributor command.
  • To add additional solved issues, use the /issue command.

There are currently no new commits on the master branch since the last update of the source branch of this PR. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you would like to avoid potential automatic rebasing, specify the current head hash when integrating, like this: /integrate 998ce78e530ccb52a76369ea6f5bdd9a3f90601c.

➡️ To integrate this PR with the above commit message to the master branch, type /integrate in a new comment.

@openjdk openjdk Bot added ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 8, 2020
@mlbridge
Copy link
Copy Markdown

mlbridge Bot commented Sep 8, 2020

Webrevs

@iklam
Copy link
Copy Markdown
Member Author

iklam commented Sep 8, 2020

In response to Lois Foltain's comments on hotspot-runtime-dev@openjdk.java.net:

Minor nit in moduleEntry.cpp & packageEntry.cpp when dealing with the ModuleEntry's reads list and a PackageEntry's exports list. The names of the methods to write and read those arrays is somewhat confusing.

ModuleEntry::write_archived_entry_array
ModuleEntry::read_archived_entry_array

At first I thought you were reading/writing an array of archived entries, not the array within an archived entry itself. I was trying to think of a better name. Please consider adding a comment at line #400 & line #417 ahead of those methods in moduleEntry.cpp to indicate that they are used for both reading/writing a ModuleEntry's reads list and a PackageEntry's exports list.

I renamed the functions to ModuleEntry's::write_growable_array and ModuleEntry::restore_growable_array, and added comments as you suggested. See commit 4f90e77

// This function is used to archive ModuleEntry::_reads and PackageEntry::_qualified_exports.
// GrowableArray cannot be directly archived, as it needs to be expandable at runtime.
// Write it out as an Array, and convert it back to GrowableArray at runtime.
Array<ModuleEntry*>* ModuleEntry::write_growable_array(GrowableArray<ModuleEntry*>* array) {

A question about this because a user's program can define modules post module initialization via ModuleDescriptor.newModule(). See for example, tests within open/test/hotspot/jtreg/runtime/module/AccessCheck. So all of these tests would trigger check_cds_restrictions() if -Xshare:dump was turned on. Is that a concern?

Arbitrary user code cannot be executed during -Xshare:dump. The only way to do it is to use a JVMTI agent, which requires specifying -XX:+AllowArchivingWithJavaAgent. You can see an example in the GCDuringDump.java test. If the agent tries to define an extra module, it will get an UnsupportedOperationException thrown by check_cds_restrictions().

@iklam
Copy link
Copy Markdown
Member Author

iklam commented Sep 8, 2020

/label remove build,hotspot

@iklam
Copy link
Copy Markdown
Member Author

iklam commented Sep 8, 2020

/cc hotspot-runtime

@openjdk openjdk Bot removed the build build-dev@openjdk.org label Sep 8, 2020
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 8, 2020

@iklam
The build label was successfully removed.

The hotspot label was successfully removed.

@openjdk openjdk Bot added hotspot-runtime hotspot-runtime-dev@openjdk.org and removed hotspot hotspot-dev@openjdk.org labels Sep 8, 2020
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 8, 2020

@iklam
The hotspot-runtime label was successfully added.

@iklam
Copy link
Copy Markdown
Member Author

iklam commented Sep 8, 2020

/label remove security

@openjdk openjdk Bot removed the security security-dev@openjdk.org label Sep 8, 2020
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 8, 2020

@iklam
The security label was successfully removed.

Copy link
Copy Markdown
Member

@erikj79 erikj79 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build changes look good.

Comment thread src/hotspot/share/classfile/classLoaderDataShared.cpp
Comment thread src/hotspot/share/classfile/classLoaderDataShared.hpp Outdated
Comment thread src/hotspot/share/classfile/modules.cpp Outdated
Comment thread src/hotspot/share/classfile/classLoaderDataShared.cpp Outdated
Copy link
Copy Markdown
Member

@lfoltan lfoltan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Ioi for addressing my review comments. Overall, looks great!

Comment thread src/hotspot/share/classfile/moduleEntry.cpp
Comment thread src/hotspot/share/oops/instanceKlass.cpp Outdated
@coleenp
Copy link
Copy Markdown
Contributor

coleenp commented Sep 9, 2020

Ok thanks! So many emails ...

Copy link
Copy Markdown
Member

@cl4es cl4es left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work!

Only a few minor comments inline, which you can choose to ignore.

assert(DumpSharedSpaces, "must be");
assert_valid(loader_data);
if (loader_data != NULL) {
// We can't create hashtables at dump time because the hashcode dependes on the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dependes -> depends

if (loader_data != NULL) {
// We can't create hashtables at dump time because the hashcode dependes on the
// address of the Symbols, which may be relocated at run time due to ASLR.
// So we store the packages/modules in a Arrays. At run time, we create
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run time -> runtime
a Arrays -> Arrays

if (klass == SystemDictionary::ClassLoader_klass() || // ClassLoader::loader_data is malloc'ed.
klass == SystemDictionary::Module_klass() || // Module::module_entry is malloc'ed
// The next 3 classes are used to implement java.lang.invoke, and are not used directly in
// regular Java code. The implementation of java.lang.invoke uses generated anonymoys classes
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pre-existing: anonymoys

Comment thread src/hotspot/share/classfile/modules.cpp Outdated
assert(UseSharedSpaces && MetaspaceShared::use_full_module_graph(), "must be");

// We don't want the classes used by the archived full module graph to be redefined by JVMTI.
// Luckily, such classes are loaded in the JVMTI "early" phase, and CDS is disable if a JVMTI
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disabled

Comment thread src/hotspot/share/memory/heapShared.cpp Outdated
{"java/lang/Character$CharacterCache", "archivedCache"},
{"java/util/jar/Attributes$Name", "KNOWN_NAMES"},
{"sun/util/locale/BaseLocale", "constantBaseLocales"},
{"java/lang/Integer$IntegerCache", 0, "archivedCache"},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the changes here be simplified or clarified? I think the new field should be a bool, or we could instead introduce a new array for the fields archived only when archiving the full module graph (the field is ignored on iteration over closed_archive_subgraph_entry_fields anyhow)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split out the new fields into a separate array as you suggested. Also fixed the typos you found. See commit e987110.

@iklam
Copy link
Copy Markdown
Member Author

iklam commented Sep 13, 2020

/integrate

@openjdk openjdk Bot closed this Sep 13, 2020
@openjdk openjdk Bot added integrated Pull request has been integrated and removed ready Pull request is ready to be integrated rfr Pull request is ready for review labels Sep 13, 2020
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Sep 13, 2020

@iklam Pushed as commit 03a4df0.

💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored.

@iklam iklam deleted the 8244778-archive-full-module-graph branch February 18, 2021 00:46
lewurm added a commit to lewurm/openjdk that referenced this pull request Oct 6, 2021
Restore looks like this now:
```
  0x0000000106e4dfcc:   movk    x9, #0x5e4, lsl openjdk#16
  0x0000000106e4dfd0:   movk    x9, #0x1, lsl openjdk#32
  0x0000000106e4dfd4:   blr x9
  0x0000000106e4dfd8:   ldp x2, x3, [sp, openjdk#16]
  0x0000000106e4dfdc:   ldp x4, x5, [sp, openjdk#32]
  0x0000000106e4dfe0:   ldp x6, x7, [sp, openjdk#48]
  0x0000000106e4dfe4:   ldp x8, x9, [sp, openjdk#64]
  0x0000000106e4dfe8:   ldp x10, x11, [sp, openjdk#80]
  0x0000000106e4dfec:   ldp x12, x13, [sp, openjdk#96]
  0x0000000106e4dff0:   ldp x14, x15, [sp, openjdk#112]
  0x0000000106e4dff4:   ldp x16, x17, [sp, openjdk#128]
  0x0000000106e4dff8:   ldp x0, x1, [sp], openjdk#144
  0x0000000106e4dffc:   ldp xzr, x19, [sp], openjdk#16
  0x0000000106e4e000:   ldp x22, x23, [sp, openjdk#16]
  0x0000000106e4e004:   ldp x24, x25, [sp, openjdk#32]
  0x0000000106e4e008:   ldp x26, x27, [sp, openjdk#48]
  0x0000000106e4e00c:   ldp x28, x29, [sp, openjdk#64]
  0x0000000106e4e010:   ldp x30, xzr, [sp, openjdk#80]
  0x0000000106e4e014:   ldp x20, x21, [sp], openjdk#96
  0x0000000106e4e018:   ldur    x12, [x29, #-24]
  0x0000000106e4e01c:   ldr x22, [x12, openjdk#16]
  0x0000000106e4e020:   add x22, x22, #0x30
  0x0000000106e4e024:   ldr x8, [x28, openjdk#8]
```
pf0n pushed a commit to pf0n/jdk that referenced this pull request Jul 9, 2025
* Use JOL to compute size of AllocObject and enforce this as minimum object size

* Include array overhead in total
fg1417 added a commit to fg1417/jdk that referenced this pull request Mar 13, 2026
…marks after JDK-8340093

JDK-8340093 enabled auto-vectorization for more reduction loop cases
using 128-bit vector operations. As a result, the following
microbenchmarks are negatively affected:
VectorReduction2.longAddDotProduct
VectorReduction2.longMulDotProduct
VectorReduction2.longMulSimple

This patch fixes these regressions.

1. Improve code generation for MLA

For longAddDotProduct[1], the current implementation generates
vectorized code similar to:
```
ldr     q17, [x12, openjdk#16]
ldr     q18, [x11, openjdk#16]
mla     z16.d, p7/m, z17.d, z18.d
ldr     q17, [x11, openjdk#32]
ldr     q18, [x12, openjdk#32]
mla     z16.d, p7/m, z18.d, z17.d
...
ldr     q17, [x11, openjdk#128]
ldr     q18, [x12, openjdk#128]
mla     z16.d, p7/m, z18.d, z17.d
```
`z16` is the third source and destination register. There are
true dependencies between consecutive mla[2] instructions.
As a result, this vectorized code performs significantly worse
than the scalar version due to limited instruction-level
parallelism.

These mla instructions are produced by a backend match rule that
fuses AddVL and MulVL into a vector MLA[3]. In this situation,
avoiding instruction fusion and instead generating separate SVE
mul and add instructions can improve instruction-level parallelism
and overall performance.

To address this, this patch introduces
is_multiply_accumulate_candidate() to determine whether a node is
a suitable vector MLA candidate. For node patterns that may
increase execution latency, instruction fusion into MLA is
disabled.

After applying this patch, the generated assembly looks like:
```
ldr     q17, [x12, openjdk#16]
ldr     q18, [x11, openjdk#16]
ldr     q19, [x11, openjdk#32]
mul     z17.d, p7/m, z17.d, z18.d
ldr     q18, [x12, openjdk#32]
ldr     q20, [x11, openjdk#48]
mul     z18.d, p7/m, z18.d, z19.d
ldr     q19, [x12, openjdk#48]
add     v16.2d, v17.2d, v16.2d
ldr     q17, [x11, openjdk#64]
add     v16.2d, v18.2d, v16.2d
ldr     q18, [x12, openjdk#64]
mul     z19.d, p7/m, z19.d, z20.d
ldr     q20, [x12, openjdk#80]
add     v16.2d, v19.2d, v16.2d
```
This sequence exposes more independent operations and reduces
dependency chains, leading to improved performance.

Since SVE mls instructions may suffer from similar issues, the
same logic has been extended to cover MLS as well. Additional
microbenchmarks have been added accordingly.

2. Avoid vectorizing MUL-heavy loops

For longMulSimple[3], the generated vectorized code exhibits
long dependency chains of SVE mul instructions, which results
in worse performance than scalar execution:
```
ldr     q17, [x1, openjdk#16]
ldr     q18, [x1, openjdk#32]
mul     z17.d, p7/m, z17.d, z16.d
ldr     q16, [x1, openjdk#48]
mul     z17.d, p7/m, z17.d, z18.d
ldr     q18, [x1, openjdk#64]
mul     z16.d, p7/m, z16.d, z17.d
...
ldr     q16, [x1, openjdk#256]
mul     z17.d, p7/m, z17.d, z19.d
mul     z16.d, p7/m, z16.d, z17.d
```

To address this, the patch introduces a platform-specific interface:
`VTransformElementWiseVectorNode::node_weight()`.

For 128-bit operations, this interface detects consecutive vector
long multiply operations and increases the node weight to 4, which is
the minimum value required for the cost model to avoid vectorization
on both 128-bit and 256-bit platforms.

3. Results
Performance measurements on 128-bit and 256-bit SVE machines show that
these changes avoid harmful vectorization and improve overall
performance for the affected benchmarks.

patch: results obtained after applying this patch, using default
auto-vectorization settings (-XX:+UseSuperWord,
-XX:AutoVectorizationOverrideProfitability=1, cost-model decision mode)

main-default: results on mainline using the same default
auto-vectorization settings (-XX:+UseSuperWord,
-XX:AutoVectorizationOverrideProfitability=1, cost-model decision mode)

main-scalar: results on mainline with -XX:+UseSuperWord and
-XX:AutoVectorizationOverrideProfitability=0 (force scalar code)

The table below reports relative performance changes:
p/m1 = (patch - main-default) / main-default
p/m0 = (patch - main-scalar) / main-scalar

Mode: avgt
Unit: ns/op

Arm Neoverse V2 machine (128 bit SVE):
Benchmark                                         (COUNT)    p/m1       p/m0
TypeVectorOperationsSuperWord.mlaL                  512     0.16%      -50.42%
TypeVectorOperationsSuperWord.mlaL                  2048    0.26%      -56.70%
TypeVectorOperationsSuperWord.mlsL                  512     -0.10%     -50.37%
TypeVectorOperationsSuperWord.mlsL                  2048    0.14%      -56.82%
TypeVectorOperationsSuperWord.mulBigL               512     0.06%      -25.77%
TypeVectorOperationsSuperWord.mulBigL               2048    -0.02%     -19.63%
TypeVectorOperationsSuperWord.mulI                  512     0.63%      -63.44%
TypeVectorOperationsSuperWord.mulI                  2048    0.28%      -63.07%
TypeVectorOperationsSuperWord.mulL                  512     -0.03%     -50.47%
TypeVectorOperationsSuperWord.mulL                  2048    0.29%      -50.82%
TypeVectorOperationsSuperWord.mulMediumL            512     -0.19%     -27.54%
TypeVectorOperationsSuperWord.mulMediumL            2048    0.24%      -25.18%
TypeVectorOperationsSuperWord.mulMlaLDependent      512     0.30%      -28.70%
TypeVectorOperationsSuperWord.mulMlaLDependent      2048    0.12%      -26.74%
TypeVectorOperationsSuperWord.mulMlaLIndependent    512     -10.43%    -43.09%
TypeVectorOperationsSuperWord.mulMlaLIndependent    2048    -14.82%    -42.68%
VectorReduction2.WithSuperword.longAddBig           2048    -15.15%    -44.01%
VectorReduction2.WithSuperword.longAddBigMixSub1    2048    -6.19%     -43.92%
VectorReduction2.WithSuperword.longAddBigMixSub2    2048    -15.18%    -43.90%
VectorReduction2.WithSuperword.longAddBigMixSub3    2048    -5.74%     -43.87%
VectorReduction2.WithSuperword.longAddDotProduct    2048    -33.36%    -18.16%
VectorReduction2.WithSuperword.longAddSimple        2048    -0.02%     -6.72%
VectorReduction2.WithSuperword.longAndBig           2048    -16.32%    -44.06%
VectorReduction2.WithSuperword.longAndDotProduct    2048    -0.01%     -3.74%
VectorReduction2.WithSuperword.longAndSimple        2048    0.00%      -6.35%
VectorReduction2.WithSuperword.longMaxBig           2048    -15.29%    -52.09%
VectorReduction2.WithSuperword.longMaxDotProduct    2048    -0.03%     -52.08%
VectorReduction2.WithSuperword.longMaxSimple        2048    -0.40%     -52.74%
VectorReduction2.WithSuperword.longMinBig           2048    -14.88%    -51.70%
VectorReduction2.WithSuperword.longMinDotProduct    2048    0.01%      -52.21%
VectorReduction2.WithSuperword.longMinSimple        2048    0.26%      -52.88%
VectorReduction2.WithSuperword.longMulBig           2048    -2.21%     -0.07%
VectorReduction2.WithSuperword.longMulDotProduct    2048    -15.47%    0.00%
VectorReduction2.WithSuperword.longMulSimple        2048    -17.87%    -0.33%
VectorReduction2.WithSuperword.longOrBig            2048    -15.23%    -43.94%
VectorReduction2.WithSuperword.longOrDotProduct     2048    -0.01%     -3.83%
VectorReduction2.WithSuperword.longOrSimple         2048    -0.01%     -6.60%
VectorReduction2.WithSuperword.longXorBig           2048    -10.03%    -41.62%
VectorReduction2.WithSuperword.longXorDotProduct    2048    0.01%      -38.61%
VectorReduction2.WithSuperword.longXorSimple        2048    0.02%      -53.18%

Arm Neoverse V1 machine (256 bit SVE):
Note: In the current mainline code, the AArch64 backend supports
only 128-bit multiply long operations. Auto-vectorization accounts
for this backend constraint and splits 256-bit vectors into 128-bit
chunks so that the loop can still be vectorized. This is why
256-bit platforms also benefit from this patch.

No obvious performance changes are observed for other benchmarks.

Benchmark                           (COUNT)       p/m1       p/m0
VectorReduction2.longMulDotProduct    2048       -28.23%    0.00%
VectorReduction2.longMulSimple        2048       -19.29%    0.01%

Tier 1 - 3 passed on both aarch64 and x86 platforms.

[1] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1096
[2] https://developer.arm.com/documentation/ddi0602/2025-12/SVE-Instructions/MLA--vectors---Multiply-add--predicated--?lang=en
[3] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/src/hotspot/cpu/aarch64/aarch64_vector.ad#L2617
[4] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1035
fg1417 added a commit to fg1417/jdk that referenced this pull request Mar 30, 2026
The microbenchmark ArraysFill.testLongFill[1] on
128-bit vector platforms generates vectorized store instructions
with non-monotonic memory offsets, e.g.:

str q16, [x12, openjdk#80]
str q16, [x12, openjdk#48]
str q16, [x12, openjdk#128]
...

This arises because SuperWord only considers true dependencies
when building edges (see [3]), and
therefore does not enforce ordering among independent vector
memory operations. These nodes are later scheduled using RPO,
which can result in an apparently unordered sequence of memory
accesses.

This patch replaces RPO-based scheduling with a priority-based
topological sort to improve ordering and locality.

The scheduling policy is:
1. Prefer nodes whose weak predecessors have already been
scheduled.
2. Prioritize node types in the following order: scalar
operations (loads/stores, address expressions), vector arithmetic,
vector loads, vector stores, then others.
3. For independent loads/stores sharing the same base address,
prefer ascending offsets.
4. Use VTransformNodeIDX to ensure stable ordering.

With this change, the generated code becomes monotonic in memory
offsets:

str q16, [x12, openjdk#16]
str q16, [x12, openjdk#32]
str q16, [x12, openjdk#48]
...

On Arm Neoverse V2 machine (128 bit SVE), this improves the
following benchmarks:

TypeVectorOperationsSuperWord.java[2]

Benchmark          (COUNT)   Mode    Units    Difference
absD                 512     avgt    ns/op    -27.05%
absD                 2048    avgt    ns/op    -27.05%
absL                 512     avgt    ns/op    -24.46%
absL                 2048    avgt    ns/op    -27.26%
convertD2LBitsRaw    512     avgt    ns/op    -20.39%
convertD2LBitsRaw    2048    avgt    ns/op    -23.92%
convertF2L           512     avgt    ns/op    -16.82%
convertF2L           2048    avgt    ns/op    -22.60%
convertI2D           512     avgt    ns/op    -12.50%
convertI2D           2048    avgt    ns/op    -17.92%
convertLBits2D       512     avgt    ns/op    -27.13%
convertLBits2D       2048    avgt    ns/op    -31.69%
negD                 512     avgt    ns/op    -26.85%
negD                 2048    avgt    ns/op    -27.09%

ArraysFill.java[1]:

Benchmark       (size)    Mode     Units     Difference
testDoubleFill    250     thrpt    ops/ms    26.46%
testDoubleFill    266     thrpt    ops/ms    32.69%
testDoubleFill    511     thrpt    ops/ms    33.83%
testDoubleFill    2047    thrpt    ops/ms    45.35%
testDoubleFill    2048    thrpt    ops/ms    45.38%
testDoubleFill    8195    thrpt    ops/ms    49.32%
testLongFill      250     thrpt    ops/ms    28.12%
testLongFill      266     thrpt    ops/ms    40.30%
testLongFill      511     thrpt    ops/ms    34.79%
testLongFill      2047    thrpt    ops/ms    45.71%
testLongFill      2048    thrpt    ops/ms    53.07%
testLongFill      8195    thrpt    ops/ms    49.52%

No significant performance changes are observed on wider vector
platforms (e.g., 256-bit or 512-bit), where fewer vector
operations are generated in SuperWord and scheduling has less
impact.

[1] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/test/micro/org/openjdk/bench/java/util/ArraysFill.java#L92
[2] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java
[3] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/src/hotspot/share/opto/superwordVTransformBuilder.cpp#L99
ruben-arm added a commit to ruben-arm/jdk that referenced this pull request Mar 30, 2026
Some vector operations do not have inputs and essentially initialize
vectors with a constant value. These operations can be marked for
spilling and subsequently rematerialized at every use. The result of
the transformation might look as follows:
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#64]
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#32]
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#16]
   movi    v16.2d, #0x0
   str     q16, [x16]
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#48]
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#112]
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#80]
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#96]

Introduce deduplication of these rematerialized vector
constant initializations reducing the above sequence to:
   movi    v16.2d, #0x0
   str     q16, [x16, openjdk#64]
   str     q16, [x16, openjdk#32]
   str     q16, [x16, openjdk#16]
   str     q16, [x16]
   str     q16, [x16, openjdk#48]
   str     q16, [x16, openjdk#112]
   str     q16, [x16, openjdk#80]
   str     q16, [x16, openjdk#96]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org hotspot-runtime hotspot-runtime-dev@openjdk.org integrated Pull request has been integrated

Development

Successfully merging this pull request may close these issues.

5 participants