8253049: Enhance itable_stub for AArch64 and x86_64 by kuaiwei · Pull Request #128 · openjdk/jdk

kuaiwei · 2020-09-11T11:58:34Z

Now itable_stub will go through instanceKlass's itable twice to look up a method entry. resolved klass is used for type checking and method holder klass is used to find method entry. In many cases , we observed resolved klass is as same as holder klass. So we can improve itable stub based on it. If they are same klass, stub uses a fast loop to check only one klass. If not, a slow loop is used to checking both klasses.

Even entering in slow loop, new implementation can be better than old one in some cases. Because new stub just need go through itable once and reduce memory operations.

bug: https://bugs.openjdk.java.net/browse/JDK-8253049

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed

Issue

JDK-8253049: Enhance itable_stub for AArch64 and x86_64

Download

$ git fetch https://git.openjdk.java.net/jdk pull/128/head:pull/128
$ git checkout pull/128

bridgekeeper · 2020-09-11T12:01:20Z

Hi @kuaiwei, welcome to this OpenJDK project and thanks for contributing!

We do not recognize you as Contributor and need to ensure you have signed the Oracle Contributor Agreement (OCA). If you have not signed the OCA, please follow the instructions. Please fill in your GitHub username in the "Username" field of the application. Once you have signed the OCA, please let us know by writing /signed in a comment in this pull request.

If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please use "Add GitHub user kuaiwei" as summary for the issue.

If you are contributing this work on behalf of your employer and your employer has signed the OCA, please let us know by writing /covered in a comment in this pull request.

openjdk · 2020-09-11T12:02:26Z

@kuaiwei The following label will be automatically applied to this pull request: hotspot.

When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label (add|remove) "label" command.

kuaiwei · 2020-09-11T15:22:33Z

/covered

bridgekeeper · 2020-09-11T15:24:31Z

Thank you! Please allow for a few business days to verify that your employer has signed the OCA. Also, please note that pull requests that are pending an OCA check will not usually be evaluated, so your patience is appreciated!

kuaiwei · 2020-09-11T15:26:32Z

/label add hotspot-compiler

openjdk · 2020-09-11T15:26:45Z

@kuaiwei Usage: /label <add|remove> [label[, label, ...]]wherelabel` is an additional classification that should be applied to this PR. These labels are valid:

serviceability
hotspot
sound
hotspot-compiler
kulla
i18n
shenandoah
jdk
javadoc
2d
security
swing
hotspot-runtime
jmx
build
nio
beans
core-libs
compiler
net
hotspot-gc
hotspot-jfr
awt

kuaiwei · 2020-09-11T15:56:36Z

/label add hotspot-compiler

openjdk · 2020-09-11T15:57:05Z

@kuaiwei
The hotspot-compiler label was successfully added.

mlbridge · 2020-09-14T10:52:43Z

Webrevs

01: Full - Incremental (6d79d5e)
00: Full (b2f12ccc4311cc310a57a1e49fecfee74718349b)

mlbridge · 2020-09-14T12:26:40Z

Mailing list message from Vladimir Ivanov on hotspot-dev:

Hi Kevin,

Very interesting observations. I like the idea to optimize for the case
when REFC == DECC.

Fusing 2 passes over the itable into one does look attractive, but I'm
not sure the proposed variant is correct. I suggest to split the patch
into 2 enhancements and handle them separately.

I'm curious what kind of benchmarks you used and what are the
improvements observed with the patch.

One suggestion about the implementation:

src/hotspot/cpu/x86/macroAssembler_x86.cpp:

+void MacroAssembler::lookup_interface_method_in_stub(Register recv_klass,

I'd like to avoid having 2 independent implementations of itable lookup
(MacroAssembler::lookup_interface_method_in_stub() and
MacroAssembler::lookup_interface_method()). It would be nice to keep the
implementation unified between itable and MethodHandle linkToInterface
linker stubs.

What MacroAssembler::lookup_interface_method(..., true
/*return_method*/) does is interface method lookup w/o proper subtype
check and it is equivalent to fast loop in
MacroAssembler::lookup_interface_method_in_stub().

As a possible path forward, you could introduce the fast path check
first by moving the fast path check into
VtableStubs::create_itable_stub() and guard the first path over the
itable. It would make the type checking pass over itable optional based
on runtime check.

Then you could refactor MacroAssembler::lookup_interface_method() to
optionally do REFC and DECC checks on every iteration and migrate
VtableStubs::create_itable_stub() and
MethodHandles::generate_method_handle_dispatch() to it.

Best regards,
Vladimir Ivanov

On 14.09.2020 13:52, kuaiwei wrote:

mlbridge · 2020-09-15T07:39:40Z

Mailing list message from Kuai Wei on hotspot-compiler-dev:

Hi Vladimir,

Thanks for your review.

I updated my test cases in test/micro/org/openjdk/bench/vm/compiler/InterfaceCalls.java . My tests will not inline interface methods and most cpu are used by itable_stub.
every test will run 10 warmup iterations and 5 measure iterations for one score. I took 3 score for every test.
Below is test result on my machines, it looks slow loop has more improvement than origin one.

aarch64:
=== testStubPoly3 ===
orig: 38430308.215 38438769.040 38325616.152
opt : 39425275.311 39626194.985 39374242.065

=== testStubPoly5 ===
orig: 23227433.053 23210843.937 23212518.073
opt : 23805995.657 23797837.061 23861764.978

=== testSlowStubPoly3 ===
orig: 30838750.839 30886603.202 30841314.152
opt : 36166775.967 36242733.807 36041506.263

=== testSlowStubPoly5 ===
orig: 18713218.115 18706994.686 18686729.040
opt : 21827549.808 21836822.173 21861920.069

x86:
=== testStubPoly3 ===
orig: 36339726.912 36322863.060 36363196.132
opt : 38631086.341 38465649.400 38466044.926

=== testStubPoly5 ===
orig: 22240149.674 22218724.450 22225970.358
opt : 23498941.840 23454580.221 23497053.570

=== testSlowStubPoly3 ===
orig: 28693696.199 28700714.257 28587900.429
opt : 34187319.519 34171321.762 34138648.599

=== testSlowStubPoly5 ===
orig: 17388480.977 17389247.386 17177206.666
opt : 20697609.518 20771108.051 20699215.655

I think lookup_interface_method can be reused as fast path. And it is also used by templateTable::invoke_interface and generate_method_handle_dispatch.
My implementation in slow path need more registers (6 registers so far), I need to check if there's register conflict in these methods. I'd like to keep a separate
slow path implementation. How do you think about it?

Thanks,
Kevin

------------------------------------------------------------------
From:Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
Send Time:2020?9?14?(???) 22:10
To:kuaiwei <github.com+1981974+kuaiwei at openjdk.java.net>; hotspot-dev <hotspot-dev at openjdk.java.net>; hotspot-compiler-dev <hotspot-compiler-dev at openjdk.java.net>
Subject:Re: RFR: 8253049: Enhance itable_stub for AArch64 and x86_64

Hi Kevin,

Very interesting observations. I like the idea to optimize for the case
when REFC == DECC.

Fusing 2 passes over the itable into one does look attractive, but I'm
not sure the proposed variant is correct. I suggest to split the patch
into 2 enhancements and handle them separately.

I'm curious what kind of benchmarks you used and what are the
improvements observed with the patch.

One suggestion about the implementation:

src/hotspot/cpu/x86/macroAssembler_x86.cpp:

+void MacroAssembler::lookup_interface_method_in_stub(Register recv_klass,

I'd like to avoid having 2 independent implementations of itable lookup
(MacroAssembler::lookup_interface_method_in_stub() and
MacroAssembler::lookup_interface_method()). It would be nice to keep the
implementation unified between itable and MethodHandle linkToInterface
linker stubs.

What MacroAssembler::lookup_interface_method(..., true
/*return_method*/) does is interface method lookup w/o proper subtype
check and it is equivalent to fast loop in
MacroAssembler::lookup_interface_method_in_stub().

As a possible path forward, you could introduce the fast path check
first by moving the fast path check into
VtableStubs::create_itable_stub() and guard the first path over the
itable. It would make the type checking pass over itable optional based
on runtime check.

Then you could refactor MacroAssembler::lookup_interface_method() to
optionally do REFC and DECC checks on every iteration and migrate
VtableStubs::create_itable_stub() and
MethodHandles::generate_method_handle_dispatch() to it.

Best regards,
Vladimir Ivanov

On 14.09.2020 13:52, kuaiwei wrote:

mlbridge · 2020-09-15T09:04:08Z

Mailing list message from Vladimir Ivanov on hotspot-dev:

I updated my test cases in test/micro/org/openjdk/bench/vm/compiler/InterfaceCalls.java . My tests will not inline interface methods and most cpu are used by itable_stub.
every test will run 10 warmup iterations and 5 measure iterations for one score. I took 3 score for every test.
Below is test result on my machines, it looks slow loop has more improvement than origin one.

Good, thanks for the numbers. I'm curious have you observed any
improvements on larger scale benchmarks or real world apps?

I'm asking because linear scan is already far from optimal when there
are many superinterfaces present.

I think lookup_interface_method can be reused as fast path. And it is also used by templateTable::invoke_interface and generate_method_handle_dispatch.
My implementation in slow path need more registers (6 registers so far), I need to check if there's register conflict in these methods. I'd like to keep a separate
slow path implementation. How do you think about it?

Frankly speaking, I'd like to avoid the duplication.

Also, absence of guarantees about order of interfaces in the itable
complicates things: REFC and DECC can be encountered in arbitrary order
and the pass should take that into account. For example, I don't see
early exit on success in slow variant, so every lookup has to go through
the whole itable irrespective of whether it succeeds or fails. I
attribute that to the complications induced by aforementioned aspect.

And speaking of the overall approach (as it is implemented now), IMO
increased complexity doesn't worth it. If interface calls become a
bottleneck, the problem lies not in itable stub, but the overall design
which requires linear scan over itables. It's better to put the effort
there than micro-optimizing the stub.

But I'm happy to change my mind if the rewritten implementation makes it
easier to reason about the code.

(FTR subtype checks suffer from a similar problem: unless
Klass::_secondary_super_cache catches it, subtype check for an interface
does a linear scan over _secondary_supers array.)

Best regards,
Vladimir Ivanov

------------------------------------------------------------------
From:Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
Send Time:2020?9?14?(???) 22:10
To:kuaiwei <github.com+1981974+kuaiwei at openjdk.java.net>; hotspot-dev <hotspot-dev at openjdk.java.net>; hotspot-compiler-dev <hotspot-compiler-dev at openjdk.java.net>
Subject:Re: RFR: 8253049: Enhance itable_stub for AArch64 and x86_64

Hi Kevin,

Very interesting observations. I like the idea to optimize for the case
when REFC == DECC.

Fusing 2 passes over the itable into one does look attractive, but I'm
not sure the proposed variant is correct. I suggest to split the patch
into 2 enhancements and handle them separately.

I'm curious what kind of benchmarks you used and what are the
improvements observed with the patch.

One suggestion about the implementation:

src/hotspot/cpu/x86/macroAssembler_x86.cpp:

+void MacroAssembler::lookup_interface_method_in_stub(Register recv_klass,

I'd like to avoid having 2 independent implementations of itable lookup
(MacroAssembler::lookup_interface_method_in_stub() and
MacroAssembler::lookup_interface_method()). It would be nice to keep the
implementation unified between itable and MethodHandle linkToInterface
linker stubs.

What MacroAssembler::lookup_interface_method(..., true
/*return_method*/) does is interface method lookup w/o proper subtype
check and it is equivalent to fast loop in
MacroAssembler::lookup_interface_method_in_stub().

As a possible path forward, you could introduce the fast path check
first by moving the fast path check into
VtableStubs::create_itable_stub() and guard the first path over the
itable. It would make the type checking pass over itable optional based
on runtime check.

Then you could refactor MacroAssembler::lookup_interface_method() to
optionally do REFC and DECC checks on every iteration and migrate
VtableStubs::create_itable_stub() and
MethodHandles::generate_method_handle_dispatch() to it.

Best regards,
Vladimir Ivanov

On 14.09.2020 13:52, kuaiwei wrote:

mlbridge · 2020-09-15T09:15:17Z

Mailing list message from Andrew Haley on hotspot-dev:

On 15/09/2020 10:02, Vladimir Ivanov wrote:

And speaking of the overall approach (as it is implemented now), IMO
increased complexity doesn't worth it. If interface calls become a
bottleneck, the problem lies not in itable stub, but the overall design
which requires linear scan over itables. It's better to put the effort
there than micro-optimizing the stub.

Indeed. When I first came to HotSpot after working on GCJ for years
I was very surprised to see a linear scan used for interface dispatch.

The code improvements look to be fairly minor.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

mlbridge · 2020-09-15T10:06:33Z

Mailing list message from Kuai Wei on hotspot-dev:

Thanks for your quick reply.

I updated my test cases in test/micro/org/openjdk/bench/vm/compiler/InterfaceCalls.java . My tests will not inline interface methods and most cpu are used by itable_stub.
every test will run 10 warmup iterations and 5 measure iterations for one score. I took 3 score for every test.
Below is test result on my machines, it looks slow loop has more improvement than origin one.

Good, thanks for the numbers. I'm curious have you observed any
improvements on larger scale benchmarks or real world apps?

I'm asking because linear scan is already far from optimal when there
are many superinterfaces present.

Kevin: itable_stub was found hot on several online applications. So I started to work on this. Now I don't have chance to verify it online. So I uses microbenchmarks to verify. I will
test with some benchmarks.

I think lookup_interface_method can be reused as fast path. And it is also used by templateTable::invoke_interface and generate_method_handle_dispatch.
My implementation in slow path need more registers (6 registers so far), I need to check if there's register conflict in these methods. I'd like to keep a separate
slow path implementation. How do you think about it?

Frankly speaking, I'd like to avoid the duplication.

Kevin: Ok, I will try to merge them.

Also, absence of guarantees about order of interfaces in the itable
complicates things: REFC and DECC can be encountered in arbitrary order
and the pass should take that into account. For example, I don't see
early exit on success in slow variant, so every lookup has to go through
the whole itable irrespective of whether it succeeds or fails. I
attribute that to the complications induced by aforementioned aspect.

Kevin: I use a counter for matching. If it reaches zero, the iteration can exit early.

And speaking of the overall approach (as it is implemented now), IMO
increased complexity doesn't worth it. If interface calls become a
bottleneck, the problem lies not in itable stub, but the overall design
which requires linear scan over itables. It's better to put the effort
there than micro-optimizing the stub.

Kevin: I agree we can improve itable design. My initial think is jvm may reorder itable at safepoint. I can take it as a follow up optimization.

But I'm happy to change my mind if the rewritten implementation makes it
easier to reason about the code.

(FTR subtype checks suffer from a similar problem: unless
Klass::_secondary_super_cache catches it, subtype check for an interface
does a linear scan over _secondary_supers array.)

Regards,
Kevin

------------------------------------------------------------------
From:Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
Send Time:2020?9?15?(???) 17:03
To:??(??) <kuaiwei.kw at alibaba-inc.com>; hotspot-dev <hotspot-dev-retn at openjdk.java.net>; kuaiwei <github.com+1981974+kuaiwei at openjdk.java.net>; hotspot-dev <hotspot-dev at openjdk.java.net>; hotspot-compiler-dev <hotspot-compiler-dev at openjdk.java.net>
Subject:Re: RFR: 8253049: Enhance itable_stub for AArch64 and x86_64

I updated my test cases in test/micro/org/openjdk/bench/vm/compiler/InterfaceCalls.java . My tests will not inline interface methods and most cpu are used by itable_stub.
every test will run 10 warmup iterations and 5 measure iterations for one score. I took 3 score for every test.
Below is test result on my machines, it looks slow loop has more improvement than origin one.

Good, thanks for the numbers. I'm curious have you observed any
improvements on larger scale benchmarks or real world apps?

I'm asking because linear scan is already far from optimal when there
are many superinterfaces present.

I think lookup_interface_method can be reused as fast path. And it is also used by templateTable::invoke_interface and generate_method_handle_dispatch.
My implementation in slow path need more registers (6 registers so far), I need to check if there's register conflict in these methods. I'd like to keep a separate
slow path implementation. How do you think about it?

Frankly speaking, I'd like to avoid the duplication.

Also, absence of guarantees about order of interfaces in the itable
complicates things: REFC and DECC can be encountered in arbitrary order
and the pass should take that into account. For example, I don't see
early exit on success in slow variant, so every lookup has to go through
the whole itable irrespective of whether it succeeds or fails. I
attribute that to the complications induced by aforementioned aspect.

And speaking of the overall approach (as it is implemented now), IMO
increased complexity doesn't worth it. If interface calls become a
bottleneck, the problem lies not in itable stub, but the overall design
which requires linear scan over itables. It's better to put the effort
there than micro-optimizing the stub.

But I'm happy to change my mind if the rewritten implementation makes it
easier to reason about the code.

(FTR subtype checks suffer from a similar problem: unless
Klass::_secondary_super_cache catches it, subtype check for an interface
does a linear scan over _secondary_supers array.)

Best regards,
Vladimir Ivanov

------------------------------------------------------------------
From:Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
Send Time:2020?9?14?(???) 22:10
To:kuaiwei <github.com+1981974+kuaiwei at openjdk.java.net>; hotspot-dev <hotspot-dev at openjdk.java.net>; hotspot-compiler-dev <hotspot-compiler-dev at openjdk.java.net>
Subject:Re: RFR: 8253049: Enhance itable_stub for AArch64 and x86_64

Hi Kevin,

Very interesting observations. I like the idea to optimize for the case
when REFC == DECC.

Fusing 2 passes over the itable into one does look attractive, but I'm
not sure the proposed variant is correct. I suggest to split the patch
into 2 enhancements and handle them separately.

I'm curious what kind of benchmarks you used and what are the
improvements observed with the patch.

One suggestion about the implementation:

src/hotspot/cpu/x86/macroAssembler_x86.cpp:

+void MacroAssembler::lookup_interface_method_in_stub(Register recv_klass,

I'd like to avoid having 2 independent implementations of itable lookup
(MacroAssembler::lookup_interface_method_in_stub() and
MacroAssembler::lookup_interface_method()). It would be nice to keep the
implementation unified between itable and MethodHandle linkToInterface
linker stubs.

What MacroAssembler::lookup_interface_method(..., true
/*return_method*/) does is interface method lookup w/o proper subtype
check and it is equivalent to fast loop in
MacroAssembler::lookup_interface_method_in_stub().

As a possible path forward, you could introduce the fast path check
first by moving the fast path check into
VtableStubs::create_itable_stub() and guard the first path over the
itable. It would make the type checking pass over itable optional based
on runtime check.

Then you could refactor MacroAssembler::lookup_interface_method() to
optionally do REFC and DECC checks on every iteration and migrate
VtableStubs::create_itable_stub() and
MethodHandles::generate_method_handle_dispatch() to it.

Best regards,
Vladimir Ivanov

On 14.09.2020 13:52, kuaiwei wrote:

mlbridge · 2020-09-15T15:39:16Z

Mailing list message from Vladimir Ivanov on hotspot-dev:

And speaking of the overall approach (as it is implemented now), IMO
increased complexity doesn't worth it. If interface calls become a
bottleneck, the problem lies not in itable stub, but the overall design
which requires linear scan over itables. It's better to put the effort
there than micro-optimizing the stub.

Indeed. When I first came to HotSpot after working on GCJ for years
I was very surprised to see a linear scan used for interface dispatch.

FTR Erik? has been looking into rewriting virtual dispatch logic:

http://openjdk.java.net/jeps/8221828

Best regards,
Vladimir Ivanov

The code improvements look to be fairly minor.

mlbridge · 2020-09-15T17:13:14Z

Mailing list message from Vladimir Ivanov on hotspot-dev:

????I?updated?my?test?cases?in?test/micro/org/openjdk/bench/vm/compiler/InterfaceCalls.java?.?My?tests?will?not?inline?interface?methods?and?most?cpu?are?used?by?itable_stub.
?every?test?will?run?10?warmup?iterations?and?5?measure?iterations?for?one?score.?I?took?3?score?for?every?test.
????Below?is?test?result?on?my?machines,?it?looks?slow?loop?has?more?improvement?than?origin?one.

Good,?thanks?for?the?numbers.?I'm?curious?have?you?observed?any
improvements?on?larger?scale?benchmarks?or?real?world?apps?

I'm?asking?because?linear?scan?is?already?far?from?optimal?when?there
are?many?superinterfaces?present.

Kevin: itable_stub was found hot on several online applications. So I
started to work on this. Now I don't have chance to verify it online. So
I uses microbenchmarks to verify. I will
test with some benchmarks.

That's unfortunate. It would be very helpful to confirm the results of
the micro-benchmarks (nano-, in this particular case).

????I?think?lookup_interface_method?can?be?reused?as?fast?path.?And?it?is?also?used?by?templateTable::invoke_interface?and?generate_method_handle_dispatch.
?My?implementation?in?slow?path?need?more?registers?(6?registers?so?far),?I?need?to?check?if?there's?register?conflict?in?these?methods.?I'd?like?to?keep?a?separate
?slow?path?implementation.?How?do?you?think?about?it?

Frankly?speaking,?I'd?like?to?avoid?the?duplication.

Kevin: Ok, I will try to merge them.

Also,?absence?of?guarantees?about?order?of?interfaces?in?the?itable
complicates?things:?REFC?and?DECC?can?be?encountered?in?arbitrary?order
and?the?pass?should?take?that?into?account.?For?example,?I?don't?see
early?exit?on?success?in?slow?variant,?so?every?lookup?has?to?go?through
the?whole?itable?irrespective?of?whether?it?succeeds?or?fails.?I
attribute?that?to?the?complications?induced?by?aforementioned?aspect.

Kevin: I use a counter for matching. If it reaches zero, the iteration
can exit early.

Good. Thanks for the clarification.

Alternatively, you could use 2 bits in the temp register to code the
state. IMO it's clearer and more robust w.r.t. possible bugs.

Or even explicitly encode the state in the code as an automaton by
generating 3 loop variants (check REFC + check DECC + check both).
But IMO it falls into over-engineering category :-)

Also, on naming: I find it hard to reason about the logic.
Registers are re-used for different purposes and the names don't help at
all (even adds to the confusion).

As an example:

movptr(method_result, Address(recv_klass, holder_klass,
Address::times_1));

And?speaking?of?the?overall?approach?(as?it?is?implemented?now),?IMO
increased?complexity?doesn't?worth?it.?If?interface?calls?become?a
bottleneck,?the?problem?lies?not?in?itable?stub,?but?the?overall?design
which?requires?linear?scan?over?itables.?It's?better?to?put?the?effort
there?than?micro-optimizing?the?stub.

Kevin: I agree we can improve itable design. My initial think is jvm may
reorder itable at safepoint. I can take it as a follow up optimization.

Well, I would definitely prefer to avoid additional runtime changes (to
sort interfaces in itables and verify their order later) just to support
minor improvements in itable stubs.

Best regards,
Vladimir Ivanov

\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
From\:Vladimir Ivanov \<vladimir\.x\.ivanov at oracle\.com>
Send Time\:2020\?9\?15\?\(\?\?\?\) 17\:03
To\:\?\?\(\?\?\) \<kuaiwei\.kw at alibaba\-inc\.com>\; hotspot\-dev
\<hotspot\-dev\-retn at openjdk\.java\.net>\; kuaiwei
\<github\.com\+1981974\+kuaiwei at openjdk\.java\.net>\; hotspot\-dev
\<hotspot\-dev at openjdk\.java\.net>\; hotspot\-compiler\-dev
\<hotspot\-compiler\-dev at openjdk\.java\.net>
Subject\:Re\: RFR\: 8253049\: Enhance itable\_stub for AArch64 and x86\_64


 >\?\?\?\?I\?updated\?my\?test\?cases\?in\?test\/micro\/org\/openjdk\/bench\/vm\/compiler\/InterfaceCalls\.java\?\.\?My\?tests\?will\?not\?inline\?interface\?methods\?and\?most\?cpu\?are\?used\?by\?itable\_stub\.
 >\?every\?test\?will\?run\?10\?warmup\?iterations\?and\?5\?measure\?iterations\?for\?one\?score\.\?I\?took\?3\?score\?for\?every\?test\.
 >\?\?\?\?Below\?is\?test\?result\?on\?my\?machines\,\?it\?looks\?slow\?loop\?has\?more\?improvement\?than\?origin\?one\.

Good\,\?thanks\?for\?the\?numbers\.\?I\'m\?curious\?have\?you\?observed\?any
improvements\?on\?larger\?scale\?benchmarks\?or\?real\?world\?apps\?

I\'m\?asking\?because\?linear\?scan\?is\?already\?far\?from\?optimal\?when\?there
are\?many\?superinterfaces\?present\.

 >\?\?\?\?I\?think\?lookup\_interface\_method\?can\?be\?reused\?as\?fast\?path\.\?And\?it\?is\?also\?used\?by\?templateTable\:\:invoke\_interface\?and\?generate\_method\_handle\_dispatch\.
 >\?My\?implementation\?in\?slow\?path\?need\?more\?registers\?\(6\?registers\?so\?far\)\,\?I\?need\?to\?check\?if\?there\'s\?register\?conflict\?in\?these\?methods\.\?I\'d\?like\?to\?keep\?a\?separate
 >\?slow\?path\?implementation\.\?How\?do\?you\?think\?about\?it\?

Frankly\?speaking\,\?I\'d\?like\?to\?avoid\?the\?duplication\.

Also\,\?absence\?of\?guarantees\?about\?order\?of\?interfaces\?in\?the\?itable
complicates\?things\:\?REFC\?and\?DECC\?can\?be\?encountered\?in\?arbitrary\?order
and\?the\?pass\?should\?take\?that\?into\?account\.\?For\?example\,\?I\?don\'t\?see
early\?exit\?on\?success\?in\?slow\?variant\,\?so\?every\?lookup\?has\?to\?go\?through

the\?whole\?itable\?irrespective\?of\?whether\?it\?succeeds\?or\?fails\.\?I
attribute\?that\?to\?the\?complications\?induced\?by\?aforementioned\?aspect\.

And\?speaking\?of\?the\?overall\?approach\?\(as\?it\?is\?implemented\?now\)\,\?IMO
increased\?complexity\?doesn\'t\?worth\?it\.\?If\?interface\?calls\?become\?a
bottleneck\,\?the\?problem\?lies\?not\?in\?itable\?stub\,\?but\?the\?overall\?design
which\?requires\?linear\?scan\?over\?itables\.\?It\'s\?better\?to\?put\?the\?effort
there\?than\?micro\-optimizing\?the\?stub\.

But\?I\'m\?happy\?to\?change\?my\?mind\?if\?the\?rewritten\?implementation\?makes\?it

easier\?to\?reason\?about\?the\?code\.

\(FTR\?subtype\?checks\?suffer\?from\?a\?similar\?problem\:\?unless
Klass\:\:\_secondary\_super\_cache\?catches\?it\,\?subtype\?check\?for\?an\?interface

does\?a\?linear\?scan\?over\?\_secondary\_supers\?array\.\)

Best\?regards\,
Vladimir\?Ivanov


 >\?\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
 >\?From\:Vladimir\?Ivanov\?\<vladimir\.x\.ivanov at oracle\.com>
 >\?Send\?Time\:2020\?9\?14\?\(\?\?\?\)\?22\:10
 >\?To\:kuaiwei\?\<github\.com\+1981974\+kuaiwei at openjdk\.java\.net>\;\?hotspot\-dev\?\<hotspot\-dev at openjdk\.java\.net>\;\?hotspot\-compiler\-dev\?\<hotspot\-compiler\-dev at openjdk\.java\.net>
 >\?Subject\:Re\:\?RFR\:\?8253049\:\?Enhance\?itable\_stub\?for\?AArch64\?and\?x86\_64
 >
 >\?Hi\?Kevin\,
 >
 >\?Very\?interesting\?observations\.\?I\?like\?the\?idea\?to\?optimize\?for\?the\?case
 >\?when\?REFC\?\=\=\?DECC\.
 >
 >\?Fusing\?2\?passes\?over\?the\?itable\?into\?one\?does\?look\?attractive\,\?but\?I\'m
 >\?not\?sure\?the\?proposed\?variant\?is\?correct\.\?I\?suggest\?to\?split\?the\?patch
 >\?into\?2\?enhancements\?and\?handle\?them\?separately\.
 >
 >\?I\'m\?curious\?what\?kind\?of\?benchmarks\?you\?used\?and\?what\?are\?the
 >\?improvements\?observed\?with\?the\?patch\.
 >
 >\?One\?suggestion\?about\?the\?implementation\:
 >
 >\?src\/hotspot\/cpu\/x86\/macroAssembler\_x86\.cpp\:
 >
 >\?\+void\?MacroAssembler\:\:lookup\_interface\_method\_in\_stub\(Register\?recv\_klass\,
 >
 >\?I\'d\?like\?to\?avoid\?having\?2\?independent\?implementations\?of\?itable\?lookup
 >\?\(MacroAssembler\:\:lookup\_interface\_method\_in\_stub\(\)\?and
 >\?MacroAssembler\:\:lookup\_interface\_method\(\)\)\.\?It\?would\?be\?nice\?to\?keep\?the
 >\?implementation\?unified\?between\?itable\?and\?MethodHandle\?linkToInterface
 >\?linker\?stubs\.
 >
 >\?What\?MacroAssembler\:\:lookup\_interface\_method\(\.\.\.\,\?true
 >\?\/\*return\_method\*\/\)\?does\?is\?interface\?method\?lookup\?w\/o\?proper\?subtype
 >\?check\?and\?it\?is\?equivalent\?to\?fast\?loop\?in
 >\?MacroAssembler\:\:lookup\_interface\_method\_in\_stub\(\)\.
 >
 >\?As\?a\?possible\?path\?forward\,\?you\?could\?introduce\?the\?fast\?path\?check
 >\?first\?by\?moving\?the\?fast\?path\?check\?into
 >\?VtableStubs\:\:create\_itable\_stub\(\)\?and\?guard\?the\?first\?path\?over\?the
 >\?itable\.\?It\?would\?make\?the\?type\?checking\?pass\?over\?itable\?optional\?based
 >\?on\?runtime\?check\.
 >
 >\?Then\?you\?could\?refactor\?MacroAssembler\:\:lookup\_interface\_method\(\)\?to
 >\?optionally\?do\?REFC\?and\?DECC\?checks\?on\?every\?iteration\?and\?migrate
 >\?VtableStubs\:\:create\_itable\_stub\(\)\?\?and
 >\?MethodHandles\:\:generate\_method\_handle\_dispatch\(\)\?to\?it\.
 >
 >\?Best\?regards\,
 >\?Vladimir\?Ivanov
 >
 >\?On\?14\.09\.2020\?13\:52\,\?kuaiwei\?wrote\:
https\:\/\/webrevs\.openjdk\.java\.net\/\?repo\=jdk\&pr\=128\&range\=00
https\:\/\/git\.openjdk\.java\.net\/jdk\?pull\/128\/head\:pull\/128
 >

bridgekeeper · 2020-10-13T21:21:51Z

@kuaiwei This pull request has been inactive for more than 4 weeks and will be automatically closed if another 4 weeks passes without any activity. To avoid this, simply add a new comment to the pull request. Feel free to ask for assistance if you need help with progressing this pull request towards integration!

bridgekeeper · 2020-11-10T21:26:31Z

@kuaiwei This pull request has been inactive for more than 8 weeks and will now be automatically closed. If you would like to continue working on this pull request in the future, feel free to reopen it!

r18 should not be used as it is reserved as platform register. Linux is fine with userspace using it, but Windows and also recently macOS ( openjdk/jdk11u-dev#301 (comment) ) are actually using it on the kernel side. The macro assembler uses the bit pattern `0x7fffffff` (== `r0-r30`) to specify which registers to spill; fortunately this helper is only used here: https://github.com/openjdk/jdk/blob/c05dc268acaf87236f30cf700ea3ac778e3b20e5/src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp#L1400-L1404 I haven't seen causing this particular instance any issues in practice _yet_, presumably because it looks hard to align the stars in order to trigger a problem (between stp and ldp of r18 a transition to kernel space must happen *and* the kernel needs to do something with r18). But jdk11u-dev has more usages of the `::pusha`/`::popa` macro and that causes troubles as explained in the link above. Output of `-XX:+PrintInterpreter` before this change: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000138809b00, 0x000000013880a280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000138809b00: ldr x2, [x12, #16] 0x0000000138809b04: ldrh w2, [x2, #44] 0x0000000138809b08: add x24, x20, x2, uxtx #3 0x0000000138809b0c: sub x24, x24, #0x8 [...] 0x0000000138809fa4: stp x16, x17, [sp, #128] 0x0000000138809fa8: stp x18, x19, [sp, #144] 0x0000000138809fac: stp x20, x21, [sp, #160] [...] 0x0000000138809fc0: stp x30, xzr, [sp, #240] 0x0000000138809fc4: mov x0, x28 ;; 0x10864ACCC 0x0000000138809fc8: mov x9, #0xaccc // #44236 0x0000000138809fcc: movk x9, #0x864, lsl #16 0x0000000138809fd0: movk x9, #0x1, lsl #32 0x0000000138809fd4: blr x9 0x0000000138809fd8: ldp x2, x3, [sp, #16] [...] 0x0000000138809ff4: ldp x16, x17, [sp, #128] 0x0000000138809ff8: ldp x18, x19, [sp, #144] 0x0000000138809ffc: ldp x20, x21, [sp, #160] ``` After: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000108e4db00, 0x0000000108e4e280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000108e4db00: ldr x2, [x12, #16] 0x0000000108e4db04: ldrh w2, [x2, #44] 0x0000000108e4db08: add x24, x20, x2, uxtx #3 0x0000000108e4db0c: sub x24, x24, #0x8 [...] 0x0000000108e4dfa4: stp x16, x17, [sp, #128] 0x0000000108e4dfa8: stp x19, x20, [sp, #144] 0x0000000108e4dfac: stp x21, x22, [sp, #160] [...] 0x0000000108e4dfbc: stp x29, x30, [sp, #224] 0x0000000108e4dfc0: mov x0, x28 ;; 0x107E4A06C 0x0000000108e4dfc4: mov x9, #0xa06c // #41068 0x0000000108e4dfc8: movk x9, #0x7e4, lsl #16 0x0000000108e4dfcc: movk x9, #0x1, lsl #32 0x0000000108e4dfd0: blr x9 0x0000000108e4dfd4: ldp x2, x3, [sp, #16] [...] 0x0000000108e4dff0: ldp x16, x17, [sp, #128] 0x0000000108e4dff4: ldp x19, x20, [sp, #144] 0x0000000108e4dff8: ldp x21, x22, [sp, #160] [...] ```

Restore looks like this now: ``` 0x0000000106e4dfcc: movk x9, #0x5e4, lsl openjdk#16 0x0000000106e4dfd0: movk x9, #0x1, lsl openjdk#32 0x0000000106e4dfd4: blr x9 0x0000000106e4dfd8: ldp x2, x3, [sp, openjdk#16] 0x0000000106e4dfdc: ldp x4, x5, [sp, openjdk#32] 0x0000000106e4dfe0: ldp x6, x7, [sp, openjdk#48] 0x0000000106e4dfe4: ldp x8, x9, [sp, openjdk#64] 0x0000000106e4dfe8: ldp x10, x11, [sp, openjdk#80] 0x0000000106e4dfec: ldp x12, x13, [sp, openjdk#96] 0x0000000106e4dff0: ldp x14, x15, [sp, openjdk#112] 0x0000000106e4dff4: ldp x16, x17, [sp, openjdk#128] 0x0000000106e4dff8: ldp x0, x1, [sp], openjdk#144 0x0000000106e4dffc: ldp xzr, x19, [sp], openjdk#16 0x0000000106e4e000: ldp x22, x23, [sp, openjdk#16] 0x0000000106e4e004: ldp x24, x25, [sp, openjdk#32] 0x0000000106e4e008: ldp x26, x27, [sp, openjdk#48] 0x0000000106e4e00c: ldp x28, x29, [sp, openjdk#64] 0x0000000106e4e010: ldp x30, xzr, [sp, openjdk#80] 0x0000000106e4e014: ldp x20, x21, [sp], openjdk#96 0x0000000106e4e018: ldur x12, [x29, #-24] 0x0000000106e4e01c: ldr x22, [x12, openjdk#16] 0x0000000106e4e020: add x22, x22, #0x30 0x0000000106e4e024: ldr x8, [x28, openjdk#8] ```

…marks after JDK-8340093 JDK-8340093 enabled auto-vectorization for more reduction loop cases using 128-bit vector operations. As a result, the following microbenchmarks are negatively affected: VectorReduction2.longAddDotProduct VectorReduction2.longMulDotProduct VectorReduction2.longMulSimple This patch fixes these regressions. 1. Improve code generation for MLA For longAddDotProduct[1], the current implementation generates vectorized code similar to: ``` ldr q17, [x12, openjdk#16] ldr q18, [x11, openjdk#16] mla z16.d, p7/m, z17.d, z18.d ldr q17, [x11, openjdk#32] ldr q18, [x12, openjdk#32] mla z16.d, p7/m, z18.d, z17.d ... ldr q17, [x11, openjdk#128] ldr q18, [x12, openjdk#128] mla z16.d, p7/m, z18.d, z17.d ``` `z16` is the third source and destination register. There are true dependencies between consecutive mla[2] instructions. As a result, this vectorized code performs significantly worse than the scalar version due to limited instruction-level parallelism. These mla instructions are produced by a backend match rule that fuses AddVL and MulVL into a vector MLA[3]. In this situation, avoiding instruction fusion and instead generating separate SVE mul and add instructions can improve instruction-level parallelism and overall performance. To address this, this patch introduces is_multiply_accumulate_candidate() to determine whether a node is a suitable vector MLA candidate. For node patterns that may increase execution latency, instruction fusion into MLA is disabled. After applying this patch, the generated assembly looks like: ``` ldr q17, [x12, openjdk#16] ldr q18, [x11, openjdk#16] ldr q19, [x11, openjdk#32] mul z17.d, p7/m, z17.d, z18.d ldr q18, [x12, openjdk#32] ldr q20, [x11, openjdk#48] mul z18.d, p7/m, z18.d, z19.d ldr q19, [x12, openjdk#48] add v16.2d, v17.2d, v16.2d ldr q17, [x11, openjdk#64] add v16.2d, v18.2d, v16.2d ldr q18, [x12, openjdk#64] mul z19.d, p7/m, z19.d, z20.d ldr q20, [x12, openjdk#80] add v16.2d, v19.2d, v16.2d ``` This sequence exposes more independent operations and reduces dependency chains, leading to improved performance. Since SVE mls instructions may suffer from similar issues, the same logic has been extended to cover MLS as well. Additional microbenchmarks have been added accordingly. 2. Avoid vectorizing MUL-heavy loops For longMulSimple[3], the generated vectorized code exhibits long dependency chains of SVE mul instructions, which results in worse performance than scalar execution: ``` ldr q17, [x1, openjdk#16] ldr q18, [x1, openjdk#32] mul z17.d, p7/m, z17.d, z16.d ldr q16, [x1, openjdk#48] mul z17.d, p7/m, z17.d, z18.d ldr q18, [x1, openjdk#64] mul z16.d, p7/m, z16.d, z17.d ... ldr q16, [x1, openjdk#256] mul z17.d, p7/m, z17.d, z19.d mul z16.d, p7/m, z16.d, z17.d ``` To address this, the patch introduces a platform-specific interface: `VTransformElementWiseVectorNode::node_weight()`. For 128-bit operations, this interface detects consecutive vector long multiply operations and increases the node weight to 4, which is the minimum value required for the cost model to avoid vectorization on both 128-bit and 256-bit platforms. 3. Results Performance measurements on 128-bit and 256-bit SVE machines show that these changes avoid harmful vectorization and improve overall performance for the affected benchmarks. patch: results obtained after applying this patch, using default auto-vectorization settings (-XX:+UseSuperWord, -XX:AutoVectorizationOverrideProfitability=1, cost-model decision mode) main-default: results on mainline using the same default auto-vectorization settings (-XX:+UseSuperWord, -XX:AutoVectorizationOverrideProfitability=1, cost-model decision mode) main-scalar: results on mainline with -XX:+UseSuperWord and -XX:AutoVectorizationOverrideProfitability=0 (force scalar code) The table below reports relative performance changes: p/m1 = (patch - main-default) / main-default p/m0 = (patch - main-scalar) / main-scalar Mode: avgt Unit: ns/op Arm Neoverse V2 machine (128 bit SVE): Benchmark (COUNT) p/m1 p/m0 TypeVectorOperationsSuperWord.mlaL 512 0.16% -50.42% TypeVectorOperationsSuperWord.mlaL 2048 0.26% -56.70% TypeVectorOperationsSuperWord.mlsL 512 -0.10% -50.37% TypeVectorOperationsSuperWord.mlsL 2048 0.14% -56.82% TypeVectorOperationsSuperWord.mulBigL 512 0.06% -25.77% TypeVectorOperationsSuperWord.mulBigL 2048 -0.02% -19.63% TypeVectorOperationsSuperWord.mulI 512 0.63% -63.44% TypeVectorOperationsSuperWord.mulI 2048 0.28% -63.07% TypeVectorOperationsSuperWord.mulL 512 -0.03% -50.47% TypeVectorOperationsSuperWord.mulL 2048 0.29% -50.82% TypeVectorOperationsSuperWord.mulMediumL 512 -0.19% -27.54% TypeVectorOperationsSuperWord.mulMediumL 2048 0.24% -25.18% TypeVectorOperationsSuperWord.mulMlaLDependent 512 0.30% -28.70% TypeVectorOperationsSuperWord.mulMlaLDependent 2048 0.12% -26.74% TypeVectorOperationsSuperWord.mulMlaLIndependent 512 -10.43% -43.09% TypeVectorOperationsSuperWord.mulMlaLIndependent 2048 -14.82% -42.68% VectorReduction2.WithSuperword.longAddBig 2048 -15.15% -44.01% VectorReduction2.WithSuperword.longAddBigMixSub1 2048 -6.19% -43.92% VectorReduction2.WithSuperword.longAddBigMixSub2 2048 -15.18% -43.90% VectorReduction2.WithSuperword.longAddBigMixSub3 2048 -5.74% -43.87% VectorReduction2.WithSuperword.longAddDotProduct 2048 -33.36% -18.16% VectorReduction2.WithSuperword.longAddSimple 2048 -0.02% -6.72% VectorReduction2.WithSuperword.longAndBig 2048 -16.32% -44.06% VectorReduction2.WithSuperword.longAndDotProduct 2048 -0.01% -3.74% VectorReduction2.WithSuperword.longAndSimple 2048 0.00% -6.35% VectorReduction2.WithSuperword.longMaxBig 2048 -15.29% -52.09% VectorReduction2.WithSuperword.longMaxDotProduct 2048 -0.03% -52.08% VectorReduction2.WithSuperword.longMaxSimple 2048 -0.40% -52.74% VectorReduction2.WithSuperword.longMinBig 2048 -14.88% -51.70% VectorReduction2.WithSuperword.longMinDotProduct 2048 0.01% -52.21% VectorReduction2.WithSuperword.longMinSimple 2048 0.26% -52.88% VectorReduction2.WithSuperword.longMulBig 2048 -2.21% -0.07% VectorReduction2.WithSuperword.longMulDotProduct 2048 -15.47% 0.00% VectorReduction2.WithSuperword.longMulSimple 2048 -17.87% -0.33% VectorReduction2.WithSuperword.longOrBig 2048 -15.23% -43.94% VectorReduction2.WithSuperword.longOrDotProduct 2048 -0.01% -3.83% VectorReduction2.WithSuperword.longOrSimple 2048 -0.01% -6.60% VectorReduction2.WithSuperword.longXorBig 2048 -10.03% -41.62% VectorReduction2.WithSuperword.longXorDotProduct 2048 0.01% -38.61% VectorReduction2.WithSuperword.longXorSimple 2048 0.02% -53.18% Arm Neoverse V1 machine (256 bit SVE): Note: In the current mainline code, the AArch64 backend supports only 128-bit multiply long operations. Auto-vectorization accounts for this backend constraint and splits 256-bit vectors into 128-bit chunks so that the loop can still be vectorized. This is why 256-bit platforms also benefit from this patch. No obvious performance changes are observed for other benchmarks. Benchmark (COUNT) p/m1 p/m0 VectorReduction2.longMulDotProduct 2048 -28.23% 0.00% VectorReduction2.longMulSimple 2048 -19.29% 0.01% Tier 1 - 3 passed on both aarch64 and x86 platforms. [1] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1096 [2] https://developer.arm.com/documentation/ddi0602/2025-12/SVE-Instructions/MLA--vectors---Multiply-add--predicated--?lang=en [3] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/src/hotspot/cpu/aarch64/aarch64_vector.ad#L2617 [4] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1035

The microbenchmark ArraysFill.testLongFill[1] on 128-bit vector platforms generates vectorized store instructions with non-monotonic memory offsets, e.g.: str q16, [x12, openjdk#80] str q16, [x12, openjdk#48] str q16, [x12, openjdk#128] ... This arises because SuperWord only considers true dependencies when building edges (see [3]), and therefore does not enforce ordering among independent vector memory operations. These nodes are later scheduled using RPO, which can result in an apparently unordered sequence of memory accesses. This patch replaces RPO-based scheduling with a priority-based topological sort to improve ordering and locality. The scheduling policy is: 1. Prefer nodes whose weak predecessors have already been scheduled. 2. Prioritize node types in the following order: scalar operations (loads/stores, address expressions), vector arithmetic, vector loads, vector stores, then others. 3. For independent loads/stores sharing the same base address, prefer ascending offsets. 4. Use VTransformNodeIDX to ensure stable ordering. With this change, the generated code becomes monotonic in memory offsets: str q16, [x12, openjdk#16] str q16, [x12, openjdk#32] str q16, [x12, openjdk#48] ... On Arm Neoverse V2 machine (128 bit SVE), this improves the following benchmarks: TypeVectorOperationsSuperWord.java[2] Benchmark (COUNT) Mode Units Difference absD 512 avgt ns/op -27.05% absD 2048 avgt ns/op -27.05% absL 512 avgt ns/op -24.46% absL 2048 avgt ns/op -27.26% convertD2LBitsRaw 512 avgt ns/op -20.39% convertD2LBitsRaw 2048 avgt ns/op -23.92% convertF2L 512 avgt ns/op -16.82% convertF2L 2048 avgt ns/op -22.60% convertI2D 512 avgt ns/op -12.50% convertI2D 2048 avgt ns/op -17.92% convertLBits2D 512 avgt ns/op -27.13% convertLBits2D 2048 avgt ns/op -31.69% negD 512 avgt ns/op -26.85% negD 2048 avgt ns/op -27.09% ArraysFill.java[1]: Benchmark (size) Mode Units Difference testDoubleFill 250 thrpt ops/ms 26.46% testDoubleFill 266 thrpt ops/ms 32.69% testDoubleFill 511 thrpt ops/ms 33.83% testDoubleFill 2047 thrpt ops/ms 45.35% testDoubleFill 2048 thrpt ops/ms 45.38% testDoubleFill 8195 thrpt ops/ms 49.32% testLongFill 250 thrpt ops/ms 28.12% testLongFill 266 thrpt ops/ms 40.30% testLongFill 511 thrpt ops/ms 34.79% testLongFill 2047 thrpt ops/ms 45.71% testLongFill 2048 thrpt ops/ms 53.07% testLongFill 8195 thrpt ops/ms 49.52% No significant performance changes are observed on wider vector platforms (e.g., 256-bit or 512-bit), where fewer vector operations are generated in SuperWord and scheduling has less impact. [1] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/test/micro/org/openjdk/bench/java/util/ArraysFill.java#L92 [2] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/test/micro/org/openjdk/bench/vm/compiler/TypeVectorOperations.java [3] https://github.com/openjdk/jdk/blob/34a0235ed30141e92f064d30fe4378709ea0e135/src/hotspot/share/opto/superwordVTransformBuilder.cpp#L99

bridgekeeper Bot added the oca Needs verification of OCA signatory status label Sep 11, 2020

openjdk Bot added the hotspot hotspot-dev@openjdk.org label Sep 11, 2020

bridgekeeper Bot added the oca-verify Needs verification of OCA signatory status label Sep 11, 2020

openjdk Bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Sep 11, 2020

bridgekeeper Bot removed oca Needs verification of OCA signatory status oca-verify Needs verification of OCA signatory status labels Sep 14, 2020

openjdk Bot added the rfr Pull request is ready for review label Sep 14, 2020

8253049: Enhance itable_stub for AArch64 and x86_64

6d79d5e

kuaiwei force-pushed the itable branch from b2f12cc to 6d79d5e Compare September 15, 2020 07:10

bridgekeeper Bot closed this Nov 10, 2020

pfirmstone referenced this pull request in pfirmstone/DirtyChai Nov 18, 2024

Module :: JGDMS Jini Platform - SpotBugs fixes #128

170a00c

pfirmstone referenced this pull request in pfirmstone/DirtyChai Nov 18, 2024

Module :: JGDMS Jini Platform - SpotBugs fixes #128

ac41a45

Conversation

kuaiwei commented Sep 11, 2020 • edited by openjdk Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Download

Uh oh!

bridgekeeper Bot commented Sep 11, 2020

Uh oh!

openjdk Bot commented Sep 11, 2020

Uh oh!

kuaiwei commented Sep 11, 2020

Uh oh!

bridgekeeper Bot commented Sep 11, 2020

Uh oh!

kuaiwei commented Sep 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openjdk Bot commented Sep 11, 2020

Uh oh!

kuaiwei commented Sep 11, 2020

Uh oh!

openjdk Bot commented Sep 11, 2020

Uh oh!

mlbridge Bot commented Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

mlbridge Bot commented Sep 14, 2020

Uh oh!

mlbridge Bot commented Sep 15, 2020

Uh oh!

mlbridge Bot commented Sep 15, 2020

Uh oh!

mlbridge Bot commented Sep 15, 2020

Uh oh!

mlbridge Bot commented Sep 15, 2020

Uh oh!

mlbridge Bot commented Sep 15, 2020

Uh oh!

mlbridge Bot commented Sep 15, 2020

Uh oh!

bridgekeeper Bot commented Oct 13, 2020

Uh oh!

bridgekeeper Bot commented Nov 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

1 participant

kuaiwei commented Sep 11, 2020 •

edited by openjdk Bot

Loading

kuaiwei commented Sep 11, 2020 •

edited

Loading

mlbridge Bot commented Sep 14, 2020 •

edited

Loading