8249142: java/awt/FontClass/CreateFont/DeleteFont.sh is unstable#256
8249142: java/awt/FontClass/CreateFont/DeleteFont.sh is unstable#256prrace wants to merge 1 commit intoopenjdk:masterfrom prrace:deletefont
Conversation
|
👋 Welcome back prr! A progress list of the required criteria for merging this PR into |
|
@prrace The following label will be automatically applied to this pull request: When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the |
|
@prrace This change now passes all automated pre-integration checks. In addition to the automated checks, the change must also fulfill all project specific requirements After integration, the commit message will be:
Since the source branch of this PR was last updated there have been 57 commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid automatic rebasing, please merge ➡️ To integrate this PR with the above commit message to the |
|
/integrate |
|
/integrate |
|
@prrace Since your change was applied there have been 57 commits pushed to the
Your commit was automatically rebased without conflicts. Pushed as commit d27835b. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
…marks after JDK-8340093 JDK-8340093 enabled auto-vectorization for more reduction loop cases using 128-bit vector operations. As a result, the following microbenchmarks are negatively affected: VectorReduction2.longAddDotProduct VectorReduction2.longMulDotProduct VectorReduction2.longMulSimple This patch fixes these regressions. 1. Improve code generation for MLA For longAddDotProduct[1], the current implementation generates vectorized code similar to: ``` ldr q17, [x12, openjdk#16] ldr q18, [x11, openjdk#16] mla z16.d, p7/m, z17.d, z18.d ldr q17, [x11, openjdk#32] ldr q18, [x12, openjdk#32] mla z16.d, p7/m, z18.d, z17.d ... ldr q17, [x11, openjdk#128] ldr q18, [x12, openjdk#128] mla z16.d, p7/m, z18.d, z17.d ``` `z16` is the third source and destination register. There are true dependencies between consecutive mla[2] instructions. As a result, this vectorized code performs significantly worse than the scalar version due to limited instruction-level parallelism. These mla instructions are produced by a backend match rule that fuses AddVL and MulVL into a vector MLA[3]. In this situation, avoiding instruction fusion and instead generating separate SVE mul and add instructions can improve instruction-level parallelism and overall performance. To address this, this patch introduces is_multiply_accumulate_candidate() to determine whether a node is a suitable vector MLA candidate. For node patterns that may increase execution latency, instruction fusion into MLA is disabled. After applying this patch, the generated assembly looks like: ``` ldr q17, [x12, openjdk#16] ldr q18, [x11, openjdk#16] ldr q19, [x11, openjdk#32] mul z17.d, p7/m, z17.d, z18.d ldr q18, [x12, openjdk#32] ldr q20, [x11, openjdk#48] mul z18.d, p7/m, z18.d, z19.d ldr q19, [x12, openjdk#48] add v16.2d, v17.2d, v16.2d ldr q17, [x11, openjdk#64] add v16.2d, v18.2d, v16.2d ldr q18, [x12, openjdk#64] mul z19.d, p7/m, z19.d, z20.d ldr q20, [x12, openjdk#80] add v16.2d, v19.2d, v16.2d ``` This sequence exposes more independent operations and reduces dependency chains, leading to improved performance. Since SVE mls instructions may suffer from similar issues, the same logic has been extended to cover MLS as well. Additional microbenchmarks have been added accordingly. 2. Avoid vectorizing MUL-heavy loops For longMulSimple[3], the generated vectorized code exhibits long dependency chains of SVE mul instructions, which results in worse performance than scalar execution: ``` ldr q17, [x1, openjdk#16] ldr q18, [x1, openjdk#32] mul z17.d, p7/m, z17.d, z16.d ldr q16, [x1, openjdk#48] mul z17.d, p7/m, z17.d, z18.d ldr q18, [x1, openjdk#64] mul z16.d, p7/m, z16.d, z17.d ... ldr q16, [x1, openjdk#256] mul z17.d, p7/m, z17.d, z19.d mul z16.d, p7/m, z16.d, z17.d ``` To address this, the patch introduces a platform-specific interface: `VTransformElementWiseVectorNode::node_weight()`. For 128-bit operations, this interface detects consecutive vector long multiply operations and increases the node weight to 4, which is the minimum value required for the cost model to avoid vectorization on both 128-bit and 256-bit platforms. 3. Results Performance measurements on 128-bit and 256-bit SVE machines show that these changes avoid harmful vectorization and improve overall performance for the affected benchmarks. patch: results obtained after applying this patch, using default auto-vectorization settings (-XX:+UseSuperWord, -XX:AutoVectorizationOverrideProfitability=1, cost-model decision mode) main-default: results on mainline using the same default auto-vectorization settings (-XX:+UseSuperWord, -XX:AutoVectorizationOverrideProfitability=1, cost-model decision mode) main-scalar: results on mainline with -XX:+UseSuperWord and -XX:AutoVectorizationOverrideProfitability=0 (force scalar code) The table below reports relative performance changes: p/m1 = (patch - main-default) / main-default p/m0 = (patch - main-scalar) / main-scalar Mode: avgt Unit: ns/op Arm Neoverse V2 machine (128 bit SVE): Benchmark (COUNT) p/m1 p/m0 TypeVectorOperationsSuperWord.mlaL 512 0.16% -50.42% TypeVectorOperationsSuperWord.mlaL 2048 0.26% -56.70% TypeVectorOperationsSuperWord.mlsL 512 -0.10% -50.37% TypeVectorOperationsSuperWord.mlsL 2048 0.14% -56.82% TypeVectorOperationsSuperWord.mulBigL 512 0.06% -25.77% TypeVectorOperationsSuperWord.mulBigL 2048 -0.02% -19.63% TypeVectorOperationsSuperWord.mulI 512 0.63% -63.44% TypeVectorOperationsSuperWord.mulI 2048 0.28% -63.07% TypeVectorOperationsSuperWord.mulL 512 -0.03% -50.47% TypeVectorOperationsSuperWord.mulL 2048 0.29% -50.82% TypeVectorOperationsSuperWord.mulMediumL 512 -0.19% -27.54% TypeVectorOperationsSuperWord.mulMediumL 2048 0.24% -25.18% TypeVectorOperationsSuperWord.mulMlaLDependent 512 0.30% -28.70% TypeVectorOperationsSuperWord.mulMlaLDependent 2048 0.12% -26.74% TypeVectorOperationsSuperWord.mulMlaLIndependent 512 -10.43% -43.09% TypeVectorOperationsSuperWord.mulMlaLIndependent 2048 -14.82% -42.68% VectorReduction2.WithSuperword.longAddBig 2048 -15.15% -44.01% VectorReduction2.WithSuperword.longAddBigMixSub1 2048 -6.19% -43.92% VectorReduction2.WithSuperword.longAddBigMixSub2 2048 -15.18% -43.90% VectorReduction2.WithSuperword.longAddBigMixSub3 2048 -5.74% -43.87% VectorReduction2.WithSuperword.longAddDotProduct 2048 -33.36% -18.16% VectorReduction2.WithSuperword.longAddSimple 2048 -0.02% -6.72% VectorReduction2.WithSuperword.longAndBig 2048 -16.32% -44.06% VectorReduction2.WithSuperword.longAndDotProduct 2048 -0.01% -3.74% VectorReduction2.WithSuperword.longAndSimple 2048 0.00% -6.35% VectorReduction2.WithSuperword.longMaxBig 2048 -15.29% -52.09% VectorReduction2.WithSuperword.longMaxDotProduct 2048 -0.03% -52.08% VectorReduction2.WithSuperword.longMaxSimple 2048 -0.40% -52.74% VectorReduction2.WithSuperword.longMinBig 2048 -14.88% -51.70% VectorReduction2.WithSuperword.longMinDotProduct 2048 0.01% -52.21% VectorReduction2.WithSuperword.longMinSimple 2048 0.26% -52.88% VectorReduction2.WithSuperword.longMulBig 2048 -2.21% -0.07% VectorReduction2.WithSuperword.longMulDotProduct 2048 -15.47% 0.00% VectorReduction2.WithSuperword.longMulSimple 2048 -17.87% -0.33% VectorReduction2.WithSuperword.longOrBig 2048 -15.23% -43.94% VectorReduction2.WithSuperword.longOrDotProduct 2048 -0.01% -3.83% VectorReduction2.WithSuperword.longOrSimple 2048 -0.01% -6.60% VectorReduction2.WithSuperword.longXorBig 2048 -10.03% -41.62% VectorReduction2.WithSuperword.longXorDotProduct 2048 0.01% -38.61% VectorReduction2.WithSuperword.longXorSimple 2048 0.02% -53.18% Arm Neoverse V1 machine (256 bit SVE): Note: In the current mainline code, the AArch64 backend supports only 128-bit multiply long operations. Auto-vectorization accounts for this backend constraint and splits 256-bit vectors into 128-bit chunks so that the loop can still be vectorized. This is why 256-bit platforms also benefit from this patch. No obvious performance changes are observed for other benchmarks. Benchmark (COUNT) p/m1 p/m0 VectorReduction2.longMulDotProduct 2048 -28.23% 0.00% VectorReduction2.longMulSimple 2048 -19.29% 0.01% Tier 1 - 3 passed on both aarch64 and x86 platforms. [1] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1096 [2] https://developer.arm.com/documentation/ddi0602/2025-12/SVE-Instructions/MLA--vectors---Multiply-add--predicated--?lang=en [3] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/src/hotspot/cpu/aarch64/aarch64_vector.ad#L2617 [4] https://github.com/openjdk/jdk/blob/c5f288e2ae2ebe6ee4a0d39d91348f746bd0e353/test/micro/org/openjdk/bench/vm/compiler/VectorReduction2.java#L1035
This test is being marked intermittent, although at the same time I am trying to make it less likely to fail.
However since we have a known issue around NIO mmap'd files not being directly unmappable, the
deletes the font system make may be stymied on Windows. So marking it intermittent is probably for the best
One other thing is that I changed it so that the tmp files created are now of different sizes so we can now
tell which createFont() call resulted in the font that can't be deleted. If it is always the Type 1 fonts then
that will be good evidence it is mmap that is the problem.
We likely need to stop using mmap for this reason.
Progress
Issue
Reviewers
Download
$ git fetch https://git.openjdk.java.net/jdk pull/256/head:pull/256$ git checkout pull/256