Skip to content

8384571: C2: Add some basic IGVN optimization for VectorBlendNode#31333

Open
erifan wants to merge 1 commit into
openjdk:masterfrom
erifan:JDK-8384571-vector-blend-opt-pr1
Open

8384571: C2: Add some basic IGVN optimization for VectorBlendNode#31333
erifan wants to merge 1 commit into
openjdk:masterfrom
erifan:JDK-8384571-vector-blend-opt-pr1

Conversation

@erifan
Copy link
Copy Markdown
Contributor

@erifan erifan commented Jun 1, 2026

This PR introduces the basic Ideal/Identity transformations for VectorBlendNode.

The semantic of VectorBlend(X, Y, M) is: M ? Y : X.

Identity:

  (VectorBlend X Y (Replicate -1)) => Y
  (VectorBlend X Y (MaskAll   -1)) => Y
  (VectorBlend X Y (Replicate  0)) => X
  (VectorBlend X Y (MaskAll    0)) => X

Ideal:

  (VectorBlend (VectorBlend X A M) B M)  => (VectorBlend X B M)
  (VectorBlend A (VectorBlend B X M) M)  => (VectorBlend A X M)
  (VectorBlend A B (XorV/XorVMask M -1)) => (VectorBlend B A M)

Also corrects the VectorBlendNode header comment: across all backends (X86 SSE/AVX, AArch64 NEON/SVE, RISC-V V) the active mask lane selects vec2 (in(2)), and the inactive lane selects vec1 (in(1)).

JTReg and JMH tests are also added for each optimization pattern. All tests (tier1, tier2, and tier3) passed on AArch64 and X86 platforms.

JMH benchmark test results:

On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:

Benchmark		        Unit	Before	Error	After		Error	Uplift
blendNegatedMaskInt	    ops/ms	7990.6	2.8	    10215.2		11.0	1.3
identityAllOnesInt	    ops/ms	3574.8	2.6	    7967.1		0.3	    2.2
identityAllZerosLong	ops/ms	3575.6	1.0	    7966.0		3.6	    2.2
nestedBlendInnerLong	ops/ms	3533.8	2.8	    478573.0	3178.5	135.4
nestedBlendOuterInt	    ops/ms	3537.6	3.4	    472242.2	3034.2	133.5

On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:

Benchmark		        Unit	Before	Error	After		Error	Uplift
blendNegatedMaskInt	    ops/ms	5171.9	5.2	    8129.0		17.3	1.6
identityAllOnesInt	    ops/ms	2722.0	0.1	    5891.3		0.1	    2.2
identityAllZerosLong	ops/ms	2722.4	0.1	    5891.1		0.3	    2.2
nestedBlendInnerLong	ops/ms	2697.6	0.0	    312148.7	2366.4	115.7
nestedBlendOuterInt	    ops/ms	2702.7	0.1	    308686.0	2709.8	114.2

On a Nvidia Grace (Neoverse-V2) machine with -XX:UseSVE=0:

Benchmark		        Unit	Before	Error	After		Error	Uplift
blendNegatedMaskInt	    ops/ms	7718.1	1.9	    9515.9		54.0	1.2
identityAllOnesInt	    ops/ms	3581.9	0.6	    8062.5		0.5	    2.3
identityAllZerosLong	ops/ms	3582.7	0.6	    8058.5		11.9	2.2
nestedBlendInnerLong	ops/ms	3529.6	1.4	    476029.8	5190.2	134.9
nestedBlendOuterInt	    ops/ms	3536.9	2.1	    486060.0	3442.1	137.4

On an AMD EPYC 9124 16-Core Processor with option -XX:UseAVX=3:

Benchmark		        Unit	Before	Error	After		Error	Uplift
blendNegatedMaskInt	    ops/ms	36773.6	541.7	46467.4		499.4	1.3
identityAllOnesInt	    ops/ms	5262.7	3.7	    13644.7		12.1	2.6
identityAllZerosLong	ops/ms	5272.4	3.4	    13665.3		8.4	    2.6
nestedBlendInnerLong	ops/ms	5256.6	4.9	    436643.3	14778.8	83.1
nestedBlendOuterInt	    ops/ms	5253.2	1.5	    223851.3	106003	42.6

On an AMD EPYC 9124 16-Core Processor with option -XX:UseAVX=2:

Benchmark		        Unit	Before	Error	After		Error	Uplift
blendNegatedMaskInt	    ops/ms	24335.3	32.1	30412.3		28.1	1.2
identityAllOnesInt	    ops/ms	5248.8	5.0	    13677.5		18.4	2.6
identityAllZerosLong	ops/ms	5248.8	2.2	    13655.8		2.9	    2.6
nestedBlendInnerLong	ops/ms	5146.2	4.6	    649242.6	1174.4	126.2
nestedBlendOuterInt	    ops/ms	5141.8	6.2	    646255.2	10654.1	125.7

The microbenchmark shows a significant speedup. This is mainly because this PR eliminates redundant computations inside the loop by hoisting them out of the loop. At the same time, it reduces the number of IR uses, which can in turn enable further optimizations.



Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8384571: C2: Add some basic IGVN optimization for VectorBlendNode (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/31333/head:pull/31333
$ git checkout pull/31333

Update a local copy of the PR:
$ git checkout pull/31333
$ git pull https://git.openjdk.org/jdk.git pull/31333/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 31333

View PR using the GUI difftool:
$ git pr show -t 31333

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/31333.diff

Using Webrev

Link to Webrev Comment

This PR introduces the basic Ideal/Identity transformations for
`VectorBlendNode`.

The semantic of `VectorBlend(X, Y, M)` is: `M ? Y : X`.

**Identity**:
```
  (VectorBlend X Y (Replicate -1)) => Y
  (VectorBlend X Y (MaskAll   -1)) => Y
  (VectorBlend X Y (Replicate  0)) => X
  (VectorBlend X Y (MaskAll    0)) => X
```

**Ideal**:
```
  (VectorBlend (VectorBlend X A M) B M) => (VectorBlend X B M)
  (VectorBlend A (VectorBlend B X M) M) => (VectorBlend A X M)
  (VectorBlend A B (XorV/XorVMask M -1)) => (VectorBlend B A M)
```

Also corrects the VectorBlendNode header comment: across all backends
(X86 SSE/AVX, AArch64 NEON/SVE, RISC-V V) the active mask lane selects
`vec2` (in(2)), and the inactive lane selects `vec1` (in(1)).

JTReg and JMH tests are also added for each optimization pattern. All
tests (tier1, tier2, and tier3) passed on AArch64 and X86 platforms.

JMH benchmark test results:

On a Nvidia Grace (Neoverse-V2) machine with 128-bit SVE2:
```
Benchmark		Unit	Before	Error	After		Error	Uplift
blendNegatedMaskInt	ops/ms	7990.6	2.8	10215.2		11.0	1.3
identityAllOnesInt	ops/ms	3574.8	2.6	7967.1		0.3	2.2
identityAllZerosLong	ops/ms	3575.6	1.0	7966.0		3.6	2.2
nestedBlendInnerLong	ops/ms	3533.8	2.8	478573.0	3178.5	135.4
nestedBlendOuterInt	ops/ms	3537.6	3.4	472242.2	3034.2	133.5
```

On an AWS Graviton3 (Neoverse-V1) machine with 256-bit SVE1:
```
Benchmark		Unit	Before	Error	After		Error	Uplift
blendNegatedMaskInt	ops/ms	5171.9	5.2	8129.0		17.3	1.6
identityAllOnesInt	ops/ms	2722.0	0.1	5891.3		0.1	2.2
identityAllZerosLong	ops/ms	2722.4	0.1	5891.1		0.3	2.2
nestedBlendInnerLong	ops/ms	2697.6	0.0	312148.7	2366.4	115.7
nestedBlendOuterInt	ops/ms	2702.7	0.1	308686.0	2709.8	114.2
```

On a Nvidia Grace (Neoverse-V2) machine with `-XX:UseSVE=0`:
```
Benchmark		Unit	Before	Error	After		Error	Uplift
blendNegatedMaskInt	ops/ms	7718.1	1.9	9515.9		54.0	1.2
identityAllOnesInt	ops/ms	3581.9	0.6	8062.5		0.5	2.3
identityAllZerosLong	ops/ms	3582.7	0.6	8058.5		11.9	2.2
nestedBlendInnerLong	ops/ms	3529.6	1.4	476029.8	5190.2	134.9
nestedBlendOuterInt	ops/ms	3536.9	2.1	486060.0	3442.1	137.4
```

On an AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=3`:
```
Benchmark		Unit	Before	Error	After		Error	Uplift
blendNegatedMaskInt	ops/ms	36773.6	541.7	46467.4		499.4	1.3
identityAllOnesInt	ops/ms	5262.7	3.7	13644.7		12.1	2.6
identityAllZerosLong	ops/ms	5272.4	3.4	13665.3		8.4	2.6
nestedBlendInnerLong	ops/ms	5256.6	4.9	436643.3	14778.8	83.1
nestedBlendOuterInt	ops/ms	5253.2	1.5	223851.3	106002.6	42.6
```

On an AMD EPYC 9124 16-Core Processor with option `-XX:UseAVX=2`:
```
Benchmark		Unit	Before	Error	After		Error	Uplift
blendNegatedMaskInt	ops/ms	24335.3	32.1	30412.3		28.1	1.2
identityAllOnesInt	ops/ms	5248.8	5.0	13677.5		18.4	2.6
identityAllZerosLong	ops/ms	5248.8	2.2	13655.8		2.9	2.6
nestedBlendInnerLong	ops/ms	5146.2	4.6	649242.6	1174.4	126.2
nestedBlendOuterInt	ops/ms	5141.8	6.2	646255.2	10654.1	125.7
```

The microbenchmark shows a significant speedup. This is mainly because
this PR eliminates redundant computations inside the loop by hoisting
them out of the loop. At the same time, it reduces the number of IR
uses, which can in turn enable further optimizations.
@bridgekeeper
Copy link
Copy Markdown

bridgekeeper Bot commented Jun 1, 2026

👋 Welcome back erfang! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Jun 1, 2026

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk Bot added hotspot-compiler hotspot-compiler-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Jun 1, 2026
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Jun 1, 2026

@erifan The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Jun 1, 2026

The total number of required reviews for this PR has been set to 2 based on the presence of this label: hotspot-compiler. This can be overridden with the /reviewers command.

@openjdk openjdk Bot added the rfr Pull request is ready for review label Jun 1, 2026
@mlbridge
Copy link
Copy Markdown

mlbridge Bot commented Jun 1, 2026

Webrevs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

1 participant