Use edge-triggered epoll for VT read sub-pollers by franz1981 · Pull Request #223 · openjdk/loom

franz1981 · 2026-03-26T09:09:11Z

Ported from 7ac9ca128885c5dd561e6fbd6bbeaddb86d6264c to the latest upstream fibers branch. Adapted to the current API which renamed implRegister/implDeregister to implStartPoll/implStopPoll and added Mode/EventFD/Cleaner/PollerGroup architecture.

Progress

Change must not contain extraneous whitespace

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/loom.git pull/223/head:pull/223
$ git checkout pull/223

Update a local copy of the PR:
$ git checkout pull/223
$ git pull https://git.openjdk.org/loom.git pull/223/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 223

View PR using the GUI difftool:
$ git pr show -t 223

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/loom/pull/223.diff

Using Webrev

Link to Webrev Comment

Ported from 7ac9ca128885c5dd561e6fbd6bbeaddb86d6264c to the latest upstream fibers branch. Adapted to the current API which renamed implRegister/implDeregister to implStartPoll/implStopPoll and added Mode/EventFD/Cleaner/PollerGroup architecture.

bridgekeeper · 2026-03-26T09:10:23Z

👋 Welcome back franz1981! A progress list of the required criteria for merging this PR into fibers will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2026-03-26T09:13:04Z

@franz1981 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

Use edge-triggered epoll for VT read sub-pollers

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the fibers branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

Remove alien file

franz1981 · 2026-03-26T09:21:16Z

-        Thread t = map.remove(fdVal);
+        Thread t;
+        if (retainRegistration()) {
+            t = map.replace(fdVal, REGISTERED);


This can be optimized

franz1981 · 2026-03-26T09:22:41Z

+                p.clearRegistration(fdVal);
+            }
+            for (Poller p : writePollers()) {
+                p.clearRegistration(fdVal);


Useless as we don't use it for write pollers

franz1981 · 2026-03-26T09:29:20Z

@AlanBateman this implement the same scheme used by go since the semantic of VT and goroutines is fairly similar: it works as we always perform a non blocking read upfront, which would park VT if EAGAIN is retuned.
Thanks to ET we won't have any epoll_wait to spin loop pollers due to available data to read, matching what users can do while interacting with blocking streams.
There are few cons:

the CHM contains all the registered FDs
short living FDs won't benefit of this
if users forget to close a stream this will bloat the CHM (but the OS wasn't happy before, too)

The biggest pro is to save 1 syscall per each blocking read, which, with small reads and many core, be very relevant: epoll_ctl always acquire a mutex to update the RB tree within the kernel and with many cores this can led to scalability issues too.

mlbridge · 2026-03-26T09:29:54Z

Webrevs

AlanBateman · 2026-03-26T14:40:16Z

Just an FYI that we've been experiments with epoll edge triggered mode in the past. The main concerns were that it's very fragile (only works with specific usage patterns) and adds complexity by way of book keeping. Yes, it can reduce the need to re-arm a file descriptor but overall it was never clear if significant benefits could be proving in real world cases to justify the complexity.

I'm not opposed to trying again but I think this require creating a new branch and iterating there. Would you be okay with that?

I think it would be useful to know what testing has been done so far. I did some quick testing and see failures/timeouts with HTTP3 tests which seems to be UDP or selection ops in the context of a virtual thread. I think it would also be useful to see some benchmark data.

When a sub-poller's epoll_wait returns an edge-triggered event but no VT is parked on that fd (map contains REGISTERED sentinel), the event is consumed and discarded. If a VT parks on the fd afterward, no new edge event fires because the fd is already in "ready" state, causing the VT to wait forever. Introduce a POLLED sentinel to distinguish "registered, no pending event" from "registered, event consumed while idle". When polled() finds no waiting VT, it sets POLLED instead of REGISTERED. When startPoll() finds POLLED, it self-unparks via LockSupport.unpark() so the caller's park() returns immediately and retries the read. Zero cost in the common case (VT parks before event). Only adds one unpark() in the rare race window.

openjdk · 2026-03-26T15:23:55Z

@franz1981 Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.

RPC-style benchmark where a platform NIO echo server (selectNow spin loop) handles requests from virtual threads doing blocking socket I/O. Each JMH invocation spawns a VT that writes a request then blocking- reads the response, exercising the poller registration path. Per-thread connections with pre-allocated buffers for zero GC in steady state. Use with -Djdk.pollerMode=2 to compare edge-triggered vs one-shot epoll registration.

Per-carrier sub-pollers have carrier affinity, which creates a scheduling conflict with edge-triggered registrations: the sub-poller competes with user VTs for the same carrier. By the time the sub-poller runs, user VTs have often already consumed data via tryRead(), causing the sub-poller to find a POLLED sentinel and waste a full park/unpark cycle on the master (each costing an epoll_ctl). Under load this causes a 2x throughput regression. VTHREAD_POLLERS mode is unaffected because its sub-pollers have no carrier affinity and can run on any available carrier, processing events before user VTs consume the data.

franz1981 · 2026-03-26T16:16:09Z

@AlanBateman I've added a JMH benchmark in 28755d9.

IMO, since is not a CPU bound test, it requires some care to read its results
Running it produces this diff:

  ┌─────────────────────┬──────────┬────────────────┬─────────────────┐                                                                                      
  │       Counter       │ Baseline │ Edge-triggered │      Ratio      │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ ops/s               │ 105,620  │ 107,168        │ 1.01x           │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ cycles/op           │ 11,724   │ 3,272          │ 3.6x less       │                                                                                      
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ instructions/op     │ 7,009    │ 2,031          │ 3.5x less       │                                                                                      
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ branches/op         │ 1,513    │ 438            │ 3.5x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ branch-misses/op    │ 115      │ 33             │ 3.5x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ L1-dcache-loads/op  │ 2,764    │ 816            │ 3.4x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ L1-dcache-misses/op │ 358      │ 101            │ 3.5x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ stalled-frontend/op │ 5,144    │ 1,442          │ 3.6x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ CPI                 │ 1.67     │ 1.61           │ slightly better │
  └─────────────────────┴──────────┴────────────────┴─────────────────┘

In short it is a huge (for the specific case with tiny read) CPU saving, but it won't impact latencies being bound to loopback RTT.
But the CPU saving is there, and pretty relevant.

As per 7e36c5f instead: I have pushed a fix for a race condition (you rightly pointed out to be complex to deal with ET and I can agree w it :D ) which disable ET for pollerMode=3 due to this behaviour:

with pollerMode=2 lazy submit allow a subpoller to enqueue locally first an awaken VT, without finding the CHM entry in POLLED state
with pollerMode=2 there's no "local" submit (unless a custom scheduler implement it!) so is likely another FJ worker which compete with the subpoller, finding POLLED state: this wastes a full park/unpark cycle on the master poller (each costing an epoll_ctl on it!)

So, in short, the fix at 1ac6dc3 is good enough for pollerMode=2 (which rarely see it unless stealing from the FJ worker local queue), but pollerMode=3 with built-in scheduler, won't make it that good, bothering the master poller way too much.

AlanBateman · 2026-03-26T17:42:05Z

This looks like a 1% improvement in ops/sec. I think we'll need to get a more real-world benchmark. Do you have something other than the micro.

Do you agree with the proposal to put this in its own branch so that we can iterate on it?

franz1981 · 2026-03-26T17:56:16Z

This looks like a 1% improvement in ops/sec

it is expected as the most of the cost is the RTT within the kernel (loopback ping/pong), but the CPU saving is relevant.
I am checking how to make it cpu-bound so it would translate more easily into tps.
That said, the Netty custom scheduler repos show some results: i just need to work on it.
But really, is unexpected to have TPS advantages for RPC like behaviour as the latency is dominated by the RTT on loopback. The main advantage is cpu saving wise, which i think is still a relevant resource nowadays.

Switch from spawning a new VT per benchmark call to using JMH's built-in virtual thread executor (-Djmh.executor=VIRTUAL). The benchmark method now runs directly on a persistent VT, eliminating VT creation/join overhead per operation and providing a tighter measurement of the poller registration path.

Switch client connections from Socket+InputStream/OutputStream with heap byte[] to SocketChannel with direct ByteBuffers, eliminating internal heap-to-direct buffer copies on every read/write. Also use direct ByteBuffers on the server side for accepted connections.

franz1981 · 2026-03-26T22:08:50Z

I am still running few benchmarks because the saving in proper benchmark is not that relevant as you suggested 🙏

JMH's VIRTUAL executor calls @State(Scope.Benchmark) setup() once per worker thread, spawning 100 platform server threads each spinning on selectNow(). Guard with AtomicBoolean so only one server is created.

Use multiple NIO server instances (configurable via @Param serverCount) with round-robin client connections to remove the server as a bottleneck when scaling carrier parallelism. Switch from selectNow() (spinning) to select(1) (blocking with 1ms timeout) to avoid wasting CPU and polluting perfnorm measurements. Reset AtomicBoolean guard and round-robin counter in tearDown() to support multi-fork JMH runs.

franz1981 · 2026-03-27T10:59:59Z

@AlanBateman first round of results of the existing benchmark and the explanation of a JMH bug I have found.

Benchmark: Edge-triggered epoll vs EPOLLONESHOT for VT read sub-pollers

Machine: AMD Ryzen 9 7950X 16-Core, Linux 6.19.8

JVM args: -Djdk.pollerMode=2 -Djdk.virtualThreadScheduler.parallelism=P
          -Djdk.virtualThreadScheduler.maxPoolSize=2*P -Djdk.readPollers=P
          -Djmh.executor=VIRTUAL

JMH:      -f 3 -wi 3 -w 5s -i 3 -r 10s -t 100 -p readSize=1

Throughput (ops/s, higher is better)

Raw JMH output

EPOLLONESHOT (baseline):

Benchmark                              (readSize) (serverCount)   Mode  Cnt       Score       Error  Units
# parallelism=1, readPollers=1
SocketReadPollerBench.rpcRoundTrip              1             4  thrpt    9  123365.250 ±  1783.282  ops/s
# parallelism=2, readPollers=2
SocketReadPollerBench.rpcRoundTrip              1             4  thrpt    9  221708.710 ±  4651.978  ops/s
# parallelism=4, readPollers=4
SocketReadPollerBench.rpcRoundTrip              1             8  thrpt    9  436302.988 ± 11788.896  ops/s

EPOLLET (edge-triggered):

Benchmark                              (readSize) (serverCount)   Mode  Cnt       Score       Error  Units
# parallelism=1, readPollers=1
SocketReadPollerBench.rpcRoundTrip              1             4  thrpt    9  134129.081 ±  1199.593  ops/s
# parallelism=2, readPollers=2
SocketReadPollerBench.rpcRoundTrip              1             4  thrpt    9  244213.173 ±  3760.140  ops/s
# parallelism=4, readPollers=4
SocketReadPollerBench.rpcRoundTrip              1             8  thrpt    9  467931.437 ± 17245.761  ops/s

Ratio (ET / baseline):

              parallelism=1:  1.087x  (+8.7%)   non-overlapping CI
              parallelism=2:  1.102x  (+10.2%)  non-overlapping CI
              parallelism=4:  1.072x  (+7.2%)   non-overlapping CI

async-profiler CPU breakdown (parallelism=1, 30s, ~58K samples)

Component	EPOLLONESHOT	EPOLLET	Delta
`epoll_ctl` path	2,183 (3.8%)	0	eliminated
Poller loop (carrier)	3,747 (6.5%)	1,589 (2.7%)	-57%
Continuation mount/unmount	29,399	29,417	unchanged

Note: `maxPoolSize=2*P` workaround

JMH's VIRTUAL executor uses VTs for both benchmark workers and iteration control (timing/warmdown signaling). At parallelism>=2 with 100 busy VTs doing tight blocking I/O loops, the iteration control VT can get starved and never signal iteration end (awaitWarmdownReady hangs). Setting maxPoolSize=2*parallelism provides enough carrier headroom for the JMH control VTs to get scheduled. This is a JMH/scheduler interaction issue, not a Loom bug.

franz1981 · 2026-03-27T11:01:02Z

I'm working now to provide some end 2 end benchmarks to have some better picture of what's goin on. 🙏

franz1981 · 2026-03-27T14:10:03Z

@AlanBateman end 2 end results via franz1981/Netty-VirtualThread-Scheduler#97 (will be merged soon!)

EPOLL ET vs ONE_SHOT: VIRTUAL_NETTY, 10K connections, 30ms RTT, 2 server cores

Command

JAVA_HOME=<jdk> OUTPUT_DIR=<out> ./run-benchmark.sh \
  --mode VIRTUAL_NETTY --threads 2 --io nio \
  --server-cpuset "4-5" --mock-cpuset "8-11" --load-cpuset "0-3" \
  --jvm-args "-Xms8g -Xmx8g" \
  --connections 10000 --load-threads 4 \
  --mock-think-time 30 --mock-threads 4 \
  --perf-stat

ONE_SHOT: Shipilev openjdk-jdk-loom b549 (2026-03-20)
ET: Custom Loom build with EPOLL ET patch

Throughput (req/s, best of 3)

JDK	Run 1	Run 2	Run 3	Best
ONE_SHOT	46,137	45,717	45,171	46,137
ET	46,771	48,350	46,173	48,350
Delta				+4.8%

Server perf stat (best runs)

Metric	ONE_SHOT	ET
CPUs utilized	1.999	1.999
Ctx switches/sec	51	52
IPC	1.20	1.24
Branch misses	2.73%	2.51%

This is a test which is not really meant to be suitable to make this optimization to shine, since the read payload is quite big, but still it shows some improvement.

franz1981 · 2026-03-27T15:47:07Z

Another test, this time with "sustainable throughput".

EPOLL ET vs ONE_SHOT: VIRTUAL_NETTY at sustainable rate (35K req/s)

10K connections, 30ms mock RTT, 2 server cores.

Command

JAVA_HOME=<jdk> OUTPUT_DIR=<out> ./run-benchmark.sh \
  --mode VIRTUAL_NETTY --threads 2 --io nio \
  --server-cpuset "4-5" --mock-cpuset "8-11" --load-cpuset "0-3" \
  --jvm-args "-Xms8g -Xmx8g" \
  --connections 10000 --load-threads 4 \
  --mock-think-time 30 --mock-threads 4 \
  --rate 35000 --perf-stat

ONE_SHOT: Shipilev openjdk-jdk-loom b549 (2026-03-20)
ET: Custom Loom build with EPOLL ET patch

Tail latency (median across 8 runs)

Percentile	ONE_SHOT	ET	Improvement
p50	35.3ms	32.7ms	-7%
p75	55.6ms	42.1ms	-24%
p90	96.5ms	63.4ms	-34%
p99	342ms	138ms	-60%
p99.9	443ms	217ms	-51%

CPU usage at equal throughput (perf stat, median of 8 runs)

Metric	ONE_SHOT	ET	Delta
CPUs utilized	1.62	1.58	-2.6%

franz1981 · 2026-03-27T15:47:50Z

@AlanBateman let me know how they looks like 🙏

AlanBateman · 2026-03-27T16:03:03Z

@AlanBateman let me know how they looks like 🙏

I will look at it more closely next week. I have a strong preference that this goes into its own branch as a prototype as it will go through many iterations. Edge triggered mode is really fragile and high dependent on usage patterns. I think the only reason is works here is that the usages that read will only attempt to poll/arm when there are no bytes available. There are many scenarios that will need to worked through to build up confident and I think will need a knob to opt-in or opt-out.

franz1981 · 2026-03-27T16:54:11Z

. I have a strong preference that this goes into its own branch as a prototype as it will go through many iterations

Now that I have some results, I think yes, but only if you think are desiderabile/meaningful results

Re the fragility agree: it is the same exact requirements which go satisfied, but I have no idea if we have cases where we won't satisfy them

AlanBateman · 2026-04-07T15:13:35Z

I've pushed the changes to a new branch "epoll" that we can use to iterate on this.

franz1981 · 2026-05-05T09:16:34Z

Any news @AlanBateman on the branch? I haven't further tested it so IDK if there are HTTP 3 related cases where it has failed - if ever

AlanBateman · 2026-05-05T13:51:58Z

The changes are in the epoll branch but there are test failures and will need a few iterations before we can decide if the additional complexity is worth trying to bring to mail line. The risk with making this the default is high so any proposal would have to be opt-in.

Delete test/jdk/java/lang/Thread/virtual/CarrierAffinity.java

ae5a07b

Remove alien file

franz1981 marked this pull request as ready for review March 26, 2026 09:19

openjdk Bot added ready Ready to be integrated rfr Ready for review labels Mar 26, 2026

franz1981 commented Mar 26, 2026

View reviewed changes

franz1981 force-pushed the fibers_et branch from 538723f to 1ac6dc3 Compare March 26, 2026 15:22

franz1981 added 2 commits March 26, 2026 16:54

franz1981 added 2 commits March 26, 2026 21:52

franz1981 added 2 commits March 26, 2026 23:48

Fix duplicate server creation in poller benchmark

7d0a0d4

JMH's VIRTUAL executor calls @State(Scope.Benchmark) setup() once per worker thread, spawning 100 platform server threads each spinning on selectNow(). Guard with AtomicBoolean so only one server is created.

Conversation

franz1981 commented Mar 26, 2026 • edited by openjdk Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Reviewing

Uh oh!

bridgekeeper Bot commented Mar 26, 2026

Uh oh!

openjdk Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

franz1981 Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

franz1981 Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

franz1981 commented Mar 26, 2026

Uh oh!

mlbridge Bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

AlanBateman commented Mar 26, 2026

Uh oh!

openjdk Bot commented Mar 26, 2026

Uh oh!

franz1981 commented Mar 26, 2026

Uh oh!

AlanBateman commented Mar 26, 2026

Uh oh!

franz1981 commented Mar 26, 2026

Uh oh!

franz1981 commented Mar 26, 2026

Uh oh!

franz1981 commented Mar 27, 2026

Benchmark: Edge-triggered epoll vs EPOLLONESHOT for VT read sub-pollers

Throughput (ops/s, higher is better)

async-profiler CPU breakdown (parallelism=1, 30s, ~58K samples)

Note: maxPoolSize=2*P workaround

Uh oh!

franz1981 commented Mar 27, 2026

Uh oh!

franz1981 commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

EPOLL ET vs ONE_SHOT: VIRTUAL_NETTY, 10K connections, 30ms RTT, 2 server cores

Command

Throughput (req/s, best of 3)

Server perf stat (best runs)

Uh oh!

franz1981 commented Mar 27, 2026

EPOLL ET vs ONE_SHOT: VIRTUAL_NETTY at sustainable rate (35K req/s)

Command

Tail latency (median across 8 runs)

CPU usage at equal throughput (perf stat, median of 8 runs)

Uh oh!

franz1981 commented Mar 27, 2026

Uh oh!

AlanBateman commented Mar 27, 2026

Uh oh!

franz1981 commented Mar 27, 2026

Uh oh!

AlanBateman commented Apr 7, 2026

Uh oh!

franz1981 commented May 5, 2026

Uh oh!

AlanBateman commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

franz1981 commented Mar 26, 2026 •

edited by openjdk Bot

Loading

openjdk Bot commented Mar 26, 2026 •

edited

Loading

mlbridge Bot commented Mar 26, 2026 •

edited

Loading

Note: `maxPoolSize=2*P` workaround

franz1981 commented Mar 27, 2026 •

edited

Loading