Skip to content

Use edge-triggered epoll for VT read sub-pollers#223

Open
franz1981 wants to merge 9 commits into
openjdk:fibersfrom
franz1981:fibers_et
Open

Use edge-triggered epoll for VT read sub-pollers#223
franz1981 wants to merge 9 commits into
openjdk:fibersfrom
franz1981:fibers_et

Conversation

@franz1981
Copy link
Copy Markdown
Contributor

@franz1981 franz1981 commented Mar 26, 2026

Ported from 7ac9ca128885c5dd561e6fbd6bbeaddb86d6264c to the latest upstream fibers branch. Adapted to the current API which renamed implRegister/implDeregister to implStartPoll/implStopPoll and added Mode/EventFD/Cleaner/PollerGroup architecture.


Progress

  • Change must not contain extraneous whitespace

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/loom.git pull/223/head:pull/223
$ git checkout pull/223

Update a local copy of the PR:
$ git checkout pull/223
$ git pull https://git.openjdk.org/loom.git pull/223/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 223

View PR using the GUI difftool:
$ git pr show -t 223

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/loom/pull/223.diff

Using Webrev

Link to Webrev Comment

Ported from 7ac9ca128885c5dd561e6fbd6bbeaddb86d6264c to the latest
upstream fibers branch. Adapted to the current API which renamed
implRegister/implDeregister to implStartPoll/implStopPoll and added
Mode/EventFD/Cleaner/PollerGroup architecture.
@bridgekeeper
Copy link
Copy Markdown

bridgekeeper Bot commented Mar 26, 2026

👋 Welcome back franz1981! A progress list of the required criteria for merging this PR into fibers will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Mar 26, 2026

@franz1981 This change now passes all automated pre-integration checks.

ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details.

After integration, the commit message for the final commit will be:

Use edge-triggered epoll for VT read sub-pollers

You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed.

At the time when this comment was updated there had been no new commits pushed to the fibers branch. If another commit should be pushed before you perform the /integrate command, your PR will be automatically rebased. If you prefer to avoid any potential automatic rebasing, please check the documentation for the /integrate command for further details.

As you do not have Committer status in this project an existing Committer must agree to sponsor your change.

➡️ To flag this PR as ready for integration with the above commit message, type /integrate in a new comment. (Afterwards, your sponsor types /sponsor in a new comment to perform the integration).

@franz1981 franz1981 marked this pull request as ready for review March 26, 2026 09:19
@openjdk openjdk Bot added ready Ready to be integrated rfr Ready for review labels Mar 26, 2026
Thread t = map.remove(fdVal);
Thread t;
if (retainRegistration()) {
t = map.replace(fdVal, REGISTERED);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be optimized

p.clearRegistration(fdVal);
}
for (Poller p : writePollers()) {
p.clearRegistration(fdVal);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useless as we don't use it for write pollers

@franz1981
Copy link
Copy Markdown
Contributor Author

@AlanBateman this implement the same scheme used by go since the semantic of VT and goroutines is fairly similar: it works as we always perform a non blocking read upfront, which would park VT if EAGAIN is retuned.
Thanks to ET we won't have any epoll_wait to spin loop pollers due to available data to read, matching what users can do while interacting with blocking streams.
There are few cons:

  • the CHM contains all the registered FDs
  • short living FDs won't benefit of this
  • if users forget to close a stream this will bloat the CHM (but the OS wasn't happy before, too)

The biggest pro is to save 1 syscall per each blocking read, which, with small reads and many core, be very relevant: epoll_ctl always acquire a mutex to update the RB tree within the kernel and with many cores this can led to scalability issues too.

@mlbridge
Copy link
Copy Markdown

mlbridge Bot commented Mar 26, 2026

Webrevs

@AlanBateman
Copy link
Copy Markdown
Collaborator

Just an FYI that we've been experiments with epoll edge triggered mode in the past. The main concerns were that it's very fragile (only works with specific usage patterns) and adds complexity by way of book keeping. Yes, it can reduce the need to re-arm a file descriptor but overall it was never clear if significant benefits could be proving in real world cases to justify the complexity.

I'm not opposed to trying again but I think this require creating a new branch and iterating there. Would you be okay with that?

I think it would be useful to know what testing has been done so far. I did some quick testing and see failures/timeouts with HTTP3 tests which seems to be UDP or selection ops in the context of a virtual thread. I think it would also be useful to see some benchmark data.

When a sub-poller's epoll_wait returns an edge-triggered event but no
VT is parked on that fd (map contains REGISTERED sentinel), the event
is consumed and discarded. If a VT parks on the fd afterward, no new
edge event fires because the fd is already in "ready" state, causing
the VT to wait forever.

Introduce a POLLED sentinel to distinguish "registered, no pending
event" from "registered, event consumed while idle". When polled()
finds no waiting VT, it sets POLLED instead of REGISTERED. When
startPoll() finds POLLED, it self-unparks via LockSupport.unpark()
so the caller's park() returns immediately and retries the read.

Zero cost in the common case (VT parks before event). Only adds one
unpark() in the rare race window.
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Mar 26, 2026

@franz1981 Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information.

RPC-style benchmark where a platform NIO echo server (selectNow spin
loop) handles requests from virtual threads doing blocking socket I/O.
Each JMH invocation spawns a VT that writes a request then blocking-
reads the response, exercising the poller registration path.

Per-thread connections with pre-allocated buffers for zero GC in
steady state. Use with -Djdk.pollerMode=2 to compare edge-triggered
vs one-shot epoll registration.
Per-carrier sub-pollers have carrier affinity, which creates a
scheduling conflict with edge-triggered registrations: the sub-poller
competes with user VTs for the same carrier. By the time the sub-poller
runs, user VTs have often already consumed data via tryRead(), causing
the sub-poller to find a POLLED sentinel and waste a full park/unpark
cycle on the master (each costing an epoll_ctl). Under load this
causes a 2x throughput regression.

VTHREAD_POLLERS mode is unaffected because its sub-pollers have no
carrier affinity and can run on any available carrier, processing
events before user VTs consume the data.
@franz1981
Copy link
Copy Markdown
Contributor Author

@AlanBateman I've added a JMH benchmark in 28755d9.

IMO, since is not a CPU bound test, it requires some care to read its results
Running it produces this diff:

  ┌─────────────────────┬──────────┬────────────────┬─────────────────┐                                                                                      
  │       Counter       │ Baseline │ Edge-triggered │      Ratio      │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ ops/s               │ 105,620  │ 107,168        │ 1.01x           │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ cycles/op           │ 11,724   │ 3,272          │ 3.6x less       │                                                                                      
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ instructions/op     │ 7,009    │ 2,031          │ 3.5x less       │                                                                                      
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ branches/op         │ 1,513    │ 438            │ 3.5x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ branch-misses/op    │ 115      │ 33             │ 3.5x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ L1-dcache-loads/op  │ 2,764    │ 816            │ 3.4x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ L1-dcache-misses/op │ 358      │ 101            │ 3.5x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ stalled-frontend/op │ 5,144    │ 1,442          │ 3.6x less       │
  ├─────────────────────┼──────────┼────────────────┼─────────────────┤                                                                                      
  │ CPI                 │ 1.67     │ 1.61           │ slightly better │
  └─────────────────────┴──────────┴────────────────┴─────────────────┘                                                                                      
                                                                    

In short it is a huge (for the specific case with tiny read) CPU saving, but it won't impact latencies being bound to loopback RTT.
But the CPU saving is there, and pretty relevant.

As per 7e36c5f instead: I have pushed a fix for a race condition (you rightly pointed out to be complex to deal with ET and I can agree w it :D ) which disable ET for pollerMode=3 due to this behaviour:

  • with pollerMode=2 lazy submit allow a subpoller to enqueue locally first an awaken VT, without finding the CHM entry in POLLED state
  • with pollerMode=2 there's no "local" submit (unless a custom scheduler implement it!) so is likely another FJ worker which compete with the subpoller, finding POLLED state: this wastes a full park/unpark cycle on the master poller (each costing an epoll_ctl on it!)

So, in short, the fix at 1ac6dc3 is good enough for pollerMode=2 (which rarely see it unless stealing from the FJ worker local queue), but pollerMode=3 with built-in scheduler, won't make it that good, bothering the master poller way too much.

@AlanBateman
Copy link
Copy Markdown
Collaborator

This looks like a 1% improvement in ops/sec. I think we'll need to get a more real-world benchmark. Do you have something other than the micro.

Do you agree with the proposal to put this in its own branch so that we can iterate on it?

@franz1981
Copy link
Copy Markdown
Contributor Author

This looks like a 1% improvement in ops/sec

it is expected as the most of the cost is the RTT within the kernel (loopback ping/pong), but the CPU saving is relevant.
I am checking how to make it cpu-bound so it would translate more easily into tps.
That said, the Netty custom scheduler repos show some results: i just need to work on it.
But really, is unexpected to have TPS advantages for RPC like behaviour as the latency is dominated by the RTT on loopback. The main advantage is cpu saving wise, which i think is still a relevant resource nowadays.

Switch from spawning a new VT per benchmark call to using
JMH's built-in virtual thread executor (-Djmh.executor=VIRTUAL).
The benchmark method now runs directly on a persistent VT,
eliminating VT creation/join overhead per operation and providing
a tighter measurement of the poller registration path.
Switch client connections from Socket+InputStream/OutputStream with
heap byte[] to SocketChannel with direct ByteBuffers, eliminating
internal heap-to-direct buffer copies on every read/write. Also use
direct ByteBuffers on the server side for accepted connections.
@franz1981
Copy link
Copy Markdown
Contributor Author

I am still running few benchmarks because the saving in proper benchmark is not that relevant as you suggested 🙏

JMH's VIRTUAL executor calls @State(Scope.Benchmark) setup() once
per worker thread, spawning 100 platform server threads each spinning
on selectNow(). Guard with AtomicBoolean so only one server is created.
Use multiple NIO server instances (configurable via @Param serverCount)
with round-robin client connections to remove the server as a bottleneck
when scaling carrier parallelism.

Switch from selectNow() (spinning) to select(1) (blocking with 1ms
timeout) to avoid wasting CPU and polluting perfnorm measurements.

Reset AtomicBoolean guard and round-robin counter in tearDown() to
support multi-fork JMH runs.
@franz1981
Copy link
Copy Markdown
Contributor Author

@AlanBateman first round of results of the existing benchmark and the explanation of a JMH bug I have found.

Benchmark: Edge-triggered epoll vs EPOLLONESHOT for VT read sub-pollers

Machine: AMD Ryzen 9 7950X 16-Core, Linux 6.19.8

JVM args: -Djdk.pollerMode=2 -Djdk.virtualThreadScheduler.parallelism=P
          -Djdk.virtualThreadScheduler.maxPoolSize=2*P -Djdk.readPollers=P
          -Djmh.executor=VIRTUAL

JMH:      -f 3 -wi 3 -w 5s -i 3 -r 10s -t 100 -p readSize=1

Throughput (ops/s, higher is better)

Raw JMH output
EPOLLONESHOT (baseline):

Benchmark                              (readSize) (serverCount)   Mode  Cnt       Score       Error  Units
# parallelism=1, readPollers=1
SocketReadPollerBench.rpcRoundTrip              1             4  thrpt    9  123365.250 ±  1783.282  ops/s
# parallelism=2, readPollers=2
SocketReadPollerBench.rpcRoundTrip              1             4  thrpt    9  221708.710 ±  4651.978  ops/s
# parallelism=4, readPollers=4
SocketReadPollerBench.rpcRoundTrip              1             8  thrpt    9  436302.988 ± 11788.896  ops/s

EPOLLET (edge-triggered):

Benchmark                              (readSize) (serverCount)   Mode  Cnt       Score       Error  Units
# parallelism=1, readPollers=1
SocketReadPollerBench.rpcRoundTrip              1             4  thrpt    9  134129.081 ±  1199.593  ops/s
# parallelism=2, readPollers=2
SocketReadPollerBench.rpcRoundTrip              1             4  thrpt    9  244213.173 ±  3760.140  ops/s
# parallelism=4, readPollers=4
SocketReadPollerBench.rpcRoundTrip              1             8  thrpt    9  467931.437 ± 17245.761  ops/s
Ratio (ET / baseline):

              parallelism=1:  1.087x  (+8.7%)   non-overlapping CI
              parallelism=2:  1.102x  (+10.2%)  non-overlapping CI
              parallelism=4:  1.072x  (+7.2%)   non-overlapping CI

async-profiler CPU breakdown (parallelism=1, 30s, ~58K samples)

Component EPOLLONESHOT EPOLLET Delta
epoll_ctl path 2,183 (3.8%) 0 eliminated
Poller loop (carrier) 3,747 (6.5%) 1,589 (2.7%) -57%
Continuation mount/unmount 29,399 29,417 unchanged

Note: maxPoolSize=2*P workaround

JMH's VIRTUAL executor uses VTs for both benchmark workers and iteration control (timing/warmdown signaling). At parallelism>=2 with 100 busy VTs doing tight blocking I/O loops, the iteration control VT can get starved and never signal iteration end (awaitWarmdownReady hangs). Setting maxPoolSize=2*parallelism provides enough carrier headroom for the JMH control VTs to get scheduled. This is a JMH/scheduler interaction issue, not a Loom bug.

@franz1981
Copy link
Copy Markdown
Contributor Author

I'm working now to provide some end 2 end benchmarks to have some better picture of what's goin on. 🙏

@franz1981
Copy link
Copy Markdown
Contributor Author

franz1981 commented Mar 27, 2026

@AlanBateman end 2 end results via franz1981/Netty-VirtualThread-Scheduler#97 (will be merged soon!)

EPOLL ET vs ONE_SHOT: VIRTUAL_NETTY, 10K connections, 30ms RTT, 2 server cores

Command

JAVA_HOME=<jdk> OUTPUT_DIR=<out> ./run-benchmark.sh \
  --mode VIRTUAL_NETTY --threads 2 --io nio \
  --server-cpuset "4-5" --mock-cpuset "8-11" --load-cpuset "0-3" \
  --jvm-args "-Xms8g -Xmx8g" \
  --connections 10000 --load-threads 4 \
  --mock-think-time 30 --mock-threads 4 \
  --perf-stat
  • ONE_SHOT: Shipilev openjdk-jdk-loom b549 (2026-03-20)
  • ET: Custom Loom build with EPOLL ET patch

Throughput (req/s, best of 3)

JDK Run 1 Run 2 Run 3 Best
ONE_SHOT 46,137 45,717 45,171 46,137
ET 46,771 48,350 46,173 48,350
Delta +4.8%

Server perf stat (best runs)

Metric ONE_SHOT ET
CPUs utilized 1.999 1.999
Ctx switches/sec 51 52
IPC 1.20 1.24
Branch misses 2.73% 2.51%

This is a test which is not really meant to be suitable to make this optimization to shine, since the read payload is quite big, but still it shows some improvement.

@franz1981
Copy link
Copy Markdown
Contributor Author

Another test, this time with "sustainable throughput".

EPOLL ET vs ONE_SHOT: VIRTUAL_NETTY at sustainable rate (35K req/s)

10K connections, 30ms mock RTT, 2 server cores.

Command

JAVA_HOME=<jdk> OUTPUT_DIR=<out> ./run-benchmark.sh \
  --mode VIRTUAL_NETTY --threads 2 --io nio \
  --server-cpuset "4-5" --mock-cpuset "8-11" --load-cpuset "0-3" \
  --jvm-args "-Xms8g -Xmx8g" \
  --connections 10000 --load-threads 4 \
  --mock-think-time 30 --mock-threads 4 \
  --rate 35000 --perf-stat
  • ONE_SHOT: Shipilev openjdk-jdk-loom b549 (2026-03-20)
  • ET: Custom Loom build with EPOLL ET patch

Tail latency (median across 8 runs)

Percentile ONE_SHOT ET Improvement
p50 35.3ms 32.7ms -7%
p75 55.6ms 42.1ms -24%
p90 96.5ms 63.4ms -34%
p99 342ms 138ms -60%
p99.9 443ms 217ms -51%

CPU usage at equal throughput (perf stat, median of 8 runs)

Metric ONE_SHOT ET Delta
CPUs utilized 1.62 1.58 -2.6%

@franz1981
Copy link
Copy Markdown
Contributor Author

@AlanBateman let me know how they looks like 🙏

@AlanBateman
Copy link
Copy Markdown
Collaborator

@AlanBateman let me know how they looks like 🙏

I will look at it more closely next week. I have a strong preference that this goes into its own branch as a prototype as it will go through many iterations. Edge triggered mode is really fragile and high dependent on usage patterns. I think the only reason is works here is that the usages that read will only attempt to poll/arm when there are no bytes available. There are many scenarios that will need to worked through to build up confident and I think will need a knob to opt-in or opt-out.

@franz1981
Copy link
Copy Markdown
Contributor Author

. I have a strong preference that this goes into its own branch as a prototype as it will go through many iterations

Now that I have some results, I think yes, but only if you think are desiderabile/meaningful results

Re the fragility agree: it is the same exact requirements which go satisfied, but I have no idea if we have cases where we won't satisfy them

@AlanBateman
Copy link
Copy Markdown
Collaborator

I've pushed the changes to a new branch "epoll" that we can use to iterate on this.

@franz1981
Copy link
Copy Markdown
Contributor Author

Any news @AlanBateman on the branch? I haven't further tested it so IDK if there are HTTP 3 related cases where it has failed - if ever

@AlanBateman
Copy link
Copy Markdown
Collaborator

The changes are in the epoll branch but there are test failures and will need a few iterations before we can decide if the additional complexity is worth trying to bring to mail line. The risk with making this the default is high so any proposal would have to be opt-in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready Ready to be integrated rfr Ready for review

Development

Successfully merging this pull request may close these issues.

2 participants