Use edge-triggered epoll for VT read sub-pollers#223
Conversation
Ported from 7ac9ca128885c5dd561e6fbd6bbeaddb86d6264c to the latest upstream fibers branch. Adapted to the current API which renamed implRegister/implDeregister to implStartPoll/implStopPoll and added Mode/EventFD/Cleaner/PollerGroup architecture.
|
👋 Welcome back franz1981! A progress list of the required criteria for merging this PR into |
|
@franz1981 This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been no new commits pushed to the As you do not have Committer status in this project an existing Committer must agree to sponsor your change. ➡️ To flag this PR as ready for integration with the above commit message, type |
Remove alien file
| Thread t = map.remove(fdVal); | ||
| Thread t; | ||
| if (retainRegistration()) { | ||
| t = map.replace(fdVal, REGISTERED); |
There was a problem hiding this comment.
This can be optimized
| p.clearRegistration(fdVal); | ||
| } | ||
| for (Poller p : writePollers()) { | ||
| p.clearRegistration(fdVal); |
There was a problem hiding this comment.
Useless as we don't use it for write pollers
|
@AlanBateman this implement the same scheme used by go since the semantic of VT and goroutines is fairly similar: it works as we always perform a non blocking read upfront, which would park VT if EAGAIN is retuned.
The biggest pro is to save 1 syscall per each blocking read, which, with small reads and many core, be very relevant: epoll_ctl always acquire a mutex to update the RB tree within the kernel and with many cores this can led to scalability issues too. |
Webrevs
|
|
Just an FYI that we've been experiments with epoll edge triggered mode in the past. The main concerns were that it's very fragile (only works with specific usage patterns) and adds complexity by way of book keeping. Yes, it can reduce the need to re-arm a file descriptor but overall it was never clear if significant benefits could be proving in real world cases to justify the complexity. I'm not opposed to trying again but I think this require creating a new branch and iterating there. Would you be okay with that? I think it would be useful to know what testing has been done so far. I did some quick testing and see failures/timeouts with HTTP3 tests which seems to be UDP or selection ops in the context of a virtual thread. I think it would also be useful to see some benchmark data. |
When a sub-poller's epoll_wait returns an edge-triggered event but no VT is parked on that fd (map contains REGISTERED sentinel), the event is consumed and discarded. If a VT parks on the fd afterward, no new edge event fires because the fd is already in "ready" state, causing the VT to wait forever. Introduce a POLLED sentinel to distinguish "registered, no pending event" from "registered, event consumed while idle". When polled() finds no waiting VT, it sets POLLED instead of REGISTERED. When startPoll() finds POLLED, it self-unparks via LockSupport.unpark() so the caller's park() returns immediately and retries the read. Zero cost in the common case (VT parks before event). Only adds one unpark() in the rare race window.
|
@franz1981 Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information. |
RPC-style benchmark where a platform NIO echo server (selectNow spin loop) handles requests from virtual threads doing blocking socket I/O. Each JMH invocation spawns a VT that writes a request then blocking- reads the response, exercising the poller registration path. Per-thread connections with pre-allocated buffers for zero GC in steady state. Use with -Djdk.pollerMode=2 to compare edge-triggered vs one-shot epoll registration.
Per-carrier sub-pollers have carrier affinity, which creates a scheduling conflict with edge-triggered registrations: the sub-poller competes with user VTs for the same carrier. By the time the sub-poller runs, user VTs have often already consumed data via tryRead(), causing the sub-poller to find a POLLED sentinel and waste a full park/unpark cycle on the master (each costing an epoll_ctl). Under load this causes a 2x throughput regression. VTHREAD_POLLERS mode is unaffected because its sub-pollers have no carrier affinity and can run on any available carrier, processing events before user VTs consume the data.
|
@AlanBateman I've added a JMH benchmark in 28755d9. IMO, since is not a CPU bound test, it requires some care to read its results In short it is a huge (for the specific case with tiny read) CPU saving, but it won't impact latencies being bound to loopback RTT. As per 7e36c5f instead: I have pushed a fix for a race condition (you rightly pointed out to be complex to deal with ET and I can agree w it :D ) which disable ET for pollerMode=3 due to this behaviour:
So, in short, the fix at 1ac6dc3 is good enough for pollerMode=2 (which rarely see it unless stealing from the FJ worker local queue), but pollerMode=3 with built-in scheduler, won't make it that good, bothering the master poller way too much. |
|
This looks like a 1% improvement in ops/sec. I think we'll need to get a more real-world benchmark. Do you have something other than the micro. Do you agree with the proposal to put this in its own branch so that we can iterate on it? |
it is expected as the most of the cost is the RTT within the kernel (loopback ping/pong), but the CPU saving is relevant. |
Switch from spawning a new VT per benchmark call to using JMH's built-in virtual thread executor (-Djmh.executor=VIRTUAL). The benchmark method now runs directly on a persistent VT, eliminating VT creation/join overhead per operation and providing a tighter measurement of the poller registration path.
Switch client connections from Socket+InputStream/OutputStream with heap byte[] to SocketChannel with direct ByteBuffers, eliminating internal heap-to-direct buffer copies on every read/write. Also use direct ByteBuffers on the server side for accepted connections.
|
I am still running few benchmarks because the saving in proper benchmark is not that relevant as you suggested 🙏 |
JMH's VIRTUAL executor calls @State(Scope.Benchmark) setup() once per worker thread, spawning 100 platform server threads each spinning on selectNow(). Guard with AtomicBoolean so only one server is created.
Use multiple NIO server instances (configurable via @Param serverCount) with round-robin client connections to remove the server as a bottleneck when scaling carrier parallelism. Switch from selectNow() (spinning) to select(1) (blocking with 1ms timeout) to avoid wasting CPU and polluting perfnorm measurements. Reset AtomicBoolean guard and round-robin counter in tearDown() to support multi-fork JMH runs.
|
@AlanBateman first round of results of the existing benchmark and the explanation of a JMH bug I have found. Benchmark: Edge-triggered epoll vs EPOLLONESHOT for VT read sub-pollersMachine: AMD Ryzen 9 7950X 16-Core, Linux 6.19.8 Throughput (ops/s, higher is better)Raw JMH outputasync-profiler CPU breakdown (parallelism=1, 30s, ~58K samples)
Note:
|
|
I'm working now to provide some end 2 end benchmarks to have some better picture of what's goin on. 🙏 |
|
@AlanBateman end 2 end results via franz1981/Netty-VirtualThread-Scheduler#97 (will be merged soon!) EPOLL ET vs ONE_SHOT: VIRTUAL_NETTY, 10K connections, 30ms RTT, 2 server coresCommandJAVA_HOME=<jdk> OUTPUT_DIR=<out> ./run-benchmark.sh \
--mode VIRTUAL_NETTY --threads 2 --io nio \
--server-cpuset "4-5" --mock-cpuset "8-11" --load-cpuset "0-3" \
--jvm-args "-Xms8g -Xmx8g" \
--connections 10000 --load-threads 4 \
--mock-think-time 30 --mock-threads 4 \
--perf-stat
Throughput (req/s, best of 3)
Server perf stat (best runs)
This is a test which is not really meant to be suitable to make this optimization to shine, since the read payload is quite big, but still it shows some improvement. |
|
Another test, this time with "sustainable throughput". EPOLL ET vs ONE_SHOT: VIRTUAL_NETTY at sustainable rate (35K req/s)10K connections, 30ms mock RTT, 2 server cores. CommandJAVA_HOME=<jdk> OUTPUT_DIR=<out> ./run-benchmark.sh \
--mode VIRTUAL_NETTY --threads 2 --io nio \
--server-cpuset "4-5" --mock-cpuset "8-11" --load-cpuset "0-3" \
--jvm-args "-Xms8g -Xmx8g" \
--connections 10000 --load-threads 4 \
--mock-think-time 30 --mock-threads 4 \
--rate 35000 --perf-stat
Tail latency (median across 8 runs)
CPU usage at equal throughput (perf stat, median of 8 runs)
|
|
@AlanBateman let me know how they looks like 🙏 |
I will look at it more closely next week. I have a strong preference that this goes into its own branch as a prototype as it will go through many iterations. Edge triggered mode is really fragile and high dependent on usage patterns. I think the only reason is works here is that the usages that read will only attempt to poll/arm when there are no bytes available. There are many scenarios that will need to worked through to build up confident and I think will need a knob to opt-in or opt-out. |
Now that I have some results, I think yes, but only if you think are desiderabile/meaningful results Re the fragility agree: it is the same exact requirements which go satisfied, but I have no idea if we have cases where we won't satisfy them |
|
I've pushed the changes to a new branch "epoll" that we can use to iterate on this. |
|
Any news @AlanBateman on the branch? I haven't further tested it so IDK if there are HTTP 3 related cases where it has failed - if ever |
|
The changes are in the epoll branch but there are test failures and will need a few iterations before we can decide if the additional complexity is worth trying to bring to mail line. The risk with making this the default is high so any proposal would have to be opt-in. |
Ported from 7ac9ca128885c5dd561e6fbd6bbeaddb86d6264c to the latest upstream fibers branch. Adapted to the current API which renamed implRegister/implDeregister to implStartPoll/implStopPoll and added Mode/EventFD/Cleaner/PollerGroup architecture.
Progress
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/loom.git pull/223/head:pull/223$ git checkout pull/223Update a local copy of the PR:
$ git checkout pull/223$ git pull https://git.openjdk.org/loom.git pull/223/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 223View PR using the GUI difftool:
$ git pr show -t 223Using diff file
Download this PR as a diff file:
https://git.openjdk.org/loom/pull/223.diff
Using Webrev
Link to Webrev Comment