Skip to content

gl-signerproxy: Fix concurrent HSM request blocking via message-passing dispatcher#727

Closed
cdecker wants to merge 2 commits into
mainfrom
2026w23-hsmproxy-per-thread-runtime
Closed

gl-signerproxy: Fix concurrent HSM request blocking via message-passing dispatcher#727
cdecker wants to merge 2 commits into
mainfrom
2026w23-hsmproxy-per-thread-runtime

Conversation

@cdecker

@cdecker cdecker commented Jun 7, 2026

Copy link
Copy Markdown
Collaborator

Problem

gl-signerproxy shared a single current_thread Tokio runtime across all OS
handler threads via Arc<Runtime>. current_thread only allows one concurrent
block_on caller — when onchaind's thread called block_on(server.request(...))
for a type 143 (SIGN_ANY_REMOTE_HTLC_TO_US) request and blocked indefinitely
(no signer connected), every other handler thread was serialised behind it:

  • Type 27 init requests from other subdaemons had to wait
  • The type 143 gRPC call never reached the plugin's stager.requests
  • stuck_request_types() always returned empty
  • The bgsync session ran the full 10-minute timeout instead of aborting early

Fix

Keep a single Tokio runtime but remove direct block_on calls from handler
threads. A GrpcMessage enum carries either a ping or a forwarded HSM request,
each with a oneshot::Sender for the reply. An mpsc channel feeds a dedicated
async grpc_dispatcher that tokio::spawns each message as an independent task.
Handler threads call blocking_send / blocking_recv — fully concurrent; a
permanently-stuck type 143 task can no longer delay any other task.

With this change, type 143 requests reach the plugin stager, stuck_request_types()
detects them, and the bgsync session aborts within 10 seconds.

Test

type_143_lockup_does_not_block_other_requests — two connections share a mock
dispatcher where type 143 sleeps for 60 s. The test confirms that a type 27
request on the second connection receives its response in well under 5 seconds
despite the concurrent blocked type 143.

Test plan

  • cargo test -p gl-signerproxy passes (regression test runs in ~0.15 s)
  • cargo check -p gl-signerproxy clean
  • Integration: bgsync session log shows stuck_request_types: [143] and
    early-abort within 10 s when a node is stuck on an onchaind type 143 request

cdecker and others added 2 commits June 4, 2026 16:57
Type 18 (WIRE_HSMD_GET_PER_COMMITMENT_POINT) is used by onchaind to
derive historical commitment keys during force-close resolution. If the
signer doesn't respond it blocks block processing just like the signing
types already in the list.

Observed on several deeply-lagged nodes where onchaind fires for a
closing channel and gets stuck on a GET_PER_COMMITMENT_POINT request
before the 10-minute bgsync preemption kicks in.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…patcher

A single current_thread Tokio runtime was shared across all OS threads via
Arc<Runtime>. When onchaind called block_on() for a type 143 request and
blocked forever (no signer connected), no other thread could enter block_on,
so type 27 init requests and all subsequent HSM calls were serialised behind
the stuck type 143. This prevented type 143 from ever reaching the plugin
stager, so stuck_request_types() returned empty and the bgsync session ran
the full 10-minute timeout instead of aborting early.

Fix: keep one runtime, remove direct block_on from handler threads. Add an
mpsc channel to a grpc_dispatcher async task that spawns each gRPC call as
an independent tokio::spawn'd task. Handler threads block on oneshot::blocking_recv
for their own response; a permanently-stuck type 143 task cannot delay any
other task.

Adds a regression test that confirms a type 143 lockup does not block a
concurrent type 27 request on a different connection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cdecker cdecker marked this pull request as draft June 9, 2026 10:11
@cdecker cdecker closed this Jun 9, 2026
@cdecker

cdecker commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Closing in favor of #728 as it is simpler and more consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant