Skip to content

Add daemon process watchdog with crash rate-limiting#770

Open
yunsmall wants to merge 1 commit into
JingMatrix:masterfrom
yunsmall:daemon-watchdog-restart
Open

Add daemon process watchdog with crash rate-limiting#770
yunsmall wants to merge 1 commit into
JingMatrix:masterfrom
yunsmall:daemon-watchdog-restart

Conversation

@yunsmall

Copy link
Copy Markdown

When the Vector daemon process crashes, all existing Binder connections
are permanently severed:

  • The manager app calls System.exit(0) in its DeathRecipient.
  • Injected app processes clear their VectorServiceClient service
    reference and never attempt to reconnect.
  • Newly forked apps fail to obtain the IPC binder in postAppSpecialize
    and skip Xposed injection entirely.
  • Only system_server can recover, because VectorDaemon re-injects
    into it on startup.

However, all persistent state (module configs, scopes, preferences, logs)
lives on disk and survives the crash. If the daemon is restarted, new
processes can once again get a fresh binder and be injected — so an
automatic restart still restores meaningful functionality.

Changes

zygisk/module/service.sh

Wrap the unshare launch in a rate-limited watchdog loop:

  • If the daemon crashes more than 3 times within 60 seconds, the
    watchdog enters a 3-minute cooldown before retrying, then resets
    the counter. This prevents tight crash-loops from burning CPU.
  • Normal (infrequent) crashes use a 5-second restart delay.
  • The entire loop runs in the background (done &), so Magisk's
    late_start service handler returns immediately and is never blocked.
  • Each iteration calls unshare -m, creating a fresh mount namespace;
    stale mounts from a previous crashed daemon do not accumulate.

zygisk/module/daemon

Drop the exec keyword from the app_process invocation. Without
exec, the app_process process runs as a child of the daemon script.
When it exits, control returns to the shell script, which finishes,
allowing the watchdog loop in service.sh to iterate. (Previously
exec replaced the shell process, so when the daemon died there was
nothing left to restart it.)

Compatibility

No effect on the daemon's normal operation path. When the daemon does
not crash, the watchdog loop is parked at unshare waiting for it to
exit — identical to the original behavior.

When the daemon crashes, all existing app connections are permanently
lost (Binder death triggers service reference clears and manager
suicide).  However persistent state lives on disk and new processes
can still get a fresh binder — so restarting the daemon still recovers
meaningful functionality.

- service.sh: wrap the unshare launch in a rate-limited watchdog loop.
  If the daemon crashes more than 3 times within 60 seconds, the
  watchdog enters a 3-minute cooldown before retrying and resets the
  counter.  Normal restarts use a 5-second delay.  The loop runs in
  background (&) so Magisk late_start is never blocked.

- daemon: drop 'exec' so that when app_process exits, control returns
  to the daemon script, which finishes, allowing the watchdog in
  service.sh to iterate.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant