Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UWB/bootconfig configuration mismatches by SuperStrongDinosaur · Pull Request #2581 · google/android-cuttlefish

SuperStrongDinosaur · 2026-05-19T08:55:30Z

Description
This change addresses a set of critical deadlocks, resource-leak hangs, and configuration-mapping bugs encountered in multi-instance CVD deployments, specifically during a cvd restart.

By resolving these architectural synchronization issues, single instances within a multi-instance group can now be safely restarted independently without hanging or crashing other running instances, and without deadlocking host-side shared daemons.

Fixes

Prevent FIFO Unlinking Deadlocks (SharedFD::Fifo & DeleteFifos )

Issue: Previously, SharedFD::Fifo always deleted the path before calling mkfifo(). In a multi-instance setup, global host-side daemons are started once and hold open connections to the VM instances. Unlinking these paths on a single-instance restart destroys the inode mapping, meaning the restarted instance's crosvm would construct new FIFOs that the active netsimd daemon has no knowledge of. This caused crosvm to hang indefinitely on startup, waiting for a reader/writer connection that would never come.

Solution: Modified SharedFD::Fifo to perform a stat() check on the target path first. If the file already exists and is verified to be a FIFO, it is opened directly instead of unlinking and recreating it. Removed bt_fifo_vm, nfc_fifo_vm, and uwb_fifo_vm from the unlinking sequence in ServerLoopImpl::DeleteFifos() to ensure their persistent paths are preserved across instance-specific lifecycles.

Ensure Proper FIFO Provisioning for Shared UWB Services (Files affected: uwb_connector.cpp)

Issue: The UWB host-connector logic only created the uwb_fifo_vm FIFOs if instance.enable_host_uwb_connector() was evaluated to true. If config.enable_host_uwb() was true but the specific instance did not run the connector, the missing FIFOs would crash host-side daemons or break crosvm initialization.

Solution: Decoupled the FIFO generation from the launcher command check. The FIFOs are now always initialized if global UWB is enabled, while the local host-connector service itself is only spawned if the instance enable_host_uwb_connector() is enabled.

Fix ProcessMonitor Shutdown Socket Hangs (Files affected: process_monitor.cc, process_monitor.h)

Issue: During a shutdown or restart event, ProcessMonitor could hang waiting on socket reads in ReadMonitorSocketLoop because the control channel remained blocked. If the socket connection returned an error or was half-closed, the loop could fail or crash rather than exiting cleanly.

Solution: Retained the child socket's raw file descriptor (child_sock_) within the ProcessMonitor class. Added an explicit child_sock_->Shutdown(SHUT_RDWR) call at the end of the MonitorRoutine execution to actively force-unblock any outstanding or stuck reads on the socket during shutdown. Hardened ReadMonitorSocketLoop to ignore read errors gracefully if the monitor has already been marked as shutting down (!running.load()).

Testing and Verification
The stability of these fixes was successfully validated under multi-instance conditions:

Multi-Instance Booting

Booted two parallel VM instances using the locally compiled binaries:
cvd create --config=sdv_core_instance1 --num_instances=2

Both instances initialized, completed guest execution, and reached the VIRTUAL_DEVICE_BOOT_COMPLETED signal cleanly.

Single-Instance Restarting (No Deadlocks)

Executed a single-instance restart command targeting instance 2:
cvd --group_name=cvd_1 --instance_name=2 restart

Instance 2 successfully halted its virtual processes, safely terminated crosvm and its children, preserved the shared Bluetooth/UWB/NFC FIFOs, and booted back to VIRTUAL_DEVICE_BOOT_COMPLETED.

Instance 1 remained completely undisturbed and functional during the entire restart cycle.

Verified using cvd status --print that both instances returned to a healthy Running state.

b/510634395

…B/bootconfig configuration mismatches

Databean · 2026-05-19T18:22:32Z

+    CF_EXPECTF(TEMP_FAILURE_RETRY(mkfifo(path.c_str(), mode)) == 0,
+               "Failed to mkfifo('{}', {:o}): {}", path, mode,
               ::cuttlefish::StrError(errno));


The original motivation for removing fifos before a restart was that if there was some partially transmitted messages still in memory for the pipe, that the guest could receive the second part of a message as data coming in on this channel.

I see the argument though that the persistent host processes get confused when their communication channels get closed and they can't recover from that.

In the case the file already exists, what do you think of draining anything in-flight from the fifo by reading while there is data to read? Then the guest has less chance of seeing something incomplete when it starts up.

Ok. I've updated the implementation: if the FIFO exists, we temporarily switch it to O_NONBLOCK and reading while there is data to read before restoring the original. This gives us clean channels for the guest without disconnecting host processes.

jemoreira · 2026-05-19T18:23:05Z

-  if (TEMP_FAILURE_RETRY(stat(path.c_str(), &st)) == 0) {
-    CF_EXPECTF(TEMP_FAILURE_RETRY(remove(path.c_str())) == 0,
-               "Failed to delete old file at '{}': '{}'", path,
+  if (TEMP_FAILURE_RETRY(stat(path.c_str(), &st)) != 0) {


This isn't the right place for this fix. In fact, this function shouldn't even be deleting the existing file when it existed. This is a utility function to simplify the mkfifo interface, intended to be usable from everywhere in the project, making it delete the existing file or leave it in place if it is a fifo makes it use case specific and more likely to cause similar problems when used elsewhere.

This function should just fail if a file exists, it should be the responsibility of the caller to check first and take the appropriate action.

Most of the callers were changed in aosp/2813549, when before they would error out if the file existed before. They all assumed the files would be deleted based on the DeleteFifos() function used in restart and powerwash flows, but that list could easily get out of sync with the actual fifos used elsewhere in the codebase.

Thank you for the context. Keeping SharedFD::Fifo generic is a valid concern, but multi-instance restarts present an issue: if we unlink the FIFO, running host daemons remain connected to the old, unlinked inode, while the restarted guest's crosvm binds to the newly created one. This breaks communication. To prevent this, we should reuse the existing FIFO node instead of unlinking it.

Pushing this logic to callers would require updating many call sites with existence checks, reuse logic, and the new draining loop. Centralizing this code inside SharedFD::Fifo prevents code duplication across the codebase.

Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UW…

b5225d8

…B/bootconfig configuration mismatches

SuperStrongDinosaur marked this pull request as ready for review May 19, 2026 09:56

SuperStrongDinosaur requested review from Databean and jemoreira May 19, 2026 09:57

SuperStrongDinosaur added bug kokoro:run Run e2e tests. labels May 19, 2026

GoogleCuttlefishTesterBot removed the kokoro:run Run e2e tests. label May 19, 2026

SuperStrongDinosaur changed the title ~~Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UW…~~ Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UWB/bootconfig configuration mismatches May 19, 2026

Databean approved these changes May 19, 2026

View reviewed changes

jemoreira reviewed May 19, 2026

View reviewed changes

Reading while there is data to read in FIFO

1e86456

SuperStrongDinosaur requested review from Databean and jemoreira May 20, 2026 12:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UWB/bootconfig configuration mismatches#2581

Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UWB/bootconfig configuration mismatches#2581
SuperStrongDinosaur wants to merge 2 commits into
google:mainfrom
SuperStrongDinosaur:restartHangFix

SuperStrongDinosaur commented May 19, 2026 •

edited

Loading

Uh oh!

Databean May 19, 2026

Uh oh!

SuperStrongDinosaur May 20, 2026 •

edited

Loading

Uh oh!

jemoreira May 19, 2026

Uh oh!

Databean May 19, 2026

Uh oh!

SuperStrongDinosaur May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

SuperStrongDinosaur commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Databean May 19, 2026

Choose a reason for hiding this comment

Uh oh!

SuperStrongDinosaur May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jemoreira May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Databean May 19, 2026

Choose a reason for hiding this comment

Uh oh!

SuperStrongDinosaur May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SuperStrongDinosaur commented May 19, 2026 •

edited

Loading

SuperStrongDinosaur May 20, 2026 •

edited

Loading