Skip to content

Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UWB/bootconfig configuration mismatches#2581

Open
SuperStrongDinosaur wants to merge 2 commits into
google:mainfrom
SuperStrongDinosaur:restartHangFix
Open

Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UWB/bootconfig configuration mismatches#2581
SuperStrongDinosaur wants to merge 2 commits into
google:mainfrom
SuperStrongDinosaur:restartHangFix

Conversation

@SuperStrongDinosaur
Copy link
Copy Markdown
Collaborator

@SuperStrongDinosaur SuperStrongDinosaur commented May 19, 2026

Description
This change addresses a set of critical deadlocks, resource-leak hangs, and configuration-mapping bugs encountered in multi-instance CVD deployments, specifically during a cvd restart.

By resolving these architectural synchronization issues, single instances within a multi-instance group can now be safely restarted independently without hanging or crashing other running instances, and without deadlocking host-side shared daemons.

Fixes

  1. Prevent FIFO Unlinking Deadlocks (SharedFD::Fifo & DeleteFifos )

Issue: Previously, SharedFD::Fifo always deleted the path before calling mkfifo(). In a multi-instance setup, global host-side daemons are started once and hold open connections to the VM instances. Unlinking these paths on a single-instance restart destroys the inode mapping, meaning the restarted instance's crosvm would construct new FIFOs that the active netsimd daemon has no knowledge of. This caused crosvm to hang indefinitely on startup, waiting for a reader/writer connection that would never come.

Solution: Modified SharedFD::Fifo to perform a stat() check on the target path first. If the file already exists and is verified to be a FIFO, it is opened directly instead of unlinking and recreating it. Removed bt_fifo_vm, nfc_fifo_vm, and uwb_fifo_vm from the unlinking sequence in ServerLoopImpl::DeleteFifos() to ensure their persistent paths are preserved across instance-specific lifecycles.

  1. Ensure Proper FIFO Provisioning for Shared UWB Services (Files affected: uwb_connector.cpp)

Issue: The UWB host-connector logic only created the uwb_fifo_vm FIFOs if instance.enable_host_uwb_connector() was evaluated to true. If config.enable_host_uwb() was true but the specific instance did not run the connector, the missing FIFOs would crash host-side daemons or break crosvm initialization.

Solution: Decoupled the FIFO generation from the launcher command check. The FIFOs are now always initialized if global UWB is enabled, while the local host-connector service itself is only spawned if the instance enable_host_uwb_connector() is enabled.

  1. Fix ProcessMonitor Shutdown Socket Hangs (Files affected: process_monitor.cc, process_monitor.h)

Issue: During a shutdown or restart event, ProcessMonitor could hang waiting on socket reads in ReadMonitorSocketLoop because the control channel remained blocked. If the socket connection returned an error or was half-closed, the loop could fail or crash rather than exiting cleanly.

Solution: Retained the child socket's raw file descriptor (child_sock_) within the ProcessMonitor class. Added an explicit child_sock_->Shutdown(SHUT_RDWR) call at the end of the MonitorRoutine execution to actively force-unblock any outstanding or stuck reads on the socket during shutdown. Hardened ReadMonitorSocketLoop to ignore read errors gracefully if the monitor has already been marked as shutting down (!running.load()).

Testing and Verification
The stability of these fixes was successfully validated under multi-instance conditions:

Multi-Instance Booting

Booted two parallel VM instances using the locally compiled binaries:
cvd create --config=sdv_core_instance1 --num_instances=2

Both instances initialized, completed guest execution, and reached the VIRTUAL_DEVICE_BOOT_COMPLETED signal cleanly.

Single-Instance Restarting (No Deadlocks)

Executed a single-instance restart command targeting instance 2:
cvd --group_name=cvd_1 --instance_name=2 restart

Instance 2 successfully halted its virtual processes, safely terminated crosvm and its children, preserved the shared Bluetooth/UWB/NFC FIFOs, and booted back to VIRTUAL_DEVICE_BOOT_COMPLETED.

Instance 1 remained completely undisturbed and functional during the entire restart cycle.

Verified using cvd status --print that both instances returned to a healthy Running state.

b/510634395

@SuperStrongDinosaur SuperStrongDinosaur marked this pull request as ready for review May 19, 2026 09:56
@GoogleCuttlefishTesterBot GoogleCuttlefishTesterBot removed the kokoro:run Run e2e tests. label May 19, 2026
@SuperStrongDinosaur SuperStrongDinosaur changed the title Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UW… Fix Cuttlefish multi-instance restart deadlocks, socket hangs, and UWB/bootconfig configuration mismatches May 19, 2026
Comment on lines +522 to 524
CF_EXPECTF(TEMP_FAILURE_RETRY(mkfifo(path.c_str(), mode)) == 0,
"Failed to mkfifo('{}', {:o}): {}", path, mode,
::cuttlefish::StrError(errno));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original motivation for removing fifos before a restart was that if there was some partially transmitted messages still in memory for the pipe, that the guest could receive the second part of a message as data coming in on this channel.

I see the argument though that the persistent host processes get confused when their communication channels get closed and they can't recover from that.

In the case the file already exists, what do you think of draining anything in-flight from the fifo by reading while there is data to read? Then the guest has less chance of seeing something incomplete when it starts up.

Copy link
Copy Markdown
Collaborator Author

@SuperStrongDinosaur SuperStrongDinosaur May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I've updated the implementation: if the FIFO exists, we temporarily switch it to O_NONBLOCK and reading while there is data to read before restoring the original. This gives us clean channels for the guest without disconnecting host processes.

if (TEMP_FAILURE_RETRY(stat(path.c_str(), &st)) == 0) {
CF_EXPECTF(TEMP_FAILURE_RETRY(remove(path.c_str())) == 0,
"Failed to delete old file at '{}': '{}'", path,
if (TEMP_FAILURE_RETRY(stat(path.c_str(), &st)) != 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the right place for this fix. In fact, this function shouldn't even be deleting the existing file when it existed. This is a utility function to simplify the mkfifo interface, intended to be usable from everywhere in the project, making it delete the existing file or leave it in place if it is a fifo makes it use case specific and more likely to cause similar problems when used elsewhere.

This function should just fail if a file exists, it should be the responsibility of the caller to check first and take the appropriate action.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the callers were changed in aosp/2813549, when before they would error out if the file existed before. They all assumed the files would be deleted based on the DeleteFifos() function used in restart and powerwash flows, but that list could easily get out of sync with the actual fifos used elsewhere in the codebase.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the context. Keeping SharedFD::Fifo generic is a valid concern, but multi-instance restarts present an issue: if we unlink the FIFO, running host daemons remain connected to the old, unlinked inode, while the restarted guest's crosvm binds to the newly created one. This breaks communication. To prevent this, we should reuse the existing FIFO node instead of unlinking it.

Pushing this logic to callers would require updating many call sites with existence checks, reuse logic, and the new draining loop. Centralizing this code inside SharedFD::Fifo prevents code duplication across the codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants