Skip to content

🐛 fix(venv): insidious creation/cleanup race condition#940

Open
13steinj wants to merge 1 commit intoaspect-build:mainfrom
13steinj:fix-venv-race-condition
Open

🐛 fix(venv): insidious creation/cleanup race condition#940
13steinj wants to merge 1 commit intoaspect-build:mainfrom
13steinj:fix-venv-race-condition

Conversation

@13steinj
Copy link
Copy Markdown

@13steinj 13steinj commented Apr 22, 2026

Suffixes the venv created with the running shell's PID, ensuring that each call generates a venv at a unique location.

Also, explicitly cleans the venv up at the end to avoid too many piling up.

Fixes #858, #899.


I am near certain the issue I've been wrestling with is the same as #858 and that the root cause that this PR fixes is the same as the root cause for #899. I don't know if there is any code path in which the venv can be created outside the bazel output base, which is why I added the trap cleanup. If there isn't I am happy to remove that line.


Test plan

  • Manual testing; On a large enough codebase that uses py_binary and specific venvs (say, via the unstable uv based rules and dependency_groups and a uv.lock lockfile, and more specifically, one that has build actions that use a py_binary as a code generator), attempt to build on a beefy 96 core machine with just the right actions cached such that the venvs collide.
    Sadly as this is a race condition your machine needs to be just beefy enough that the actions run in the right order to create the perfect storm of pain.
    Then, do it again with this branch, and see that the venvs no longer collide.
    If one is using remote execution, you will not (or at least it will be very difficult) to be able to reproduce the original bug, because it appears that under remote execution the venvs are (a) potentially sharded across different nodes, and (b) I imagine the remote execution implementation defines its own prefix path per action, such that I am sure that there exist implementations that have a more unique path based on the action information.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 22, 2026

CLA assistant check
All committers have signed the CLA.

@aspect-workflows
Copy link
Copy Markdown

aspect-workflows Bot commented Apr 22, 2026

Bazel 8 (Test)

All tests were cache hits

120 tests (100.0%) were fully cached saving 56s.


Bazel 9 (Test)

All tests were cache hits

119 tests (100.0%) were fully cached saving 1m.


Bazel 8 (Test)

e2e

All tests were cache hits

54 tests (100.0%) were fully cached saving 45s.


Bazel 9 (Test)

e2e

All tests were cache hits

54 tests (100.0%) were fully cached saving 43s.


Bazel 8 (Test)

examples/uv_pip_compile

All tests were cache hits

1 test (100.0%) was fully cached saving 444ms.


Buildifier

Suffixes the venv created with the running shell's PID, ensuring that
each call generates a venv at a unique location.

Also, explicitly cleans the venv up at the end to avoid too many
piling up.

Fixes aspect-build#858, aspect-build#899.
@13steinj 13steinj force-pushed the fix-venv-race-condition branch from cca2ab3 to b86722f Compare April 23, 2026 01:51
@13steinj 13steinj changed the title 🐛 fix(venv): insiduous creation/cleanup race condition 🐛 fix(venv): insidious creation/cleanup race condition Apr 23, 2026
@jbedard jbedard requested review from dzbarsky, jbedard and xangcastle May 1, 2026 22:39
@dzbarsky
Copy link
Copy Markdown
Collaborator

dzbarsky commented May 1, 2026

Isn't this moot with static venv? Is the plan to deprecate/delete all this code? :)

@jbedard
Copy link
Copy Markdown
Member

jbedard commented May 2, 2026

Yes I think this will all go away, but this might still be worth merging in 1.x for now?

@13steinj
Copy link
Copy Markdown
Author

13steinj commented May 5, 2026

Isn't this moot with static venv? Is the plan to deprecate/delete all this code? :)

I don't have context on this at all, but would generally appreciate a merge to avoid local patches getting stale :)

@13steinj
Copy link
Copy Markdown
Author

13steinj commented May 5, 2026

Assuming you're referring to #967 now that I am seeing this at a reasonable hour.
I don't care how this gets fixed so long as it does, but I would kindly ask / suggest it's worth merging to fix in 1.x for now, to avoid the scenario that the new behavior has a change or breakage that people can't jump forward to.

@jbedard
Copy link
Copy Markdown
Member

jbedard commented May 6, 2026

@13steinj I am quite hesitant to be creating random directories like this and then depending on trap to delete them which is why I haven't merged this. I'm not certain how reliable the trap is, or if this might start leaving a collection of infinitely-growing files behind?

@13steinj
Copy link
Copy Markdown
Author

13steinj commented May 7, 2026

@jbedard
trap ... EXIT is reliable in bash/zsh: it fires on normal exit, set -e aborts, and most signals. The main miss case is SIGKILL, but that's true of any cleanup mechanism.

As for accumulation: these directories (normally) live inside the output base, so bazel clean sweeps them regardless. They won't grow unboundedly like a /tmp leak would (but if someone does something funky and manages to force these be made in /tmp, most modern linux systems use tmpfs / shm.

Happy to add some additional belt-and-suspenders of your choosing, but I think this is sufficient as-is. We've been running with this patch, with remote execution and not, for the past 3-4 weeks on a fairly large monorepo with a lot of Python and nobody has reported a bug, and none of the internal monitoring has sounded alarms on disk space related to python venvs.

@13steinj
Copy link
Copy Markdown
Author

13steinj commented May 7, 2026

But now that there's a conflict, I assume you'd like me to verify that the latest release (or latest main?) still has the issue?

@jbedard
Copy link
Copy Markdown
Member

jbedard commented May 7, 2026

main is now moving toward an alpha release of v2 which will not have this bug anymore. If you'd still like to debate adding this to a v1 patch I can create a v2 branch that this PR will target?

@13steinj
Copy link
Copy Markdown
Author

13steinj commented May 7, 2026

I am fine using v2 assuming it "Just Works" ™️, once out I can test and report back. When do you expect v2 to be released?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: venv collision on build_tool

4 participants