Add buildcache channel for Build Caching Service runtime resolution#878
Add buildcache channel for Build Caching Service runtime resolution#878LoopedBard3 wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a new buildcache channel to Crank’s agent/runtime resolution pipeline so the runtime binaries can be sourced per-commit from the Build Caching Service (BCS), enabling finer-grained performance regression bisection than VMR feed cadence allows.
Changes:
- Added build-cache-related properties to
Jobfor selecting commit/branch/config. - Added
BuildCacheClientfor resolving latest builds, downloading/extracting artifacts, and overlaying runtime binaries. - Updated agent startup/version resolution logic and docs to support the new
buildcachechannel and new agent CLI options.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Microsoft.Crank.Models/Job.cs | Adds job-level inputs for BCS commit/branch/config selection. |
| src/Microsoft.Crank.Agent/Startup.cs | Adds CLI options and integrates buildcache into runtime version resolution + post-publish overlay. |
| src/Microsoft.Crank.Agent/BuildCacheClient.cs | Implements BCS querying/downloading/extraction and overlay helpers. |
| docs/dotnet_versions.md | Documents buildcache channel usage and configuration knobs. |
| docs/build_cache_requirements.md | Describes required BCS-side blob layout/access for the integration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| var overlayCount = BuildCacheClient.OverlayPublishedOutput(buildCacheExtractDir, outputFolder); | ||
| Log.Info($"Build Cache: Overlaid {overlayCount} runtime files from commit {buildCacheCommitSha.Substring(0, 8)} into published output"); | ||
|
|
||
| // Update the reported runtime version to reflect the BCS commit | ||
| runtimeVersion = $"{runtimeVersion}+buildcache.{buildCacheCommitSha.Substring(0, 8)}"; | ||
| } | ||
| catch (Exception ex) |
There was a problem hiding this comment.
runtimeVersion is being mutated to include +buildcache.<sha> and then passed into PatchRuntimeConfig. For framework-dependent publishes (i.e., when runtimeOptions does not contain includedFrameworks), this will write a runtime version that is not actually installed, which can prevent the app from starting. Keep the feed-resolved runtime version for runtimeconfig patching and store the BCS commit separately for reporting/metadata, or gate buildcache usage to self-contained publishes only.
There was a problem hiding this comment.
Addressed by 1ecc239 (round 1) and 56657e2 (round 2): the BCS commit is no longer appended to runtimeVersion. PatchRuntimeConfig keeps the feed-resolved version (which is actually installed), and the BCS commit is reported separately via MeasurementMetadata (runtime-version). FDD startup is no longer broken by the rewrite.
| // Build Cache: overlay BCS runtime binaries into the published output | ||
| // This replaces the NuGet-sourced runtime DLLs with BCS-built ones from the specific commit | ||
| if (useBuildCache && buildCacheExtractDir != null) | ||
| { | ||
| try | ||
| { | ||
| var overlayCount = BuildCacheClient.OverlayPublishedOutput(buildCacheExtractDir, outputFolder); | ||
| Log.Info($"Build Cache: Overlaid {overlayCount} runtime files from commit {buildCacheCommitSha.Substring(0, 8)} into published output"); | ||
|
|
||
| // Update the reported runtime version to reflect the BCS commit | ||
| runtimeVersion = $"{runtimeVersion}+buildcache.{buildCacheCommitSha.Substring(0, 8)}"; | ||
| } | ||
| catch (Exception ex) | ||
| { | ||
| Log.Info($"Build Cache: Warning - overlay failed: {ex.Message}. Published app will use feed-sourced runtime."); | ||
| } | ||
| } | ||
|
|
||
| PatchRuntimeConfig(job, outputFolder, aspNetCoreVersion, runtimeVersion); |
There was a problem hiding this comment.
BuildCache overlay is applied to the published output, but for non-self-contained publishes the output typically won’t contain runtime binaries to replace (and PatchRuntimeConfig will still run). If buildcache is intended to work only with job.SelfContained, enforce that with a clear error message; otherwise, consider using the existing BuildCacheClient.InstallRuntimeFromBuildCacheAsync approach to overlay into dotnetHome for framework-dependent runs.
There was a problem hiding this comment.
Addressed by 19f909e (round 3) with a different approach: instead of mutating the global _dotnethome (which would create cross-job pollution), the agent now creates a per-job isolated dotnet home at <temp>/crank-buildcache/home-<shortsha>-<config>-<guid>/, mirrored from the global home and overlaid with BCS bits. StartProcess uses jobContext.BuildCacheDotnetHome for runtime resolution. FDD works correctly and concurrent jobs are isolated. The original InstallRuntimeFromBuildCacheAsync method was dead code and was deleted in round 1.
| private static readonly HttpClient _httpClient = new HttpClient { Timeout = TimeSpan.FromMinutes(10) }; | ||
|
|
||
| // Cache latestBuilds.json responses to avoid repeated downloads | ||
| private static readonly Dictionary<string, (DateTimeOffset fetchedAt, LatestBuildsResponse data)> _latestBuildsCache = new(); | ||
| private static readonly TimeSpan _latestBuildsCacheDuration = TimeSpan.FromHours(1); | ||
|
|
||
| // Cache of already-installed BCS commit SHAs to avoid re-extracting | ||
| private static readonly HashSet<string> _installedBuildCacheRuntimes = new(StringComparer.OrdinalIgnoreCase); | ||
|
|
There was a problem hiding this comment.
_latestBuildsCache and _installedBuildCacheRuntimes are static non-thread-safe collections. The agent can process multiple jobs concurrently, so concurrent accesses can corrupt these collections or throw. Use ConcurrentDictionary/ConcurrentHashSet (or locks) around these caches, and also include baseUrl in the latestBuilds cache key to avoid cross-contaminating entries when the base URL is overridden.
There was a problem hiding this comment.
| ``` | ||
| GET https://pvscmdupload.blob.core.windows.net/$web/builds/{repoName}/latest/{branch}/latestBuilds.json | ||
| GET https://pvscmdupload.blob.core.windows.net/$web/builds/{repoName}/buildArtifacts/{commitSha}/{configKey}/{artifactFile} | ||
| ``` |
There was a problem hiding this comment.
This doc’s example URLs use the blob endpoint with /$web/..., but the agent default base URL (and client URL construction) uses the static website endpoint (https://pvscmdupload.z22.web.core.windows.net) without /$web. Please align the documented URLs with what the agent actually calls (or clarify which base URL should be configured via --build-cache-base-url).
There was a problem hiding this comment.
Addressed by 1ecc239 (round 1): the doc URLs in docs/build_cache_requirements.md now match what the agent actually calls — the static-website endpoint https://pvscmdupload.z22.web.core.windows.net/builds/... (no /$web/). The default --build-cache-base-url value is documented and matches.
Adds a new 'buildcache' channel that resolves .NET runtime binaries from the Build Caching Service (BCS) instead of VMR feeds. This provides per-commit granularity for performance regression bisection. Key changes: - New Job.cs properties: BuildCacheCommitSha, BuildCacheBranch, BuildCacheConfig - New BuildCacheClient.cs: HTTP client for BCS latestBuilds.json and artifact download/extraction with post-build overlay into published output - Startup.cs: 'buildcache' channel in version resolution that builds with real NuGet packages then overlays BCS runtime binaries (194 files) after publish - Agent CLI options: --build-cache-base-url, --build-cache-repo-name, --build-cache-disabled - Documentation in docs/dotnet_versions.md and docs/build_cache_requirements.md Usage: --application.channel buildcache [--application.buildCacheCommitSha <sha>] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
99d0195 to
41394e4
Compare
Changes:
* BuildCacheClient: rewrite to drop ~250 LOC of dead code from an earlier
abandoned design (synthesizing a new shared-framework dir with a synthetic
version). The FDD overlay path is restored as a new public API.
* New OverlayDotnetHome(extractDir, dotnetHome, runtimeVersion): overlays BCS
bits into dotnetHome/shared/Microsoft.NETCore.App/{runtimeVersion}/ and
host/fxr/{runtimeVersion}/ + the dotnet host. Wired into Startup.cs after
publish so framework-dependent jobs actually run against BCS bits instead of
silently using the feed runtime.
* OverlayPublishedOutput: copy ALL managed/native runtime files
unconditionally (previously skipped any file not already in destination,
which would silently drop new DLLs introduced by the BCS commit). Also
copies hostfxr/hostpolicy/dotnet for self-contained.
* Hardening:
- URL-encode repoName/commitSha/buildCacheConfig/branch with
Uri.EscapeDataString.
- Atomic download (.partial -> rename) and Content-Length validation so
truncated archives are not reused after a failed run.
- Retry transient HTTP failures via ProcessUtil.RetryOnExceptionAsync.
- Replace shelling out to tar with System.Formats.Tar.TarFile.
- Wrap synchronous archive extraction in Task.Run.
- Per-(commit,config) SemaphoreSlim + per-call unique extract dir to avoid
races between concurrent jobs.
- Drop unused targetFramework parameter from DownloadAndExtractAsync.
- All commit-SHA Substring uses go through ShortSha (Math.Min length guard).
- ParseLatestBuilds: case-insensitive branch_name/BranchName handling and
skip non-object metadata properties safely.
* Startup.cs:
- Validate user-supplied BuildCacheCommitSha length (>= 8 chars) up front
instead of crashing later.
- Stop mutating runtimeVersion before PatchRuntimeConfig - use feed-resolved
version so runtimeconfig.json points to a really-installed shared
framework dir. Suffix +buildcache.<sha> is now applied to
job.RuntimeVersion only, after PatchRuntimeConfig has run.
- Treat 0-file overlay as fatal (job.Error + return null) so silent failures
do not produce wrong-runtime benchmarks.
- Promote overlay-failure log from Info to fatal job error.
* docs/dotnet_versions.md: add trailing newline.
* test/Microsoft.Crank.UnitTests/BuildCacheClientTests.cs: 17 new tests
covering ParseLatestBuilds (PascalCase / snake_case / mixed / missing
fields / case-insensitive lookup / non-object values), GetPlatformMoniker,
ShortSha, GetNativeLibName, OverlayPublishedOutput unconditional copy,
pdb/dbg skip, OverlayDotnetHome shared-framework precondition, and full
shared-framework + hostfxr overlay.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…verlay
Live-tested against an agent with --application.channel buildcache. The first
end-to-end run was reporting the FEED commit hash in .NET Runtime Version
because:
1. The .version file in shared/Microsoft.NETCore.App/{ver}/ was never rewritten
after the BCS overlay, so any consumer of that file (notably the agent's own
BenchmarksNetCoreAppVersion measurement metadata) reported the feed-installed
commit.
2. Even with .version rewritten, the metadata-capture block runs immediately
after the feed runtime install -- before the post-publish overlay -- so the
feed commit was captured into Job.Measurements before BCS bits were in place.
Fixes:
* OverlayDotnetHome now accepts an optional commitSha parameter and rewrites
shared/Microsoft.NETCore.App/{ver}/.version with the BCS commit so anything
reading that file gets the correct hash.
* Startup.cs splits the BCS overlay into two stages:
- dotnet-home overlay runs RIGHT AFTER feed runtime install (before the
BenchmarksNetCoreAppVersion metadata is captured), so the metadata picks up
the BCS commit. Treat 0 files as fatal.
- published-output overlay runs after publish (only path where outputFolder
exists). For self-contained, 0 files is fatal; for FDD it's expected.
* Two new tests:
- OverlayDotnetHome_WithCommitSha_RewritesVersionFile
- OverlayDotnetHome_WithoutCommitSha_LeavesVersionFileUntouched
Live test result with --application.channel buildcache --application.framework
net11.0 against a Linux agent:
| .NET Runtime Version | 11.0.0-preview.5.26256.117+603403d9cb49 |
| Requests/sec | 7,593,977 |
The +603403d9cb49 suffix is the actual BCS commit, and 212+212 files were
overlaid (dotnet home + published output).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…d apphost, and more
Round-3 fixes informed by a rubber-duck critique + a second live test.
Per-job ISOLATED dotnet home (replaces in-place overlay of shared dotnetHome)
The biggest design change: stop mutating the global dotnetHome. After feed
install, build a per-job dotnet home by copying the relevant subtrees
(shared/Microsoft.NETCore.App/{ver}, shared/Microsoft.AspNetCore.App/{ver},
host/fxr/{ver}, dotnet[.exe]) and overlay BCS bits into the copy. The global
dotnetHome stays clean. This fixes three real bugs at once:
- cross-job pollution (a later non-buildcache job would have read the BCS
commit from .version),
- concurrent buildcache jobs racing on the same shared framework dir,
- mid-overlay failures leaving a permanently-corrupted shared framework that
_installedDotnetRuntimes would continue to trust.
The per-job home is exposed via JobContext.BuildCacheDotnetHome and used by
StartProcess so framework-dependent jobs actually load BCS bits at runtime.
Cleanup happens at job-end.
Build cache fields in BuildKeyData
Job.BuildKeyData now carries BuildCacheCommitSha / BuildCacheBranch /
BuildCacheConfig so reuseBuild + NoBuild don't silently reuse a build pinned
to a different BCS commit.
SDK-bound apphost is no longer clobbered
A live test exposed this regression: the BCS archive ships the raw,
unbound apphost. Overlaying it on top of the SDK-bound published binary left
the apphost with the placeholder SHA-256 binding and the app refused to start
with "This executable is not bound to a managed DLL to execute. The binding
value is: 'c3ab8ff1...'". Fixed by NOT overlaying apphost. CoreCLR JIT, GC,
managed BCL, hostfxr, hostpolicy still all come from BCS, so the perf
relevant code is correct. (Doc comment explains how a future enhancement
could rebind a BCS apphost via Microsoft.NET.HostModel.HostWriter.)
BCS-config RID for overlay discovery
Stop using the host's GetPlatformMoniker for overlay discovery. Resolve the
RID from the buildCacheConfig instead so an explicit musl/cross-arch override
finds the right runtime pack inside the archive.
Numeric-aware managed lib dir selection
SelectHighestManagedDir parses net{major}.0 numerically so the BCS archive
could ship multiple TFMs without lexically picking net9.0 over net10.0.
Non-retryable 404
New BuildCacheNotFoundException sentinel + RetryTransientAsync helper so 404
responses fail fast instead of being retried 3 times.
Strict SHA validation
ValidateCommitSha requires 8-40 hex chars; rejects "../../../etc/passwd",
non-hex, and silly inputs early instead of letting them propagate into URLs
and temp paths.
Cleanup
CleanupExtractDir wipes per-call extraction dirs after overlay; the archive
in the parent commit dir is intentionally kept so subsequent jobs for the
same commit can skip the download. Per-job dotnet home is also deleted at
job end.
Executable bits
EnsureExecutable preserves +x on native files and the dotnet host on
Unix-like systems (File.Copy can drop the bit if the destination didn't
have it).
Tests (36 BuildCacheClient tests, all passing)
New coverage for:
- SelectHighestManagedDir picks net11.0 over net10.0 over net9.0,
- GetRidForConfig for every supported config,
- ValidateCommitSha (accepts/rejects cases incl. path-traversal-like
inputs),
- CreateBuildCacheDotnetHome: mirrors global, overlays BCS, rewrites
.version, and asserts the global home is NOT touched,
- Two concurrent CreateBuildCacheDotnetHome invocations produce isolated
homes,
- OverlayPublishedOutput preserves a pre-existing SDK-bound apphost.
Live test result
--application.framework net11.0 --application.channel buildcache against a
Linux agent:
| .NET Runtime Version | 11.0.0-preview.6.26277.104+38d408d22a64 |
| Requests/sec | 8,456,046 |
Per-job dotnet home created at:
/tmp/crank-buildcache/home-38d408d2-coreclr_x64_linux-aaa3385bb0ae4bdca01c7c6b4a7ca2bc
Global dotnetHome .version preserved: b520ee7cc01690f70d1431951f554c4e0666a69a
(the feed commit, not the BCS commit) — proving the isolation works.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a new
buildcachechannel that resolves .NET runtime binaries from the Build Caching Service (BCS) instead of NuGet feeds. This gives per-commit granularity for performance regression bisection, instead of being limited to feed cadence.Wire-up
Controller (
Job.cs) — new opt-in properties (default""):BuildCacheCommitSha— pin a specific commit; empty = latest from BCSBuildCacheBranch— defaultmainBuildCacheConfig— defaults to the inferred RID config (e.g.coreclr_x64_linux)Agent (
Startup.cs/BuildCacheClient.cs) — new CLI:--build-cache-base-url(default:https://pvscmdupload.z22.web.core.windows.net)--build-cache-repo-name(default:runtime)--build-cache-disabledUsage:
How runtime resolution works
The agent still builds the app with real NuGet packages (so
dotnet restore/publishbehave normally andruntimeconfig.jsonreferences a real installed version), then swaps in BCS bits at the right layer depending on publish kind:outputFolder. The SDK-bound apphost is preserved — BCS ships an unbound apphost containing a SHA-256 placeholder where the managed DLL path is encoded, so overlaying it would break startup.<temp>/crank-buildcache/home-<shortsha>-<config>-<guid>/, mirrored from the global_dotnethomeand overlaid with BCS files.StartProcessresolves the runtime from this isolated home. The global_dotnethomeis never mutated, so concurrent jobs and other channels are unaffected.The resolved BCS commit is recorded in
MeasurementMetadata(underruntime-version) so reports show the real binary that ran, not the feed-resolved string.Cache correctness
The BCS commit, branch, and config are included in
BuildKeyData, so--application.options.reuseBuildcorrectly cache-busts when any of them change. Pre-existingoptions.jsonfiles written by older agents will cache-miss exactly once after upgrade, then re-cache in the new format.Backward compatibility
All new code is gated on
--application.channel buildcache. Old controllers calling a new agent are unaffected — newBuildCache*properties default to""and are read defensively (!string.IsNullOrEmpty(...)); the BCS path is never entered without the opt-in channel.Verification
dotnet buildcleantest/Microsoft.Crank.UnitTests/BuildCacheClientTests.cscoverParseLatestBuilds(multiple JSON shapes),ValidateCommitSha(incl. path-traversal rejection),OverlayPublishedOutput(incl. SDK-apphost preservation),CreateBuildCacheDotnetHome(incl. isolation between two concurrent jobs),SelectHighestManagedDir(numeric-aware:net11.0>net10.0>net9.0),CleanupExtractDir, etc.11.0.0-preview.6.26277.104+38d408d22a64.hellosample): 162,667 RPS, metadata reports11.0.0-preview.6.26277.116+c56e0a2499fa, isolated dotnet home confirmed.