ci: split otap-dataflow tests into build-once + 3 partitions#2615
Open
jmacd wants to merge 12 commits intoopen-telemetry:mainfrom
Open
ci: split otap-dataflow tests into build-once + 3 partitions#2615jmacd wants to merge 12 commits intoopen-telemetry:mainfrom
jmacd wants to merge 12 commits intoopen-telemetry:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2615 +/- ##
==========================================
- Coverage 88.06% 88.05% -0.02%
==========================================
Files 630 630
Lines 232988 232986 -2
==========================================
- Hits 205181 205147 -34
- Misses 27283 27315 +32
Partials 524 524
🚀 New features to boost your workflow:
|
d381634 to
9ab666a
Compare
The monolithic 'cargo nextest run --all-features --workspace' step was timing out in the merge queue (56min on ARM, 32min on Windows, 13min on Linux). This restructures the CI to: 1. build_required / build_extra: compile test binaries once per OS and upload a nextest archive artifact. 2. test_and_coverage_required / test_extra: download the archive and run tests with '--partition count:N/3', splitting into 3 parallel jobs per OS. No recompilation needed. 3. Coverage: Linux build uses 'cargo llvm-cov show-env' to instrument binaries at build time. Test partitions source the same env vars, run their slice of tests, then 'cargo llvm-cov report' generates per-partition lcov files uploaded to codecov with partition flags. 4. The non-required test_and_coverage job now only covers experimental folders (small, no partitioning needed). otap-dataflow on ARM and macOS moves to the new build_extra/test_extra jobs. Expected improvement: each test partition runs ~1/3 of tests without paying the full compile cost, bringing wall-clock time well under the timeout threshold. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The nextest archive + Go test collector fill the disk on ubuntu-latest runners. Add the same disk cleanup step used by the build jobs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ARM runners are slow — compiling --all-features (3 crypto backends + TLS) takes 50+ minutes. The full feature matrix is already tested on x86 Linux and Windows. ARM/macOS only need to catch architecture- specific issues, which default features cover. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Standardize all disk cleanup steps to the same compact 3-line form. Previously there were 3 variants: verbose (14 lines with df -h), compact (3 lines), and missing entirely. Now every job that needs disk space gets the same pattern. Also adds cleanup to test_extra which was missing it. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
5ecf9c6 to
fddf97f
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a test fails, nextest cancels all remaining tests by default. With --max-fail=10 via a ci profile, we continue running and report up to 10 failures before stopping, giving better signal on flaky vs. systematic failures. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The experimental workspaces don't have .config/nextest.toml with the ci profile, so --profile ci fails there. Remove it from the test_and_coverage job which only runs experimental folders. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- test_and_coverage -> experimental_tests - build_extra -> build_nonrequired - test_extra -> test_nonrequired Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This job no longer does coverage -- that moved to a separate job. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When running from an archive, nextest may not find the repository config at .config/nextest.toml. Pass it explicitly so the ci profile (with max-fail=10) is actually used. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
|
Few weeks ago in the context of the new cargo xtask check, I compared the speed of nextest with the regular cargo test and for an unknown reason cargo test was significantly faster. We should go back to cargo test or find a way to make nextest faster (I try several hours without success). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The monolithic
cargo nextest run --all-features --workspacestep is timing out in the merge queue at around 55min on ARM, 35min on Windows.This restructures the CI to:
build_required / build_extra: compile test binaries once per OS and upload a nextest archive artifact.
test_and_coverage_required / test_extra: download the archive and run tests with
--partition count:N/3, splitting into 3 parallel jobs per OS.Coverage: Linux build uses
cargo llvm-cov show-envto instrument binaries at build time. Test partitions source the same env vars, run their slice of tests, thencargo llvm-cov reportgenerates per-partition lcov files uploaded to codecov.The non-required test_and_coverage job now only covers experimental folders. otap-dataflow on ARM and macOS moves to the new build_extra/test_extra jobs.
Expected improvement: each test partition runs ~1/3 of tests without paying the full compile cost.
Fixes #2534