Skip to content

ci: split otap-dataflow tests into build-once + 3 partitions#2615

Open
jmacd wants to merge 12 commits intoopen-telemetry:mainfrom
jmacd:jmacd/split_tests
Open

ci: split otap-dataflow tests into build-once + 3 partitions#2615
jmacd wants to merge 12 commits intoopen-telemetry:mainfrom
jmacd:jmacd/split_tests

Conversation

@jmacd
Copy link
Copy Markdown
Contributor

@jmacd jmacd commented Apr 8, 2026

The monolithic cargo nextest run --all-features --workspace step is timing out in the merge queue at around 55min on ARM, 35min on Windows.

This restructures the CI to:

  1. build_required / build_extra: compile test binaries once per OS and upload a nextest archive artifact.

  2. test_and_coverage_required / test_extra: download the archive and run tests with --partition count:N/3, splitting into 3 parallel jobs per OS.

  3. Coverage: Linux build uses cargo llvm-cov show-env to instrument binaries at build time. Test partitions source the same env vars, run their slice of tests, then cargo llvm-cov report generates per-partition lcov files uploaded to codecov.

  4. The non-required test_and_coverage job now only covers experimental folders. otap-dataflow on ARM and macOS moves to the new build_extra/test_extra jobs.

Expected improvement: each test partition runs ~1/3 of tests without paying the full compile cost.

Fixes #2534

@github-actions github-actions bot added the ci-repo Repository maintenance, build, GH workflows, repo cleanup, or other chores label Apr 8, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.05%. Comparing base (fdc0cfc) to head (106b63c).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2615      +/-   ##
==========================================
- Coverage   88.06%   88.05%   -0.02%     
==========================================
  Files         630      630              
  Lines      232988   232986       -2     
==========================================
- Hits       205181   205147      -34     
- Misses      27283    27315      +32     
  Partials      524      524              
Components Coverage Δ
otap-dataflow 89.75% <ø> (-0.02%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 90.74% <ø> (ø)
otel-arrow-go 52.45% <ø> (ø)
quiver 92.27% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jmacd jmacd force-pushed the jmacd/split_tests branch 2 times, most recently from d381634 to 9ab666a Compare April 8, 2026 23:20
The monolithic 'cargo nextest run --all-features --workspace' step was
timing out in the merge queue (56min on ARM, 32min on Windows, 13min
on Linux).

This restructures the CI to:

1. build_required / build_extra: compile test binaries once per OS and
   upload a nextest archive artifact.

2. test_and_coverage_required / test_extra: download the archive and
   run tests with '--partition count:N/3', splitting into 3 parallel
   jobs per OS. No recompilation needed.

3. Coverage: Linux build uses 'cargo llvm-cov show-env' to instrument
   binaries at build time. Test partitions source the same env vars,
   run their slice of tests, then 'cargo llvm-cov report' generates
   per-partition lcov files uploaded to codecov with partition flags.

4. The non-required test_and_coverage job now only covers experimental
   folders (small, no partitioning needed).  otap-dataflow on ARM and
   macOS moves to the new build_extra/test_extra jobs.

Expected improvement: each test partition runs ~1/3 of tests without
paying the full compile cost, bringing wall-clock time well under the
timeout threshold.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jmacd jmacd force-pushed the jmacd/split_tests branch from eb712dd to 6e55019 Compare April 9, 2026 15:40
@jmacd jmacd mentioned this pull request Apr 9, 2026
1 task
jmacd and others added 3 commits April 9, 2026 09:27
The nextest archive + Go test collector fill the disk on ubuntu-latest
runners. Add the same disk cleanup step used by the build jobs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ARM runners are slow — compiling --all-features (3 crypto backends +
TLS) takes 50+ minutes. The full feature matrix is already tested on
x86 Linux and Windows. ARM/macOS only need to catch architecture-
specific issues, which default features cover.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Standardize all disk cleanup steps to the same compact 3-line form.
Previously there were 3 variants: verbose (14 lines with df -h),
compact (3 lines), and missing entirely. Now every job that needs
disk space gets the same pattern. Also adds cleanup to test_extra
which was missing it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jmacd jmacd force-pushed the jmacd/split_tests branch from 5ecf9c6 to fddf97f Compare April 10, 2026 20:36
jmacd and others added 2 commits April 10, 2026 13:36
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jmacd jmacd marked this pull request as ready for review April 10, 2026 21:22
@jmacd jmacd requested a review from a team as a code owner April 10, 2026 21:22
@jmacd jmacd marked this pull request as draft April 10, 2026 21:25
When a test fails, nextest cancels all remaining tests by default.
With --max-fail=10 via a ci profile, we continue running and report
up to 10 failures before stopping, giving better signal on flaky
vs. systematic failures.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions bot added the rust Pull requests that update Rust code label Apr 10, 2026
jmacd and others added 5 commits April 10, 2026 14:41
The experimental workspaces don't have .config/nextest.toml with the
ci profile, so --profile ci fails there. Remove it from the
test_and_coverage job which only runs experimental folders.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- test_and_coverage -> experimental_tests
- build_extra -> build_nonrequired
- test_extra -> test_nonrequired

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This job no longer does coverage -- that moved to a separate job.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When running from an archive, nextest may not find the repository
config at .config/nextest.toml. Pass it explicitly so the ci profile
(with max-fail=10) is actually used.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jmacd jmacd marked this pull request as ready for review April 11, 2026 01:07
@lquerel
Copy link
Copy Markdown
Contributor

lquerel commented Apr 12, 2026

Few weeks ago in the context of the new cargo xtask check, I compared the speed of nextest with the regular cargo test and for an unknown reason cargo test was significantly faster. We should go back to cargo test or find a way to make nextest faster (I try several hours without success).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-repo Repository maintenance, build, GH workflows, repo cleanup, or other chores rust Pull requests that update Rust code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Split the tests into multiple units

2 participants