Audience: contributors and AI coding agents working on DataLab-Web. Read this before adding a new test layer for a bug fix or feature.
DataLab-Web has three test layers, each with very different cost/coverage trade-offs:
| Layer | Tool | Typical cost | Best for |
|---|---|---|---|
| Python (in-Pyodide logic) | pytest | ~ms | bootstrap.py helpers, JSON shapes, Sigima glue |
| Component (TS/React) | Vitest + RTL | ~10–100 ms | Form behaviour, reducers, runtime mocks |
| End-to-end (browser) | Playwright | ~10–60 s each | Multi-component interactions, runtime round-trips |
Playwright runs serially (workers: 1, ~3 min Pyodide boot per spec
file in CI), so adding E2E tests has a real, compounding cost. Treat
them as a scarce resource.
A continuous layer ties the three together: GitHub Actions (tests.yml) runs all three on every push / PR. The Python layer runs bootstrap.py directly under CPython through fixtures that stub the Pyodide-only modules (js, pyodide.ffi); this gives fast feedback and high coverage without booting WebAssembly.
# One-time: copy the environment template and create the project venv
# (Python 3.11 or 3.12 — earlier versions trip a quirk in
# ``isinstance(list[T], type)`` that breaks Sigima's processor
# introspection).
Copy-Item .env.template .env
py -3.11 -m venv .venv
.\.venv\Scripts\python -m pip install -r requirements-dev.txt
# Python unit tests + coverage report (htmlcov-python/)
.\.venv\Scripts\python -m pytest tests/python --cov=src/runtime --cov-report=html:htmlcov-python
# TypeScript unit tests + coverage report (coverage-ts/)
npm test
npm run test:cov
# End-to-end browser tests (boots Pyodide in Chromium ~1.5 min)
npx playwright install chromium # one-time
npm run test:e2e
# Performance benchmarks (opt-in — ~5 min). Includes the image-display
# benchmark and the 50k-sample binary transfer probe.
npx playwright test --project=perf
PERF=1 npm run test:e2e
# Run a throwaway `_repro_*` probe (the default suite ignores them).
$env:PW_REPRO=1; npx playwright test --project=repro tests/e2e/_repro_x.spec.tsTest layout:
tests/
├── python/ # pytest suite — exercises bootstrap.py headlessly
├── ts/ # Vitest suite — pure TypeScript modules
└── e2e/ # Playwright specs — real browser smoke tests
VS Code tasks are provided under .vscode/tasks.json (🚀 Pytest, 🟢 Vitest, 🎭 Playwright, …). The default test task (Ctrl+Shift+P → Run Test Task) launches the Python suite.
The default Playwright project (chromium) runs the regression suite and intentionally excludes performance benchmarks and _repro_* throwaway probes. The costly perf and benchmark projects are opt-in: each is registered only when explicitly requested, either via its --project=<name> flag or its PW_<NAME> env var (see wantsProject in playwright.config.ts). A bare playwright test — and therefore the default CI run, which calls npm run test:e2e (--project=chromium) — never pays for them.
npm run test:e2e # default chromium suite (CI default)
npm run test:e2e:perf # perf benchmarks only (--project=perf)
npx playwright test --project=perf # same, explicit form
PERF=1 npm run test:e2e # chromium suite + PERF-gated probesWhen you need a perf or budget-style probe, mark it with test.skip(!process.env.PERF, "...") (it then runs in chromium only under PERF=1) or add the spec to the perf project's testMatch glob (currently image_perf.spec.ts, opfs_storage_bench.spec.ts, opfs_sync_spike.spec.ts, opfs_worker_bench.spec.ts).
Perf benchmarks are deterministic with respect to the code: if nothing changes, the numbers do not change, so re-running them on every commit (or on a fixed schedule) would add noise, not signal. They are therefore driven by a dedicated, opt-in workflow — .github/workflows/perf.yml — that runs the perf project (npm run test:e2e:perf) only when wanted:
- manually via Run workflow (
workflow_dispatch); - on a pull request labelled
run-perf(opt-in per PR); - on every push to
main(release merges); - on every
vX.Y.Zrelease tag.
Results are tracked over time with benchmark-action/github-action-benchmark on the orphan benchmarks branch (open dev/bench/determinist/index.html or dev/bench/timings/index.html from that branch to view the charts). scripts/perf-to-benchmark-json.mjs converts the raw result JSON into the action's customSmallerIsBetter format, splitting metrics into two groups with different policies:
- Deterministic (memory Δheap in MiB, approximate JSON payload in MB) — these make trustworthy gates. On a
run-perfpull request, an increase beyond 125 % of the baseline fails the check and comments on the PR. Onmain/ tags the value is recorded but never fails (the change is already merged). - Timings (milliseconds) — noisy on shared CI runners, so they are tracked with a wide 200 % threshold and never fail; they exist purely for trend inspection.
The cheap deterministic invariants encoded directly in the specs (data-integrity checksums, spill counts, the heap-decoupling expect(...)) still run wherever the perf project runs and fail fast on a broken guarantee, independent of the chart thresholds.
Because the chromium project testIgnores _repro_* (so a forgotten probe never lands in CI) and testIgnore can't be overridden on the CLI, npx playwright test tests/e2e/_repro_x.spec.ts returns "No tests found". Do not rename the file to run it. Instead use the env-gated repro project, which only exists when PW_REPRO is set (so CI, which runs a bare playwright test, is unaffected):
$env:PW_REPRO=1; npx playwright test --project=repro tests/e2e/_repro_x.spec.tsRun it from the DataLab-Web folder — if the terminal cwd is another workspace folder, npx can't find the local Playwright and stalls on a download prompt.
The runtime can also run inside a Dedicated Web Worker (opt-in via
?runtime=worker; see architecture.md §3.3 and DEW
ADR #2). That path has its own failure surface — Pyodide boot in a
module worker, the postMessage RPC bridge, transferable buffers, the
synchronous mirror, and workspace-mutation events crossing back to the
main thread — which unit tests cannot reach. tests/e2e/worker_mode.spec.ts
is the permanent regression suite that exercises it in the default
chromium project, so it runs on every CI build. It boots the worker
runtime once (describe.serial + beforeAll) and shares the page
across assertions to amortise the cold boot. This suite is the gate for
ever promoting worker mode to the default execution mode.
-
Reproduce the bug with the cheapest possible probe.
- If the symptom is purely in
bootstrap.py(or any JSON-serialisable contract): write a pytest test undertests/python/. - If the symptom is a React state/render bug isolatable from the
runtime: write a Vitest + RTL test under
tests/ts/(mock the runtime via theDataLabRuntimeinterface). - Only if neither layer can express the bug — it requires a real
Pyodide round-trip or multi-component interaction — write a
throwaway Playwright spec under
tests/e2e/_repro_*.spec.ts.
- If the symptom is purely in
-
Fix the bug.
-
Decide whether the test becomes permanent, using the matrix below. Default to throwaway unless promotion is justified.
- A cheaper layer cannot express the invariant. A Vitest test that
needs to mock half of
DataLabRuntimeand Pyodide's behaviour duplicates production code; an E2E is honest in that case. - The bug touches a foundational contract you're now committing to (e.g. "the side panel always reflects the selected object", "ROIs survive a reload", "running a feature never corrupts the object tree"). These are the invariants users would notice immediately if broken; permanent tests pay back over many refactors.
- The mechanism is subtle (race conditions, stale state across async boundaries, Pyodide bridge marshalling) and likely to regress silently.
- The bug is in first-implementation code that's likely to be rewritten or restructured soon.
- The fix is structurally obvious (e.g. adding a missing
nullguard) and unlikely to be undone. - The bug class is prevented by a type-system or lint change — prefer the prevention to a test.
- The scenario is already implicitly covered by an existing test (don't add per-bug duplicates of the same coverage).
When you do promote a test to permanent E2E:
- One spec per invariant, multiple
test()cases inside. Spec startup (Pyodide boot, page navigation) dominates wall time; sharing it across related cases keeps the suite cheap. - Test the contract, not the implementation. Assertions should read like product requirements ("size input matches the selected signal's backend size"), not like UI snapshots.
- Use the runtime API for setup, the UI for the action under test, and both for assertions. Cross-checking UI vs. backend catches the largest class of stale-state bugs.
- Generous timeouts (
waitForRuntimeReady, then explicit polls); the suite already runs slowly, retries hide flakes worse than long expects. - Name the spec after the invariant, not the bug
(
side_panel_mirrors_selection.spec.ts, notfix_creation_form_swap_bug.spec.ts).
Every change to the UI must be exercised end-to-end with Playwright before it is considered done. This applies to:
- Every bug fix that touches a React component or runtime call.
- Every new feature, however small.
- Every phase of a multi-phase implementation — not only the final phase. If a feature is split into Phase 1 (backend) → Phase 2 (UI wiring) → Phase 3 (polish), Phase 2 and Phase 3 each need their own Playwright pass before being declared complete.
The Playwright pass can be:
- A throwaway probe under
tests/e2e/_repro_*.spec.ts(deleted afterwards) — sufficient for incremental progress, refactors, and fixes that don't meet the promotion criteria below. - A permanent spec when the criteria in Promote to a permanent test when are met.
The point is not to grow the suite, it is to never declare a UI change done based solely on type-checks, unit tests, or "looks fine in the dev server" — Pyodide round-trips and async state interactions silently break in ways only a browser-driven test catches reliably.
When asked to implement or fix anything that touches the UI:
- Reproduce / scope the change with a temporary
tests/e2e/_repro_*.spec.ts(delete afterwards) — confirm the starting state and the target behaviour before coding. - Apply the change.
- Run Playwright on the temporary spec to verify. No UI work is
declared done without this step, including intermediate phases of
a multi-phase plan. Run a
_repro_*probe through the env-gatedreproproject (a bare run ignores it):$env:PW_REPRO=1; npx playwright test --project=repro tests/e2e/_repro_x.spec.ts(see _Running a\_repro_\*throwaway probe_ above). - Apply the decision tree above. If promoting, write a single, well-named permanent spec covering the invariant. Delete the reproduction spec.
- Briefly justify the decision in the PR description (why permanent or why throwaway).