DataLab-Web testing strategy

Audience: contributors and AI coding agents working on DataLab-Web. Read this before adding a new test layer for a bug fix or feature.

DataLab-Web has three test layers, each with very different cost/coverage trade-offs:

Layer	Tool	Typical cost	Best for
Python (in-Pyodide logic)	pytest	~ms	`bootstrap.py` helpers, JSON shapes, Sigima glue
Component (TS/React)	Vitest + RTL	~10–100 ms	Form behaviour, reducers, runtime mocks
End-to-end (browser)	Playwright	~10–60 s each	Multi-component interactions, runtime round-trips

Playwright runs serially (workers: 1, ~3 min Pyodide boot per spec file in CI), so adding E2E tests has a real, compounding cost. Treat them as a scarce resource.

A continuous layer ties the three together: GitHub Actions (tests.yml) runs all three on every push / PR. The Python layer runs bootstrap.py directly under CPython through fixtures that stub the Pyodide-only modules (js, pyodide.ffi); this gives fast feedback and high coverage without booting WebAssembly.

Running the suites locally

# One-time: copy the environment template and create the project venv
# (Python 3.11 or 3.12 — earlier versions trip a quirk in
# ``isinstance(list[T], type)`` that breaks Sigima's processor
# introspection).
Copy-Item .env.template .env
py -3.11 -m venv .venv
.\.venv\Scripts\python -m pip install -r requirements-dev.txt

# Python unit tests + coverage report (htmlcov-python/)
.\.venv\Scripts\python -m pytest tests/python --cov=src/runtime --cov-report=html:htmlcov-python

# TypeScript unit tests + coverage report (coverage-ts/)
npm test
npm run test:cov

# End-to-end browser tests (boots Pyodide in Chromium ~1.5 min)
npx playwright install chromium   # one-time
npm run test:e2e

# Performance benchmarks (opt-in — ~5 min). Includes the image-display
# benchmark and the 50k-sample binary transfer probe.
npx playwright test --project=perf
PERF=1 npm run test:e2e

# Run a throwaway `_repro_*` probe (the default suite ignores them).
$env:PW_REPRO=1; npx playwright test --project=repro tests/e2e/_repro_x.spec.ts

Test layout:

tests/
├── python/          # pytest suite — exercises bootstrap.py headlessly
├── ts/              # Vitest suite — pure TypeScript modules
└── e2e/             # Playwright specs — real browser smoke tests

VS Code tasks are provided under .vscode/tasks.json (🚀 Pytest, 🟢 Vitest, 🎭 Playwright, …). The default test task (Ctrl+Shift+P → Run Test Task) launches the Python suite.

Default suite vs perf project

The default Playwright project (chromium) runs the regression suite and intentionally excludes performance benchmarks and _repro_* throwaway probes. The costly perf and benchmark projects are opt-in: each is registered only when explicitly requested, either via its --project=<name> flag or its PW_<NAME> env var (see wantsProject in playwright.config.ts). A bare playwright test — and therefore the default CI run, which calls npm run test:e2e (--project=chromium) — never pays for them.

npm run test:e2e                       # default chromium suite (CI default)
npm run test:e2e:perf                  # perf benchmarks only (--project=perf)
npx playwright test --project=perf     # same, explicit form
PERF=1 npm run test:e2e                # chromium suite + PERF-gated probes

When you need a perf or budget-style probe, mark it with test.skip(!process.env.PERF, "...") (it then runs in chromium only under PERF=1) or add the spec to the perf project's testMatch glob (currently image_perf.spec.ts, opfs_storage_bench.spec.ts, opfs_sync_spike.spec.ts, opfs_worker_bench.spec.ts).

Performance benchmarks: on-demand CI and regression tracking

Perf benchmarks are deterministic with respect to the code: if nothing changes, the numbers do not change, so re-running them on every commit (or on a fixed schedule) would add noise, not signal. They are therefore driven by a dedicated, opt-in workflow — .github/workflows/perf.yml — that runs the perf project (npm run test:e2e:perf) only when wanted:

manually via Run workflow (workflow_dispatch);
on a pull request labelled run-perf (opt-in per PR);
on every push to main (release merges);
on every vX.Y.Z release tag.

Results are tracked over time with benchmark-action/github-action-benchmark on the orphan benchmarks branch (open dev/bench/determinist/index.html or dev/bench/timings/index.html from that branch to view the charts). scripts/perf-to-benchmark-json.mjs converts the raw result JSON into the action's customSmallerIsBetter format, splitting metrics into two groups with different policies:

Deterministic (memory Δheap in MiB, approximate JSON payload in MB) — these make trustworthy gates. On a run-perf pull request, an increase beyond 125 % of the baseline fails the check and comments on the PR. On main / tags the value is recorded but never fails (the change is already merged).
Timings (milliseconds) — noisy on shared CI runners, so they are tracked with a wide 200 % threshold and never fail; they exist purely for trend inspection.

The cheap deterministic invariants encoded directly in the specs (data-integrity checksums, spill counts, the heap-decoupling expect(...)) still run wherever the perf project runs and fail fast on a broken guarantee, independent of the chart thresholds.

Running a `_repro_*` throwaway probe

Because the chromium project testIgnores _repro_* (so a forgotten probe never lands in CI) and testIgnore can't be overridden on the CLI, npx playwright test tests/e2e/_repro_x.spec.ts returns "No tests found". Do not rename the file to run it. Instead use the env-gated repro project, which only exists when PW_REPRO is set (so CI, which runs a bare playwright test, is unaffected):

$env:PW_REPRO=1; npx playwright test --project=repro tests/e2e/_repro_x.spec.ts

Run it from the DataLab-Web folder — if the terminal cwd is another workspace folder, npx can't find the local Playwright and stalls on a download prompt.

Worker-mode coverage

The runtime can also run inside a Dedicated Web Worker (opt-in via ?runtime=worker; see architecture.md §3.3 and DEW ADR #2). That path has its own failure surface — Pyodide boot in a module worker, the postMessage RPC bridge, transferable buffers, the synchronous mirror, and workspace-mutation events crossing back to the main thread — which unit tests cannot reach. tests/e2e/worker_mode.spec.ts is the permanent regression suite that exercises it in the default chromium project, so it runs on every CI build. It boots the worker runtime once (describe.serial + beforeAll) and shares the page across assertions to amortise the cold boot. This suite is the gate for ever promoting worker mode to the default execution mode.

Decision tree for a bug fix

Reproduce the bug with the cheapest possible probe.
- If the symptom is purely in bootstrap.py (or any JSON-serialisable contract): write a pytest test under tests/python/.
- If the symptom is a React state/render bug isolatable from the runtime: write a Vitest + RTL test under tests/ts/ (mock the runtime via the DataLabRuntime interface).
- Only if neither layer can express the bug — it requires a real Pyodide round-trip or multi-component interaction — write a throwaway Playwright spec under tests/e2e/_repro_*.spec.ts.
Fix the bug.
Decide whether the test becomes permanent, using the matrix below. Default to throwaway unless promotion is justified.

Promote to a permanent test when

A cheaper layer cannot express the invariant. A Vitest test that needs to mock half of DataLabRuntime and Pyodide's behaviour duplicates production code; an E2E is honest in that case.
The bug touches a foundational contract you're now committing to (e.g. "the side panel always reflects the selected object", "ROIs survive a reload", "running a feature never corrupts the object tree"). These are the invariants users would notice immediately if broken; permanent tests pay back over many refactors.
The mechanism is subtle (race conditions, stale state across async boundaries, Pyodide bridge marshalling) and likely to regress silently.

Keep as a throwaway probe when

The bug is in first-implementation code that's likely to be rewritten or restructured soon.
The fix is structurally obvious (e.g. adding a missing null guard) and unlikely to be undone.
The bug class is prevented by a type-system or lint change — prefer the prevention to a test.
The scenario is already implicitly covered by an existing test (don't add per-bug duplicates of the same coverage).

Authoring rules for permanent E2E tests

When you do promote a test to permanent E2E:

One spec per invariant, multiple test() cases inside. Spec startup (Pyodide boot, page navigation) dominates wall time; sharing it across related cases keeps the suite cheap.
Test the contract, not the implementation. Assertions should read like product requirements ("size input matches the selected signal's backend size"), not like UI snapshots.
Use the runtime API for setup, the UI for the action under test, and both for assertions. Cross-checking UI vs. backend catches the largest class of stale-state bugs.
Generous timeouts (waitForRuntimeReady, then explicit polls); the suite already runs slowly, retries hide flakes worse than long expects.
Name the spec after the invariant, not the bug (side_panel_mirrors_selection.spec.ts, not fix_creation_form_swap_bug.spec.ts).

Mandatory rule for UI changes

Every change to the UI must be exercised end-to-end with Playwright before it is considered done. This applies to:

Every bug fix that touches a React component or runtime call.
Every new feature, however small.
Every phase of a multi-phase implementation — not only the final phase. If a feature is split into Phase 1 (backend) → Phase 2 (UI wiring) → Phase 3 (polish), Phase 2 and Phase 3 each need their own Playwright pass before being declared complete.

The Playwright pass can be:

A throwaway probe under tests/e2e/_repro_*.spec.ts (deleted afterwards) — sufficient for incremental progress, refactors, and fixes that don't meet the promotion criteria below.
A permanent spec when the criteria in Promote to a permanent test when are met.

The point is not to grow the suite, it is to never declare a UI change done based solely on type-checks, unit tests, or "looks fine in the dev server" — Pyodide round-trips and async state interactions silently break in ways only a browser-driven test catches reliably.

Workflow for AI agents

When asked to implement or fix anything that touches the UI:

Reproduce / scope the change with a temporary tests/e2e/_repro_*.spec.ts (delete afterwards) — confirm the starting state and the target behaviour before coding.
Apply the change.
Run Playwright on the temporary spec to verify. No UI work is declared done without this step, including intermediate phases of a multi-phase plan. Run a _repro_* probe through the env-gated repro project (a bare run ignores it): $env:PW_REPRO=1; npx playwright test --project=repro tests/e2e/_repro_x.spec.ts (see _Running a \_repro_\* throwaway probe_ above).
Apply the decision tree above. If promoting, write a single, well-named permanent spec covering the invariant. Delete the reproduction spec.
Briefly justify the decision in the PR description (why permanent or why throwaway).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DataLab-Web testing strategy

Running the suites locally

Default suite vs perf project

Performance benchmarks: on-demand CI and regression tracking

Running a `_repro_*` throwaway probe

Worker-mode coverage

Decision tree for a bug fix

Promote to a permanent test when

Keep as a throwaway probe when

Authoring rules for permanent E2E tests

Mandatory rule for UI changes

Workflow for AI agents

Uh oh!

FilesExpand file tree

testing-strategy.md

Latest commit

History

testing-strategy.md

File metadata and controls

DataLab-Web testing strategy

Running the suites locally

Default suite vs perf project

Performance benchmarks: on-demand CI and regression tracking

Running a _repro_* throwaway probe

Worker-mode coverage

Decision tree for a bug fix

Promote to a permanent test when

Keep as a throwaway probe when

Authoring rules for permanent E2E tests

Mandatory rule for UI changes

Workflow for AI agents

Running a `_repro_*` throwaway probe