Skip to content

Feature- Tests#10

Open
Jurgee wants to merge 38 commits into
mainfrom
feature/tests
Open

Feature- Tests#10
Jurgee wants to merge 38 commits into
mainfrom
feature/tests

Conversation

@Jurgee

@Jurgee Jurgee commented May 31, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

Release Notes

  • New Features

    • Added HTTP endpoints to trigger test runs and throughput performance measurements
    • Introduced model snapshot testing for binary classification, semantic segmentation, and embedding models
  • Tests

    • Added snapshot test cases validating model outputs against reference data
  • Chores

    • Added Helm deployment configurations for test and performance runners
    • Configured persistent storage for test reference data
    • Updated project dependencies with testing tools

Jurgee and others added 30 commits April 7, 2026 16:58
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Jurgee and others added 5 commits May 31, 2026 10:46
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
Co-authored-by: Copilot <copilot@github.com>
@Jurgee Jurgee self-assigned this May 31, 2026
@Jurgee Jurgee requested review from a team, JakubPekar, Copilot and matejpekar May 31, 2026 11:42
@coderabbitai

coderabbitai Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@Jurgee, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 28 minutes and 35 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fa491907-6e69-4a02-b0ad-1b4f10f98e95

📥 Commits

Reviewing files that changed from the base of the PR and between afbcab6 and 11c1dcd.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (9)
  • builders/test_runner.py
  • builders/throughput_runner.py
  • tests/model_snapshots/_shared.py
  • tests/model_snapshots/generate_references.py
  • tests/model_snapshots/test_binary_classifier_model_snapshot.py
  • tests/model_snapshots/test_prov_gigapath_model_snapshot.py
  • tests/model_snapshots/test_semantic_segmentation_model_snapshot.py
  • tests/model_snapshots/test_virchow2_model_snapshot.py
  • tests/perf_throughput.py
📝 Walkthrough

Walkthrough

This PR introduces a comprehensive model testing and performance infrastructure. It adds snapshot-based validation tests for three model types (binary classifier, semantic segmentation, embeddings), a performance benchmarking tool, reference data generation, and Ray Serve HTTP endpoints to execute tests remotely. Kubernetes infrastructure provisions shared storage for test reference data.

Changes

Model Testing and Performance Infrastructure

Layer / File(s) Summary
Shared test validation helpers and pytest configuration
tests/model_snapshots/_shared.py, pyproject.toml
Helper functions extract image tiles via OpenSlide, invoke model SDK endpoints, and validate outputs against reference data stored on disk within numeric tolerances. Semantic segmentation and embedding runners also compute summary statistics. Pytest is added as dev dependency with pythonpath configured for the model snapshots directory.
Snapshot test suites for model validation
tests/model_snapshots/test_binary_classifier_model_snapshot.py, tests/model_snapshots/test_semantic_segmentation_model_snapshot.py, tests/model_snapshots/test_prov_gigapath_model_snapshot.py, tests/model_snapshots/test_virchow2_model_snapshot.py, tests/model_snapshots/new_images.txt
Parametrized pytest tests define cases for prostate classification, colorectum segmentation, and gigapath/virchow2 embeddings. Each test loads per-case reference data (JSON for binary, NPY for embeddings/segmentation) from /mnt/test_refs/ and invokes the corresponding shared validation helper with fixed tile size and pyramid level.
Reference data generation script
tests/model_snapshots/generate_references.py
Standalone script reads image tiles and invokes the model service to generate expected outputs. Writes JSON files containing metadata and classification scores for binary cases, and numpy arrays for segmentation and embedding cases, with error handling and per-case logging.
Performance throughput and latency benchmark script
tests/perf_throughput.py
Measures per-model throughput (requests/second) and latency percentiles (p50/p95/p99) using concurrent requests with configurable duration, concurrency, and timeout. Includes optional readiness polling with retry/backoff, deterministic tile pool generation, and thread-safe stats collection using NumPy percentile computation.
Ray Serve HTTP service wrappers for remote test execution
builders/test_runner.py, builders/throughput_runner.py
Two FastAPI-based Ray Serve deployments expose HTTP POST endpoints. TestRunner installs pytest and runs pytest tests/model_snapshots/, returning combined stdout/stderr as text/plain. ThroughputRunner invokes tests/perf_throughput.py with CLI-configurable duration, concurrency, timeout, and readiness-wait parameters.
Kubernetes and Helm deployment configuration
helm/rayservice/applications/test-runner.yaml, helm/rayservice/applications/throughput-test.yaml, helm/rayservice/values.yaml, helm/rayservice/workers/cpu-workers.yaml, helm/rayservice/workers/mig20-workers.yaml, pvc/model-test-refs-pvc.yaml
Helm application manifests define test-runner and throughput-test services with HTTP route prefixes (/run-tests, /run-throughput), Ray runtime environments (GitHub working directory + SDK pip install), and autoscaling (0–1 replicas, 1 CPU, 2 GiB memory per actor). Worker manifests and a new PersistentVolumeClaim provision shared NFS-CSI storage at /mnt/test_refs (5Gi, ReadWriteMany) for test reference data.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • matejpekar

Poem

🐰 Tests snapshot and serve, performance numbers we measure,
Tiles extracted with care from each histology treasure,
Reference frames stored in volumes for sharing,
Ray Serve deploys the endpoints with bearing,
Model validation through quantitative pleasure! 📊✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Feature- Tests' is extremely vague and generic. It uses a non-descriptive prefix 'Feature-' followed by the bare word 'Tests', failing to convey meaningful information about the specific changes in the changeset. Replace with a concise, specific title summarizing the main changes, such as 'Add model snapshot and throughput performance test infrastructure with Ray Serve deployments'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a test runner and a throughput benchmark runner for Ray Serve deployments, along with Helm chart configurations, PVC setups, and snapshot tests for model validation. The review feedback suggests several improvements: moving dynamic pip installations from the deployment initialization to the runtime_env configuration, returning proper HTTP error status codes (e.g., 500) upon test failures, avoiding hardcoded paths by introducing a helper to dynamically resolve the test references directory, and explicitly handling ValueError in the readiness check to prevent infinite retry loops.

Comment thread builders/test_runner.py
Comment thread builders/test_runner.py
Comment thread builders/throughput_runner.py
Comment thread builders/throughput_runner.py
Comment thread helm/rayservice/applications/test-runner.yaml
Comment thread tests/model_snapshots/test_semantic_segmentation_model_snapshot.py Outdated
Comment thread tests/model_snapshots/test_semantic_segmentation_model_snapshot.py Outdated
Comment thread tests/model_snapshots/test_virchow2_model_snapshot.py Outdated
Comment thread tests/model_snapshots/test_virchow2_model_snapshot.py Outdated
Comment thread tests/perf_throughput.py

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
tests/perf_throughput.py (1)

41-45: ⚡ Quick win

Eliminate code duplication by importing from _shared.py.

This function duplicates _models_base_url from tests/model_snapshots/_shared.py. Maintaining the same logic in two places increases the risk of inconsistency if the default URL changes.

♻️ Refactor to import from shared module
+from tests.model_snapshots._shared import _models_base_url
+
 import numpy as np
 from rationai import Client


-def _models_base_url() -> str:
-    return os.environ.get(
-        "MODEL_SERVICE_MODELS_BASE_URL",
-        "http://rayservice-model-serve-svc.rationai-jobs-ns.svc.cluster.local:8000",
-    )
-
-
 def make_pool(tile_size: int, n: int) -> list[np.ndarray]:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/perf_throughput.py` around lines 41 - 45, Remove the duplicate
_models_base_url function and import it from the shared module instead; update
tests/perf_throughput.py to add an import from tests.model_snapshots._shared (or
the correct module path) and use the imported _models_base_url function where
needed so the default URL is maintained in one place (reference symbol:
_models_base_url in tests/model_snapshots/_shared.py and its usages in
tests/perf_throughput.py).
builders/throughput_runner.py (2)

14-18: ⚡ Quick win

Unnecessary pytest install.

run() invokes tests/perf_throughput.py directly via sys.executable, not through pytest, so installing pytest here serves no purpose and just adds startup cost/failure risk. Drop it, or install the actual deps perf_throughput.py needs (e.g. numpy) instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@builders/throughput_runner.py` around lines 14 - 18, The __init__ in
builders/throughput_runner.py currently runs subprocess.run to pip-install
pytest, which is unnecessary because perf_throughput.py is invoked directly via
sys.executable rather than via pytest; remove the pytest install call from the
__init__ (or replace it with installing real runtime deps perf_throughput.py
requires, e.g., numpy) so that the subprocess.run([... "pip", "install",
"pytest" ...]) invocation is deleted/updated and the class initializer no longer
performs an unneeded pip install.

51-55: ⚡ Quick win

Add a timeout to the benchmark subprocess.

Same pattern as builders/test_runner.py: subprocess.run has no timeout, so if the benchmark hangs it blocks the single replica. A hard cap slightly above duration_s (e.g. duration_s + buffer) plus TimeoutExpired handling would bound it.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@builders/throughput_runner.py` around lines 51 - 55, The subprocess.run call
currently has no timeout and can hang; add a timeout parameter set to duration_s
+ a small buffer (e.g. buffer seconds) when calling subprocess.run(cmd,
capture_output=True, text=True, timeout=duration_s+buffer) and wrap the call in
a try/except that catches subprocess.TimeoutExpired (use the same pattern as
builders/test_runner.py): on timeout assemble output from the exception's
stdout/stderr (or note that it timed out), and return a Response (text/plain)
describing the timeout and partial output (set an appropriate HTTP status like
504 if the test_runner uses that) so the replica is bounded and the caller sees
the timeout.
helm/rayservice/applications/throughput-test.yaml (1)

1-3: 💤 Low value

Naming inconsistency: throughput-runner vs throughput-test.

The app name is throughput-runner, but the file and the values.yaml applications entry are throughput-test, while test-runner is consistent across all three. Align the names to avoid confusion when correlating values entries to app configs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@helm/rayservice/applications/throughput-test.yaml` around lines 1 - 3, The
app config in throughput-test.yaml uses name: throughput-runner which mismatches
the file name and the values.yaml applications entry (throughput-test) while
other configs use test-runner; update the name field in the manifest to match
the canonical identifier used in values.yaml and other configs (e.g., change
name: throughput-runner to name: throughput-test or to test-runner depending on
the canonical choice) so that builders.throughput_runner:app and route_prefix:
/run-throughput remain unchanged and the values.yaml applications entry maps
correctly.
builders/test_runner.py (2)

14-18: ⚡ Quick win

Prefer installing pytest via runtime_env/image instead of at replica startup.

Installing inside __init__ adds cold-start latency and a network-dependent failure surface (check=True will crash replica init on any transient pip failure). Adding pytest to the runtime_env.pip list in helm/rayservice/applications/test-runner.yaml (or baking it into the image) is more reliable.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@builders/test_runner.py` around lines 14 - 18, The __init__ method in
builders.test_runner.py currently runs subprocess.run([sys.executable, "-m",
"pip", "install", "pytest", "-q"], check=True) which causes cold-start latency
and brittle failures; remove that subprocess.run call from the
TestRunner.__init__ and stop installing pytest at replica startup, and instead
add "pytest" to the runtime_env.pip list in the Helm manifest
(helm/rayservice/applications/test-runner.yaml) or bake it into the container
image so TestRunner starts without performing pip installs at init.

23-37: ⚡ Quick win

Add a timeout to the pytest subprocess.

subprocess.run(..., capture_output=True) has no timeout, so a hung or slow snapshot run will block this request — and, with num_replicas=1 / target_ongoing_requests: 1, the whole replica — indefinitely. Consider a bounded timeout and handling TimeoutExpired.

♻️ Suggested change
         result = subprocess.run(
             [
                 sys.executable,
                 "-m",
                 "pytest",
                 "tests/model_snapshots/",
                 "-v",
                 "--tb=short",
                 "--no-header",
                 "-s",
                 "--color=no",
             ],
             capture_output=True,
             text=True,
+            timeout=600,
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@builders/test_runner.py` around lines 23 - 37, The pytest subprocess call
using subprocess.run(...) should include a bounded timeout and handle
subprocess.TimeoutExpired; update the subprocess.run invocation in
builders/test_runner.py to pass a sensible timeout (or a configurable constant)
and wrap the call in a try/except that catches subprocess.TimeoutExpired,
captures/uses the exception's stdout/stderr (or logs them), and returns/fails
the test run with a clear timeout error instead of hanging indefinitely;
reference the existing subprocess.run(...) call and the
subprocess.TimeoutExpired exception for locating where to add the timeout and
handler.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/model_snapshots/_shared.py`:
- Around line 186-194: Move the shape validation to occur before computing the
cosine similarity: check actual.shape == expected.shape (use the same case_name
for context) and call pytest.fail if they differ, then compute similarity via
np.dot and norms; also guard against zero norms by checking actual_norm and
expected_norm (from np.linalg.norm) and fail or handle the case explicitly to
avoid division-by-zero/nan behavior when computing similarity used against
min_cosine_similarity.
- Around line 96-119: The shape comparison should be performed immediately after
computing `actual` and loading `expected` so it fails with the clear message
instead of raising a NumPy error during subtraction; move the `if actual.shape
!= expected.shape: pytest.fail(f"Shape mismatch: expected={expected.shape},
actual={actual.shape}")` check to precede the `max_diff =
np.abs(actual.astype(np.float32) - expected.astype(np.float32)).max()` line (and
before any `stats_slice` slicing/aggregation), keeping the rest of the
`max_diff`, `stats_slice`, `min_val`, `mean_val`, `max_val`, and `frac_05`
computations unchanged.

In `@tests/perf_throughput.py`:
- Line 215: The help text for the CLI option "--models-base-url" is misleading
(it says default http://localhost:8000) — update the help string to reflect the
actual default returned by the helper _models_base_url() (the cluster URL
http://rayservice-model-serve-svc.rationai-jobs-ns.svc.cluster.local:8000 when
the MODEL_SERVICE_MODELS_BASE_URL env var is unset); locate the parser/arg
definition that sets help for "--models-base-url" in tests/perf_throughput.py
and change the descriptive text to mention the real default (or refer to
_models_base_url() / MODEL_SERVICE_MODELS_BASE_URL) so the help is accurate.

---

Nitpick comments:
In `@builders/test_runner.py`:
- Around line 14-18: The __init__ method in builders.test_runner.py currently
runs subprocess.run([sys.executable, "-m", "pip", "install", "pytest", "-q"],
check=True) which causes cold-start latency and brittle failures; remove that
subprocess.run call from the TestRunner.__init__ and stop installing pytest at
replica startup, and instead add "pytest" to the runtime_env.pip list in the
Helm manifest (helm/rayservice/applications/test-runner.yaml) or bake it into
the container image so TestRunner starts without performing pip installs at
init.
- Around line 23-37: The pytest subprocess call using subprocess.run(...) should
include a bounded timeout and handle subprocess.TimeoutExpired; update the
subprocess.run invocation in builders/test_runner.py to pass a sensible timeout
(or a configurable constant) and wrap the call in a try/except that catches
subprocess.TimeoutExpired, captures/uses the exception's stdout/stderr (or logs
them), and returns/fails the test run with a clear timeout error instead of
hanging indefinitely; reference the existing subprocess.run(...) call and the
subprocess.TimeoutExpired exception for locating where to add the timeout and
handler.

In `@builders/throughput_runner.py`:
- Around line 14-18: The __init__ in builders/throughput_runner.py currently
runs subprocess.run to pip-install pytest, which is unnecessary because
perf_throughput.py is invoked directly via sys.executable rather than via
pytest; remove the pytest install call from the __init__ (or replace it with
installing real runtime deps perf_throughput.py requires, e.g., numpy) so that
the subprocess.run([... "pip", "install", "pytest" ...]) invocation is
deleted/updated and the class initializer no longer performs an unneeded pip
install.
- Around line 51-55: The subprocess.run call currently has no timeout and can
hang; add a timeout parameter set to duration_s + a small buffer (e.g. buffer
seconds) when calling subprocess.run(cmd, capture_output=True, text=True,
timeout=duration_s+buffer) and wrap the call in a try/except that catches
subprocess.TimeoutExpired (use the same pattern as builders/test_runner.py): on
timeout assemble output from the exception's stdout/stderr (or note that it
timed out), and return a Response (text/plain) describing the timeout and
partial output (set an appropriate HTTP status like 504 if the test_runner uses
that) so the replica is bounded and the caller sees the timeout.

In `@helm/rayservice/applications/throughput-test.yaml`:
- Around line 1-3: The app config in throughput-test.yaml uses name:
throughput-runner which mismatches the file name and the values.yaml
applications entry (throughput-test) while other configs use test-runner; update
the name field in the manifest to match the canonical identifier used in
values.yaml and other configs (e.g., change name: throughput-runner to name:
throughput-test or to test-runner depending on the canonical choice) so that
builders.throughput_runner:app and route_prefix: /run-throughput remain
unchanged and the values.yaml applications entry maps correctly.

In `@tests/perf_throughput.py`:
- Around line 41-45: Remove the duplicate _models_base_url function and import
it from the shared module instead; update tests/perf_throughput.py to add an
import from tests.model_snapshots._shared (or the correct module path) and use
the imported _models_base_url function where needed so the default URL is
maintained in one place (reference symbol: _models_base_url in
tests/model_snapshots/_shared.py and its usages in tests/perf_throughput.py).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f1bad222-0af5-4760-aa06-357b910d5716

📥 Commits

Reviewing files that changed from the base of the PR and between 3d37a6e and afbcab6.

📒 Files selected for processing (19)
  • builders/test_runner.py
  • builders/throughput_runner.py
  • helm/rayservice/applications/test-runner.yaml
  • helm/rayservice/applications/throughput-test.yaml
  • helm/rayservice/values.yaml
  • helm/rayservice/workers/cpu-workers.yaml
  • helm/rayservice/workers/mig20-workers.yaml
  • pvc/model-test-refs-pvc.yaml
  • pyproject.toml
  • tests/__init__.py
  • tests/model_snapshots/__init__.py
  • tests/model_snapshots/_shared.py
  • tests/model_snapshots/generate_references.py
  • tests/model_snapshots/new_images.txt
  • tests/model_snapshots/test_binary_classifier_model_snapshot.py
  • tests/model_snapshots/test_prov_gigapath_model_snapshot.py
  • tests/model_snapshots/test_semantic_segmentation_model_snapshot.py
  • tests/model_snapshots/test_virchow2_model_snapshot.py
  • tests/perf_throughput.py

Comment thread tests/model_snapshots/_shared.py Outdated
Comment thread tests/model_snapshots/_shared.py Outdated
Comment thread tests/perf_throughput.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants