Skip to content

Add tests for #1475: multishot candidate selection artifacts#1478

Draft
prompt-driven-github[bot] wants to merge 1 commit into
mainfrom
test/issue-1475
Draft

Add tests for #1475: multishot candidate selection artifacts#1478
prompt-driven-github[bot] wants to merge 1 commit into
mainfrom
test/issue-1475

Conversation

@prompt-driven-github

Copy link
Copy Markdown
Contributor

Summary

Adds tests based on the requirements in #1475.

Test Files

  • tests/test_multishot_candidate.py — 37 pytest test functions across 9 scenarios
  • pdd/multishot_candidate.py — implementation module supporting the tests
  • pdd/schemas/selection_policy.schema.json — JSON schema for selection_policy.json artifact
  • pdd/schemas/multishot_candidate.schema.json — JSON schema for candidate records
  • tests/fixtures/multishot/selection_policy.json — golden fixture
  • tests/fixtures/multishot/candidate_records.jsonl — golden fixture (3 synthetic candidate records)
  • tests/fixtures/multishot/pass_at_k.json — golden fixture
  • tests/fixtures/multishot/selection_regret.json — golden fixture

Test Coverage

  • Total Tests: 37
  • Framework: pytest
  • Status: All 37 passing (0.55s)
  • Test Plan Coverage: 25/25 planned cases implemented (100%)

What These Tests Verify

  1. Selection Policy Artifact — schema validation, leakage flags, frozen_before_generation enforcement, golden fixture round-trip
  2. Leakage Rejection — parametrized over all 6 forbidden hidden input classes: hidden_pass_fail, hidden_stdout, hidden_stderr, hidden_test_names, prior_hidden_outcome, manual_post_hoc_choice
  3. Deterministic Tie-Break — 10 repeated calls produce same winner; winner matches declared policy order; k=5 n-way tie stable under 10 input permutations
  4. Candidate Record Completeness — k=3 required fields all present; JSONL schema validation; hidden_status absent before freeze / present after freeze
  5. Malformed/Missing Row Detection — missing task_id, missing candidate_index, wrong type, duplicate index all raise ValidationError
  6. Metric Recomputation — oracle count, selected count, regret count/rate, oracle-lift capture (incl. oracle_count=0 edge case), unbiased pass-at-k, edge cases k=1/all-pass/all-fail
  7. Golden JSON — all four artifacts (selection_policy.json, candidate_records.jsonl, pass_at_k.json, selection_regret.json) validated and exactly recomputed from JSONL
  8. Freeze / Hidden-Label LifecycleFrozenStateError raised before freeze; labels accessible after freeze
  9. k=1 Degenerate Case — single candidate always selected; pass-at-1 equals pass rate

Contract Test Summary

N/A — no OpenAPI spec found. Schema validation is handled via jsonschema.validate against pdd/schemas/ files.

Accessibility Audit Summary

N/A — not a web test. This is a Python CLI/research harness feature.

Manual Testing Summary

N/A — all scenarios covered by automated pytest tests with synthetic tmp_path fixtures and no external dependencies.

Test Execution

pytest -vv tests/test_multishot_candidate.py

Next Steps

  1. Review the generated tests
  2. Run tests locally to verify
  3. Adjust tests if needed
  4. Mark PR as ready for review

Closes #1475


Generated by PDD agentic test workflow (18-step)

Adds 37 pytest tests across 9 scenarios covering: selection policy artifact
schema validation, leakage rejection (6 parametrized forbidden input classes),
deterministic tie-break stability, candidate record completeness, malformed/missing
row detection, metric recomputation (oracle count, selected count, regret rate,
oracle-lift capture, unbiased pass-at-k), golden JSON fixtures for all four
artifacts, freeze/hidden-label lifecycle, and k=1 degenerate case.

New files:
- pdd/multishot_candidate.py: implementation (SelectionPolicy, CandidateRecord,
  LeakageError, FrozenStateError, select_winner, compute_metrics, pass_at_k)
- pdd/schemas/selection_policy.schema.json: JSON schema for selection_policy.json
- pdd/schemas/multishot_candidate.schema.json: JSON schema for candidate records
- tests/test_multishot_candidate.py: 37 tests, all passing
- tests/fixtures/multishot/: golden fixtures (selection_policy.json,
  candidate_records.jsonl, pass_at_k.json, selection_regret.json)

Closes #1475

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add public-only multishot candidate selection artifacts

1 participant