Skip to content

feat: add downstream task evaluation pipeline#24

Merged
Sudhendra merged 7 commits into
mainfrom
downstream-eval-framework
Mar 7, 2026
Merged

feat: add downstream task evaluation pipeline#24
Sudhendra merged 7 commits into
mainfrom
downstream-eval-framework

Conversation

@Sudhendra
Copy link
Copy Markdown
Owner

Summary

  • add a downstream-task evaluation framework with offline benchmark prep for HotpotQA, QASPER, and DS-1000 plus deterministic scoring and baseline compressors
  • add a runnable paired evaluation CLI for full-vs-compressed context comparisons, including explicit learned backend support for local MLX adapters and Tinker checkpoints
  • document the downstream eval workflow and fix autoformat.sh so worktree pushes run hooks correctly

Test Plan

  • pytest tests/test_downstream_dataset.py tests/test_prepare_downstream_eval.py tests/test_downstream_benchmarks.py tests/test_downstream_baselines.py tests/test_downstream_scoring.py tests/test_evaluate_downstream.py tests/test_evaluate_downstream_cli.py -v
  • ruff check src/evaluation/downstream scripts/evaluate_downstream.py scripts/prepare_downstream_eval.py tests/test_downstream_dataset.py tests/test_prepare_downstream_eval.py tests/test_downstream_benchmarks.py tests/test_downstream_baselines.py tests/test_downstream_scoring.py tests/test_evaluate_downstream.py tests/test_evaluate_downstream_cli.py
  • smoke-tested tiny downstream runs in the worktree and reviewed summary artifacts for baseline vs learned compressors

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66d1107d4f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/evaluate_downstream.py
Comment thread scripts/evaluate_downstream.py Outdated
@Sudhendra Sudhendra merged commit 77b3619 into main Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant