Add Reviewer 3 (closed-source HTTP API) as a benchmark system#85
Open
dangng2004 wants to merge 6 commits into
Open
Add Reviewer 3 (closed-source HTTP API) as a benchmark system#85dangng2004 wants to merge 6 commits into
dangng2004 wants to merge 6 commits into
Conversation
Wires reviewer3 into both the perturbation benchmark (run_benchmark.py) and
the conference outcomes study (run_competitors.py), so it can be benchmarked
on the exact same paper sets as openaireview and coarse.
Highlights
- Perturbation: LaTeX-as-md sources are compiled to PDF with pdflatex
before submission (cs_CC corpus is full LaTeX); other domains pass
through. PDF MIME is set; cfg knobs (review_mode, poll_interval_s,
poll_timeout_s) are threaded from YAML through Reviewer3System into
the adapter via the job payload.
- Comment normalization picks up the recently-added Reviewer 3 fields:
citedText (-> quote), title, severity (1-4 -> canonical 3-tier scale).
- Severity mapping is consolidated into benchmarks/perturbation/_severity.py
so the perturbation adapter, compute_auc.py, and report_scaleup.py share
one source of truth (collapses three inlined copies of COARSE_SEVERITY_MAP).
- Conference adapter reuses the perturbation HTTP submit/poll/normalize
helpers via sys.path import; no duplication. R3 has no model selector,
so method_key is fixed at reviewer3__reviewer3.
- 8 perturbation configs (full_<domain>_reviewer3.yaml) mirror the
full_<domain>_coarse.yaml knobs for max_tokens / min_perturbations.
- conference_study/configs/reviewer3.yaml pins models: [reviewer3] so
the per-(paper, model) loop fires once per paper rather than once per
manifest model.
- requests>=2.31 added to [project.optional-dependencies] benchmarks.
- .env.example documents REVIEWER3_API_KEY / REVIEWER3_USER_ID
(UUID, not email — see neurips_2026 setup notes).
Operational note (not in this diff)
- conference_study/{manifests,papers,results} are gitignored data dirs
that live only in the sibling worktree. To run, symlink them in:
cd benchmarks/conference_study
ln -s ../../../OpenAIReview/benchmarks/conference_study/manifests manifests
ln -s ../../../OpenAIReview/benchmarks/conference_study/papers papers
ln -s ../../../OpenAIReview/benchmarks/conference_study/results results
Known gap
- run_competitors.py does not load .env (only run_benchmark.py does);
export REVIEWER3_* before launching, or add a dotenv.load_dotenv() at
the top of run_competitors.py in a follow-up.
Smoke-validated end-to-end on 1 cs_CC paper through the perturbation
runner; recall 10/17 on the staged perturbations (LLM judge).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…figs Two cleanups so `git status` reflects only real work: Root .gitignore - Add slash-less `benchmarks/conference_study/papers` (the existing `papers/` rule in conference_study/.gitignore uses a trailing slash and wouldn't match a symlink — same situation the file already handles for manifests/ and results/). - Ignore `benchmarks/experimental_perturbations/` (removed from the repo in 6373fad but the local dir lingers). benchmarks/perturbation/.gitignore - Extend the "ephemeral configs" rule beyond `configs/_*` to cover the per-domain configs we generate locally but don't check in: configs/cs_*scaleup*.yaml configs/full_*.yaml configs/grok_*.yaml configs/longtail_*.yaml configs/subset_*.yaml configs/r3_smoke*.yaml Add `!configs/full_*_reviewer3.yaml` exception so the canonical reviewer3 configs that ARE tracked don't get hidden by the bulk rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ilures)
The token-based truncation in `prepare_units` cuts the LaTeX-as-md staged file
at a token boundary, which routinely leaves the document mid-environment.
pdflatex on the staged file then "produces no usable PDF" for a fraction of
papers, surfacing as a hard failure in the reviewer3 run.
Switches the reviewer3 system to compile the FULL pre-truncation source
(`u.src_corrupted`) and then trim the rendered PDF to its first N pages.
This matches the `max_pages: 20` convention conference_study/configs/coarse.yaml
already uses for coarse, so reviewer3 sees roughly the same content window
the other systems see.
Three pdflatex robustness fixes layered in:
1. Strip orphan `\input{...}` / `\include{...}` whose target file isn't
bundled. pdflatex aborts hard on a missing \input, killing the compile
for the whole paper even when the body is fine (paper_005 cs_CC had
`\input{mypreamble.tex}`).
2. Inject a defensive preamble of `\providecommand` fallbacks for common
author-defined shortcuts (\bbR, \calA, \bfx, \eps, \vvirg, \ootimes,
etc.). Authors typically define these in private preamble files we
don't have; \providecommand is a no-op when the command is already
defined, so the injection is safe blanket coverage.
3. subprocess.run uses bytes (text=False) instead of text=True so the
pdflatex log's non-UTF-8 accent bytes don't blow up Python's decoder
(paper_009 cs_CC had byte 0xaa at offset ~57k).
Changes
- Reviewer3System.build_jobs threads `u.src_corrupted` (full path) and
`cfg["max_pages"]` into the job payload.
- Reviewer3Job adopts `source` + `max_pages` fields; `_submit` / `_ensure_pdf`
forward them.
- `_ensure_pdf` prefers `source` over `paper` for the compile when set;
caches alongside the source with `.trim.pdf` suffix when trimmed.
- `_trim_pages_to` (in-place) and `_maybe_trim_pages` (for already-PDF inputs)
use pymupdf to cap pages.
- `max_pages: 20` added to all 8 `full_*_reviewer3.yaml` configs.
- run_benchmark.py Config gains `max_pages: int | None = None`.
Smoke-validated on three previously-failing cs_CC papers (2604.19872v1 with
missing \input + custom commands, 2604.24325v1 with same pattern, 2604.24879v1
with non-UTF8 bytes in pdflatex output) — all three now produce 20-page
trimmed PDFs in 2–4s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2 tasks
…to 1h Two operational fixes after observing the live runs: 1. Conference Reviewer3Adapter wasn't honoring max_pages — it called _submit(pdf) directly without forwarding max_pages, so R3 received the full PDF (sometimes 50+ pages, 5+ MB) and a chunk of those tripped R3's HTTP 413 limit. Adapter now reads `max_pages` from the top-level config (falling back to reviewer3_options.max_pages) and threads it through to _submit -> _ensure_pdf -> _maybe_trim_pages. 2. poll_timeout_s bumped from 1800 (30 min) to 3600 (60 min) in all 8 perturbation configs and conference reviewer3.yaml. Observed wall time per paper under 10-concurrent load was routinely 25-40 min, with a long tail past 30 — causing dozens of false-timeout failures even though R3 was still processing. The session remains live on R3's side regardless, but the adapter abandoned them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three layered fixes for the remaining pdflatex-aborts:
1. `_force_known_documentclass` — rewrite \documentclass[opts]{X} to
\documentclass{amsart} when X.cls isn't installed (kpsewhich miss).
Many papers use journal classes not in TeX Live (aq-amsart, sn-jnl,
atlasdoc, aastex631/701, cas-sc, iopjournal, svjour3, ieeeconf,
informs3, revtex4 without -1/-2). amsart is the math-friendly fallback,
preserving theorem/lemma envs.
Regex now matches only the **first uncommented** \documentclass so
commented-out example lines (e.g. q-bio.GN papers carry
`%%\documentclass[...]{sn-jnl}` followed by the active variant) don't
short-circuit the lookup.
2. `_strip_missing_packages` — same idea for \usepackage{X} when X.sty
isn't installed. pdflatex aborts hard on a missing package too; the
common offender on hep-ex papers is `\usepackage{jinstpub}`. Comment
them out; the body's references to missing-package commands degrade
to undefined-control-sequence warnings (pdflatex in nonstopmode still
produces a usable PDF).
3. Rescue preamble now runs inside \AtBeginDocument{...} so its
\providecommand falls AFTER all \usepackage{...} loads. Previously,
the rescue defined \Bbbk before amsfonts loaded, then amsfonts
`\DeclareSymbolFont` errored with "Command \Bbbk already defined".
Spot-test across 8 domains: 7 of 8 sample papers compile cleanly. The
holdout is q-bio.GN paper_001 (uses sn-jnl class with extensive class-
specific commands that don't degrade gracefully); 9 of 10 q-bio.GN papers
use bundled classes and work. Expected pdflatex-failure cells dropped
from 39 to ~3 (one paper × three error types).
kpsewhich results are cached per process for the common-class /
common-package set so repeated rewrites across the 222-cell run pay the
subprocess cost once per (class|package).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…waste Adds `.sid` file alongside each review output. On run: - If `.sid` exists and points to a still-valid R3 session, we POLL that session instead of submitting a new one. Avoids the duplicate-session credit waste observed when runs were killed mid-poll (~34% of credits spent on duplicates across yesterday's runs, per the rescue audit). - If the sid fetch hard-fails (403 "not found" / 404), drop the stale file and submit fresh. - On submit success: write the sid IMMEDIATELY so a SIGKILL between submit and first poll-tick still leaves a recovery path. - On any failure: keep the sid file — next run resumes the same session. - On success: leave the sid file in place as an audit trail (cheap; the out_json's presence is the real "done" marker for skip-completed). Perturbation: - `Reviewer3Job` gains `sid_file: Path | None`. `Reviewer3System.build_jobs` computes `<review_dir>/<stem>.sid` per cell and threads it through. - `_run_one` handles the resume vs submit branch. Conference: - `run_competitors.py` injects `cfg["_sid_file"]` next to the merged paper JSON (under `<results>/.sids/<slug>.<method_key>.sid`). - Conference `Reviewer3Adapter.review()` honors the underscore-prefixed key and persists/resumes the same way. Also drops the conference `max_per_model: 1` back to 5 after confirming R3's throttle from yesterday has lifted (single-paper probe transitioned waiting -> processing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
run_benchmark.py) and the conference outcomes study (run_competitors.py) so it can be benchmarked on the exact same paper sets as openaireview and coarse.pdflatexbefore submission; cfg knobs (review_mode, poll_*) thread from YAML throughReviewer3Systeminto the adapter via job payload; new comment fields (citedText, title, severity 1–4) get normalized into our schema.Reviewer3Adapterreuses the perturbation HTTP submit/poll/normalize helpers (no duplication); fixed method keyreviewer3__reviewer3since R3 has no model selector.benchmarks/perturbation/_severity.py— perturbation adapter,compute_auc.py, andreport_scaleup.pynow share one source of truth (collapses three inlined copies ofCOARSE_SEVERITY_MAP).full_<domain>_reviewer3.yaml) mirror the matchingfull_<domain>_coarse.yamlknobs for max_tokens / min_perturbations.requests>=2.31added to[project.optional-dependencies] benchmarks;.env.exampledocumentsREVIEWER3_API_KEY/REVIEWER3_USER_ID(UUID from the vendor's web UI session JSON, not an email).Test plan
run_benchmark.py(10 min wall, 11 comments, recall 10/17 on LLM judge).--dry-runlists all 197 v2 papers without submitting.run_benchmark.load_config(...)and infersystem=reviewer3correctly.Operational notes (not in the diff)
conference_study/{manifests,papers,results}are gitignored data dirs that live only in the sibling worktree. To run conference_study locally, symlink them in:run_competitors.pydoesn't load.env(onlyrun_benchmark.pydoes). ExportREVIEWER3_*before launching, or adddotenv.load_dotenv()at the top ofrun_competitors.pyin a follow-up.🤖 Generated with Claude Code