Tighten standard playwright_webarena verify scripts + smart-quote normalization by Cierra0506 · Pull Request #254 · eval-sys/mcpmark

Cierra0506 · 2026-05-11T17:52:11Z

Summary

Bundle of fixes to standard playwright_webarena tasks accumulated on dlx/verify. Brings 30 commits worth of verify / task-description tightening to pin-all-versions.

Highlights:

Tighten verify logic for 14 standard shopping / shopping_admin tasks and several reddit tasks (advanced_product_analysis, gaming_accessories_analysis, multi_category_budget_analysis, printer_keyboard_search, running_shoes_purchase, health_routine_optimization, holiday_baking_competition, customer_segmentation_setup, fitness_promotion_strategy, marketing_customer_analysis, ny_expansion_analysis, products_sales_analysis, sales_inventory_analysis, search_filtering_operations).
Unicode smart-quote normalization in 21 standard verify scripts so model output containing ’ ‘ ” “ matches ASCII label.txt values.
Bug: icon-only vote button selector in reddit/budget_europe_travel/verify.py — switched from button:has-text("Retract upvote") to button[title="Retract upvote"] since Postmill renders the vote button with no text node.
Description ↔ label alignment in shopping_admin/sales_inventory_analysis/description.md — Customer Since field now spells out the full Magento timestamp format.
Misc: route MCP_MESSAGES: log line to stderr in 9 standard verify scripts; handle ™ HTML entity in fitness Bestseller name; add gpt-5.5 model entry + xhigh reasoning-effort choice in pipeline / model_config.

Test plan

AST-parse every modified verify.py (already validated locally)
Re-run failed task traces from results/pw_debug/ against the updated verifies to confirm previously-incorrect failures now classify correctly
Spot-check label.txt comparisons still pass for known-good traces (no false negatives from normalization)

🤖 Generated with Claude Code

- extraction_table: drop numeric-column quotes in data.csv and example, load data.csv as ground truth and match rows by set, allow flexible column order in model output, scan fallback now requires all 5 headers - cloudflare_turnstile_challenge: remove redundant note line duplicating earlier steps Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- verify.py: only check the last completed assistant message (was OR'ing across all messages), align content extraction with extraction_table (filter text/output_text items instead of joining all content blocks) - description.md: spell out the strict 4-digit-year output format Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- description.md: explicitly require the v1 (initial) version of the paper so model output aligns with the v1-sourced ground truth in content.txt - verify.py: only check the last completed assistant message and align content extraction with extraction_table; replace strict whitespace-only normalize + equality with markdown/unicode-aware normalize plus difflib similarity (threshold 0.9), so bullet style, bold markers, casing, and small punctuation drift no longer cause false negatives Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The DeepSeek R1 v1 paper's last section is titled "Conclusion, Limitations, and Future Work", not just "Conclusion". Update the description so the model extracts the full section that content.txt was derived from. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Remove broken smart-quote replace block in normalize_text (Python parsed it as a triple-quoted string + same-char no-op replaces; never normalized anything) - Hard-fail when label.txt is missing instead of silently degrading to a "fields are non-empty" check that lets any 7-field submission pass - Drop the redundant Deeplearning_Post_Count special-case validation; the generic numeric-field loop already covers all three count fields - Replace `if "expected_data" in locals()` introspection with the now unconditional expected_data variable - Add missing f-string prefix on the missing-keys error so it actually prints which keys are missing Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Wiki title check: switch substring `in` to exact `==` after stripping, matching the strict equality used for forum title/description/sidebar - Step 5 upvote check: drop the "any vote count >= 1" fallback that would pass on any pre-seeded postmill data regardless of user actions; only the "Retract upvote" button reliably signals the current user upvoted - Remove dead normalize_text function (never called; also contained a broken smart-quote replace block similar to ai_data_analyst) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Remove broken smart-quote replace block in normalize_text (Python parsed the lines as a triple-quoted string + same-char no-op replaces; never normalized anything) - Drop the redundant "upvotes are descending" check; it's mathematically implied by the per-field equality comparison against label.txt (which is itself in descending order), so it can only fire alongside earlier errors and adds noise - Drop the now-misleading "Posts ordered by upvotes (descending)" success message that claimed a check we no longer perform Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- label.txt: correct Total_LLM_Posts from 9 to 8 (verified against the postmill MachineLearning forum first page; only 8 posts contain GPT/ChatGPT/LLM as a case-insensitive substring) - Replace Top1/2/3_Date fields with Top1/2/3_Author across description, label.txt, and verify.py — author names are stable while "X years ago" drifts as the snapshot ages - Remove broken smart-quote replace block in normalize_text (Python parsed it as a triple-quoted string + same-char no-op replaces) - Hard-fail when label.txt is missing instead of silently degrading to a "fields are non-empty" check - Drop the redundant Total_LLM_Posts special-case validation; the generic per-field comparison loop already covers it - Replace `if "expected_data" in locals()` introspection with the now unconditional expected_data variable - Drop the descending-order check on top 3 upvotes; mathematically implied by per-field equality against label.txt - parse_key_value_format: drop the colon-separator fallback (label is pipe-only and the fallback was secretly accepting non-conforming model output) and accept unicode bullet `•` alongside `-` and `*` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Remove broken smart-quote replace block in normalize_text (Python parsed it as a triple-quoted string + same-char no-op replaces); the working &-decode and whitespace collapse are kept - Hard-fail when label.txt is missing instead of silently degrading to a "fields are non-empty" check - Drop the redundant Total_Year_Posts special-case validation; the generic per-field comparison loop already covers it - Replace `if "expected_data" in locals()` introspection with the now unconditional expected_data variable Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Drop Total_NBA_Posts field entirely (description / label / verify); the count had irreconcilable semantics — postmill search is whole-word and site-wide, but the Top 5 list expected substring matches scoped to the sports forum (incl. WNBA-titled posts), making the original 20 vs any model's interpretation always disagree - description: replace "search for posts containing 'NBA' in their titles" with "browse its posts to find posts whose titles contain 'NBA'", which matches what's actually achievable (no forum-scoped search) and lets substring catch WNBA naturally - label.txt: fix Top3_Title — `tonight|68,323` → `tonight: 68,323` to match the actual postmill page rendering - verify.py: - Hard-fail when label.txt is missing - Drop the redundant Total_NBA_Posts special-case validation - Remove `if "expected_data" in locals()` introspection and the unreachable basic-validation else branch - Remove broken smart-quote replace lines in normalize_text (kept the working unicode-escape apostrophe replacements since label has U+2019) - parse_key_value_format: drop the unused `#`-comment-line skip and add `*` bullet support to align with sibling tasks - Submission body locator now keys on Top1_Title instead of the deleted Total_NBA_Posts marker - Fix KeyError in success print that still referenced extracted_data['Total_NBA_Posts'] (would have failed any compliant submission via the broad except Exception clause) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- CartTotal: drop the cosmetic startswith("$") check; the existing strip step already removes $ and , before comparing, so a numerically-correct amount written without the dollar sign was being rejected for no reason - LatestReviewer: replace the bidirectional substring check with a case-insensitive exact match. The old check passed empty strings and single characters as long as one was a substring of the other, which let unfilled or wildly wrong values slip through Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- label.txt: correct Products70Plus from 7 to 6. Verified directly via the postmill shopping container (port 7770): Video Games category at default 12 products/page has 5 products with rating >= 70% on page 1 and 1 on page 2. The original 7 only matches if you switch to 24/page first, which the description never instructs - description.md: clarify that products without any rating do not count toward the threshold count - verify.py: - Drop the cosmetic startswith("$") check on price fields (CheapestReviewedPrice, N64Subtotal); the strip step already handles $ and , so a numerically-correct value without $ was being rejected for no reason - Drop the +-2 tolerance window on Products70Plus. The data is static seed data, the rationale ("dynamic content") is bogus, and the tolerance had been hiding the wrong label (anything in [5,9] was passing for label=7, so neither the correct 6 nor the wrong 7 ever surfaced as a mismatch) - Switch ComparisonCount and ShippingMethods to int comparison so "02" or whitespace variants don't fail a numerically-correct answer - Add ShippingState to the case-insensitive exact-match branch alongside CheckoutEmail Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Drop the cosmetic startswith("$") check on the four price fields (Battery1Price, Battery2Price, InitialSubtotal, FinalSubtotal); the strip step already removes $ and , before comparing, so a numerically correct amount written without $ was being rejected for no reason - Switch the six count fields (AdvancedSearchResults, ComparisonCount, TeaReviews, CartUniqueProducts, CartTotalQuantity, TeaRating) to int comparison so "02" / whitespace / a trailing % don't fail a numerically-correct answer; TeaRating gets a `replace("%", "")` so both "95" and "95%" compare as 95 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- description.md: clarify TotalCartItems is the sum of all product quantities, not the number of distinct line items - verify.py: - Drop the cosmetic startswith("$") check on the three price-bearing fields (CartSubtotalAfterUpdate, the price portion of CheapestChocolatePriceReviews, the price portion of Page2ThirdProductSKUPrice); the strip step already handles $ and , so a numerically correct value without $ was being rejected - Drop the 0.01 price tolerance — these prices come from Magento's rendered string, no float arithmetic happens on either side, so the tolerance only ever masked wrong answers like $72.55 vs $72.56 - Switch the count/numeric pieces to int comparison: TotalCartItems, the reviews piece of CheapestChocolatePriceReviews, and the rating piece of HighestRatedCookieSKURating (with `%` stripped) - Make the SKU portions of HighestRatedCookieSKURating and Page2ThirdProductSKUPrice case-insensitive to match SecondGingerbreadSKU's behaviour Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…o advanced_product_analysis Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- description.md: - drop step 2 "total Revenue" line — dashboard chart is disabled so Revenue is fixed at $0.00 with no signal - step 3 switch from "search by name" to "open the product page linked from the Bestsellers row" — the dashboard row has an implicit product-edit link that uniquely identifies one simple SKU, removing the ambiguity for configurable products like Sprite Stasis Ball 65 cm that have 3 same-name variants - step 3 rename vague "Current inventory quantity" to "Salable Quantity" so the model reads the actual remaining-stock field, not the static source qty (always 100 in seed data) - step 4 specify Cart Price Rules path explicitly and replace the misleading "rules that might apply to fitness/yoga products" with a precise rule signature: "percentage discount on the entire order (not tied to a specific product)" — disambiguates the 20% $200+ rule from the 70% Luma water bottle rule - step 6 drop "other" from "how many other customers are in the same group" — label counts the customer himself, so off-by-one is removed - schema/example: drop TotalRevenue, rename inventory→salable_quantity, align None:0 wording across step 2/example/label, replace email@example.com placeholder with <customer email> to avoid seeding example.com into model output (the original label.txt sarah.miller@example.com was likely fabricated this way) - label.txt: - drop TotalRevenue|$0.00 - Bestseller salable_quantity 100 → 93/93/88 (verified live) - TopCustomer email sarah.miller@example.com → helloworld@yahoo.com (Sarah Miller's real email in the customer table; the previous value did not exist anywhere in the database) - BestsellerInSearch No:0 → None:0 (align with description/example) - verify.py: - get_model_response: add type=='message' filter (matches the customer_segmentation_setup tightening) - send "MCP_MESSAGES:" log to stderr like the rest of the prints - drop dead branches for TotalRevenue, MinimumPurchaseRule, LowestInventoryProduct, MostRecentOrderDate (none are in label.txt) - compare_answers tightened along the P1–P6 patterns established by earlier shopping/* commits: - Bestseller price → float() compare so "27" matches "$27.00" - Bestseller quantity → int() compare - Bestseller salable_quantity: drop ±0.0001 tolerance, keep float compare to handle Magento's "93.0000" rendering - PercentageDiscountRule percent → float() compare - PercentageDiscountRule rule name + TopCustomer name → case-insensitive so all name-class fields behave consistently - new branch for ActiveRulesCount / TotalOrders / SameGroupCustomers using int() compare; MostRecentOrderID stays in the default string branch because its leading zeros (000000299) are significant - Bestseller1/2/3 are now compared as a set via a new _normalize_bestseller helper — the three lines may be listed in any order (the dashboard ordering is non-obvious and the three tied bestsellers can equally well be listed under any Bestseller{1,2,3} key) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Magento product detail pages may render the trademark symbol as the HTML entity ™ rather than the unicode ™. Normalize ™ → ™ in _normalize_bestseller so a model that copies "Quest Lumaflex™ Band" from the page still matches label's "Quest Lumaflex™ Band". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- description.md: - clarify step 6 ambiguous "best-selling and most expensive" — explicitly say "Among the products tied for the highest sales quantity in the Bestsellers table, identify the one with the highest price" so the label TopProduct = Sprite Stasis Ball 65 cm:6 falls out deterministically (the literal "both top-quantity AND top-priced" reading would intersect to nothing — Overnight Duffle is the most-expensive overall but only sold 5 units) - verify.py: - get_model_response: add type=='message' filter and send the "MCP_MESSAGES:" log to stderr (matches customer_segmentation_setup) - verify main flow: drop the silent "Will proceed with browser verification only" fallback — hard-fail when model_response is missing or the <answer> block can't be parsed - compare_answers tightened along the P1–P6 patterns established by earlier shopping/* commits: - CouponCodes: replace the substring check ("H20" not in mv or "Luma water bottle" not in mv) — which passed any model output containing those two strings anywhere — with a proper split-and-set comparison: code case-sensitive, rule name case-insensitive (P4 fix) - Top2SearchTerms: split each "term:count" pair, compare as (term.lower(), int(count)) tuple set so order/case/spacing don't fail a numerically-correct answer - ZeroResultTerm: split "term:count" — term lower, count as int - EmailVerification: build dict of {email.lower(): status.lower()} so "Yes" / "YES" don't fail an otherwise correct answer - TopProduct: split "name:quantity" — name case-insensitive, qty as int (handles "6.0" / "06" variants) - new branch for TotalSearchTerms / ActiveRulesCount / SubscribedCount using int() compare - new branch for TotalRevenue: strip $ and , then compare as float so "$0.00" matches "0" / "0.00" - malformed model entries (missing ':', non-numeric count, etc.) now add specific mismatch messages instead of silently coercing to None tuples Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- description.md: - rename Second_Bestseller_* schema keys → Cheap_Bestseller_* to match description's "lowest price bestseller" semantics (the "Second" prefix was a leftover from an earlier design) - drop step 3's "compare CA tax with NY" subtask and the Higher_Tax_State field — the comparison was not adding signal beyond NY_Tax_Rate / CA_Tax_Rate themselves; CA_Tax_Rate is kept as a reference data point - rewrite step 4 to point at "Stores → Settings → Order Status" explicitly (the previous "Filter orders to show only statuses..." wording sent models to Sales → Orders) and clean up the awkward "if exists one has the status code 'processing'" phrasing - rewrite step 6 first sub-item: explicitly say "check whether its 'Pickup Location' column is 'Enabled' or 'Disabled'" — the original "is currently 'Enabled' or shows as 'Disabled' for Pickup Location" was easy to misread as the Is Enabled column (which is what the label originally encoded) - align Default_Source_State schema/example wording (was a mix of "state_or_none" / "State or None" / "Yes or No"); description step 6 asks "Yes or No", so schema/example now match - label.txt: - Total_States_With_Tax 2 → 3 (admin actually has CA, MI and NY tax rate rows; the original 2 missed Michigan) - Default_Source_Pickup_Status Enabled → Disabled (Pickup Location column on the Default Source row is Disabled; the original Enabled was the Is Enabled column instead) - Default_Source_State No → Yes (Address Data section on Edit Source has a 'State/Province' field — region_id select after Country) - rename Second_Bestseller_* keys → Cheap_Bestseller_* - drop Higher_Tax_State row - verify.py: - get_model_response: send "MCP_MESSAGES:" log to stderr, add type=='message' filter on the assistant message scan - compare_answers tightened along the P1–P6 patterns from earlier shopping/* commits: - Lifetime_Sales_Amount / Cheap_Bestseller_Price / Dashboard_Revenue: switch from string compare to float() so "$0.00" matches "0" / "0.00" - Cheap_Bestseller_Quantity / Total_States_With_Tax / Number_Of_Websites: int(float()) compare so "04" / "4" / "4.0" don't fail - Cheap_Bestseller_Name / Default_Source_Pickup_Status / Main_Store_Code: case-insensitive compare - Default_Source_State joined the Yes/No case-insensitive branch (the previous 'None' → '' normalization was a leftover from when the field was state-name-or-none) - drop dead Empty_Rows_Yes_Effect / Order_Status_Options / Chart_Disabled_Message branches — none of these keys are in label.txt or the schema Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- description.md: - step 2 sub-item 1: "Search for all products containing 'Yoga' in their name" → "Filter by Name containing 'Yoga'"; the original wording sent models to the keyword search box, which is fulltext over name+description+sku and returns 281 hits — only the Name column filter returns the 172 products whose name actually contains Yoga - step 2 sub-item 2: "Clear the search" → "Clear all filters" so it matches the filter terminology used in sub-item 1 - step 3 sub-item 1: rewrite the ambiguous "lowest price and lowest quantity" — taken literally that's the intersection of two singletons and is empty (cheapest = $14 Sprite Yoga Strap with qty=6, lowest-qty = qty=5 with $23 Sprite Stasis Ball 55 cm or $45 Overnight Duffle). The label encoded "qty-lowest tier, then cheapest within it", so the description now says exactly that: "Among the products tied for the lowest sales quantity, identify the one with the lowest price" - step 3 sub-item 2: rename schema/example key QuestLumaflexQuantity → SecondCheapestQuantity; the original key name leaked the answer (Quest Lumaflex) — the model could fill the value without ever reading the dashboard - step 4 typo: "Father" → "Gather" - example output: SarahMillerEmail|email@example.com → <customer email> so the placeholder doesn't push models toward example.com (the same trap that produced sarah.miller@example.com in fitness) - label.txt: - YogaProducts 171 → 172 (verified live: Name filter "Yoga" returns exactly 172 enabled products) - LowestProduct: drop the spurious " foot" in the product name ("Sprite Stasis Ball 55 cm foot" → "Sprite Stasis Ball 55 cm" — the original was a copy/paste artifact mixing two product names) - rename QuestLumaflexQuantity → SecondCheapestQuantity - TotalCustomers 72 → 70 (the original count included two test customers created by the marketing_customer_analysis task; baseline state has 70) - verify.py: - get_model_response: send "MCP_MESSAGES:" log to stderr, add type=='message' filter on the assistant message scan - rename QuestLumaflexQuantity → SecondCheapestQuantity in expected_keys - compare_answers tightened along the P1–P6 patterns: - LowestProduct: name compared case-insensitively, qty as int (was strict string compare on both) - WH11Price / DashboardRevenue: switch from string compare on the stripped value to float() so "$54.00" matches "54" / "54.0" - new branch for YogaProducts / ZeroQuantityProducts / SecondCheapestQuantity / TotalCustomers / PendingOrders using int(float()) compare - GraceNguyenOrderID stays in the default string branch because its leading zeros (000000189) are significant Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- description.md: - step 2 sub-item 2: "Clear the search" → "Clear all filters" so the terminology matches sub-item 1 (which uses Filter) - step 3 sub-item 2: rewrite "Find Grace Nguyen's Complete and the most cheap order" — the original was missing words and used "the most cheap" instead of "cheapest"; new wording is "Find Grace Nguyen's order with Complete status and the lowest price" - step 4 sub-item 1: rewrite "the product with most quantity but and lowest price" — the "but and" is a grammar error and the literal "max-qty AND min-price" is the empty intersection. Use the same two-step pattern as marketing_customer_analysis / products_sales_analysis: "Among the products tied for the highest sales quantity in the Bestsellers table, identify the one with the lowest price" - step 5 sub-item 1: "with its email address containing" → "whose email address contains" (grammar) - verify.py: - get_model_response: send "MCP_MESSAGES:" log to stderr - drop dead Position2Product branch (no such key in label or schema) - compare_answers tightened along the P1–P6 patterns: - WS12Info: name compared case-insensitively, price as float() so "$22.00" matches "22" / "22.0" - HighestOrderInfo: customer compared case-insensitively, amount as float() - new CheapProduct branch: name case-insensitive, qty as int (was falling through to the strict default branch) - OvernightDufflePrice: switch from string compare on the stripped value to float() - HollisterPosition: now also strip() before lowercasing - SarahMillerInfo: replace the bidirectional substring date check ("expected_date in model_date or model_date in expected_date" — a classic P4 hole that lets empty/short strings pass) with a case-insensitive exact match after strip() - Invoice002BillTo: case-insensitive - new branch for SpriteProducts / Quantity100Products / PendingOrders / CostelloCustomers / PaidInvoices using int(float()) compare - GraceOrderID stays in the default string branch (its leading zeros "000000114" are significant); also drop the redundant startswith("000") gate since every Magento order ID is 000-padded Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- description.md: - drop the Position3Bestseller sub-item from step 4 — it duplicated the bestsellers-table inspection covered by other tasks and was just one more thing the model could fail on; schema and example keys removed in lockstep - step 4 sub-item 2: fix grammar in "with the both the highest number of results and uses" → "with both the highest..."; tidy "record name and the number of results" → "record its name and number of results". The literal AND reading is fine here because label data is unambiguous (Antonia Racer Tank wins on uses among the results=23 tier) - step 5 header: drop the awkward self-reference "in step 2" in favour of a direct path "Marketing → Search Terms" - step 5 sub-item 2: spell out the tie-break — "Sort by 'Results' (ascending), then by 'Uses' (ascending)". Two terms in the grid have Results=1 (hollister with 19 uses, WP10 with 1 use); without the secondary sort, "first non-zero" was ambiguous and the model could land on hollister and fail against label WP10:1 - step 5 sub-item 3: drop the misleading "unique" — Search Terms grid rows are already unique terms, the word made models think they had to dedupe across Top/Last Search Terms - step 2 sub-item 4 / step 4 sub-item 1: align singular/plural — "Find the search term ... record its name" was inconsistent with "(record them all)"; both items now say "Find the search terms ... record their names" - schema/example: change the multi-entry separator from '|' to ';' so the three layers of delimiters are visually distinct (newline between rows, '|' between key/value, ';' between entries, ':' between term and count). Previously the row "OneResultTerm|hollister:19|WP10:1" used '|' for two different things, which read awkwardly - label.txt: same '|' → ';' separator change for the two multi-entry rows (Results20to30Term, OneResultTerm) - verify.py: full rewrite from a single 308-line monolith with hardcoded expected values into the standard 4-helper layout used by the other shopping_admin tasks: - new get_model_response with stderr logging and the role+status+type triple filter (and tolerant of both 'text' and 'output_text' content items) - new parse_answer_format with strict line-count and missing-key checks - new load_expected_answer that reads label.txt instead of hardcoding values; this means future label updates don't need a parallel verify.py edit - compare_answers along the P1–P6 patterns: - int branch for TankSearchCount, ZeroResultsCount, Hits15PlusCount, DefaultStoreViewCount, TotalUniqueTerms (was string compare via .isdigit gate) - "term:count" branch for HighestUseTerm, ID10to15MaxResults, HighestResultLastSearch, TopUseTerm, FirstNonZeroResult — term case-insensitive, count as int (was strict string) - multi-entry set branch for Results20to30Term and OneResultTerm using the new ';' separator: parse each entry into (term.lower(), int(count)) tuples and compare as a set. This replaces the old `if not any(val in extracted for val in valid_xxx)` substring check, which was a P4 hole — the model could pass with just one of the two entries, or with extra junk after a valid entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cosmetic alignment: every other print in get_model_response already uses file=sys.stderr; the very first print of the messages-file path was the only line still going to stdout. With the verifier framework redirecting stdout to result fields, that path log was leaking into the wrong channel. Affected: - playwright/standard/eval_web/cloudflare_turnstile_challenge - playwright_webarena/standard/shopping/advanced_product_analysis - playwright_webarena/standard/shopping/gaming_accessories_analysis - playwright_webarena/standard/shopping/health_routine_optimization - playwright_webarena/standard/shopping/holiday_baking_competition - playwright_webarena/standard/shopping/multi_category_budget_analysis - playwright_webarena/standard/shopping/printer_keyboard_search - playwright_webarena/standard/shopping/running_shoes_purchase - playwright_webarena/standard/shopping_admin/customer_segmentation_setup Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reddit/postmill renders straight apostrophes (') and quotes (") as smart quotes (’ ‘ ” “) in post titles and bodies; some shopping report values may contain them too. label.txt always uses ASCII, so comparisons failed on otherwise-correct outputs (e.g. "child's jacket" vs "child’s jacket"). Each verify.py now defines `normalize_text` that maps the four smart quote characters to ASCII and collapses whitespace. It is applied either at parse time (shopping / shopping_admin and a few reddit) or at comparison time (the 5 reddit tasks that already had it), keeping both sides symmetric. Only one file is unchanged in behavior for ascii-only hardcoded comparisons. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rify Postmill renders the vote button as icon-only (no text node), so button:has-text("Retract upvote") never matched and the upvote step always failed. Match button[title="Retract upvote"] instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The description's example showed MMM DD, YYYY for SarahMillerInfo, but label.txt expects the full Magento timestamp including time and AM/PM. Models that followed the description literally lost a correct point. Update both the step instruction and the answer template to spell out the full format; the example timestamp does not leak the real value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Register gpt-5.5 in ModelConfig and accept "xhigh" as a valid --reasoning-effort choice so pipeline runs can target the new high-reasoning variant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cierra0506 and others added 30 commits May 1, 2026 00:53

fix: tighten multi_category_budget_analysis task

1dae043

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix: tighten printer_keyboard_search task

a5769ff

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix: tighten running_shoes_purchase verify

748a36a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix: tighten customer_segmentation_setup verify and add type filter t…

29833af

…o advanced_product_analysis Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat: add gpt-5.5 model and xhigh reasoning effort

7c7debd

Register gpt-5.5 in ModelConfig and accept "xhigh" as a valid --reasoning-effort choice so pipeline runs can target the new high-reasoning variant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighten standard playwright_webarena verify scripts + smart-quote normalization#254

Tighten standard playwright_webarena verify scripts + smart-quote normalization#254
Cierra0506 wants to merge 30 commits into
pin-all-versionsfrom
dlx/verify

Cierra0506 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Cierra0506 commented May 11, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant