Skip to content

Tighten standard playwright_webarena verify scripts + smart-quote normalization#254

Open
Cierra0506 wants to merge 30 commits into
pin-all-versionsfrom
dlx/verify
Open

Tighten standard playwright_webarena verify scripts + smart-quote normalization#254
Cierra0506 wants to merge 30 commits into
pin-all-versionsfrom
dlx/verify

Conversation

@Cierra0506
Copy link
Copy Markdown
Collaborator

Summary

Bundle of fixes to standard playwright_webarena tasks accumulated on dlx/verify. Brings 30 commits worth of verify / task-description tightening to pin-all-versions.

Highlights:

  • Tighten verify logic for 14 standard shopping / shopping_admin tasks and several reddit tasks (advanced_product_analysis, gaming_accessories_analysis, multi_category_budget_analysis, printer_keyboard_search, running_shoes_purchase, health_routine_optimization, holiday_baking_competition, customer_segmentation_setup, fitness_promotion_strategy, marketing_customer_analysis, ny_expansion_analysis, products_sales_analysis, sales_inventory_analysis, search_filtering_operations).
  • Unicode smart-quote normalization in 21 standard verify scripts so model output containing ’ ‘ ” “ matches ASCII label.txt values.
  • Bug: icon-only vote button selector in reddit/budget_europe_travel/verify.py — switched from button:has-text("Retract upvote") to button[title="Retract upvote"] since Postmill renders the vote button with no text node.
  • Description ↔ label alignment in shopping_admin/sales_inventory_analysis/description.md — Customer Since field now spells out the full Magento timestamp format.
  • Misc: route MCP_MESSAGES: log line to stderr in 9 standard verify scripts; handle ™ HTML entity in fitness Bestseller name; add gpt-5.5 model entry + xhigh reasoning-effort choice in pipeline / model_config.

Test plan

  • AST-parse every modified verify.py (already validated locally)
  • Re-run failed task traces from results/pw_debug/ against the updated verifies to confirm previously-incorrect failures now classify correctly
  • Spot-check label.txt comparisons still pass for known-good traces (no false negatives from normalization)

🤖 Generated with Claude Code

Cierra0506 and others added 30 commits May 1, 2026 00:53
- extraction_table: drop numeric-column quotes in data.csv and example,
  load data.csv as ground truth and match rows by set, allow flexible
  column order in model output, scan fallback now requires all 5 headers
- cloudflare_turnstile_challenge: remove redundant note line duplicating
  earlier steps

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- verify.py: only check the last completed assistant message (was OR'ing
  across all messages), align content extraction with extraction_table
  (filter text/output_text items instead of joining all content blocks)
- description.md: spell out the strict 4-digit-year output format

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- description.md: explicitly require the v1 (initial) version of the paper
  so model output aligns with the v1-sourced ground truth in content.txt
- verify.py: only check the last completed assistant message and align
  content extraction with extraction_table; replace strict whitespace-only
  normalize + equality with markdown/unicode-aware normalize plus difflib
  similarity (threshold 0.9), so bullet style, bold markers, casing, and
  small punctuation drift no longer cause false negatives

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DeepSeek R1 v1 paper's last section is titled "Conclusion, Limitations,
and Future Work", not just "Conclusion". Update the description so the
model extracts the full section that content.txt was derived from.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove broken smart-quote replace block in normalize_text (Python parsed
  it as a triple-quoted string + same-char no-op replaces; never normalized
  anything)
- Hard-fail when label.txt is missing instead of silently degrading to a
  "fields are non-empty" check that lets any 7-field submission pass
- Drop the redundant Deeplearning_Post_Count special-case validation; the
  generic numeric-field loop already covers all three count fields
- Replace `if "expected_data" in locals()` introspection with the now
  unconditional expected_data variable
- Add missing f-string prefix on the missing-keys error so it actually
  prints which keys are missing

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Wiki title check: switch substring `in` to exact `==` after stripping,
  matching the strict equality used for forum title/description/sidebar
- Step 5 upvote check: drop the "any vote count >= 1" fallback that would
  pass on any pre-seeded postmill data regardless of user actions; only
  the "Retract upvote" button reliably signals the current user upvoted
- Remove dead normalize_text function (never called; also contained a
  broken smart-quote replace block similar to ai_data_analyst)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove broken smart-quote replace block in normalize_text (Python parsed
  the lines as a triple-quoted string + same-char no-op replaces; never
  normalized anything)
- Drop the redundant "upvotes are descending" check; it's mathematically
  implied by the per-field equality comparison against label.txt (which
  is itself in descending order), so it can only fire alongside earlier
  errors and adds noise
- Drop the now-misleading "Posts ordered by upvotes (descending)" success
  message that claimed a check we no longer perform

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- label.txt: correct Total_LLM_Posts from 9 to 8 (verified against the
  postmill MachineLearning forum first page; only 8 posts contain
  GPT/ChatGPT/LLM as a case-insensitive substring)
- Replace Top1/2/3_Date fields with Top1/2/3_Author across description,
  label.txt, and verify.py — author names are stable while "X years ago"
  drifts as the snapshot ages
- Remove broken smart-quote replace block in normalize_text (Python parsed
  it as a triple-quoted string + same-char no-op replaces)
- Hard-fail when label.txt is missing instead of silently degrading to a
  "fields are non-empty" check
- Drop the redundant Total_LLM_Posts special-case validation; the generic
  per-field comparison loop already covers it
- Replace `if "expected_data" in locals()` introspection with the now
  unconditional expected_data variable
- Drop the descending-order check on top 3 upvotes; mathematically implied
  by per-field equality against label.txt
- parse_key_value_format: drop the colon-separator fallback (label is
  pipe-only and the fallback was secretly accepting non-conforming model
  output) and accept unicode bullet `•` alongside `-` and `*`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Remove broken smart-quote replace block in normalize_text (Python parsed
  it as a triple-quoted string + same-char no-op replaces); the working
  &amp;-decode and whitespace collapse are kept
- Hard-fail when label.txt is missing instead of silently degrading to a
  "fields are non-empty" check
- Drop the redundant Total_Year_Posts special-case validation; the generic
  per-field comparison loop already covers it
- Replace `if "expected_data" in locals()` introspection with the now
  unconditional expected_data variable

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop Total_NBA_Posts field entirely (description / label / verify);
  the count had irreconcilable semantics — postmill search is whole-word
  and site-wide, but the Top 5 list expected substring matches scoped to
  the sports forum (incl. WNBA-titled posts), making the original 20 vs
  any model's interpretation always disagree
- description: replace "search for posts containing 'NBA' in their titles"
  with "browse its posts to find posts whose titles contain 'NBA'", which
  matches what's actually achievable (no forum-scoped search) and lets
  substring catch WNBA naturally
- label.txt: fix Top3_Title — `tonight|68,323` → `tonight: 68,323` to
  match the actual postmill page rendering
- verify.py:
  - Hard-fail when label.txt is missing
  - Drop the redundant Total_NBA_Posts special-case validation
  - Remove `if "expected_data" in locals()` introspection and the
    unreachable basic-validation else branch
  - Remove broken smart-quote replace lines in normalize_text (kept the
    working unicode-escape apostrophe replacements since label has U+2019)
  - parse_key_value_format: drop the unused `#`-comment-line skip and
    add `*` bullet support to align with sibling tasks
  - Submission body locator now keys on Top1_Title instead of the deleted
    Total_NBA_Posts marker
  - Fix KeyError in success print that still referenced
    extracted_data['Total_NBA_Posts'] (would have failed any compliant
    submission via the broad except Exception clause)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CartTotal: drop the cosmetic startswith("$") check; the existing strip
  step already removes $ and , before comparing, so a numerically-correct
  amount written without the dollar sign was being rejected for no reason
- LatestReviewer: replace the bidirectional substring check with a
  case-insensitive exact match. The old check passed empty strings and
  single characters as long as one was a substring of the other, which
  let unfilled or wildly wrong values slip through

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- label.txt: correct Products70Plus from 7 to 6. Verified directly via
  the postmill shopping container (port 7770): Video Games category at
  default 12 products/page has 5 products with rating >= 70% on page 1
  and 1 on page 2. The original 7 only matches if you switch to
  24/page first, which the description never instructs
- description.md: clarify that products without any rating do not count
  toward the threshold count
- verify.py:
  - Drop the cosmetic startswith("$") check on price fields
    (CheapestReviewedPrice, N64Subtotal); the strip step already
    handles $ and , so a numerically-correct value without $ was
    being rejected for no reason
  - Drop the +-2 tolerance window on Products70Plus. The data is
    static seed data, the rationale ("dynamic content") is bogus,
    and the tolerance had been hiding the wrong label (anything in
    [5,9] was passing for label=7, so neither the correct 6 nor the
    wrong 7 ever surfaced as a mismatch)
  - Switch ComparisonCount and ShippingMethods to int comparison
    so "02" or whitespace variants don't fail a numerically-correct
    answer
  - Add ShippingState to the case-insensitive exact-match branch
    alongside CheckoutEmail

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop the cosmetic startswith("$") check on the four price fields
  (Battery1Price, Battery2Price, InitialSubtotal, FinalSubtotal); the
  strip step already removes $ and , before comparing, so a numerically
  correct amount written without $ was being rejected for no reason
- Switch the six count fields (AdvancedSearchResults, ComparisonCount,
  TeaReviews, CartUniqueProducts, CartTotalQuantity, TeaRating) to int
  comparison so "02" / whitespace / a trailing % don't fail a
  numerically-correct answer; TeaRating gets a `replace("%", "")` so
  both "95" and "95%" compare as 95

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- description.md: clarify TotalCartItems is the sum of all product
  quantities, not the number of distinct line items
- verify.py:
  - Drop the cosmetic startswith("$") check on the three price-bearing
    fields (CartSubtotalAfterUpdate, the price portion of
    CheapestChocolatePriceReviews, the price portion of
    Page2ThirdProductSKUPrice); the strip step already handles $ and ,
    so a numerically correct value without $ was being rejected
  - Drop the 0.01 price tolerance — these prices come from Magento's
    rendered string, no float arithmetic happens on either side, so the
    tolerance only ever masked wrong answers like $72.55 vs $72.56
  - Switch the count/numeric pieces to int comparison: TotalCartItems,
    the reviews piece of CheapestChocolatePriceReviews, and the rating
    piece of HighestRatedCookieSKURating (with `%` stripped)
  - Make the SKU portions of HighestRatedCookieSKURating and
    Page2ThirdProductSKUPrice case-insensitive to match
    SecondGingerbreadSKU's behaviour

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o advanced_product_analysis

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- description.md:
  - drop step 2 "total Revenue" line — dashboard chart is disabled so
    Revenue is fixed at $0.00 with no signal
  - step 3 switch from "search by name" to "open the product page linked
    from the Bestsellers row" — the dashboard row has an implicit
    product-edit link that uniquely identifies one simple SKU, removing
    the ambiguity for configurable products like Sprite Stasis Ball 65 cm
    that have 3 same-name variants
  - step 3 rename vague "Current inventory quantity" to "Salable Quantity"
    so the model reads the actual remaining-stock field, not the static
    source qty (always 100 in seed data)
  - step 4 specify Cart Price Rules path explicitly and replace the
    misleading "rules that might apply to fitness/yoga products" with a
    precise rule signature: "percentage discount on the entire order
    (not tied to a specific product)" — disambiguates the 20% $200+ rule
    from the 70% Luma water bottle rule
  - step 6 drop "other" from "how many other customers are in the same
    group" — label counts the customer himself, so off-by-one is removed
  - schema/example: drop TotalRevenue, rename inventory→salable_quantity,
    align None:0 wording across step 2/example/label, replace
    email@example.com placeholder with <customer email> to avoid
    seeding example.com into model output (the original label.txt
    sarah.miller@example.com was likely fabricated this way)

- label.txt:
  - drop TotalRevenue|$0.00
  - Bestseller salable_quantity 100 → 93/93/88 (verified live)
  - TopCustomer email sarah.miller@example.com → helloworld@yahoo.com
    (Sarah Miller's real email in the customer table; the previous value
    did not exist anywhere in the database)
  - BestsellerInSearch No:0 → None:0 (align with description/example)

- verify.py:
  - get_model_response: add type=='message' filter (matches the
    customer_segmentation_setup tightening)
  - send "MCP_MESSAGES:" log to stderr like the rest of the prints
  - drop dead branches for TotalRevenue, MinimumPurchaseRule,
    LowestInventoryProduct, MostRecentOrderDate (none are in label.txt)
  - compare_answers tightened along the P1–P6 patterns established by
    earlier shopping/* commits:
    - Bestseller price → float() compare so "27" matches "$27.00"
    - Bestseller quantity → int() compare
    - Bestseller salable_quantity: drop ±0.0001 tolerance, keep float
      compare to handle Magento's "93.0000" rendering
    - PercentageDiscountRule percent → float() compare
    - PercentageDiscountRule rule name + TopCustomer name → case-insensitive
      so all name-class fields behave consistently
    - new branch for ActiveRulesCount / TotalOrders / SameGroupCustomers
      using int() compare; MostRecentOrderID stays in the default string
      branch because its leading zeros (000000299) are significant
  - Bestseller1/2/3 are now compared as a set via a new
    _normalize_bestseller helper — the three lines may be listed in any
    order (the dashboard ordering is non-obvious and the three tied
    bestsellers can equally well be listed under any Bestseller{1,2,3} key)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Magento product detail pages may render the trademark symbol as the
HTML entity &trade; rather than the unicode ™. Normalize &trade; → ™
in _normalize_bestseller so a model that copies "Quest Lumaflex&trade;
Band" from the page still matches label's "Quest Lumaflex™ Band".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- description.md:
  - clarify step 6 ambiguous "best-selling and most expensive" — explicitly
    say "Among the products tied for the highest sales quantity in the
    Bestsellers table, identify the one with the highest price" so the
    label TopProduct = Sprite Stasis Ball 65 cm:6 falls out
    deterministically (the literal "both top-quantity AND top-priced"
    reading would intersect to nothing — Overnight Duffle is the
    most-expensive overall but only sold 5 units)

- verify.py:
  - get_model_response: add type=='message' filter and send the
    "MCP_MESSAGES:" log to stderr (matches customer_segmentation_setup)
  - verify main flow: drop the silent "Will proceed with browser
    verification only" fallback — hard-fail when model_response is
    missing or the <answer> block can't be parsed
  - compare_answers tightened along the P1–P6 patterns established by
    earlier shopping/* commits:
    - CouponCodes: replace the substring check
      ("H20" not in mv or "Luma water bottle" not in mv) — which passed
      any model output containing those two strings anywhere — with a
      proper split-and-set comparison: code case-sensitive, rule name
      case-insensitive (P4 fix)
    - Top2SearchTerms: split each "term:count" pair, compare as
      (term.lower(), int(count)) tuple set so order/case/spacing don't
      fail a numerically-correct answer
    - ZeroResultTerm: split "term:count" — term lower, count as int
    - EmailVerification: build dict of {email.lower(): status.lower()}
      so "Yes" / "YES" don't fail an otherwise correct answer
    - TopProduct: split "name:quantity" — name case-insensitive, qty as
      int (handles "6.0" / "06" variants)
    - new branch for TotalSearchTerms / ActiveRulesCount / SubscribedCount
      using int() compare
    - new branch for TotalRevenue: strip $ and , then compare as float
      so "$0.00" matches "0" / "0.00"
    - malformed model entries (missing ':', non-numeric count, etc.) now
      add specific mismatch messages instead of silently coercing to
      None tuples

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- description.md:
  - rename Second_Bestseller_* schema keys → Cheap_Bestseller_* to match
    description's "lowest price bestseller" semantics (the "Second" prefix
    was a leftover from an earlier design)
  - drop step 3's "compare CA tax with NY" subtask and the
    Higher_Tax_State field — the comparison was not adding signal beyond
    NY_Tax_Rate / CA_Tax_Rate themselves; CA_Tax_Rate is kept as a
    reference data point
  - rewrite step 4 to point at "Stores → Settings → Order Status"
    explicitly (the previous "Filter orders to show only statuses..."
    wording sent models to Sales → Orders) and clean up the awkward
    "if exists one has the status code 'processing'" phrasing
  - rewrite step 6 first sub-item: explicitly say "check whether its
    'Pickup Location' column is 'Enabled' or 'Disabled'" — the original
    "is currently 'Enabled' or shows as 'Disabled' for Pickup Location"
    was easy to misread as the Is Enabled column (which is what the
    label originally encoded)
  - align Default_Source_State schema/example wording (was a mix of
    "state_or_none" / "State or None" / "Yes or No"); description step 6
    asks "Yes or No", so schema/example now match

- label.txt:
  - Total_States_With_Tax 2 → 3 (admin actually has CA, MI and NY tax
    rate rows; the original 2 missed Michigan)
  - Default_Source_Pickup_Status Enabled → Disabled (Pickup Location
    column on the Default Source row is Disabled; the original Enabled
    was the Is Enabled column instead)
  - Default_Source_State No → Yes (Address Data section on Edit Source
    has a 'State/Province' field — region_id select after Country)
  - rename Second_Bestseller_* keys → Cheap_Bestseller_*
  - drop Higher_Tax_State row

- verify.py:
  - get_model_response: send "MCP_MESSAGES:" log to stderr, add
    type=='message' filter on the assistant message scan
  - compare_answers tightened along the P1–P6 patterns from earlier
    shopping/* commits:
    - Lifetime_Sales_Amount / Cheap_Bestseller_Price / Dashboard_Revenue:
      switch from string compare to float() so "$0.00" matches "0" /
      "0.00"
    - Cheap_Bestseller_Quantity / Total_States_With_Tax /
      Number_Of_Websites: int(float()) compare so "04" / "4" / "4.0"
      don't fail
    - Cheap_Bestseller_Name / Default_Source_Pickup_Status /
      Main_Store_Code: case-insensitive compare
    - Default_Source_State joined the Yes/No case-insensitive branch
      (the previous 'None' → '' normalization was a leftover from when
      the field was state-name-or-none)
    - drop dead Empty_Rows_Yes_Effect / Order_Status_Options /
      Chart_Disabled_Message branches — none of these keys are in
      label.txt or the schema

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- description.md:
  - step 2 sub-item 1: "Search for all products containing 'Yoga' in
    their name" → "Filter by Name containing 'Yoga'"; the original
    wording sent models to the keyword search box, which is fulltext
    over name+description+sku and returns 281 hits — only the Name
    column filter returns the 172 products whose name actually contains
    Yoga
  - step 2 sub-item 2: "Clear the search" → "Clear all filters" so it
    matches the filter terminology used in sub-item 1
  - step 3 sub-item 1: rewrite the ambiguous "lowest price and lowest
    quantity" — taken literally that's the intersection of two singletons
    and is empty (cheapest = $14 Sprite Yoga Strap with qty=6,
    lowest-qty = qty=5 with $23 Sprite Stasis Ball 55 cm or $45
    Overnight Duffle). The label encoded "qty-lowest tier, then
    cheapest within it", so the description now says exactly that:
    "Among the products tied for the lowest sales quantity, identify
    the one with the lowest price"
  - step 3 sub-item 2: rename schema/example key
    QuestLumaflexQuantity → SecondCheapestQuantity; the original key
    name leaked the answer (Quest Lumaflex) — the model could fill the
    value without ever reading the dashboard
  - step 4 typo: "Father" → "Gather"
  - example output: SarahMillerEmail|email@example.com → <customer email>
    so the placeholder doesn't push models toward example.com (the
    same trap that produced sarah.miller@example.com in fitness)

- label.txt:
  - YogaProducts 171 → 172 (verified live: Name filter "Yoga" returns
    exactly 172 enabled products)
  - LowestProduct: drop the spurious " foot" in the product name
    ("Sprite Stasis Ball 55 cm foot" → "Sprite Stasis Ball 55 cm" —
    the original was a copy/paste artifact mixing two product names)
  - rename QuestLumaflexQuantity → SecondCheapestQuantity
  - TotalCustomers 72 → 70 (the original count included two test
    customers created by the marketing_customer_analysis task; baseline
    state has 70)

- verify.py:
  - get_model_response: send "MCP_MESSAGES:" log to stderr, add
    type=='message' filter on the assistant message scan
  - rename QuestLumaflexQuantity → SecondCheapestQuantity in
    expected_keys
  - compare_answers tightened along the P1–P6 patterns:
    - LowestProduct: name compared case-insensitively, qty as int
      (was strict string compare on both)
    - WH11Price / DashboardRevenue: switch from string compare on the
      stripped value to float() so "$54.00" matches "54" / "54.0"
    - new branch for YogaProducts / ZeroQuantityProducts /
      SecondCheapestQuantity / TotalCustomers / PendingOrders using
      int(float()) compare
    - GraceNguyenOrderID stays in the default string branch because
      its leading zeros (000000189) are significant

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- description.md:
  - step 2 sub-item 2: "Clear the search" → "Clear all filters" so the
    terminology matches sub-item 1 (which uses Filter)
  - step 3 sub-item 2: rewrite "Find Grace Nguyen's Complete and the
    most cheap order" — the original was missing words and used "the
    most cheap" instead of "cheapest"; new wording is "Find Grace
    Nguyen's order with Complete status and the lowest price"
  - step 4 sub-item 1: rewrite "the product with most quantity but and
    lowest price" — the "but and" is a grammar error and the literal
    "max-qty AND min-price" is the empty intersection. Use the same
    two-step pattern as marketing_customer_analysis / products_sales_analysis:
    "Among the products tied for the highest sales quantity in the
    Bestsellers table, identify the one with the lowest price"
  - step 5 sub-item 1: "with its email address containing" → "whose
    email address contains" (grammar)

- verify.py:
  - get_model_response: send "MCP_MESSAGES:" log to stderr
  - drop dead Position2Product branch (no such key in label or schema)
  - compare_answers tightened along the P1–P6 patterns:
    - WS12Info: name compared case-insensitively, price as float() so
      "$22.00" matches "22" / "22.0"
    - HighestOrderInfo: customer compared case-insensitively, amount as
      float()
    - new CheapProduct branch: name case-insensitive, qty as int (was
      falling through to the strict default branch)
    - OvernightDufflePrice: switch from string compare on the stripped
      value to float()
    - HollisterPosition: now also strip() before lowercasing
    - SarahMillerInfo: replace the bidirectional substring date check
      ("expected_date in model_date or model_date in expected_date" — a
      classic P4 hole that lets empty/short strings pass) with a
      case-insensitive exact match after strip()
    - Invoice002BillTo: case-insensitive
    - new branch for SpriteProducts / Quantity100Products / PendingOrders
      / CostelloCustomers / PaidInvoices using int(float()) compare
    - GraceOrderID stays in the default string branch (its leading zeros
      "000000114" are significant); also drop the redundant
      startswith("000") gate since every Magento order ID is 000-padded

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- description.md:
  - drop the Position3Bestseller sub-item from step 4 — it duplicated
    the bestsellers-table inspection covered by other tasks and was
    just one more thing the model could fail on; schema and example
    keys removed in lockstep
  - step 4 sub-item 2: fix grammar in "with the both the highest
    number of results and uses" → "with both the highest..."; tidy
    "record name and the number of results" → "record its name and
    number of results". The literal AND reading is fine here because
    label data is unambiguous (Antonia Racer Tank wins on uses among
    the results=23 tier)
  - step 5 header: drop the awkward self-reference "in step 2" in
    favour of a direct path "Marketing → Search Terms"
  - step 5 sub-item 2: spell out the tie-break — "Sort by 'Results'
    (ascending), then by 'Uses' (ascending)". Two terms in the grid
    have Results=1 (hollister with 19 uses, WP10 with 1 use); without
    the secondary sort, "first non-zero" was ambiguous and the model
    could land on hollister and fail against label WP10:1
  - step 5 sub-item 3: drop the misleading "unique" — Search Terms
    grid rows are already unique terms, the word made models think
    they had to dedupe across Top/Last Search Terms
  - step 2 sub-item 4 / step 4 sub-item 1: align singular/plural —
    "Find the search term ... record its name" was inconsistent with
    "(record them all)"; both items now say "Find the search terms ...
    record their names"
  - schema/example: change the multi-entry separator from '|' to ';'
    so the three layers of delimiters are visually distinct (newline
    between rows, '|' between key/value, ';' between entries, ':'
    between term and count). Previously the row "OneResultTerm|hollister:19|WP10:1"
    used '|' for two different things, which read awkwardly

- label.txt: same '|' → ';' separator change for the two multi-entry
  rows (Results20to30Term, OneResultTerm)

- verify.py: full rewrite from a single 308-line monolith with
  hardcoded expected values into the standard 4-helper layout used by
  the other shopping_admin tasks:
  - new get_model_response with stderr logging and the
    role+status+type triple filter (and tolerant of both 'text' and
    'output_text' content items)
  - new parse_answer_format with strict line-count and missing-key
    checks
  - new load_expected_answer that reads label.txt instead of
    hardcoding values; this means future label updates don't need a
    parallel verify.py edit
  - compare_answers along the P1–P6 patterns:
    - int branch for TankSearchCount, ZeroResultsCount, Hits15PlusCount,
      DefaultStoreViewCount, TotalUniqueTerms (was string compare via
      .isdigit gate)
    - "term:count" branch for HighestUseTerm, ID10to15MaxResults,
      HighestResultLastSearch, TopUseTerm, FirstNonZeroResult — term
      case-insensitive, count as int (was strict string)
    - multi-entry set branch for Results20to30Term and OneResultTerm
      using the new ';' separator: parse each entry into
      (term.lower(), int(count)) tuples and compare as a set. This
      replaces the old `if not any(val in extracted for val in valid_xxx)`
      substring check, which was a P4 hole — the model could pass with
      just one of the two entries, or with extra junk after a valid
      entry

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cosmetic alignment: every other print in get_model_response already uses
file=sys.stderr; the very first print of the messages-file path was the
only line still going to stdout. With the verifier framework redirecting
stdout to result fields, that path log was leaking into the wrong
channel.

Affected:
- playwright/standard/eval_web/cloudflare_turnstile_challenge
- playwright_webarena/standard/shopping/advanced_product_analysis
- playwright_webarena/standard/shopping/gaming_accessories_analysis
- playwright_webarena/standard/shopping/health_routine_optimization
- playwright_webarena/standard/shopping/holiday_baking_competition
- playwright_webarena/standard/shopping/multi_category_budget_analysis
- playwright_webarena/standard/shopping/printer_keyboard_search
- playwright_webarena/standard/shopping/running_shoes_purchase
- playwright_webarena/standard/shopping_admin/customer_segmentation_setup

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reddit/postmill renders straight apostrophes (') and quotes (") as smart
quotes (’ ‘ ” “) in post titles and bodies; some shopping report values
may contain them too. label.txt always uses ASCII, so comparisons failed
on otherwise-correct outputs (e.g. "child's jacket" vs "child’s jacket").

Each verify.py now defines `normalize_text` that maps the four smart
quote characters to ASCII and collapses whitespace. It is applied either
at parse time (shopping / shopping_admin and a few reddit) or at
comparison time (the 5 reddit tasks that already had it), keeping both
sides symmetric. Only one file is unchanged in behavior for ascii-only
hardcoded comparisons.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rify

Postmill renders the vote button as icon-only (no text node), so
button:has-text("Retract upvote") never matched and the upvote step
always failed. Match button[title="Retract upvote"] instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The description's example showed MMM DD, YYYY for SarahMillerInfo, but
label.txt expects the full Magento timestamp including time and AM/PM.
Models that followed the description literally lost a correct point.
Update both the step instruction and the answer template to spell out
the full format; the example timestamp does not leak the real value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Register gpt-5.5 in ModelConfig and accept "xhigh" as a valid
--reasoning-effort choice so pipeline runs can target the new
high-reasoning variant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant