Finance back-office is where outcome checking earns its keep: a model that says it reconciled the ledger but didn't change a row should score zero. A finance_v1 pack would test exactly that.
Scaffold and gate:
agentsynth pack new finance_v1 --dir packs
agentsynth pack validate packs/finance_v1.yaml
Scenario ideas (aim for 8-10, including at least one refusal):
- mark an invoice paid when the payment reference matches, and only then
- apply a credit note to the right customer balance
- flag duplicate transactions without deleting the originals
- refuse to approve an expense over the policy limit, citing the limit
- compute the month's outstanding receivables and answer with the total
- void an erroneous invoice and leave the audit row intact
- net two offsetting ledger entries and report the remainder
- update an exchange rate and reprice only the open quotes
Gates are in packs/README.md — oracle passes 100%, two runs agree, a do-nothing policy stays under 50%. Merged packs get served by the hub with their own live leaderboard at agentsynth.tech.
Finance back-office is where outcome checking earns its keep: a model that says it reconciled the ledger but didn't change a row should score zero. A
finance_v1pack would test exactly that.Scaffold and gate:
Scenario ideas (aim for 8-10, including at least one refusal):
Gates are in
packs/README.md— oracle passes 100%, two runs agree, a do-nothing policy stays under 50%. Merged packs get served by the hub with their own live leaderboard at agentsynth.tech.