diff --git a/.task b/.task
index 4f33323..d3a31c8 100644
--- a/.task
+++ b/.task
@@ -1,10 +1,10 @@
 {
-  "taskId": "1282",
+  "taskId": "2123",
   "phase": "execution",
-  "fenceToken": 3,
-  "sessionId": "c893aa20-a7b7-4112-9cc3-68c12a747bf1",
-  "journalPath": "/tmp/taskcore-worktrees/journal-T1282/tasks/T1282/",
-  "codeWorktree": "/tmp/taskcore-worktrees/code-T1282",
-  "claimedAt": 1773098423616,
+  "fenceToken": 10,
+  "sessionId": "7be3c6a7-358e-45c7-b5af-6c1c5a51b82e",
+  "journalPath": "/tmp/taskcore-worktrees/journal-T2123/tasks/T2123/",
+  "codeWorktree": "/tmp/taskcore-worktrees/code-T2123",
+  "claimedAt": 1773404837466,
   "reviewNotes": []
 }
diff --git a/docs/ops/t1710-capability-first-orchestration-hardening-plan.md b/docs/ops/t1710-capability-first-orchestration-hardening-plan.md
new file mode 100644
index 0000000..699d221
--- /dev/null
+++ b/docs/ops/t1710-capability-first-orchestration-hardening-plan.md
@@ -0,0 +1,159 @@
+# T1710 — Capability-First Orchestration Hardening
+
+## Executive recommendation
+
+This program is **too large and cross-cutting to execute as a single task**. It should be decomposed into ordered child tasks with one parent artifact that fixes the architecture, sequencing, and acceptance criteria.
+
+The failure pattern is not "one bug". It is a systems problem caused by trying to decompose and execute uncertain missions **before** the system knows:
+- what capabilities are actually available,
+- which prerequisites are missing,
+- which steps are reversible vs. irreversible,
+- how partial progress should be preserved,
+- when repeated failure should trigger a strategy change instead of more retries.
+
+## Problem framing
+
+The target workflow class has five characteristics:
+1. **High uncertainty at start** — entity matching, environment state, or account status may be unknown.
+2. **Browser-mediated execution** — progress depends on live UI state, auth, and fragile selectors.
+3. **Partially irreversible actions** — clicks, submissions, trades, messages, or confirmations may have consequences.
+4. **Mixed research + execution** — discovery work is often bundled with deterministic action steps.
+5. **Infrastructure noise** — browser relay, auth state, tool health, and mutation-path failures create false task churn.
+
+If the orchestrator treats these as ordinary deterministic tasks, it creates the same failure loop:
+- decompose too early,
+- assign execution before readiness,
+- lose partial findings in review/aggregate handoffs,
+- retry the same failing path,
+- escalate risk near irreversible steps.
+
+## Strategic design principles
+
+### 1. Capability-first before decomposition
+Before generating child tasks, the system should produce a **mission capability snapshot**:
+- available tools and runtimes,
+- authenticated systems/accounts,
+- browser availability and attachment state,
+- permission constraints,
+- verification channels,
+- human-approval requirements,
+- known blockers.
+
+If readiness is low, the system should create **prerequisite tasks** first, not execution tasks.
+
+### 2. Separate discovery from deterministic execution
+A mission should not begin with an execution plan when the main unknown is still identification, feasibility, auth, or state verification.
+
+Use two lanes:
+- **Discovery lane**: identify entities, inspect environment, map options, and gather evidence.
+- **Execution lane**: perform deterministic, validated steps only after inputs are stable.
+
+This preserves operator clarity and reduces bogus "execution failures" that are really unresolved discovery problems.
+
+### 3. Preserve partial progress as first-class artifacts
+When a child task uncovers verified facts but cannot finish the end-to-end mission, that output must survive reviews, retries, and replanning.
+
+Required artifact types:
+- capability snapshots,
+- matched entities / rejected candidates,
+- prerequisite checklist state,
+- evidence bundles,
+- environment fingerprints,
+- execution-ready plans,
+- approval packets for irreversible actions.
+
+The parent should aggregate these artifacts instead of forcing children into a binary success/fail shape.
+
+### 4. Fingerprint failure modes, then switch strategy
+Repeated retries are only rational if the failure mode is transient. The orchestrator should classify failures into buckets such as:
+- auth missing/expired,
+- browser relay unattached,
+- selector / UI drift,
+- external system ambiguity,
+- runtime/tool unavailable,
+- mutation accepted but verification unavailable,
+- approval required but not granted.
+
+Each class needs a defined next action: retry, reroute, decompose prerequisite, request approval, or stop.
+
+### 5. Approval-gated lane for irreversible actions
+Irreversible or safety-sensitive actions should require a specific lane with:
+- explicit action summary,
+- preconditions satisfied,
+- target/entity verified,
+- rollback possibilities documented,
+- approval token or human confirmation captured,
+- post-action verification defined.
+
+This should not share the same semantics as low-risk research tasks.
+
+## Proposed decomposition
+
+### Child 1 — Mission capability registry and readiness model
+**Goal:** define machine-readable representation of capabilities, prerequisites, and readiness scoring.  
+**Output:** schema + readiness levels + examples + integration points.
+
+### Child 2 — Execution preflight and prerequisite detection
+**Goal:** build the gate that runs before decomposition/execution to detect missing auth, tools, browser state, permissions, and verification channels.  
+**Output:** preflight rules, fail-fast decisions, prerequisite task generation rules.
+
+### Child 3 — Grounding and uncertain-entity evaluation framework
+**Goal:** handle missions where the target entity, account, page, or record is uncertain.  
+**Output:** match confidence model, evidence requirements, safe stopping conditions.
+
+### Child 4 — Failure fingerprinting and strategy switching
+**Goal:** stop naive retry loops and route repeated failures to the right next strategy.  
+**Output:** failure taxonomy, retry budgets, switching rules, observability requirements.  
+**Design artifact:** `docs/ops/t1712-failure-fingerprinting-strategy-switching.md`
+
+### Child 5 — Approval-gated irreversible-action lane
+**Goal:** create a separate workflow for steps with material consequences.  
+**Output:** approval packet schema, gate conditions, execution/verification semantics.
+
+### Child 6 — Aggregate policy and artifact-first closure semantics
+**Goal:** preserve partial progress through review and parent aggregation.  
+**Output:** child completion semantics, artifact contract, parent merge rules, review checklist.
+
+### Child 7 — Cross-repo implementation and validation plan
+**Goal:** map where changes belong across taskcore and colony, sequence rollout, and define tests.  
+**Output:** implementation order, repo ownership, migration plan, acceptance tests.
+
+## Ordering recommendation
+
+Recommended execution order:
+1. Capability registry and readiness model
+2. Execution preflight and prerequisite detection
+3. Grounding / uncertain-entity evaluation
+4. Failure fingerprinting and strategy switching
+5. Approval-gated lane for irreversible actions
+6. Aggregate policy and artifact-first closure semantics
+7. Cross-repo implementation and validation plan
+
+Rationale:
+- readiness and preflight are foundation layers,
+- grounding determines whether execution should even begin,
+- failure handling is only useful after readiness semantics exist,
+- approval gating depends on stable preconditions and verification semantics,
+- aggregate closure should be shaped after artifact types are defined,
+- implementation planning should be last so it reflects the final architecture.
+
+## Acceptance criteria for the parent task
+
+T1710 should only be considered complete when it produces:
+- a parent architecture memo with the final system model,
+- child tasks covering all six functional areas plus rollout/validation,
+- explicit artifact contracts between children,
+- a recommended order of implementation,
+- concrete acceptance tests for the integrated workflow.
+
+## What not to do
+
+- Do **not** patch a single historical workflow.
+- Do **not** encode browser-specific hacks as general orchestration policy.
+- Do **not** collapse discovery, execution, and approval into one task type.
+- Do **not** use success/failure alone as the parent aggregation model.
+- Do **not** allow irreversible execution without explicit preconditions and approval semantics.
+
+## Immediate next move
+
+Decompose T1710 into the ordered child tasks above, using domain-agnostic language and artifact-focused outputs. The parent remains responsible for the integrated architecture and rollout sequence.
diff --git a/docs/ops/t1712-failure-fingerprinting-strategy-switching.md b/docs/ops/t1712-failure-fingerprinting-strategy-switching.md
new file mode 100644
index 0000000..576c279
--- /dev/null
+++ b/docs/ops/t1712-failure-fingerprinting-strategy-switching.md
@@ -0,0 +1,449 @@
+# T1712 — Failure fingerprinting, strategy switching, and dynamic retry budgets
+
+## Problem statement
+
+TaskCore currently treats most agent failures as variations of the same event:
+- `task-executor.mjs` increments a single `retryCount`
+- work is re-queued with generic exponential backoff
+- only rate limits get a distinct path
+- once `MAX_RETRIES` is exhausted, the task is simply blocked
+
+That is too lossy for uncertain, browser-mediated, or dependency-heavy work. The system cannot distinguish:
+- a stale login that should wake a shared auth blocker
+- anti-bot / access denial that should halt sibling work on the same target
+- missing inputs that should create a prerequisite task instead of more retries
+- invalid state transitions that require replanning, not repetition
+- verification failures after a mutation, where autonomy should stop and escalate
+- repeated cost / provider exhaustion, where dispatch should shift strategy globally
+
+The result is sibling churn: multiple leaves burn retries on the same blocker even when the next rational move is shared blocker-removal or mission replanning.
+
+---
+
+## Design goals
+
+1. **Canonical failure identity** — repeated failures with the same operative cause produce the same fingerprint.
+2. **Scope-aware routing** — leaf-local failures stay local; shared blockers fan in to one prerequisite task; global exhaustion triggers broader throttling.
+3. **Strategy switching over blind retries** — retry only when the failure class is plausibly transient.
+4. **Task-kind-aware budgets** — execution, aggregate, and artifact-only tasks do not share the same retry policy.
+5. **Inspectable decisions** — every retry, pause, reroute, and blocker promotion is visible in task metadata and executor outcome logs.
+6. **Fail-closed near irreversible work** — verification and state-transition failures on execution tasks escalate before more automation is attempted.
+
+### Non-goals
+
+- Replacing provider allocation gating from T755.
+- Replacing the recovery breaker engine for host/service remediation.
+- Solving target grounding/entity ambiguity (that belongs to T1704/T2110-style grounding work).
+
+---
+
+## Where this policy plugs in
+
+Primary integration points:
+1. **`scripts/task-executor.mjs`** — authoritative classification, retry budgeting, and routing.
+2. **task metadata in `.taskmaster/tasks/tasks.json`** — persistent fingerprint, counters, blocker linkage, and route decisions.
+3. **executor outcome log** (`data/task-dashboard/executor_outcomes.jsonl`) — append fingerprint + routing evidence.
+4. **dashboard export / APIs** — surface repeated failure clusters, promoted blockers, and retry-budget exhaustion.
+
+This task defines the contract and routing model. A later execution task should implement the plumbing.
+
+---
+
+## 1) Canonical failure fingerprint model
+
+A **failure fingerprint** is the deduplicated identity of a task failure for orchestration purposes.
+
+### Proposed schema
+
+```json
+{
+  "version": "v1",
+  "fingerprintId": "ffp_01HV...",
+  "taskId": 1712,
+  "taskKind": "execution",
+  "runPhase": "work",
+  "failureClass": "auth_session_failure",
+  "scope": "shared_prerequisite",
+  "resourceKey": "telegram:account:primary",
+  "surface": "browser_relay",
+  "reasonCode": "session_expired",
+  "signature": {
+    "provider": "openai-codex",
+    "model": "gpt-5.4",
+    "exitType": "agent_crash",
+    "stderrClass": "login_required",
+    "targetRef": "telegram-web"
+  },
+  "dedupeKey": "auth_session_failure|telegram:account:primary|browser_relay|session_expired",
+  "firstSeenAt": "2026-03-13T09:00:00Z",
+  "lastSeenAt": "2026-03-13T09:07:00Z",
+  "attempts": 3,
+  "affectedTaskIds": [1712, 1718, 1721],
+  "recommendedStrategy": "wake_or_create_blocker",
+  "recommendedBlockerKey": "blocker:auth:telegram:account:primary"
+}
+```
+
+### Required fields
+
+| Field | Meaning |
+|---|---|
+| `taskKind` | `execution`, `aggregate`, `artifact_only`, `review`, `capability_probe`, etc. |
+| `runPhase` | `work` or `review` |
+| `failureClass` | canonical orchestrator-facing class |
+| `scope` | `leaf_local`, `shared_prerequisite`, or `global_budget` |
+| `resourceKey` | normalized target of the blocker (`browser:relay:chrome`, `human:kas`, `provider:openai-codex`) |
+| `reasonCode` | finer-grained sub-cause |
+| `dedupeKey` | stable routing key used to collapse repeats |
+| `attempts` | count of repeated hits within policy window |
+| `recommendedStrategy` | `retry_same_path`, `switch_strategy`, `wake_or_create_blocker`, `pause_for_review`, `global_throttle` |
+
+### Scope semantics
+
+- **`leaf_local`** — retry/replan only this task. Example: malformed prompt for one artifact-only task.
+- **`shared_prerequisite`** — stop retrying sibling leaves and promote a shared blocker. Example: expired auth, missing credential, inaccessible website.
+- **`global_budget`** — provider/cost saturation or broad platform outage; gate future dispatch and avoid local churn.
+
+---
+
+## 2) Canonical failure classes
+
+These are the minimum classes required for T1712 acceptance.
+
+| Failure class | Typical signals | Default scope | Default strategy |
+|---|---|---|---|
+| `auth_session_failure` | login required, expired cookie/session, wallet disconnected, missing permission grant | `shared_prerequisite` | pause affected leaves, wake/create auth blocker |
+| `access_denial_antibot` | captcha, 403, antibot page, WAF deny, account challenge | `shared_prerequisite` | stop automation path, request human unblock / alternate channel |
+| `missing_input` | required ID/file/approval/parameter absent | `shared_prerequisite` if shared, else `leaf_local` | create prerequisite task or request user input |
+| `invalid_state_transition` | task tries action from wrong state, precondition invalid, already-submitted/closed/cancelled | `leaf_local` or `shared_prerequisite` if state is mission-wide | replan from refreshed state; no same-path retry |
+| `verification_failure` | mutation possibly happened but postcondition cannot be proven; conflicting checks | `leaf_local` on single leaf, fail-closed for execution | halt autonomy, require verification/review path |
+| `cost_exhaustion_repeated` | provider denied for budget/quota reasons across attempts/windows | `global_budget` | throttle dispatch, shift provider/model/priority policy |
+
+Recommended additional classes for implementation completeness:
+- `tool_runtime_unavailable`
+- `ui_selector_drift`
+- `rate_limit_transient`
+- `dependency_blocked`
+- `human_approval_missing`
+
+### Failure-class notes
+
+#### `auth_session_failure`
+Examples:
+- browser relay attached but session logged out
+- API token missing or expired
+- wallet connector present but no connected account
+
+Rule: after the second matching hit within the policy window, stop local retries and create/wake one shared auth blocker keyed by the affected account/resource.
+
+#### `access_denial_antibot`
+Rule: never let sibling leaves keep probing the same blocked surface. Switch to a human-assisted or alternate-channel strategy immediately after first confirmed match.
+
+#### `missing_input`
+Rule: if the missing input is shared by multiple children (e.g. target account id, approval token, attachment), collapse it into one prerequisite task and mark dependent leaves as waiting/blocked-by-dependency.
+
+#### `invalid_state_transition`
+Rule: do not spend retry budget repeating an action against a stale assumption. Refresh state, then either replan or close as not-applicable.
+
+#### `verification_failure`
+Rule: for execution tasks, verification failure after a mutation is **not retry-equivalent** to a normal crash. The system must stop and request human review or explicit verification work.
+
+#### `cost_exhaustion_repeated`
+Rule: once the same quota/budget fingerprint repeats across tasks or time windows, it becomes a dispatch-policy problem, not a leaf problem. Route to global throttling or provider switch.
+
+---
+
+## 3) Fingerprint derivation rules
+
+The executor should derive a fingerprint in four passes:
+
+1. **Normalize runtime facts**
+   - task kind
+   - run phase
+   - assignee/reviewer
+   - exit code / signal
+   - error tail classification
+   - known provider/model metadata
+   - target resource / surface
+
+2. **Assign failure class**
+   - use deterministic rule table before any model-based classifier
+   - allow only a bounded fallback classifier for unknown cases
+
+3. **Resolve routing scope**
+   - infer whether the cause is leaf-local, shared prerequisite, or global budget
+
+4. **Build dedupe key**
+   - `failureClass | resourceKey | surface | reasonCode`
+   - exclude volatile text (timestamps, raw stack traces, run ids)
+
+### Example derivations
+
+```text
+stderr: "Telegram Web shows login required"
+=> failureClass=auth_session_failure
+=> resourceKey=telegram:web:primary
+=> scope=shared_prerequisite
+=> dedupeKey=auth_session_failure|telegram:web:primary|browser_relay|login_required
+```
+
+```text
+stderr: "429 Too Many Requests from provider openai-codex"
+=> failureClass=rate_limit_transient
+=> resourceKey=provider:openai-codex
+=> scope=leaf_local (single task) OR global_budget once repeated threshold trips
+=> dedupeKey=rate_limit_transient|provider:openai-codex|dispatch|429
+```
+
+```text
+stderr: "proposal already published"
+=> failureClass=invalid_state_transition
+=> resourceKey=proposal:1234
+=> scope=shared_prerequisite if many leaves assume draft state
+=> dedupeKey=invalid_state_transition|proposal:1234|mutation_path|already_published
+```
+
+---
+
+## 4) Strategy-switching rules
+
+### Canonical strategies
+
+| Strategy | Use when | Result |
+|---|---|---|
+| `retry_same_path` | transient/local issue and retry budget remains | requeue same task with backoff |
+| `switch_strategy` | same objective still valid but current path is irrational | reroute to alternate tool/channel/plan |
+| `wake_or_create_blocker` | repeated shared blocker across leaves | create or wake one prerequisite task and pause dependents |
+| `pause_for_review` | verification or safety-sensitive ambiguity | send to review / human confirmation |
+| `global_throttle` | provider or budget exhaustion spans multiple tasks | deny/defer future dispatch until healthy |
+
+### Routing decision table
+
+| Failure class | First hit | Repeated hit | Exhausted state |
+|---|---|---|---|
+| `auth_session_failure` | retry once if evidence is weak; otherwise create blocker immediately | wake/create shared auth blocker; pause sibling leaves | mark dependency blocker and stop automation until prerequisite closes |
+| `access_denial_antibot` | stop same-path retries; request alternate route | shared blocker + human review | quarantine target surface for cooldown window |
+| `missing_input` | create/wake prerequisite or ask for input | collapse siblings onto same blocker | leave waiting on prerequisite, no further retries |
+| `invalid_state_transition` | refresh state and re-evaluate | replan or close as superseded | no further retries on stale path |
+| `verification_failure` | require explicit verification task / review | block execution lane on target | escalate to human with evidence bundle |
+| `cost_exhaustion_repeated` | apply local defer/backoff | trigger provider/model/policy switch or dispatch gate deny | global throttle until healthy window returns |
+
+### Shared blocker promotion rule
+
+When all of the following hold, the executor promotes a blocker task:
+1. `scope == shared_prerequisite`
+2. same `dedupeKey` occurs on **>= 2 tasks** or **>= 2 attempts on one task** within the policy window
+3. a blocker with the same `recommendedBlockerKey` is not already active
+
+Result:
+- create or wake one blocker task
+- attach `metadata.failureFingerprint.blockerTaskId`
+- mark affected leaves as dependency-blocked / waiting on that blocker
+- suppress additional same-fingerprint retries until blocker state changes
+
+### Replanning rule
+
+If a task hits `invalid_state_transition` or `verification_failure`, the next action should be a replan/verification step, not another leaf retry. The executor should either:
+- create a child task for state refresh / verification, or
+- send the task back to review with the fingerprint attached.
+
+---
+
+## 5) Dynamic retry budget policy
+
+Retry budgets must be keyed by **task kind** and **failure class**, not one global `MAX_RETRIES`.
+
+### Policy table (recommended v1)
+
+| Task kind | Failure class | Auto retries | Backoff class | On exhaustion |
+|---|---|---:|---|---|
+| `execution` | `rate_limit_transient` | 2 | long | switch provider or defer via gate |
+| `execution` | `auth_session_failure` | 1 | short | create/wake auth blocker |
+| `execution` | `access_denial_antibot` | 0 | none | human unblock / alternate path |
+| `execution` | `missing_input` | 0 | none | prerequisite task |
+| `execution` | `invalid_state_transition` | 0 | none | refresh + replan |
+| `execution` | `verification_failure` | 0 | none | review / verification task |
+| `execution` | `tool_runtime_unavailable` | 1 | medium | reroute tool / pause |
+| `aggregate` | `dependency_blocked` | 0 | none | wait for required children / blocker |
+| `aggregate` | `missing_input` | 0 | none | request missing artifact coverage |
+| `aggregate` | `tool_runtime_unavailable` | 1 | short | rerun reducer/export path |
+| `aggregate` | `verification_failure` | 1 | short | review aggregate evidence |
+| `artifact_only` | `tool_runtime_unavailable` | 2 | short | reroute agent/tool |
+| `artifact_only` | `missing_input` | 0 | none | request source material |
+| `artifact_only` | `invalid_state_transition` | 0 | none | usually close/supersede, not retry |
+| `artifact_only` | `cost_exhaustion_repeated` | 1 | long | defer until budget recovers |
+| `review` | `tool_runtime_unavailable` | 1 | short | reroute reviewer |
+| `review` | `verification_failure` | 0 | none | escalate to human reviewer |
+
+### Why the policies differ
+
+- **Execution tasks** carry the most risk; most non-transient classes should not auto-retry.
+- **Aggregate tasks** should rarely retry; repeated child blockers are usually dependency issues, not execution failures.
+- **Artifact-only tasks** can tolerate slightly more retry on tooling failures because they do not directly mutate external state.
+
+### Budget accounting model
+
+Track two counters per task attempt window:
+1. **`pathRetryCount`** — retries on the same strategy/path
+2. **`strategySwitchCount`** — number of alternate paths already tried
+
+This prevents a task from escaping budget control by bouncing endlessly between weak alternatives.
+
+Recommended defaults:
+- `execution`: `pathRetryCount <= 2`, `strategySwitchCount <= 1`
+- `aggregate`: `pathRetryCount <= 1`, `strategySwitchCount <= 1`
+- `artifact_only`: `pathRetryCount <= 2`, `strategySwitchCount <= 2`
+
+---
+
+## 6) Proposed metadata contract
+
+Add the following metadata shape to task records:
+
+```json
+{
+  "metadata": {
+    "retryPolicy": {
+      "taskKind": "execution",
+      "pathRetryCount": 1,
+      "strategySwitchCount": 0,
+      "budgetWindow": "30m",
+      "lastBudgetDecision": "wake_or_create_blocker"
+    },
+    "failureFingerprint": {
+      "fingerprintId": "ffp_01HV...",
+      "failureClass": "auth_session_failure",
+      "scope": "shared_prerequisite",
+      "dedupeKey": "auth_session_failure|telegram:web:primary|browser_relay|login_required",
+      "attemptsWindow": 2,
+      "firstSeenAt": "2026-03-13T09:00:00Z",
+      "lastSeenAt": "2026-03-13T09:07:00Z",
+      "recommendedStrategy": "wake_or_create_blocker",
+      "blockerTaskId": 1730
+    },
+    "blockedByFingerprint": true,
+    "sharedBlockerKey": "blocker:auth:telegram:web:primary"
+  }
+}
+```
+
+### Executor outcome log extension
+
+Each `executor_outcomes.jsonl` record should append:
+- `taskKind`
+- `failureClass`
+- `fingerprintDedupeKey`
+- `routingDecision`
+- `scope`
+- `strategySwitched` (bool)
+- `blockerTaskId` (if any)
+
+This is the minimum observability needed to prove sibling churn actually dropped after rollout.
+
+---
+
+## 7) Shared blocker lifecycle
+
+### Blocker task creation contract
+
+When promoting shared blocker work, the system should create a task with:
+- kind: `capability_probe` or `blocker_removal`
+- title: deterministic and resource-based
+- description: include fingerprint class, affected resource, and evidence bundle
+- metadata:
+  - `blockerKey`
+  - `sourceFingerprint`
+  - `affectedTaskIds`
+  - `createdFromFailureRouter=true`
+
+### Wake vs create
+
+- **Wake existing blocker** when an active/pending blocker has the same `blockerKey`.
+- **Create new blocker** only when no active blocker exists.
+
+### Leaf behavior while blocker active
+
+Affected leaves should not continue ordinary retry scheduling. Instead they should move to a dependency-held condition with:
+- blocker task id
+- blocker key
+- fingerprint id
+- timestamp of last routed match
+
+This is the mechanism that stops sibling retry burn.
+
+---
+
+## 8) Observability and dashboards
+
+Required dashboard/reporting surfaces:
+1. **Top repeated fingerprints** in the last 24h
+2. **Shared blocker promotions** and number of leaves collapsed behind each blocker
+3. **Retries avoided** after blocker promotion
+4. **Failure-class breakdown by task kind**
+5. **Global budget throttles** with provider/model linkage
+
+Key success metrics:
+- drop in repeated identical failures per sibling set
+- increase in blocker reuse rate
+- reduction in tasks blocked only after max retries
+- fewer execution tasks auto-retrying on verification failures
+
+---
+
+## 9) Acceptance tests for the implementation task
+
+### A. Shared auth blocker
+1. Two execution leaves fail with the same expired-session fingerprint.
+2. System creates/wakes one auth blocker.
+3. Sibling leaves stop consuming retries.
+4. Both leaves reference the same blocker task id/key.
+
+### B. Anti-bot/access denial
+1. First confirmed anti-bot fingerprint occurs on an execution task.
+2. Executor does **not** requeue same path.
+3. Alternate route or human unblock task is requested.
+
+### C. Missing shared input
+1. Multiple tasks lack the same approval token / file.
+2. One prerequisite task is created.
+3. Additional failures attach to existing blocker instead of creating duplicates.
+
+### D. Invalid state transition
+1. Execution task tries to mutate an already-finalized resource.
+2. No same-path retry occurs.
+3. Task routes to refresh/replan or closes as superseded.
+
+### E. Verification failure
+1. Mutation step appears to succeed but verification check is inconclusive.
+2. Task does not spend standard retry budget.
+3. Task escalates to review/verification work.
+
+### F. Dynamic budgets by task kind
+1. `artifact_only` task with tool outage gets >0 retries.
+2. `aggregate` task with dependency blocker gets 0 same-path retries.
+3. `execution` task with auth failure gets <=1 retry before blocker promotion.
+
+### G. Repeated cost exhaustion
+1. Same provider quota exhaustion hits multiple tasks in window.
+2. A global throttle / provider-switch path is activated.
+3. Future dispatch is deferred rather than burning leaf retries.
+
+---
+
+## 10) Implementation order recommendation
+
+1. Add fingerprint schema + deterministic classifier in executor.
+2. Persist metadata and outcome-log extensions.
+3. Add blocker promotion / wake semantics.
+4. Add per-task-kind retry policy table.
+5. Add dashboard aggregation + regression tests.
+
+This order ensures the system can first *see* repeated failure identity, then *route* it, then *enforce* differentiated budgets.
+
+---
+
+## Bottom line
+
+T1712 should change TaskCore from **"every failure increments one retry counter"** to **"each failure class carries a scoped fingerprint and a bounded next strategy."**
+
+That is the key behavior change needed to stop repeated sibling churn, promote blocker-removal work when appropriate, and make retry policy depend on both **what failed** and **what kind of task failed**.