diff --git a/docs/rfcs/0001-directed-evolution.md b/docs/rfcs/0001-directed-evolution.md new file mode 100644 index 00000000..cd372253 --- /dev/null +++ b/docs/rfcs/0001-directed-evolution.md @@ -0,0 +1,788 @@ +# RFC-0001: Directed Evolution + +- Status: Draft +- Date: 2026-05-26 +- Authors: Codex, with product direction from the human director +- Related: + - ADR-0013: Evolution Loop Agent Integration + - ADR-0025: Evolution Records & Governance Decisions as System Entities + - ADR-0034: GEPA-Based Self-Improvement Loop + - ADR-0035: IntentDiscovery Evolution Loop + - `os-apps/evolution/` + - `os-apps/intent-discovery/` + +## Summary + +Directed Evolution is Temper's end-to-end loop for improving a running +application as if it were an organism under guided selection pressure. + +The loop starts from real signals: user behavior, simulated user behavior, +errors, traces, logs, metrics, unmet intents, and holistic observations made by +an agent brain. The brain interprets those signals into possible directions. +Some directions are safe enough to proceed automatically, especially repair +work. Growth directions, product changes, UX changes, and policy changes are +presented to the human director, who decides what should be pursued and +negotiates the Adaptation Goal and Viability Constraints with the brain in +chat. + +Once an episode starts, background Codex agents generate variants, evaluate +them through explicit evaluation stages, eliminate weak variants with recorded +evidence, run surviving variants against AI simulated users, and promote the +winner as the new parent of the organism. Mission Control is the observational +surface for this process: it shows what is happening, why it is happening, what +is automated, how variants are being selected or eliminated, and how the +organism's lineage changes over time. + +The v1 target is not a demo, a mock state machine, or a static specification +exercise. It must run a real evolution cycle against the Agent Answers organism, +using real Codex brains, real app variants, real evidence, real observability, +and a visible promotion in Genesis/Temper. + +## Why This RFC Exists + +The Codex and Claude directed-evolution branches explored useful pieces, but +neither completed the full product: + +- Claude moved closer to a reusable engine shape: explicit evolution entities, + WASM integration points, and a mission-control style UI with progress and + elimination affordances. +- Codex moved closer to concrete worker execution and proof discipline: + evidence-gated orchestration, local smoke tests, and working UI surfaces. + +The missing center is the product contract. We need one document that explains +what Directed Evolution is supposed to do, where the brain is used, where WASM +or state machines are used, what the human sees, what the entities mean, and +what counts as "fully working." + +## What We Take From Prior Branches + +From the Claude track, keep: + +- the Mission Control direction: progress, bracket-like variant flow, + elimination visibility, and a dashboard that makes evolution feel alive +- the instinct that Directed Evolution needs reusable engine entities, not only + one-off scripts +- the use of WASM integration points for bounded computation and reusable tools + +From the Codex track, keep: + +- the local worker direction for running Codex jobs outside deployed Genesis +- evidence-gated orchestration, where work is not considered done without + persisted proof +- the concrete app-oriented proof instinct: a runnable organism, variants, and + smoke/e2e verification instead of only specifications + +Do not copy either track wholesale. Claude was closer to the reusable engine +shape. Codex was closer to concrete execution proof. The target system needs +both. + +## Product Principles + +1. The brain is real. + Agent reasoning is not a decorative label on top of scripted logic. Where + the system needs judgment, interpretation, product taste, diagnosis, or + code generation, it must use an agent brain. + +2. The organism is real. + The thing being evolved is a running app, not only a set of abstract specs. + For v1, the organism is the Agent Answers app. + +3. Signals are not conclusions. + A failed action, error spike, or deterministic unmet-intent capture is raw + evidence. A brain interprets raw signals into pressures and directions. + +4. Humans direct; they do not micromanage. + The human chooses important growth directions, negotiates what matters, and + sets or pins constraints. The human should not manually choose winners when + the agreed selection process has enough evidence to decide. + +5. Mission Control is mostly observational. + The UI should make the living process legible. Chat remains the place where + human-brain negotiation happens. + +6. Automation must be visible. + The UI must show which repair and growth pressures are allowed to proceed + without human approval, and which require the human director. + +7. Every elimination must explain itself. + A dead variant should have a readable cause, linked evidence, and the + relevant metrics or constraints that killed it. + +8. Lineage matters. + The human should see the organism changing over time, not only the current + episode. + +## Glossary + +| Term | Meaning | +|------|---------| +| Organism | The app lineage being evolved. For v1, Agent Answers. | +| Organism Version | A promoted version of the organism that can be a parent for later episodes. | +| Lineage | The history of organism versions, promotions, branches, and mutations. | +| Signal | A raw observation from telemetry, traces, logs, failures, user behavior, simulated user behavior, or agent observation. | +| Pressure | A brain-interpreted reason the organism may need to change. Common classes are repair pressure and growth pressure. | +| Direction | A possible path for evolution, framed by the brain from one or more pressures. | +| Episode | One concrete evolution run pursuing a direction against an organism parent. | +| Generation | One round of variants inside an episode. An episode may have multiple generations. | +| Variant | One candidate app version produced during a generation. | +| Mutation | The concrete change introduced by a variant. | +| Adaptation Goal | The thing the episode is trying to improve. This replaces "North Star." | +| Viability Constraint | A requirement the organism must preserve while adapting. This replaces "Guardrail." | +| Selection Pressure | Episode-specific criteria used to decide which variants survive or win. | +| Evaluation Stage | A concrete trial, check, benchmark, review, or live test applied to variants. This replaces "Assay." | +| Stage Result | The result of one Evaluation Stage for one variant. | +| Metric | A named measurable quantity. | +| Measurement | One observed value for a Metric. | +| Elimination Rule | A hard rule that kills a variant. | +| Scoring Rule | A soft rule that ranks surviving variants. | +| Evidence | Trace, log, screenshot, diff, recording, test output, metric sample, or agent report supporting a result. | +| Trial | A live or simulated use of a variant, usually by AI simulated users. | +| Promotion | The act of making a winning variant the new parent organism version. | +| Autonomy Policy | The policy that says which pressures and episodes can proceed without human approval. | +| Brain Run | One invocation or session of an agent brain doing a bounded task. | + +## End-to-End Flow + +```mermaid +flowchart TD + A["Running organism"] --> B["Signals"] + B --> C["Observer brain"] + C --> D["Pressures"] + D --> E["Directions"] + E --> F{"Autonomy Policy"} + F -->|repair auto lane| G["Episode starts"] + F -->|growth or risky lane| H["Human-brain chat negotiation"] + H --> G + G --> I["Adaptation Goal + Viability Constraints"] + I --> J["Generation"] + J --> K["Variant brains generate mutations"] + K --> L["Evaluation Stages"] + L --> M["Eliminations + scoring"] + M --> N{"Winner good enough?"} + N -->|no| J + N -->|yes| O["Promotion"] + O --> P["New organism version"] + P --> Q["Lineage + Mission Control narrative"] + P --> A +``` + +### 1. Signals + +Signals can come from deterministic systems or from brains. + +Deterministic signal sources include: + +- errors +- latency regressions +- failed actions +- unavailable entity sets or actions +- failed guards +- test failures +- trace or log anomalies +- Datadog monitor alerts +- deterministic unmet-intent capture already present in Temper + +Brain-observed signal sources include: + +- a simulated user agent struggling to complete a goal +- a background observer noticing repeated friction that does not appear as a + clean error +- correlation across user sessions, traces, logs, and app state +- product opportunities inferred from actual usage +- confusing UX patterns discovered through agent use + +Signals are stored and tagged. They do not directly become directions. + +### 2. Pressures + +The observer brain reads signals and produces pressures. A pressure is a +reason to consider changing the organism. + +Pressure classes: + +| Class | Meaning | Default autonomy | +|-------|---------|------------------| +| Repair Pressure | Something is broken, degraded, failing, or unsafe. | May auto-start and auto-promote when bounded. | +| Growth Pressure | The organism could become more capable or useful. | Human approval required by default. | +| UX Pressure | The app works but human or agent users struggle with flow, clarity, or ergonomics. | Human approval required. | +| Policy Pressure | Behavior may need governance, permissions, or safety changes. | Human approval required. | +| Data Pressure | Schema, retention, indexing, or data movement may need to change. | Human approval required unless explicitly classified as bounded repair. | + +### 3. Directions + +A direction is a brain-framed possible path for evolution. It must include: + +- source pressures +- source signals and evidence +- why the brain thinks this is real +- why it matters to the organism +- whether it is repair, growth, UX, policy, data, or mixed +- recommended autonomy lane +- expected user-visible effect +- expected risks +- initial Adaptation Goal proposal +- initial Viability Constraint proposal + +Mission Control should show directions as a queue, but not as vague cards. +The human must be able to click through and inspect what fed each direction, +why it exists, and what evidence supports it. + +### 4. Autonomy Routing + +Autonomy Policy determines whether a direction can proceed automatically. + +Default lanes: + +| Lane | Can start without human? | Can promote without human? | Examples | +|------|---------------------------|-----------------------------|----------| +| Bounded Repair | Yes | Yes, if all Viability Constraints pass and blast radius is bounded. | Fix broken action, restore failing integration, revert performance regression. | +| Supervised Repair | Yes | No, unless the human has pre-authorized this class. | Data migration repair, risky dependency change. | +| Directed Growth | No | Yes, after the human-approved episode contract passes. There is no manual winner override. | New product feature, new workflow, new capability. | +| UX Change | No | No, unless explicitly pre-authorized. | Layout or interaction changes. | +| Policy Change | No | No. | Permissions, approval rules, data access rules. | + +The UI must always show: + +- active autonomy lane +- why the lane was chosen +- what the system may do without human input +- what is blocked until human input +- how the human can ask the brain to adjust policy in chat + +### 5. Human-Brain Negotiation + +For growth, UX, policy, and other ambiguous directions, the human and the brain +negotiate in chat. The chat is outside Mission Control. In v1, this means this +Codex session. + +The output of that negotiation is not merely prose. The brain must materialize: + +- Adaptation Goal +- Viability Constraints +- Selection Pressure +- Evaluation Stages +- Elimination Rules +- Scoring Rules +- required evidence +- any pinned constraints from the human + +Mission Control reflects those artifacts once they exist, but it does not try +to replace the conversation. + +### 6. Episode Start + +An episode starts from: + +- one selected direction +- one parent organism version +- one Adaptation Goal +- one set of Viability Constraints +- one Selection Pressure +- one Autonomy Policy lane +- one initial Evaluation Plan + +An episode is not the same as a direction. A direction can lead to multiple +episodes over time. An episode can have multiple generations. + +### 7. Generations and Variants + +Each generation creates multiple variants from the current episode parent. + +For v1: + +- background Codex jobs generate the variants +- each variant gets its own branch, app ref, deployment slot, or isolated + runtime identity +- each variant records its mutation summary +- each variant records the Brain Run that created it +- no variant may modify its own evaluators, Evaluation Stages, Elimination + Rules, Scoring Rules, or Viability Constraints + +If no variant is good enough, the episode may start another generation. The +next generation may use the best survivor, the original parent, or a deliberate +crossover/refinement source, but that choice must be recorded. + +### 8. Evaluation Stages + +Evaluation Stages are the legible checkpoints where variants are tested, +trialed, reviewed, eliminated, or scored. + +Common stages: + +| Stage | Purpose | Typical executor | +|-------|---------|------------------| +| Build Stage | Does the variant compile, install, and start? | Script/tooling, recorded as evidence. | +| Spec Verification Stage | Do affected IOA/CSDL/Cedar artifacts pass required verification? | Temper verification cascade. | +| Static Review Stage | Does the change violate obvious code, safety, determinism, or policy constraints? | Codex review brain plus deterministic checks. | +| Behavioral Stage | Does the variant satisfy the Adaptation Goal in controlled trials? | AI simulated users plus app telemetry. | +| Viability Stage | Does the variant preserve required existing behavior? | Tests, traces, simulated users, metrics. | +| Observability Stage | Does the variant emit required traces, logs, and metrics? | Tooling plus Datadog evidence. | +| Production Trial Stage | Does the variant perform under live or production-like traffic? | AI simulated users, routed traffic, Datadog. | +| Selection Stage | Which surviving variant best satisfies the Selection Pressure? | Selector brain plus scoring rules. | + +Stages can be reused across episodes, but they are not fully universal. The +organism should have a baseline evaluation ladder, and each episode may add +episode-specific stages, metrics, and rules. + +### 9. Metrics, Measurements, and Rules + +Metrics are reusable definitions. Measurements are values observed during a +stage. + +Metric examples: + +- task success rate +- unmet-intent rate +- failed action rate +- latency p50/p95/p99 +- error rate +- trace span failure count +- cost per successful task +- number of user-agent retries +- human-readable confusion score from simulated users +- regression count against preserved workflows +- code review severity count +- verification pass/fail +- deployment health + +Elimination Rules kill variants. Examples: + +- build fails +- verification fails +- required observability is missing +- task success rate is below the parent +- any pinned Viability Constraint is violated +- security or policy review finds a blocking issue +- p95 latency exceeds the allowed regression budget + +Scoring Rules rank survivors. Examples: + +- maximize task success rate +- minimize unmet-intent rate +- minimize added complexity +- minimize latency and cost +- prefer smaller mutation when outcomes are equivalent +- prefer clearer user-facing behavior when metrics are close + +Every Elimination Rule and Scoring Rule must name the metrics, evidence, and +stage results it depends on. + +### 10. Trials and AI Simulated Users + +The organism is used by traffic. In v1, this traffic includes AI simulated +users. These users must be agents, not scripts pretending to be users. + +AI simulated users: + +- receive realistic goals +- interact with the running app +- make their own decisions about how to proceed +- produce traces and narrative observations +- can fail, misunderstand, retry, or reveal unmet intent +- are not told which variant should win +- are tagged in observability with `simulated_user_id` + +The simulation harness may provide task setup, accounts, seeded data, and +routing. The user behavior itself should come from agent reasoning. + +### 11. Selection and Promotion + +Selection chooses the best surviving variant according to the agreed Selection +Pressure, Evaluation Stage results, Elimination Rules, Scoring Rules, and +evidence. + +The selector can be a brain, but it must be constrained by the recorded +selection artifacts. It should explain its conclusion in human-readable terms. + +The human does not manually override the winner in the normal flow. If the +human disagrees with a winner, that means the Adaptation Goal, Viability +Constraints, Selection Pressure, or Evaluation Stages were wrong or incomplete. +The right action is to stop, revise, or run another episode, not silently pick +a favorite. + +Promotion makes the winner the new parent organism version and records: + +- parent version +- winning variant +- mutation summary +- selection explanation +- evidence bundle +- deployment/app ref +- rollback pointer +- lineage edge + +## Brain Roles + +The system uses multiple brain instances. They are the same class of agent +where possible, but they are not the same session. + +| Role | Responsibility | v1 engine | +|------|----------------|-----------| +| Human-facing director brain | Negotiate goals, constraints, and direction with the human. | This Codex chat session. | +| Observer brain | Read signals and infer pressures/directions. | Background Codex via TemperPaw worker. | +| Direction framer brain | Produce direction records with provenance and recommended autonomy lane. | Background Codex via TemperPaw worker. | +| Evaluation designer brain | Propose Adaptation Goal, Viability Constraints, Selection Pressure, stages, metrics, and rules. | This Codex chat for human-facing negotiation; background Codex for draft materialization. | +| Variant generator brain | Produce candidate app variants. | Background Codex jobs via TemperPaw worker. | +| Simulated user brain | Use the organism like a real user with a goal. | Background Codex jobs or another approved agent runner, managed through TemperPaw. | +| Reviewer brain | Review variants for code, UX, determinism, safety, or policy issues. | Background Codex jobs. | +| Selector brain | Explain the winner from recorded evidence and scoring. | Background Codex constrained by stage results. | +| Narrator brain | Produce concise human-readable episode and lineage explanations for Mission Control. | Background Codex. | + +TemperPaw is not the brain. TemperPaw is the worker/orchestration layer that can +run local Codex jobs, feed them bounded context, capture outputs, and write +results back to Temper/Genesis. + +## Architecture + +Directed Evolution has three planes. + +### Control Plane + +The Control Plane stores the state machine and audit trail: + +- Organisms +- Directions +- Episodes +- Generations +- Variants +- Evaluation Stages +- Stage Results +- Trials +- Promotions +- Lineage +- Autonomy Policy +- Brain Runs +- Work Items + +This should be Temper-native: IOA entities, OData APIs, Cedar governance, +telemetry, and event sourcing. + +Existing `IntentDiscovery` and `EvolutionRun` are useful predecessors, but the +full Directed Evolution model is broader: + +- `IntentDiscovery` maps most closely to signal gathering, observer brain + analysis, and direction creation. +- `EvolutionRun` maps most closely to one episode/generation loop, but its + current shape is GEPA/spec-mutation oriented and must be extended for app + variants, AI simulated users, live trials, lineage, and Mission Control. + +### Execution Plane + +The Execution Plane performs work outside the state machine: + +- local Codex jobs +- app variant generation +- build/test commands +- deployment/app-ref creation +- simulated user runs +- evidence collection + +For v1, execution should be pull-based: + +1. The Control Plane creates a Work Item. +2. A local TemperPaw worker polls or subscribes for runnable work. +3. The worker starts a Codex job locally. +4. The job produces structured output and artifacts. +5. The worker writes results back to the Control Plane. + +This avoids requiring deployed Genesis or Railway services to directly run +Codex. The deployed system can own the entities and UI while local workers do +the agent execution. + +### What Moves State + +The state machine is moved by Temper entity actions, not by hidden UI state and +not by WASM acting on its own. + +| Thing | Moves state? | Role | +|-------|--------------|------| +| Human chat | Indirectly | Human tells the director brain what to pursue or preserve. The brain materializes entity actions. | +| Mission Control UI | Sometimes | Operational clicks can dispatch explicit entity actions such as pause, resume, stop, dismiss, or pin. | +| Temper entities | Yes | The source of truth for Direction, Episode, Generation, Variant, StageResult, Trial, Promotion, and Lineage transitions. | +| TemperPaw worker | Yes, through OData/entity actions | Pulls Work Items, runs Codex jobs, then records results back into Temper/Genesis. | +| Codex brain | Indirectly | Decides, generates, reviews, selects, or explains, then emits structured outputs that workers submit as actions. | +| WASM modules | No direct authority | Compute bounded results, transform data, call allowed tools, or produce reports. Their outputs must be recorded through entity actions. | +| Datadog | No | Provides evidence through telemetry. It does not decide or transition entities. | + +This separation is important: brains make judgments, WASM computes bounded +steps, workers execute jobs, and Temper entities preserve the official state +history. + +### Observability Plane + +The Observability Plane provides evidence: + +- traces +- logs +- metrics +- screenshots +- test output +- deployment health +- simulated user trajectories +- app-level events + +Datadog can be used even when Codex runs locally because local and deployed +processes can emit to the same observability backend. All emitted data must +carry correlation tags: + +- `organism_id` +- `organism_version_id` +- `direction_id` +- `episode_id` +- `generation_id` +- `variant_id` +- `trial_id` +- `brain_run_id` +- `simulated_user_id` +- `tenant` +- `app_ref` +- `environment` + +The UI must never depend on only opaque Datadog links. Datadog is source +evidence, but key results should be materialized into Temper entities so the +episode can be understood later. + +## Entity Plan + +The exact IOA specs can evolve, but the model should include these first-class +entities or equivalent records. + +| Entity | Purpose | +|--------|---------| +| Organism | Identifies the app being evolved and its baseline evaluation ladder. | +| OrganismVersion | A promoted parent version with app refs, deployment refs, and evidence. | +| LineageEdge | Connects parent versions, variants, promotions, and mutation summaries. | +| Signal | Raw observation with source, timestamp, tags, and evidence references. | +| Pressure | Brain-interpreted reason to consider evolution. | +| Direction | Candidate path for evolution with provenance, risk, autonomy lane, and proposed goal. | +| Episode | Concrete run pursuing one direction from one parent version. | +| Generation | One variant batch inside an episode. | +| Variant | Candidate app version with mutation, branch/ref, runtime identity, and status. | +| Mutation | Structured summary of what a variant changed. | +| AdaptationGoal | Episode goal the variants are trying to satisfy. | +| ViabilityConstraint | Durable or episode-specific behavior that must be preserved. | +| SelectionPressure | Episode-specific criteria for survivor ranking and winner selection. | +| EvaluationStage | Reusable or episode-specific checkpoint applied to variants. | +| StageResult | Result of a stage for a variant. | +| MetricDefinition | Reusable metric definition. | +| Measurement | Observed metric value tied to a StageResult or Trial. | +| EliminationRule | Hard rule that can kill a variant. | +| ScoringRule | Soft rule that ranks survivors. | +| EvidenceArtifact | Trace, log, screenshot, report, diff, or artifact supporting a result. | +| Trial | Live or simulated traffic run against a variant. | +| Promotion | Winning variant becoming the new organism parent. | +| AutonomyPolicy | Which pressure classes and risks can start or promote automatically. | +| BrainRun | One bounded invocation of a brain role. | +| WorkItem | Runnable unit consumed by TemperPaw or another worker. | + +## Mission Control UX + +Mission Control should follow Claude's stronger UI direction: a live, +game-dashboard-like surface showing progress, brackets, eliminations, evidence, +and lineage. It should be useful, not theatrical. + +Primary views: + +| View | Purpose | +|------|---------| +| Direction Queue | Shows possible directions, pressure class, autonomy lane, and provenance. | +| Direction Detail | Explains what fed the direction, why it exists, evidence, and proposed goal. | +| Episode Dashboard | Shows current stage, generation, variants, survival status, and progress. | +| Variant Bracket | Shows variants moving through stages, eliminations, and winner selection. | +| Variant Compare | Compares mutations, metrics, evidence, and constraints across variants. | +| Death Report | Explains why a variant died, with evidence and violated rules. | +| Trial Monitor | Shows AI simulated users, goals, traces, and outcomes per variant. | +| Autonomy Panel | Shows what is currently allowed to proceed automatically. | +| Lineage View | Shows the organism's growth over versions, branches, promotions, and mutations. | +| Specimen View | Shows the current organism and how recent episodes changed it. | + +Allowed UI interactions in v1: + +- pause, resume, or stop an episode +- inspect why a variant died +- compare variants +- pin an important Viability Constraint +- dismiss a direction +- optionally select a direction only if that creates a real work item/callback + for the brain and makes clear that chat negotiation may still be required + +Not in v1 UI: + +- approve or revise evaluation criteria in forms +- ask the brain from inside the UI +- manually promote a winner +- pretend that a click replaces human-brain negotiation + +## Relationship Between Chat and UI + +Chat is the collaboration surface. Mission Control is the observation surface. + +The human should be able to say in chat: + +- pursue this direction +- explain this direction +- change the Adaptation Goal +- pin this constraint +- stop this episode +- why did this variant die +- what changed in the organism + +The UI should update because the underlying entities changed, not because the +UI is a separate command surface with its own hidden workflow. + +If a UI action requires back-and-forth human judgment, it should be done in +chat instead. If a UI action is low-ambiguity and operational, it can live in +Mission Control. + +## V1 Vertical Slice + +The first fully working slice should prove the whole loop on one organism. + +Organism: + +- Agent Answers app + +Brains: + +- this Codex chat as human-facing director brain +- background Codex jobs via TemperPaw for observer, variant generation, + simulated users, review, selection, and narration + +Control Plane: + +- Temper/Genesis entities for organism, direction, episode, generation, + variant, stage result, trial, promotion, lineage, and autonomy policy + +Execution Plane: + +- local TemperPaw worker launches Codex jobs +- variants are created as real app refs, branches, deployment slots, or + otherwise isolated runnable versions + +Observability: + +- Datadog receives traces/logs/metrics from the app, variants, simulated users, + and workers +- key evidence is materialized into Temper entities + +Required flow: + +1. Agent Answers is running as the parent organism. +2. AI simulated users use the app for realistic goals. +3. Signals are captured from app behavior, user-agent behavior, errors, traces, + logs, and metrics. +4. Observer brain produces pressures and directions. +5. Mission Control shows directions with provenance. +6. Human selects or confirms a growth direction in chat. +7. Human and Codex negotiate Adaptation Goal and Viability Constraints in chat. +8. Episode starts from the selected direction. +9. Background Codex jobs generate at least three variants. +10. Each variant runs through Evaluation Stages with recorded Stage Results. +11. AI simulated users exercise surviving variants. +12. Weak variants are eliminated with Death Reports. +13. Selector brain chooses the winner from evidence and scoring. +14. Repair episodes may auto-promote if policy allows; growth episodes promote + after the human-approved direction and evaluation contract completes. +15. Winner becomes the new Organism Version. +16. Mission Control shows what changed and why it won. +17. Lineage View shows the organism's new branch/version. +18. The promoted app behavior is visible in Genesis/Temper, not only in docs. + +## Acceptance Criteria + +Directed Evolution v1 is fully working when all of the following are true: + +- A real Agent Answers organism is registered with a parent version. +- AI simulated users, not deterministic scripts, exercise the organism. +- Signals from usage and observability are captured with correlation IDs. +- A background observer brain creates at least one direction from real signals. +- The direction shows provenance in Mission Control. +- The human can negotiate the Adaptation Goal and Viability Constraints with + Codex in chat. +- The negotiated artifacts are persisted as entities. +- An episode creates multiple real variants. +- Variants are actually runnable. +- Evaluation Stages execute and produce Stage Results, Metrics, Measurements, + and Evidence Artifacts. +- At least one variant is eliminated with an inspectable Death Report. +- Surviving variants are trialed with AI simulated users. +- The selector brain chooses a winner from recorded evidence. +- The winner is promoted to a new Organism Version. +- Lineage records the parent, winner, mutation, and promotion evidence. +- Mission Control displays directions, episode progress, variant comparison, + eliminations, autonomy policy, and lineage. +- Datadog contains correlated traces/logs/metrics for the episode. +- The final promoted change is visible in the running app. +- No stage is mocked in a way that could pass without real execution. + +## Open Questions + +1. Should the Control Plane live first in deployed Genesis/Temper, local + Temper, or a hybrid where local Temper mirrors deployed entities? +2. What is the first growth direction for Agent Answers that is meaningful but + small enough for a v1 proof? +3. How should app variants be isolated: branches, app refs, deployments, + tenant-scoped routing, or another Genesis primitive? +4. What is the minimal Datadog setup for local worker plus deployed app + correlation? +5. Which parts of the current `os-apps/evolution/EvolutionRun` should be kept, + renamed, or split into Episode/Generation/Variant entities? +6. Does `IntentDiscovery` become the observer/direction creator, or does + Directed Evolution introduce a broader `DirectionDiscovery` entity? +7. What policy language should express bounded growth lanes if the human + pre-authorizes some growth classes later? +8. How should rollback be represented in lineage and Promotion records? + +## Implementation Notes + +- Do not hand-write organism-specific specs as the primary workflow. The human + describes intent in chat; the brain materializes entities and changes. +- Do not let variant-generation brains modify evaluators, Selection Pressure, + Elimination Rules, Scoring Rules, or Viability Constraints for their own + variants. +- Keep deterministic checks and WASM computation where they are useful, but do + not replace agent judgment with scripts when the product requires a brain. +- Keep Mission Control sparse in interaction and rich in explanation. +- Prefer Temper entities over markdown progress files for stateful work when + Temper MCP is available. +- Preserve the existing verification cascade and Cedar governance model. + +## Non-Goals + +- Building a generic no-code evolution designer before proving one organism. +- Replacing chat with an in-app assistant. +- Letting the human manually select winners. +- Treating failed deterministic actions as final unmet-intent conclusions + without brain interpretation. +- Running Codex inside a deployed Railway/Genesis process for v1. +- Shipping a dashboard backed by fixture data and calling it complete. + +## Naming Decisions + +Accepted terms: + +- Adaptation Goal +- Viability Constraint +- Selection Pressure +- Evaluation Stage +- Stage Result +- Direction +- Episode +- Generation +- Variant +- Mutation +- Trial +- Promotion +- Lineage +- Autonomy Policy + +Rejected or avoided terms: + +- Fitness Charter +- Fitness Plan +- Assay +- Human winner override + +The word "fitness" may still be used informally or in code where it is already +established, but user-facing v1 language should prefer Selection Pressure, +Evaluation Stage, Metrics, Elimination Rules, and Scoring Rules.