feat(testing): e2e scenario harness — real boot, echo provider, cross-LLM evaluation#171
Conversation
…s-LLM eval pattern, CI on linux+macos Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (12)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the testing infrastructure by introducing a robust end-to-end scenario harness for the Zora agent. It provides a new Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
Sequence DiagramThis PR adds a deterministic Echo provider and a real boot end to end harness that runs through the CLI in CI without secrets. The core flow validates provider selection including fallback, then performs a two step generate then evaluate cycle while asserting session logs are written. sequenceDiagram
participant CI
participant Harness
participant CLI
participant Router
participant Echo
participant Sessions
CI->>Harness: Run e2e scenarios on linux and macos
Harness->>CLI: Ask generator task
CLI->>Router: Load config and select enabled provider
alt Primary enabled
Router->>Echo: Route task to echo primary
else Primary disabled
Router->>Echo: Fallback to echo evaluator
end
Echo-->>CLI: Return deterministic task output
CLI->>Sessions: Write session events file
Harness->>CLI: Ask evaluator task using generated output
CLI->>Echo: Execute evaluate prompt
Echo-->>Harness: Return EVALUATION response
Generated by CodeAnt AI |
Nitpicks 🔍
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive E2E testing framework for Zora, centered around a new EchoProvider that offers deterministic responses for testing without requiring real LLM API keys. The changes include adding new E2E test scenarios in scenario-harness.test.ts, new configuration and policy fixture files, and updates to package.json to support running these tests. The EchoProvider is integrated into the CLI and daemon. Feedback from the review suggests simplifying the test:e2e:real command in the documentation for clarity and refactoring duplicated environment setup logic within the E2E test harness for improved maintainability.
| ```bash | ||
| cp tests/fixtures/e2e-config-real.toml.example tests/fixtures/e2e-config-real.toml | ||
| # Edit to set real provider credentials/models | ||
| ZORA_E2E=1 ZORA_REAL_PROVIDERS=1 npm run test:e2e:real |
There was a problem hiding this comment.
The test:e2e:real script in package.json already includes the ZORA_E2E=1 and ZORA_REAL_PROVIDERS=1 environment variables. To avoid redundancy and simplify the command for users, you can remove them from this documentation.
| ZORA_E2E=1 ZORA_REAL_PROVIDERS=1 npm run test:e2e:real | |
| npm run test:e2e:real |
| const env: NodeJS.ProcessEnv = { | ||
| ...process.env, | ||
| ZORA_CONFIG_DIR: concConfig, | ||
| }; | ||
| delete env['CLAUDECODE']; | ||
| delete env['CLAUDE_CODE_ENTRYPOINT']; | ||
| delete env['CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS']; |
There was a problem hiding this comment.
This environment setup logic is duplicated from the spawnAsk function earlier in this file. To improve maintainability and ensure consistency, this should be refactored. As per our guidelines, duplicated logic should be extracted into a private helper method to improve maintainability and avoid future inconsistencies. I suggest making this block identical to the one in spawnAsk. Ideally, this logic would be extracted into a shared helper function like createTestEnv.
| const env: NodeJS.ProcessEnv = { | |
| ...process.env, | |
| ZORA_CONFIG_DIR: concConfig, | |
| }; | |
| delete env['CLAUDECODE']; | |
| delete env['CLAUDE_CODE_ENTRYPOINT']; | |
| delete env['CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS']; | |
| const env: NodeJS.ProcessEnv = { | |
| ...process.env, | |
| ZORA_CONFIG_DIR: concConfig, | |
| // Strip Claude Code env vars so we don't trigger SDK conflicts | |
| CLAUDECODE: undefined, | |
| CLAUDE_CODE_ENTRYPOINT: undefined, | |
| CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS: undefined, | |
| }; | |
| // Remove undefined keys (spawn passes them as the string "undefined") | |
| for (const key of ['CLAUDECODE', 'CLAUDE_CODE_ENTRYPOINT', 'CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS']) { | |
| delete env[key]; | |
| } |
References
- Extract duplicated logic into a private helper method to improve maintainability and avoid future inconsistencies.
| const sinceMs = Date.now() - 1; | ||
|
|
||
| const result = spawnAsk('summarize this task for the evaluator', { | ||
| configDir: zoraConfigDir, | ||
| cwd: tempDir, | ||
| }); | ||
|
|
||
| expect(result.exitCode, `stderr: ${result.stderr}`).toBe(0); | ||
|
|
||
| const newFiles = sessionFilesNewerThan(sinceMs); | ||
| expect(newFiles.length, 'Expected a new session file').toBeGreaterThan(0); | ||
|
|
||
| const events = parseJsonl(newFiles[0]!); |
There was a problem hiding this comment.
Suggestion: This provider-routing scenario assumes the newest file from a global shared sessions directory belongs to this test run, which is flaky when other Zora tasks write concurrently. Identify the session file created by this invocation (diff before/after list and match the task) before asserting provider source. [possible bug]
Severity Level: Major ⚠️
- ⚠️ Provider-routing assertion may read wrong session file.
- ⚠️ Shared global sessions cause flaky E2E behavior.| const sinceMs = Date.now() - 1; | |
| const result = spawnAsk('summarize this task for the evaluator', { | |
| configDir: zoraConfigDir, | |
| cwd: tempDir, | |
| }); | |
| expect(result.exitCode, `stderr: ${result.stderr}`).toBe(0); | |
| const newFiles = sessionFilesNewerThan(sinceMs); | |
| expect(newFiles.length, 'Expected a new session file').toBeGreaterThan(0); | |
| const events = parseJsonl(newFiles[0]!); | |
| const taskPrompt = 'summarize this task for the evaluator'; | |
| const beforeFiles = new Set(listSessionFiles()); | |
| const result = spawnAsk(taskPrompt, { | |
| configDir: zoraConfigDir, | |
| cwd: tempDir, | |
| }); | |
| expect(result.exitCode, `stderr: ${result.stderr}`).toBe(0); | |
| const createdFiles = listSessionFiles().filter(f => !beforeFiles.has(f)); | |
| expect(createdFiles.length, 'Expected a new session file').toBeGreaterThan(0); | |
| const matchingFile = createdFiles.find((file) => { | |
| const fileEvents = parseJsonl(file); | |
| return fileEvents.some(e => { | |
| if (e['type'] !== 'task.start') return false; | |
| const content = e['content'] as { task?: string } | undefined; | |
| return content?.task === taskPrompt; | |
| }); | |
| }); | |
| expect(matchingFile, 'Expected session file for this scenario').toBeDefined(); | |
| const events = parseJsonl(matchingFile!); |
Steps of Reproduction ✅
1. Run Scenario 3 (`tests/e2e/scenario-harness.test.ts:314`) while another `zora-agent
ask` process runs concurrently on the same machine.
2. Scenario 3 collects files via `sessionFilesNewerThan()` and picks `newFiles[0]`
(`tests/e2e/scenario-harness.test.ts:324-327`), assuming newest file is this test's run.
3. Session files are global (`~/.zora/sessions`) because CLI `ask` creates Orchestrator
without `baseDir` (`src/cli/index.ts:183`), so Orchestrator defaults to
`os.homedir()/.zora` (`src/orchestrator/orchestrator.ts:215`) and SessionManager writes
under `sessions` (`src/orchestrator/session-manager.ts:111-113`).
4. If another process writes a newer file, `parseJsonl(newFiles[0])` reads the wrong run
and provider assertion at `tests/e2e/scenario-harness.test.ts:333` becomes
flaky/incorrect.Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** tests/e2e/scenario-harness.test.ts
**Line:** 315:327
**Comment:**
*Possible Bug: This provider-routing scenario assumes the newest file from a global shared sessions directory belongs to this test run, which is flaky when other Zora tasks write concurrently. Identify the session file created by this invocation (diff before/after list and match the task) before asserting provider source.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.| // Step 2: Evaluate the generated output | ||
| const evalSince = Date.now() - 1; | ||
| const evalPrompt = `evaluate: ${generatedText.slice(0, 200)}`; | ||
| const evalResult = spawnAsk(evalPrompt, { | ||
| configDir: zoraConfigDir, | ||
| cwd: tempDir, | ||
| }); | ||
| expect(evalResult.exitCode, `Evaluation step failed. stderr: ${evalResult.stderr}`).toBe(0); | ||
|
|
||
| // EchoProvider responds with "EVALUATION:" for "evaluate" keyword | ||
| expect(evalResult.stdout, 'Evaluator response should contain EVALUATION:').toContain('EVALUATION:'); | ||
|
|
||
| // Both session files were written | ||
| const evalFiles = sessionFilesNewerThan(evalSince); | ||
| expect(evalFiles.length, 'Evaluator should produce a session file').toBeGreaterThan(0); |
There was a problem hiding this comment.
Suggestion: The "cross-provider evaluation" scenario currently runs both generation and evaluation with the same config, so it does not actually verify cross-provider behavior. Force the evaluation step onto the fallback evaluator provider and assert the recorded source is the evaluator. [logic error]
Severity Level: Major ⚠️
- ⚠️ Scenario 5 misses cross-provider routing regressions.
- ⚠️ "Cross-provider" claim is unverified in current assertions.| // Step 2: Evaluate the generated output | |
| const evalSince = Date.now() - 1; | |
| const evalPrompt = `evaluate: ${generatedText.slice(0, 200)}`; | |
| const evalResult = spawnAsk(evalPrompt, { | |
| configDir: zoraConfigDir, | |
| cwd: tempDir, | |
| }); | |
| expect(evalResult.exitCode, `Evaluation step failed. stderr: ${evalResult.stderr}`).toBe(0); | |
| // EchoProvider responds with "EVALUATION:" for "evaluate" keyword | |
| expect(evalResult.stdout, 'Evaluator response should contain EVALUATION:').toContain('EVALUATION:'); | |
| // Both session files were written | |
| const evalFiles = sessionFilesNewerThan(evalSince); | |
| expect(evalFiles.length, 'Evaluator should produce a session file').toBeGreaterThan(0); | |
| // Step 2: Evaluate the generated output with evaluator provider | |
| const evalSetup = createTempZoraDir('cross-eval'); | |
| const evalDir = evalSetup.dir; | |
| const evalConfig = evalSetup.configDir; | |
| try { | |
| writeConfigWithDisabledPrimary(evalConfig); | |
| const evalSince = Date.now() - 1; | |
| const evalPrompt = `evaluate: ${generatedText.slice(0, 200)}`; | |
| const evalResult = spawnAsk(evalPrompt, { | |
| configDir: evalConfig, | |
| cwd: evalDir, | |
| }); | |
| expect(evalResult.exitCode, `Evaluation step failed. stderr: ${evalResult.stderr}`).toBe(0); | |
| // EchoProvider responds with "EVALUATION:" for "evaluate" keyword | |
| expect(evalResult.stdout, 'Evaluator response should contain EVALUATION:').toContain('EVALUATION:'); | |
| const evalFiles = sessionFilesNewerThan(evalSince); | |
| expect(evalFiles.length, 'Evaluator should produce a session file').toBeGreaterThan(0); | |
| const evalEvents = parseJsonl(evalFiles[0]!); | |
| const evalSources = evalEvents | |
| .filter(e => e['source']) | |
| .map(e => e['source'] as string); | |
| expect(evalSources.some(p => p === 'echo-evaluator'), 'Expected evaluator provider to handle evaluation').toBe(true); | |
| } finally { | |
| removeTempDir(evalDir); | |
| } |
Steps of Reproduction ✅
1. Run Scenario 5 in `tests/e2e/scenario-harness.test.ts:373-402`; both generation and
evaluation call `spawnAsk(..., { configDir: zoraConfigDir })` (`lines 376-379` and
`390-393`).
2. The fixture config enables both providers with same capabilities, ranks 1 and 2
(`tests/fixtures/e2e-config.toml:7-21`).
3. Router default mode `respect_ranking` returns lowest rank candidate
(`src/orchestrator/router.ts:123-126`), so both steps route to `echo-primary`.
4. Test still passes because it only asserts output contains `EVALUATION:`
(`tests/e2e/scenario-harness.test.ts:397`), and EchoProvider emits that for `evaluate`
keyword regardless of provider (`src/providers/echo-provider.ts:183-185`); no
provider-source verification occurs.Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** tests/e2e/scenario-harness.test.ts
**Line:** 387:401
**Comment:**
*Logic Error: The "cross-provider evaluation" scenario currently runs both generation and evaluation with the same config, so it does not actually verify cross-provider behavior. Force the evaluation step onto the fallback evaluator provider and assert the recorded source is the evaluator.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.|
CodeAnt AI finished reviewing your PR. |
User description
Summary
EchoProvider(type:echo) — deterministic, no API keys, works in CI on any OSevaluate: [output]→ evaluator responds withEVALUATION:prefix (Gemini checks Claude's work in real runs)ubuntu-latest+macos-latest, no secrets neededtest:e2eruns in CI;test:e2e:realfor local runs with real API keysTest plan
npm run build:backendpassesZORA_E2E=1 npx vitest run tests/e2e/— all 7 scenarios pass🤖 Generated with Claude Code
CodeAnt-AI Description
Add a built-in Echo provider for repeatable E2E testing
What Changed
echoprovider that returns predictable outputs without API keys, so end-to-end runs can pass in CI on any supported machineecho, allowing configs to route tasks to the new providerImpact
✅ Stable CI end-to-end runs without API keys✅ Clearer provider fallback coverage✅ Faster validation of real CLI behavior💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.