Skip to content

feat(testing): e2e scenario harness — real boot, echo provider, cross-LLM evaluation#171

Merged
ryaker merged 1 commit into
mainfrom
feat/e2e-clean
Mar 25, 2026
Merged

feat(testing): e2e scenario harness — real boot, echo provider, cross-LLM evaluation#171
ryaker merged 1 commit into
mainfrom
feat/e2e-clean

Conversation

@ryaker

@ryaker ryaker commented Mar 25, 2026

Copy link
Copy Markdown
Owner

User description

Summary

  • Adds EchoProvider (type: echo) — deterministic, no API keys, works in CI on any OS
  • 7 e2e scenarios: basic task, session persistence, provider routing, failover, cross-provider evaluation, prompt injection safety, concurrency
  • Cross-LLM evaluation pattern: generator task → evaluate: [output] → evaluator responds with EVALUATION: prefix (Gemini checks Claude's work in real runs)
  • CI job on ubuntu-latest + macos-latest, no secrets needed
  • test:e2e runs in CI; test:e2e:real for local runs with real API keys

Test plan

  • npm run build:backend passes
  • ZORA_E2E=1 npx vitest run tests/e2e/ — all 7 scenarios pass
  • CI e2e job green on ubuntu-latest and macos-latest

🤖 Generated with Claude Code


CodeAnt-AI Description

Add a built-in Echo provider for repeatable E2E testing

What Changed

  • Added a new echo provider that returns predictable outputs without API keys, so end-to-end runs can pass in CI on any supported machine
  • CLI provider selection now recognizes echo, allowing configs to route tasks to the new provider
  • Added E2E scenario coverage for task routing, session recording, provider fallback, cross-provider evaluation, prompt-injection handling, and concurrent runs
  • Added CI and local commands to run the E2E suite, plus fixtures and notes for both echo-based and real-provider testing

Impact

✅ Stable CI end-to-end runs without API keys
✅ Clearer provider fallback coverage
✅ Faster validation of real CLI behavior

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

…s-LLM eval pattern, CI on linux+macos

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codeant-ai

codeant-ai Bot commented Mar 25, 2026

Copy link
Copy Markdown

CodeAnt AI is reviewing your PR.


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@coderabbitai

coderabbitai Bot commented Mar 25, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@ryaker has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 18 minutes and 29 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 75a0670d-f55b-4608-b7f2-114af30ced59

📥 Commits

Reviewing files that changed from the base of the PR and between 3154d61 and 778ed55.

📒 Files selected for processing (12)
  • .github/workflows/ci.yml
  • docs/testing/e2e-cross-llm-evaluation.md
  • package.json
  • src/cli/daemon.ts
  • src/cli/index.ts
  • src/providers/echo-provider.ts
  • src/providers/index.ts
  • src/types.ts
  • tests/e2e/scenario-harness.test.ts
  • tests/fixtures/e2e-config-real.toml.example
  • tests/fixtures/e2e-config.toml
  • tests/fixtures/e2e-policy.toml
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/e2e-clean

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ryaker ryaker merged commit 825cd4a into main Mar 25, 2026
2 of 6 checks passed
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the testing infrastructure by introducing a robust end-to-end scenario harness for the Zora agent. It provides a new EchoProvider that allows for deterministic and API-key-free testing, crucial for reliable continuous integration. The changes include a suite of 7 new scenarios designed to validate various agent behaviors and a sophisticated cross-LLM evaluation pattern, ensuring the agent's interactions with language models are thoroughly tested and verified.

Highlights

  • New EchoProvider: Introduced a deterministic EchoProvider for e2e testing, eliminating the need for API keys and enabling reliable CI execution across various operating systems.
  • Comprehensive E2E Scenarios: Added 7 new end-to-end scenarios covering core functionalities such as basic task routing, session persistence, provider routing, failover, cross-provider evaluation, prompt injection safety, and concurrency.
  • Cross-LLM Evaluation Pattern: Implemented a novel cross-LLM evaluation pattern where a 'generator' LLM produces output and a separate 'evaluator' LLM checks its quality, allowing for independent verification (e.g., Gemini checking Claude's work).
  • CI Integration: Configured CI jobs on ubuntu-latest and macos-latest to automatically run the new e2e test suite, ensuring continuous validation of the agent's behavior.
  • New NPM Scripts: Added test:e2e and test:e2e:real npm scripts to easily run the e2e tests in both mock (EchoProvider) and real (API keys required) provider modes.
Ignored Files
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/ci.yml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@codeant-ai codeant-ai Bot added the size:XL This PR changes 500-999 lines, ignoring generated files label Mar 25, 2026
@codeant-ai

codeant-ai Bot commented Mar 25, 2026

Copy link
Copy Markdown

Sequence Diagram

This PR adds a deterministic Echo provider and a real boot end to end harness that runs through the CLI in CI without secrets. The core flow validates provider selection including fallback, then performs a two step generate then evaluate cycle while asserting session logs are written.

sequenceDiagram
    participant CI
    participant Harness
    participant CLI
    participant Router
    participant Echo
    participant Sessions

    CI->>Harness: Run e2e scenarios on linux and macos
    Harness->>CLI: Ask generator task
    CLI->>Router: Load config and select enabled provider
    alt Primary enabled
        Router->>Echo: Route task to echo primary
    else Primary disabled
        Router->>Echo: Fallback to echo evaluator
    end
    Echo-->>CLI: Return deterministic task output
    CLI->>Sessions: Write session events file
    Harness->>CLI: Ask evaluator task using generated output
    CLI->>Echo: Execute evaluate prompt
    Echo-->>Harness: Return EVALUATION response
Loading

Generated by CodeAnt AI

@codeant-ai

codeant-ai Bot commented Mar 25, 2026

Copy link
Copy Markdown

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review

  • Environment Pollution
    The setup path can create ~/.zora/policy.toml on the host machine when it is missing, and the file is never restored or removed. That makes the harness non-self-contained and can affect later local runs or other test jobs that share the same home directory.

  • Flaky Lookup
    Session discovery is based on mtimeMs > sinceMs with only a 1 ms buffer. On filesystems with coarse timestamp precision, or when multiple runs happen close together, this can miss the session file created by the current scenario and produce intermittent failures.

  • Possible Bug
    The word-count branch will report 1 for empty or whitespace-only tasks because splitting an empty trimmed string still produces one array element. Please verify the expected behavior for blank prompts and handle that case explicitly.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive E2E testing framework for Zora, centered around a new EchoProvider that offers deterministic responses for testing without requiring real LLM API keys. The changes include adding new E2E test scenarios in scenario-harness.test.ts, new configuration and policy fixture files, and updates to package.json to support running these tests. The EchoProvider is integrated into the CLI and daemon. Feedback from the review suggests simplifying the test:e2e:real command in the documentation for clarity and refactoring duplicated environment setup logic within the E2E test harness for improved maintainability.

```bash
cp tests/fixtures/e2e-config-real.toml.example tests/fixtures/e2e-config-real.toml
# Edit to set real provider credentials/models
ZORA_E2E=1 ZORA_REAL_PROVIDERS=1 npm run test:e2e:real

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test:e2e:real script in package.json already includes the ZORA_E2E=1 and ZORA_REAL_PROVIDERS=1 environment variables. To avoid redundancy and simplify the command for users, you can remove them from this documentation.

Suggested change
ZORA_E2E=1 ZORA_REAL_PROVIDERS=1 npm run test:e2e:real
npm run test:e2e:real

Comment on lines +443 to +449
const env: NodeJS.ProcessEnv = {
...process.env,
ZORA_CONFIG_DIR: concConfig,
};
delete env['CLAUDECODE'];
delete env['CLAUDE_CODE_ENTRYPOINT'];
delete env['CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS'];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This environment setup logic is duplicated from the spawnAsk function earlier in this file. To improve maintainability and ensure consistency, this should be refactored. As per our guidelines, duplicated logic should be extracted into a private helper method to improve maintainability and avoid future inconsistencies. I suggest making this block identical to the one in spawnAsk. Ideally, this logic would be extracted into a shared helper function like createTestEnv.

Suggested change
const env: NodeJS.ProcessEnv = {
...process.env,
ZORA_CONFIG_DIR: concConfig,
};
delete env['CLAUDECODE'];
delete env['CLAUDE_CODE_ENTRYPOINT'];
delete env['CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS'];
const env: NodeJS.ProcessEnv = {
...process.env,
ZORA_CONFIG_DIR: concConfig,
// Strip Claude Code env vars so we don't trigger SDK conflicts
CLAUDECODE: undefined,
CLAUDE_CODE_ENTRYPOINT: undefined,
CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS: undefined,
};
// Remove undefined keys (spawn passes them as the string "undefined")
for (const key of ['CLAUDECODE', 'CLAUDE_CODE_ENTRYPOINT', 'CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS']) {
delete env[key];
}
References
  1. Extract duplicated logic into a private helper method to improve maintainability and avoid future inconsistencies.

Comment on lines +315 to +327
const sinceMs = Date.now() - 1;

const result = spawnAsk('summarize this task for the evaluator', {
configDir: zoraConfigDir,
cwd: tempDir,
});

expect(result.exitCode, `stderr: ${result.stderr}`).toBe(0);

const newFiles = sessionFilesNewerThan(sinceMs);
expect(newFiles.length, 'Expected a new session file').toBeGreaterThan(0);

const events = parseJsonl(newFiles[0]!);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: This provider-routing scenario assumes the newest file from a global shared sessions directory belongs to this test run, which is flaky when other Zora tasks write concurrently. Identify the session file created by this invocation (diff before/after list and match the task) before asserting provider source. [possible bug]

Severity Level: Major ⚠️
- ⚠️ Provider-routing assertion may read wrong session file.
- ⚠️ Shared global sessions cause flaky E2E behavior.
Suggested change
const sinceMs = Date.now() - 1;
const result = spawnAsk('summarize this task for the evaluator', {
configDir: zoraConfigDir,
cwd: tempDir,
});
expect(result.exitCode, `stderr: ${result.stderr}`).toBe(0);
const newFiles = sessionFilesNewerThan(sinceMs);
expect(newFiles.length, 'Expected a new session file').toBeGreaterThan(0);
const events = parseJsonl(newFiles[0]!);
const taskPrompt = 'summarize this task for the evaluator';
const beforeFiles = new Set(listSessionFiles());
const result = spawnAsk(taskPrompt, {
configDir: zoraConfigDir,
cwd: tempDir,
});
expect(result.exitCode, `stderr: ${result.stderr}`).toBe(0);
const createdFiles = listSessionFiles().filter(f => !beforeFiles.has(f));
expect(createdFiles.length, 'Expected a new session file').toBeGreaterThan(0);
const matchingFile = createdFiles.find((file) => {
const fileEvents = parseJsonl(file);
return fileEvents.some(e => {
if (e['type'] !== 'task.start') return false;
const content = e['content'] as { task?: string } | undefined;
return content?.task === taskPrompt;
});
});
expect(matchingFile, 'Expected session file for this scenario').toBeDefined();
const events = parseJsonl(matchingFile!);
Steps of Reproduction ✅
1. Run Scenario 3 (`tests/e2e/scenario-harness.test.ts:314`) while another `zora-agent
ask` process runs concurrently on the same machine.

2. Scenario 3 collects files via `sessionFilesNewerThan()` and picks `newFiles[0]`
(`tests/e2e/scenario-harness.test.ts:324-327`), assuming newest file is this test's run.

3. Session files are global (`~/.zora/sessions`) because CLI `ask` creates Orchestrator
without `baseDir` (`src/cli/index.ts:183`), so Orchestrator defaults to
`os.homedir()/.zora` (`src/orchestrator/orchestrator.ts:215`) and SessionManager writes
under `sessions` (`src/orchestrator/session-manager.ts:111-113`).

4. If another process writes a newer file, `parseJsonl(newFiles[0])` reads the wrong run
and provider assertion at `tests/e2e/scenario-harness.test.ts:333` becomes
flaky/incorrect.
Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** tests/e2e/scenario-harness.test.ts
**Line:** 315:327
**Comment:**
	*Possible Bug: This provider-routing scenario assumes the newest file from a global shared sessions directory belongs to this test run, which is flaky when other Zora tasks write concurrently. Identify the session file created by this invocation (diff before/after list and match the task) before asserting provider source.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
👍 | 👎

Comment on lines +387 to +401
// Step 2: Evaluate the generated output
const evalSince = Date.now() - 1;
const evalPrompt = `evaluate: ${generatedText.slice(0, 200)}`;
const evalResult = spawnAsk(evalPrompt, {
configDir: zoraConfigDir,
cwd: tempDir,
});
expect(evalResult.exitCode, `Evaluation step failed. stderr: ${evalResult.stderr}`).toBe(0);

// EchoProvider responds with "EVALUATION:" for "evaluate" keyword
expect(evalResult.stdout, 'Evaluator response should contain EVALUATION:').toContain('EVALUATION:');

// Both session files were written
const evalFiles = sessionFilesNewerThan(evalSince);
expect(evalFiles.length, 'Evaluator should produce a session file').toBeGreaterThan(0);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The "cross-provider evaluation" scenario currently runs both generation and evaluation with the same config, so it does not actually verify cross-provider behavior. Force the evaluation step onto the fallback evaluator provider and assert the recorded source is the evaluator. [logic error]

Severity Level: Major ⚠️
- ⚠️ Scenario 5 misses cross-provider routing regressions.
- ⚠️ "Cross-provider" claim is unverified in current assertions.
Suggested change
// Step 2: Evaluate the generated output
const evalSince = Date.now() - 1;
const evalPrompt = `evaluate: ${generatedText.slice(0, 200)}`;
const evalResult = spawnAsk(evalPrompt, {
configDir: zoraConfigDir,
cwd: tempDir,
});
expect(evalResult.exitCode, `Evaluation step failed. stderr: ${evalResult.stderr}`).toBe(0);
// EchoProvider responds with "EVALUATION:" for "evaluate" keyword
expect(evalResult.stdout, 'Evaluator response should contain EVALUATION:').toContain('EVALUATION:');
// Both session files were written
const evalFiles = sessionFilesNewerThan(evalSince);
expect(evalFiles.length, 'Evaluator should produce a session file').toBeGreaterThan(0);
// Step 2: Evaluate the generated output with evaluator provider
const evalSetup = createTempZoraDir('cross-eval');
const evalDir = evalSetup.dir;
const evalConfig = evalSetup.configDir;
try {
writeConfigWithDisabledPrimary(evalConfig);
const evalSince = Date.now() - 1;
const evalPrompt = `evaluate: ${generatedText.slice(0, 200)}`;
const evalResult = spawnAsk(evalPrompt, {
configDir: evalConfig,
cwd: evalDir,
});
expect(evalResult.exitCode, `Evaluation step failed. stderr: ${evalResult.stderr}`).toBe(0);
// EchoProvider responds with "EVALUATION:" for "evaluate" keyword
expect(evalResult.stdout, 'Evaluator response should contain EVALUATION:').toContain('EVALUATION:');
const evalFiles = sessionFilesNewerThan(evalSince);
expect(evalFiles.length, 'Evaluator should produce a session file').toBeGreaterThan(0);
const evalEvents = parseJsonl(evalFiles[0]!);
const evalSources = evalEvents
.filter(e => e['source'])
.map(e => e['source'] as string);
expect(evalSources.some(p => p === 'echo-evaluator'), 'Expected evaluator provider to handle evaluation').toBe(true);
} finally {
removeTempDir(evalDir);
}
Steps of Reproduction ✅
1. Run Scenario 5 in `tests/e2e/scenario-harness.test.ts:373-402`; both generation and
evaluation call `spawnAsk(..., { configDir: zoraConfigDir })` (`lines 376-379` and
`390-393`).

2. The fixture config enables both providers with same capabilities, ranks 1 and 2
(`tests/fixtures/e2e-config.toml:7-21`).

3. Router default mode `respect_ranking` returns lowest rank candidate
(`src/orchestrator/router.ts:123-126`), so both steps route to `echo-primary`.

4. Test still passes because it only asserts output contains `EVALUATION:`
(`tests/e2e/scenario-harness.test.ts:397`), and EchoProvider emits that for `evaluate`
keyword regardless of provider (`src/providers/echo-provider.ts:183-185`); no
provider-source verification occurs.
Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** tests/e2e/scenario-harness.test.ts
**Line:** 387:401
**Comment:**
	*Logic Error: The "cross-provider evaluation" scenario currently runs both generation and evaluation with the same config, so it does not actually verify cross-provider behavior. Force the evaluation step onto the fallback evaluator provider and assert the recorded source is the evaluator.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
👍 | 👎

@codeant-ai

codeant-ai Bot commented Mar 25, 2026

Copy link
Copy Markdown

CodeAnt AI finished reviewing your PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL This PR changes 500-999 lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants