Skip to content

Add example agents and integration test workflow#340

Merged
kovtcharov merged 18 commits intomainfrom
kalin/examples
Apr 17, 2026
Merged

Add example agents and integration test workflow#340
kovtcharov merged 18 commits intomainfrom
kalin/examples

Conversation

@kovtcharov
Copy link
Copy Markdown
Collaborator

@kovtcharov kovtcharov commented Feb 12, 2026

Summary

  • Add 3 new example agents showcasing GAIA capabilities
  • Add real integration tests that execute agents and validate responses
  • Add CI/CD workflow running on Strix with Lemonade server
  • Update docs homepage with professional, technical messaging

New Examples

  • weather_agent.py - Real-time weather via MCP server integration
  • rag_doc_agent.py - Document Q&A using RAG for private data
  • product_mockup_agent.py - HTML landing page generator for rapid prototyping

All examples use Qwen3-4B-GGUF for faster inference.

Testing

  • tests/integration/test_example_agents.py - Real execution tests with response validation
  • .github/workflows/test_examples.yml - CI/CD on Strix runner with Lemonade server

Test coverage: 10/10 examples (100%)

Tests that actually run:

  • NotesAgent: Creates notes, validates database operations
  • ProductMockupAgent: Generates HTML, validates file creation
  • FileWatcherAgent: Watches directories, validates event handling
  • Structure tests for MCP-based agents (require external servers)

Documentation Updates

  • Updated docs homepage (docs/index.mdx)
  • Replaced marketing slogan with technical value prop: "Agent SDK for AMD Ryzen AI"
  • Added MCP to list of key capabilities
  • Added Computer Use Agents (CUA) as use case
  • More professional, technical tone

CI/CD Workflow

  • Runs on self-hosted Strix runner (stx label)
  • Starts Lemonade server with Qwen3-4B-GGUF
  • Executes agents and validates responses
  • 5-minute timeout per test
  • Skips copyright header validation (allows external contributions)

All examples are verified, copy-paste ready, and validated in CI/CD pipeline.

- Add weather_agent.py for MCP weather integration
- Add rag_doc_agent.py for document Q&A with RAG
- Add product_mockup_agent.py for HTML landing page generation
- Add integration tests for all example agents
- Add CI/CD workflow to validate examples on every PR
@github-actions github-actions bot added devops DevOps/infrastructure changes tests Test changes labels Feb 12, 2026
Claude Code added 4 commits February 11, 2026 17:34
- Add tests for mcp_time_server_agent.py
- Add tests for mcp_windows_system_health_agent.py
- Add tests for sd_agent_example.py
- Coverage now 10/10 (100%) of example files
External contributors can submit examples without AMD copyright
- Use Qwen3-4B-GGUF model for faster inference in examples
- Update workflow to run on self-hosted Strix runner with Lemonade server
- Convert integration tests to actually execute agents and validate responses
- Add Lemonade server startup to CI/CD workflow
- Replace marketing slogan with technical value prop
- Update headline to 'Agent SDK for AMD Ryzen AI'
- Add MCP to list of key capabilities
- Add Computer Use Agents (CUA) to use cases
- More professional tone per leadership feedback
@github-actions github-actions bot added the documentation Documentation changes label Feb 12, 2026
Comment thread docs/index.mdx Outdated
Comment thread docs/index.mdx Outdated
@kovtcharov kovtcharov self-assigned this Feb 27, 2026
Claude Code added 3 commits March 5, 2026 14:07
- Convert test_examples workflow steps to PowerShell for Windows CI
- Make Docker check non-blocking in MCP server tests
- Use python -m pytest instead of bare pytest
- Fix MCPClientMixin import path in weather_agent example
- Improve file watcher test with retry loop to reduce flakiness
- Clean up test formatting (black/isort compliance)
@kovtcharov kovtcharov added this to the Futures milestone Mar 16, 2026
The previous test_examples.yml failed on the stx runner because it ran the
Linux-only ./.github/actions/free-disk-space action (requires bash/df) and
used `shell: pwsh` — neither of which exist on the self-hosted Windows
runner. The workflow is now split:

* test-examples-unit runs on ubuntu-latest, validates syntax for every
  example, and runs the structure/import tests from
  tests/integration/test_example_agents.py. Tests decorated with
  @requires_lemonade auto-skip here.
* test-examples-integration runs on stx using ./.github/actions/setup-venv
  and ./.github/actions/install-lemonade, starts Lemonade via the shared
  start-lemonade.ps1 helper with Qwen3-4B-Instruct-2507-GGUF, and executes
  the full pytest suite including the LLM-backed tests.

Example agent modernization for current SDK:

* rag_doc_agent.py no longer pulls in the ChatAgent-specific RAGToolsMixin
  (whose tools require many ChatAgent-only attributes). It now registers a
  single `query_documents` tool bound to RAGSDK.query() and allow-lists the
  index directory so the SDK's path validator lets it read local files.
* weather_agent.py follows the idiomatic Agent+MCPClientMixin pattern used
  by the builder template: _mcp_manager is wired up before super().__init__()
  and MCP tool registration happens in _register_tools, so a fresh client
  manager is available when Agent.__init__ composes the system prompt.
* product_mockup_agent.py is unchanged apart from an automated black
  reformat of a long line.

All structure/import tests pass locally (7 passed, 3 skipped when Lemonade
is not running) and black/isort report clean.
…leanup

The stx integration run failed because TestNotesAgent instantiated
NotesAgent with the default model_id, which resolves to
Qwen3.5-35B-A3B-GGUF — a model we intentionally do not pull on the runner.
Lemonade returned HTTP 422 ("model_name=... not registered") and the test
then tripped a Windows PermissionError (WinError 32) when pytest tried to
delete the tempdir while the SQLite connection was still open.

Changes:

* Introduce TEST_MODEL_ID (env-override via GAIA_TEST_MODEL, default
  Qwen3-4B-Instruct-2507-GGUF) and thread it through every LLM-backed
  test: NotesAgent, ProductMockupAgent, FileWatcherAgent. This matches
  the model our workflow pulls via start-lemonade.ps1.
* Wrap the NotesAgent and FileWatcherAgent assertions in try/finally so
  close_db() / stop_all_watchers() runs before TemporaryDirectory tries
  to remove the directory, preventing the Windows file-lock error.
* Switch weather_agent.py to the free open-meteo-mcp server (no API key,
  vs the PyPI mcp-server-weather package which requires --api_key).
ProductMockupAgent and DocAgent passed model_id as an explicit kwarg to
super().__init__(**kwargs), which crashed with a duplicate-keyword
TypeError when callers (like the integration tests) also passed
model_id=... themselves.  Switch both to kwargs.setdefault("model_id", ...)
so callers can override without colliding.

Also updated weather_agent.py's connection-failure hint to reference
open-meteo-mcp instead of the stale mcp-server-weather package name.
Users running rag_doc_agent.py without ''[rag]'' extras installed hit an
obscure ImportError from RAGSDK ("Missing required RAG dependencies:
pypdf, sentence-transformers, faiss-cpu"). Add the install hint to the
docstring so the fix is discoverable.
The ''_split_text_with_llm'' chunking helper and ''_extract_text_from_json''
both re-imported ''json'' locally even though it is already imported at
module scope (line 12).  This triggered pylint W0404 (''Reimport'') which
is what caused the 'Run Code Quality Checks' workflow to fail with a
non-zero exit code on every PR.
@github-actions github-actions bot added rag RAG system changes performance Performance-critical changes labels Apr 17, 2026
Per Tomasz's review on PR #340, the original wording over-promised that
GAIA has "no cloud dependency" at all.  The accurate statement is that
the core runtime needs no cloud (so sensitive data stays on-device), but
individual agents can still opt into external services when a use case
requires it — the weather_agent.py example in this same PR is exactly
such a case.

Changes:
* Hero copy now qualifies the cloud-free claim and explicitly lists
  opt-in services (weather APIs, Jira, MCP servers).
* Rename the "No Cloud Dependency" card to "Cloud-Optional" with matching
  clarification.

Tomasz's comment thread on docs/index.mdx:
  #340 (comment)
Two related bugs surfaced during end-to-end testing of all 3 new example
agents with a local Lemonade server:

1. The examples defaulted to ''Qwen3-4B-GGUF'', which is the base
   (non-instruct) model.  With it, the LLM hallucinates tool usage
   instead of actually invoking the registered tools — DocAgent
   confidently answered "60% Product A / 40% Product B" for a document
   that explicitly said "70% product, 20% engineering, 10% marketing".
   The instruct-tuned variant ''Qwen3-4B-Instruct-2507-GGUF'' (already
   used by the CI workflow and test_rag.yml) follows tool-use
   instructions correctly and returns the grounded answer.

2. DocAgent passed ''model_id'' to Agent but the inner RAGSDK was
   constructed with an empty RAGConfig, silently falling back to the
   framework default ''Qwen3.5-35B-A3B-GGUF'' for its answer-synthesis
   call.  On a runner that does not have the 35B model pulled, this
   raises HTTP 422 "not registered with Lemonade Server".  Plumb the
   agent's resolved ''model_id'' into RAGConfig.model so both code paths
   hit the same loaded model.

Verified end-to-end against local Lemonade + Qwen3-4B-Instruct-2507-GGUF:

  ProductMockupAgent → generated testapp.html with Tailwind + all 3 features
  WeatherAgent       → connected to open-meteo-mcp, answered real Tokyo weather
  DocAgent           → returned "70% to product, 20% to engineering, 10% to marketing"
@kovtcharov kovtcharov added this pull request to the merge queue Apr 17, 2026
Merged via the queue into main with commit 0cfbcf4 Apr 17, 2026
37 checks passed
@kovtcharov kovtcharov deleted the kalin/examples branch April 17, 2026 21:59
@itomek itomek mentioned this pull request Apr 20, 2026
6 tasks
github-merge-queue bot pushed a commit that referenced this pull request Apr 20, 2026
# GAIA v0.17.3 Release Notes

GAIA v0.17.3 is an extensibility and resilience release. You can now
package your own agents into a custom GAIA installer and seed them on
first launch, point GAIA at alternative OpenAI-compatible inference
servers from the C++ library (Ollama, for example), and start from three
new reference agents (weather, RAG Q&A, HTML mockup) that execute
against real Lemonade hardware in CI. It also hardens the RAG cache
against an insecure-deserialization class of bug (CWE-502) — all users
should upgrade.

**Why upgrade:**
- **Ship your own GAIA** — Export and import agents between machines,
follow a new guide to produce a custom installer that seeds your agents
on first launch, and on Windows install everything in one step because
the installer now includes the Lemonade Server MSI.
- **Work with alternative inference backends** — The C++ library now
preserves OpenAI-compatible `/v1` base URLs instead of rewriting them to
`/api/v1`, so servers that expose the standard `/v1` path (Ollama, for
example) work out of the box.
- **Start from a working example** — Three new reference agents (weather
via MCP, RAG document Q&A, HTML landing-page generator) with integration
tests that actually execute against Lemonade on a Strix CI runner.
- **Safer RAG cache** — Replaces `pickle` deserialization with JSON +
HMAC-SHA256 (CWE-502). Unsigned or tampered caches are rejected and
transparently rebuilt on the next query.
- **Better document handling** — Encrypted or corrupted PDFs now produce
distinct, actionable errors (`EncryptedPDFError`, `CorruptedPDFError`)
instead of generic failures, and the RAG index is hardened for
concurrent queries.

---

## What's New

### Custom Installers and Agent Portability

You can now package a custom GAIA installer that ships with your own
agents pre-loaded, and move agents between machines with export/import
(PR #795). On Windows, the official installer now includes the Lemonade
Server MSI and runs it during install, so a fresh machine has the
complete local-LLM stack after a single download (PR #781).

**What you can do:**
- Export an agent from `~/.gaia/agents/` to a portable bundle with `gaia
agents export` and import it on another machine with `gaia agents
import`
- Follow the new custom-installer playbook at
[`docs/playbooks/custom-installer/index.mdx`](/playbooks/custom-installer)
to distribute GAIA with your agents pre-loaded — useful for workshops,
team deployments, and internal tooling
- On Windows, the installer now includes Lemonade Server — no separate
download for a complete first-run experience

**Under the hood:**
- `gaia agents export` / `gaia agents import` CLI commands round-trip
agents between machines as portable bundles
- First-launch agent seeder
(`src/gaia/apps/webui/services/agent-seeder.cjs`) copies
`<resourcesPath>/agents/<id>/` into `~/.gaia/agents/<id>/` the first
time the app starts
- Windows NSIS installer embeds `lemonade-server-minimal.msi` into
`$PLUGINSDIR` and runs it via `msiexec /i ... /qn /norestart` during
install (auto-cleaned on exit)

---

### Broader Backend Compatibility in the C++ Library

The C++ library now preserves OpenAI-compatible `/v1` base URLs (PR
#773) instead of rewriting them to `/api/v1`. That means inference
servers that expose the standard OpenAI `/v1` path — for example, Ollama
at `http://localhost:11434/v1` — work out of the box without needing a
special adapter.

---

### Reference Agents and Real-Hardware Integration Tests

Three new example agents and a Strix-runner CI workflow land together
(PR #340).

**What you can do:**
- Copy `examples/weather_agent.py`, `examples/rag_doc_agent.py`, or
`examples/product_mockup_agent.py` as a starting point for your own
agents
- Run the new integration tests locally against Lemonade to validate
agents end-to-end, not just structurally

**Under the hood:**
- `tests/integration/test_example_agents.py` executes agents and
validates responses with a 5-minute-per-test timeout
- `.github/workflows/test_examples.yml` runs on the self-hosted Strix
runner (`stx` label) with Lemonade serving `Qwen3-4B-Instruct-2507-GGUF`
- Docs homepage refreshed with a technical value prop ("Agent SDK for
AMD Ryzen AI") and MCP / CUA added to the capabilities list

---

### Smarter PDF Handling in RAG

Encrypted and corrupted PDFs now surface as distinct, actionable errors
(`EncryptedPDFError`, `CorruptedPDFError`, `EmptyPDFError`) instead of
generic failures or silent 0-chunk indexes (PR #784, closes #451).
Encrypted PDFs are detected before extraction; corrupted PDFs are caught
during extraction with a clear message. Combined with the
indexing-failure surfacing in PR #723, you get a visible indexing-failed
status the moment a document fails — and the RAG index itself is now
thread-safe under concurrent queries (PR #746).

---

## Security

### RAG Cache Deserialization Replaced with JSON + HMAC

Fixes an insecure-deserialization issue in the RAG cache (CWE-502, PR
#768). Previously, cached document indexes were serialized with Python
`pickle`; if an attacker could write to `~/.gaia/` — via a shared drive,
a sync conflict, or a malicious extension — loading that cache could
execute arbitrary code.

v0.17.3 replaces `pickle` with signed JSON: caches are now serialized as
JSON and authenticated with HMAC-SHA256 using a per-install key stored
at `~/.gaia/cache/hmac.key`. Unsigned or tampered caches are rejected
and transparently rebuilt on the next query. Old `.pkl` caches from
previous GAIA versions are ignored and re-indexed the next time you
query a document.

**You should upgrade if you** share `~/.gaia/` across machines (Dropbox,
iCloud, network home directories), run GAIA in a multi-user environment,
or have ever imported RAG caches from another source.

---

## Bug Fixes

- **Ask Agent attaches files before sending to chat** (PR #725) —
Dropped files are indexed into RAG and attached to the active session
before the prompt is consumed, so the model sees the document on the
first turn instead of the second.
- **Document indexing failures are surfaced** (PR #723) — A document
that produces 0 chunks now raises `RuntimeError` in the SDK and surfaces
as `indexing_status: failed` in the UI, instead of looking like a silent
success. Covers RAG SDK, background indexing, and re-index paths.
- **Encrypted or corrupted PDFs produce actionable errors** (PR #784,
closes #451) — RAG now raises distinct `EncryptedPDFError` and
`CorruptedPDFError` exceptions instead of generic failures, so you see
exactly what went wrong.
- **RAG index thread safety hardened** (PR #746) — Adds `RLock`
protection around index mutation paths and rebuilds chunk/index state
atomically before publishing it, so concurrent queries read consistent
snapshots and failed rebuilds no longer leak partial state.
- **MCP JSON-RPC handler guards against non-dict bodies** (PR #803) — A
malformed JSON-RPC payload (array, string, null) now returns HTTP 400
`Invalid Request: expected JSON object` instead of an HTTP 500 from a
`TypeError`.
- **File-search count aligned with accessible results** (PR #754) — The
returned count now matches the number of files the tool actually
surfaces, instead of a pre-filter total that over-reported results the
caller could not access.
- **Tracked block cursor replaces misplaced decorative cursor** (PR
#727) — Fixes the mis-positioned blinking cursor in the chat input box,
which now tracks the actual caret position via a mirror-div technique.
- **Ad-hoc sign the macOS app bundle instead of skipping code signing**
(PR #765) — The `.app` bundle inside the DMG now carries an ad-hoc
signature, so Gatekeeper presents a single "Open Anyway" bypass in
System Settings instead of the unrecoverable "is damaged" error. Full
Apple Developer ID signing is still being finalized.

---

## Release & CI

- **Publish workflow: single approval gate, no legacy Electron apps**
(PR #758) — Removed the legacy jira and example standalone Electron apps
from the publish pipeline; a single `publish` environment gate governs
PyPI, npm, and installer publishing.
- **Claude CI modernization** (PR #797, PR #799, PR #783) — Migrated all
four `claude-code-action` call sites to `v1.0.99` (pinned by SHA, fixes
an issue-handler hang), bumped `--max-turns` from 20 to 50 on both
`pr-review` and `pr-comment` for deeper analysis, upgraded to Opus 4.7,
standardized 23 subagent definitions with explicit when-to-use sections
and tool allowlists, and added agent-builder tooling (manifest schema,
`lint.py --agents`, BuilderAgent mixins).

---

## Docs

- **Roadmap overhaul** (PR #710) — Milestone-aligned plans with
voice-first as P0 and 9 new plan documents for upcoming initiatives.
- **Plan: email triage agent** (PR #796) — Specification for an upcoming
email triage agent.
- **Docs/source drift resolved** (PR #794) — Fixed broken SDK examples
across 15 docs, rewrote 5 spec files against the current source
(including two that documented entire APIs that don't exist in code),
added 20+ missing CLI flags to the CLI reference, and removed 2
already-shipped plan documents (installer, mcp-client).
- **FAQ: data-privacy answer clarified for external LLM providers** (PR
#798) — Sharper guidance on what leaves your machine when you point GAIA
at Claude or OpenAI.

---

## Full Changelog

**21 commits** since v0.17.2:

- `6d3f3f71` — fix: replace misplaced decorative cursor with tracked
terminal block cursor (#727)
- `874cf2a3` — fix: Ask Agent indexes and attaches files before sending
to chat (#725)
- `4fa121e2` — fix: surface document indexing failures instead of silent
0-chunk success (#723)
- `34b1d06e` — fix(ci): ad-hoc sign macOS DMG instead of skipping code
signing (#765)
- `7188b83c` — Roadmap overhaul: milestone-aligned plans with
voice-first P0 and 9 new plan documents (#710)
- `1beddac5` — cpp: support Ollama-compatible /v1 endpoints (#773)
- `cf9ac995` — fix: harden rag index thread safety (#746)
- `1c55c31b` — fix(ci): remove legacy electron apps from publish, single
approval gate (#758)
- `52946a7a` — feat(installer): bundle Lemonade Server MSI into Windows
installer (#774) (#781)
- `e96b3686` — ci(claude): review infra + conventions + subagent
overhaul + agent-builder tooling (#783)
- `058674b5` — fix(rag): detect encrypted and corrupted PDFs with
actionable errors (#451) (#784)
- `7bcb5d51` — fix: replace insecure pickle deserialization with JSON +
HMAC in RAG cache (CWE-502) (#768)
- `a5167e5f` — fix: keep file-search count aligned with accessible
results (#754)
- `da5ba458` — ci(claude): migrate to claude-code-action v1.0.99 + fix
issue-handler hang (#797)
- `03f546b9` — ci(claude): bump pr-review and pr-comment --max-turns 20
-> 50 (#799)
- `4119d564` — docs(faq): clarify data-privacy answer re: external LLM
providers (#798)
- `0cfbcf41` — Add example agents and integration test workflow (#340)
- `c4bd15fb` — docs: fix drift between docs and source (docs review pass
1 + 2) (#794)
- `407ed5b8` — docs(plans): add email triage agent spec (#796)
- `06fb04a4` — fix(mcp): guard JSON-RPC handler against non-dict body
(#803)
- `880ad603` — feat(installer): custom installer guide, agent
export/import, first-launch seeder (#795)

Full Changelog:
[v0.17.2...v0.17.3](v0.17.2...v0.17.3)

---

## Release checklist
- [x] `util/validate_release_notes.py docs/releases/v0.17.3.mdx --tag
v0.17.3` passes
- [x] `src/gaia/version.py` → `0.17.3`
- [x] `src/gaia/apps/webui/package.json` → `0.17.3`
- [x] Navbar label in `docs/docs.json` → `v0.17.3 · Lemonade 10.0.0`
- [x] All 21 PRs in the range (v0.17.2..HEAD) are represented in the
notes
- [ ] Review from @kovtcharov-amd addressed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops DevOps/infrastructure changes documentation Documentation changes performance Performance-critical changes rag RAG system changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants