| title | Self Improving Ad Copy Agent |
|---|---|
| emoji | π§ |
| colorFrom | gray |
| colorTo | gray |
| sdk | docker |
| pinned | false |
A small agentic system that generates direct-to-consumer (DTC) ad copy, evaluates its own outputs with an LLM-as-judge, stores the results, and uses that signal to improve over time. The loop runs end to end:
retrieve -> generate -> evaluate -> log -> improve -> repeat
Every good output becomes future few-shot fuel and a regression baseline; every bad output gets flagged for review. The system gets better the more it runs, without any human editing prompts in the hot path.
π Live demo: https://adcopy.zeetay.dev
The hosted demo runs the full loop from the browser: fill the brief (pre-filled with a FitFuel example), click Generate, and watch the agent retrieve past winners, generate three variants, and score each on four dimensions in real time.
- Frontend: Next.js (App Router) deployed on Netlify, in
web/. - Backend: the FastAPI app deployed on Hugging Face Spaces (Docker, see
Dockerfile), with CORS, a per-IP rate limit on/run, a/statsendpoint, and a startup seed so a fresh deployment shows non-zero stats and working retrieval on the very first visit.
Deployment env vars: the backend needs
GROQ_API_KEY(and optionallyCORS_ORIGINS= your site URL); the frontend needsNEXT_PUBLIC_API_URL= your Hugging Face Space URL. See the per-app.env.examplefiles.
Most "LLM app" demos are a single prompt with no memory and no notion of quality. This project is built around three ideas that make it a genuine self-improving loop:
-
It judges itself on multiple axes. Each output is scored 1-5 on four independent dimensions (hook strength, brand alignment, clarity, conversion intent) by a separate judge model. Dimension scores are first-class and stored separately; there is no single "vibe" score driving decisions.
-
It remembers what worked. High-scoring outputs are embedded by their brand brief and stored in a local vector store. On every new run, the agent retrieves the most similar past winners and injects them as few-shot examples. The quality bar therefore ratchets upward over time.
-
It refuses to regress. Winning outputs are captured into a golden dataset. Any change to the prompt is checked against that dataset before it is allowed to drive real runs: if quality drops by more than a tolerance on any known-good brief, the change fails.
The result is a closed feedback loop where generation, evaluation, memory, and prompt versioning reinforce each other.
A single run flows through agent/core.py:
- Retrieve top-3 high-scoring past outputs (
agent/memory.py, ChromaDB) - Inject them as few-shot examples (
agent/prompts.py) - Generate headline / body / cta with Groq (
agent/tools.py) - Judge each variant on 4 dimensions with Groq (
evals/judge.py) - Store run + outputs + scores in SQLite (
db/store.py) - Promote winners (>= 4.0) to golden + memory (
feedback/loop.py) - Flag losers (< 2.5) for review (
feedback/loop.py)
| Layer | File(s) | Responsibility |
|---|---|---|
| Persistence | db/store.py |
SQLite: runs, outputs, golden, flagged |
| Prompts | agent/prompts.py |
All prompt text, versioned; nothing hardcoded elsewhere |
| Memory | agent/memory.py |
ChromaDB + sentence-transformers retrieval |
| Agent loop | agent/core.py, agent/tools.py |
Orchestration + Groq client |
| Evaluation | evals/*.py |
Rubric, judge, golden dataset, regression runner |
| Feedback | feedback/loop.py |
Promote winners, flag losers |
| API | api/main.py |
POST /run |
| CLI / tests | scripts/run_eval.py, tests/ |
Manual runs and the regression suite |
- Python 3.11+
- A Groq API key (free at https://console.groq.com/keys)
python -m venv .venv
# Windows: .venv\Scripts\activate
# macOS/Linux: source .venv/bin/activate
pip install -r requirements.txtcp .env.example .env # Windows: copy .env.example .env
# then edit .env and set GROQ_API_KEYOnly GROQ_API_KEY is required. Everything else (SQLite at ./agent.db,
ChromaDB at ./chroma_db) is local; no other external services.
# Run the agent on the built-in FitFuel example brief:
python scripts/run_eval.py run
# Or run on your own brief:
python scripts/run_eval.py run --brief my_brief.json
# Inspect golden / flagged counts at any time:
python scripts/run_eval.py statusThe very first run retrieves zero few-shot examples (memory is empty). As you run more briefs, winners accumulate and later runs start standing on the shoulders of earlier ones.
A brief is a JSON object:
{
"brand": "FitFuel",
"product": "High-protein meal replacement shake",
"audience": "Busy professionals aged 25-40",
"tone": "Energetic and no-nonsense",
"goal": "Drive trial purchases"
}uvicorn api.main:app --reloadThen trigger the full loop:
curl -X POST http://127.0.0.1:8000/run \
-H "Content-Type: application/json" \
-d '{"brand":"FitFuel","product":"High-protein meal replacement shake","audience":"Busy professionals aged 25-40","tone":"Energetic and no-nonsense","goal":"Drive trial purchases"}'The response contains the three generated variants, all four dimension scores per variant, the weighted average, and a feedback summary (what was promoted / flagged).
-
Retrieve. The incoming brief is embedded (locally, via
all-MiniLM-L6-v2) and used to query ChromaDB for the top-3 most similar past outputs that scored >= 3.5/5. Only proven-good copy is ever retrieved. -
Generate. Those examples are injected into the active generation prompt under a "Past high-performing examples" section, and the model produces a headline hook, body copy, and a CTA in one structured JSON response.
-
Evaluate. Each of the three variants is immediately scored by the judge model on the four rubric dimensions. Scores are clamped to 1-5 and a weighted average is computed for internal ranking only.
-
Log. The run, every variant, and every dimension score are written to SQLite.
-
Improve. The feedback loop:
- promotes any output with a weighted average >= 4.0 into both the golden dataset and ChromaDB memory (so it can be retrieved next time);
- flags any output < 2.5 into the
flagged_outputstable with the reason "below quality threshold."
Because winners re-enter memory, the pool of few-shot exemplars improves run over run; that is the "self-improving" part.
Four dimensions, each 1-5:
| Dimension | Question | Weight |
|---|---|---|
hook_strength |
Does the headline immediately grab attention? | 0.30 |
brand_alignment |
Does the copy reflect the brief tone and audience? | 0.25 |
clarity |
Is the message immediately understandable? | 0.25 |
conversion_intent |
Does it drive toward the stated goal? | 0.20 |
The weighted average is internal only, used for thresholds and ranking. All four raw dimensions are always stored separately.
Any output with weighted average >= 4.0 is captured (brief, output, all scores, prompt version, timestamp). This is the regression baseline.
For each golden entry it regenerates copy for the same brief + variant using the prompt version under test, re-judges it, and compares the new weighted score to the stored baseline. If any entry drops by more than 0.5, the run fails with a clear warning.
# Run the regression eval against the active prompt and print a Rich table:
python scripts/run_eval.py regressionpytest tests/test_regression.pyIt loads the golden dataset, runs the eval runner, and asserts no entry
regresses more than 0.5 from baseline. The test skips (rather than failing
spuriously) when GROQ_API_KEY is unset or the golden dataset is still empty.
All prompt text lives in agent/prompts.py; nothing is hardcoded anywhere
else. Each prompt is a named, versioned constant (GENERATION_PROMPT_V1,
GENERATION_PROMPT_V2, ...). A single constant selects which is live:
ACTIVE_PROMPT_VERSION = "GENERATION_PROMPT_V1"Every run logs the version it used (stored on the run and on each output).
To promote a new prompt safely:
- Add a new constant (e.g.
GENERATION_PROMPT_V2) and register it inPROMPT_REGISTRY. - Run the regression check against it before making it active:
# Either run the suite, or the CLI regression command after flipping the # constant in a branch: pytest tests/test_regression.py python scripts/run_eval.py regression
- Only if no golden entry regresses more than 0.5, change
ACTIVE_PROMPT_VERSIONto the new version.
This is what "changing the active version triggers a regression check before it
is used in a real run" means in practice: the golden dataset is the gate.
GENERATION_PROMPT_V2 ships in this repo as a worked example you can promote.
Why ChromaDB for memory. The agent needs semantic retrieval: "find past briefs like this one", not exact lookups. ChromaDB gives a persistent local vector store with cosine similarity and zero external services, and it pairs cleanly with local sentence-transformers embeddings. SQLite alone can't do nearest-neighbour search over brief semantics; a hosted vector DB would violate the "local only" constraint and add ops overhead for no benefit at this scale.
Why dimension scoring over a composite score. A single 1-10 "quality" score is unactionable and easy for a judge to anchor on. Scoring four independent dimensions tells you why copy is weak (great hook, poor clarity) and makes the signal far more stable and debuggable. We do compute a weighted average, but only for internal thresholds/ranking; the four raw dimensions are always stored, so we never lose information by collapsing too early.
Why SQLite for eval storage. Eval results are structured, relational, and queryable (runs -> outputs -> scores; golden; flagged). SQLite gives ACID guarantees, trivial setup, a single-file database, and real SQL, ideal for run history and a golden dataset. It needs no server and ships with Python. A hosted DB would add infrastructure with no upside for a local, single-node harness.
Why the 4.0 threshold for golden inclusion. On a 1-5 scale, 4.0 means "clearly good on the weighted blend" without demanding perfection. Set it lower and the golden set fills with mediocre copy, weakening both the regression baseline and the few-shot exemplars. Set it higher (e.g. 4.5) and you rarely capture anything, so the system never accumulates a baseline or improves. 4.0 is the point where entries are good enough to defend against regressions and to teach future runs. (The retrieval floor is a more permissive 3.5 so memory can draw on a slightly wider pool of solid examples, while only the strongest >=4.0 outputs become protected golden baselines.)
self-improving-agent/
βββ README.md
βββ requirements.txt
βββ .env.example
βββ agent/ core loop, tools, memory, versioned prompts
βββ evals/ rubric, LLM judge, golden dataset, regression runner
βββ feedback/ post-run promote/flag loop
βββ db/ SQLite store
βββ api/ FastAPI app (POST /run)
βββ tests/ pytest regression suite
βββ scripts/ run_eval.py CLI (run / regression / status)