From 677b52d658d76721302f7fedbff6246ab0150150 Mon Sep 17 00:00:00 2001 From: miroslavpojer Date: Wed, 20 May 2026 15:41:47 +0200 Subject: [PATCH 1/2] Implement token-saving skill and related evaluation files; update README and documentation for clarity --- README.md | 2 +- docs/getting-started.md | 2 +- skills/SKILL.md | 66 +++++++ skills/token-saving/evals/evals.json | 186 ++++++++++++++++++ .../evals/files/large-user-service.py | 185 +++++++++++++++++ .../evals/files/pr-description-existing.md | 20 ++ .../evals/files/prior-conversation.txt | 27 +++ .../evals/files/rate-limiter-fix.diff | 28 +++ .../evals/files/sprint-changelog.txt | 36 ++++ skills/token-saving/evals/trigger-eval.json | 107 ++++++++++ 10 files changed, 657 insertions(+), 2 deletions(-) create mode 100644 skills/SKILL.md create mode 100644 skills/token-saving/evals/evals.json create mode 100644 skills/token-saving/evals/files/large-user-service.py create mode 100644 skills/token-saving/evals/files/pr-description-existing.md create mode 100644 skills/token-saving/evals/files/prior-conversation.txt create mode 100644 skills/token-saving/evals/files/rate-limiter-fix.diff create mode 100644 skills/token-saving/evals/files/sprint-changelog.txt create mode 100644 skills/token-saving/evals/trigger-eval.json diff --git a/README.md b/README.md index f90df00..428ed11 100644 --- a/README.md +++ b/README.md @@ -87,7 +87,7 @@ Before building a new skill, check whether one already exists: | [skills.sh](https://skills.sh) | Open registry — install with `npx skills add ` | | [anthropics/skills](https://github.com/anthropics/skills) | Anthropic reference skills including `skill-creator` | | [absa-group/agent-skills](https://github.com/absa-group/agent-skills) | Broader ABSA-owned skill collection | -| [absa-group/cps-agentic-toolkit](https://github.com/absa-group/cps-agentic-toolkit) | CPS team's skill set built on top of this repo | +| [absa-group/cps-agentic-toolkit](https://github.com/absa-group/cps-agentic-toolkit) | CPS team's extended skill set (ABSA-internal) | ## Contributing diff --git a/docs/getting-started.md b/docs/getting-started.md index d5402a3..2c9b085 100644 --- a/docs/getting-started.md +++ b/docs/getting-started.md @@ -132,7 +132,7 @@ Project skills take precedence over global skills when both exist. ### Add project-specific skills For skills that only apply to a specific repository, place them in `.github/skills/` within that repo. These are loaded -automatically when Copilot CLI is launched from that directory, layered on top of your personal and CPS base skills. +automatically when Copilot CLI is launched from that directory, layered on top of your personal and shared base skills. ``` your-project-repo/ diff --git a/skills/SKILL.md b/skills/SKILL.md new file mode 100644 index 0000000..473b75d --- /dev/null +++ b/skills/SKILL.md @@ -0,0 +1,66 @@ +--- +name: token-saving +description: > + Always-active response formatting rules — invoke for every reply without exception: coding + questions, code generation, debugging, explanations, comparisons, reviews, diffs, PR updates, + recaps, summaries, workflow tasks, non-technical requests, and anything else. Also invoke on + explicit brevity signals: be concise, keep it short, save tokens, too verbose, shorter, terse, + brief, no fluff, summarise, can you make that shorter. Rules: no filler openers (Certainly!, + Great question!, Happy to help!); no closing platitudes (Let me know if you have questions!); + concise within line limits; skip restating prior context; prefer tables/bullets over prose; + append What changed / Why / How to verify footer only for code-output responses, not Q&A, + reviews, or planning. Boundary: when user explicitly requests full detail, deep dive, complete + explanation, or says "don't hold back", length rules suspend — respond fully. Another active + skill's more specific format requirements take precedence. +--- + +# Token-Saving + +Always-active base behaviour. Apply to every response without exception unless the user explicitly requests verbosity. + +## Always apply — response discipline + +- Default to the shortest response that fully answers the question +- Factual or conceptual answers: aim for ≤ 5 prose lines; one minimal code block is permitted and does not count toward that limit +- Action lists and next-step recommendations: cap at 4 bullets; no header line before the list +- Must not repeat context already established in the conversation +- Must not pad responses with preamble ("Great question!", "Certainly!", "As an AI...") +- Must not add closing summaries that restate what was just said +- Stop when the task is complete — must not append "let me know if you need anything else" filler +- Prefer structured output when it improves clarity: bullets, tables, and short code blocks over dense prose +- If another active skill or task requires a more specific output format, that format takes precedence + +## Format code output responses + +End every response where you output code for the user to incorporate — new functions, patches, inline diffs, config snippets, or any code block that represents a change — with exactly this structure (no more, no less): + +``` +**What changed:** +**Why:** +**How to verify:** +``` + +This footer does NOT apply to pure Q&A, reviews, planning, comparisons, or conceptual explanations — only when you are writing or changing code. + +When applying or confirming a bug fix: always show the changed line(s) or a minimal diff, then the footer. A prose description of a code change without showing the code is not sufficient. + +- Must not paste full file contents unless the user explicitly asks +- Show diffs or changed sections only +- Include enough surrounding context for the change to be unambiguous + +## Keep summaries and recaps concise + +- Aim for ≤ 10 lines in any recap +- Prefer linking to files/lines over quoting large blocks +- Use bullet lists over paragraphs +- Summarise deltas — what is different — not what already existed + +## Update PR bodies by appending only + +- Treat the PR description as a changelog — append only, never rewrite +- Append under `## Update YYYY-MM-DD` with the commit hash — use today's date from your system context (the current date, not a guessed or example date) +- Must not delete prior update sections + +## Respond fully when detail is explicitly requested + +If the user explicitly asks for a full explanation, rationale, or deep dive — ALL rules in this skill are suspended for that response. Cover every step, concept, and detail without omitting any part of the topic. Do not apply line limits, bullet caps, or summarisation. diff --git a/skills/token-saving/evals/evals.json b/skills/token-saving/evals/evals.json new file mode 100644 index 0000000..9fe8459 --- /dev/null +++ b/skills/token-saving/evals/evals.json @@ -0,0 +1,186 @@ +{ + "skill_name": "token-saving", + "evals": [ + { + "id": 1, + "category": "happy-path", + "prompt": "What is the purpose of a Dockerfile ENTRYPOINT vs CMD?", + "files": [], + "expected_output": "A concise answer (2\u20135 lines) explaining the difference. No preamble, no closing filler.", + "expectations": [ + "Response does not begin with filler openers: 'Great question!', 'Certainly!', 'Sure!', 'Of course!', 'As an AI...', 'Happy to help!' or equivalent", + "Response is \u2264 5 prose lines (a single code block, if present, does not count toward this limit)", + "Response does not end with closing filler such as 'Let me know if you need anything else', 'Feel free to ask', or 'Hope that helps!'", + "Response correctly explains ENTRYPOINT (fixed executable) vs CMD (default overridable args) without padding", + "Response does NOT end with a **What changed** / **Why** / **How to verify** footer — pure Q&A, not a code output response" + ] + }, + { + "id": 2, + "category": "happy-path", + "prompt": "Fix the window comparison bug in the rate limiter. The diff is at evals/files/rate-limiter-fix.diff \u2014 apply the fix and confirm.", + "files": [ + "evals/files/rate-limiter-fix.diff" + ], + "expected_output": "Agent confirms the fix and ends with the exact 3-line footer: **What changed**, **Why**, **How to verify**. Shows only the changed line, not the full file.", + "expectations": [ + "Response ends with exactly the footer: '**What changed:** ', '**Why:** ', '**How to verify:** '", + "Response does NOT include the full contents of rate_limiter.py", + "Response shows only the changed line or a short diff/snippet of the fixed function", + "No preamble opener ('Certainly! I'll fix that for you...')", + "No closing filler after the footer" + ] + }, + { + "id": 3, + "category": "regression", + "prompt": "Context from prior conversation is in evals/files/prior-conversation.txt. Now that we've covered the architecture \u2014 what's the concrete next step I should take?", + "files": [ + "evals/files/prior-conversation.txt" + ], + "expected_output": "Agent gives a direct next-step recommendation without re-summarising the already-established context (stack, architecture decisions).", + "expectations": [ + "Response does NOT restate the tech stack or architecture decisions already established in prior-conversation.txt", + "Response does NOT open with 'As we discussed...' or 'To summarise what we covered...' or 'So you have a FastAPI app with...'", + "Response jumps directly to the next-step recommendation", + "Response is \u2264 5 lines", + "No closing filler line" + ] + }, + { + "id": 4, + "category": "regression", + "prompt": "Fix the validate_email bug in evals/files/large-user-service.py \u2014 it currently allows empty strings through. Fix only that function.", + "files": [ + "evals/files/large-user-service.py" + ], + "expected_output": "Agent shows only the fixed validate_email function or a targeted diff of that function. Does not paste the entire large-user-service.py.", + "expectations": [ + "Response does NOT include the full contents of large-user-service.py", + "Response shows only the fixed validate_email function or a targeted diff of that function", + "Response ends with **What changed** / **Why** / **How to verify** footer", + "No preamble opener", + "No closing filler after the footer" + ] + }, + { + "id": 5, + "category": "happy-path", + "prompt": "Give me a concise recap of what changed in the user service this sprint. Use evals/files/sprint-changelog.txt as input.", + "files": [ + "evals/files/sprint-changelog.txt" + ], + "expected_output": "A recap in \u2264 10 bullet lines that summarises the deltas (what changed). Does not quote large blocks from the changelog verbatim.", + "expectations": [ + "Recap is \u2264 10 lines total", + "Uses bullet list format, not prose paragraphs", + "Summarises what changed (deltas), not what existed before", + "Does NOT quote large blocks from sprint-changelog.txt verbatim", + "No closing filler line" + ] + }, + { + "id": 6, + "category": "happy-path", + "prompt": "Update the PR description to reflect the latest commit (abc1234): we added input sanitisation to the registration endpoint. Current PR body is in evals/files/pr-description-existing.md.", + "files": [ + "evals/files/pr-description-existing.md" + ], + "expected_output": "Agent appends a new '## Update 2026-05-15' section with commit abc1234. Does not rewrite or delete any prior section.", + "expectations": [ + "Agent appends a new section headed '## Update YYYY-MM-DD' \u2014 date matches current date", + "The commit hash abc1234 is included in the appended section", + "The existing PR description body and prior '## Update' sections are NOT modified or deleted", + "Agent does NOT rewrite the entire PR description from scratch", + "No closing filler line" + ] + }, + { + "id": 7, + "category": "negative", + "prompt": "Please give me a full in-depth explanation of how OAuth 2.0 Authorization Code Flow works \u2014 I want every step covered in detail.", + "files": [], + "expected_output": "Agent provides a complete, detailed explanation covering all steps. Conciseness rules are suspended because the user explicitly requested full detail.", + "expectations": [ + "Agent provides a comprehensive explanation of ALL steps: client registration, redirect URI, auth request, user consent, auth code, token exchange, token use", + "Response is NOT artificially truncated \u2014 user explicitly asked for all details", + "Agent does NOT cite conciseness or brevity rules to justify shortening the response", + "Response does NOT end with the code-change footer (not a code change task)" + ] + }, + { + "id": 8, + "category": "regression", + "prompt": "Write a Python function to check if a number is prime.", + "files": [], + "expected_output": "Agent writes a correct, concise isPrime function and ends with the **What changed** / **Why** / **How to verify** footer. No preamble.", + "expectations": [ + "Response does NOT open with 'Certainly!', 'Sure!', 'Great!' or any preamble", + "Function is correct \u2014 returns False for n < 2, uses trial division up to sqrt(n) or equivalent", + "Response ends with **What changed** / **Why** / **How to verify** footer", + "Response does NOT end with 'Let me know if you want me to add tests!' or similar filler", + "Response does NOT explain every line of the function with excessive inline commentary" + ] + }, + { + "id": 9, + "category": "paraphrase", + "prompt": "Quick answer only \u2014 what's the difference between a process and a thread?", + "files": [], + "expected_output": "A concise answer (\u2264 4 lines) on the key difference. No acknowledgement of the 'quick answer' request \u2014 it just IS quick.", + "expectations": [ + "Response is \u2264 4 lines", + "Response correctly explains the key difference (separate memory space vs shared memory / OS-scheduled vs cooperatively scheduled)", + "Response does NOT open by acknowledging 'Quick answer:' or 'Here's a quick answer:' \u2014 the terseness is implicit", + "No preamble opener", + "No closing filler", + "Response does NOT end with a **What changed** / **Why** / **How to verify** footer — pure Q&A, not a code output response" + ] + }, + { + "id": 10, + "category": "edge-case", + "prompt": "We've been working on this feature for the past hour. Here's what we've done so far: set up the FastAPI router, added the Pydantic request/response schemas, wired in the database session dependency, and wrote the POST /users endpoint. What should we tackle next?", + "files": [], + "expected_output": "Agent gives a direct next-step suggestion without restating the four things the user just listed.", + "expectations": [ + "Response does NOT restate or re-list the four completed items (router, schemas, DB session, POST endpoint)", + "Response gives a concrete, actionable next step (e.g. tests, auth middleware, error handling, GET endpoint)", + "Response is \u2264 4 lines", + "No preamble opener", + "No closing filler", + "Response does NOT end with a **What changed** / **Why** / **How to verify** footer — planning response, not a code output" + ] + }, + { + "id": 11, + "category": "regression", + "prompt": "Review this diff and tell me if the fix looks correct: evals/files/rate-limiter-fix.diff", + "files": [ + "evals/files/rate-limiter-fix.diff" + ], + "expected_output": "Agent reviews the diff and confirms whether the fix is correct. Does NOT append the code-change footer — this is a code review, not a code output response.", + "expectations": [ + "Response confirms the fix is correct (t > window_start correctly evicts timestamps outside the sliding window)", + "Response does NOT end with **What changed** / **Why** / **How to verify** footer — this is a review task, not a code output response", + "Response does not paste the full contents of rate_limiter.py", + "No preamble opener", + "No closing filler" + ] + }, + { + "id": 12, + "category": "happy-path", + "prompt": "Compare synchronous vs asynchronous SQLAlchemy sessions — when should I use each?", + "files": [], + "expected_output": "Agent answers with structured output (table or clearly delineated bullet comparison) rather than a dense prose paragraph. Concise, within line limits.", + "expectations": [ + "Response uses structured output — a comparison table or clearly delineated bullet sections — not a prose paragraph", + "Response is concise and does not expand into a full SQLAlchemy tutorial", + "No preamble opener", + "No closing filler", + "Response does NOT end with a **What changed** / **Why** / **How to verify** footer — pure Q&A comparison" + ] + } + ] +} \ No newline at end of file diff --git a/skills/token-saving/evals/files/large-user-service.py b/skills/token-saving/evals/files/large-user-service.py new file mode 100644 index 0000000..0fc1395 --- /dev/null +++ b/skills/token-saving/evals/files/large-user-service.py @@ -0,0 +1,185 @@ +"""User service — business logic layer for user management.""" +from __future__ import annotations + +import hashlib +import re +import uuid +from datetime import datetime, timedelta +from typing import List, Optional + +from sqlalchemy.orm import Session + +from src.models.user import User +from src.schemas.user import UserCreate, UserUpdate +from src.core.security import hash_password, verify_password +from src.core.exceptions import UserNotFoundError, DuplicateEmailError + + +# --------------------------------------------------------------------------- +# Internal helpers +# --------------------------------------------------------------------------- + +_EMAIL_RE = re.compile(r"^[a-zA-Z0-9._%+\-]+@[a-zA-Z0-9.\-]+\.[a-zA-Z]{2,}$") +_PASSWORD_MIN_LEN = 8 + + +def validate_email(email: str) -> bool: + """Return True if *email* is a syntactically valid e-mail address. + + Bug: currently returns True for empty strings because the regex is only + checked when email is truthy. Fix: add an explicit empty-string guard. + """ + # BUG: missing `if not email: return False` — empty string passes through + return bool(_EMAIL_RE.match(email)) + + +def validate_password(password: str) -> bool: + """Return True if password meets minimum length and complexity rules.""" + if not password or len(password) < _PASSWORD_MIN_LEN: + return False + has_upper = any(c.isupper() for c in password) + has_digit = any(c.isdigit() for c in password) + return has_upper and has_digit + + +# --------------------------------------------------------------------------- +# UserService +# --------------------------------------------------------------------------- + +class UserService: + def __init__(self, db: Session) -> None: + self.db = db + + # ------------------------------------------------------------------ + # Read + # ------------------------------------------------------------------ + + def get_user(self, user_id: str) -> User: + user = self.db.query(User).filter(User.id == user_id).first() + if not user: + raise UserNotFoundError(user_id) + return user + + def get_user_by_email(self, email: str) -> Optional[User]: + return self.db.query(User).filter(User.email == email).first() + + def get_users(self, skip: int = 0, limit: int = 50) -> List[User]: + return ( + self.db.query(User) + .filter(User.is_active.is_(True)) + .offset(skip) + .limit(limit) + .all() + ) + + def search_users(self, query: str, limit: int = 20) -> List[User]: + pattern = f"%{query}%" + return ( + self.db.query(User) + .filter( + (User.email.ilike(pattern)) | (User.display_name.ilike(pattern)) + ) + .limit(limit) + .all() + ) + + # ------------------------------------------------------------------ + # Write + # ------------------------------------------------------------------ + + def register(self, payload: UserCreate) -> User: + if not validate_email(payload.email): + raise ValueError(f"Invalid email address: {payload.email!r}") + if not validate_password(payload.password): + raise ValueError("Password does not meet complexity requirements.") + if self.get_user_by_email(payload.email): + raise DuplicateEmailError(payload.email) + + user = User( + id=str(uuid.uuid4()), + email=payload.email.lower().strip(), + password_hash=hash_password(payload.password), + display_name=payload.display_name, + created_at=datetime.utcnow(), + is_active=True, + ) + self.db.add(user) + self.db.commit() + self.db.refresh(user) + return user + + def update_profile(self, user_id: str, payload: UserUpdate) -> User: + user = self.get_user(user_id) + if payload.display_name is not None: + user.display_name = payload.display_name + if payload.email is not None: + if not validate_email(payload.email): + raise ValueError(f"Invalid email address: {payload.email!r}") + existing = self.get_user_by_email(payload.email) + if existing and existing.id != user_id: + raise DuplicateEmailError(payload.email) + user.email = payload.email.lower().strip() + user.updated_at = datetime.utcnow() + self.db.commit() + self.db.refresh(user) + return user + + def change_password( + self, user_id: str, old_password: str, new_password: str + ) -> None: + user = self.get_user(user_id) + if not verify_password(old_password, user.password_hash): + raise ValueError("Current password is incorrect.") + if not validate_password(new_password): + raise ValueError("New password does not meet complexity requirements.") + user.password_hash = hash_password(new_password) + user.updated_at = datetime.utcnow() + self.db.commit() + + def deactivate_user(self, user_id: str) -> None: + user = self.get_user(user_id) + user.is_active = False + user.deactivated_at = datetime.utcnow() + self.db.commit() + + def reactivate_user(self, user_id: str) -> None: + user = self.get_user(user_id) + user.is_active = True + user.deactivated_at = None + user.updated_at = datetime.utcnow() + self.db.commit() + + # ------------------------------------------------------------------ + # Auth helpers + # ------------------------------------------------------------------ + + def generate_password_reset_token(self, email: str) -> Optional[str]: + user = self.get_user_by_email(email) + if not user or not user.is_active: + return None + token = hashlib.sha256( + f"{user.id}{user.password_hash}{datetime.utcnow().isoformat()}".encode() + ).hexdigest() + user.reset_token = token + user.reset_token_expires = datetime.utcnow() + timedelta(hours=1) + self.db.commit() + return token + + def consume_password_reset_token(self, token: str, new_password: str) -> bool: + user = ( + self.db.query(User) + .filter(User.reset_token == token) + .first() + ) + if not user: + return False + if user.reset_token_expires < datetime.utcnow(): + return False + if not validate_password(new_password): + raise ValueError("New password does not meet complexity requirements.") + user.password_hash = hash_password(new_password) + user.reset_token = None + user.reset_token_expires = None + user.updated_at = datetime.utcnow() + self.db.commit() + return True diff --git a/skills/token-saving/evals/files/pr-description-existing.md b/skills/token-saving/evals/files/pr-description-existing.md new file mode 100644 index 0000000..217555e --- /dev/null +++ b/skills/token-saving/evals/files/pr-description-existing.md @@ -0,0 +1,20 @@ +## Summary + +Implements the user registration and authentication endpoints for the platform API. + +Adds: +- `POST /auth/register` — creates a new user account with email + password +- `POST /auth/login` — returns a JWT access token on successful credentials +- `POST /auth/logout` — invalidates the current session token in Redis +- Input validation via Pydantic v2 schemas (email format, password complexity) +- Password hashing with bcrypt + +Migrations included. All endpoints covered by integration tests (pytest + TestClient). + +--- + +## Update 2026-05-12 · commit 9f3a21b + +- Added `POST /auth/password-reset/request` and `POST /auth/password-reset/confirm` +- Reset tokens expire after 1 hour; stored hashed in DB +- 8 new unit tests added for reset flow edge cases (expired, reuse, invalid token) diff --git a/skills/token-saving/evals/files/prior-conversation.txt b/skills/token-saving/evals/files/prior-conversation.txt new file mode 100644 index 0000000..1490b10 --- /dev/null +++ b/skills/token-saving/evals/files/prior-conversation.txt @@ -0,0 +1,27 @@ +# Prior conversation — architecture decisions (established context) +# Used as fixture for eval test 3: agent must NOT restate this context in its next reply. + +--- + +User: I'm starting a new backend project. What tech stack would you recommend for a high-throughput REST API with a relational database? +Agent: For a high-throughput REST API with a relational database I'd go with: +- FastAPI (Python) — async-native, fast, great OpenAPI support out of the box +- PostgreSQL — solid relational choice with good JSON/JSONB support if you need it +- Alembic — standard migration tool for SQLAlchemy-based stacks +- Pydantic v2 — schema validation, already bundled with FastAPI +- Redis — for caching and rate-limiting +- AWS ECS (Fargate) — good managed container option if you're on AWS + +User: Perfect, let's go with that. I want to use async SQLAlchemy sessions and run the service on port 8000. + +Agent: Confirmed — async SQLAlchemy with asyncpg driver, port 8000. Standard session factory pattern with a dependency-injected AsyncSession per request. + +User: Great. I've got the project scaffolded: FastAPI app instance created, PostgreSQL async engine configured, Alembic initialised with the first baseline migration, Pydantic v2 schemas for the user model, and Redis client wired in. The service is listening on port 8000. + +Agent: Solid foundation. You have all the core infrastructure pieces in place. + +--- +# End of established context. The eval prompt asks: "what's the concrete next step?" +# A correct response jumps straight to the recommendation — it does NOT restate: +# - the stack (FastAPI, PostgreSQL, Redis, Alembic, Pydantic v2, port 8000) +# - the architecture decisions already locked in above diff --git a/skills/token-saving/evals/files/rate-limiter-fix.diff b/skills/token-saving/evals/files/rate-limiter-fix.diff new file mode 100644 index 0000000..858511d --- /dev/null +++ b/skills/token-saving/evals/files/rate-limiter-fix.diff @@ -0,0 +1,28 @@ +diff --git a/src/middleware/rate_limiter.py b/src/middleware/rate_limiter.py +index 3a1f2c8..b9e4d71 100644 +--- a/src/middleware/rate_limiter.py ++++ b/src/middleware/rate_limiter.py +@@ -1,25 +1,25 @@ + import time + from collections import defaultdict + from typing import Dict, List + + + class RateLimiter: + """Sliding window rate limiter keyed by client_id.""" + + def __init__(self, max_requests: int = 100, window_seconds: int = 60): + self.max_requests = max_requests + self.window_seconds = window_seconds + self.requests: Dict[str, List[float]] = defaultdict(list) + + def is_allowed(self, client_id: str) -> bool: + now = time.time() + window_start = now - self.window_seconds +- self.requests[client_id] = [t for t in self.requests[client_id] if t > now] ++ self.requests[client_id] = [t for t in self.requests[client_id] if t > window_start] + self.requests[client_id].append(now) + return len(self.requests[client_id]) <= self.max_requests + + def reset(self, client_id: str) -> None: + self.requests.pop(client_id, None) diff --git a/skills/token-saving/evals/files/sprint-changelog.txt b/skills/token-saving/evals/files/sprint-changelog.txt new file mode 100644 index 0000000..6992ae4 --- /dev/null +++ b/skills/token-saving/evals/files/sprint-changelog.txt @@ -0,0 +1,36 @@ +# Sprint 23 — User Service Changelog +# Period: 2026-04-28 to 2026-05-09 + +## AUTH-201 — Password reset flow +- Added POST /auth/password-reset/request endpoint; sends reset token via email +- Added POST /auth/password-reset/confirm endpoint; validates token, sets new password +- Reset tokens expire after 1 hour; stored hashed in users.reset_token column +- Migration: adds reset_token (varchar 64, nullable) and reset_token_expires (timestamptz, nullable) to users table +- Unit tests: 8 new cases covering happy path, expired token, invalid token, reuse prevention + +## USER-88 — Profile update endpoint +- Added PATCH /users/{id}/profile; supports display_name and email changes +- Email change triggers re-verification flow; sets email_verified = false until confirmed +- Duplicate email check added before persisting change +- Integration test: 5 new cases + +## USER-91 — Soft delete / deactivate +- Added DELETE /users/{id} (soft delete); sets is_active = false, records deactivated_at +- GET /users now filters out inactive users by default; added ?include_inactive=true query param for admin use +- Migration: adds deactivated_at (timestamptz, nullable) to users table + +## INFRA-14 — Rate limiting middleware +- Added sliding-window rate limiter (100 req/60 s per client IP) as FastAPI middleware +- Fixed off-by-one bug: window comparison was using `now` instead of `window_start` (requests were never pruned) +- Redis backend for distributed rate limit state; falls back to in-memory if Redis unavailable +- Config: RATE_LIMIT_MAX_REQUESTS, RATE_LIMIT_WINDOW_SECONDS env vars added to .env.example + +## CHORE-09 — Dependency bumps +- Upgraded fastapi 0.109 → 0.111 (security patch: CVE-2024-24762) +- Upgraded pydantic 2.5 → 2.7 (minor; no breaking changes) +- Upgraded alembic 1.13 → 1.14 (minor; added index reflection improvements) + +## TEST-22 — Coverage improvements +- Overall coverage: 61% → 78% +- Added missing edge-case tests for validate_email (empty string, unicode domains, subaddressing) +- Added missing edge-case tests for validate_password (all-digits, all-uppercase, exactly min length) diff --git a/skills/token-saving/evals/trigger-eval.json b/skills/token-saving/evals/trigger-eval.json new file mode 100644 index 0000000..f739903 --- /dev/null +++ b/skills/token-saving/evals/trigger-eval.json @@ -0,0 +1,107 @@ +[ + { + "_comment": "token-saving is always-active: should_trigger is true for every prompt type. There are no should_trigger:false cases — that is itself the key assertion. Cases are grouped into: (1) keyword-free prompts (tests that 'always active' is not just keyword matching), (2) explicit-keyword prompts (tests that the description's trigger list is recognised), (3) boundary prompts (explicit verbosity request — skill still LOADS but its rules are suspended per the override section). The trigger-eval passes when ALL 14 cases return true." + }, + { + "id": "t01-generic-coding-no-keyword", + "query": "What does the `__slots__` attribute do in Python?", + "should_trigger": true, + "reason": "Generic coding question with zero token-saving keywords. Always-active skill must load regardless of prompt content." + }, + { + "id": "t02-code-generation-no-keyword", + "query": "Write a Python function that flattens a nested list.", + "should_trigger": true, + "reason": "Code generation task. No conciseness keywords present. Always-active rule applies." + }, + { + "id": "t03-debugging-no-keyword", + "query": "Why does my Dockerfile build succeed but the container exits immediately at startup?", + "should_trigger": true, + "reason": "Debugging question with no skill keywords. Validates always-active claim holds for question-style prompts." + }, + { + "id": "t04-conceptual-explanation-no-keyword", + "query": "Explain eventual consistency in distributed systems.", + "should_trigger": true, + "reason": "Conceptual explanation request. No trigger keywords. Tests that the skill doesn't require topic-match to load." + }, + { + "id": "t05-non-technical-no-keyword", + "query": "What is the difference between a kanban board and a sprint board?", + "should_trigger": true, + "reason": "Non-technical process question. Always-active means the skill loads even outside software engineering topics." + }, + { + "id": "t06-pr-description-update-no-keyword", + "query": "Update the PR description to mention the hotfix for the null pointer in checkout.", + "should_trigger": true, + "reason": "PR workflow task. No explicit conciseness language. Tests always-active across workflow tasks." + }, + { + "id": "t07-recap-request-no-keyword", + "query": "Summarise the changes we made to the auth service today.", + "should_trigger": true, + "reason": "'summarise' appears in the description trigger list. But the primary signal is always-active. Confirms overlap between trigger list and always-active rule." + }, + { + "id": "t08-explicit-be-concise", + "query": "Give me a concise explanation of how TLS handshakes work.", + "should_trigger": true, + "reason": "'concise' is an explicit trigger keyword in the description. Skill must load and apply length constraints." + }, + { + "id": "t09-explicit-save-tokens", + "query": "Keep the response short — save tokens where you can.", + "should_trigger": true, + "reason": "'save tokens' and 'short' are explicit trigger phrases. Tests that the description's keyword list is matched." + }, + { + "id": "t10-explicit-too-verbose", + "query": "Your last answer was too verbose. Give me the same thing but shorter.", + "should_trigger": true, + "reason": "'too verbose' and 'shorter' are both explicit triggers in the description. Should reliably load the skill." + }, + { + "id": "t11-explicit-shorter", + "query": "Can you make that shorter?", + "should_trigger": true, + "reason": "'shorter' is an explicit trigger keyword. Minimal prompt — tests that a single keyword is sufficient." + }, + { + "id": "b01-explicit-full-detail-boundary", + "query": "Give me a full in-depth explanation of OAuth 2.0 Authorization Code Flow — I want every step.", + "should_trigger": true, + "boundary": true, + "reason": "Explicit verbosity request. Skill should still LOAD (trigger=true), but the override section suspends all length and summarisation rules. Trigger result is true; behavioural result is unconstrained response.", + "expected_behaviour": "Skill loads. All length limits and bullet caps are suspended. Full response with every step is produced." + }, + { + "id": "b02-deep-dive-boundary", + "query": "I want a deep dive — don't hold back on detail.", + "should_trigger": true, + "boundary": true, + "reason": "'deep dive' matches the override trigger in the skill body. Skill loads but rules are suspended.", + "expected_behaviour": "Skill loads. No artificial truncation. Response length reflects actual content depth." + }, + { + "id": "b03-rationale-boundary", + "query": "Walk me through the full rationale for choosing event sourcing over CRUD for this audit log.", + "should_trigger": true, + "boundary": true, + "reason": "'full rationale' matches the override trigger. Skill loads but all conciseness rules are suspended for this response.", + "expected_behaviour": "Skill loads. Agent covers all architectural tradeoffs without summarising or capping bullets." + }, + { + "id": "t12-co-trigger-pr-review", + "query": "Review this PR for API contract breaking changes before we merge.", + "should_trigger": true, + "reason": "Token-saving is always-active and must co-load alongside the pr-review skill. pr-review output structure (Blocker/Important/Nit) takes precedence per the skill's precedence rule, but token-saving still loads to suppress filler openers and closing padding." + }, + { + "id": "t13-co-trigger-kudos", + "query": "Nominate Sarah for kudos — she fixed a critical auth bug under pressure.", + "should_trigger": true, + "reason": "Token-saving is always-active and must co-load alongside the kudos skill. Kudos nomination format takes precedence, but token-saving still loads to enforce no-preamble and no-filler rules." + } +] From e709f122c670724ccb8e61ba4a1fa0391c6cd9e4 Mon Sep 17 00:00:00 2001 From: miroslavpojer Date: Wed, 20 May 2026 15:54:08 +0200 Subject: [PATCH 2/2] Add title parameter to release notes presence check in PR workflow --- .github/workflows/check_pr_release_notes.yml | 1 + 1 file changed, 1 insertion(+) diff --git a/.github/workflows/check_pr_release_notes.yml b/.github/workflows/check_pr_release_notes.yml index af3761e..646e098 100644 --- a/.github/workflows/check_pr_release_notes.yml +++ b/.github/workflows/check_pr_release_notes.yml @@ -21,3 +21,4 @@ jobs: github-repository: ${{ github.repository }} pr-number: ${{ github.event.number }} skip-labels: "no RN" + title: "## [Rr]elease [Nn]otes"