fix(workflows): paginated GET /workflows — resolves list incompleteness past 500 sessions by surpradhan · Pull Request #155 · surpradhan/agent-event-protocol

surpradhan · 2026-06-26T08:50:55Z

Summary

Root cause: The dashboard's workflow list was assembled client-side from GET /sessions?limit=500. Sessions beyond position 500 were invisible, so workflows whose trace_id only appeared after that threshold were silently absent. setMetric('m-workflows', byTrace.size) also overwrote the accurate count from GET /metrics with a potentially-lower client-side value.
Fix: Add GET /v1/workflows?limit=&cursor= — a proper paginated endpoint that queries all workflows directly from the DB. Dashboard switches from client-side assembly to fetching this endpoint.
No schema change, no new CI job.

Changes

Layer	What changed
`src/db/backends/sqlite.js`	`listWorkflows(tenantId, { limit, cursor })` — subquery keyset pagination on `{ last_active, trace_id }`
`src/db/backends/postgres.js`	Same, with `$N::text` casts for Postgres type inference
`src/db/backends/interface.js`	Stub + JSDoc
`src/db/index.js`	Thin async wrapper + export
`src/server.js`	`GET /v1/workflows` (read-scoped, `validateQueryParams`, tenant-scoped)
`src/openapi.json`	New path entry with request/response schema
`src/public/dashboard.html`	`loadWorkflowList()` + `renderWorkflowList()` replace `buildWorkflowList(sessions)`; active state updates via `data-trace-id`; `debouncedRefreshWorkflowList` wired to SSE; `setMetric('m-workflows', ...)` override removed
`tests/integration/server.test.js`	6 new tests: 200 shape, session counts, pagination, auth, tenant isolation

Test plan

245 tests pass (npm test)
Lint clean (npm run lint)
GET /workflows returns { workflows: [...], next_cursor } with correct session_count per trace
Pagination: limit=1 yields next_cursor; following it returns a different workflow
Tenant isolation: second-tenant key sees only its own workflows
Dashboard workflow list fetches GET /workflows on load and SSE events
Clicking a workflow updates active state without an extra list fetch

🤖 Generated with Claude Code

surpradhan

Validator Review — PR #155 `fix/workflow-list-pagination`

Gates

npm test: 245/245 pass, 0 fail (includes the 6 new integration tests for GET /workflows)
npm run lint: clean
Working tree: clean (git status shows nothing uncommitted)

What was validated empirically

Pagination correctness — I ran a scratch in-memory simulation of the keyset cursor logic ({ last_active, trace_id }, DESC/DESC sort, tie-break on trace_id). Iterating limit=1 over 3 workflows with distinct timestamps produced all 3 items in the correct order with no duplicates and terminated cleanly. The off-by-one fence (pageSize + 1 fetch, then pop() + cursor from the last kept row) is standard and correct.

better-sqlite3 integer type — verified in the Node REPL that COUNT(*) AS session_count comes back as a JS number directly from better-sqlite3. The SQLite backend's return { workflows: rows, next_cursor } (no explicit Number() cast) produces the same type as the Postgres backend's explicit Number(r.session_count). Both are fine.

Tenant isolation — the SQL WHERE clause ($1::text IS NULL OR tenant_id = $2) / (? IS NULL OR tenant_id = ?) is correctly scoped; the integration test mints a key for wl-other-tenant and asserts it cannot see the default tenant's workflows. Test passes.

m-workflows metric fix — confirmed the old setMetric('m-workflows', byTrace.size) override is gone. loadAll() now sets m-workflows from GET /metrics → metrics.workflow_count (line 3015 in dashboard.html), which is a DB-accurate count. The client no longer pollutes it.

Findings

1. MEDIUM — src/public/dashboard.html: Hardcoded limit=200 re-introduces a silent cap

loadWorkflowList() fetches /workflows?limit=200. This replaces the old 500-session cap with a 200-workflow cap, and crucially the dashboard has no "load more" button or pagination loop. A tenant with >200 workflows will see a silently truncated list, the same class of bug that this PR is fixing. The PR description says "proper paginated endpoint" but the dashboard only fetches page 1.

Concrete fix: either bump the constant to 500 (matching the server's enforced max) and add a comment acknowledging it's still a one-page fetch, or implement a "load more" button that appends subsequent pages. The server-side machinery for pagination is there; the client just doesn't use it yet.

2. LOW — src/public/dashboard.html: switchMain('workflows') does not refresh the list on tab activation

Every other lazy-loaded tab refreshes its data when switched to (e.g. if (v === 'rejected') loadRejections()), but switchMain has no if (v === 'workflows') loadWorkflowList(). The list is populated once on page load and then by the SSE debounce (800 ms). If a user visits another page/tab and returns, or if they open the dashboard while the Workflows tab is already selected via #workflows hash routing, the workflow list is stale until the next SSE event.

Concrete fix: add if (v === 'workflows') loadWorkflowList(); alongside the other tab guards in switchMain.

3. MINOR — src/db/backends/sqlite.js vs postgres.js: shape asymmetry in listWorkflows return

SQLite returns rows verbatim: { workflows: rows, next_cursor }. Postgres remaps each row explicitly: { trace_id, session_count: Number(...), last_active }. The contract is the same because better-sqlite3 natively returns COUNT(*) as a JS integer, but the Postgres version is more explicit and defensive. For future maintainability it would be worth making SQLite remap consistently:

return {
  workflows: rows.map(r => ({
    trace_id:      r.trace_id,
    session_count: Number(r.session_count),
    last_active:   r.last_active,
  })),
  next_cursor,
};

4. MINOR — src/server.js GET /workflows: no validateTenantOwnership defense-in-depth check

GET /sessions (line 426) runs validateTenantOwnership(result.sessions, req.tenant_id, "sessions_list") as a defense-in-depth check after the DB call. The new GET /workflows route skips this. In practice the omission is benign — the returned rows contain only { trace_id, session_count, last_active } (no tenant_id field), so validateTenantOwnership's item-level check !item.tenant_id || item.tenant_id === tenantId would always pass. But the absence creates an inconsistency worth noting for audit purposes.

5. MINOR — src/openapi.json: new path registered as /workflows, not /v1/workflows

The other paginated workflow path (/workflows/{traceId}) is registered the same way (/workflows/{traceId} not /v1/workflows/{traceId}) and the spec has "servers" entries for both http://localhost:8787/v1 and http://localhost:8787, so the OpenAPI spec is correct and consistent with the existing convention. No action needed — just noting it is intentional.

Summary table

#	Severity	File	Issue
1	MEDIUM	`dashboard.html:1569`	`limit=200` cap re-introduces silent truncation; no pagination loop
2	LOW	`dashboard.html:1492–1497`	`switchMain('workflows')` doesn't call `loadWorkflowList()` on tab activation
3	MINOR	`sqlite.js:865`	Return shape remapping inconsistent with Postgres backend
4	MINOR	`server.js:530–535`	Missing `validateTenantOwnership` defense-in-depth (benign but inconsistent)
5	MINOR	`openapi.json:919`	`/workflows` path (not `/v1/workflows`) — consistent with existing spec convention, no action needed

Verdict

The core fix is correct and well-tested: the root cause (client-side assembly from a 500-session cap) is eliminated, the keyset pagination algorithm is sound, tenant isolation is enforced in SQL and covered by tests, and the m-workflows metric source is fixed. The 6 new integration tests discriminate the right behaviors (shape, session counts, pagination, auth, isolation).

Finding #1 is the one that needs resolution before merge — the dashboard's hardcoded limit=200 re-introduces a silent cap at a different threshold. Findings #2–4 are low/minor and could be follow-up issues rather than blockers.

surpradhan

Code Review — PR #155: paginated GET /workflows

Summary

Clean, well-scoped fix. The root cause (client-side list capped at 500 sessions) is properly addressed with a server-side paginated endpoint using keyset pagination on { last_active, trace_id }. Both SQLite and Postgres backends, the route, OpenAPI spec, and tests are all present. Gates are green (245 tests pass, lint clean). Below are findings by severity.

Findings

[Suggestion] src/public/dashboard.html line 1569 — Hard-coded limit=200 leaves the original truncation problem partially in place

loadWorkflowList() fetches /workflows?limit=200 and does not implement pagination (no "load more" / cursor tracking). A tenant with more than 200 workflows will still see a silently-truncated list — the threshold just moved from 500 sessions to 200 workflows. Since the endpoint now supports proper pagination, the dashboard should either:

Raise the limit to 500 (matching the backend cap and preventing a new silent floor), or
Implement a "Load more" link that follows next_cursor.

The PR description says the fix resolves list incompleteness; a hard 200-workflow cap without visible pagination means it is only partially resolved for large tenants.

[Suggestion] src/middleware/queryValidation.js line 162 vs src/db/backends/sqlite.js line 827 — Middleware allows limit up to 1000 but the backend silently clamps at 500

validateQueryParams rejects values outside 1–1000. listWorkflows (both backends) clamps pageSize to 500. So GET /workflows?limit=600 passes the middleware, the backend silently serves 500 rows, and the caller has no idea. The OpenAPI spec correctly documents maximum: 500, but the enforcement layer is in the backend, not the middleware.

Fix: either reject limit > 500 for this route at the middleware layer (breaking change if callers already send 501–1000), or — simpler and consistent with getPaginatedSessions — add a per-route post-parse check in the route handler:

v1.get("/workflows", requireReadAccess, validateQueryParams, async (req, res) => {
  const { limit, cursor } = req.query;
  const parsedLimit = parseInt(limit, 10) || 50;
  if (parsedLimit > 500) {
    return res.status(400).json({ error: "Bad Request", message: "limit must be ≤ 500 for this endpoint" });
  }
  ...
});

Alternatively, keep the silent clamp but add the 400 response to the OpenAPI spec and mention the effective cap in the description. The current spec says maximum: 500 in the schema but the middleware lets 501–1000 through silently.

[Nit] src/openapi.json line 967 — Missing 400 response entry

Every other paginated endpoint that accepts limit/cursor documents a 400 response for bad query params (e.g. /analytics/policy-blocked, /analytics/anomalies). The new /workflows entry only documents 401. Add:

"400": { "description": "Invalid query parameter (limit outside 1–500 or malformed cursor)" }

[Nit] src/server.js line 530 — No validateTenantOwnership defense-in-depth check (minor)

All other listing routes (GET /sessions, GET /sessions/:id/events, GET /sessions/:id/tree) call validateTenantOwnership on the returned data as a defense-in-depth guard. The listWorkflows route omits this. It is not a real security gap because the returned rows contain no tenant_id field (only trace_id, session_count, last_active), so validateTenantOwnership would always return true — but it is an inconsistency with the established house pattern. Consider either applying it (it will always pass, serving as a safe future-proofing anchor if tenant_id is added to the response later) or leaving a short comment explaining why it is omitted here.

[Nit] tests/integration/server.test.js line 819 — Pagination test can silently no-op

if (!page1.next_cursor) return; // only one workflow in the tenant — skip

With two distinct trace IDs seeded in before(), limit=1 should always yield a next_cursor for tenant-test. The silent return means that if a regression causes the endpoint to return all results without a cursor, this test passes without asserting anything. Prefer a hard assertion:

assert.ok(page1.next_cursor, "expected next_cursor with limit=1 and 2+ seeded workflows");

[Nit] src/public/dashboard.html line 1495 — switchMain('workflows') does not call loadWorkflowList()

switchMain for rejected, analytics, performance, custom, anomalies, and webhooks all trigger a data-load on tab switch (lines 1490–1495). The workflows branch has no corresponding loadWorkflowList() call. After the loadAll() initial load, if a user switches away and returns to the Workflows tab, the list does not refresh. This was also true before this PR (because buildWorkflowList was driven by loadAll), but now that loadWorkflowList is the entry point it is more visible. A one-liner in switchMain:

if (v === 'workflows') loadWorkflowList();

What looks good

Keyset pagination is correctly implemented in both backends. The { last_active DESC, trace_id DESC } tie-break is sound: trace_id is a non-null TEXT NOT NULL column, so it provides a deterministic tiebreaker when two workflows share the same MAX(updated_at).
Subquery pattern is correct. Aggregating in the inner query and applying the cursor in the outer WHERE avoids repeating aggregate expressions and is the right approach for post-aggregate keyset pagination.
Cursor injection is safe. Values from decodeCursor are used as SQL parameters only, never interpolated. A malformed or adversarially-crafted cursor gracefully degrades to the first page.
SQLite session_count type is correct. better-sqlite3 returns COUNT(*) as a native JS number, not a string. The Postgres backend correctly applies Number() coercion. Types are consistent across both backends at the API boundary.
data-trace-id active-state update in loadWorkflow() (replacing the buildWorkflowList re-render) is a nice improvement — active state toggles without triggering a network round-trip.
Tenant isolation is enforced in SQL (AND ($1::text IS NULL OR tenant_id = $2)) and tested with a second-tenant key.
No schema change, no new CI job — constraints respected.
debouncedRefreshWorkflowList is wired to SSE with an 800 ms debounce (100 ms longer than the session list), which avoids double-fetch on the same SSE tick.

Verdict

Request Changes (minor). The hard-coded limit=200 in the dashboard is the most impactful item: for a tenant with 200+ workflows the stated bug (silent truncation) is still present, just at a different threshold. The pagination test silent-skip and the missing switchMain call are quick fixes. Everything else is nit-level polish.

surpradhan

Round 2 validation — APPROVE

Gates run against worktree /tmp/aep-wf-list-fix (branch fix/workflow-list-pagination, tip 31af134).

Gate results

npm test

ℹ pass 246
ℹ fail 0
ℹ skipped 0

Up from 239 before this PR (7 new tests for GET /workflows). All existing tests green; no regressions.

npm run lint — exits 0, no output.

R1 findings: all resolved

R1 finding	Fix verified
Dashboard fetch `limit=200` missed sessions 201–500	`loadWorkflowList()` now fetches `/workflows?limit=500` (server max); confirmed line 1570 of `dashboard.html`
Workflow tab loaded stale data on first activation	`switchMain('workflows')` calls `loadWorkflowList()` at line 1490
SQLite `listWorkflows` returned raw SQLite integers (not JS numbers)	`Number(r.session_count)` at line 867 of `sqlite.js`; parity with Postgres backend
New route lacked `validateTenantOwnership`	Added at lines 539–541 of `server.js`; passes vacuously for workflow rows (no `tenant_id` field) per the comment at line 537 and the guard logic at `server.js:71`
No explicit 400 for `limit > 500`	Route now returns `res.status(400)` at line 532–534 before delegating to `listWorkflows`; confirmed by test at line 828
OpenAPI `/workflows` entry missing `400` response	Added at line 967 of `openapi.json`: `"400": { "description": "Invalid query parameter (limit outside 1–500 or malformed cursor)" }`
Pagination test silently skipped when `next_cursor` was absent	`assert.ok(page1.next_cursor, "expected next_cursor...")` at line 819 — hard fail, not a skip

Probe outcomes

Route ordering — GET /workflows (static) is registered before GET /workflows/:traceId (parameterized) in server.js (lines 530 vs 547). Express routes in declaration order; static wins correctly.

validateTenantOwnership vacuous-pass — confirmed: the function returns true for array items with no tenant_id field (line 71: !item.tenant_id || ...). The defense-in-depth call is safe and consistent with the sessions/events pattern.

Cursor helpers — decodeCursor/encodeCursor are shared from src/db/backends/_helpers.js (base64url, JSON-safe, null-on-decode-error). Both backends use the same helpers. Malformed cursor → decodeCursor returns null → cursor clause becomes ? IS NULL → full scan from start; harmless and consistent with other paginated endpoints.

Keyset pagination correctness — the OR (last_active = ? AND trace_id < ?) tiebreak in both backends ensures deterministic ordering when two workflows share the same MAX(updated_at). Confirmed in the live pagination test (two seeded traces both inserted in the same test run, limit=1 yields non-overlapping pages).

tiers.test.js in 2-dot diff — noted during review; the branch predates PR #149 (which added that file to main). The 3-dot diff (git diff origin/main...HEAD) does NOT include it. GitHub's squash merge operates on the 3-dot diff so the file survives on main after merge. Not a bug, no action required from the author.

Zero actionable items

All R1 findings addressed correctly. No new issues introduced. 246/246 tests pass, lint clean, working tree clean.

surpradhan

Round 2 Review — fix/workflow-list-pagination

Reviewer: independent code-review agent (R2)
Gates: 246 tests pass, lint clean.

Prior findings — all resolved

Every R1 finding has been addressed correctly:

Dashboard limit: limit=500 in loadWorkflowList() — matches the server-side cap. No further sessions are fetched just to build the list.
Tab activation: switchMain('workflows') now calls loadWorkflowList(), so the list is populated on first tab switch regardless of prior SSE state.
SQLite field mapping: listWorkflows now returns explicit { trace_id, session_count: Number(...), last_active } — byte-for-byte parity with Postgres.
Tenant ownership check: validateTenantOwnership is called on result.workflows. The workflow rows carry no tenant_id field, so Array.prototype.every returns true vacuously (line 71 of server.js: !item.tenant_id || ...). This is intentional and consistent; actual scoping is enforced by the DB query's WHERE tenant_id = ?. The comment in the route handler explains this correctly.
Explicit 400 guard: limit > 500 checked before the DB call and returns 400 with a clear message.
OpenAPI 400 entry: documented on /workflows.
Pagination test strengthened: assert.ok(page1.next_cursor, ...) now hard-fails if fewer than 2 seeded workflows exist.
New limit=501 test: passes correctly.

New code — nothing actionable

Route ordering is correct: /workflows (exact match) is registered before /workflows/:traceId (param route) — Express will never confuse them.
Cursor codec (decodeCursor / encodeCursor) is shared with all other paginated endpoints; a malformed cursor silently falls back to the first page (returns 200). The OpenAPI 400 description mentions "malformed cursor" — this is a minor spec overstatement, but it exactly matches the pre-existing pattern on /sessions (which doesn't document 400 at all for cursors). Not a regression; acceptable as-is.
m-workflows metric card is set from GET /metrics in loadAll() and is NOT refreshed by debouncedRefreshWorkflowList() on SSE events. This is correct: the old buildWorkflowList was overwriting it with a potentially-low client-side count, so removing that was the right move. The metric stays accurate from the last loadAll()/metrics fetch. Not a regression.
parseInt(limit, 10) || 50 in the backend falls back to 50 for NaN (non-numeric string) and 0. Math.max(1, -5) clamps negative values to 1. Both are safe and consistent with the behavior of all other paginated endpoints in the codebase.
Tenant isolation test seeds a dedicated wl-other-tenant key and verifies zero cross-tenant leakage. Combined with the DB-level WHERE tenant_id = ? in both backends, isolation is solid.

What looks good

Keyset pagination is implemented identically in both SQLite and Postgres backends — same subquery shape, same parameter binding pattern, same N+1 detect-and-pop idiom used everywhere else in the project.
Number(r.session_count) coercion guards correctly against Postgres returning COUNT(*) as a string.
data-trace-id on list items lets loadWorkflow() toggle the active state with a simple DOM query scan, avoiding an unnecessary re-fetch.
Six integration tests cover shape, counts, pagination, auth, and tenant isolation — the test that matters most (pagination produces a non-overlapping next page) now hard-asserts the cursor exists.

Verdict: Approve. All R1 findings are resolved, no new issues introduced, gates are green.

… incompleteness past 500 sessions The dashboard's workflow list was assembled client-side from /sessions?limit=500. Any trace_id that appeared only in sessions beyond position 500 was silently absent from the list, and setMetric('m-workflows', byTrace.size) could report a lower count than the accurate workflow_count from GET /metrics. Add listWorkflows(tenantId, { limit, cursor }) to both SQLite and Postgres backends (subquery-based keyset pagination on { last_active, trace_id }; max 500 per page; no schema change). Expose it as GET /v1/workflows with validateQueryParams and tenant scoping. Dashboard now fetches GET /workflows?limit=200 on initial load and on SSE events (debounced, separate from the session-list refresh). loadWorkflow() updates the active state via DOM data-trace-id attributes instead of re-fetching. The setMetric('m-workflows', ...) override is removed from the workflow list — the accurate count already comes from GET /metrics. 6 new integration tests (200, pagination, session counts, shape, auth, tenant isolation). 245 tests pass, lint clean.

…ow mapping, validateTenantOwnership

…nAPI 400 entry, hard assert in pagination test

surpradhan commented Jun 26, 2026

View reviewed changes

surpradhan added 4 commits June 26, 2026 14:38

fix(workflows): address R1 findings — limit 500, tab reload, SQLite r…

17a4a8f

…ow mapping, validateTenantOwnership

fix(workflows): address R2 findings — explicit 400 for limit>500, Ope…

d4da6a7

…nAPI 400 entry, hard assert in pagination test

chore: trigger CI

94497df

surpradhan force-pushed the fix/workflow-list-pagination branch from 9df5eef to 94497df Compare June 26, 2026 09:08

surpradhan merged commit 41b8839 into main Jun 26, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(workflows): paginated GET /workflows — resolves list incompleteness past 500 sessions#155

fix(workflows): paginated GET /workflows — resolves list incompleteness past 500 sessions#155
surpradhan merged 4 commits into
mainfrom
fix/workflow-list-pagination

surpradhan commented Jun 26, 2026

Uh oh!

surpradhan left a comment

Uh oh!

surpradhan left a comment

Uh oh!

surpradhan left a comment

Uh oh!

surpradhan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

surpradhan commented Jun 26, 2026

Summary

Changes

Test plan

Uh oh!

surpradhan left a comment

Choose a reason for hiding this comment

Validator Review — PR #155 fix/workflow-list-pagination

Gates

What was validated empirically

Findings

Summary table

Verdict

Uh oh!

surpradhan left a comment

Choose a reason for hiding this comment

Code Review — PR #155: paginated GET /workflows

Summary

Findings

What looks good

Verdict

Uh oh!

surpradhan left a comment

Choose a reason for hiding this comment

Gate results

R1 findings: all resolved

Probe outcomes

Zero actionable items

Uh oh!

surpradhan left a comment

Choose a reason for hiding this comment

Round 2 Review — fix/workflow-list-pagination

Prior findings — all resolved

New code — nothing actionable

What looks good

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Validator Review — PR #155 `fix/workflow-list-pagination`