fix(workflows): paginated GET /workflows — resolves list incompleteness past 500 sessions#155
Conversation
surpradhan
left a comment
There was a problem hiding this comment.
Validator Review — PR #155 fix/workflow-list-pagination
Gates
npm test: 245/245 pass, 0 fail (includes the 6 new integration tests forGET /workflows)npm run lint: clean- Working tree: clean (
git statusshows nothing uncommitted)
What was validated empirically
Pagination correctness — I ran a scratch in-memory simulation of the keyset cursor logic ({ last_active, trace_id }, DESC/DESC sort, tie-break on trace_id). Iterating limit=1 over 3 workflows with distinct timestamps produced all 3 items in the correct order with no duplicates and terminated cleanly. The off-by-one fence (pageSize + 1 fetch, then pop() + cursor from the last kept row) is standard and correct.
better-sqlite3 integer type — verified in the Node REPL that COUNT(*) AS session_count comes back as a JS number directly from better-sqlite3. The SQLite backend's return { workflows: rows, next_cursor } (no explicit Number() cast) produces the same type as the Postgres backend's explicit Number(r.session_count). Both are fine.
Tenant isolation — the SQL WHERE clause ($1::text IS NULL OR tenant_id = $2) / (? IS NULL OR tenant_id = ?) is correctly scoped; the integration test mints a key for wl-other-tenant and asserts it cannot see the default tenant's workflows. Test passes.
m-workflows metric fix — confirmed the old setMetric('m-workflows', byTrace.size) override is gone. loadAll() now sets m-workflows from GET /metrics → metrics.workflow_count (line 3015 in dashboard.html), which is a DB-accurate count. The client no longer pollutes it.
Findings
1. MEDIUM — src/public/dashboard.html: Hardcoded limit=200 re-introduces a silent cap
loadWorkflowList() fetches /workflows?limit=200. This replaces the old 500-session cap with a 200-workflow cap, and crucially the dashboard has no "load more" button or pagination loop. A tenant with >200 workflows will see a silently truncated list, the same class of bug that this PR is fixing. The PR description says "proper paginated endpoint" but the dashboard only fetches page 1.
Concrete fix: either bump the constant to 500 (matching the server's enforced max) and add a comment acknowledging it's still a one-page fetch, or implement a "load more" button that appends subsequent pages. The server-side machinery for pagination is there; the client just doesn't use it yet.
2. LOW — src/public/dashboard.html: switchMain('workflows') does not refresh the list on tab activation
Every other lazy-loaded tab refreshes its data when switched to (e.g. if (v === 'rejected') loadRejections()), but switchMain has no if (v === 'workflows') loadWorkflowList(). The list is populated once on page load and then by the SSE debounce (800 ms). If a user visits another page/tab and returns, or if they open the dashboard while the Workflows tab is already selected via #workflows hash routing, the workflow list is stale until the next SSE event.
Concrete fix: add if (v === 'workflows') loadWorkflowList(); alongside the other tab guards in switchMain.
3. MINOR — src/db/backends/sqlite.js vs postgres.js: shape asymmetry in listWorkflows return
SQLite returns rows verbatim: { workflows: rows, next_cursor }. Postgres remaps each row explicitly: { trace_id, session_count: Number(...), last_active }. The contract is the same because better-sqlite3 natively returns COUNT(*) as a JS integer, but the Postgres version is more explicit and defensive. For future maintainability it would be worth making SQLite remap consistently:
return {
workflows: rows.map(r => ({
trace_id: r.trace_id,
session_count: Number(r.session_count),
last_active: r.last_active,
})),
next_cursor,
};4. MINOR — src/server.js GET /workflows: no validateTenantOwnership defense-in-depth check
GET /sessions (line 426) runs validateTenantOwnership(result.sessions, req.tenant_id, "sessions_list") as a defense-in-depth check after the DB call. The new GET /workflows route skips this. In practice the omission is benign — the returned rows contain only { trace_id, session_count, last_active } (no tenant_id field), so validateTenantOwnership's item-level check !item.tenant_id || item.tenant_id === tenantId would always pass. But the absence creates an inconsistency worth noting for audit purposes.
5. MINOR — src/openapi.json: new path registered as /workflows, not /v1/workflows
The other paginated workflow path (/workflows/{traceId}) is registered the same way (/workflows/{traceId} not /v1/workflows/{traceId}) and the spec has "servers" entries for both http://localhost:8787/v1 and http://localhost:8787, so the OpenAPI spec is correct and consistent with the existing convention. No action needed — just noting it is intentional.
Summary table
| # | Severity | File | Issue |
|---|---|---|---|
| 1 | MEDIUM | dashboard.html:1569 |
limit=200 cap re-introduces silent truncation; no pagination loop |
| 2 | LOW | dashboard.html:1492–1497 |
switchMain('workflows') doesn't call loadWorkflowList() on tab activation |
| 3 | MINOR | sqlite.js:865 |
Return shape remapping inconsistent with Postgres backend |
| 4 | MINOR | server.js:530–535 |
Missing validateTenantOwnership defense-in-depth (benign but inconsistent) |
| 5 | MINOR | openapi.json:919 |
/workflows path (not /v1/workflows) — consistent with existing spec convention, no action needed |
Verdict
The core fix is correct and well-tested: the root cause (client-side assembly from a 500-session cap) is eliminated, the keyset pagination algorithm is sound, tenant isolation is enforced in SQL and covered by tests, and the m-workflows metric source is fixed. The 6 new integration tests discriminate the right behaviors (shape, session counts, pagination, auth, isolation).
Finding #1 is the one that needs resolution before merge — the dashboard's hardcoded limit=200 re-introduces a silent cap at a different threshold. Findings #2–4 are low/minor and could be follow-up issues rather than blockers.
surpradhan
left a comment
There was a problem hiding this comment.
Code Review — PR #155: paginated GET /workflows
Summary
Clean, well-scoped fix. The root cause (client-side list capped at 500 sessions) is properly addressed with a server-side paginated endpoint using keyset pagination on { last_active, trace_id }. Both SQLite and Postgres backends, the route, OpenAPI spec, and tests are all present. Gates are green (245 tests pass, lint clean). Below are findings by severity.
Findings
[Suggestion] src/public/dashboard.html line 1569 — Hard-coded limit=200 leaves the original truncation problem partially in place
loadWorkflowList() fetches /workflows?limit=200 and does not implement pagination (no "load more" / cursor tracking). A tenant with more than 200 workflows will still see a silently-truncated list — the threshold just moved from 500 sessions to 200 workflows. Since the endpoint now supports proper pagination, the dashboard should either:
- Raise the limit to 500 (matching the backend cap and preventing a new silent floor), or
- Implement a "Load more" link that follows
next_cursor.
The PR description says the fix resolves list incompleteness; a hard 200-workflow cap without visible pagination means it is only partially resolved for large tenants.
[Suggestion] src/middleware/queryValidation.js line 162 vs src/db/backends/sqlite.js line 827 — Middleware allows limit up to 1000 but the backend silently clamps at 500
validateQueryParams rejects values outside 1–1000. listWorkflows (both backends) clamps pageSize to 500. So GET /workflows?limit=600 passes the middleware, the backend silently serves 500 rows, and the caller has no idea. The OpenAPI spec correctly documents maximum: 500, but the enforcement layer is in the backend, not the middleware.
Fix: either reject limit > 500 for this route at the middleware layer (breaking change if callers already send 501–1000), or — simpler and consistent with getPaginatedSessions — add a per-route post-parse check in the route handler:
v1.get("/workflows", requireReadAccess, validateQueryParams, async (req, res) => {
const { limit, cursor } = req.query;
const parsedLimit = parseInt(limit, 10) || 50;
if (parsedLimit > 500) {
return res.status(400).json({ error: "Bad Request", message: "limit must be ≤ 500 for this endpoint" });
}
...
});Alternatively, keep the silent clamp but add the 400 response to the OpenAPI spec and mention the effective cap in the description. The current spec says maximum: 500 in the schema but the middleware lets 501–1000 through silently.
[Nit] src/openapi.json line 967 — Missing 400 response entry
Every other paginated endpoint that accepts limit/cursor documents a 400 response for bad query params (e.g. /analytics/policy-blocked, /analytics/anomalies). The new /workflows entry only documents 401. Add:
"400": { "description": "Invalid query parameter (limit outside 1–500 or malformed cursor)" }[Nit] src/server.js line 530 — No validateTenantOwnership defense-in-depth check (minor)
All other listing routes (GET /sessions, GET /sessions/:id/events, GET /sessions/:id/tree) call validateTenantOwnership on the returned data as a defense-in-depth guard. The listWorkflows route omits this. It is not a real security gap because the returned rows contain no tenant_id field (only trace_id, session_count, last_active), so validateTenantOwnership would always return true — but it is an inconsistency with the established house pattern. Consider either applying it (it will always pass, serving as a safe future-proofing anchor if tenant_id is added to the response later) or leaving a short comment explaining why it is omitted here.
[Nit] tests/integration/server.test.js line 819 — Pagination test can silently no-op
if (!page1.next_cursor) return; // only one workflow in the tenant — skipWith two distinct trace IDs seeded in before(), limit=1 should always yield a next_cursor for tenant-test. The silent return means that if a regression causes the endpoint to return all results without a cursor, this test passes without asserting anything. Prefer a hard assertion:
assert.ok(page1.next_cursor, "expected next_cursor with limit=1 and 2+ seeded workflows");[Nit] src/public/dashboard.html line 1495 — switchMain('workflows') does not call loadWorkflowList()
switchMain for rejected, analytics, performance, custom, anomalies, and webhooks all trigger a data-load on tab switch (lines 1490–1495). The workflows branch has no corresponding loadWorkflowList() call. After the loadAll() initial load, if a user switches away and returns to the Workflows tab, the list does not refresh. This was also true before this PR (because buildWorkflowList was driven by loadAll), but now that loadWorkflowList is the entry point it is more visible. A one-liner in switchMain:
if (v === 'workflows') loadWorkflowList();What looks good
- Keyset pagination is correctly implemented in both backends. The
{ last_active DESC, trace_id DESC }tie-break is sound:trace_idis a non-nullTEXT NOT NULLcolumn, so it provides a deterministic tiebreaker when two workflows share the sameMAX(updated_at). - Subquery pattern is correct. Aggregating in the inner query and applying the cursor in the outer
WHEREavoids repeating aggregate expressions and is the right approach for post-aggregate keyset pagination. - Cursor injection is safe. Values from
decodeCursorare used as SQL parameters only, never interpolated. A malformed or adversarially-crafted cursor gracefully degrades to the first page. - SQLite
session_counttype is correct.better-sqlite3returnsCOUNT(*)as a native JS number, not a string. The Postgres backend correctly appliesNumber()coercion. Types are consistent across both backends at the API boundary. data-trace-idactive-state update inloadWorkflow()(replacing thebuildWorkflowListre-render) is a nice improvement — active state toggles without triggering a network round-trip.- Tenant isolation is enforced in SQL (
AND ($1::text IS NULL OR tenant_id = $2)) and tested with a second-tenant key. - No schema change, no new CI job — constraints respected.
debouncedRefreshWorkflowListis wired to SSE with an 800 ms debounce (100 ms longer than the session list), which avoids double-fetch on the same SSE tick.
Verdict
Request Changes (minor). The hard-coded limit=200 in the dashboard is the most impactful item: for a tenant with 200+ workflows the stated bug (silent truncation) is still present, just at a different threshold. The pagination test silent-skip and the missing switchMain call are quick fixes. Everything else is nit-level polish.
surpradhan
left a comment
There was a problem hiding this comment.
Round 2 validation — APPROVE
Gates run against worktree /tmp/aep-wf-list-fix (branch fix/workflow-list-pagination, tip 31af134).
Gate results
npm test
ℹ pass 246
ℹ fail 0
ℹ skipped 0
Up from 239 before this PR (7 new tests for GET /workflows). All existing tests green; no regressions.
npm run lint — exits 0, no output.
R1 findings: all resolved
| R1 finding | Fix verified |
|---|---|
Dashboard fetch limit=200 missed sessions 201–500 |
loadWorkflowList() now fetches /workflows?limit=500 (server max); confirmed line 1570 of dashboard.html |
| Workflow tab loaded stale data on first activation | switchMain('workflows') calls loadWorkflowList() at line 1490 |
SQLite listWorkflows returned raw SQLite integers (not JS numbers) |
Number(r.session_count) at line 867 of sqlite.js; parity with Postgres backend |
New route lacked validateTenantOwnership |
Added at lines 539–541 of server.js; passes vacuously for workflow rows (no tenant_id field) per the comment at line 537 and the guard logic at server.js:71 |
No explicit 400 for limit > 500 |
Route now returns res.status(400) at line 532–534 before delegating to listWorkflows; confirmed by test at line 828 |
OpenAPI /workflows entry missing 400 response |
Added at line 967 of openapi.json: "400": { "description": "Invalid query parameter (limit outside 1–500 or malformed cursor)" } |
Pagination test silently skipped when next_cursor was absent |
assert.ok(page1.next_cursor, "expected next_cursor...") at line 819 — hard fail, not a skip |
Probe outcomes
Route ordering — GET /workflows (static) is registered before GET /workflows/:traceId (parameterized) in server.js (lines 530 vs 547). Express routes in declaration order; static wins correctly.
validateTenantOwnership vacuous-pass — confirmed: the function returns true for array items with no tenant_id field (line 71: !item.tenant_id || ...). The defense-in-depth call is safe and consistent with the sessions/events pattern.
Cursor helpers — decodeCursor/encodeCursor are shared from src/db/backends/_helpers.js (base64url, JSON-safe, null-on-decode-error). Both backends use the same helpers. Malformed cursor → decodeCursor returns null → cursor clause becomes ? IS NULL → full scan from start; harmless and consistent with other paginated endpoints.
Keyset pagination correctness — the OR (last_active = ? AND trace_id < ?) tiebreak in both backends ensures deterministic ordering when two workflows share the same MAX(updated_at). Confirmed in the live pagination test (two seeded traces both inserted in the same test run, limit=1 yields non-overlapping pages).
tiers.test.js in 2-dot diff — noted during review; the branch predates PR #149 (which added that file to main). The 3-dot diff (git diff origin/main...HEAD) does NOT include it. GitHub's squash merge operates on the 3-dot diff so the file survives on main after merge. Not a bug, no action required from the author.
Zero actionable items
All R1 findings addressed correctly. No new issues introduced. 246/246 tests pass, lint clean, working tree clean.
surpradhan
left a comment
There was a problem hiding this comment.
Round 2 Review — fix/workflow-list-pagination
Reviewer: independent code-review agent (R2)
Gates: 246 tests pass, lint clean.
Prior findings — all resolved
Every R1 finding has been addressed correctly:
- Dashboard limit:
limit=500inloadWorkflowList()— matches the server-side cap. No further sessions are fetched just to build the list. - Tab activation:
switchMain('workflows')now callsloadWorkflowList(), so the list is populated on first tab switch regardless of prior SSE state. - SQLite field mapping:
listWorkflowsnow returns explicit{ trace_id, session_count: Number(...), last_active }— byte-for-byte parity with Postgres. - Tenant ownership check:
validateTenantOwnershipis called onresult.workflows. The workflow rows carry notenant_idfield, soArray.prototype.everyreturnstruevacuously (line 71 of server.js:!item.tenant_id || ...). This is intentional and consistent; actual scoping is enforced by the DB query'sWHERE tenant_id = ?. The comment in the route handler explains this correctly. - Explicit 400 guard:
limit > 500checked before the DB call and returns400with a clear message. - OpenAPI 400 entry: documented on
/workflows. - Pagination test strengthened:
assert.ok(page1.next_cursor, ...)now hard-fails if fewer than 2 seeded workflows exist. - New limit=501 test: passes correctly.
New code — nothing actionable
- Route ordering is correct:
/workflows(exact match) is registered before/workflows/:traceId(param route) — Express will never confuse them. - Cursor codec (
decodeCursor/encodeCursor) is shared with all other paginated endpoints; a malformed cursor silently falls back to the first page (returns 200). The OpenAPI400description mentions "malformed cursor" — this is a minor spec overstatement, but it exactly matches the pre-existing pattern on/sessions(which doesn't document 400 at all for cursors). Not a regression; acceptable as-is. m-workflowsmetric card is set fromGET /metricsinloadAll()and is NOT refreshed bydebouncedRefreshWorkflowList()on SSE events. This is correct: the oldbuildWorkflowListwas overwriting it with a potentially-low client-side count, so removing that was the right move. The metric stays accurate from the lastloadAll()/metrics fetch. Not a regression.parseInt(limit, 10) || 50in the backend falls back to 50 forNaN(non-numeric string) and0.Math.max(1, -5)clamps negative values to 1. Both are safe and consistent with the behavior of all other paginated endpoints in the codebase.- Tenant isolation test seeds a dedicated
wl-other-tenantkey and verifies zero cross-tenant leakage. Combined with the DB-levelWHERE tenant_id = ?in both backends, isolation is solid.
What looks good
- Keyset pagination is implemented identically in both SQLite and Postgres backends — same subquery shape, same parameter binding pattern, same
N+1detect-and-pop idiom used everywhere else in the project. Number(r.session_count)coercion guards correctly against Postgres returningCOUNT(*)as a string.data-trace-idon list items letsloadWorkflow()toggle the active state with a simple DOM query scan, avoiding an unnecessary re-fetch.- Six integration tests cover shape, counts, pagination, auth, and tenant isolation — the test that matters most (pagination produces a non-overlapping next page) now hard-asserts the cursor exists.
Verdict: Approve. All R1 findings are resolved, no new issues introduced, gates are green.
… incompleteness past 500 sessions
The dashboard's workflow list was assembled client-side from /sessions?limit=500.
Any trace_id that appeared only in sessions beyond position 500 was silently absent
from the list, and setMetric('m-workflows', byTrace.size) could report a lower count
than the accurate workflow_count from GET /metrics.
Add listWorkflows(tenantId, { limit, cursor }) to both SQLite and Postgres backends
(subquery-based keyset pagination on { last_active, trace_id }; max 500 per page;
no schema change). Expose it as GET /v1/workflows with validateQueryParams and
tenant scoping.
Dashboard now fetches GET /workflows?limit=200 on initial load and on SSE events
(debounced, separate from the session-list refresh). loadWorkflow() updates the
active state via DOM data-trace-id attributes instead of re-fetching. The
setMetric('m-workflows', ...) override is removed from the workflow list — the
accurate count already comes from GET /metrics.
6 new integration tests (200, pagination, session counts, shape, auth, tenant isolation).
245 tests pass, lint clean.
…ow mapping, validateTenantOwnership
…nAPI 400 entry, hard assert in pagination test
9df5eef to
94497df
Compare
Summary
GET /sessions?limit=500. Sessions beyond position 500 were invisible, so workflows whosetrace_idonly appeared after that threshold were silently absent.setMetric('m-workflows', byTrace.size)also overwrote the accurate count fromGET /metricswith a potentially-lower client-side value.GET /v1/workflows?limit=&cursor=— a proper paginated endpoint that queries all workflows directly from the DB. Dashboard switches from client-side assembly to fetching this endpoint.Changes
src/db/backends/sqlite.jslistWorkflows(tenantId, { limit, cursor })— subquery keyset pagination on{ last_active, trace_id }src/db/backends/postgres.js$N::textcasts for Postgres type inferencesrc/db/backends/interface.jssrc/db/index.jssrc/server.jsGET /v1/workflows(read-scoped,validateQueryParams, tenant-scoped)src/openapi.jsonsrc/public/dashboard.htmlloadWorkflowList()+renderWorkflowList()replacebuildWorkflowList(sessions); active state updates viadata-trace-id;debouncedRefreshWorkflowListwired to SSE;setMetric('m-workflows', ...)override removedtests/integration/server.test.jsTest plan
npm test)npm run lint)GET /workflowsreturns{ workflows: [...], next_cursor }with correctsession_countper tracelimit=1yieldsnext_cursor; following it returns a different workflowGET /workflowson load and SSE events🤖 Generated with Claude Code