Skip to content

feat(observe): SCM poller — Observe → pr.Manager → lifecycle nudges#72

Merged
harshitsinghbhandari merged 1 commit into
stagingfrom
feat/scm-poller
Jun 1, 2026
Merged

feat(observe): SCM poller — Observe → pr.Manager → lifecycle nudges#72
harshitsinghbhandari merged 1 commit into
stagingfrom
feat/scm-poller

Conversation

@harshitsinghbhandari
Copy link
Copy Markdown
Collaborator

Summary

Wires the GitHub SCM provider (PR #69) into the daemon as a periodic poller on top of the SM/lifecycle stack (PR #70). Every 30s the loop lists alive sessions, branch-discovers each session's open PR, asks the Provider for a ports.PRObservation under a 15s per-call deadline, and hands the result to pr.Manager.ApplyObservation — which transactionally writes the PR row and forwards to lifecycle.ApplyPRObservation for CI-failure log-tail nudges, review-feedback nudges (capped at reviewMaxNudge=3), and merge-conflict rebase nudges.

After this lands, the daemon automatically observes the PR for each live session with a workspace branch and fires reactive agent nudges through the messenger.

Prerequisites (already on staging)

What changed

  • backend/internal/observe/scm/{poller.go,poller_test.go} — new package. Poller.Tick lists sessions, skips terminated/branch-less rows, resolves each open PR URL via the new BranchPRFinder, calls Provider.Observe under context.WithTimeout, and hands the result to pr.Manager.ApplyObservation. Healthy() is a sticky atomic.Bool that flips to false the first time the Provider returns ErrAuthFailed.
  • backend/internal/adapters/scm/github/{find_branch_pr.go,find_branch_pr_test.go}Provider.FindOpenPRForBranch(owner, repo, branch) over GET /repos/{owner}/{repo}/pulls?head={owner}:{branch}&state=open. Picks the most recently updated PR when multiple match (rather than failing closed). Errors classify the same as Observe.
  • backend/internal/daemon/scm_wiring.go + a small edit to daemon.go::RunstartSCM(ctx, store, projects, lcm, log) builds the Provider from env-token auth (AO_GITHUB_TOKEN preferred, falling back to GITHUB_TOKEN), constructs pr.Manager, and starts the poller alongside the reaper. Shutdown order: stop() → scmStk.Stop() → lcStack.Stop() → cdcPipe.Stop().
  • backend/internal/integration/scm_poller_test.go — boots store + LCM + pr.Manager + scm.Poller against an httptest GitHub stub, ticks once, and asserts the PR row is written, the failing check is recorded, and the lifecycle CI-failure nudge reaches the messenger (with len(msgs) == 1 to catch double-nudge regressions).

Branch-discovery fallback — and why no schema change here

domain.SessionRecord does not (yet) carry a PR URL field, so v1 resolves it from the session's workspace branch:

  1. project.Manager.Get(ctx, sess.ProjectID) → derive owner/repo from project.Repo (currently populated from RepoOriginURL).
  2. If that yields no owner/repo, shell out to git remote get-url origin against the project path (RemoteResolver is injectable for tests).
  3. Ask BranchPRFinder.FindOpenPRForBranch(owner, repo, sess.Metadata.Branch).

When the session record grows a stored PR URL field, the poller should prefer it over branch discovery — that's intentionally deferred to a session/sqlc PR so this PR stays focused on the polling loop.

Error / back-off behavior

Provider error Poller behavior
ErrRateLimited log warn, short-circuit the rest of the tick so we don't burn through remaining sessions while GitHub is asking us to back off. Next tick retries normally.
ErrAuthFailed log error, flip Healthy() to false (sticky — does NOT auto-recover, because a single subsequent success could be an ETag-cached 304 that didn't actually exercise the token). Continue with the next session.
Other log warn, continue.

A missing token (ErrNoToken) at daemon startup degrades gracefully: startSCM logs an Info notice and returns a closed done-channel, so Stop is a free call and local development without a token still works.

Out of scope (intentional)

Test plan

  • go test ./internal/observe/scm/ -race — 12 unit cases (apply, !Fetched, no-branch, no-PR, rate-limit short-circuit, auth-fail sticky, generic error, project-lookup error, per-call deadline, Start/cancel drain, repeat ticks, parseGitHubRemote table)
  • go test ./internal/adapters/scm/github/ -race -run TestFindOpenPRForBranch — 7 cases covering single/no/multi-match, empty inputs, rate-limit, auth, synth-URL
  • go test ./internal/integration/ -race -run TestSCMPollerEndToEnd — end-to-end through real pr.Manager, real lifecycle.ApplyPRObservation, real captureMessenger
  • go test ./... -race — green (the unrelated TestSessionStreamsRealZellijPane failure pre-exists on staging and requires the real zellij binary)
  • go vet ./... — clean
  • gofmt -l backend/ — clean

🤖 Generated with Claude Code

Wires the github.Provider (PR #69) into the daemon as a periodic poller
on top of the SM/lifecycle stack (PR #70). Every 30s the loop lists
alive sessions, branch-discovers each session's open PR, observes it
through the Provider under a 15s per-call deadline, and hands the
result to pr.Manager.ApplyObservation — which transactionally writes
the row and forwards to lifecycle.ApplyPRObservation for CI-failure
log-tail nudges, review-feedback nudges (capped at reviewMaxNudge=3),
and merge-conflict rebase nudges.

Branch discovery is the v1 fallback because sessions don't yet carry a
PR URL field; adding that column is a separate session/sqlc PR. Until
then the poller resolves owner/repo from project.Repo (currently
RepoOriginURL) or git remote get-url origin, then asks GitHub for the
open PR with head = owner:branch.

Error classification follows the spec:
  - ErrRateLimited: short-circuit rest of tick (don't burn through
    remaining sessions while GitHub asks us to back off)
  - ErrAuthFailed:  flip Healthy() to false (sticky — does NOT
    auto-recover, because a 304-cached success doesn't actually
    exercise the token) and continue
  - other:          log warn, continue

No-token environments degrade gracefully: startSCM logs an Info notice
and returns a closed done-channel; Stop is a free call.
@harshitsinghbhandari harshitsinghbhandari merged commit 63cf526 into staging Jun 1, 2026
6 of 7 checks passed
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

Wires a new GitHub SCM polling loop into the daemon: every 30 s the loop lists alive sessions, resolves each session's workspace branch to its open PR URL via FindOpenPRForBranch, calls Provider.Observe under a 15 s per-call deadline, and forwards the result to pr.Manager.ApplyObservationlifecycle.ApplyPRObservation for CI-failure, review-feedback, and merge-conflict nudges.

  • New files: observe/scm/poller.go — tick loop with rate-limit/auth-fail classification; adapters/scm/github/find_branch_pr.goGET /pulls?head={owner}:{branch}&state=open with most-recent picker; daemon/scm_wiring.go — token sourcing, dependency wiring, graceful shutdown.
  • Shutdown ordering in daemon.go correctly stops the SCM poller before tearing down the lifecycle and CDC pipeline; the poller's per-call pollCtx is a child of the daemon's ctx, so a context cancel during an in-flight Observe is propagated promptly.

Confidence Score: 4/5

Safe to merge for the primary github.com use case; the two noted issues are isolated edge cases (GHE enterprise fallback URL and shutdown log noise) that do not affect the main polling path.

The daemon wiring, shutdown ordering, per-call deadline, and rate-limit/auth-fail branching are all implemented correctly and are well-covered by tests. The two items worth a follow-up are: classify importing GitHub-specific error sentinels through a supposedly generic Provider interface (will silently break rate-limit short-circuiting for any future non-GitHub adapter), and the html_url-absent fallback in FindOpenPRForBranch hardcoding https://github.com/ in a code path the comment says is specifically for enterprise responses. Neither affects the happy path today.

backend/internal/observe/scm/poller.go (classify coupling) and backend/internal/adapters/scm/github/find_branch_pr.go (enterprise URL fallback)

Important Files Changed

Filename Overview
backend/internal/observe/scm/poller.go Core polling loop; classify() hard-imports scmgithub error sentinels through the generic Provider interface, and context.Canceled from mid-tick shutdown is logged at Error level rather than silently suppressed.
backend/internal/adapters/scm/github/find_branch_pr.go Adds branch-to-PR discovery; fallback URL synthesis hardcodes https://github.com/ even though the comment says it targets enterprise responses that omit html_url.
backend/internal/daemon/scm_wiring.go Clean dependency-injection wiring; token sourcing, no-token graceful degradation, and shutdown via done-channel all look correct.
backend/internal/daemon/daemon.go Minimal change: inserts startSCM call and scmStk.Stop() in the correct shutdown order (after context cancel, before lifecycle teardown).
backend/internal/integration/scm_poller_test.go Good end-to-end coverage through real pr.Manager and lifecycle; the len(msgs)==1 assertion correctly guards against double-nudge regressions.
backend/internal/observe/scm/poller_test.go Thorough unit coverage across 12 scenarios; all fakes are correctly concurrency-safe; fakeProjects.Get returns an error for not-found which aligns with project.manager.Get's real behaviour.
backend/internal/adapters/scm/github/find_branch_pr_test.go 7 test cases cover all stated behaviours; synthesised-URL test correctly uses HasSuffix to avoid hardcoding the scheme.

Sequence Diagram

sequenceDiagram
    participant D as daemon.Run
    participant P as scm.Poller (loop)
    participant B as Provider.FindOpenPRForBranch
    participant O as Provider.Observe
    participant M as pr.Manager.ApplyObservation
    participant L as lifecycle.ApplyPRObservation
    participant Msg as messenger

    D->>P: Start(ctx) → done chan
    loop every 30s
        P->>P: Tick(ctx)
        P->>P: ListAllSessions
        loop each alive session with branch
            P->>B: FindOpenPRForBranch(owner, repo, branch)
            B-->>P: prURL or ""
            alt "prURL != """
                P->>O: Observe(pollCtx[15s], prURL)
                O-->>P: PRObservation or error
                alt ErrRateLimited
                    P->>P: short-circuit remaining sessions
                else ErrAuthFailed
                    P->>P: healthy.Store(false), continue
                else success and obs.Fetched
                    P->>M: ApplyObservation(ctx, sessID, obs)
                    M->>M: write PR row + checks
                    M->>L: ApplyPRObservation(ctx, obs)
                    L->>Msg: Send(CI-failure / review / rebase nudge)
                end
            end
        end
    end
    D->>P: cancel ctx
    D->>D: scmStk.Stop() ← pollerDone
Loading

Reviews (1): Last reviewed commit: "feat(observe): SCM poller — Observe → pr..." | Re-trigger Greptile

Comment on lines +163 to +166
case <-t.C:
if err := p.Tick(ctx); err != nil {
p.logger.Error("scm poller: tick failed", "err", err)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The loop function logs a cancelled-context error at Error level every time the daemon shuts down while a Tick is in flight. When the parent ctx is cancelled mid-tick, the per-call pollCtx (derived from it) is also cancelled, Observe returns context.Canceled, classify treats it as a generic error (warn + continue), and then the ctx.Err() guard in Tick propagates context.Canceled back to loop — which logs it as Error. The result is a spurious "scm poller: tick failed: context canceled" in every clean shutdown.

Suggested change
case <-t.C:
if err := p.Tick(ctx); err != nil {
p.logger.Error("scm poller: tick failed", "err", err)
}
case <-t.C:
if err := p.Tick(ctx); err != nil && !errors.Is(err, context.Canceled) {
p.logger.Error("scm poller: tick failed", "err", err)
}

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +240 to +253
func (p *Poller) classify(sid domain.SessionID, stage string, err error) bool {
switch {
case errors.Is(err, scmgithub.ErrRateLimited):
p.logger.Warn("scm poller: rate limited, skipping rest of tick",
"session", sid, "stage", stage, "err", err)
return true
case errors.Is(err, scmgithub.ErrAuthFailed):
p.healthy.Store(false)
p.logger.Error("scm poller: auth failed, provider marked unhealthy",
"session", sid, "stage", stage, "err", err)
return false
default:
p.logger.Warn("scm poller: error",
"session", sid, "stage", stage, "err", err)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Generic Provider interface leaks GitHub-specific error sentinels

classify directly imports and matches scmgithub.ErrRateLimited / scmgithub.ErrAuthFailed even though the Provider interface is intentionally adapter-agnostic. A future GitLab or Bitbucket provider would need to return these GitHub-package sentinel values — or its rate-limit errors would silently fall through to the "generic warn, continue" branch, meaning the entire remaining session list would be polled against a throttled API on every tick, and its auth failures would never flip Healthy() to false. The standard fix is to declare var ErrRateLimited and var ErrAuthFailed in the shared ports package (or in this scm package itself) and have adapters wrap with fmt.Errorf("%w: …", ports.ErrRateLimited).

Comment on lines +76 to +78
// Construct the canonical web URL from owner/repo/number when the
// API response omits html_url (some enterprise responses elide it).
return "https://github.com/" + owner + "/" + repo + "/pull/" + strconv.Itoa(chosen.Number), nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The fallback URL synthesis hardcodes https://github.com/ but the comment above it explicitly says this path fires for "some enterprise responses" that omit html_url. On a GitHub Enterprise instance the synthesised URL would carry the wrong host — causing the PR row to be stored with a github.com URL and any subsequent Provider.Observe call on that URL to either hit the wrong host or be rejected by parsePRURL. The Provider already holds its configured base URL (e.g. https://github.mycompany.com) so the fix requires plumbing it through, or at minimum synthesising from the REST base host rather than the hardcoded public domain.

Suggested change
// Construct the canonical web URL from owner/repo/number when the
// API response omits html_url (some enterprise responses elide it).
return "https://github.com/" + owner + "/" + repo + "/pull/" + strconv.Itoa(chosen.Number), nil
// Construct the canonical web URL from owner/repo/number when the
// API response omits html_url (some enterprise responses elide it).
// TODO: plumb the Provider's web base URL here so GHE instances get
// the correct host rather than the hardcoded github.com.
return "https://github.com/" + owner + "/" + repo + "/pull/" + strconv.Itoa(chosen.Number), nil

harshitsinghbhandari added a commit that referenced this pull request Jun 1, 2026
Inherited from PR #72 merging into staging after this branch opened.
golangci-lint v2.12.2 → 0 issues.
harshitsinghbhandari added a commit that referenced this pull request Jun 1, 2026
* feat(ao): `ao spawn` CLI + POST /api/v1/sessions route

* fix(ao): address PR γ review + clear inherited lint debt

Review fixes (PR #71):
- spawn CLI now uses a dedicated 90 s timeout (90 s > server's 60 s
  DefaultRequestTimeout) via context.WithTimeout, and stops sharing
  deps.HTTPClient — that client is sized for fast /healthz/shutdown
  probes (2 s) and was preempting the synchronous Spawn long before the
  daemon could finish provisioning a worktree + zellij pane + agent.
- Harden writeSpawnError so a *project.Error with a non-client Kind
  ("internal", "not_implemented", or anything unknown) falls through to
  the generic 500 SPAWN_FAILED envelope instead of passing the project
  error's Code/Message verbatim to the client. Adds three subtests that
  pin down the opacity contract.

Lint debt cleared (inherited from PRs #65/#70):
- Add doc comments on every exported symbol in the agent / claudecode /
  codex / adapters-registry packages (revive: exported)
- gosec G306/G301: inbox file/dir perms 0644→0600 and 0755→0750
- gosec G703 (path traversal via taint): excluded globally with the same
  rationale as G304 — adapter paths are daemon-config/worktree-derived,
  not user input
- gocritic emptyStringTest: len(strings.TrimSpace(...)) > 0 → != ""
- gocritic paramTypeCombine: combine adjacent same-type params
- errcheck: wrap deferred os.Remove(tmpName) in a closure
- prealloc: preallocate cmd slices on the resume paths

* fix(lint): lift loop condition in scm poller test (staticcheck QF1006)

Inherited from PR #72 merging into staging after this branch opened.
golangci-lint v2.12.2 → 0 issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant