Skip to content

fix(verify): one-shot consumption of email confirmation tokens#281

Merged
umputun merged 1 commit into
masterfrom
fix/verify-replay
May 9, 2026
Merged

fix(verify): one-shot consumption of email confirmation tokens#281
umputun merged 1 commit into
masterfrom
fix/verify-replay

Conversation

@paskal
Copy link
Copy Markdown
Collaborator

@paskal paskal commented May 8, 2026

Summary

The email confirmation JWT issued by VerifyHandler.sendConfirmation was protected only by its 30-minute expiry: any party who could read the confirmation link (forwarded email, mail-gateway logs, mailbox archive) could redeem it independently within the window, creating a separate authenticated session for the same address.

Change

Add a pluggable VerifConfirmationStore interface and an in-memory default. On first redemption the token is recorded as consumed; replays are rejected with 403 confirmation token already consumed.

type VerifConfirmationStore interface {
    // MUST retain the marker for at least the supplied ttl, or return a
    // non-nil err if it cannot. Callers MUST treat a non-nil err as
    // fail-closed (reject the redemption) to avoid replay during outages.
    MarkUsed(key string, ttl time.Duration) (alreadyUsed bool, err error)
}

Key design choices:

  • (bool, error) signature -- a Redis/DB adapter must be able to signal a backend outage so callers can fail closed instead of either fail-open (allow replay) or always-reject (lock users out). The TTL contract is strict: dropping a marker before its ttl while the JWT is still valid reopens the replay window the store is meant to close.
  • Key = SHA-256(rawToken), not a jti claim. Existing tokens issued before this fix carry no jti, but they are still de-dup'd correctly because the signed token bytes are unique per issuance. No wire-format change.
  • VerifConfirmationStoreFunc adapter for closure-based config, mirroring the SenderFunc / token.AllowedHostsFunc house pattern. Includes a typed-nil guard at the call site (a non-nil interface wrapping a nil func is treated as no store configured, matching AllowedHostsFunc).
  • In-memory default uses amortized eviction (inMemoryVerifStoreSweepEvery=256) instead of walking the map under the mutex on every redemption -- the lookup-then-insert pattern keeps correctness; sweeping every Nth insert keeps the lock window short.
  • Pluggable via Opts.VerifConfirmationStore. Multi-instance deployments behind a load balancer MUST supply a shared backend (Redis or any shared KV) -- otherwise instance A's "used" set is unknown to instance B and the protection silently fails. The README change makes this an explicit must-do, not a recommendation.
  • Single-init via sync.Once in AddVerifProvider: precedence is Opts.VerifConfirmationStore if non-nil, else provider.NewInMemoryVerifStore() default. Subsequent AddVerifProvider calls reuse the same store.
  • Token scrubbed on fail-closed branches. The new markErr != nil and alreadyUsed SendErrorJSON sites use a scrubTokenFromRequest(r) helper to set ?token=<redacted> before logging. The fail-closed branch is materially worse than the existing parse/expired branches because the store did NOT record consumption -- the JWT in the URL is still live.

Both auth.go and v2/auth.go get the same wiring.

Test

  • TestInMemoryVerifStore -- direct unit coverage:
    • first vs second MarkUsed, expiry, key independence
    • concurrent same-key redemption -- 50 goroutines redeem the same key; exactly one observes alreadyUsed=false. Locks the atomicity contract that the mutex ordering provides.
    • amortized sweep evicts expired entries -- lowers the sweep threshold to 4 via the test-friendly var, inserts 3 nanosecond-TTL keys, sleeps to expire them, asserts the 4th insert triggers the sweep and prunes them.
  • TestVerifConfirmationStoreFunc -- adapter calls f and surfaces both bool and err.
  • TestVerifyHandler_LoginAcceptConfirm_RejectsReplay -- integration: same testConfirmedToken redeemed twice; first 200, second 403 with "already" in body.
  • TestVerifyHandler_LoginAcceptConfirm_TypedNilStoreFuncDoesNotPanic -- var nilFn VerifConfirmationStoreFunc; ConfirmationStore: nilFn survives the != nil check; the handler must treat it as no store, not panic.
  • TestVerifyHandler_LoginAcceptConfirm_FailClosedOnStoreError -- VerifConfirmationStoreFunc(func(...) (bool, error) { return false, fmt.Errorf("redis down") }) triggers the security-critical fail-closed branch; asserts 403 + "store unavailable".
  • TestService_AddVerifProvider_UsesCustomConfirmationStore -- Opts.VerifConfirmationStore injection takes precedence over the default.
  • TestService_AddVerifProvider_DefaultsToInMemory -- default install when not configured; sync.Once ensures the same store is reused across multiple AddVerifProvider calls.

Same suite added in v1 and v2. Full go test -race ./... green in both modules; golangci-lint run --new-from-rev=master 0 issues.

Multi-instance migration note

Existing single-instance callers transparently get replay rejection on upgrade. LB-fronted callers must opt into a shared store via Opts.VerifConfirmationStore (e.g. Redis), or replay rejection will be inconsistent across instances. The README change in this PR makes this explicit and adds a sample Redis adapter showing the (bool, error) contract.

@paskal paskal requested a review from umputun as a code owner May 8, 2026 17:41
@coveralls
Copy link
Copy Markdown

coveralls commented May 8, 2026

Coverage Report for CI Build 25600756364

Coverage increased (+0.2%) to 85.165%

Details

  • Coverage increased (+0.2%) from the base build.
  • Patch coverage: 2 uncovered changes across 1 file (67 of 69 lines covered, 97.1%).
  • No coverage regressions found.

Uncovered Changes

File Changed Covered %
v2/provider/verify.go 62 60 96.77%

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 3485
Covered Lines: 2968
Line Coverage: 85.16%
Coverage Strength: 7.97 hits per line

💛 - Coveralls

@paskal paskal force-pushed the fix/verify-replay branch 2 times, most recently from 3a2a2d6 to 0e39694 Compare May 8, 2026 18:16
Copy link
Copy Markdown
Member

@umputun umputun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few concerns, the first two worth resolving before merge:

  1. MarkUsed interface should return (bool, error) before this ships. Right now any future Redis/DB adapter has to choose between fail-open (return false on backend timeout, allow replay) and fail-closed (return true, lock users out during outages). Both choices are wrong, and once VerifConfirmationStore is exported with the bool-only signature, changing it later is a breaking change. Suggest:

    type VerifConfirmationStore interface {
        MarkUsed(key string, ttl time.Duration) (alreadyUsed bool, err error)
    }

    LoginHandler then fails closed on non-nil err with a logged warning. The in-memory default just returns nil. (provider/verify.go:53-59, v2/provider/verify.go:65-70)

  2. (*inMemoryVerifStore).MarkUsed is O(n) under the only mutex. Every redemption walks the entire used map for expired-entry eviction, holding sync.Mutex the whole time. At sustained traffic the map grows toward requests/sec × 30min and each handler call serializes on a full-map scan. Cheap fixes: gate the sweep on len(used) > N or on a lastSweep clock, or skip the loop entirely on the hot path (correctness is preserved by the lookup-then-insert pattern, the cap is just requests within TTL window). (provider/verify.go:74-86, v2/provider/verify.go:84-98)

design points worth addressing (suggested resolutions inline):

  1. Default fail-open behind a load balancer is the right default, but the README warning needs to be sharper. Keep the in-memory store as the default. The library's deployment profile is mostly single-instance services that get free protection. The problem is that "instance A consumed, instance B accepts" is a silent failure with no log line, and the current README reads as a recommendation rather than a hard requirement. Suggest replacing the warning with something like:

    Multi-instance deployments behind a load balancer must supply a shared backend. With the default in-memory store, replay protection works only on the instance that originally consumed the token; an attacker who hits any other instance can still replay within the TTL. This is silent: the request succeeds and the auth flow completes normally, with no log indicating the protection was bypassed.

  2. MarkUsed runs before the handshake-format check, avatar fetch, and TokenService.Set. That's the right point for security (atomic check-and-mark before side effects), but it means a transient downstream failure (avatar saver flapping, token store hiccup) burns a valid token and the user gets 403 on retry. Suggest keeping the current ordering. Magic-link flows let the user re-request a fresh email link, which is preferable to a non-atomic consume-then-rollback design. Worth adding a one-line note in the godoc on LoginHandler or in the README saying "consumption is final; transient errors require requesting a new confirmation email." (provider/verify.go:147-203)

minor:

  1. The integration replay test only fires N=2 sequentially. The atomicity claim of MarkUsed rests on the mutex serializing concurrent same-key calls, but no test exercises that. A t.Parallel/errgroup test (50 goroutines redeem the same token, exactly one returns false) locks in the contract cheaply.

  2. House style for similar policy hooks (token.AllowedHosts, token.Audience, token.Validator) provides a *Func adapter for closure-based config. Consider adding VerifConfirmationStoreFunc for symmetry, or note that the omission is intentional.

umputun pushed a commit that referenced this pull request May 8, 2026
The verify provider sends one-shot magic-link tokens by email; the
Email sender's debug log dumped the full body verbatim:

    [DEBUG] send "Confirmation token: <jwt>" to victim@example.com

Anyone with log access (centralized logging, crash bundles, support
tools, mail-gateway-adjacent observability) could redeem that token
within its TTL — independently of the legitimate recipient. The
verify-replay PR (#281) limits reuse to one consumption, but the
log-reader can still race the user for that one consumption.

Log only the recipient and the body length:

    [DEBUG] send 142-byte message to victim@example.com

Same fix in v1 (provider/sender/email.go:84) and v2
(v2/provider/sender/email.go:84), single PR.

Test: TestEmail_SendDoesNotLogBody captures logger output, sends a
known-secret body to a non-existent SMTP host, and asserts the body
substring is absent while the recipient is present. Added in both
modules.
@paskal paskal force-pushed the fix/verify-replay branch 4 times, most recently from a3b60da to d903497 Compare May 9, 2026 00:31
@paskal
Copy link
Copy Markdown
Collaborator Author

paskal commented May 9, 2026

addressed all blocking and the in-scope non-blocking points in the last push:

  1. MarkUsed interface now returns (alreadyUsed bool, err error). LoginHandler fails closed on non-nil err with a [WARN] log; in-memory default returns nil err. README has the new signature in the Redis adapter example.

  2. (*inMemoryVerifStore).MarkUsed now uses amortized eviction -- inMemoryVerifStoreSweepEvery=256 insert counter, walks the map only every Nth call. lookup-then-insert preserves correctness; staleness bound is N inserts between sweeps.

  3. README warning rewritten with the silent-failure callout: explicit "the request succeeds and the auth flow completes normally, with no log indicating the protection was bypassed" line, multi-instance requirement marked must.

  4. godoc on LoginHandler now states consumption is final and that transient downstream failures burn the token (users must re-request a new confirmation email).

  5. concurrent same-key test added to TestInMemoryVerifStore -- 50 goroutines redeem the same token, exactly one observes alreadyUsed=false. uses sync.WaitGroup + atomic counter.

  6. added VerifConfirmationStoreFunc adapter (provider/verify.go, v2/provider/verify.go) for closure-based config, mirroring SenderFunc / token.AllowedHostsFunc. test in both modules verifies adapter calls f and surfaces both bool and err.

go fix + race + lint clean in both modules.

@paskal
Copy link
Copy Markdown
Collaborator Author

paskal commented May 9, 2026

no extra changes this round — the previous batch (1-6 from the original review) covered everything in scope. CI is the only thing left to wait on.

Copy link
Copy Markdown
Member

@umputun umputun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all 6 round-1 points addressed cleanly, thx. Round-2 introduced a few new things that need a look, plus two I missed in round 1.

new in round 2:

  1. VerifConfirmationStoreFunc is missing a typed-nil guard. Opts{VerifConfirmationStore: provider.VerifConfirmationStoreFunc(nil)} produces a non-nil interface that survives the if e.ConfirmationStore != nil check at auth.go:435 and panics at MarkUsed on the first redemption. The mirror token.AllowedHostsFunc does have this guard (see v2/provider/redirect.go:35). Same in provider/verify.go:65-67 and v2.

  2. fail-closed branch has no test. The (bool, error) signature was added precisely so a backend outage rejects rather than fails open, and the godoc at provider/verify.go:55-62 makes it a security guarantee, but no test exercises markErr != nil → 403. A VerifConfirmationStoreFunc(func(...) (bool, error) { return false, errors.New("boom") }) test that asserts 403 + body would lock the contract cheaply. (provider/verify.go:190-194, v2 mirror)

  3. amortized sweep is dark code. inMemoryVerifStoreSweepEvery=256 plus the eviction branch at provider/verify.go:109-115 are not exercised by any test. Cheap fix: convert the const to a private var so a test can lower it to e.g. 4 and verify a sweep actually runs and prunes expired entries.

  4. TTL contract godoc allows non-conforming early eviction. provider/verify.go:55-62 says TTL is upper bound and "implementations may evict earlier under memory pressure". A conforming Redis adapter under memory pressure could drop the consumed marker before the JWT expires, reopening the replay window in exactly the multi-instance case the PR is supposed to harden against. Tighten the contract to "must retain markers for at least the supplied TTL, or return an error if it cannot".

things I missed in round 1:

  1. verifConfirmStoreO field name. trailing-O for sync.Once is non-idiomatic, house style is once (see avatar/localfs.go:19, v2/token/jwt.go:91). auth.go:39, v2/auth.go:39.

  2. token logged via r.URL.String() on the new fail-closed branch. rest.SendErrorJSON logs the full URL including ?token=<jwt>. The handler's parse/expired branches already do this, so no regression in baseline behavior, but the new fail-closed branch is materially worse: when a Redis outage triggers it the token is still live (store didn't record consumption), so a single log line equals an unredeemed magic link. Worth either hashing the token in the log message on the two new branches, or scrubbing token= from the URL before passing the request to SendErrorJSON. (provider/verify.go:189-193, v2 mirror)

coveralls -0.08% is patch-coverage on the new fail-closed + sweep branches, which is exactly what items 2 and 3 above would fix.

@paskal paskal force-pushed the fix/verify-replay branch from d903497 to 2e0f421 Compare May 9, 2026 11:38
The email confirmation JWT issued by VerifyHandler.sendConfirmation was
only protected by its 30-minute expiry: any party who could read the
confirmation link (forwarded email, mail-gateway logs, mailbox archive)
could redeem it independently within the window, creating a separate
authenticated session for the same address.

Add a VerifConfirmationStore interface (MarkUsed key, ttl -> bool) and a
default in-memory implementation (sync.Mutex-protected map keyed by
SHA-256 of the raw token). When VerifyHandler.ConfirmationStore is
non-nil, LoginHandler records the token as consumed on first redemption
and rejects replays with 403 "confirmation token already consumed".

The store key is the SHA-256 of the raw token rather than a jti claim,
so existing token fixtures (which carry no jti) are still de-dup'd
correctly without changing the wire format.

Wired through Service.AddVerifProvider in both auth.go and v2/auth.go.
The store is selected once via sync.Once, with this precedence:

  1. Opts.VerifConfirmationStore if non-nil (caller-supplied; required
     for multi-instance deployments — Redis or any shared KV).
  2. provider.NewInMemoryVerifStore() default — fine for single
     instance, broken for LB-fronted multi-instance deployments where
     instance A's "used" set is unknown to instance B. Documented in
     README under "Confirmation token replay protection".

Tests:
* TestInMemoryVerifStore — store-level coverage (replay, expiry, key
  independence).
* TestVerifyHandler_LoginAcceptConfirm_RejectsReplay — integration:
  same token redeemed twice -> 200 then 403.
* TestService_AddVerifProvider_UsesCustomConfirmationStore — Opts
  injection works.
* TestService_AddVerifProvider_DefaultsToInMemory — default install
  + sync.Once reuse on subsequent AddVerifProvider calls.
@paskal paskal force-pushed the fix/verify-replay branch from 2e0f421 to c7e9b50 Compare May 9, 2026 12:08
@paskal
Copy link
Copy Markdown
Collaborator Author

paskal commented May 9, 2026

addressed all 6 round-2 items in the last push:

  1. typed-nil VerifConfirmationStoreFunc guard -- LoginHandler now mirrors the token.AllowedHostsFunc pattern: type-asserts the store to VerifConfirmationStoreFunc, treats a typed-nil func as if no store were configured. New test TestVerifyHandler_LoginAcceptConfirm_TypedNilStoreFuncDoesNotPanic wires var nilFn VerifConfirmationStoreFunc; ConfirmationStore: nilFn, drives the redemption, asserts no panic + 200 (legacy fall-through). v1 + v2.

  2. fail-closed branch test -- TestVerifyHandler_LoginAcceptConfirm_FailClosedOnStoreError uses VerifConfirmationStoreFunc(func(...) (bool, error) { return false, fmt.Errorf("redis down") }) and asserts 403 + body contains "store unavailable". v1 + v2.

  3. sweep is now testable -- inMemoryVerifStoreSweepEvery converted from const to var. New subtest TestInMemoryVerifStore/amortized sweep evicts expired entries lowers it to 4, inserts 3 nanosecond-TTL keys, sleeps to expire them, sanity-checks the map still holds 3, then the 4th insert triggers the sweep and the map is asserted to hold only the fresh entry. v1 + v2.

  4. TTL contract tightened -- godoc on VerifConfirmationStore.MarkUsed rewritten:

    "The implementation MUST retain the marker for at least the supplied ttl, or return a non-nil err if it cannot -- dropping a marker before its ttl while the underlying JWT is still valid reopens the replay window the store is meant to close."
    No more "may evict earlier under memory pressure" wording. v1 + v2.

  5. verifConfirmStoreOverifConfirmStoreOnce rename. v1 + v2 auth.go.

  6. token scrubbed on the new fail-closed branches -- new scrubTokenFromRequest(r) helper returns a shallow r.Clone with ?token= set to <redacted>, used at both the markErr != nil and the alreadyUsed SendErrorJSON sites. The existing parse/expired branches keep logging the URL as-is per your "no regression in baseline behavior" note (those tokens are invalid/expired). The fail-closed-with-live-token leak is closed. v1 + v2.

Verified the scrub on v1 fail-closed test output:

confirmation token store unavailable - redis down - 403 -  - /login?token=<redacted>

go fix + race + lint clean in both modules.

@paskal paskal requested a review from umputun May 9, 2026 12:12
Copy link
Copy Markdown
Member

@umputun umputun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all 6 round-2 items verified clean: typed-nil func guard at the handler boundary, fail-closed branch test, sweep-testable via var, TTL godoc tightened, store-name rename, scrub helper applied only on the live-token branches via r.Clone. Tests + race + lint green in both modules.

one issue worth a follow-up (not blocking, marking approve so this can land):

typed-nil VerifConfirmationStoreFunc in Opts silently disables the default in-memory store. The handler-level guard added in round 2 (provider/verify.go:217) correctly normalizes a typed-nil func to no-store, but the Service-level check at auth.go:449-454 (and v2/auth.go:450-455) is still a plain s.opts.VerifConfirmationStore != nil test. A typed-nil VerifConfirmationStoreFunc is a non-nil interface wrapping a nil func, so it survives that check, provider.NewInMemoryVerifStore() is skipped, and at redemption the handler's typed-nil guard normalizes the func to nil. Net result: a user who writes Opts{VerifConfirmationStore: VerifConfirmationStoreFunc(nil)} gets neither their func nor the default store, and replay protection is silently disabled for that exact configuration.

Same shape as the *avatar.Proxy typed-nil in #286 just with a different consequence (silent loss of protection vs panic). The fix is the same as you used in LoginHandler, applied one layer up:

s.verifConfirmStoreOnce.Do(func() {
    store := s.opts.VerifConfirmationStore
    if fn, ok := store.(provider.VerifConfirmationStoreFunc); ok && fn == nil {
        store = nil
    }
    if store != nil {
        s.verifConfirmStore = store
        return
    }
    s.verifConfirmStore = provider.NewInMemoryVerifStore()
})

Worth adding TestService_AddVerifProvider_TypedNilStoreFuncFallsBackToDefault symmetric to the existing handler-level typed-nil test, in both v1 and v2.

minor follow-ups:

  1. The verifConfirmStoreOverifConfirmStoreOnce rename made the field longer than the column other fields are aligned to in Service. gofmt -d auth.go v2/auth.go reports a diff. CI doesn't enforce gofmt so this is cosmetic, but a one-line gofmt -w pass before merge avoids a noisy diff in the next contributor's IDE-on-save.

  2. scrubTokenFromRequest's defensive early-return at provider/verify.go:148-149 (r == nil || r.URL == nil || token == "") is the 2 uncovered lines coveralls is flagging. They're dead code along the only invocation path. Either drop the early-return (function is private, contract is "callers always pass a request with token") or add a 5-line direct unit test. Defensive code is fine to keep, just worth noting why coverage isn't 100%.

  3. adapter-author guidance for MarkUsed errors: a poorly-implemented Redis/DB adapter that constructs fmt.Errorf("SET %s ...: %w", key, err) would surface the SHA-256 hash through the err log. Not a defect in the in-tree implementation. Worth a one-line note in the VerifConfirmationStore interface godoc telling adapter authors not to embed key (or any caller-supplied data) in returned errors.

nothing blocking. The remaining round-2 work is solid.

@umputun umputun merged commit 1861958 into master May 9, 2026
9 checks passed
@umputun umputun deleted the fix/verify-replay branch May 9, 2026 16:51
umputun pushed a commit that referenced this pull request May 10, 2026
…dance

Followups to #281 (verify replay protection) raised on the post-merge
review. None blocking, all small.

1. Service-level typed-nil VerifConfirmationStoreFunc guard. The
   handler-level guard added in round 2 normalizes a typed-nil func to
   nil, but the AddVerifProvider check at auth.go was still a plain
   `s.opts.VerifConfirmationStore != nil` test. A typed-nil
   VerifConfirmationStoreFunc is a non-nil interface wrapping a nil
   func, so it survived that check, the in-memory default was skipped,
   and at redemption the handler's typed-nil guard normalized it to
   nil — net result: a user who wrote
   `Opts{VerifConfirmationStore: VerifConfirmationStoreFunc(nil)}`
   got neither their func nor the default, and replay protection was
   silently disabled for that exact configuration.

   Same shape as the *avatar.Proxy typed-nil case fixed in #286 with a
   different consequence (silent loss of protection vs panic). Apply
   the same shape of guard one layer up. New regression test
   TestService_AddVerifProvider_TypedNilStoreFuncFallsBackToDefault in
   both v1 and v2.

2. gofmt -w on auth.go / v2/auth.go. The verifConfirmStoreO ->
   verifConfirmStoreOnce rename in #281 made the field longer than the
   surrounding column alignment. Cosmetic; CI doesn't enforce gofmt
   but a noisy IDE-on-save diff for the next contributor.

3. scrubTokenFromRequest unit test (TestScrubTokenFromRequest) in both
   v1 and v2 — covers the defensive early-return that coveralls flagged
   as uncovered after #281 (token-missing returns r unchanged, nil r
   returns nil, token-present returns redacted clone with other query
   params preserved).

4. Adapter-author guidance on VerifConfirmationStore.MarkUsed godoc:
   tells external Redis/DB adapter authors not to embed the supplied
   key in returned errors, since the handler logs err on the
   fail-closed branch and the key is the SHA-256 of a still-live JWT.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants