Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 31 additions & 10 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,40 @@ Recipes live in the `Justfile` (`just --list`); the bare `just` runs the full

The CI `DB_DSN` format: `postgresql+asyncpg://postgres:postgres@localhost:5432/postgres`

## Architecture

The package (`db_retry/`) exposes five public symbols via `__init__.py`:

- **`postgres_retry`** (`retry.py`) — async tenacity decorator that retries on `asyncpg.SerializationError` (40001) and `asyncpg.PostgresConnectionError` (08000/08003). Walks the outer `__cause__`/`__context__` chain to find any `DBAPIError`, then inspects `DBAPIError.orig.__cause__` to distinguish retriable errors from others like `StatementCompletionUnknownError` (40002). The chain walk lets retries fire when the `DBAPIError` is re-raised by a wrapper (e.g. advanced-alchemy's `wrap_sqlalchemy_exception()` surfacing it as `RepositoryError`/`IntegrityError`). Supports bare `@postgres_retry` (uses default) and `@postgres_retry(retries=N)` for per-callsite override.
## Workflow

- **`build_connection_factory`** (`connections.py`) — returns an async callable suitable for SQLAlchemy's `async_engine_from_config`. Handles multi-host DSNs by randomizing host order (load balancing) and attempting all hosts on timeout before raising `TargetServerAttributeNotMatched`.
Planning follows [`planning/README.md`](planning/README.md) — its **Quick path**
is the authoritative convention for making a change (choose a lane, create a
bundle under `planning/changes/`, ship the `architecture/` promotion in the same
PR). Run `just check-planning` (also wired into `just lint-ci`) before pushing.

- **`build_db_dsn`** / **`is_dsn_multihost`** (`dsn.py`) — parse and construct `sqlalchemy.URL` objects. Multi-host DSNs encode additional hosts in query parameters. Existing `target_session_attrs` in the DSN is preserved (not overwritten).

- **`Transaction`** (`transaction.py`) — frozen dataclass context manager wrapping `AsyncSession`. Supports optional isolation level (e.g., `"SERIALIZABLE"`). Auto-rolls back on `__aexit__` if the session is still in a transaction (i.e. no explicit `.commit()` or `.rollback()` was called). Uses `typing.Self` (no `typing_extensions` dependency).
## Architecture

- **`settings.py`** — exposes `get_retries_number()` which reads `DB_RETRY_RETRIES_NUMBER` env var at call time (default: 3), allowing `monkeypatch.setenv` to work in tests.
> Quick orientation only. The authoritative, code-current account of each
> capability lives in [`architecture/`](architecture/) — one file per
> capability. **When a change alters a capability's behavior, update the matching
> `architecture/<capability>.md` in the same PR** — that promotion is what keeps
> `architecture/` true; code that changes without it silently rots the truth home.

The package (`db_retry/`) exposes five public symbols via `__init__.py`. Read
the matching capability file before changing behavior:

| Symbol(s) | Source | Capability file |
|---|---|---|
| `postgres_retry` | `retry.py` | [architecture/retry.md](architecture/retry.md) |
| `build_connection_factory` | `connections.py` | [architecture/connections.md](architecture/connections.md) |
| `build_db_dsn`, `is_dsn_multihost` | `dsn.py` | [architecture/dsn.md](architecture/dsn.md) |
| `Transaction` | `transaction.py` | [architecture/transaction.md](architecture/transaction.md) |
| `get_retries_number` | `settings.py` | [architecture/settings.md](architecture/settings.md) |

- **`postgres_retry`** is an async tenacity decorator that retries on
`asyncpg.SerializationError` (40001) and `asyncpg.PostgresConnectionError`
(08xxx), walking the `__cause__`/`__context__` chain to find a retriable
`DBAPIError` even when re-wrapped. Bare `@postgres_retry` or
`@postgres_retry(retries=N)`.
- **`build_connection_factory`** returns an async creator for
`async_engine_from_config`, load-balancing and failing over across multi-host
DSNs before raising `TargetServerAttributeNotMatched`.

## Linting / Type Checking

Expand Down
7 changes: 7 additions & 0 deletions Justfile
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,13 @@ lint-ci:
uv run ruff format --check
uv run ruff check --no-fix
uv run ty check
uv run python planning/index.py --check

index:
uv run python planning/index.py

check-planning:
uv run python planning/index.py --check

publish:
rm -rf dist
Expand Down
26 changes: 26 additions & 0 deletions architecture/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Architecture

The living truth about what `db-retry` does **now** — one file per capability,
updated by hand whenever a change ships. The *why* and *how it got here* live in
[`../planning/changes/`](../planning/changes/) — and decisions deliberately taken,
including options rejected, in [`../planning/decisions/`](../planning/decisions/);
this directory is the present.

These files carry **no frontmatter** — they are prose, dated by git.

## Capabilities

- [retry.md](retry.md) — `postgres_retry`, the async tenacity decorator and its
cause-chain retry predicate.
- [connections.md](connections.md) — `build_connection_factory`, multi-host
load balancing and failover.
- [dsn.md](dsn.md) — `build_db_dsn` / `is_dsn_multihost`, DSN parsing and
construction.
- [transaction.md](transaction.md) — `Transaction`, the session context manager.
- [settings.md](settings.md) — `get_retries_number`, the env-driven retry count.

## Promotion rule

Shipping a change hand-edits the affected capability file(s) here to match the
new reality, in the same PR as the code. The change bundle stays in place under
[`../planning/changes/`](../planning/changes/) — no folder move.
52 changes: 52 additions & 0 deletions architecture/connections.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Connections

`build_connection_factory` (in `db_retry/connections.py`) returns an async
callable `() -> asyncpg.Connection` suitable for SQLAlchemy's
`async_engine_from_config` (the `async_creator` hook). Its job is to connect to
PostgreSQL across a multi-host DSN with load balancing and per-host failover.

## Signature

```python
build_connection_factory(url: sqlalchemy.URL, timeout: float)
-> Callable[[], Awaitable[asyncpg.Connection]]
```

The `url` is translated into asyncpg connect args **once**, at factory-build
time, via `PGDialect_asyncpg().create_connect_args(url)`. `target_session_attrs`
is popped from those args and, if present, wrapped in asyncpg's
`SessionAttribute` enum — so a `target_session_attrs` carried on the DSN (e.g.
`read-write`/`prefer-standby` set by [`build_db_dsn`](dsn.md)) **is honored**,
not discarded.

## Host handling

`host` and `port` are popped from the connect args:

- **Multi-host** (both are lists): they are zipped into `(host, port)` pairs
(`strict=True` — lengths must match), the pair list is `random.shuffle`d for
load balancing, then split back into parallel `hosts`/`ports` lists. The
shuffled pair list is also retained for the failover path.
- **Single-host** (scalars): used as-is; the retained pair list is empty.

## Connect and failover

The returned `_connection_factory`:

1. Attempts one `asyncpg.connect(...)` against the full host/port set with the
given `timeout` and `target_session_attrs`. On success, returns immediately.
2. On `TimeoutError`, if there is no multi-host pair list it re-raises. With a
pair list, it logs a warning and falls through to host-by-host probing.
3. Re-shuffles a copy of the pairs and tries each `(host, port)` individually,
swallowing `TimeoutError`, `OSError`, and `asyncpg.TargetServerAttributeNotMatched`
(logging a warning per failed host) and returning the first that connects.
4. If every host fails, raises `asyncpg.TargetServerAttributeNotMatched` naming
the unmet `target_session_attrs`.

The two-stage design lets the fast path use asyncpg's own multi-host attempt,
and only pays the per-host cost when the bulk attempt times out — typically when
`target_session_attrs` (e.g. `read-write`) excludes some hosts.

## Related

- [dsn.md](dsn.md) — how multi-host DSNs and `target_session_attrs` are encoded.
47 changes: 47 additions & 0 deletions architecture/dsn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# DSN

`db_retry/dsn.py` parses and constructs `sqlalchemy.URL` objects for
multi-host PostgreSQL DSNs. Two public functions.

## `build_db_dsn`

```python
build_db_dsn(
db_dsn: str,
database_name: str,
use_replica: bool = False,
drivername: str = "postgresql",
) -> sqlalchemy.URL
```

Takes a stored DSN and returns a new `URL` with three things replaced:

- **`database`** ← `database_name`. The stored DSN carries a placeholder
database (the maintained format is
`postgresql://login:password@/db_placeholder?host=host1&host=host2` — empty
host in the authority, real hosts in repeated `host` query params, per
SQLAlchemy's [multiple-fallback-hosts](https://docs.sqlalchemy.org/en/20/dialects/postgresql.html#specifying-multiple-fallback-hosts)
form); the real service database name is substituted here.
- **`drivername`** ← `drivername` (default `postgresql`; callers pass
`postgresql+asyncpg` to get the async dialect).
- **`target_session_attrs`** ← `prefer-standby` when `use_replica` else
`read-write`. This is a dict union (`existing_query | {target_session_attrs: …}`),
so the computed value **overwrites** any `target_session_attrs` already on the
DSN — `use_replica` is authoritative. (The other existing query params are
preserved.) Note: honoring a *pre-existing* `target_session_attrs` happens
downstream in [`build_connection_factory`](connections.md), not here.

## `is_dsn_multihost`

```python
is_dsn_multihost(db_dsn: str) -> bool
```

`True` when the DSN's `host` query param is a tuple of length > 1 — i.e. the
multiple-fallback-hosts form with at least two hosts. A single `host` param or a
host in the authority (`@host/db`) is **not** multi-host.

## Related

- [connections.md](connections.md) — consumes the constructed `URL` and the
multi-host encoding.
59 changes: 59 additions & 0 deletions architecture/retry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Retry

`postgres_retry` (in `db_retry/retry.py`) is an async decorator that retries a
coroutine function when PostgreSQL raises a transient error — a serialization
failure or a lost connection — and gives up on everything else.

## Public surface

```python
@postgres_retry # bare — uses the default retry count
async def handler(...) -> ...: ...

@postgres_retry(retries=5) # per-callsite override
async def handler(...) -> ...: ...
```

Two `typing.overload`s back the dual form: called with a function it returns the
wrapped function; called with `func=None` (i.e. `@postgres_retry(...)`) it
returns a decorator. The wrapped function keeps its signature via
`functools.wraps`. `retries` defaults to `None`, which defers to
[`settings.get_retries_number()`](settings.md) **at call time** — so the env var
is read per invocation, not frozen at decoration.

## Retry engine

Each call builds a `tenacity.AsyncRetrying` with:

- `stop=stop_after_attempt(retries or get_retries_number())`
- `wait=wait_exponential_jitter()` — exponential backoff with jitter
- `retry=retry_if_exception(_retry_handler)` — the predicate below
- `reraise=True` — the **original** exception propagates after the last attempt,
not tenacity's `RetryError`
- `before=before_log(logger, DEBUG)` — debug log before each attempt

## What counts as retriable

`_is_retriable_dbapi_error` returns `True` only for a `sqlalchemy.exc.DBAPIError`
whose `.orig` is set and whose `.orig.__cause__` is an
`asyncpg.SerializationError` (SQLSTATE `40001`) or
`asyncpg.PostgresConnectionError` (class `08`, e.g. `08000`/`08003`). This
deliberately excludes lookalikes such as `StatementCompletionUnknownError`
(`40002`), where the statement's outcome is unknown and a blind retry is unsafe.

## Cause-chain walk

`_retry_handler` does not inspect only the raised exception — it walks the
`__cause__`/`__context__` chain (following `__cause__` first, then
`__context__`), guarding against cycles with a `seen` set of `id()`s, and
returns `True` as soon as any link is a retriable `DBAPIError`.

The walk matters because the `DBAPIError` is often re-raised inside another
exception. For example advanced-alchemy's `wrap_sqlalchemy_exception()` surfaces
it as `RepositoryError`/`IntegrityError` with the real `DBAPIError` hanging off
`__cause__`; the walk lets the retry still fire. Both retry and give-up paths
emit a debug log.

## Related

- [settings.md](settings.md) — where the default attempt count comes from.
24 changes: 24 additions & 0 deletions architecture/settings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Settings

`db_retry/settings.py` holds the package's one piece of runtime configuration:
the default retry count.

## `get_retries_number`

```python
def get_retries_number() -> int:
return int(os.getenv("DB_RETRY_RETRIES_NUMBER", "3"))
```

Reads the `DB_RETRY_RETRIES_NUMBER` environment variable **at call time**,
defaulting to `3`. It is a function — not a module-level constant — precisely so
the value is re-read on every call: tests can `monkeypatch.setenv(...)` and see
the new value, and a deployment can change the env without re-importing.

[`postgres_retry`](retry.md) calls this whenever its own `retries` argument is
`None`, so the env var sets the default attempt count while a per-callsite
`retries=N` overrides it.

## Related

- [retry.md](retry.md) — the sole consumer.
42 changes: 42 additions & 0 deletions architecture/transaction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Transaction

`Transaction` (in `db_retry/transaction.py`) is an async context manager that
wraps a SQLAlchemy `AsyncSession`, opening a transaction on entry and cleaning
up on exit.

## Shape

```python
@dataclasses.dataclass(kw_only=True, frozen=True, slots=True)
class Transaction:
session: AsyncSession
isolation_level: IsolationLevel | None = None
```

A frozen, slotted, keyword-only dataclass: `Transaction(session=..., isolation_level=...)`.
`__aenter__` returns `typing.Self` (no `typing_extensions` dependency).

## Entry

`async with Transaction(session=s) as tx:`

1. If `isolation_level` is set, calls `session.connection(execution_options={"isolation_level": ...})`
to apply it (e.g. `"SERIALIZABLE"`).
2. If the session is **not** already in a transaction, calls `session.begin()`.
An already-open transaction is adopted rather than nested.

## Exit

`__aexit__` does **not** suppress exceptions (returns `None`) and always runs,
on both the success and error path:

1. If the session is still in a transaction, `session.rollback()`. This is the
auto-rollback: a block that neither `.commit()`s nor `.rollback()`s — or that
raises — leaves the transaction open, and exit rolls it back. A block that
already committed/rolled back is no longer in a transaction, so nothing is
undone.
2. `session.close()` — always, regardless of outcome.

So commit is explicit: call `tx.commit()` inside the block to persist; otherwise
the work is rolled back. `tx.rollback()` is also exposed for an explicit early
rollback. Both delegate straight to the session.
1 change: 1 addition & 0 deletions planning/.convention-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
1.0.0
Loading