ENG-37: type participant enum + unnamed in AI prompt + backfill by henokteixeira · Pull Request #80 · shemaobt/tripod-api

henokteixeira · 2026-05-20T15:01:54Z

Summary

Linear: https://linear.app/shema-obt/issue/ENG-37
Counterpart of meaning-map-ui#24 (frontend dropdown).
Introduces ParticipantType (StrEnum: named | unnamed | group | divine | role) following the project's enum convention (BCDStatus, EntryProvenance, Testament, RagNamespace).
Types both BCDParticipantEntry.type (API) and ParticipantSchema.type (LLM output schema) with the enum. Because ParticipantSchema is passed to call_llm via LangChain.with_structured_output(...), the model is now schema-forced to emit only valid values rather than relying solely on the prompt.
Updates the participant generation prompt to also teach the AI about unnamed for anonymous individuals (e.g. "the man's servant", "a woman from Bethlehem").

Supersedes #53

This PR replaces the previously-conflicting #53. Same intent, same code (cherry-picked from the original branch onto fresh main), with the merge conflict in app/models/book_context.py resolved during cherry-pick — EntryProvenance (added on main in the meantime) and the new ParticipantType enum now coexist as sibling StrEnum classes.

Test plan

uv run pytest tests/ — all passing.
uv run ruff check . clean.
uv run ruff format --check . clean.
uv run mypy app/ clean.
Boot the backend, create a BCD, edit type via the UI to each of the 5 values, confirm round-trip on GET /book-context/{id}.
Submit a type outside the enum via API — should get a 422 (enum validation rejecting it).
Trigger a fresh POST /book-context/{id}/generate for a small book (e.g. Ruth) and confirm AI-generated participants stay within the enum and may emit unnamed where appropriate.

Rollout work added in response to the review (audit, migration, LLM resilience, scope expansion) — see follow-up comment.

Defines the canonical 5-value vocabulary (named, unnamed, group, divine, role) for participant naming categories and types BCDParticipantEntry.type against it. ENG-37

ParticipantSchema is the output schema passed to call_llm via LangChain.with_structured_output, so typing it with the enum forces the model to emit only valid values via structured output rather than relying solely on the prompt. ENG-37

Adds 'unnamed' to the type list in Source A with semantic guidance so the AI also classifies anonymous individuals (e.g. "the man's servant", "a woman from Bethlehem"). The prompt focuses on when to use each value; the schema enforces the value set. ENG-37

…t types Adds backfill_participant_enums(db, dry_run) which iterates over BCDs with a participant_register, normalises any 'type' value outside the enum to 'named' (with a WARN log) and uses flag_modified so SQLAlchemy persists the JSON change. Idempotent and safe to re-run. The thin scripts/backfill_participant_enums.py wrapper exposes a --dry-run flag, following the pattern of scripts/backfill_split_index.py. This is required before deploying the enum-typed BCDParticipantEntry, otherwise GETs on legacy BCDs with out-of-enum 'type' would 500. ENG-37

ENG-37

Removed for now; we'll likely re-introduce normalization as an Alembic migration instead. The enum-typed schema still rejects out-of-enum values on read, so any rollout will need a follow-up before deploy. ENG-37

Ruff's isort rule now requires combining same-module plain imports onto a single line. Merges EntryProvenance and ParticipantType into one statement; aliased imports stay on separate lines.

joaocarvoli

Blocking on this one. The direction is right (structured types over free strings), but the rollout strategy is missing the prerequisite work and will regress a previously-fixed production incident.

Verified regression

I ran the failure case directly against this branch's BCDParticipantEntry:

'location': FAILS -> type
'place':    FAILS -> type
'animal':   FAILS -> type
'object':   FAILS -> type
'concept':  FAILS -> type

All raise pydantic.ValidationError in BCDResponse.model_validate(bcd) at app/services/book_context/enrich_bcd_response.py:17. That call happens on every GET /book-context/{id}. Inline comments below tie this to the specific lines.

What's missing in this PR

Data audit. No evidence of an audit query against production participant_register to enumerate the actual set of type values in the wild. Required before we can know the blast radius:
```
SELECT DISTINCT jsonb_array_elements(participant_register)->>'type'
FROM book_context_documents
WHERE participant_register IS NOT NULL;
```
Migration. A backfill was added in 1ec3527 and reverted in 69df5d1. The PR description defers normalization to a future Alembic migration, but that migration is not in this PR. An open PR that's "safe to merge but not to deploy" is one merge away from production.
Tests. PR description says "this PR doesn't add new tests." For a change that introduces a hard validation boundary on existing data, the failure-path test (out-of-enum type → ValidationError) is exactly what would have caught this risk in CI.
Manual test plan. 3 of 4 items unchecked — and they're the ones that exercise the actual integration (round-trip via UI, deliberate out-of-enum submission, fresh generation against a real book).

What I need to flip this to approve

Audit query result attached to the PR description so we know what's in production.
Alembic data migration attached to this PR (idempotent + reversible).
A unit test that verifies an out-of-enum type raises a clean error (or, after normalization, gets mapped to named).
Prompt fix per the inline comment on participants.py.
The unchecked manual test plan items executed and ticked.

joaocarvoli · 2026-05-22T03:09:35Z

+    UNNAMED = "unnamed"
+    GROUP = "group"
+    DIVINE = "divine"
+    ROLE = "role"


BLOCKER — the enum is missing values that exist in production.

Git history on this exact field:

Commit Date Change

40dffee Mar 26 2026 Added "location" to the Literal — "BCD generation produced location-type participants, Pydantic validation failed with a 500 error, also causing downstream 504 timeouts on the frontend"

4b45a6d Mar 27 2026 Widened to str — "The LLM sometimes produces unexpected type values like 'location' instead of the strict … enum. This caused a Pydantic validation error (500) when fetching the BCD"

The new enum named | unnamed | group | divine | role does not include location, which was confirmed in production. Other ad-hoc values may exist that we haven't enumerated.

This PR cannot ship until we either (a) include location (and whatever else the audit query turns up) in the enum, or (b) ship the data migration alongside this PR so the data matches the constraint at deploy time. Doing neither — which is what's proposed — guarantees a recurrence of the prior incident.

Audit ran — production has no values outside named | group | divine | role anymore. The location incident from 40dffee/4b45a6d is dead data at this point (cleaned up at some point between then and now). Numbers in the follow-up comment.

Migration 20260522_0001 is in the PR now and will normalize anything that appears between merge and deploy. Enum unchanged: named | unnamed | group | divine | role.

joaocarvoli · 2026-05-22T03:09:35Z

    english_gloss: str = ""
    entity_type: str = "person"
-    type: str = "named"
+    type: ParticipantType = ParticipantType.NAMED


This line is the failure point. BCDResponse.model_validate(bcd) (called for every GET on a BCD at app/services/book_context/enrich_bcd_response.py:17) parses participant_register through list[BCDParticipantEntry], which now enforces the enum on every row.

I verified the regression directly on this branch:

from app.models.book_context import BCDParticipantEntry BCDParticipantEntry.model_validate({ 'name': 'X', 'type': 'location', 'entry_verse': {'chapter': 1, 'verse': 1}, 'role_in_book': 'setting', }) # pydantic_core._pydantic_core.ValidationError: 1 validation error for BCDParticipantEntry # type: Input should be 'named', 'unnamed', 'group', 'divine' or 'role'

Note: model_config = {"extra": "allow"} does not help here — it only governs unknown fields, not validation on known fields. The original 4b45a6d commit message read it the same way ("since the data has model_config extra=allow anyway") which was a misread of Pydantic semantics. Worth flagging for the team.

Secondary observation: this model has two str "category" fields — entity_type: str = "person" on line 44 is left as free string. BCDPlace.type: str = "" (sibling model) is also untouched. Either enum all three or comment on why only BCDParticipantEntry.type is constrained — the asymmetry will confuse the next reader.

Failure point is resolved by the new 20260522_0001 migration (normalizes any out-of-enum value before deploy, with _legacy_<field> for reversibility).

Asymmetry observation: addressed in this round. BCDParticipantEntry.entity_type is now ParticipantEntityType (person | person_common | ambiguous) and BCDPlace.type is now PlaceType (17 values calibrated from the audit). Both go through the same migration. The extra=allow ≠ field validation point is well taken — kept the comment in mind while writing the migration so it specifically rewrites the typed field rather than relying on extras.

joaocarvoli · 2026-05-22T03:09:35Z

    english_gloss: str = ""
    entity_type: str = "person"
-    type: str = "named"
+    type: ParticipantType = ParticipantType.NAMED


The PR description claims "the model is now schema-forced to emit only valid values rather than relying solely on the prompt." Worth being precise about what that buys us in practice.

The LLM here is ChatGoogleGenerativeAI (Gemini) per app/services/book_context/generation/llm.py:7. Gemini's structured output via LangChain enforces JSON Schema, but its enum-constraint enforcement is softer than OpenAI's strict mode — the model can still emit out-of-enum strings under load. When that happens, LangChain's with_structured_output raises rather than coercing, so await call_llm(...) throws and the entire participant generation step fails instead of producing partial output.

That's a separate failure mode from the legacy-data one and deserves a guard. Suggest:

# in participants.py around line 154 try: result = await call_llm(prompt, output_schema=ParticipantRegisterSchema) except ValidationError as e: logger.warning("LLM emitted out-of-enum participant type, normalizing to 'named': %s", e) # re-parse with type coerced to ParticipantType.NAMED for any failing rows ...

Without this, a single bad row in the LLM response kills the whole generation. The current free-string field is what lets the pipeline survive minor LLM drift; enum-ifying it removes that resilience.

Good catch — and the exception type is OutputParserException (subclass of ValueError), not ValidationError. The guard I added catches both: except (ValidationError, OutputParserException) as exc.

Applied in four places, not just _generate_batch, because BCDPlace is enum-typed too now and ContextSectionsSchema nests it:

_generate_batch (participants)

_generate_place_batch (places)

generate_context_sections small path

generate_context_sections large path (ContextSectionsBatchSchema)

Fallback re-issues call_llm(..., output_schema=None), parses the raw payload via _parse_and_normalize_*, and coerces invalid type/entity_type values back into the enum with _legacy_<field> preserved. Tests in tests/test_participant_llm_resilience.py and tests/test_place_llm_resilience.py exercise both exception paths and the small-path bundled fallback.

joaocarvoli · 2026-05-22T03:09:35Z

 Persia when used as country/region references.
+  - "unnamed" — for a person entity referenced anonymously in the narrative \
+(e.g., "the man's servant", "a woman from Bethlehem"). Rare for proper-noun \
+sources but possible when the BHSA entity stands for an unnamed individual.


Semantic problem with where this is placed.

This guidance lives under Source A — Person Entities (proper nouns) (see the header at line ~62 and "name: copy EXACTLY from the entity data" at the top of this block). unnamed is by definition not a proper noun. The examples given in the PR description — "the man's servant", "a woman from Bethlehem" — are common-noun descriptors, which belong under Source B — Common Noun Groups, Roles, and Officials.

As written, the prompt asks the AI to apply unnamed to BHSA proper-noun entities, which contradicts Source A's premise ("FOR EACH person entity in the Person Entities list … copy EXACTLY from the entity data" — those entities ARE named, that's why BHSA flags them as proper-noun person entities).

Two possible fixes:

Restructure Source A as "Individuals" with named vs unnamed sub-options, and clarify which BHSA nametype values map to unnamed (if any).

Move unnamed to Source B with examples that match BHSA common-noun lemmas (e.g., "man's servant" → some BHSA common noun for servant).

Without one of these, the AI is being given contradictory guidance: "these are proper-noun entities, but classify some as unnamed."

Fixed per your option 2. unnamed moved from Source A to Source B as category (c) UNNAMED INDIVIDUAL, with examples that now match BHSA common-noun lemmas:

"the man's servant", "a woman from Bethlehem", "the foreman of the reapers". Use when the BHSA common-noun lemma stands for a SINGLE individual whose identity is narratively important but who is never named.

Source A's type list now explicitly states it produces only named | group | divine and notes that unnamed | role are reserved for Source B — closes the contradiction the AI was being asked to navigate. Source B's category (c) is distinct from (a) collective groups and (b) institutional roles, so the LLM has a clean three-way choice for common nouns.

Adds two new StrEnums alongside the existing ParticipantType: - ParticipantEntityType (person | person_common | ambiguous) — typed BCDParticipantEntry.entity_type and ParticipantSchema.entity_type so the LLM structured_output emits only valid values and reads validate the same way. - PlaceType (city | country | region | district | empire | village | mountain | valley | river | well | field | road | gate | tower | wall | structure | other) — typed BCDPlace.type. The 17 values come from a production audit; values outside this set will be normalized by a follow-up migration. The enum sets were calibrated against actual production data so the read-path validation no longer regresses the previously-fixed 500/504 incident: every value present in production either maps to an enum member or will be backfilled by the migration in the next commit. Updates the two existing place fixtures (Boaz's field) from the legacy "location" string to the new "field" enum value, which is semantically more accurate per the BHSA classification. 🤖 Generated with [Nori](https://noriagentic.com) Co-Authored-By: Nori <contact@tilework.tech>

Normalizes participant_register[*].type, participant_register[*].entity_type and places[*].type so existing rows match the new StrEnums. Strategy: - Any out-of-enum value is replaced with a safe default (named / person / other). - The original value is preserved alongside as `_legacy_<field>` on the same JSON object. Pydantic's `extra="allow"` keeps it through reads without affecting validation. - `downgrade()` restores `<field>` from `_legacy_<field>`. Production audit before writing this migration: - participant.type: all values already inside the enum (no-op). - participant.entity_type: 12 rows with "ambiguous" (now an enum member); 6 rows with "group" (bug) → person. - places.type: ~85% covered by the enum; the long tail ("city/empire", "place", "", etc.) → other. Idempotent: re-running the migration on already-normalized rows is a no-op because the legacy marker is the trigger, not the field value. Tests cover normalize, restore, idempotence, and the upgrade/downgrade round-trip property. 🤖 Generated with [Nori](https://noriagentic.com) Co-Authored-By: Nori <contact@tilework.tech>

Two related changes to the participant generation node: 1. Resilience guard around `call_llm(prompt, output_schema=...)`. LangChain's `with_structured_output` raises `OutputParserException` (not `ValidationError`) when the model emits a value outside the StrEnum — for Gemini under load this happens occasionally and previously killed the entire participant generation step. The fallback catches both exception types, re-issues the call without a schema, and normalizes invalid `type`/`entity_type` values back into the enum via `_parse_and_normalize_participants`, preserving the offending value as `_legacy_<field>` for visibility. 2. Prompt fix per review on #80. The `unnamed` type belongs in Source B (common-noun candidates), not Source A (proper-noun entities) — the original placement contradicted Source A's premise. Source A now explicitly lists only `named | group | divine` and notes that `unnamed | role` are reserved for Source B. Source B gains a third category (c) for unnamed individuals with examples that match BHSA common-noun lemmas. 🤖 Generated with [Nori](https://noriagentic.com) Co-Authored-By: Nori <contact@tilework.tech>

Same exception-type fix applied in three places where `call_llm` is issued with a schema that nests `BCDPlace`/`PlaceSchema`: - `_generate_place_batch` (PlacesRegisterSchema) - `generate_context_sections` small path (ContextSectionsSchema) - `generate_context_sections` large path (ContextSectionsBatchSchema) Without the guard, a single out-of-enum `places[*].type` from the LLM would not just lose the one bad row — it would kill the entire context_sections step (theological_spine, objects, institutions, genre_context, maintenance_notes all included). This is structurally worse than the participant case because more fields are bundled. The small path now falls back to a raw call + `_parse_and_normalize_ context_sections` which preserves the other fields and normalizes invalid place types to `other` with `_legacy_type`. The large path treats the no-places bundle the same way; place batches were already isolated and use `_parse_and_normalize_places` on fallback. 🤖 Generated with [Nori](https://noriagentic.com) Co-Authored-By: Nori <contact@tilework.tech>

henokteixeira · 2026-05-22T18:05:35Z

Review follow-up — rollout work + scope expansion

Thanks @joaocarvoli for the detailed review. Addressing each point:

1. Production audit (commit 43ce4d1)

Ran the three queries against the Neon production DB before deciding the enum sets.

participant_register[*].type — all production values already inside the proposed enum (named=1271, group=96, divine=24, role=9). Migration is a no-op on this field. location and the other concerning values from the historical incident no longer appear — they were already cleaned up.

participant_register[*].entity_type — person (1365), person_common (17), ambiguous (12), group (6). ambiguous is a legitimate BHSA classification state → kept as an enum member. group (6 rows) is a conflation bug → backfill normalizes to person.

places[*].type — 21 distinct values across ~367 entries. The 17-value PlaceType covers ~85% directly (city 217, region 32, district 12, gate 12, tower 11, country 6, empire 6, village 4, well 4, mountain 4, river 3, valley/field/road/structure 1 each, plus wall, other). The long tail (place 17, composite slashes like city/empire 6, hill/district 4, mountain/city 2, plus single-occurrence oddities) is normalized to other with _legacy_type preserving the original.

2. Backfill migration (commit 02863e7)

alembic/versions/20260522_0001_normalize_bcd_enum_fields.py. Idempotent + reversible: any out-of-enum value is replaced with a safe default and the original is preserved as _legacy_<field> on the same JSON object (Pydantic's extra=allow keeps it through reads). downgrade() restores <field> from the marker. Tests in tests/test_normalize_bcd_enum_fields_migration.py cover normalize, restore, idempotence, and the upgrade/downgrade round-trip.

3. LLM resilience + scope expansion (commit c116423, commit edb3573)

The resilience guard you suggested — but the exception type matters. LangChain's with_structured_output raises OutputParserException (not ValidationError) when Gemini emits an out-of-enum value, so the guard catches both. Applied in four places:

_generate_batch (participants)
_generate_place_batch (places)
generate_context_sections small path (was even more catastrophic — would have lost theological_spine + objects + institutions + genre_context + maintenance_notes together)
generate_context_sections large path (ContextSectionsBatchSchema)

Each fallback re-issues the call without a schema, parses the raw output, and normalizes invalid values back into the enum, preserving the offending value as _legacy_<field> for visibility.

This also addresses the asymmetry observation in your inline comment on line 45: BCDParticipantEntry.entity_type (ParticipantEntityType: person | person_common | ambiguous) and BCDPlace.type (PlaceType: 17 values) are now enums too, with the migration handling both.

4. Prompt fix (commit c116423)

Per your inline on line 75: moved unnamed from Source A to Source B as category (c) — UNNAMED INDIVIDUAL. Source A now explicitly states it produces only named | group | divine (with a note that unnamed | role belong to Source B). Source B's unnamed examples now match BHSA common-noun lemmas (the man's servant, a woman from Bethlehem, the foreman of the reapers).

5. Test plan

uv run pytest tests/ — 728 passing, 2 skipped (baseline 633 + 95 new across 4 test files).
ruff check, ruff format --check, mypy app/ — all clean.
The 3 manual items remain pending — will tick after running against a live backend.

Rollout order

Migration is now in this PR, so deploy in two steps:

uv run alembic upgrade head against production — this is read-tolerant (legacy data still validates after this because of the _legacy_<field> marker pattern + extra=allow).
Deploy the application code. Every BCD row is guaranteed to validate against the new enums.

_legacy_<field> markers persist in the JSON until the row is next edited via update_section (which overwrites the column with the UI payload). If you need to audit downstream consumers of those markers, do it before the next bulk edit cycle.

henokteixeira added 8 commits May 20, 2026 11:58

feat(bcd): add ParticipantType StrEnum and type the participant entry

e6e38d0

Defines the canonical 5-value vocabulary (named, unnamed, group, divine, role) for participant naming categories and types BCDParticipantEntry.type against it. ENG-37

chore(bcd): reorder ParticipantType import to satisfy ruff I001

db02dd8

ENG-37

chore(bcd): apply ruff format to participant_backfill

694ff7c

ENG-37

revert(bcd): remove backfill service and script

69df5d1

Removed for now; we'll likely re-introduce normalization as an Alembic migration instead. The enum-typed schema still rejects out-of-enum values on read, so any rollout will need a follow-up before deploy. ENG-37

chore(bcd): combine plain ParticipantType import to satisfy ruff I001

e4d2b29

Ruff's isort rule now requires combining same-module plain imports onto a single line. Merges EntryProvenance and ParticipantType into one statement; aliased imports stay on separate lines.

henokteixeira self-assigned this May 22, 2026

joaocarvoli requested changes May 22, 2026

View reviewed changes

henokteixeira and others added 4 commits May 22, 2026 14:36

henokteixeira requested a review from joaocarvoli May 22, 2026 23:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENG-37: type participant enum + unnamed in AI prompt + backfill#80

ENG-37: type participant enum + unnamed in AI prompt + backfill#80
henokteixeira wants to merge 12 commits into
mainfrom
henok/eng-37-participant-type-enum-v2

henokteixeira commented May 20, 2026 •

edited

Loading

Uh oh!

joaocarvoli left a comment

Uh oh!

joaocarvoli May 22, 2026

Uh oh!

henokteixeira May 22, 2026

Uh oh!

joaocarvoli May 22, 2026

Uh oh!

henokteixeira May 22, 2026

Uh oh!

joaocarvoli May 22, 2026

Uh oh!

henokteixeira May 22, 2026

Uh oh!

joaocarvoli May 22, 2026

Uh oh!

henokteixeira May 22, 2026

Uh oh!

henokteixeira commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Commit	Date	Change
`40dffee`	Mar 26 2026	Added `"location"` to the Literal — "BCD generation produced location-type participants, Pydantic validation failed with a 500 error, also causing downstream 504 timeouts on the frontend"
`4b45a6d`	Mar 27 2026	Widened to `str` — "The LLM sometimes produces unexpected type values like 'location' instead of the strict … enum. This caused a Pydantic validation error (500) when fetching the BCD"

Conversation

henokteixeira commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Supersedes #53

Test plan

Uh oh!

joaocarvoli left a comment

Choose a reason for hiding this comment

Verified regression

What's missing in this PR

What I need to flip this to approve

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

henokteixeira commented May 22, 2026

Review follow-up — rollout work + scope expansion

1. Production audit (commit 43ce4d1)

2. Backfill migration (commit 02863e7)

3. LLM resilience + scope expansion (commit c116423, commit edb3573)

4. Prompt fix (commit c116423)

5. Test plan

Rollout order

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

henokteixeira commented May 20, 2026 •

edited

Loading