ENG-37: type participant enum + unnamed in AI prompt + backfill#80
ENG-37: type participant enum + unnamed in AI prompt + backfill#80henokteixeira wants to merge 12 commits into
Conversation
Defines the canonical 5-value vocabulary (named, unnamed, group, divine, role) for participant naming categories and types BCDParticipantEntry.type against it. ENG-37
ParticipantSchema is the output schema passed to call_llm via LangChain.with_structured_output, so typing it with the enum forces the model to emit only valid values via structured output rather than relying solely on the prompt. ENG-37
Adds 'unnamed' to the type list in Source A with semantic guidance so the AI also classifies anonymous individuals (e.g. "the man's servant", "a woman from Bethlehem"). The prompt focuses on when to use each value; the schema enforces the value set. ENG-37
…t types Adds backfill_participant_enums(db, dry_run) which iterates over BCDs with a participant_register, normalises any 'type' value outside the enum to 'named' (with a WARN log) and uses flag_modified so SQLAlchemy persists the JSON change. Idempotent and safe to re-run. The thin scripts/backfill_participant_enums.py wrapper exposes a --dry-run flag, following the pattern of scripts/backfill_split_index.py. This is required before deploying the enum-typed BCDParticipantEntry, otherwise GETs on legacy BCDs with out-of-enum 'type' would 500. ENG-37
Removed for now; we'll likely re-introduce normalization as an Alembic migration instead. The enum-typed schema still rejects out-of-enum values on read, so any rollout will need a follow-up before deploy. ENG-37
Ruff's isort rule now requires combining same-module plain imports onto a single line. Merges EntryProvenance and ParticipantType into one statement; aliased imports stay on separate lines.
joaocarvoli
left a comment
There was a problem hiding this comment.
Blocking on this one. The direction is right (structured types over free strings), but the rollout strategy is missing the prerequisite work and will regress a previously-fixed production incident.
Verified regression
I ran the failure case directly against this branch's BCDParticipantEntry:
'location': FAILS -> type
'place': FAILS -> type
'animal': FAILS -> type
'object': FAILS -> type
'concept': FAILS -> type
All raise pydantic.ValidationError in BCDResponse.model_validate(bcd) at app/services/book_context/enrich_bcd_response.py:17. That call happens on every GET /book-context/{id}. Inline comments below tie this to the specific lines.
What's missing in this PR
-
Data audit. No evidence of an audit query against production
participant_registerto enumerate the actual set oftypevalues in the wild. Required before we can know the blast radius:SELECT DISTINCT jsonb_array_elements(participant_register)->>'type' FROM book_context_documents WHERE participant_register IS NOT NULL;
-
Migration. A backfill was added in
1ec3527and reverted in69df5d1. The PR description defers normalization to a future Alembic migration, but that migration is not in this PR. An open PR that's "safe to merge but not to deploy" is one merge away from production. -
Tests. PR description says "this PR doesn't add new tests." For a change that introduces a hard validation boundary on existing data, the failure-path test (out-of-enum
type→ValidationError) is exactly what would have caught this risk in CI. -
Manual test plan. 3 of 4 items unchecked — and they're the ones that exercise the actual integration (round-trip via UI, deliberate out-of-enum submission, fresh generation against a real book).
What I need to flip this to approve
- Audit query result attached to the PR description so we know what's in production.
- Alembic data migration attached to this PR (idempotent + reversible).
- A unit test that verifies an out-of-enum
typeraises a clean error (or, after normalization, gets mapped tonamed). - Prompt fix per the inline comment on
participants.py. - The unchecked manual test plan items executed and ticked.
| UNNAMED = "unnamed" | ||
| GROUP = "group" | ||
| DIVINE = "divine" | ||
| ROLE = "role" |
There was a problem hiding this comment.
BLOCKER — the enum is missing values that exist in production.
Git history on this exact field:
| Commit | Date | Change |
|---|---|---|
40dffee |
Mar 26 2026 | Added "location" to the Literal — "BCD generation produced location-type participants, Pydantic validation failed with a 500 error, also causing downstream 504 timeouts on the frontend" |
4b45a6d |
Mar 27 2026 | Widened to str — "The LLM sometimes produces unexpected type values like 'location' instead of the strict … enum. This caused a Pydantic validation error (500) when fetching the BCD" |
The new enum named | unnamed | group | divine | role does not include location, which was confirmed in production. Other ad-hoc values may exist that we haven't enumerated.
This PR cannot ship until we either (a) include location (and whatever else the audit query turns up) in the enum, or (b) ship the data migration alongside this PR so the data matches the constraint at deploy time. Doing neither — which is what's proposed — guarantees a recurrence of the prior incident.
There was a problem hiding this comment.
Audit ran — production has no values outside named | group | divine | role anymore. The location incident from 40dffee/4b45a6d is dead data at this point (cleaned up at some point between then and now). Numbers in the follow-up comment.
Migration 20260522_0001 is in the PR now and will normalize anything that appears between merge and deploy. Enum unchanged: named | unnamed | group | divine | role.
| english_gloss: str = "" | ||
| entity_type: str = "person" | ||
| type: str = "named" | ||
| type: ParticipantType = ParticipantType.NAMED |
There was a problem hiding this comment.
This line is the failure point. BCDResponse.model_validate(bcd) (called for every GET on a BCD at app/services/book_context/enrich_bcd_response.py:17) parses participant_register through list[BCDParticipantEntry], which now enforces the enum on every row.
I verified the regression directly on this branch:
from app.models.book_context import BCDParticipantEntry
BCDParticipantEntry.model_validate({
'name': 'X', 'type': 'location',
'entry_verse': {'chapter': 1, 'verse': 1},
'role_in_book': 'setting',
})
# pydantic_core._pydantic_core.ValidationError: 1 validation error for BCDParticipantEntry
# type: Input should be 'named', 'unnamed', 'group', 'divine' or 'role'Note: model_config = {"extra": "allow"} does not help here — it only governs unknown fields, not validation on known fields. The original 4b45a6d commit message read it the same way ("since the data has model_config extra=allow anyway") which was a misread of Pydantic semantics. Worth flagging for the team.
Secondary observation: this model has two str "category" fields — entity_type: str = "person" on line 44 is left as free string. BCDPlace.type: str = "" (sibling model) is also untouched. Either enum all three or comment on why only BCDParticipantEntry.type is constrained — the asymmetry will confuse the next reader.
There was a problem hiding this comment.
Failure point is resolved by the new 20260522_0001 migration (normalizes any out-of-enum value before deploy, with _legacy_<field> for reversibility).
Asymmetry observation: addressed in this round. BCDParticipantEntry.entity_type is now ParticipantEntityType (person | person_common | ambiguous) and BCDPlace.type is now PlaceType (17 values calibrated from the audit). Both go through the same migration. The extra=allow ≠ field validation point is well taken — kept the comment in mind while writing the migration so it specifically rewrites the typed field rather than relying on extras.
| english_gloss: str = "" | ||
| entity_type: str = "person" | ||
| type: str = "named" | ||
| type: ParticipantType = ParticipantType.NAMED |
There was a problem hiding this comment.
The PR description claims "the model is now schema-forced to emit only valid values rather than relying solely on the prompt." Worth being precise about what that buys us in practice.
The LLM here is ChatGoogleGenerativeAI (Gemini) per app/services/book_context/generation/llm.py:7. Gemini's structured output via LangChain enforces JSON Schema, but its enum-constraint enforcement is softer than OpenAI's strict mode — the model can still emit out-of-enum strings under load. When that happens, LangChain's with_structured_output raises rather than coercing, so await call_llm(...) throws and the entire participant generation step fails instead of producing partial output.
That's a separate failure mode from the legacy-data one and deserves a guard. Suggest:
# in participants.py around line 154
try:
result = await call_llm(prompt, output_schema=ParticipantRegisterSchema)
except ValidationError as e:
logger.warning("LLM emitted out-of-enum participant type, normalizing to 'named': %s", e)
# re-parse with type coerced to ParticipantType.NAMED for any failing rows
...Without this, a single bad row in the LLM response kills the whole generation. The current free-string field is what lets the pipeline survive minor LLM drift; enum-ifying it removes that resilience.
There was a problem hiding this comment.
Good catch — and the exception type is OutputParserException (subclass of ValueError), not ValidationError. The guard I added catches both: except (ValidationError, OutputParserException) as exc.
Applied in four places, not just _generate_batch, because BCDPlace is enum-typed too now and ContextSectionsSchema nests it:
_generate_batch(participants)_generate_place_batch(places)generate_context_sectionssmall pathgenerate_context_sectionslarge path (ContextSectionsBatchSchema)
Fallback re-issues call_llm(..., output_schema=None), parses the raw payload via _parse_and_normalize_*, and coerces invalid type/entity_type values back into the enum with _legacy_<field> preserved. Tests in tests/test_participant_llm_resilience.py and tests/test_place_llm_resilience.py exercise both exception paths and the small-path bundled fallback.
| Persia when used as country/region references. | ||
| - "unnamed" — for a person entity referenced anonymously in the narrative \ | ||
| (e.g., "the man's servant", "a woman from Bethlehem"). Rare for proper-noun \ | ||
| sources but possible when the BHSA entity stands for an unnamed individual. |
There was a problem hiding this comment.
Semantic problem with where this is placed.
This guidance lives under Source A — Person Entities (proper nouns) (see the header at line ~62 and "name: copy EXACTLY from the entity data" at the top of this block). unnamed is by definition not a proper noun. The examples given in the PR description — "the man's servant", "a woman from Bethlehem" — are common-noun descriptors, which belong under Source B — Common Noun Groups, Roles, and Officials.
As written, the prompt asks the AI to apply unnamed to BHSA proper-noun entities, which contradicts Source A's premise ("FOR EACH person entity in the Person Entities list … copy EXACTLY from the entity data" — those entities ARE named, that's why BHSA flags them as proper-noun person entities).
Two possible fixes:
- Restructure Source A as "Individuals" with
namedvsunnamedsub-options, and clarify which BHSAnametypevalues map tounnamed(if any). - Move
unnamedto Source B with examples that match BHSA common-noun lemmas (e.g., "man's servant" → some BHSA common noun for servant).
Without one of these, the AI is being given contradictory guidance: "these are proper-noun entities, but classify some as unnamed."
There was a problem hiding this comment.
Fixed per your option 2. unnamed moved from Source A to Source B as category (c) UNNAMED INDIVIDUAL, with examples that now match BHSA common-noun lemmas:
"the man's servant", "a woman from Bethlehem", "the foreman of the reapers". Use when the BHSA common-noun lemma stands for a SINGLE individual whose identity is narratively important but who is never named.
Source A's type list now explicitly states it produces only named | group | divine and notes that unnamed | role are reserved for Source B — closes the contradiction the AI was being asked to navigate. Source B's category (c) is distinct from (a) collective groups and (b) institutional roles, so the LLM has a clean three-way choice for common nouns.
Adds two new StrEnums alongside the existing ParticipantType: - ParticipantEntityType (person | person_common | ambiguous) — typed BCDParticipantEntry.entity_type and ParticipantSchema.entity_type so the LLM structured_output emits only valid values and reads validate the same way. - PlaceType (city | country | region | district | empire | village | mountain | valley | river | well | field | road | gate | tower | wall | structure | other) — typed BCDPlace.type. The 17 values come from a production audit; values outside this set will be normalized by a follow-up migration. The enum sets were calibrated against actual production data so the read-path validation no longer regresses the previously-fixed 500/504 incident: every value present in production either maps to an enum member or will be backfilled by the migration in the next commit. Updates the two existing place fixtures (Boaz's field) from the legacy "location" string to the new "field" enum value, which is semantically more accurate per the BHSA classification. 🤖 Generated with [Nori](https://noriagentic.com) Co-Authored-By: Nori <contact@tilework.tech>
Normalizes participant_register[*].type, participant_register[*].entity_type
and places[*].type so existing rows match the new StrEnums.
Strategy:
- Any out-of-enum value is replaced with a safe default (named / person
/ other).
- The original value is preserved alongside as `_legacy_<field>` on the
same JSON object. Pydantic's `extra="allow"` keeps it through reads
without affecting validation.
- `downgrade()` restores `<field>` from `_legacy_<field>`.
Production audit before writing this migration:
- participant.type: all values already inside the enum (no-op).
- participant.entity_type: 12 rows with "ambiguous" (now an enum
member); 6 rows with "group" (bug) → person.
- places.type: ~85% covered by the enum; the long tail
("city/empire", "place", "", etc.) → other.
Idempotent: re-running the migration on already-normalized rows is a
no-op because the legacy marker is the trigger, not the field value.
Tests cover normalize, restore, idempotence, and the upgrade/downgrade
round-trip property.
🤖 Generated with [Nori](https://noriagentic.com)
Co-Authored-By: Nori <contact@tilework.tech>
Two related changes to the participant generation node: 1. Resilience guard around `call_llm(prompt, output_schema=...)`. LangChain's `with_structured_output` raises `OutputParserException` (not `ValidationError`) when the model emits a value outside the StrEnum — for Gemini under load this happens occasionally and previously killed the entire participant generation step. The fallback catches both exception types, re-issues the call without a schema, and normalizes invalid `type`/`entity_type` values back into the enum via `_parse_and_normalize_participants`, preserving the offending value as `_legacy_<field>` for visibility. 2. Prompt fix per review on #80. The `unnamed` type belongs in Source B (common-noun candidates), not Source A (proper-noun entities) — the original placement contradicted Source A's premise. Source A now explicitly lists only `named | group | divine` and notes that `unnamed | role` are reserved for Source B. Source B gains a third category (c) for unnamed individuals with examples that match BHSA common-noun lemmas. 🤖 Generated with [Nori](https://noriagentic.com) Co-Authored-By: Nori <contact@tilework.tech>
Same exception-type fix applied in three places where `call_llm` is issued with a schema that nests `BCDPlace`/`PlaceSchema`: - `_generate_place_batch` (PlacesRegisterSchema) - `generate_context_sections` small path (ContextSectionsSchema) - `generate_context_sections` large path (ContextSectionsBatchSchema) Without the guard, a single out-of-enum `places[*].type` from the LLM would not just lose the one bad row — it would kill the entire context_sections step (theological_spine, objects, institutions, genre_context, maintenance_notes all included). This is structurally worse than the participant case because more fields are bundled. The small path now falls back to a raw call + `_parse_and_normalize_ context_sections` which preserves the other fields and normalizes invalid place types to `other` with `_legacy_type`. The large path treats the no-places bundle the same way; place batches were already isolated and use `_parse_and_normalize_places` on fallback. 🤖 Generated with [Nori](https://noriagentic.com) Co-Authored-By: Nori <contact@tilework.tech>
Review follow-up — rollout work + scope expansionThanks @joaocarvoli for the detailed review. Addressing each point: 1. Production audit (commit 43ce4d1)Ran the three queries against the Neon production DB before deciding the enum sets.
2. Backfill migration (commit 02863e7)
3. LLM resilience + scope expansion (commit c116423, commit edb3573)The resilience guard you suggested — but the exception type matters. LangChain's
Each fallback re-issues the call without a schema, parses the raw output, and normalizes invalid values back into the enum, preserving the offending value as This also addresses the asymmetry observation in your inline comment on line 45: 4. Prompt fix (commit c116423)Per your inline on line 75: moved 5. Test plan
Rollout orderMigration is now in this PR, so deploy in two steps:
|
Summary
meaning-map-ui#24(frontend dropdown).ParticipantType(StrEnum:named | unnamed | group | divine | role) following the project's enum convention (BCDStatus,EntryProvenance,Testament,RagNamespace).BCDParticipantEntry.type(API) andParticipantSchema.type(LLM output schema) with the enum. BecauseParticipantSchemais passed tocall_llmviaLangChain.with_structured_output(...), the model is now schema-forced to emit only valid values rather than relying solely on the prompt.unnamedfor anonymous individuals (e.g. "the man's servant", "a woman from Bethlehem").Supersedes #53
This PR replaces the previously-conflicting #53. Same intent, same code (cherry-picked from the original branch onto fresh
main), with the merge conflict inapp/models/book_context.pyresolved during cherry-pick —EntryProvenance(added onmainin the meantime) and the newParticipantTypeenum now coexist as siblingStrEnumclasses.Test plan
uv run pytest tests/— all passing.uv run ruff check .clean.uv run ruff format --check .clean.uv run mypy app/clean.typevia the UI to each of the 5 values, confirm round-trip onGET /book-context/{id}.typeoutside the enum via API — should get a 422 (enum validation rejecting it).POST /book-context/{id}/generatefor a small book (e.g. Ruth) and confirm AI-generated participants stay within the enum and may emitunnamedwhere appropriate.Rollout work added in response to the review (audit, migration, LLM resilience, scope expansion) — see follow-up comment.