Skip to content

ENG-37: type participant enum + unnamed in AI prompt + backfill#80

Open
henokteixeira wants to merge 12 commits into
mainfrom
henok/eng-37-participant-type-enum-v2
Open

ENG-37: type participant enum + unnamed in AI prompt + backfill#80
henokteixeira wants to merge 12 commits into
mainfrom
henok/eng-37-participant-type-enum-v2

Conversation

@henokteixeira

@henokteixeira henokteixeira commented May 20, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Linear: https://linear.app/shema-obt/issue/ENG-37
  • Counterpart of meaning-map-ui#24 (frontend dropdown).
  • Introduces ParticipantType (StrEnum: named | unnamed | group | divine | role) following the project's enum convention (BCDStatus, EntryProvenance, Testament, RagNamespace).
  • Types both BCDParticipantEntry.type (API) and ParticipantSchema.type (LLM output schema) with the enum. Because ParticipantSchema is passed to call_llm via LangChain.with_structured_output(...), the model is now schema-forced to emit only valid values rather than relying solely on the prompt.
  • Updates the participant generation prompt to also teach the AI about unnamed for anonymous individuals (e.g. "the man's servant", "a woman from Bethlehem").

Supersedes #53

This PR replaces the previously-conflicting #53. Same intent, same code (cherry-picked from the original branch onto fresh main), with the merge conflict in app/models/book_context.py resolved during cherry-pick — EntryProvenance (added on main in the meantime) and the new ParticipantType enum now coexist as sibling StrEnum classes.

Test plan

  • uv run pytest tests/ — all passing.
  • uv run ruff check . clean.
  • uv run ruff format --check . clean.
  • uv run mypy app/ clean.
  • Boot the backend, create a BCD, edit type via the UI to each of the 5 values, confirm round-trip on GET /book-context/{id}.
  • Submit a type outside the enum via API — should get a 422 (enum validation rejecting it).
  • Trigger a fresh POST /book-context/{id}/generate for a small book (e.g. Ruth) and confirm AI-generated participants stay within the enum and may emit unnamed where appropriate.

Rollout work added in response to the review (audit, migration, LLM resilience, scope expansion) — see follow-up comment.

Defines the canonical 5-value vocabulary (named, unnamed, group, divine,
role) for participant naming categories and types BCDParticipantEntry.type
against it.

ENG-37
ParticipantSchema is the output schema passed to call_llm via
LangChain.with_structured_output, so typing it with the enum forces the
model to emit only valid values via structured output rather than relying
solely on the prompt.

ENG-37
Adds 'unnamed' to the type list in Source A with semantic guidance so
the AI also classifies anonymous individuals (e.g. "the man's servant",
"a woman from Bethlehem"). The prompt focuses on when to use each value;
the schema enforces the value set.

ENG-37
…t types

Adds backfill_participant_enums(db, dry_run) which iterates over BCDs
with a participant_register, normalises any 'type' value outside the
enum to 'named' (with a WARN log) and uses flag_modified so SQLAlchemy
persists the JSON change. Idempotent and safe to re-run.

The thin scripts/backfill_participant_enums.py wrapper exposes a
--dry-run flag, following the pattern of scripts/backfill_split_index.py.

This is required before deploying the enum-typed BCDParticipantEntry,
otherwise GETs on legacy BCDs with out-of-enum 'type' would 500.

ENG-37
Removed for now; we'll likely re-introduce normalization as an Alembic
migration instead. The enum-typed schema still rejects out-of-enum values
on read, so any rollout will need a follow-up before deploy.

ENG-37
Ruff's isort rule now requires combining same-module plain imports
onto a single line. Merges EntryProvenance and ParticipantType into
one statement; aliased imports stay on separate lines.
@henokteixeira henokteixeira self-assigned this May 22, 2026

@joaocarvoli joaocarvoli left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking on this one. The direction is right (structured types over free strings), but the rollout strategy is missing the prerequisite work and will regress a previously-fixed production incident.

Verified regression

I ran the failure case directly against this branch's BCDParticipantEntry:

'location': FAILS -> type
'place':    FAILS -> type
'animal':   FAILS -> type
'object':   FAILS -> type
'concept':  FAILS -> type

All raise pydantic.ValidationError in BCDResponse.model_validate(bcd) at app/services/book_context/enrich_bcd_response.py:17. That call happens on every GET /book-context/{id}. Inline comments below tie this to the specific lines.

What's missing in this PR

  1. Data audit. No evidence of an audit query against production participant_register to enumerate the actual set of type values in the wild. Required before we can know the blast radius:

    SELECT DISTINCT jsonb_array_elements(participant_register)->>'type'
    FROM book_context_documents
    WHERE participant_register IS NOT NULL;
  2. Migration. A backfill was added in 1ec3527 and reverted in 69df5d1. The PR description defers normalization to a future Alembic migration, but that migration is not in this PR. An open PR that's "safe to merge but not to deploy" is one merge away from production.

  3. Tests. PR description says "this PR doesn't add new tests." For a change that introduces a hard validation boundary on existing data, the failure-path test (out-of-enum typeValidationError) is exactly what would have caught this risk in CI.

  4. Manual test plan. 3 of 4 items unchecked — and they're the ones that exercise the actual integration (round-trip via UI, deliberate out-of-enum submission, fresh generation against a real book).

What I need to flip this to approve

  • Audit query result attached to the PR description so we know what's in production.
  • Alembic data migration attached to this PR (idempotent + reversible).
  • A unit test that verifies an out-of-enum type raises a clean error (or, after normalization, gets mapped to named).
  • Prompt fix per the inline comment on participants.py.
  • The unchecked manual test plan items executed and ticked.

UNNAMED = "unnamed"
GROUP = "group"
DIVINE = "divine"
ROLE = "role"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKER — the enum is missing values that exist in production.

Git history on this exact field:

Commit Date Change
40dffee Mar 26 2026 Added "location" to the Literal — "BCD generation produced location-type participants, Pydantic validation failed with a 500 error, also causing downstream 504 timeouts on the frontend"
4b45a6d Mar 27 2026 Widened to str — "The LLM sometimes produces unexpected type values like 'location' instead of the strict … enum. This caused a Pydantic validation error (500) when fetching the BCD"

The new enum named | unnamed | group | divine | role does not include location, which was confirmed in production. Other ad-hoc values may exist that we haven't enumerated.

This PR cannot ship until we either (a) include location (and whatever else the audit query turns up) in the enum, or (b) ship the data migration alongside this PR so the data matches the constraint at deploy time. Doing neither — which is what's proposed — guarantees a recurrence of the prior incident.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Audit ran — production has no values outside named | group | divine | role anymore. The location incident from 40dffee/4b45a6d is dead data at this point (cleaned up at some point between then and now). Numbers in the follow-up comment.

Migration 20260522_0001 is in the PR now and will normalize anything that appears between merge and deploy. Enum unchanged: named | unnamed | group | divine | role.

english_gloss: str = ""
entity_type: str = "person"
type: str = "named"
type: ParticipantType = ParticipantType.NAMED

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is the failure point. BCDResponse.model_validate(bcd) (called for every GET on a BCD at app/services/book_context/enrich_bcd_response.py:17) parses participant_register through list[BCDParticipantEntry], which now enforces the enum on every row.

I verified the regression directly on this branch:

from app.models.book_context import BCDParticipantEntry
BCDParticipantEntry.model_validate({
    'name': 'X', 'type': 'location',
    'entry_verse': {'chapter': 1, 'verse': 1},
    'role_in_book': 'setting',
})
# pydantic_core._pydantic_core.ValidationError: 1 validation error for BCDParticipantEntry
# type: Input should be 'named', 'unnamed', 'group', 'divine' or 'role'

Note: model_config = {"extra": "allow"} does not help here — it only governs unknown fields, not validation on known fields. The original 4b45a6d commit message read it the same way ("since the data has model_config extra=allow anyway") which was a misread of Pydantic semantics. Worth flagging for the team.

Secondary observation: this model has two str "category" fields — entity_type: str = "person" on line 44 is left as free string. BCDPlace.type: str = "" (sibling model) is also untouched. Either enum all three or comment on why only BCDParticipantEntry.type is constrained — the asymmetry will confuse the next reader.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failure point is resolved by the new 20260522_0001 migration (normalizes any out-of-enum value before deploy, with _legacy_<field> for reversibility).

Asymmetry observation: addressed in this round. BCDParticipantEntry.entity_type is now ParticipantEntityType (person | person_common | ambiguous) and BCDPlace.type is now PlaceType (17 values calibrated from the audit). Both go through the same migration. The extra=allow ≠ field validation point is well taken — kept the comment in mind while writing the migration so it specifically rewrites the typed field rather than relying on extras.

english_gloss: str = ""
entity_type: str = "person"
type: str = "named"
type: ParticipantType = ParticipantType.NAMED

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description claims "the model is now schema-forced to emit only valid values rather than relying solely on the prompt." Worth being precise about what that buys us in practice.

The LLM here is ChatGoogleGenerativeAI (Gemini) per app/services/book_context/generation/llm.py:7. Gemini's structured output via LangChain enforces JSON Schema, but its enum-constraint enforcement is softer than OpenAI's strict mode — the model can still emit out-of-enum strings under load. When that happens, LangChain's with_structured_output raises rather than coercing, so await call_llm(...) throws and the entire participant generation step fails instead of producing partial output.

That's a separate failure mode from the legacy-data one and deserves a guard. Suggest:

# in participants.py around line 154
try:
    result = await call_llm(prompt, output_schema=ParticipantRegisterSchema)
except ValidationError as e:
    logger.warning("LLM emitted out-of-enum participant type, normalizing to 'named': %s", e)
    # re-parse with type coerced to ParticipantType.NAMED for any failing rows
    ...

Without this, a single bad row in the LLM response kills the whole generation. The current free-string field is what lets the pipeline survive minor LLM drift; enum-ifying it removes that resilience.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — and the exception type is OutputParserException (subclass of ValueError), not ValidationError. The guard I added catches both: except (ValidationError, OutputParserException) as exc.

Applied in four places, not just _generate_batch, because BCDPlace is enum-typed too now and ContextSectionsSchema nests it:

  • _generate_batch (participants)
  • _generate_place_batch (places)
  • generate_context_sections small path
  • generate_context_sections large path (ContextSectionsBatchSchema)

Fallback re-issues call_llm(..., output_schema=None), parses the raw payload via _parse_and_normalize_*, and coerces invalid type/entity_type values back into the enum with _legacy_<field> preserved. Tests in tests/test_participant_llm_resilience.py and tests/test_place_llm_resilience.py exercise both exception paths and the small-path bundled fallback.

Persia when used as country/region references.
- "unnamed" — for a person entity referenced anonymously in the narrative \
(e.g., "the man's servant", "a woman from Bethlehem"). Rare for proper-noun \
sources but possible when the BHSA entity stands for an unnamed individual.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Semantic problem with where this is placed.

This guidance lives under Source A — Person Entities (proper nouns) (see the header at line ~62 and "name: copy EXACTLY from the entity data" at the top of this block). unnamed is by definition not a proper noun. The examples given in the PR description — "the man's servant", "a woman from Bethlehem" — are common-noun descriptors, which belong under Source B — Common Noun Groups, Roles, and Officials.

As written, the prompt asks the AI to apply unnamed to BHSA proper-noun entities, which contradicts Source A's premise ("FOR EACH person entity in the Person Entities list … copy EXACTLY from the entity data" — those entities ARE named, that's why BHSA flags them as proper-noun person entities).

Two possible fixes:

  1. Restructure Source A as "Individuals" with named vs unnamed sub-options, and clarify which BHSA nametype values map to unnamed (if any).
  2. Move unnamed to Source B with examples that match BHSA common-noun lemmas (e.g., "man's servant" → some BHSA common noun for servant).

Without one of these, the AI is being given contradictory guidance: "these are proper-noun entities, but classify some as unnamed."

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed per your option 2. unnamed moved from Source A to Source B as category (c) UNNAMED INDIVIDUAL, with examples that now match BHSA common-noun lemmas:

"the man's servant", "a woman from Bethlehem", "the foreman of the reapers". Use when the BHSA common-noun lemma stands for a SINGLE individual whose identity is narratively important but who is never named.

Source A's type list now explicitly states it produces only named | group | divine and notes that unnamed | role are reserved for Source B — closes the contradiction the AI was being asked to navigate. Source B's category (c) is distinct from (a) collective groups and (b) institutional roles, so the LLM has a clean three-way choice for common nouns.

henokteixeira and others added 4 commits May 22, 2026 14:36
Adds two new StrEnums alongside the existing ParticipantType:

- ParticipantEntityType (person | person_common | ambiguous) — typed
  BCDParticipantEntry.entity_type and ParticipantSchema.entity_type so
  the LLM structured_output emits only valid values and reads validate
  the same way.
- PlaceType (city | country | region | district | empire | village |
  mountain | valley | river | well | field | road | gate | tower |
  wall | structure | other) — typed BCDPlace.type. The 17 values come
  from a production audit; values outside this set will be normalized
  by a follow-up migration.

The enum sets were calibrated against actual production data so the
read-path validation no longer regresses the previously-fixed 500/504
incident: every value present in production either maps to an enum
member or will be backfilled by the migration in the next commit.

Updates the two existing place fixtures (Boaz's field) from the legacy
"location" string to the new "field" enum value, which is semantically
more accurate per the BHSA classification.
🤖 Generated with [Nori](https://noriagentic.com)

Co-Authored-By: Nori <contact@tilework.tech>
Normalizes participant_register[*].type, participant_register[*].entity_type
and places[*].type so existing rows match the new StrEnums.

Strategy:
- Any out-of-enum value is replaced with a safe default (named / person
  / other).
- The original value is preserved alongside as `_legacy_<field>` on the
  same JSON object. Pydantic's `extra="allow"` keeps it through reads
  without affecting validation.
- `downgrade()` restores `<field>` from `_legacy_<field>`.

Production audit before writing this migration:
- participant.type:        all values already inside the enum (no-op).
- participant.entity_type: 12 rows with "ambiguous" (now an enum
                          member); 6 rows with "group" (bug) → person.
- places.type:             ~85% covered by the enum; the long tail
                          ("city/empire", "place", "", etc.) → other.

Idempotent: re-running the migration on already-normalized rows is a
no-op because the legacy marker is the trigger, not the field value.
Tests cover normalize, restore, idempotence, and the upgrade/downgrade
round-trip property.
🤖 Generated with [Nori](https://noriagentic.com)

Co-Authored-By: Nori <contact@tilework.tech>
Two related changes to the participant generation node:

1. Resilience guard around `call_llm(prompt, output_schema=...)`.
   LangChain's `with_structured_output` raises `OutputParserException`
   (not `ValidationError`) when the model emits a value outside the
   StrEnum — for Gemini under load this happens occasionally and
   previously killed the entire participant generation step. The
   fallback catches both exception types, re-issues the call without
   a schema, and normalizes invalid `type`/`entity_type` values back
   into the enum via `_parse_and_normalize_participants`, preserving
   the offending value as `_legacy_<field>` for visibility.

2. Prompt fix per review on #80. The `unnamed` type belongs in Source
   B (common-noun candidates), not Source A (proper-noun entities) —
   the original placement contradicted Source A's premise. Source A
   now explicitly lists only `named | group | divine` and notes that
   `unnamed | role` are reserved for Source B. Source B gains a third
   category (c) for unnamed individuals with examples that match
   BHSA common-noun lemmas.
🤖 Generated with [Nori](https://noriagentic.com)

Co-Authored-By: Nori <contact@tilework.tech>
Same exception-type fix applied in three places where `call_llm` is
issued with a schema that nests `BCDPlace`/`PlaceSchema`:

- `_generate_place_batch` (PlacesRegisterSchema)
- `generate_context_sections` small path (ContextSectionsSchema)
- `generate_context_sections` large path (ContextSectionsBatchSchema)

Without the guard, a single out-of-enum `places[*].type` from the LLM
would not just lose the one bad row — it would kill the entire
context_sections step (theological_spine, objects, institutions,
genre_context, maintenance_notes all included). This is structurally
worse than the participant case because more fields are bundled.

The small path now falls back to a raw call + `_parse_and_normalize_
context_sections` which preserves the other fields and normalizes
invalid place types to `other` with `_legacy_type`. The large path
treats the no-places bundle the same way; place batches were already
isolated and use `_parse_and_normalize_places` on fallback.
🤖 Generated with [Nori](https://noriagentic.com)

Co-Authored-By: Nori <contact@tilework.tech>
@henokteixeira

Copy link
Copy Markdown
Contributor Author

Review follow-up — rollout work + scope expansion

Thanks @joaocarvoli for the detailed review. Addressing each point:

1. Production audit (commit 43ce4d1)

Ran the three queries against the Neon production DB before deciding the enum sets.

participant_register[*].type — all production values already inside the proposed enum (named=1271, group=96, divine=24, role=9). Migration is a no-op on this field. location and the other concerning values from the historical incident no longer appear — they were already cleaned up.

participant_register[*].entity_typeperson (1365), person_common (17), ambiguous (12), group (6). ambiguous is a legitimate BHSA classification state → kept as an enum member. group (6 rows) is a conflation bug → backfill normalizes to person.

places[*].type — 21 distinct values across ~367 entries. The 17-value PlaceType covers ~85% directly (city 217, region 32, district 12, gate 12, tower 11, country 6, empire 6, village 4, well 4, mountain 4, river 3, valley/field/road/structure 1 each, plus wall, other). The long tail (place 17, composite slashes like city/empire 6, hill/district 4, mountain/city 2, plus single-occurrence oddities) is normalized to other with _legacy_type preserving the original.

2. Backfill migration (commit 02863e7)

alembic/versions/20260522_0001_normalize_bcd_enum_fields.py. Idempotent + reversible: any out-of-enum value is replaced with a safe default and the original is preserved as _legacy_<field> on the same JSON object (Pydantic's extra=allow keeps it through reads). downgrade() restores <field> from the marker. Tests in tests/test_normalize_bcd_enum_fields_migration.py cover normalize, restore, idempotence, and the upgrade/downgrade round-trip.

3. LLM resilience + scope expansion (commit c116423, commit edb3573)

The resilience guard you suggested — but the exception type matters. LangChain's with_structured_output raises OutputParserException (not ValidationError) when Gemini emits an out-of-enum value, so the guard catches both. Applied in four places:

  • _generate_batch (participants)
  • _generate_place_batch (places)
  • generate_context_sections small path (was even more catastrophic — would have lost theological_spine + objects + institutions + genre_context + maintenance_notes together)
  • generate_context_sections large path (ContextSectionsBatchSchema)

Each fallback re-issues the call without a schema, parses the raw output, and normalizes invalid values back into the enum, preserving the offending value as _legacy_<field> for visibility.

This also addresses the asymmetry observation in your inline comment on line 45: BCDParticipantEntry.entity_type (ParticipantEntityType: person | person_common | ambiguous) and BCDPlace.type (PlaceType: 17 values) are now enums too, with the migration handling both.

4. Prompt fix (commit c116423)

Per your inline on line 75: moved unnamed from Source A to Source B as category (c) — UNNAMED INDIVIDUAL. Source A now explicitly states it produces only named | group | divine (with a note that unnamed | role belong to Source B). Source B's unnamed examples now match BHSA common-noun lemmas (the man's servant, a woman from Bethlehem, the foreman of the reapers).

5. Test plan

  • uv run pytest tests/728 passing, 2 skipped (baseline 633 + 95 new across 4 test files).
  • ruff check, ruff format --check, mypy app/ — all clean.
  • The 3 manual items remain pending — will tick after running against a live backend.

Rollout order

Migration is now in this PR, so deploy in two steps:

  1. uv run alembic upgrade head against production — this is read-tolerant (legacy data still validates after this because of the _legacy_<field> marker pattern + extra=allow).
  2. Deploy the application code. Every BCD row is guaranteed to validate against the new enums.

_legacy_<field> markers persist in the JSON until the row is next edited via update_section (which overwrites the column with the UI payload). If you need to audit downstream consumers of those markers, do it before the next bulk edit cycle.

@henokteixeira henokteixeira requested a review from joaocarvoli May 22, 2026 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants