[FEATURE](pyspark) PySpark validation generated from the Pydantic schema by sethfitz · Pull Request #518 · OvertureMaps/schema

Seth Fitzsimmons (sethfitz) · 2026-05-11T23:10:22Z

Summary

Adds a new runtime package (overture-schema-pyspark) plus a new output target in overture-schema-codegen that emits PySpark validation expressions and conformance tests from the same Pydantic models that define the schema. Ships an overture-validate CLI (to be merged with the overture-schema CLI later; that will be an opportunity to improve its ergonomics) for running the validation against Parquet on disk or in S3.

PySpark plugs in as a peer of the existing Markdown output target: same ModelSpec extraction, same four-layer architecture (Discovery -> Extraction -> Output Layout -> Rendering), new pipeline module. The codegen abstraction is model-general (ModelSpec = RecordSpec | UnionSpec: one record that compiles to a Spark StructType, or a tagged union of records); in this schema every model is an Overture feature, so the generated modules are organized by theme/feature. See packages/overture-schema-codegen/docs/design.md for the full picture; the "PySpark Pipeline" section there covers the new stages in detail.

What's in the PR

packages/overture-schema-pyspark/ -- runtime. Public API in validate.py (validate_model, explain_errors), schema comparison in schema_check.py, dataclasses in check.py, the overture-validate CLI in cli.py, and shared expression building blocks in expressions/{constraint_expressions,column_patterns,_schema_structs}.py. The per-feature expression modules under expressions/generated/overture/schema/<theme>/<feature>.py and per-feature conformance tests under tests/generated/overture/schema/<theme>/test_<feature>.py are emitted by codegen and confined to a generated/ boundary that make generate-pyspark wipes and recreates. _registry.py walks that tree at import time and exposes REGISTRY: dict[str, ModelValidation] keyed by each module's ENTRY_POINT value (e.g. "overture.schema.transportation:Segment"), reading each module's emitted MODEL_VALIDATION constant, plus a parallel PARTITION_MAP derived from each module's PARTITIONS dict for partitioned features.

Validation degrades gracefully when columns are skipped (--skip-columns / --suppress) or structurally absent from the data. Every check declares the top-level columns it reads, and a check over a missing column -- top-level, nested inside an array or map (sources[].confidence resolves to sources), or model-level -- is dropped before Spark planning rather than raising a raw AnalysisException. ValidationResult.absent_columns surfaces the columns dropped for this reason, and the CLI backstop classifies any residual planning error through the structured PySpark error API so a generator bug propagates as a traceback instead of being mistaken for a missing column.

This has been tested to work with Spark versions 3.4.0 - 4.1.1 (and the lowest-direct CI check will verify that it continues to work with the lowest declared PySpark version, which is currently 3.4).

Adam Lastowka (@Rachmanin0xFF) the public API changed since the previous version you tested (the entry point is now validate_model, returning a ValidationResult, in an attempt to simplify how it gets integrated with the CDP). If you point me to its current state, I can propose an update that uses the new API.

packages/overture-schema-codegen/src/overture/schema/codegen/pyspark/ -- new output target. Pipeline stages:

ModelSpec (from extraction)
    |
constraint_dispatch.py    constraints -> ExpressionDescriptor / ModelConstraintDescriptor
    |
check_builder.py          FieldShape tree -> Check / ModelCheck IR
                          (check_ir.py; resolves array nesting, map key/value
                          projection, variant gating via
                          FieldPath = ScalarPath | ArrayPath | MapPath and Guard
                          sum types)
schema_builder.py         FieldShape tree -> SchemaField list (StructType source)
test_data/                ModelSpec -> BASE_ROW_SPARSE / BASE_ROW_POPULATED, scaffold,
                          invalid_value
    |
renderer.py               Check IR -> per-feature expression module
test_renderer.py          Check IR -> per-feature conformance test module
    |
pipeline.py               orchestrates, returns GeneratedModule list

make generate-pyspark wipes both generated/ trees and recreates them; make check gates on regeneration being current.

What's covered

Every constraint Pydantic enforces today is dispatched to a PySpark expression, which allows Spark to push down predicates and take advantage of Catalyst to avoid unnecessary deserialization:

Field constraints: Ge / Gt / Le / Lt / Interval, ArrayMinLen / ArrayMaxLen / ScalarMinLen / ScalarMaxLen (typed length constraints split by attachment layer), StrippedConstraint, PatternConstraint, UniqueItemsConstraint, GeometryTypeConstraint, JsonPointerConstraint.
Map key/value constraints: dict-typed fields (e.g. names.common, a dict[LanguageTag, StrippedString] present on most named features) have their per-key and per-value constraints validated by projecting map_keys / map_values and applying the same field checks. A map value that is a sub-model is descended into, validating its field and model-level constraints. dict[K, Any] maps (e.g. Infrastructure.source_tags) carry no constraint and emit no checks; container-valued map values (a list or map inside a map value) are not yet supported and raise at generation time rather than emit an unanchored check.
NewType overrides: LinearlyReferencedRange (length / bounds / order).
Base-type overrides: HttpUrl (format + length), EmailStr, BBox (completeness, lat ordering, lat range).
Model constraints: RequireAnyOfConstraint, RadioGroupConstraint, RequireIfConstraint, ForbidIfConstraint, MinFieldsSetConstraint. MinFieldsSetConstraint mirrors Pydantic's model_fields_set semantics: required fields are always set (the constructor enforces them) and count toward the threshold alongside any explicitly-set optional fields. NoExtraFieldsConstraint is intentionally skipped, as its behavior is implicit when using DataFrames.

Known semantic gaps / differences

Divergences from Pydantic:

UniqueItemsConstraint uses Spark's array_distinct, which compares whole elements with structural equality on raw stored values. Pydantic compares normalized Python objects -- e.g., list[HttpUrl] is compared after URL normalization (such as ensuring that URLs like http://example.com contain a trailing /: http://example.com/). The PySpark check catches exact duplicates, not duplicates after normalization.
require_any_of checks isNotNull as a proxy for Pydantic's model_fields_set. Parquet has no equivalent of "explicitly provided" because of the tabular nature of DataFrames; isNotNull is stricter (it rejects fields explicitly set to null).
BBox's PySpark expressions include checking latitude ordering and range. Longitude is skipped due to undefined behavior around how we handle features that cross the antimeridian.
Regex checks run on the JVM, so a PatternConstraint is evaluated with Java's regex dialect rather than Python's -- the two agree for the schema's patterns, but ASCII-vs-Unicode shorthand classes (\w, \d) and interior-newline handling differ in general.

CLI

$ overture-validate --help
Usage: overture-validate [OPTIONS] FEATURE_TYPE PATH

  Validate Overture data at PATH and write annotated Parquet.

Options:
  -o, --output TEXT            Output path for validated Parquet.
  --head INTEGER               Error rows to display.  [default: 20]
  --conf TEXT                  Spark config key=value pairs.
  --count-only                 Report error count only; skip explain/unpivot.
  --skip-schema-check          Warn on schema mismatches instead of aborting.
  --skip-columns TEXT          Columns declared absent from data; skips their
                               checks.
  --ignore-extra-columns TEXT  Extra data columns to ignore in schema
                               comparison.
  --suppress TEXT              Suppress checks: FIELD (all checks) or
                               FIELD:CHECK (specific).
  --help                       Show this message and exit.

Output is one row per violation: feature ID, theme/type, failing field, check name, message, offending value.

Testing

overture-schema-codegen generates conformance tests alongside the PySpark expressions as a sanity-check by evaluating expected-good and expected-bad rows within a single DataFrame (setup/teardown overhead for DataFrame-per-test-case was much too high). Each feature gets two baseline rows -- BASE_ROW_SPARSE (required fields only) and BASE_ROW_POPULATED (every optional field filled in) -- and every scenario is exercised against both, so checks that depend on optional structure being present are covered as well as the minimal-row case.

make check generates tests and runs the full suite over everything, which use PySpark and increase test runtime (again; they run in just over a minute for me). testmon was introduced to improve the developer experience by skipping tests that weren't affected by recent edits.

Beyond the unit and conformance tests:

# Local Parquet
overture-validate segment samples/segment.parquet --count-only

# Real release prefix (expect bbox Float-vs-Double mismatch -> use --skip-schema-check)
# partitions (theme=places, type=place) will automatically be detected
# note that the s3a protocol must be used when AWS-provided Hadoop filesystem JARs are not available
overture-validate place s3a://overturemaps-us-west-2/release/<release>/ --skip-schema-check --head 50

Notes for review

This PR is intentionally large because the generated tree is large. The interesting surface is smaller: pyspark/{constraint_dispatch,check_builder,check_ir,schema_builder,renderer,test_renderer,pipeline}.py plus the path IR in overture-schema-system's field_path.py and the runtime in overture-schema-pyspark/src/overture/schema/pyspark/. Everything under generated/ is regenerable output -- review the codegen, but skim the output to understand the shape of what's produced.
Bundles supporting changes (testmon for incremental test runs, VehicleSelectorBase extraction, Java 17 CI pin, build/CI hardening so test-all runs unconditionally and the check job triggers on the Makefile/workflow themselves, typing-extensions declared) rather than splitting them into separate PRs.
The branch has been squashed to a single commit; expect a force-push. The review/refactoring/hardening pass is complete.

Closes #517.

github-actions · 2026-05-11T23:12:12Z

🗺️ Schema reference docs preview is live!


🌍 Preview	https://staging.overturemaps.org/schema/pr/518/schema/index.html
🕐 Updated	Jun 24, 2026 03:51 UTC
📝 Commit	1c21815
🔧 env SCHEMA_PREVIEW	`true`

Note

♻️ This preview updates automatically with each push to this PR.

pytest-testmon tracks which tests cover which source files and skips unaffected tests on subsequent runs. Activated via a TESTMON Makefile variable so the default `make check` uses incremental selection while `make check TESTMON=` runs the full suite. Lock the dependency in the dev group, gitignore the local cache file, and thread $(TESTMON) through the test, test-all, and test-only targets. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>

Pull the shared `dimension` and `comparison` fields of the five vehicle selector subtypes into a `VehicleSelectorBase` parent, and thread `discriminator="dimension"` through the `VehicleSelector` annotated union. The discriminator turns the union into a Pydantic discriminated union, so it serializes as JSON Schema's `oneOf` + `discriminator` rather than `anyOf`. Regenerated segment_baseline_schema.json captures the new shape. This is a prerequisite for downstream tooling that walks discriminated unions structurally (e.g. PySpark codegen for segment's nested vehicle scoping). Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>

Replace the Tonga-based Division/DivisionArea/DivisionBoundary fixtures with Kauaʻi County samples that exercise admin_level, capital_division_ids, wikidata, and source license alongside the existing fields. Replace the Tonga-based Connector/Segment fixtures with a Vermooten Street junction in Pretoria that exercises access_restrictions with when.vehicle, speed_limits with when.heading, routes with ref, road_surface, and multi-source attribution. Reformat the TOML with 4-space indents and sorted keys to match sibling theme packages. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>

Introduce overture-schema-pyspark, a runtime PySpark validation package whose per-feature expression modules and conformance tests are generated from the same Pydantic models that define the schema, along with an `overture-validate` CLI. Runtime (overture-schema-pyspark/src/overture/schema/pyspark/): - check.py — Check, CheckShape, FeatureValidation dataclasses. - schema_check.py — write-first comparison of Spark schemas against an expected StructType, with structural type matching and SchemaMismatch reporting. - validate.py — public API: validate_feature(), evaluate_checks(), explain_errors(). The explain stage UNPIVOTs per-row check results into one row per violation, preserving all input columns for downstream join-back. - cli.py — `overture-validate <parquet-or-directory>` runs the validation pipeline against a path of GeoParquet files. Output is one row per violation: feature ID, theme/type, failing field, check name, offending value. Single-pass evaluation keeps memory bounded for arbitrarily large inputs. - expressions/ — shared runtime utilities (constraint_expressions, column_patterns, _schema_structs). Per-feature expression modules live under expressions/overture/ and are added by the codegen in a follow-up commit. - tests/_support/ — conformance test infrastructure (scenarios, harness, helpers, mutations). The harness builds one DataFrame per feature, applies all scenarios as deterministic-UUID-tagged rows, runs validation once, and indexes violations back to scenario IDs — O(checks) rather than O(checks * scenarios). CLI filtering options: --theme <theme> limit to one theme --feature <feature> limit to one feature type --skip-schema-check run only constraint checks (no schema comparison) --count-only print violation counts per check rather than rows --suppress <key> suppress specific (feature, field, check) triples per a YAML config Codegen pipeline (overture-schema-codegen/src/.../pyspark/): FeatureSpec | constraint_dispatch.py map constraints to descriptors | check_builder.py walk FieldSpec -> CheckNode IR; resolve array nesting, variant gating | schema_builder.py FieldSpec -> SchemaField list (StructType source) | renderer.py CheckNode -> per-feature expression module test_renderer.py CheckNode -> per-feature conformance test module synthetic.py FeatureSpec -> BASE_ROW + invalid values | pipeline.py orchestrate, return GeneratedModule list The dispatch tables map every supported constraint (Ge/Gt/Le/Lt/ Interval, MinLen/MaxLen, StrippedConstraint, PatternConstraint, UniqueItemsConstraint, GeometryTypeConstraint, JsonPointerConstraint, RequireAnyOfConstraint, RadioGroupConstraint, RequireIfConstraint, ForbidIfConstraint, MinFieldsSetConstraint), NewType (Country- CodeAlpha2, LinearlyReferencedRange, RegionCode), and base type (HttpUrl, EmailStr) to constraint_expressions check functions. Discriminated unions (segment is the canonical hard case) split into per-arm test files. The codegen handles arm splitting via generate_arm_rows in synthetic.py and _filter_field_nodes_for_arm in test_renderer.py. The Makefile gains a `generate-pyspark` target and gates `check` on it so a stale generation surfaces immediately. The CLI is exposed as a `[project.scripts]` entry point so `overture-validate` becomes available after `pip install` / `uv sync`. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>

Generate PySpark expressions (and tests) for models defined in the workspace Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>

PySpark 3.4 (the declared floor) doesn't run on Java 21, the default JDK on ubuntu-latest runners -- it hits NoSuchMethodException on java.nio.DirectByteBuffer.<init>(long, int), removed in JDK 21. Pin the lowest-direct cell to Java 17 so the resolved pyspark==3.4.0 can actually start. The default cell (which resolves to a current pyspark 4.x) keeps the runner's default Java 21. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>

validate_feature built check expressions referencing every column the schema declares, then evaluated them with an eager df.select. When the input DataFrame lacked a declared column, Spark's plan analysis raised an AnalysisException before the caller could inspect the schema mismatch, so a file missing a required column produced a Java stack trace instead of the schema-mismatch report the CLI is built to emit. Columns that compare_schemas reports as absent from the data now have their checks dropped, the same as --skip-columns columns; referencing them is what crashes Spark. The mismatch is still recorded in schema_mismatches, so the CLI reports it and exits cleanly (or, with --skip-schema-check, validates the columns that are present). The CLI also prints the --skip-columns invocation for the absent columns, so the escape hatch is discoverable from the error itself. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>

Model-level constraints (require_any_of and the like) generated for a sub-model reached through an optional field fired even when that field was null. Pydantic skips a model validator when the optional sub-model is absent, so the generated PySpark expression produced a false positive the schema itself never raises. ModelCheck now carries a gate: the optional-ancestor path that must be non-null for the constraint to apply. check_builder sets it when the constrained model is reached via an optional struct field inside an array; the renderer wraps the constraint in F.when(<accessor>.isNotNull(), ...). Regenerated Segment expressions: the speed_limits[].when, access_restrictions[].when, and prohibited_transitions[].when require_any_of checks are now skipped when their when sub-model is null. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>

Victor Schappert (vcschapp) · 2026-05-27T03:34:20Z

This PR is intentionally large because the generated tree is large. The interesting surface is smaller: pyspark/{constraint_dispatch,check_builder,check_ir,schema_builder,renderer,test_renderer,pipeline}.py plus the runtime in overture-schema-pyspark/src/overture/schema/pyspark/. Everything under generated/ is regenerable output -- review the codegen, but skim the output to understand the shape of what's produced.

This seems reasonable for as a basis for review, but for merging, I don't see the upside in committing 15-30 generated files and 15K-30K generated lines to source control—we want to source control the generator, not the generatee, right?

Victor Schappert (vcschapp) · 2026-05-27T03:39:46Z

Divergence from Python no. 1 comment.

UniqueItemsConstraint uses Spark's array_distinct, which compares whole elements with structural equality on raw stored values. Pydantic compares normalized Python objects -- e.g., list[HttpUrl] is compared after URL normalization (such as ensuring that URLs like http://example.com contain a trailing /: http://example.com/). The PySpark check catches exact duplicates, not duplicates after normalization.

Basically we're saying PySpark will slightly under-count duplicates in some rare edge cases. While annoying, this doesn't seem like it's a very big deal.

Victor Schappert (vcschapp) · 2026-05-27T03:55:14Z

Divergence from Pydantic no. 3 question:

BBox's PySpark expressions include checking latitude ordering and range. Longitude is skipped due to undefined behavior around how we handle features that cross the antimeridian.

Does this mean we have stricter constraints in the PySpark than in Pydantic? If yes, shouldn't they be the same?

Victor Schappert (vcschapp) · 2026-05-27T04:00:15Z

Regarding @no_extra_fields:

NoExtraFieldsConstraint is intentionally skipped, as its behavior is implicit when using DataFrames.

This is true in terms of struct-typed columns, but isn't it not true for a top-level dataframe? i.e. Doesn't it mean that if I have a model like:

@no_extra_fields
class Foo(BaseModel):
    bar: int

And a dataframe like:

bar	baz	qux
42	"garply"	3.14

Wouldn't the baz and qux columns sneak through without causing a violation?

Victor Schappert (vcschapp) · 2026-05-27T04:22:11Z

Divergence from Pydantic no. 2 comment:

require_any_of checks isNotNull as a proxy for Pydantic's model_fields_set. Parquet has no equivalent of "explicitly provided" because of the tabular nature of DataFrames; isNotNull is stricter (it rejects fields explicitly set to null).

Let's see if I understand this right.

I have a model like this:

@require_any_of("bar", "baz")
class Foo(BaseModel):
    bar: int | None = None
    baz: str | None = None

I think you're saying that Pydantic would allow Foo(bar=None) because bar is set explicitly but PySpark would reject it...

The latest version of the Pydantic should also reject it... The docs say:

To ensure parity between Python and JSON Schema validation, a field's value must be explicitly set to a non-None value to satisfy the constraint. Fields whose value was set by Pydantic using a default value violate the constraint, as do fields explicitly set to None.

That seems to be the current behavior:

>>> from pydantic import BaseModel
>>> from overture.schema.system.model_constraint import require_any_of
>>> @require_any_of("bar", "baz")
... class Foo(BaseModel):
...     bar: int | None = None
...     baz: str | None = None
... 
>>> Foo(bar=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ANT.AMAZON.COM/schapper/overture/schema/.venv/lib/python3.10/site-packages/pydantic/main.py", line 250, in __init__
    validated_self = self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for Foo
  Value error, at least one of these fields must be set to a value other than None, but none are: bar, baz (`@require_any_of`) [type=value_error, input_value={'bar': None}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.12/v/value_error

Am I missing something, or did you actually achieve perfect parity?

Victor Schappert (vcschapp)

Nice work on this.

It's a huge PR, but I think the successfulness of it is best captured by this comment:

Every constraint Pydantic enforces today is dispatched to a PySpark expression, which allows Spark to push down predicates and take advantage of Catalyst to avoid unnecessary deserialization ...

That's phenomenal.

A few points to discuss, with my biggest items being committing of generated code; and some minor conceptual issues around the use of the word "feature".

Victor Schappert (vcschapp) · 2026-05-27T04:32:33Z

+These are codegen-internal classes -- Pydantic users continue to write
+`Annotated[X, MinLen(n)]` in their schemas; the wrapping happens inside


For my understanding: Does Field(max_length=n) also produce this type of internal constraint, or is that a different code path?

Good catch; that should say Field(max_lenth=n).

Victor Schappert (vcschapp) · 2026-05-27T04:38:17Z

+{% from '_check_function.py.jinja2' import check_function -%}
+# This file is auto-generated by overture-schema-codegen. Do not edit.
+
+"""{{ feature_title }} validation expression builders."""


Naming question regarding "feature": can this handle any BaseModel that gets pulled in via discovery, or does it have to be a "feature"—and if so, what flavor of feature?

It can handle any BaseModel.

Victor Schappert (vcschapp) · 2026-05-27T04:46:12Z

+FeatureSpec: TypeAlias = ModelSpec | UnionSpec
+"""Top-level feature types passed through the extraction pipeline.
+
+Consumers narrow with `isinstance` when an arm-specific attribute
+is needed (e.g. `UnionSpec.discriminator_field`).
+"""


I didn't realize this before, but this seems to be overloading the term "feature" to mean basically, "a thing that is either derived from BaseModel, or a union of such things"; whereas in system and common we use "feature" to mean a geospatial feature, i.e., a thing with geometry and related attributes.

Can we use some more neutral naming for this type and the things flowing from it?

RootSpec

TopLevelSpec

TargetSpec

True. That was probably in the back of my head because I was thinking about validating Overture Features as its primary purpose even while the implementation was more generic.

None of those names really resonate, but I'll see if I can come up with something that feels less overloaded. Something that properly encompasses ModelSpec | UnionSpec (where UnionSpec is implicitly "union of models"). If we weren't dealing with unions, ModelSpec could be appropriate. PrimarySpec (complementing SupplementarySpec, which encompasses enums, NewTypes, nested models, other types) feels close.

I went with renaming FeatureSpec to ModelSpec (in line with Pydantic conventions) and ModelSpec turned into RecordSpec to distinguish it from UnionSpec.

Seth Fitzsimmons (sethfitz) · 2026-06-17T16:00:14Z

This seems reasonable for as a basis for review, but for merging, I don't see the upside in committing 15-30 generated files and 15K-30K generated lines to source control—we want to source control the generator, not the generatee, right?

We agreed in one of the Schema WG meetings to drop the generated code from the PR to merge (once we've reviewed it), as it can be regenerated on demand and will exist in published packages for posterity.

Basically we're saying PySpark will slightly under-count duplicates in some rare edge cases. While annoying, this doesn't seem like it's a very big deal.

Exactly. Agreed.

Does this mean we have stricter constraints in the PySpark than in Pydantic? If yes, shouldn't they be the same?

For BBox, yes. We're going to follow up with Adam Lastowka (@Rachmanin0xFF) and the Data Platform folks to understand what their validation requirements + desires are today (in PySpark) and eventually make sure that PySpark and Pydantic are equivalent.

Wouldn't the baz and qux columns sneak through without causing a violation?

No, these would be rejected when checking the Spark schema (before moving into per-row validation).

require_any_of checks isNotNull as a proxy for Pydantic's model_fields_set. Parquet has no equivalent of "explicitly provided" because of the tabular nature of DataFrames; isNotNull is stricter (it rejects fields explicitly set to null).
Am I missing something, or did you actually achieve perfect parity?

I did achieve parity with Pydantic, but wanted to call this out because of the ability to omit values there (and when using JSON).

This lands map key/value validation in the PySpark and Markdown generators, renames the codegen's core "feature" vocabulary to "model", and closes correctness gaps across validation generation, constraint identity, base-row synthesis, and the runtime checks. Map key/value validation (PySpark). Dict-typed fields such as names.common (dict[LanguageTag, StrippedString], present on nearly every feature with names) got no key or value validation, so invalid language-tag keys and unstripped values passed overture-validate while failing Pydantic. A MapPath IR locates a map's keys or values, the check builder descends into MapOf.key/.value, and the renderer emits map_keys_check/map_values_check over F.map_keys/F.map_values, reusing the null-guarded array transform. A map value that is a sub-model is descended into the way a list element is, so its field and model-level constraints generate checks. Shapes with no representable MapPath -- a container nested in a map value, an array-shaped map projection, a map reached through an array -- raise at generation time rather than silently drop a constraint or emit a mis-typed check. Map key/value rendering (Markdown). Map references render every key and value type and their directly-applied constraints. Map sides link to their type pages, container wrappers (a list or map inside a map) survive instead of being peeled, model-valued map values link, and every variant is named rather than collapsing to map<K, ?>. A bare side folds into the surrounding map<...> code span so two adjacent backtick spans cannot corrupt the CommonMark. feature to model vocabulary. The codegen generates validation for any Pydantic BaseModel, not only geospatial features, so the root abstraction is now ModelSpec (a RecordSpec, or a tagged UnionSpec of records) and the runtime surface is validate_model, model_keys, model_names, ModelValidation, and the emitted MODEL_VALIDATION constant, in place of validate_feature, feature_keys, feature_names, FeatureValidation, and FEATURE_VALIDATION. The registry walks the new constant name and the module template is renamed to match. The overture-validate CLI argument and the GeoJSON-feature validator keep feature vocabulary, where the geospatial meaning is correct. Graceful degradation over absent and skipped columns. Each Check carries one read_columns set, the top-level columns it actually reads, replacing the split root_field/referenced_fields mechanism that left an unresolvable F.col() for a row-root constraint over a struct field and for an in-array constraint branching on a row-root discriminator. Checks whose columns are skipped or structurally absent -- top-level, nested (sources[].confidence resolves to sources), or model-level -- are dropped before Spark instead of crashing with a raw AnalysisException. The CLI backstop classifies the exception through the structured PySpark error API rather than its message text, so a real planning bug propagates as a traceback instead of steering the operator to suppress it with --skip-columns. ValidationResult exposes absent_columns, the same derivation that drives check dropping. evaluate_checks and explain_errors reject input columns colliding with their reserved _err_N, field, check, and message names. Field-check label disambiguation. Discriminated unions produced multiple field checks sharing a (field, name) identity -- segment emitted two required checks on access_restrictions[].when.vehicle[].value, and sources[].confidence collided on bounds across many models -- leaving colliding rows indistinguishable in suppression, explain_errors metadata, and the conformance expected_field. Symmetric _N suffixes disambiguate them, computed over the unfiltered check list so per-arm test filtering cannot hide a collision the shared expression module still carries. Label and disambiguated function-name derivation is unified into one flattening shared by the expression and test renderers, which also fixes a per-arm asymmetry where a model-check base label spanning an arm boundary made the per-arm test expect a field the module never emits. Base-row and test-data value synthesis. Row synthesis handles Not(FieldEqCondition), used by Division and DivisionBoundary as ~IS_COUNTRY, instead of silently treating it as unsatisfied, and merges every bound constraint on a field before choosing a value so both-exclusive intervals and sibling Gt/Lt constraints are jointly satisfied. forbid_if fill values are typed for non-string scalars (0, 0.0, False) rather than falling back to a string sentinel for a non-string column. One CONSTRAINT_VALUES table pairs the valid and invalid value for each string constraint, replacing two parallel tables that drifted silently. SparkCategory dispatch is exhaustive: an unhandled or binary category raises at generation time instead of emitting a wrong fill value. Raw Field(pattern=) handling fails loud on an uncurated pattern naming the table to update, and requires PydanticMetadata before reading an object's .pattern, restoring the unhandled-constraint TypeError contract. Constraint identity. Deduplication compares constraints by value. System FieldConstraint subclasses inherited object identity, so two structurally identical UniqueItemsConstraint or PatternConstraint instances reported as divergent for any union field shared across members. FieldConstraint now defines value equality and hashing keyed on the concrete type and normalized instance state, reducing a compiled re.Pattern to (pattern, flags) so a case-insensitive pattern is not masked and reducing container attributes to hashable forms, so a set of constraints deduplicates by rule. The union fingerprint compares the constraint objects directly, falling back to a value-stable repr only for pydantic's internal Field() metadata, the lone constraint type that still compares by identity. Constraint attributes must themselves be value types, a contract the base states for future authors. Runtime checks. check_geometry_type reads the full four-byte WKB type word and normalizes ISO and EWKB encodings to the OGC base type, so 3D (Z/M/ZM) geometries in GeoParquet no longer false-fail every geometry-type check. check_url_format matches the scheme case-insensitively, matching Pydantic's HttpUrl. check_bounds rejects NaN under a lower-only bound, which Spark's ordering otherwise lets pass. resolve_read completes a partial partition path, appending a type=Y leaf below a theme=X/ path so one feature's checks no longer run against every type sharing a theme directory. A compiled, flagged re.Pattern, the only Pydantic carrier for re.IGNORECASE, is honored in both the pyspark check and the markdown display instead of crashing the model's generation. Variant gating and extraction. A discriminated union reached under a struct prefix, and a gate crossing mismatched array nesting, raise rather than emit a mis-gated check; both are preemptive, since no current schema reaches that state. Forward-ref and self-referential field annotations resolve against their owning model before classification rather than crashing the terminal classifier on an unresolved string. A required list[X | None] field no longer inherits is_optional from element nullability. Build, structure, and docs. The check-python-code CI paths trigger on the Makefile and the workflow themselves, uv-sync no longer routes its errors to /dev/null, and test-all runs the full suite unconditionally so golden-JSON and example-only changes are not deselected by testmon. typing-extensions is declared to match its unconditional import. Shared helpers replace duplicated logic -- register_model, schema_const_name, enum_source, a struct-only-prefix predicate, and a model-spec discovery entry point that gives every discovery site one extraction carrying partition layout -- and several docstrings and CLI help strings are corrected to match current behavior. Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>

Seth Fitzsimmons (sethfitz) added the change type - cosmetic 🌹 Cosmetic change label May 11, 2026

Seth Fitzsimmons (sethfitz) temporarily deployed to staging May 11, 2026 23:11 — with GitHub Actions Inactive

Seth Fitzsimmons (sethfitz) force-pushed the pyspark-codegen branch from b23c092 to 038a1aa Compare May 13, 2026 16:51

Seth Fitzsimmons (sethfitz) temporarily deployed to staging May 13, 2026 16:53 — with GitHub Actions Inactive

John McCall (lowlydba) changed the base branch from dev to main May 14, 2026 18:47

Seth Fitzsimmons (sethfitz) added 3 commits May 14, 2026 14:12

Seth Fitzsimmons (sethfitz) force-pushed the pyspark-codegen branch from 038a1aa to 36d9a39 Compare May 18, 2026 18:47

Seth Fitzsimmons (sethfitz) temporarily deployed to staging May 18, 2026 18:48 — with GitHub Actions Inactive

Seth Fitzsimmons (sethfitz) force-pushed the pyspark-codegen branch from 36d9a39 to 0bdd012 Compare May 19, 2026 16:14

Seth Fitzsimmons (sethfitz) temporarily deployed to staging May 19, 2026 16:15 — with GitHub Actions Inactive

Seth Fitzsimmons (sethfitz) added 3 commits May 19, 2026 09:36

chore(pyspark): generate PySpark expressions

ef2ef01

Generate PySpark expressions (and tests) for models defined in the workspace Signed-off-by: Seth Fitzsimmons <seth@mojodna.net>

Seth Fitzsimmons (sethfitz) force-pushed the pyspark-codegen branch from 0bdd012 to 1da7d9e Compare May 19, 2026 16:37

Seth Fitzsimmons (sethfitz) temporarily deployed to staging May 19, 2026 16:38 — with GitHub Actions Inactive

Seth Fitzsimmons (sethfitz) added 2 commits May 20, 2026 14:15

Seth Fitzsimmons (sethfitz) temporarily deployed to staging May 20, 2026 21:18 — with GitHub Actions Inactive

Victor Schappert (vcschapp) self-requested a review May 27, 2026 03:28

Victor Schappert (vcschapp) reviewed May 27, 2026

View reviewed changes

Seth Fitzsimmons (sethfitz) temporarily deployed to staging June 24, 2026 03:08 — with GitHub Actions Inactive

Seth Fitzsimmons (sethfitz) force-pushed the pyspark-codegen branch from 44393a5 to 1c21815 Compare June 24, 2026 03:49

Seth Fitzsimmons (sethfitz) deployed to staging June 24, 2026 03:51 — with GitHub Actions View deployment

		These are codegen-internal classes -- Pydantic users continue to write
		`Annotated[X, MinLen(n)]` in their schemas; the wrapping happens inside

Uh oh!

Conversation

Seth Fitzsimmons (sethfitz) commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in the PR

What's covered

Known semantic gaps / differences

CLI

Testing

Notes for review

Uh oh!

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🗺️ Schema reference docs preview is live!

Uh oh!

Victor Schappert (vcschapp) commented May 27, 2026

Uh oh!

Victor Schappert (vcschapp) commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Victor Schappert (vcschapp) commented May 27, 2026

Uh oh!

Victor Schappert (vcschapp) commented May 27, 2026

Uh oh!

Victor Schappert (vcschapp) commented May 27, 2026

Uh oh!

Victor Schappert (vcschapp) left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Victor Schappert (vcschapp) May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Seth Fitzsimmons (sethfitz) Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Victor Schappert (vcschapp) May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Seth Fitzsimmons (sethfitz) Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Victor Schappert (vcschapp) May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Seth Fitzsimmons (sethfitz) Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Seth Fitzsimmons (sethfitz) Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Seth Fitzsimmons (sethfitz) commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Seth Fitzsimmons (sethfitz) commented May 11, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading

Victor Schappert (vcschapp) commented May 27, 2026 •

edited

Loading