Skip to content

Iceberg write Invalid POC#1202

Open
michaeldjeffrey wants to merge 5 commits into
mainfrom
mj/write-invalid-poc
Open

Iceberg write Invalid POC#1202
michaeldjeffrey wants to merge 5 commits into
mainfrom
mj/write-invalid-poc

Conversation

@michaeldjeffrey
Copy link
Copy Markdown
Contributor

This pull request introduces a new pattern for handling and storing invalid (rejected) records for bans, heartbeats, speedtests, and speedtest averages in the mobile_verifier service. Now, for each type of record, invalid entries are written to sibling invalid_* tables with an added reason column explaining why the record was rejected. This is achieved through a new ValidInvalidWriter abstraction and supporting data structures. The changes also make the schema for valid and invalid tables consistent and easier to maintain.

The most important changes are:

1. Introduction of Invalid Record Tables and Writers

  • Added new modules and data structures for invalid_ban, invalid_heartbeat, invalid_speedtest, and invalid_speedtest_avg, each mirroring the schema of their valid counterpart and adding a reason column for the rejection cause. These modules include constructors for easy conversion from valid to invalid records. [1] [2] [3] [4]
  • Introduced the ValidInvalidWriter abstraction, which writes valid records to the main table and invalid records to the corresponding invalid_* table, and updated all writer types to use this pattern. [1] [2]

2. Schema and API Improvements

  • Added helper methods to TableDefinition (with_name, with_field) to simplify deriving sibling table definitions for invalid records, ensuring schemas stay in sync.
  • Updated the field visibility of record structs (e.g., IcebergBan, IcebergHeartbeat, IcebergSpeedtest) to allow construction of invalid record variants. [1] [2] [3]

3. Ingestion Pipeline Updates

  • Modified the ban and heartbeat ingestion logic to collect both valid and invalid records, writing each to their appropriate tables using the new writer abstraction. [1] [2] [3]

4. Utility Improvements

  • Added a utility function to strip enum prefixes from rejection reasons for cleaner storage in the reason column.

These changes make the handling of invalid data explicit, auditable, and easier to maintain, while keeping valid and invalid table schemas tightly coupled and reducing code duplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant