Skip to content

Improve callsign matching for spelled-out vs. numeric variants #4

@MuddyWinds

Description

@MuddyWinds

Whisper transcribes callsigns inconsistently depending on how the controller/pilot speaks and how the audio comes through — e.g. "United two five", "United 25", and "UAL25" can all refer to the same aircraft. This fragments analysis and weakens ADS-B correlation (which matches on callsign).

Goal

Add a normalization pass that canonicalizes a callsign into a consistent form (airline telephony + flight number) before the transcript batch is sent to Gemini and before ADS-B correlation.

Suggested approach

  • Map airline telephony names → ICAO prefixes (e.g. UnitedUAL, SpeedbirdBAW, CathayCPA).
  • Convert spelled-out and word-number digits to numeric (two five25, niner9).
  • Produce a canonical token (e.g. UAL25) while preserving the raw transcript text for display.
  • Be conservative: when confidence is low or no airline match is found, leave the raw text untouched rather than guessing.

Where to look

  • backend/core/batcher.py (batch assembly + AIRPORT_GEO / ADS-B correlation)
  • The Gemini prompt assembly path

Acceptance

  • A short unit test covering several spelled-out / numeric / ICAO variants resolving to the same canonical callsign.
  • Raw transcript text remains visible on the observation card.

This maps directly to the "callsign matching isn't perfect yet" limitation in the README — one of the most impactful accuracy fixes available.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions