Skip to content

Rewrite intake stage from WDL to Python Polars#49

Merged
vincent-octo merged 22 commits into
masterfrom
rewrite-intake-with-polars
Jun 9, 2026
Merged

Rewrite intake stage from WDL to Python Polars#49
vincent-octo merged 22 commits into
masterfrom
rewrite-intake-with-polars

Conversation

@vincent-octo

@vincent-octo vincent-octo commented May 26, 2026

Copy link
Copy Markdown
Contributor

TODO

  • Encapsulate the intake stages into a mini WDL script?
    Out of scope for this PR.
  • Documentation on how to run the intake.assemble and intake.tidyup stages
  • Maybe provide a single command to run both intake stages at once
  • Add a sanitization step to remove or replace new-line characters, as users are used to not having to parse new-line characters. Maybe a good test for that is, can the output be read with non-CSV-aware tools like basic awk.
  • Compare new output (this rewrite) to previous output, expectation is that they match, document the differences otherwise.
    • N rows
    • columns names and order
      Diff:
      • ROW_ID is now _rowid
      • new _rowid_source column to have end-to-end data tracing
    • N rows for each FINNGENID
    • Exact value match for maximal selection of columns (= all except row ID columns)
      Diff:
      • Behavior change: Newline inside values are now replaced by the Unicode character (U+2424), instead of replacing by a space character .
      • Bug fix: Values within quoted TSV fields are now correctly preserved, whereas the previous implementation added extra quotes.
      • Bug fix: Tab characters inside quoted TSV values are now correctly preserved, whereas the previous implementation treated them as field separator resulting in shifted values.

Switching to these made it memory friendly:
- `pl.concat(... how="horizontal")`
  instead of `.join`
- `.collect(engine="streaming")`
  insead of just `.collect()`

Also added another check for the merging of main <> freetext files.


NOTE: Polars is better than DuckDB for this since it assigns line
numbers in a deterministic way (only polars guarantees this, not the
case with DuckDB).
Prerequisite in order to have the polars implementation pull the config
shared with the other import packages.
@vincent-octo vincent-octo requested a review from piotor87 May 26, 2026 12:16
Known differences:
- Behavior change: Newline inside values are now replaced by the Unicode
  character `␤` (U+2424), instead of replacing by a space character ` `.
- Bug fix: Values within quoted TSV fields are now correctly preserved,
  whereas the previous implementation added extra quotes.
- Bug fix: Tab characters inside quoted TSV values are now correctly
  preserved, whereas the previous implementation treated them as field
  separator resulting in shifted values.
@vincent-octo vincent-octo self-assigned this Jun 9, 2026
@vincent-octo vincent-octo force-pushed the rewrite-intake-with-polars branch from e4ae095 to 7d9b7cd Compare June 9, 2026 08:20
@vincent-octo vincent-octo marked this pull request as ready for review June 9, 2026 08:21
@vincent-octo vincent-octo merged commit c65d72f into master Jun 9, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant