KANTA LAB values preprocessing & QC

Based on Kira Detrois' existing repo.

CORE FILE

N.B. The raw data contains a MEASUREMENT_FREE_TEXT column that unfortunately cannot be directly released as it contains data that is potentially sensitive. It contains a mix of numerical measurement values, positive/negative outcomes, outcomes linked to thresholds (e.g. <3ml) and general notes. Our approach has been to extract such data from the original column through a process of cleaning and whitelisting of the field.

This is the file that contains mostly just workable data for analysis, like values,outcomes and pos/neg status both raw and extracted from free text where possible.

Column Name	Easy Description	Technical Notes
`ROW_ID`	ROW ID that allows to compare lines across steps of the pipeline
`FINNGENID`	Pseudoanonimized IDs
`SEX`	Sex of the sample
`EVENT_AGE`	Age of the individual at the time fo the event
`APPROX_EVENT_DATETIME`	Approximate event (+- two weeks) of the event
`OMOP_CONCEPT_ID`	OMOP id of the mapping	The mapping is done using a table
`TEST_NAME`	Name of test. It's either the local one (manipulated to standard) or the national one mapped via the lab_id table	The map for the national (THL) IDs is in the data folder,and was downloaded from Kuntaliitto - Laboratoriotutkimusnimikkeistö
`MEASUREMENT_VALUE_HARMONIZED`	The harmonized numeric value of the measurement
`MEASUREMENT_UNIT_HARMONIZED`	The harmonized unit of the measurement from the data	The conversion happens from a table and by defining a target unit from another table built using the most common unit for each concept ID
`MEASUREMENT_VALUE_EXTRACTED`	Numerical value extracted from the measurement free text field	The column remove special prefixes (check config) and specific characteres (as before). If the remaining text is a float it's kept. In our analysis, the values statistically align with the distribution of harmonized values, thus suggesting that the units are the same, although some QC is still needed.
`MEASUREMENT_VALUE_MERGED`	Column containing harmonized values and merged values, where former is missing and the latter present.	The merged column undergoes a final "ad-hoc" harmonization step. Since extracted values are most often without unit we assume that they are in the target unit and perform post merging QCing. (see below `QC_PASS`). When we see that there is clearly an issue at the srouce we use a table to perform a final conversion based on numerical thresholds (e.g. hematocrit >100 as it's sometimes reported in the wrong unit). At times we're also forced to blacklist extracted values since there is a mismatch of units and we cannot use numerical thresholds to disentangle the faulty entries.
`TEST_OUTCOME`	Outcome of the lab measurement. (typical values H=high, L=low, N=Normal, A=Abnormal)	The AR/LABRA - Exception Messages classification is used in local and national healthcare information systems to indicate deviations or changes in laboratory measurements and other measurements or examinations in relation to reference values or normal values. The content of the classification follows the content of the following classification approved by HL7International (FHIR Value set/code system definition for HL7 v2 table 0078 (Interpretation Codes), see AR/LABRA - Poikkeustilanneviestit for the abbreviations. N.B. We have found at times this columns to be contraddicting the `OUTCOME_POS_EXTRACTED` column. Our clinicians have pointed out that this column is automatically generated and might not always reliable. Such mismatches are most often linked to panel tests where the outcome is assigned to the whole set of panel tests as a whole while the information in the free text data is different for each entry. These mismatches are flagged in the `QC_NOTES` so users can be aware of the mismatch. In general, we recommend using the free text column for these values.
`TEST_OUTCOME_IMPUTED`	Imputed outcome values of the lab measurement	We automatically tried to detecth high/low value thresholds based on lab value entries where both MEASUREMENT_VALUE_HARMONIED and TEST_OUTCOME appear together at least 100 times. Values are `H,L,N` for high,low and normal. All entries with valid OMOP ids with >100 counts were imputed, including already labelled ones. The `*` sign indicates that there were issues in defining the threshold for `H/L` thresholds so one should use these values with caution.In the data folder there's a table that shows the values used to determine lower/higher limits for each OMOP id. Please see the [handbook] (https://docs.finngen.fi/finngen-data-specifics/red-library-data-individual-level-data/what-phenotype-files-are-available-in-sandbox-1/kanta-lab-values/data) for more info.
`OUTCOME_POS_EXTRACTED`	POSITIVE(1) or NEGATIVE(0) status extracted from free text.	We extracted all strings from the measurement free text field containing either `pos` or `neg` as a substring. We then checked them one by one and assigned to each a POS or NEG status. The mapping table can be seen here. Virtually all cases are straightforward as the Finnish language doesn't contain false positives (pun kinda intended). The only problematic entries are linked to RH pos/neg, which we decided to skip as they are not relevant. More fields are included from a table through a similar process of whitelisting. These are families of values are the ones that show a + sign in their text, mostly from presence in urine measurements.
`TEST_OUTCOME_TEXT_EXTRACTED`	Abnormality texts extracted from free text after standardization. It's formatted as `[<\>]\|[VALUE]\|[UNIT?]`	We applied similar filters as for `MEASUREMENT_VALUE_EXTRACTED` also allowing for a comparison (e.g. `Yli/Alle/</>`) and a unit as long as it's in list of target units (after basic string manipulation). More fields are included from a table through a similar process of whitelisting. These are families of values are the ones that show a + sign in their text, mostly from presence in urine measurements. They are often formatted as +/++ or with numerical indicators like 3+.
`QC_PASS`	Boolean indicator of numerical QC.	Values are initialized as `2` (unchcked). Through a table containing numerical thresholds (e.g. impossible or implausible values) the column is updated to `1` (checked and passed) or `0` (checked and not passed). These filters are not to be taken as definite but they are the ones used in the analysis side for GWAS and downstream analyses. We always recommend to double check yourself based on your needs.
`QC_NOTES`	Description of QC considerations.	Post-harmonization ad hoc conversions are annotated here (like Hematocrit [0,1] --> [0,100] conversion). It also includes flagging of mismatches between `TEST_OUTCOME` and `OUTCOME_POS_EXTRACTED`

EXTENDED_COLUMNS FILE

This is the file that contains all columns, including mostly empty and/or uninformative ones. It can be used as a default file for analysis, but there are pre-harmonization (source) columns that might lead to confusion. Very useful for debugging pruposes or for custom QCing. The columns unique to this file are:

Column Name	Easy Description	Technical Notes
`TEST_ID`	National (THL) or local lab ID of the measurement
`TEST_ID_IS_NATIONAL`	Source of the lab ID	0: local and 1: national (THL)
`TEST_NAME_SOURCE`	The original abbreviation of the measurement
`MEASUREMENT_VALUE_SOURCE`	Original measurement value
`MEASUREMENT_UNIT_SOURCE`	Original measurement unit
`MEASUREMENT_STATUS`	The measurement status	The final data contains only `C` - corrected results or `F` - final result \| See Koodistopalvelu - AR/LABRA - Tutkimusvastauksien tulkintakoodit 1997
`REFERENCE_RANGE_GROUP`	The reference values for the measurement in text form	This can be used to define the lab abnormality with better coverage using regex expressions (-to be implemented for the whole data).
`REFERENCE_RANGE_[LOWER\|UPPER]_[VALUE\|UNIT]`	Reference lower\|upper value\|unit \| Mostly mutually exclusive with `REFERENCE_RANGE_GROUP`
`CODING_SYSTEM_ORG`	Name (if present) of the uploading org, mapped via Business Identity Code	From a table indata folder. We noticed that the central digits in most codes were BIC (sometimes missing leadin 0) and thus we could extract the macro information from the bigger table. The result is not as granular as all subfields (e.g. `Espoo_221`) are now merged into just `Espoon_Kaupunki (City of Espoo)`.
`CODING_SYSTEM_OID`	ID of what is probably the org that logged the data in the system.

The raw to output column mapping is as follows:

Column in raw file	Description
FINNGENID	FINNGENID
EVENT_AGE	EVENT_AGE
tutkimuskoodistonjarjestelmaid	CODING_SYSTEM
paikallinentutkimusnimike_selite	TEST_NAME_ABBREVIATION
tutkimustulosarvo	MEASUREMENT_VALUE
tutkimustulosyksikko	MEASUREMENT_UNIT
tutkimusvastauksentilaid	MEASUREMENT_STATUS
tuloksenpoikkeavuus	TEST_OUTCOME
viitearvoryhma	REFERENCE_RANGE_GROUP
viitevalialkuarvo	REFERENCE_RANGE_LOWER_VALUE
viitevalialkuyksikko	REFERENCE_RANGE_LOWER_UNIT
viitevaliloppuarvo	REFERENCE_RANGE_UPPER_VALUE
viitevaliloppuyksikko	REFERENCE_RANGE_UPPER_UNIT
tutkimustulosteksti	MEASUREMENT_FREE_TEXT

UNIT INJECTION

Many lab records carry a numeric value but no unit (the MEASUREMENT_UNIT_PRE_FIX field is NULL). The unit injection pipeline (scripts/injection/) identifies these records and, where possible, assigns a unit by comparing their value distribution against records of the same TEST_NAME that already have a unit.

Each TEST_NAME is classified into one of three categories based on the amount of reference data available (records that have both a value and a unit):

Category	Condition	How a unit is assigned
`UNAMBIGUOUS`	One unit accounts for ≥ threshold% of reference records	Distribution comparison against that dominant unit via KS / t / MAD tests
`AMBIGUOUS`	Multiple units present; no single one dominates	Per-unit comparison, optionally after splitting a bimodal distribution
`NO_DATA`	Fewer than `--min-target-n` reference records with any unit	No engine run; unit rescued via OMOP (see below)

NO_DATA handling

NO_DATA TEST_NAMEs have too little reference data for a statistical distribution comparison. Instead, they are enriched using the OMOP unit table: if the OMOP concept mapped to a NO_DATA test has category SINGLE (only one unit ever observed) or EQUIVALENT (multiple units but all mutually convertible), the canonical OMOP unit is injected directly. Tests with OMOP category MULTIPLE or with no OMOP mapping receive no unit.

Results are written to no_data_results.tsv. At the active 98% threshold (~500 min-count), ~28% of TEST_NAMEs (≈202) fall into this category, representing ~3.5% of measurements.

For full details on the injection engine, bimodality detection, and output columns, see the injection README.

MERGE OF RAW DATA

Inputs reports and responses are sorted over the same keys and merged

DUPLICATE REMOVAL & SORTING

First the wdl trims the data of only the relevant columns (taken from the config) and sorts it by a series of columns (also specified in the config under sort_cols) so we can also discard duplicate entries.

Preprocessing

There is a config file that contains all the relevant "choices" about how to manipulate the data (e.g. which columns to include, how to rename columns, which values of which column to include etc.) so there are virtually no hard coded elements in the code itself.

TECHNICAL INFO

The code performs the following actions:

MINIMAL

output columns are initialized
spaces are removed everywhere in the text
the date is built in the right format
all type of NA/missing values (Puuttuu,"_" etc.) are replaced with "NA"
entries with invalid hetu root are removed (and logged)
entries with invalid measurement status are removed (and logged)
TEST_ID_IS_NATIONAL is created checking if laboratoriotutkimusnimike_koodi is not NA (1 national/0 regional)
TEST_ID is created assigning the regional id for regional labs and a
TEST_NAME_ABBREVIATION is updated for national labs through a mapping
CODING_SYSTEM is updated when available (problematic ATM, see table above)
CODING_SYSTEM_MAP is created from y-tunnukset in table
~~SERVICE_PROVIDER_ID is updated where the mapping can happen~~ --> scrapped in v3
TEST_NAME_ABBREVIATIONs with problematic characters are edited (see abbreviation_replacements in the config)

UNIT

Fixes strange characters in the lab unit field. Also moves to lower case for non NA values.
Mapping of units. This can be done either via regex (from config) or through a mapping
TEST_OUTCOME is edited to be consistent with the standard definition see AR/LABRA - Poikkeustilanneviestit. This means replacing < with L, > with H, POS with A and NEG with N.

HARMONIZATION

Mapping status is updated (internal thing)
IS_UNIT_VALID column is populated based on whether the unit is in usagi list
Unit edit or injection to make sure all abbreviations with similar units are mapped to the same one (e.g. mg --> mg/24h for du-prot). Based on a table
OMOP mapping from a table
unit harmonization (optional but default) from a table

How it works

The script reads in the data in chunks of --chunksize length and it processes the lines with Python's pandas. With the flag --mp and --jobs the script runs each chunk into other smaller subchunks in parallel (efficiency TBD). The filter folder contains separate scripts that perform conceptually separate tasks. Each of them contains a global function of the same name of the script that gathers all individual functions that populate the script. In this way we can easily compartmentalize the munging/qc and add new features.

usage: main.py [-h] [--raw-data RAW_DATA] [--log {critical,error,warn,warning,info,debug}] [--test] [--gz] [--mp [MP]] [-o OUT] [--prefix PREFIX] [--sep SEP] [--chunk-size CHUNK_SIZE] [--lines LINES] [--unit-map {regex,map,none}] [--harmonization [HARMONIZATION]]

Kanta Lab preprocessing pipeline: raw data ⇒ clean data.

options:
  -h, --help            show this help message and exit
  --raw-data RAW_DATA   Path to input raw file. File should be tsv.
 --log {critical,error,warn,warning,info,debug}Provide logging level. Example '--log debug', default = 'warning'
  --test                Reads first chunk only
  --gz                  Ouputs to gz
  --mp [MP]             Flag for multiproc. Default is '0' (no multiproc). If passed it defaults to cpu count, but one can also specify the number of cpus to use: e.g. '--mp' or '--mp 4'.
  -o OUT, --out OUT     Folder in which to save the results (default = current working directory)
  --prefix PREFIX       Prefix of the out files (default = 'kanta_YYYY_MM_DD')
  --sep SEP             Separator (default = tab)
  --chunk-size CHUNK_SIZE      Number of rows to be processed by each chunk (default = '100').
  --lines LINES         Number of lines in input file (calculated/estimated otherwise).
  --unit-map {regex,map,none}  How to replace units. Map uses the unit_mapping.txt mapping in data and regex after. Regex does only regex. none skips it entirely.
  --harmonization [HARMONIZATION]  Path to tsv with concept id and target unit.

Core

The core folder contains the pipeline that produces a similar file, which is used for analysis purposes.

The current version outputs new columns, which are taken from the MEASUREMENT_FREE_TEXT column, which cannot be shared due to data privacy reasons for the time being. The columns are:

Column Name	Easy Description	General Notes	Technical Notes
`MEASUREMENT_VALUE_EXTRACTED`	Numerical values	The column is mutually exclusive with `harmonization_omop::MEASUREMENT_VALUE`. In our analysis the units seemed to be pretty much always coherent with our target OMOP units, albeit for occasional clusters of outliers that are present also in the core data.	The content of the `MEASUREMENT_FREE_TEXT` colum is cleaned by removal of spaces, converted to lower case, the target omop unit is removed if present, certain result strings are removed. If we're left with a pure float number, it's considered to be an extracted value.
`extracted::MEASUREMENT_VALUE_MERGED`	extracted::MEASUREMENT_VALUE_MERGED	Merge of harmonized value column and of extracted value column
`OUTCOME_POS_EXTRACTED`	Boolean	Pos(1) or Neg(0) status	We extracted all strings containing the substring pos/neg and we manually mapped them to pos/neg statuses. The mapping is stored in the data folder. We also added positive extractions linked to (mostly) from presence in urine measurements. They are often formatted as +/++ or with numerical indicators like 3+. The mapping is stored in the data folder
`TEST_OUTCOME_TEXT_EXTRACTED`	Strings	Simplified/summarized strings present in free text. E.g. (YLI 3.0 ml --> >3ml).
`QC_PASS`	Boolean indicator of numerical QC.	Values are initialized as `2` (unchcked). Through a table containing numerical thresholds (e.g. impossible or implausible values) the column is updated to `1` (checked and passed) or `0` (checked and not passed). These filters are not to be taken as definite but they are the ones used in the analysis side for GWAS and downstream analyses. We always recommend to double check yourself based on your needs.
`QC_NOTES`	Description of QC considerations.	Post-harmonization ad hoc conversions are annotated here (like Hematocrit [0,1] --> [0,100] conversion). It also includes flagging of mismatches between `TEST_OUTCOME` and `OUTCOME_POS_EXTRACTED`

TECHNICAL INFO

The princples are identical to the finngen_qc pipeline, just with different filters being used.

EXTRACT

This filter extracts info from the free text column:

extract_outcome: This function extracts test outcomes (e.g., "<5", ">10") from the "MEASUREMENT_FREE_TEXT" column. It identifies rows with status indicators, standardizes the text, parses the comparison operator, value, and unit, and stores the extracted outcome in a new column named "extracted::TEST_OUTCOME_TEXT". It also performs some data cleaning and unit mapping. Another type of families of values are the ones that show a + sign in their text, mostly from presence in urine measurements. They are often formatted as +/++ or with numerical indicators like 3+. These values are taken from a mapping file.
extract_measurement: This function extracts numerical measurement values from the "MEASUREMENT_FREE_TEXT" column and stores them in a new column called "extracted::MEASUREMENT_VALUE". It also creates a "extracted::MEASUREMENT_VALUE_MERGED" column, which prioritizes values from the "harmonization_omop::MEASUREMENT_VALUE" column if available, otherwise using the newly extracted values.
extract_positive: This function takes in a table and maps these full free text fields (manually checked by us) to 1(pos)/0(neg) outcomes. These free text contains the substrings pos or neg in them.
extract_plus_ab" This function takes in a table and maps these full free text fields (manually checked by us) to 1(pos)/0(neg) outcomes. These strings contains the sign + in them.

QC

This is for all sorts of QCing of the data, ideally for outlier filter/removal:

all extracted values that can be misinterpreted as dates are removed (DDMMYY) as they can either be birth dates or the exact date of the examination, which is shifted +- 2 weeks for each FINNGENID
we perform a post harmonization fix for certain values where a mismatch of units exist. For the time being we've only seen his happen on the whole OMOP_ID scale with Hematocrit, where some values are in the [0,1] range instead of the [0,100] range as expected. This is mostly linked to extracted values, but it also affects heavily harmonized values. The mapping file can be found in the data folder
we use a table to perform basic QC. The table enumerates thresholds for which we find values that are impossible/implausible or for which the pipeline cannot handle potential issues with unit. If en entry matches the criteria in the table, QC_PASS is set to 0 and QC_NOTES is updated to describe what went wrong according to our analysis
we use a table to mask out extracted values from the MEASUREMENT_VALUE_MERGED column. QC columns are also updated. ATM it only involves only a handful of OMOP_IDS that clearly have problematic coexistence of units, either by nature (e.g. hematocrit in two different units) or clearly typos at source. In general, we prefer to QC problematic values and the idea is to use this list carefully only for clear cut cases where there is no space for ambiguity whatsoever.
we flag values for which the TEST_OUTCOME column does not match the extracted::IS_POS column by adding the string OUTCOME_EXTRACT_CONFLICT to QC_NOTES

OUTCOME

TEST_OUTCOME_IMPUTED is generated from the (merged) numerical values based on thresholds learned from the data when both values and outcomes are present

How it works

The princples are identical to the finngen_qc pipeline.

usage: main.py [-h] --raw-data RAW_DATA [--log {critical,error,warn,warning,info,debug}] [--test] [--gz] [--mp [MP]] [-o OUT] [--prefix PREFIX] [--sep SEP]
               [--chunk-size CHUNK_SIZE] [--lines LINES]

Kanta Lab analysis pipeline: clean data ⇒ analysis data.

options:
  -h, --help            show this help message and exit
  --raw-data RAW_DATA   Path to input raw file. File should be tsv.
  --log {critical,error,warn,warning,info,debug Provide logging level. Example '--log debug', default = 'warning'
  --test                Reads first chunk only
  --gz                  Ouputs to gz
  --mp [MP]             Flag for multiproc. Default is '0' (no multiproc). If passed it defaults to cpu count, but one can also specify the number of cpus  to use: e.g. '--mp' or '--mp 4'.
  -o OUT, --out OUT     Folder in which to save the results (default = current working directory)
  --prefix PREFIX       Prefix of the out files (default = 'kanta_YYYY_MM_DD')
  --sep SEP             Separator (default = tab)
  --chunk-size CHUNK_SIZE
                        Number of rows to be processed by each chunk (default = '1000*n_cpus').
  --lines LINES         Number of lines in input file (calculated/estimated otherwise).

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.github/workflows		.github/workflows
docker		docker
qc_scripts		qc_scripts
sb_release		sb_release
scripts/injection		scripts/injection
src/kanta		src/kanta
tests		tests
wdl		wdl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
kanta_pipeline.png		kanta_pipeline.png
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KANTA LAB values preprocessing & QC

CORE FILE

EXTENDED_COLUMNS FILE

UNIT INJECTION

NO_DATA handling

MERGE OF RAW DATA

DUPLICATE REMOVAL & SORTING

Preprocessing

TECHNICAL INFO

MINIMAL

UNIT

HARMONIZATION

How it works

Core

TECHNICAL INFO

EXTRACT

QC

OUTCOME

How it works

About

Uh oh!

Releases 2

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

KANTA LAB values preprocessing & QC

CORE FILE

EXTENDED_COLUMNS FILE

UNIT INJECTION

NO_DATA handling

MERGE OF RAW DATA

DUPLICATE REMOVAL & SORTING

Preprocessing

TECHNICAL INFO

How it works

Core

TECHNICAL INFO

How it works

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Uh oh!

Languages