Based on Kira Detrois' existing repo.
N.B. The raw data contains a MEASUREMENT_FREE_TEXT column that unfortunately cannot be directly released as it contains data that is potentially sensitive. It contains a mix of numerical measurement values, positive/negative outcomes, outcomes linked to thresholds (e.g. <3ml) and general notes. Our approach has been to extract such data from the original column through a process of cleaning and whitelisting of the field.
This is the file that contains mostly just workable data for analysis, like values,outcomes and pos/neg status both raw and extracted from free text where possible.
| Column Name | Easy Description | Technical Notes |
|---|---|---|
ROW_ID |
ROW ID that allows to compare lines across steps of the pipeline | |
FINNGENID |
Pseudoanonimized IDs | |
SEX |
Sex of the sample | |
EVENT_AGE |
Age of the individual at the time fo the event | |
APPROX_EVENT_DATETIME |
Approximate event (+- two weeks) of the event | |
OMOP_CONCEPT_ID |
OMOP id of the mapping | The mapping is done using a table |
TEST_NAME |
Name of test. It's either the local one (manipulated to standard) or the national one mapped via the lab_id table | The map for the national (THL) IDs is in the data folder,and was downloaded from Kuntaliitto - Laboratoriotutkimusnimikkeistö |
MEASUREMENT_VALUE_HARMONIZED |
The harmonized numeric value of the measurement | |
MEASUREMENT_UNIT_HARMONIZED |
The harmonized unit of the measurement from the data | The conversion happens from a table and by defining a target unit from another table built using the most common unit for each concept ID |
MEASUREMENT_VALUE_EXTRACTED |
Numerical value extracted from the measurement free text field | The column remove special prefixes (check config) and specific characteres (as before). If the remaining text is a float it's kept. In our analysis, the values statistically align with the distribution of harmonized values, thus suggesting that the units are the same, although some QC is still needed. |
MEASUREMENT_VALUE_MERGED |
Column containing harmonized values and merged values, where former is missing and the latter present. | The merged column undergoes a final "ad-hoc" harmonization step. Since extracted values are most often without unit we assume that they are in the target unit and perform post merging QCing. (see below QC_PASS). When we see that there is clearly an issue at the srouce we use a table to perform a final conversion based on numerical thresholds (e.g. hematocrit >100 as it's sometimes reported in the wrong unit). At times we're also forced to blacklist extracted values since there is a mismatch of units and we cannot use numerical thresholds to disentangle the faulty entries. |
TEST_OUTCOME |
Outcome of the lab measurement. (typical values H=high, L=low, N=Normal, A=Abnormal) | The AR/LABRA - Exception Messages classification is used in local and national healthcare information systems to indicate deviations or changes in laboratory measurements and other measurements or examinations in relation to reference values or normal values. The content of the classification follows the content of the following classification approved by HL7International (FHIR Value set/code system definition for HL7 v2 table 0078 (Interpretation Codes), see AR/LABRA - Poikkeustilanneviestit for the abbreviations. N.B. We have found at times this columns to be contraddicting the OUTCOME_POS_EXTRACTED column. Our clinicians have pointed out that this column is automatically generated and might not always reliable. Such mismatches are most often linked to panel tests where the outcome is assigned to the whole set of panel tests as a whole while the information in the free text data is different for each entry. These mismatches are flagged in the QC_NOTES so users can be aware of the mismatch. In general, we recommend using the free text column for these values. |
TEST_OUTCOME_IMPUTED |
Imputed outcome values of the lab measurement | We automatically tried to detecth high/low value thresholds based on lab value entries where both MEASUREMENT_VALUE_HARMONIED and TEST_OUTCOME appear together at least 100 times. Values are H,L,N for high,low and normal. All entries with valid OMOP ids with >100 counts were imputed, including already labelled ones. The * sign indicates that there were issues in defining the threshold for H/L thresholds so one should use these values with caution.In the data folder there's a table that shows the values used to determine lower/higher limits for each OMOP id. Please see the [handbook] (https://docs.finngen.fi/finngen-data-specifics/red-library-data-individual-level-data/what-phenotype-files-are-available-in-sandbox-1/kanta-lab-values/data) for more info. |
OUTCOME_POS_EXTRACTED |
POSITIVE(1) or NEGATIVE(0) status extracted from free text. | We extracted all strings from the measurement free text field containing either pos or neg as a substring. We then checked them one by one and assigned to each a POS or NEG status. The mapping table can be seen here. Virtually all cases are straightforward as the Finnish language doesn't contain false positives (pun kinda intended). The only problematic entries are linked to RH pos/neg, which we decided to skip as they are not relevant. More fields are included from a table through a similar process of whitelisting. These are families of values are the ones that show a + sign in their text, mostly from presence in urine measurements. |
TEST_OUTCOME_TEXT_EXTRACTED |
Abnormality texts extracted from free text after standardization. It's formatted as [<\>]|[VALUE]|[UNIT?] |
We applied similar filters as for MEASUREMENT_VALUE_EXTRACTED also allowing for a comparison (e.g. Yli/Alle/</>) and a unit as long as it's in list of target units (after basic string manipulation). More fields are included from a table through a similar process of whitelisting. These are families of values are the ones that show a + sign in their text, mostly from presence in urine measurements. They are often formatted as +/++ or with numerical indicators like 3+. |
QC_PASS |
Boolean indicator of numerical QC. | Values are initialized as 2 (unchcked). Through a table containing numerical thresholds (e.g. impossible or implausible values) the column is updated to 1 (checked and passed) or 0 (checked and not passed). These filters are not to be taken as definite but they are the ones used in the analysis side for GWAS and downstream analyses. We always recommend to double check yourself based on your needs. |
QC_NOTES |
Description of QC considerations. | Post-harmonization ad hoc conversions are annotated here (like Hematocrit [0,1] --> [0,100] conversion). It also includes flagging of mismatches between TEST_OUTCOME and OUTCOME_POS_EXTRACTED |
This is the file that contains all columns, including mostly empty and/or uninformative ones. It can be used as a default file for analysis, but there are pre-harmonization (source) columns that might lead to confusion. Very useful for debugging pruposes or for custom QCing. The columns unique to this file are:
| Column Name | Easy Description | Technical Notes |
|---|---|---|
TEST_ID |
National (THL) or local lab ID of the measurement | |
TEST_ID_IS_NATIONAL |
Source of the lab ID | 0: local and 1: national (THL) |
TEST_NAME_SOURCE |
The original abbreviation of the measurement | |
MEASUREMENT_VALUE_SOURCE |
Original measurement value | |
MEASUREMENT_UNIT_SOURCE |
Original measurement unit | |
MEASUREMENT_STATUS |
The measurement status | The final data contains only C - corrected results or F - final result | See Koodistopalvelu - AR/LABRA - Tutkimusvastauksien tulkintakoodit 1997 |
REFERENCE_RANGE_GROUP |
The reference values for the measurement in text form | This can be used to define the lab abnormality with better coverage using regex expressions (-to be implemented for the whole data). |
REFERENCE_RANGE_[LOWER|UPPER]_[VALUE|UNIT] |
Reference lower|upper value|unit | Mostly mutually exclusive with REFERENCE_RANGE_GROUP |
|
CODING_SYSTEM_ORG |
Name (if present) of the uploading org, mapped via Business Identity Code | From a table indata folder. We noticed that the central digits in most codes were BIC (sometimes missing leadin 0) and thus we could extract the macro information from the bigger table. The result is not as granular as all subfields (e.g. Espoo_221) are now merged into just Espoon_Kaupunki (City of Espoo). |
CODING_SYSTEM_OID |
ID of what is probably the org that logged the data in the system. |
The raw to output column mapping is as follows:
| Column in raw file | Description |
|---|---|
| FINNGENID | FINNGENID |
| EVENT_AGE | EVENT_AGE |
| tutkimuskoodistonjarjestelmaid | CODING_SYSTEM |
| paikallinentutkimusnimike_selite | TEST_NAME_ABBREVIATION |
| tutkimustulosarvo | MEASUREMENT_VALUE |
| tutkimustulosyksikko | MEASUREMENT_UNIT |
| tutkimusvastauksentilaid | MEASUREMENT_STATUS |
| tuloksenpoikkeavuus | TEST_OUTCOME |
| viitearvoryhma | REFERENCE_RANGE_GROUP |
| viitevalialkuarvo | REFERENCE_RANGE_LOWER_VALUE |
| viitevalialkuyksikko | REFERENCE_RANGE_LOWER_UNIT |
| viitevaliloppuarvo | REFERENCE_RANGE_UPPER_VALUE |
| viitevaliloppuyksikko | REFERENCE_RANGE_UPPER_UNIT |
| tutkimustulosteksti | MEASUREMENT_FREE_TEXT |
Many lab records carry a numeric value but no unit (the MEASUREMENT_UNIT_PRE_FIX field is NULL). The unit injection pipeline (scripts/injection/) identifies these records and, where possible, assigns a unit by comparing their value distribution against records of the same TEST_NAME that already have a unit.
Each TEST_NAME is classified into one of three categories based on the amount of reference data available (records that have both a value and a unit):
| Category | Condition | How a unit is assigned |
|---|---|---|
UNAMBIGUOUS |
One unit accounts for ≥ threshold% of reference records | Distribution comparison against that dominant unit via KS / t / MAD tests |
AMBIGUOUS |
Multiple units present; no single one dominates | Per-unit comparison, optionally after splitting a bimodal distribution |
NO_DATA |
Fewer than --min-target-n reference records with any unit |
No engine run; unit rescued via OMOP (see below) |
NO_DATA TEST_NAMEs have too little reference data for a statistical distribution comparison. Instead, they are enriched using the OMOP unit table: if the OMOP concept mapped to a NO_DATA test has category SINGLE (only one unit ever observed) or EQUIVALENT (multiple units but all mutually convertible), the canonical OMOP unit is injected directly. Tests with OMOP category MULTIPLE or with no OMOP mapping receive no unit.
Results are written to no_data_results.tsv. At the active 98% threshold (~500 min-count), ~28% of TEST_NAMEs (≈202) fall into this category, representing ~3.5% of measurements.
For full details on the injection engine, bimodality detection, and output columns, see the injection README.
Inputs reports and responses are sorted over the same keys and merged
First the wdl trims the data of only the relevant columns (taken from the config) and sorts it by a series of columns (also specified in the config under sort_cols) so we can also discard duplicate entries.
There is a config file that contains all the relevant "choices" about how to manipulate the data (e.g. which columns to include, how to rename columns, which values of which column to include etc.) so there are virtually no hard coded elements in the code itself.
The code performs the following actions:
- output columns are initialized
- spaces are removed everywhere in the text
- the date is built in the right format
- all type of NA/missing values (Puuttuu,"_" etc.) are replaced with "NA"
- entries with invalid hetu root are removed (and logged)
- entries with invalid measurement status are removed (and logged)
- TEST_ID_IS_NATIONAL is created checking if
laboratoriotutkimusnimike_koodiis not NA (1 national/0 regional) - TEST_ID is created assigning the regional id for regional labs and a
- TEST_NAME_ABBREVIATION is updated for national labs through a mapping
- CODING_SYSTEM is updated when available (problematic ATM, see table above)
- CODING_SYSTEM_MAP is created from y-tunnukset in table
SERVICE_PROVIDER_ID is updated where the mapping can happen--> scrapped in v3- TEST_NAME_ABBREVIATIONs with problematic characters are edited (see
abbreviation_replacementsin the config)
- Fixes strange characters in the lab unit field. Also moves to lower case for non NA values.
- Mapping of units. This can be done either via regex (from config) or through a mapping
- TEST_OUTCOME is edited to be consistent with the standard definition see AR/LABRA - Poikkeustilanneviestit. This means replacing
<withL,>withH,POSwithAandNEGwithN.
- Mapping status is updated (internal thing)
- IS_UNIT_VALID column is populated based on whether the unit is in usagi list
- Unit edit or injection to make sure all abbreviations with similar units are mapped to the same one (e.g. mg --> mg/24h for du-prot). Based on a table
- OMOP mapping from a table
- unit harmonization (optional but default) from a table
The script reads in the data in chunks of --chunksize length and it processes the lines with Python's pandas. With the flag --mp and --jobs the script runs each chunk into other smaller subchunks in parallel (efficiency TBD). The filter folder contains separate scripts that perform conceptually separate tasks. Each of them contains a global function of the same name of the script that gathers all individual functions that populate the script. In this way we can easily compartmentalize the munging/qc and add new features.
usage: main.py [-h] [--raw-data RAW_DATA] [--log {critical,error,warn,warning,info,debug}] [--test] [--gz] [--mp [MP]] [-o OUT] [--prefix PREFIX] [--sep SEP] [--chunk-size CHUNK_SIZE] [--lines LINES] [--unit-map {regex,map,none}] [--harmonization [HARMONIZATION]]
Kanta Lab preprocessing pipeline: raw data ⇒ clean data.
options:
-h, --help show this help message and exit
--raw-data RAW_DATA Path to input raw file. File should be tsv.
--log {critical,error,warn,warning,info,debug}Provide logging level. Example '--log debug', default = 'warning'
--test Reads first chunk only
--gz Ouputs to gz
--mp [MP] Flag for multiproc. Default is '0' (no multiproc). If passed it defaults to cpu count, but one can also specify the number of cpus to use: e.g. '--mp' or '--mp 4'.
-o OUT, --out OUT Folder in which to save the results (default = current working directory)
--prefix PREFIX Prefix of the out files (default = 'kanta_YYYY_MM_DD')
--sep SEP Separator (default = tab)
--chunk-size CHUNK_SIZE Number of rows to be processed by each chunk (default = '100').
--lines LINES Number of lines in input file (calculated/estimated otherwise).
--unit-map {regex,map,none} How to replace units. Map uses the unit_mapping.txt mapping in data and regex after. Regex does only regex. none skips it entirely.
--harmonization [HARMONIZATION] Path to tsv with concept id and target unit.
The core folder contains the pipeline that produces a similar file, which is used for analysis purposes.
The current version outputs new columns, which are taken from the MEASUREMENT_FREE_TEXT column, which cannot be shared due to data privacy reasons for the time being. The columns are:
| Column Name | Easy Description | General Notes | Technical Notes |
|---|---|---|---|
MEASUREMENT_VALUE_EXTRACTED |
Numerical values | The column is mutually exclusive with harmonization_omop::MEASUREMENT_VALUE. In our analysis the units seemed to be pretty much always coherent with our target OMOP units, albeit for occasional clusters of outliers that are present also in the core data. |
The content of the MEASUREMENT_FREE_TEXT colum is cleaned by removal of spaces, converted to lower case, the target omop unit is removed if present, certain result strings are removed. If we're left with a pure float number, it's considered to be an extracted value. |
extracted::MEASUREMENT_VALUE_MERGED |
extracted::MEASUREMENT_VALUE_MERGED | Merge of harmonized value column and of extracted value column | |
OUTCOME_POS_EXTRACTED |
Boolean | Pos(1) or Neg(0) status | We extracted all strings containing the substring pos/neg and we manually mapped them to pos/neg statuses. The mapping is stored in the data folder. We also added positive extractions linked to (mostly) from presence in urine measurements. They are often formatted as +/++ or with numerical indicators like 3+. The mapping is stored in the data folder |
TEST_OUTCOME_TEXT_EXTRACTED |
Strings | Simplified/summarized strings present in free text. E.g. (YLI 3.0 ml --> >3ml). | |
QC_PASS |
Boolean indicator of numerical QC. | Values are initialized as 2 (unchcked). Through a table containing numerical thresholds (e.g. impossible or implausible values) the column is updated to 1 (checked and passed) or 0 (checked and not passed). These filters are not to be taken as definite but they are the ones used in the analysis side for GWAS and downstream analyses. We always recommend to double check yourself based on your needs. |
|
QC_NOTES |
Description of QC considerations. | Post-harmonization ad hoc conversions are annotated here (like Hematocrit [0,1] --> [0,100] conversion). It also includes flagging of mismatches between TEST_OUTCOME and OUTCOME_POS_EXTRACTED |
The princples are identical to the finngen_qc pipeline, just with different filters being used.
This filter extracts info from the free text column:
-
extract_outcome: This function extracts test outcomes (e.g., "<5", ">10") from the "MEASUREMENT_FREE_TEXT" column. It identifies rows with status indicators, standardizes the text, parses the comparison operator, value, and unit, and stores the extracted outcome in a new column named "extracted::TEST_OUTCOME_TEXT". It also performs some data cleaning and unit mapping. Another type of families of values are the ones that show a + sign in their text, mostly from presence in urine measurements. They are often formatted as +/++ or with numerical indicators like 3+. These values are taken from a mapping file.
-
extract_measurement: This function extracts numerical measurement values from the "MEASUREMENT_FREE_TEXT" column and stores them in a new column called "extracted::MEASUREMENT_VALUE". It also creates a "extracted::MEASUREMENT_VALUE_MERGED" column, which prioritizes values from the "harmonization_omop::MEASUREMENT_VALUE" column if available, otherwise using the newly extracted values.
-
extract_positive: This function takes in a table and maps these full free text fields (manually checked by us) to 1(pos)/0(neg) outcomes. These free text contains the substrings
posornegin them. -
extract_plus_ab" This function takes in a table and maps these full free text fields (manually checked by us) to 1(pos)/0(neg) outcomes. These strings contains the sign
+in them.
This is for all sorts of QCing of the data, ideally for outlier filter/removal:
- all extracted values that can be misinterpreted as dates are removed (DDMMYY) as they can either be birth dates or the exact date of the examination, which is shifted +- 2 weeks for each FINNGENID
- we perform a post harmonization fix for certain values where a mismatch of units exist. For the time being we've only seen his happen on the whole OMOP_ID scale with Hematocrit, where some values are in the [0,1] range instead of the [0,100] range as expected. This is mostly linked to extracted values, but it also affects heavily harmonized values. The mapping file can be found in the data folder
- we use a table to perform basic QC. The table enumerates thresholds for which we find values that are impossible/implausible or for which the pipeline cannot handle potential issues with unit. If en entry matches the criteria in the table,
QC_PASSis set to 0 andQC_NOTESis updated to describe what went wrong according to our analysis - we use a table to mask out extracted values from the
MEASUREMENT_VALUE_MERGEDcolumn. QC columns are also updated. ATM it only involves only a handful of OMOP_IDS that clearly have problematic coexistence of units, either by nature (e.g. hematocrit in two different units) or clearly typos at source. In general, we prefer to QC problematic values and the idea is to use this list carefully only for clear cut cases where there is no space for ambiguity whatsoever. - we flag values for which the
TEST_OUTCOMEcolumn does not match theextracted::IS_POScolumn by adding the stringOUTCOME_EXTRACT_CONFLICTto QC_NOTES
TEST_OUTCOME_IMPUTEDis generated from the (merged) numerical values based on thresholds learned from the data when both values and outcomes are present
The princples are identical to the finngen_qc pipeline.
usage: main.py [-h] --raw-data RAW_DATA [--log {critical,error,warn,warning,info,debug}] [--test] [--gz] [--mp [MP]] [-o OUT] [--prefix PREFIX] [--sep SEP]
[--chunk-size CHUNK_SIZE] [--lines LINES]
Kanta Lab analysis pipeline: clean data ⇒ analysis data.
options:
-h, --help show this help message and exit
--raw-data RAW_DATA Path to input raw file. File should be tsv.
--log {critical,error,warn,warning,info,debug Provide logging level. Example '--log debug', default = 'warning'
--test Reads first chunk only
--gz Ouputs to gz
--mp [MP] Flag for multiproc. Default is '0' (no multiproc). If passed it defaults to cpu count, but one can also specify the number of cpus to use: e.g. '--mp' or '--mp 4'.
-o OUT, --out OUT Folder in which to save the results (default = current working directory)
--prefix PREFIX Prefix of the out files (default = 'kanta_YYYY_MM_DD')
--sep SEP Separator (default = tab)
--chunk-size CHUNK_SIZE
Number of rows to be processed by each chunk (default = '1000*n_cpus').
--lines LINES Number of lines in input file (calculated/estimated otherwise).
