Skip to content

FINNGEN/kanta_lab_preprocessing

Repository files navigation

KANTA LAB values preprocessing & QC

Based on Kira Detrois' existing repo.

CORE FILE

N.B. The raw data contains a MEASUREMENT_FREE_TEXT column that unfortunately cannot be directly released as it contains data that is potentially sensitive. It contains a mix of numerical measurement values, positive/negative outcomes, outcomes linked to thresholds (e.g. <3ml) and general notes. Our approach has been to extract such data from the original column through a process of cleaning and whitelisting of the field.

This is the file that contains mostly just workable data for analysis, like values,outcomes and pos/neg status both raw and extracted from free text where possible.

Column Name Easy Description Technical Notes
ROW_ID ROW ID that allows to compare lines across steps of the pipeline
FINNGENID Pseudoanonimized IDs
SEX Sex of the sample
EVENT_AGE Age of the individual at the time fo the event
APPROX_EVENT_DATETIME Approximate event (+- two weeks) of the event
OMOP_CONCEPT_ID OMOP id of the mapping The mapping is done using a table
TEST_NAME Name of test. It's either the local one (manipulated to standard) or the national one mapped via the lab_id table The map for the national (THL) IDs is in the data folder,and was downloaded from Kuntaliitto - Laboratoriotutkimusnimikkeistö
MEASUREMENT_VALUE_HARMONIZED The harmonized numeric value of the measurement
MEASUREMENT_UNIT_HARMONIZED The harmonized unit of the measurement from the data The conversion happens from a table and by defining a target unit from another table built using the most common unit for each concept ID
MEASUREMENT_VALUE_EXTRACTED Numerical value extracted from the measurement free text field The column remove special prefixes (check config) and specific characteres (as before). If the remaining text is a float it's kept. In our analysis, the values statistically align with the distribution of harmonized values, thus suggesting that the units are the same, although some QC is still needed.
MEASUREMENT_VALUE_MERGED Column containing harmonized values and merged values, where former is missing and the latter present. The merged column undergoes a final "ad-hoc" harmonization step. Since extracted values are most often without unit we assume that they are in the target unit and perform post merging QCing. (see below QC_PASS). When we see that there is clearly an issue at the srouce we use a table to perform a final conversion based on numerical thresholds (e.g. hematocrit >100 as it's sometimes reported in the wrong unit). At times we're also forced to blacklist extracted values since there is a mismatch of units and we cannot use numerical thresholds to disentangle the faulty entries.
TEST_OUTCOME Outcome of the lab measurement. (typical values H=high, L=low, N=Normal, A=Abnormal) The AR/LABRA - Exception Messages classification is used in local and national healthcare information systems to indicate deviations or changes in laboratory measurements and other measurements or examinations in relation to reference values ​​or normal values. The content of the classification follows the content of the following classification approved by HL7International (FHIR Value set/code system definition for HL7 v2 table 0078 (Interpretation Codes), see AR/LABRA - Poikkeustilanneviestit for the abbreviations. N.B. We have found at times this columns to be contraddicting the OUTCOME_POS_EXTRACTED column. Our clinicians have pointed out that this column is automatically generated and might not always reliable. Such mismatches are most often linked to panel tests where the outcome is assigned to the whole set of panel tests as a whole while the information in the free text data is different for each entry. These mismatches are flagged in the QC_NOTES so users can be aware of the mismatch. In general, we recommend using the free text column for these values.
TEST_OUTCOME_IMPUTED Imputed outcome values of the lab measurement We automatically tried to detecth high/low value thresholds based on lab value entries where both MEASUREMENT_VALUE_HARMONIED and TEST_OUTCOME appear together at least 100 times. Values are H,L,N for high,low and normal. All entries with valid OMOP ids with >100 counts were imputed, including already labelled ones. The * sign indicates that there were issues in defining the threshold for H/L thresholds so one should use these values with caution.In the data folder there's a table that shows the values used to determine lower/higher limits for each OMOP id. Please see the [handbook] (https://docs.finngen.fi/finngen-data-specifics/red-library-data-individual-level-data/what-phenotype-files-are-available-in-sandbox-1/kanta-lab-values/data) for more info.
OUTCOME_POS_EXTRACTED POSITIVE(1) or NEGATIVE(0) status extracted from free text. We extracted all strings from the measurement free text field containing either pos or neg as a substring. We then checked them one by one and assigned to each a POS or NEG status. The mapping table can be seen here. Virtually all cases are straightforward as the Finnish language doesn't contain false positives (pun kinda intended). The only problematic entries are linked to RH pos/neg, which we decided to skip as they are not relevant. More fields are included from a table through a similar process of whitelisting. These are families of values are the ones that show a + sign in their text, mostly from presence in urine measurements.
TEST_OUTCOME_TEXT_EXTRACTED Abnormality texts extracted from free text after standardization. It's formatted as [<\>]|[VALUE]|[UNIT?] We applied similar filters as for MEASUREMENT_VALUE_EXTRACTED also allowing for a comparison (e.g. Yli/Alle/</>) and a unit as long as it's in list of target units (after basic string manipulation). More fields are included from a table through a similar process of whitelisting. These are families of values are the ones that show a + sign in their text, mostly from presence in urine measurements. They are often formatted as +/++ or with numerical indicators like 3+.
QC_PASS Boolean indicator of numerical QC. Values are initialized as 2 (unchcked). Through a table containing numerical thresholds (e.g. impossible or implausible values) the column is updated to 1 (checked and passed) or 0 (checked and not passed). These filters are not to be taken as definite but they are the ones used in the analysis side for GWAS and downstream analyses. We always recommend to double check yourself based on your needs.
QC_NOTES Description of QC considerations. Post-harmonization ad hoc conversions are annotated here (like Hematocrit [0,1] --> [0,100] conversion). It also includes flagging of mismatches between TEST_OUTCOME and OUTCOME_POS_EXTRACTED

EXTENDED_COLUMNS FILE

This is the file that contains all columns, including mostly empty and/or uninformative ones. It can be used as a default file for analysis, but there are pre-harmonization (source) columns that might lead to confusion. Very useful for debugging pruposes or for custom QCing. The columns unique to this file are:

Column Name Easy Description Technical Notes
TEST_ID National (THL) or local lab ID of the measurement
TEST_ID_IS_NATIONAL Source of the lab ID 0: local and 1: national (THL)
TEST_NAME_SOURCE The original abbreviation of the measurement
MEASUREMENT_VALUE_SOURCE Original measurement value
MEASUREMENT_UNIT_SOURCE Original measurement unit
MEASUREMENT_STATUS The measurement status The final data contains only C - corrected results or F - final result | See Koodistopalvelu - AR/LABRA - Tutkimusvastauksien tulkintakoodit 1997
REFERENCE_RANGE_GROUP The reference values for the measurement in text form This can be used to define the lab abnormality with better coverage using regex expressions (-to be implemented for the whole data).
REFERENCE_RANGE_[LOWER|UPPER]_[VALUE|UNIT] Reference lower|upper value|unit | Mostly mutually exclusive with REFERENCE_RANGE_GROUP
CODING_SYSTEM_ORG Name (if present) of the uploading org, mapped via Business Identity Code From a table indata folder. We noticed that the central digits in most codes were BIC (sometimes missing leadin 0) and thus we could extract the macro information from the bigger table. The result is not as granular as all subfields (e.g. Espoo_221) are now merged into just Espoon_Kaupunki (City of Espoo).
CODING_SYSTEM_OID ID of what is probably the org that logged the data in the system.

The raw to output column mapping is as follows:

Column in raw file Description
FINNGENID FINNGENID
EVENT_AGE EVENT_AGE
tutkimuskoodistonjarjestelmaid CODING_SYSTEM
paikallinentutkimusnimike_selite TEST_NAME_ABBREVIATION
tutkimustulosarvo MEASUREMENT_VALUE
tutkimustulosyksikko MEASUREMENT_UNIT
tutkimusvastauksentilaid MEASUREMENT_STATUS
tuloksenpoikkeavuus TEST_OUTCOME
viitearvoryhma REFERENCE_RANGE_GROUP
viitevalialkuarvo REFERENCE_RANGE_LOWER_VALUE
viitevalialkuyksikko REFERENCE_RANGE_LOWER_UNIT
viitevaliloppuarvo REFERENCE_RANGE_UPPER_VALUE
viitevaliloppuyksikko REFERENCE_RANGE_UPPER_UNIT
tutkimustulosteksti MEASUREMENT_FREE_TEXT

Summary of the wdl pipeline

UNIT INJECTION

Many lab records carry a numeric value but no unit (the MEASUREMENT_UNIT_PRE_FIX field is NULL). The unit injection pipeline (scripts/injection/) identifies these records and, where possible, assigns a unit by comparing their value distribution against records of the same TEST_NAME that already have a unit.

Each TEST_NAME is classified into one of three categories based on the amount of reference data available (records that have both a value and a unit):

Category Condition How a unit is assigned
UNAMBIGUOUS One unit accounts for ≥ threshold% of reference records Distribution comparison against that dominant unit via KS / t / MAD tests
AMBIGUOUS Multiple units present; no single one dominates Per-unit comparison, optionally after splitting a bimodal distribution
NO_DATA Fewer than --min-target-n reference records with any unit No engine run; unit rescued via OMOP (see below)

NO_DATA handling

NO_DATA TEST_NAMEs have too little reference data for a statistical distribution comparison. Instead, they are enriched using the OMOP unit table: if the OMOP concept mapped to a NO_DATA test has category SINGLE (only one unit ever observed) or EQUIVALENT (multiple units but all mutually convertible), the canonical OMOP unit is injected directly. Tests with OMOP category MULTIPLE or with no OMOP mapping receive no unit.

Results are written to no_data_results.tsv. At the active 98% threshold (~500 min-count), ~28% of TEST_NAMEs (≈202) fall into this category, representing ~3.5% of measurements.

For full details on the injection engine, bimodality detection, and output columns, see the injection README.

MERGE OF RAW DATA

Inputs reports and responses are sorted over the same keys and merged

DUPLICATE REMOVAL & SORTING

First the wdl trims the data of only the relevant columns (taken from the config) and sorts it by a series of columns (also specified in the config under sort_cols) so we can also discard duplicate entries.

Preprocessing

There is a config file that contains all the relevant "choices" about how to manipulate the data (e.g. which columns to include, how to rename columns, which values of which column to include etc.) so there are virtually no hard coded elements in the code itself.

TECHNICAL INFO

The code performs the following actions:

  • output columns are initialized
  • spaces are removed everywhere in the text
  • the date is built in the right format
  • all type of NA/missing values (Puuttuu,"_" etc.) are replaced with "NA"
  • entries with invalid hetu root are removed (and logged)
  • entries with invalid measurement status are removed (and logged)
  • TEST_ID_IS_NATIONAL is created checking if laboratoriotutkimusnimike_koodi is not NA (1 national/0 regional)
  • TEST_ID is created assigning the regional id for regional labs and a
  • TEST_NAME_ABBREVIATION is updated for national labs through a mapping
  • CODING_SYSTEM is updated when available (problematic ATM, see table above)
  • CODING_SYSTEM_MAP is created from y-tunnukset in table
  • SERVICE_PROVIDER_ID is updated where the mapping can happen --> scrapped in v3
  • TEST_NAME_ABBREVIATIONs with problematic characters are edited (see abbreviation_replacements in the config)
  • Fixes strange characters in the lab unit field. Also moves to lower case for non NA values.
  • Mapping of units. This can be done either via regex (from config) or through a mapping
  • TEST_OUTCOME is edited to be consistent with the standard definition see AR/LABRA - Poikkeustilanneviestit. This means replacing < with L, > with H, POS with A and NEG with N.
  • Mapping status is updated (internal thing)
  • IS_UNIT_VALID column is populated based on whether the unit is in usagi list
  • Unit edit or injection to make sure all abbreviations with similar units are mapped to the same one (e.g. mg --> mg/24h for du-prot). Based on a table
  • OMOP mapping from a table
  • unit harmonization (optional but default) from a table

How it works

The script reads in the data in chunks of --chunksize length and it processes the lines with Python's pandas. With the flag --mp and --jobs the script runs each chunk into other smaller subchunks in parallel (efficiency TBD). The filter folder contains separate scripts that perform conceptually separate tasks. Each of them contains a global function of the same name of the script that gathers all individual functions that populate the script. In this way we can easily compartmentalize the munging/qc and add new features.

usage: main.py [-h] [--raw-data RAW_DATA] [--log {critical,error,warn,warning,info,debug}] [--test] [--gz] [--mp [MP]] [-o OUT] [--prefix PREFIX] [--sep SEP] [--chunk-size CHUNK_SIZE] [--lines LINES] [--unit-map {regex,map,none}] [--harmonization [HARMONIZATION]]

Kanta Lab preprocessing pipeline: raw data ⇒ clean data.

options:
  -h, --help            show this help message and exit
  --raw-data RAW_DATA   Path to input raw file. File should be tsv.
 --log {critical,error,warn,warning,info,debug}Provide logging level. Example '--log debug', default = 'warning'
  --test                Reads first chunk only
  --gz                  Ouputs to gz
  --mp [MP]             Flag for multiproc. Default is '0' (no multiproc). If passed it defaults to cpu count, but one can also specify the number of cpus to use: e.g. '--mp' or '--mp 4'.
  -o OUT, --out OUT     Folder in which to save the results (default = current working directory)
  --prefix PREFIX       Prefix of the out files (default = 'kanta_YYYY_MM_DD')
  --sep SEP             Separator (default = tab)
  --chunk-size CHUNK_SIZE      Number of rows to be processed by each chunk (default = '100').
  --lines LINES         Number of lines in input file (calculated/estimated otherwise).
  --unit-map {regex,map,none}  How to replace units. Map uses the unit_mapping.txt mapping in data and regex after. Regex does only regex. none skips it entirely.
  --harmonization [HARMONIZATION]  Path to tsv with concept id and target unit.

Core

The core folder contains the pipeline that produces a similar file, which is used for analysis purposes.

The current version outputs new columns, which are taken from the MEASUREMENT_FREE_TEXT column, which cannot be shared due to data privacy reasons for the time being. The columns are:

Column Name Easy Description General Notes Technical Notes
MEASUREMENT_VALUE_EXTRACTED Numerical values The column is mutually exclusive with harmonization_omop::MEASUREMENT_VALUE. In our analysis the units seemed to be pretty much always coherent with our target OMOP units, albeit for occasional clusters of outliers that are present also in the core data. The content of the MEASUREMENT_FREE_TEXT colum is cleaned by removal of spaces, converted to lower case, the target omop unit is removed if present, certain result strings are removed. If we're left with a pure float number, it's considered to be an extracted value.
extracted::MEASUREMENT_VALUE_MERGED extracted::MEASUREMENT_VALUE_MERGED Merge of harmonized value column and of extracted value column
OUTCOME_POS_EXTRACTED Boolean Pos(1) or Neg(0) status We extracted all strings containing the substring pos/neg and we manually mapped them to pos/neg statuses. The mapping is stored in the data folder. We also added positive extractions linked to (mostly) from presence in urine measurements. They are often formatted as +/++ or with numerical indicators like 3+. The mapping is stored in the data folder
TEST_OUTCOME_TEXT_EXTRACTED Strings Simplified/summarized strings present in free text. E.g. (YLI 3.0 ml --> >3ml).
QC_PASS Boolean indicator of numerical QC. Values are initialized as 2 (unchcked). Through a table containing numerical thresholds (e.g. impossible or implausible values) the column is updated to 1 (checked and passed) or 0 (checked and not passed). These filters are not to be taken as definite but they are the ones used in the analysis side for GWAS and downstream analyses. We always recommend to double check yourself based on your needs.
QC_NOTES Description of QC considerations. Post-harmonization ad hoc conversions are annotated here (like Hematocrit [0,1] --> [0,100] conversion). It also includes flagging of mismatches between TEST_OUTCOME and OUTCOME_POS_EXTRACTED

TECHNICAL INFO

The princples are identical to the finngen_qc pipeline, just with different filters being used.

This filter extracts info from the free text column:

  • extract_outcome: This function extracts test outcomes (e.g., "<5", ">10") from the "MEASUREMENT_FREE_TEXT" column. It identifies rows with status indicators, standardizes the text, parses the comparison operator, value, and unit, and stores the extracted outcome in a new column named "extracted::TEST_OUTCOME_TEXT". It also performs some data cleaning and unit mapping. Another type of families of values are the ones that show a + sign in their text, mostly from presence in urine measurements. They are often formatted as +/++ or with numerical indicators like 3+. These values are taken from a mapping file.

  • extract_measurement: This function extracts numerical measurement values from the "MEASUREMENT_FREE_TEXT" column and stores them in a new column called "extracted::MEASUREMENT_VALUE". It also creates a "extracted::MEASUREMENT_VALUE_MERGED" column, which prioritizes values from the "harmonization_omop::MEASUREMENT_VALUE" column if available, otherwise using the newly extracted values.

  • extract_positive: This function takes in a table and maps these full free text fields (manually checked by us) to 1(pos)/0(neg) outcomes. These free text contains the substrings pos or neg in them.

  • extract_plus_ab" This function takes in a table and maps these full free text fields (manually checked by us) to 1(pos)/0(neg) outcomes. These strings contains the sign + in them.

This is for all sorts of QCing of the data, ideally for outlier filter/removal:

  • all extracted values that can be misinterpreted as dates are removed (DDMMYY) as they can either be birth dates or the exact date of the examination, which is shifted +- 2 weeks for each FINNGENID
  • we perform a post harmonization fix for certain values where a mismatch of units exist. For the time being we've only seen his happen on the whole OMOP_ID scale with Hematocrit, where some values are in the [0,1] range instead of the [0,100] range as expected. This is mostly linked to extracted values, but it also affects heavily harmonized values. The mapping file can be found in the data folder
  • we use a table to perform basic QC. The table enumerates thresholds for which we find values that are impossible/implausible or for which the pipeline cannot handle potential issues with unit. If en entry matches the criteria in the table, QC_PASS is set to 0 and QC_NOTES is updated to describe what went wrong according to our analysis
  • we use a table to mask out extracted values from the MEASUREMENT_VALUE_MERGED column. QC columns are also updated. ATM it only involves only a handful of OMOP_IDS that clearly have problematic coexistence of units, either by nature (e.g. hematocrit in two different units) or clearly typos at source. In general, we prefer to QC problematic values and the idea is to use this list carefully only for clear cut cases where there is no space for ambiguity whatsoever.
  • we flag values for which the TEST_OUTCOME column does not match the extracted::IS_POS column by adding the string OUTCOME_EXTRACT_CONFLICT to QC_NOTES
  • TEST_OUTCOME_IMPUTED is generated from the (merged) numerical values based on thresholds learned from the data when both values and outcomes are present

How it works

The princples are identical to the finngen_qc pipeline.

usage: main.py [-h] --raw-data RAW_DATA [--log {critical,error,warn,warning,info,debug}] [--test] [--gz] [--mp [MP]] [-o OUT] [--prefix PREFIX] [--sep SEP]
               [--chunk-size CHUNK_SIZE] [--lines LINES]

Kanta Lab analysis pipeline: clean data ⇒ analysis data.

options:
  -h, --help            show this help message and exit
  --raw-data RAW_DATA   Path to input raw file. File should be tsv.
  --log {critical,error,warn,warning,info,debug Provide logging level. Example '--log debug', default = 'warning'
  --test                Reads first chunk only
  --gz                  Ouputs to gz
  --mp [MP]             Flag for multiproc. Default is '0' (no multiproc). If passed it defaults to cpu count, but one can also specify the number of cpus  to use: e.g. '--mp' or '--mp 4'.
  -o OUT, --out OUT     Folder in which to save the results (default = current working directory)
  --prefix PREFIX       Prefix of the out files (default = 'kanta_YYYY_MM_DD')
  --sep SEP             Separator (default = tab)
  --chunk-size CHUNK_SIZE
                        Number of rows to be processed by each chunk (default = '1000*n_cpus').
  --lines LINES         Number of lines in input file (calculated/estimated otherwise).

About

Repo for kanta lab QC

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors