Skip to content

Fix 'flan global prepare' and 'global fit' pipeline execution errors#4

Open
TheVidz wants to merge 12 commits into
genxnetwork:mainfrom
TheVidz:main
Open

Fix 'flan global prepare' and 'global fit' pipeline execution errors#4
TheVidz wants to merge 12 commits into
genxnetwork:mainfrom
TheVidz:main

Conversation

@TheVidz

@TheVidz TheVidz commented Jun 15, 2026

Copy link
Copy Markdown

This Pull Request addresses critical workflow failures within the repository that were preventing the flan global prepare and flan global fit pipelines from executing properly. Following a fork of the codebase, multiple underlying bugs, data formatting constraints, and file errors were identified and resolved, resulting in a fully functional end-to-end execution.

Key Changes & Bug Fixes

  1. Pipeline Execution & Logic Corrections
    global fit Resolution: Fixed an issue within loader.py that directly blocked the global fit process from completing successfully.

prepare Step Stabilization: Addressed a critical crash in qc.py impacting the data preparation phase. Follow-up stabilization was implemented in local_plink.py to ensure prepare functions reliably across local environments.

Revert Correction: Resolved regression errors introduced during troubleshooting by successfully handling and reverting conflicting changes (de891d0).

  1. Bioinformatics & Data Quality Filters
    Variant ID Resolution: Fixed a major issue causing crashes during --read-freq operations due to duplicate variant IDs formatted as '.'.

Allele Count Filtering: Implemented a minimum allele count threshold (--mac 1) to ensure data integrity during processing.

Sample Splitter Fix: Patched a num_folds validation error occurring inside the sample splitting logic.

  1. Repository Cleanups
    Corrected a naming issue in deep_ancestry to ensure proper routing/redirecting to flan.

Verification & Testing
Verified that flan global prepare successfully runs through completion without qc or variant ID errors.

Verified that flan global fit correctly initializes and executes training using the corrected loader.py implementation.

Summary by Sourcery

Restore end-to-end execution of the flan global prepare and global fit pipelines by aligning phenotypes with PCA features, wiring QC outputs through subsequent steps, and hardening local PCA and dataset splitting behavior.

Bug Fixes:

  • Align phenotype loading and PLINK PC scores by IID to prevent shape and ordering mismatches in the global fit pipeline.
  • Ensure QC-processed genotype data is used consistently for splitting genotypes and running PCA, avoiding crashes due to unclean source data.
  • Remove reliance on --read-freq and add a minimum allele count filter in local PLINK PCA steps to prevent failures from invalid or zero-frequency variants.
  • Guard sample splitting with a fallback/default num_folds value to avoid errors when the argument is missing.
  • Update cache configuration paths and deep_ancestry package wiring so node1 runs correctly in the current environment.

Enhancements:

  • Relax overly strict index assertions and intersect available IDs when loading PLINK PCs to make data loading more robust to partial overlaps.

@sourcery-ai

sourcery-ai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Reviewer's Guide

Aligns phenotypes and PCA features by IID, routes all downstream processing through QC’d PLINK data, hardens sample splitting and PCA scoring to avoid frequency/ID crashes, and updates local configs and package init for deep_ancestry.

Sequence diagram for IID-aligned phenotype and PCA loading

sequenceDiagram
    participant Loader as LocalDataLoader
    participant Cache as FileCache
    participant Pandas as pandas
    participant LoadPCs as load_plink_pcs
    participant LoadPheno as load_phenotype

    Loader->>Cache: pca_path(fold, 'train', 'sscore')
    Cache-->>Loader: train_sscore_path
    Loader->>Pandas: read_csv(train_sscore_path, sep='\t')
    Pandas-->>Loader: df_train_sscore
    Loader->>Loader: iids_train = df_train_sscore['IID']

    Loader->>LoadPCs: load_plink_pcs(path=pca_path('train'), order_as_in_file=phenotype_path('train'))
    LoadPCs->>Pandas: read_table(path)
    Pandas-->>LoadPCs: df_pcs
    LoadPCs->>Pandas: read_csv(order_as_in_file, sep='\t').set_index('IID')
    Pandas-->>LoadPCs: df_pheno_indexed
    LoadPCs->>LoadPCs: common_ids = df_pheno_indexed.index ∩ df_pcs.index
    LoadPCs->>LoadPCs: df_pcs = df_pcs.reindex(common_ids)
    LoadPCs-->>Loader: X_train_df
    Loader->>Loader: X_train_df = X_train_df.reindex(iids_train)

    Loader->>Cache: phenotype_path(fold, 'train')
    Cache-->>Loader: train_pheno_path
    Loader->>LoadPheno: load_phenotype(train_pheno_path, out_type=int64, encode=True, keep_iids=iids_train)
    LoadPheno->>Pandas: read_table(train_pheno_path)
    Pandas-->>LoadPheno: pheno_df
    LoadPheno->>LoadPheno: pheno_df = pheno_df.set_index('IID').reindex(keep_iids).reset_index()
    LoadPheno-->>Loader: y_train

    Loader->>Loader: repeat for val, test
    Loader-->>Loader: X, Y aligned by IID
Loading

File-Level Changes

Change Details Files
Align phenotype loading and PCA feature matrices by common IIDs to prevent global fit crashes.
  • Extend load_phenotype to accept an optional keep_iids list and reindex phenotype table on IID before extracting the phenotype column.
  • Relax load_plink_pcs ordering by removing the strict length assert and reindexing PCA dataframe to the intersection of phenotype and PCA IIDs.
  • Update LocalDataLoader.load to read IIDs from sscore files for train/val/test and pass them into phenotype and PCA loaders so X and Y are keyed to the same sample IDs.
  • Modify _load_pcs to optionally reindex train/val/test PCA matrices by explicit IID arrays before converting to numpy arrays.
flan/nn/loader.py
Ensure QC-processed PLINK data is used throughout splitting and downstream steps, and adjust PLINK QC arguments to fix crashes on missing/duplicate variant IDs.
  • In qc.fit_transform, write QC output to a new *_qc pfile, set --set-missing-var-ids to a more specific template, and update FileCache._pfile_path to the QC output.
  • In SampleSplitter._split_genotypes and fit_transform, force use of the *_qc PLINK base path when creating per-fold train/val/test splits and phenotypes.
  • Introduce a default num_folds=5 when args.num_folds is missing to avoid splitter validation errors.
flan/preprocess/qc.py
flan/preprocess/sample_splitter.py
Stabilize local PCA projection by avoiding problematic read-freq usage and enforcing minimal allele count filtering.
  • In local_plink.transform, remove --read-freq from PLINK score call and add --mac 1 so per-part PCA scoring filters low-frequency/zero variants locally.
  • Mirror the same argument change in local_plink.predict for prediction-time PCA scoring.
flan/pca/local_plink.py
Adjust local paths and package structure to work in the forked environment and correctly route deep_ancestry imports to flan.
  • Update cache and source node1 configs to use relative local paths instead of absolute /data paths.
  • Add a deep_ancestry/init.py module so the package resolves correctly and can be used as an entry point to flan.
scripts/configs/cache/node1.yaml
scripts/configs/source/node1_50.yaml
deep_ancestry/__init__.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 5 issues, and left some high level feedback:

  • Several places now mutate cache._pfile_path directly and rely on the "_qc" suffix; consider exposing a public method or dedicated QC-aware accessor on FileCache instead of reaching into a private attribute and using string conventions.
  • The changes to load_plink_pcs and _load_pcs silently intersect and reindex on IIDs, which can drop or reorder samples without any visibility; it would be safer to log or assert when expected IDs are missing/mismatched so data leakage or misalignment is easier to detect.
  • In SampleSplitter._split_ids, forcing self.args.num_folds to a minimum of 5 may unexpectedly override user configuration; if the intent is validation, consider raising an error or warning instead of silently changing the argument.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Several places now mutate `cache._pfile_path` directly and rely on the `"_qc"` suffix; consider exposing a public method or dedicated QC-aware accessor on `FileCache` instead of reaching into a private attribute and using string conventions.
- The changes to `load_plink_pcs` and `_load_pcs` silently intersect and reindex on IIDs, which can drop or reorder samples without any visibility; it would be safer to log or assert when expected IDs are missing/mismatched so data leakage or misalignment is easier to detect.
- In `SampleSplitter._split_ids`, forcing `self.args.num_folds` to a minimum of 5 may unexpectedly override user configuration; if the intent is validation, consider raising an error or warning instead of silently changing the argument.

## Individual Comments

### Comment 1
<location path="flan/nn/loader.py" line_range="31-40" />
<code_context>


-def load_phenotype(phenotype_path: str, out_type = numpy.float32, encode = False) -> numpy.ndarray:
+def load_phenotype(phenotype_path: str, out_type = numpy.float32, encode = False, keep_iids = None) -> numpy.ndarray:
     """
     :param phenotype_path: Phenotypes location
     :param out_type: convert to type
     :param encode: whether phenotypes are strings and we want to code them as ints)
     """
     data = pandas.read_table(phenotype_path)
+    # Highlighted Fix: If a list of aligned IIDs is provided, filter and order by them
+    if keep_iids is not None:
+        data = data.set_index('IID').reindex(keep_iids).reset_index()
+        
     data = data.iloc[:, -1].values
     if encode:
         _, data = numpy.unique(data, return_inverse=True)
</code_context>
<issue_to_address>
**issue (bug_risk):** NaNs introduced by `keep_iids` are silently encoded into integers, hiding missing phenotypes.

When `keep_iids` is used, `set_index(...).reindex(keep_iids).reset_index()` will produce NaNs for IIDs present in `keep_iids` but missing from the phenotype file. Because `numpy.unique(..., return_inverse=True)` runs before any NaN check, those NaNs are treated as a category and converted to integers, so a later `numpy.isnan(phenotype)` check can never detect them. Please either (a) validate for NaNs immediately after the `reindex` and fail if any are found, or (b) perform the NaN check on the unencoded `data` before calling `numpy.unique`.
</issue_to_address>

### Comment 2
<location path="flan/nn/loader.py" line_range="91-100" />
<code_context>
+    def _load_pcs(self, cache: FileCache, fold: int, iids_train=None, iids_val=None, iids_test=None) -> X:
</code_context>
<issue_to_address>
**issue (bug_risk):** Reindexing PCs with IIDs can create all-NaN rows when PCA IIDs and phenotype IIDs differ.

In `_load_pcs`, `load_plink_pcs` already intersects PCA sscore IIDs with the phenotype IIDs from `order_as_in_file`. When you then reindex on `iids_train` / `iids_val` / `iids_test`, any IIDs present only in the sscore file will be reintroduced as all-NaN rows. These NaNs persist through `.values.astype(numpy.float32)`, so you can end up with feature rows that are empty or misaligned with your phenotype filtering. Consider intersecting `iids_*` with the DataFrame index before reindexing, or at least asserting that the number of non-NaN rows matches expectations.
</issue_to_address>

### Comment 3
<location path="flan/preprocess/sample_splitter.py" line_range="32-34" />
<code_context>
             y: y can be passed to trigger StratifiedKFold instead of KFold
             random_state (int): Fixed random_state for train_test_split sklearn function
         """
+        # adding min 5 folds
+        num_folds = getattr(self.args, "num_folds", 5)
+        self.args.num_folds = num_folds
+
         ids = pandas.read_table(cache.ids_path()).rename(columns={'#IID': 'IID'}).filter(['FID', 'IID'])
</code_context>
<issue_to_address>
**issue (bug_risk):** Forcing a minimum of 5 folds on `self.args.num_folds` can desync from `cache.num_folds` and the on-disk layout.

Here `self.args.num_folds` is forced to ≥5, but other code (e.g. `_split_genotypes`, PCA) iterates over `range(cache.num_folds)`. With configs like `scripts/configs/cache/node1.yaml` (`num_folds: 1`), `_split_ids` will create 5 folds of IDs while only 1 fold of genotypes/artifacts exists. This can break downstream logic for folds > 0. Instead of enforcing a minimum here, keep `self.args.num_folds` aligned with `cache.num_folds` (or validate and update them together).
</issue_to_address>

### Comment 4
<location path="flan/preprocess/qc.py" line_range="34-35" />
<code_context>
+            }
+        )
+
+        # ✅ VERY IMPORTANT: update cache to point to QC output
+        cache._pfile_path = qc_path

     def transform(self, source_path: str, dest_path: str) -> None:
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Directly mutating `cache._pfile_path` and using string concatenation may break expectations about `pfile_path` type and structure.

Assigning `qc_path` (built as `str(cache.pfile_path()) + "_qc"`) back into `cache._pfile_path` changes its type from a `Path`-like object to a plain string. Any downstream code using `pfile_path()` as a `Path` (e.g. with `/` or `.with_suffix`) may then fail unexpectedly. The same pattern appears in `SampleSplitter.fit_transform`. Please construct `qc_path` as a `Path` (for example via `with_name`) and update the cache via a public method rather than mutating the private `_pfile_path` attribute directly.

Suggested implementation:

```python
            args_dict={
                '--out': str(qc_path),
                '--set-missing-var-ids': '@:#:$r:$a',
                **self.qc_config
            }
        )

        # ✅ VERY IMPORTANT: update cache to point to QC output
        cache.set_pfile_path(qc_path)

```

1. Ensure `qc_path` is constructed as a Path-like object rather than via string concatenation. For example, wherever `qc_path` is currently defined (likely as `str(cache.pfile_path()) + "_qc"`), replace it with something like:
   - `qc_path = cache.pfile_path().with_name(cache.pfile_path().name + "_qc")`
   or, if you need to preserve suffixes differently, use an appropriate `pathlib.Path` method.
2. Implement the new public method `set_pfile_path` on the `cache` object’s class (if it does not already exist). It should:
   - Accept a Path-like object.
   - Store it internally (e.g. in `_pfile_path`) without changing its type.
   - Maintain any invariants or side effects previously associated with updating `_pfile_path`.
3. Update `SampleSplitter.fit_transform` to follow the same pattern:
   - Construct its QC (or split) path using Path operations, not string concatenation.
   - Use `cache.set_pfile_path(...)` (or the equivalent public API) instead of mutating `cache._pfile_path` directly.
</issue_to_address>

### Comment 5
<location path="flan/nn/loader.py" line_range="53-59" />
<code_context>
         y = pandas.read_csv(order_as_in_file, sep='\t').set_index('IID')
-        assert len(df) == len(y)
-        df = df.reindex(y.index)
+        # Highlighted Fix: Drop the strict assert check and intersect valid indices instead
+        common_ids = y.index.intersection(df.index)
+        df = df.reindex(common_ids)

     return df
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Silently dropping non-common IDs in `load_plink_pcs` can hide serious data issues.

Using `common_ids = y.index.intersection(df.index)` and reindexing means mismatched phenotype/PCA IDs now result in silently dropped samples. If overlap is very small (or zero), downstream code may unknowingly run on an unrepresentative or empty dataset. Please add a sanity check (e.g., ensure `len(common_ids) > 0` and that the overlap is close to `len(y)`), and raise or log if the overlap is unexpectedly small.

```suggestion
    if order_as_in_file is not None:
        y = pandas.read_csv(order_as_in_file, sep='\t').set_index('IID')
        # Drop the strict assert check but enforce sanity checks on the overlap
        common_ids = y.index.intersection(df.index)

        # Sanity checks on overlap between phenotype/order file and PCA data
        n_expected = len(y)
        n_overlap = len(common_ids)
        if n_overlap == 0:
            raise ValueError(
                f"No overlapping IIDs between order file ({order_as_in_file}) "
                f"and PCA data (0 / {n_expected} overlapped)."
            )

        overlap_ratio = n_overlap / n_expected
        if overlap_ratio < 0.95:
            logging.getLogger(__name__).warning(
                "Only %d/%d (%.1f%%) IIDs overlap between order file (%s) and PCA data; "
                "samples with mismatched IDs will be dropped.",
                n_overlap,
                n_expected,
                overlap_ratio * 100.0,
                order_as_in_file,
            )

        df = df.reindex(common_ids)

    return df
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread flan/nn/loader.py
Comment on lines +31 to +40
def load_phenotype(phenotype_path: str, out_type = numpy.float32, encode = False, keep_iids = None) -> numpy.ndarray:
"""
:param phenotype_path: Phenotypes location
:param out_type: convert to type
:param encode: whether phenotypes are strings and we want to code them as ints)
"""
data = pandas.read_table(phenotype_path)
# Highlighted Fix: If a list of aligned IIDs is provided, filter and order by them
if keep_iids is not None:
data = data.set_index('IID').reindex(keep_iids).reset_index()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): NaNs introduced by keep_iids are silently encoded into integers, hiding missing phenotypes.

When keep_iids is used, set_index(...).reindex(keep_iids).reset_index() will produce NaNs for IIDs present in keep_iids but missing from the phenotype file. Because numpy.unique(..., return_inverse=True) runs before any NaN check, those NaNs are treated as a category and converted to integers, so a later numpy.isnan(phenotype) check can never detect them. Please either (a) validate for NaNs immediately after the reindex and fail if any are found, or (b) perform the NaN check on the unencoded data before calling numpy.unique.

Comment thread flan/nn/loader.py
Comment on lines +91 to +100
def _load_pcs(self, cache: FileCache, fold: int, iids_train=None, iids_val=None, iids_test=None) -> X:
X_train = load_plink_pcs(path=cache.pca_path(fold, 'train', 'sscore'),
order_as_in_file=cache.phenotype_path(fold, 'train')).values.astype(numpy.float32)
order_as_in_file=cache.phenotype_path(fold, 'train'))
if iids_train is not None:
X_train = X_train.reindex(iids_train)
X_train = X_train.values.astype(numpy.float32)

X_val = load_plink_pcs(path=cache.pca_path(fold, 'val', 'sscore'),
order_as_in_file=cache.phenotype_path(fold, 'val')).values.astype(numpy.float32)
order_as_in_file=cache.phenotype_path(fold, 'val'))
if iids_val is not None:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Reindexing PCs with IIDs can create all-NaN rows when PCA IIDs and phenotype IIDs differ.

In _load_pcs, load_plink_pcs already intersects PCA sscore IIDs with the phenotype IIDs from order_as_in_file. When you then reindex on iids_train / iids_val / iids_test, any IIDs present only in the sscore file will be reintroduced as all-NaN rows. These NaNs persist through .values.astype(numpy.float32), so you can end up with feature rows that are empty or misaligned with your phenotype filtering. Consider intersecting iids_* with the DataFrame index before reindexing, or at least asserting that the number of non-NaN rows matches expectations.

Comment on lines +32 to +34
# adding min 5 folds
num_folds = getattr(self.args, "num_folds", 5)
self.args.num_folds = num_folds

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Forcing a minimum of 5 folds on self.args.num_folds can desync from cache.num_folds and the on-disk layout.

Here self.args.num_folds is forced to ≥5, but other code (e.g. _split_genotypes, PCA) iterates over range(cache.num_folds). With configs like scripts/configs/cache/node1.yaml (num_folds: 1), _split_ids will create 5 folds of IDs while only 1 fold of genotypes/artifacts exists. This can break downstream logic for folds > 0. Instead of enforcing a minimum here, keep self.args.num_folds aligned with cache.num_folds (or validate and update them together).

Comment thread flan/preprocess/qc.py
Comment on lines +34 to +35
# ✅ VERY IMPORTANT: update cache to point to QC output
cache._pfile_path = qc_path

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Directly mutating cache._pfile_path and using string concatenation may break expectations about pfile_path type and structure.

Assigning qc_path (built as str(cache.pfile_path()) + "_qc") back into cache._pfile_path changes its type from a Path-like object to a plain string. Any downstream code using pfile_path() as a Path (e.g. with / or .with_suffix) may then fail unexpectedly. The same pattern appears in SampleSplitter.fit_transform. Please construct qc_path as a Path (for example via with_name) and update the cache via a public method rather than mutating the private _pfile_path attribute directly.

Suggested implementation:

            args_dict={
                '--out': str(qc_path),
                '--set-missing-var-ids': '@:#:$r:$a',
                **self.qc_config
            }
        )

        # ✅ VERY IMPORTANT: update cache to point to QC output
        cache.set_pfile_path(qc_path)
  1. Ensure qc_path is constructed as a Path-like object rather than via string concatenation. For example, wherever qc_path is currently defined (likely as str(cache.pfile_path()) + "_qc"), replace it with something like:
    • qc_path = cache.pfile_path().with_name(cache.pfile_path().name + "_qc")
      or, if you need to preserve suffixes differently, use an appropriate pathlib.Path method.
  2. Implement the new public method set_pfile_path on the cache object’s class (if it does not already exist). It should:
    • Accept a Path-like object.
    • Store it internally (e.g. in _pfile_path) without changing its type.
    • Maintain any invariants or side effects previously associated with updating _pfile_path.
  3. Update SampleSplitter.fit_transform to follow the same pattern:
    • Construct its QC (or split) path using Path operations, not string concatenation.
    • Use cache.set_pfile_path(...) (or the equivalent public API) instead of mutating cache._pfile_path directly.

Comment thread flan/nn/loader.py
Comment on lines 53 to 59
if order_as_in_file is not None:
y = pandas.read_csv(order_as_in_file, sep='\t').set_index('IID')
assert len(df) == len(y)
df = df.reindex(y.index)
# Highlighted Fix: Drop the strict assert check and intersect valid indices instead
common_ids = y.index.intersection(df.index)
df = df.reindex(common_ids)

return df

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (bug_risk): Silently dropping non-common IDs in load_plink_pcs can hide serious data issues.

Using common_ids = y.index.intersection(df.index) and reindexing means mismatched phenotype/PCA IDs now result in silently dropped samples. If overlap is very small (or zero), downstream code may unknowingly run on an unrepresentative or empty dataset. Please add a sanity check (e.g., ensure len(common_ids) > 0 and that the overlap is close to len(y)), and raise or log if the overlap is unexpectedly small.

Suggested change
if order_as_in_file is not None:
y = pandas.read_csv(order_as_in_file, sep='\t').set_index('IID')
assert len(df) == len(y)
df = df.reindex(y.index)
# Highlighted Fix: Drop the strict assert check and intersect valid indices instead
common_ids = y.index.intersection(df.index)
df = df.reindex(common_ids)
return df
if order_as_in_file is not None:
y = pandas.read_csv(order_as_in_file, sep='\t').set_index('IID')
# Drop the strict assert check but enforce sanity checks on the overlap
common_ids = y.index.intersection(df.index)
# Sanity checks on overlap between phenotype/order file and PCA data
n_expected = len(y)
n_overlap = len(common_ids)
if n_overlap == 0:
raise ValueError(
f"No overlapping IIDs between order file ({order_as_in_file}) "
f"and PCA data (0 / {n_expected} overlapped)."
)
overlap_ratio = n_overlap / n_expected
if overlap_ratio < 0.95:
logging.getLogger(__name__).warning(
"Only %d/%d (%.1f%%) IIDs overlap between order file (%s) and PCA data; "
"samples with mismatched IDs will be dropped.",
n_overlap,
n_expected,
overlap_ratio * 100.0,
order_as_in_file,
)
df = df.reindex(common_ids)
return df

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant