Fix 'flan global prepare' and 'global fit' pipeline execution errors#4
Fix 'flan global prepare' and 'global fit' pipeline execution errors#4TheVidz wants to merge 12 commits into
Conversation
Reviewer's GuideAligns phenotypes and PCA features by IID, routes all downstream processing through QC’d PLINK data, hardens sample splitting and PCA scoring to avoid frequency/ID crashes, and updates local configs and package init for deep_ancestry. Sequence diagram for IID-aligned phenotype and PCA loadingsequenceDiagram
participant Loader as LocalDataLoader
participant Cache as FileCache
participant Pandas as pandas
participant LoadPCs as load_plink_pcs
participant LoadPheno as load_phenotype
Loader->>Cache: pca_path(fold, 'train', 'sscore')
Cache-->>Loader: train_sscore_path
Loader->>Pandas: read_csv(train_sscore_path, sep='\t')
Pandas-->>Loader: df_train_sscore
Loader->>Loader: iids_train = df_train_sscore['IID']
Loader->>LoadPCs: load_plink_pcs(path=pca_path('train'), order_as_in_file=phenotype_path('train'))
LoadPCs->>Pandas: read_table(path)
Pandas-->>LoadPCs: df_pcs
LoadPCs->>Pandas: read_csv(order_as_in_file, sep='\t').set_index('IID')
Pandas-->>LoadPCs: df_pheno_indexed
LoadPCs->>LoadPCs: common_ids = df_pheno_indexed.index ∩ df_pcs.index
LoadPCs->>LoadPCs: df_pcs = df_pcs.reindex(common_ids)
LoadPCs-->>Loader: X_train_df
Loader->>Loader: X_train_df = X_train_df.reindex(iids_train)
Loader->>Cache: phenotype_path(fold, 'train')
Cache-->>Loader: train_pheno_path
Loader->>LoadPheno: load_phenotype(train_pheno_path, out_type=int64, encode=True, keep_iids=iids_train)
LoadPheno->>Pandas: read_table(train_pheno_path)
Pandas-->>LoadPheno: pheno_df
LoadPheno->>LoadPheno: pheno_df = pheno_df.set_index('IID').reindex(keep_iids).reset_index()
LoadPheno-->>Loader: y_train
Loader->>Loader: repeat for val, test
Loader-->>Loader: X, Y aligned by IID
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 5 issues, and left some high level feedback:
- Several places now mutate
cache._pfile_pathdirectly and rely on the"_qc"suffix; consider exposing a public method or dedicated QC-aware accessor onFileCacheinstead of reaching into a private attribute and using string conventions. - The changes to
load_plink_pcsand_load_pcssilently intersect and reindex on IIDs, which can drop or reorder samples without any visibility; it would be safer to log or assert when expected IDs are missing/mismatched so data leakage or misalignment is easier to detect. - In
SampleSplitter._split_ids, forcingself.args.num_foldsto a minimum of 5 may unexpectedly override user configuration; if the intent is validation, consider raising an error or warning instead of silently changing the argument.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Several places now mutate `cache._pfile_path` directly and rely on the `"_qc"` suffix; consider exposing a public method or dedicated QC-aware accessor on `FileCache` instead of reaching into a private attribute and using string conventions.
- The changes to `load_plink_pcs` and `_load_pcs` silently intersect and reindex on IIDs, which can drop or reorder samples without any visibility; it would be safer to log or assert when expected IDs are missing/mismatched so data leakage or misalignment is easier to detect.
- In `SampleSplitter._split_ids`, forcing `self.args.num_folds` to a minimum of 5 may unexpectedly override user configuration; if the intent is validation, consider raising an error or warning instead of silently changing the argument.
## Individual Comments
### Comment 1
<location path="flan/nn/loader.py" line_range="31-40" />
<code_context>
-def load_phenotype(phenotype_path: str, out_type = numpy.float32, encode = False) -> numpy.ndarray:
+def load_phenotype(phenotype_path: str, out_type = numpy.float32, encode = False, keep_iids = None) -> numpy.ndarray:
"""
:param phenotype_path: Phenotypes location
:param out_type: convert to type
:param encode: whether phenotypes are strings and we want to code them as ints)
"""
data = pandas.read_table(phenotype_path)
+ # Highlighted Fix: If a list of aligned IIDs is provided, filter and order by them
+ if keep_iids is not None:
+ data = data.set_index('IID').reindex(keep_iids).reset_index()
+
data = data.iloc[:, -1].values
if encode:
_, data = numpy.unique(data, return_inverse=True)
</code_context>
<issue_to_address>
**issue (bug_risk):** NaNs introduced by `keep_iids` are silently encoded into integers, hiding missing phenotypes.
When `keep_iids` is used, `set_index(...).reindex(keep_iids).reset_index()` will produce NaNs for IIDs present in `keep_iids` but missing from the phenotype file. Because `numpy.unique(..., return_inverse=True)` runs before any NaN check, those NaNs are treated as a category and converted to integers, so a later `numpy.isnan(phenotype)` check can never detect them. Please either (a) validate for NaNs immediately after the `reindex` and fail if any are found, or (b) perform the NaN check on the unencoded `data` before calling `numpy.unique`.
</issue_to_address>
### Comment 2
<location path="flan/nn/loader.py" line_range="91-100" />
<code_context>
+ def _load_pcs(self, cache: FileCache, fold: int, iids_train=None, iids_val=None, iids_test=None) -> X:
</code_context>
<issue_to_address>
**issue (bug_risk):** Reindexing PCs with IIDs can create all-NaN rows when PCA IIDs and phenotype IIDs differ.
In `_load_pcs`, `load_plink_pcs` already intersects PCA sscore IIDs with the phenotype IIDs from `order_as_in_file`. When you then reindex on `iids_train` / `iids_val` / `iids_test`, any IIDs present only in the sscore file will be reintroduced as all-NaN rows. These NaNs persist through `.values.astype(numpy.float32)`, so you can end up with feature rows that are empty or misaligned with your phenotype filtering. Consider intersecting `iids_*` with the DataFrame index before reindexing, or at least asserting that the number of non-NaN rows matches expectations.
</issue_to_address>
### Comment 3
<location path="flan/preprocess/sample_splitter.py" line_range="32-34" />
<code_context>
y: y can be passed to trigger StratifiedKFold instead of KFold
random_state (int): Fixed random_state for train_test_split sklearn function
"""
+ # adding min 5 folds
+ num_folds = getattr(self.args, "num_folds", 5)
+ self.args.num_folds = num_folds
+
ids = pandas.read_table(cache.ids_path()).rename(columns={'#IID': 'IID'}).filter(['FID', 'IID'])
</code_context>
<issue_to_address>
**issue (bug_risk):** Forcing a minimum of 5 folds on `self.args.num_folds` can desync from `cache.num_folds` and the on-disk layout.
Here `self.args.num_folds` is forced to ≥5, but other code (e.g. `_split_genotypes`, PCA) iterates over `range(cache.num_folds)`. With configs like `scripts/configs/cache/node1.yaml` (`num_folds: 1`), `_split_ids` will create 5 folds of IDs while only 1 fold of genotypes/artifacts exists. This can break downstream logic for folds > 0. Instead of enforcing a minimum here, keep `self.args.num_folds` aligned with `cache.num_folds` (or validate and update them together).
</issue_to_address>
### Comment 4
<location path="flan/preprocess/qc.py" line_range="34-35" />
<code_context>
+ }
+ )
+
+ # ✅ VERY IMPORTANT: update cache to point to QC output
+ cache._pfile_path = qc_path
def transform(self, source_path: str, dest_path: str) -> None:
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Directly mutating `cache._pfile_path` and using string concatenation may break expectations about `pfile_path` type and structure.
Assigning `qc_path` (built as `str(cache.pfile_path()) + "_qc"`) back into `cache._pfile_path` changes its type from a `Path`-like object to a plain string. Any downstream code using `pfile_path()` as a `Path` (e.g. with `/` or `.with_suffix`) may then fail unexpectedly. The same pattern appears in `SampleSplitter.fit_transform`. Please construct `qc_path` as a `Path` (for example via `with_name`) and update the cache via a public method rather than mutating the private `_pfile_path` attribute directly.
Suggested implementation:
```python
args_dict={
'--out': str(qc_path),
'--set-missing-var-ids': '@:#:$r:$a',
**self.qc_config
}
)
# ✅ VERY IMPORTANT: update cache to point to QC output
cache.set_pfile_path(qc_path)
```
1. Ensure `qc_path` is constructed as a Path-like object rather than via string concatenation. For example, wherever `qc_path` is currently defined (likely as `str(cache.pfile_path()) + "_qc"`), replace it with something like:
- `qc_path = cache.pfile_path().with_name(cache.pfile_path().name + "_qc")`
or, if you need to preserve suffixes differently, use an appropriate `pathlib.Path` method.
2. Implement the new public method `set_pfile_path` on the `cache` object’s class (if it does not already exist). It should:
- Accept a Path-like object.
- Store it internally (e.g. in `_pfile_path`) without changing its type.
- Maintain any invariants or side effects previously associated with updating `_pfile_path`.
3. Update `SampleSplitter.fit_transform` to follow the same pattern:
- Construct its QC (or split) path using Path operations, not string concatenation.
- Use `cache.set_pfile_path(...)` (or the equivalent public API) instead of mutating `cache._pfile_path` directly.
</issue_to_address>
### Comment 5
<location path="flan/nn/loader.py" line_range="53-59" />
<code_context>
y = pandas.read_csv(order_as_in_file, sep='\t').set_index('IID')
- assert len(df) == len(y)
- df = df.reindex(y.index)
+ # Highlighted Fix: Drop the strict assert check and intersect valid indices instead
+ common_ids = y.index.intersection(df.index)
+ df = df.reindex(common_ids)
return df
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Silently dropping non-common IDs in `load_plink_pcs` can hide serious data issues.
Using `common_ids = y.index.intersection(df.index)` and reindexing means mismatched phenotype/PCA IDs now result in silently dropped samples. If overlap is very small (or zero), downstream code may unknowingly run on an unrepresentative or empty dataset. Please add a sanity check (e.g., ensure `len(common_ids) > 0` and that the overlap is close to `len(y)`), and raise or log if the overlap is unexpectedly small.
```suggestion
if order_as_in_file is not None:
y = pandas.read_csv(order_as_in_file, sep='\t').set_index('IID')
# Drop the strict assert check but enforce sanity checks on the overlap
common_ids = y.index.intersection(df.index)
# Sanity checks on overlap between phenotype/order file and PCA data
n_expected = len(y)
n_overlap = len(common_ids)
if n_overlap == 0:
raise ValueError(
f"No overlapping IIDs between order file ({order_as_in_file}) "
f"and PCA data (0 / {n_expected} overlapped)."
)
overlap_ratio = n_overlap / n_expected
if overlap_ratio < 0.95:
logging.getLogger(__name__).warning(
"Only %d/%d (%.1f%%) IIDs overlap between order file (%s) and PCA data; "
"samples with mismatched IDs will be dropped.",
n_overlap,
n_expected,
overlap_ratio * 100.0,
order_as_in_file,
)
df = df.reindex(common_ids)
return df
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| def load_phenotype(phenotype_path: str, out_type = numpy.float32, encode = False, keep_iids = None) -> numpy.ndarray: | ||
| """ | ||
| :param phenotype_path: Phenotypes location | ||
| :param out_type: convert to type | ||
| :param encode: whether phenotypes are strings and we want to code them as ints) | ||
| """ | ||
| data = pandas.read_table(phenotype_path) | ||
| # Highlighted Fix: If a list of aligned IIDs is provided, filter and order by them | ||
| if keep_iids is not None: | ||
| data = data.set_index('IID').reindex(keep_iids).reset_index() |
There was a problem hiding this comment.
issue (bug_risk): NaNs introduced by keep_iids are silently encoded into integers, hiding missing phenotypes.
When keep_iids is used, set_index(...).reindex(keep_iids).reset_index() will produce NaNs for IIDs present in keep_iids but missing from the phenotype file. Because numpy.unique(..., return_inverse=True) runs before any NaN check, those NaNs are treated as a category and converted to integers, so a later numpy.isnan(phenotype) check can never detect them. Please either (a) validate for NaNs immediately after the reindex and fail if any are found, or (b) perform the NaN check on the unencoded data before calling numpy.unique.
| def _load_pcs(self, cache: FileCache, fold: int, iids_train=None, iids_val=None, iids_test=None) -> X: | ||
| X_train = load_plink_pcs(path=cache.pca_path(fold, 'train', 'sscore'), | ||
| order_as_in_file=cache.phenotype_path(fold, 'train')).values.astype(numpy.float32) | ||
| order_as_in_file=cache.phenotype_path(fold, 'train')) | ||
| if iids_train is not None: | ||
| X_train = X_train.reindex(iids_train) | ||
| X_train = X_train.values.astype(numpy.float32) | ||
|
|
||
| X_val = load_plink_pcs(path=cache.pca_path(fold, 'val', 'sscore'), | ||
| order_as_in_file=cache.phenotype_path(fold, 'val')).values.astype(numpy.float32) | ||
| order_as_in_file=cache.phenotype_path(fold, 'val')) | ||
| if iids_val is not None: |
There was a problem hiding this comment.
issue (bug_risk): Reindexing PCs with IIDs can create all-NaN rows when PCA IIDs and phenotype IIDs differ.
In _load_pcs, load_plink_pcs already intersects PCA sscore IIDs with the phenotype IIDs from order_as_in_file. When you then reindex on iids_train / iids_val / iids_test, any IIDs present only in the sscore file will be reintroduced as all-NaN rows. These NaNs persist through .values.astype(numpy.float32), so you can end up with feature rows that are empty or misaligned with your phenotype filtering. Consider intersecting iids_* with the DataFrame index before reindexing, or at least asserting that the number of non-NaN rows matches expectations.
| # adding min 5 folds | ||
| num_folds = getattr(self.args, "num_folds", 5) | ||
| self.args.num_folds = num_folds |
There was a problem hiding this comment.
issue (bug_risk): Forcing a minimum of 5 folds on self.args.num_folds can desync from cache.num_folds and the on-disk layout.
Here self.args.num_folds is forced to ≥5, but other code (e.g. _split_genotypes, PCA) iterates over range(cache.num_folds). With configs like scripts/configs/cache/node1.yaml (num_folds: 1), _split_ids will create 5 folds of IDs while only 1 fold of genotypes/artifacts exists. This can break downstream logic for folds > 0. Instead of enforcing a minimum here, keep self.args.num_folds aligned with cache.num_folds (or validate and update them together).
| # ✅ VERY IMPORTANT: update cache to point to QC output | ||
| cache._pfile_path = qc_path |
There was a problem hiding this comment.
suggestion (bug_risk): Directly mutating cache._pfile_path and using string concatenation may break expectations about pfile_path type and structure.
Assigning qc_path (built as str(cache.pfile_path()) + "_qc") back into cache._pfile_path changes its type from a Path-like object to a plain string. Any downstream code using pfile_path() as a Path (e.g. with / or .with_suffix) may then fail unexpectedly. The same pattern appears in SampleSplitter.fit_transform. Please construct qc_path as a Path (for example via with_name) and update the cache via a public method rather than mutating the private _pfile_path attribute directly.
Suggested implementation:
args_dict={
'--out': str(qc_path),
'--set-missing-var-ids': '@:#:$r:$a',
**self.qc_config
}
)
# ✅ VERY IMPORTANT: update cache to point to QC output
cache.set_pfile_path(qc_path)- Ensure
qc_pathis constructed as a Path-like object rather than via string concatenation. For example, whereverqc_pathis currently defined (likely asstr(cache.pfile_path()) + "_qc"), replace it with something like:qc_path = cache.pfile_path().with_name(cache.pfile_path().name + "_qc")
or, if you need to preserve suffixes differently, use an appropriatepathlib.Pathmethod.
- Implement the new public method
set_pfile_pathon thecacheobject’s class (if it does not already exist). It should:- Accept a Path-like object.
- Store it internally (e.g. in
_pfile_path) without changing its type. - Maintain any invariants or side effects previously associated with updating
_pfile_path.
- Update
SampleSplitter.fit_transformto follow the same pattern:- Construct its QC (or split) path using Path operations, not string concatenation.
- Use
cache.set_pfile_path(...)(or the equivalent public API) instead of mutatingcache._pfile_pathdirectly.
| if order_as_in_file is not None: | ||
| y = pandas.read_csv(order_as_in_file, sep='\t').set_index('IID') | ||
| assert len(df) == len(y) | ||
| df = df.reindex(y.index) | ||
| # Highlighted Fix: Drop the strict assert check and intersect valid indices instead | ||
| common_ids = y.index.intersection(df.index) | ||
| df = df.reindex(common_ids) | ||
|
|
||
| return df |
There was a problem hiding this comment.
suggestion (bug_risk): Silently dropping non-common IDs in load_plink_pcs can hide serious data issues.
Using common_ids = y.index.intersection(df.index) and reindexing means mismatched phenotype/PCA IDs now result in silently dropped samples. If overlap is very small (or zero), downstream code may unknowingly run on an unrepresentative or empty dataset. Please add a sanity check (e.g., ensure len(common_ids) > 0 and that the overlap is close to len(y)), and raise or log if the overlap is unexpectedly small.
| if order_as_in_file is not None: | |
| y = pandas.read_csv(order_as_in_file, sep='\t').set_index('IID') | |
| assert len(df) == len(y) | |
| df = df.reindex(y.index) | |
| # Highlighted Fix: Drop the strict assert check and intersect valid indices instead | |
| common_ids = y.index.intersection(df.index) | |
| df = df.reindex(common_ids) | |
| return df | |
| if order_as_in_file is not None: | |
| y = pandas.read_csv(order_as_in_file, sep='\t').set_index('IID') | |
| # Drop the strict assert check but enforce sanity checks on the overlap | |
| common_ids = y.index.intersection(df.index) | |
| # Sanity checks on overlap between phenotype/order file and PCA data | |
| n_expected = len(y) | |
| n_overlap = len(common_ids) | |
| if n_overlap == 0: | |
| raise ValueError( | |
| f"No overlapping IIDs between order file ({order_as_in_file}) " | |
| f"and PCA data (0 / {n_expected} overlapped)." | |
| ) | |
| overlap_ratio = n_overlap / n_expected | |
| if overlap_ratio < 0.95: | |
| logging.getLogger(__name__).warning( | |
| "Only %d/%d (%.1f%%) IIDs overlap between order file (%s) and PCA data; " | |
| "samples with mismatched IDs will be dropped.", | |
| n_overlap, | |
| n_expected, | |
| overlap_ratio * 100.0, | |
| order_as_in_file, | |
| ) | |
| df = df.reindex(common_ids) | |
| return df |
This Pull Request addresses critical workflow failures within the repository that were preventing the flan global prepare and flan global fit pipelines from executing properly. Following a fork of the codebase, multiple underlying bugs, data formatting constraints, and file errors were identified and resolved, resulting in a fully functional end-to-end execution.
global fit Resolution: Fixed an issue within loader.py that directly blocked the global fit process from completing successfully.
prepare Step Stabilization: Addressed a critical crash in qc.py impacting the data preparation phase. Follow-up stabilization was implemented in local_plink.py to ensure prepare functions reliably across local environments.
Revert Correction: Resolved regression errors introduced during troubleshooting by successfully handling and reverting conflicting changes (de891d0).
Variant ID Resolution: Fixed a major issue causing crashes during --read-freq operations due to duplicate variant IDs formatted as '.'.
Allele Count Filtering: Implemented a minimum allele count threshold (--mac 1) to ensure data integrity during processing.
Sample Splitter Fix: Patched a num_folds validation error occurring inside the sample splitting logic.
Corrected a naming issue in deep_ancestry to ensure proper routing/redirecting to flan.
Verified that flan global fit correctly initializes and executes training using the corrected loader.py implementation.
Summary by Sourcery
Restore end-to-end execution of the flan global prepare and global fit pipelines by aligning phenotypes with PCA features, wiring QC outputs through subsequent steps, and hardening local PCA and dataset splitting behavior.
Bug Fixes:
Enhancements: