Remove scikit-learn 1.5.0 limitation by thomasATbayer · Pull Request #24 · Bayer-Group/MotherML

thomasATbayer · 2026-06-17T16:49:16Z

Branch: `update_sklearn_to_current_version`

PR #24 — "Remove scikit-learn 1.5.0 limitation"

Why this branch exists

The previous release pinned scikit-learn to <=1.5.0, blocking the entire dependency tree from upgrading. This branch lifts that pin and adapts the codebase to the API changes introduced in scikit-learn 1.6–1.8. Several companion libraries (CatBoost, TabPFN, UMAP, HDBSCAN) were also updated at the same time.

Summary of changes

Area	File(s)	What changed
Dependency versions	`pyproject.toml`, `uv.lock`	sklearn unpinned; catboost, tabpfn, umap bumped
Python support	`pyproject.toml`, `workflow.yml`	Drop 3.10, add 3.14
HDBSCAN clustering	`cv/cv_methods.py`	Switch from external `hdbscan` to `sklearn.cluster.HDBSCAN`; fix fingerprint conversion for 2D input
CatBoost sklearn compat	`ml/models/m_catboost.py`	Custom `__sklearn_clone__` base class
TabPFN compat	`ml/models/m_tabpfn.py`	`force_all_finite` → `ensure_all_finite`; pin V2 model weights
rnanorm removal	`ml/rna.py`, `pyproject.toml`	Native CPM/UQ/CUF replace unmaintained rnanorm
RNA input validation	`ml/rna.py`	Fail-fast guards for zero library sizes and degenerate inputs; `CPM.transform` enforces fit-before-transform via `check_is_fitted`
Pipeline utils	`pipeline_utils.py`	Use `is_classifier()` helper; `get_ranking_pipeline` saves/restores global metadata routing state
Learning to rank	`optimization/core.py`	Redesigned `handle_metadata_routing` with routing guard
CI — TabPFN weights	`workflow.yml`	Download step added to fast `test` job (was only in `test-slow`)
CI — prepareTabpfn	`scripts/prepareTabpfn.py`	Fixed missing `from pathlib import Path` import
Tests	`test/unit/test_catboost_clone.py`	New test suite for `__sklearn_clone__`
RNA tests	`test/unit/test_rna.py`	Removed `xfail` markers for `UQ`/`CUF`

Detailed changes

1. Dependency updates (`pyproject.toml`)

scikit-learn :  <=1.5.0           →  >=1.8.0,<1.9
catboost     :  >=1.2.6,<=1.2.8   →  >=1.2.9,<=1.2.10
tabpfn       :  ==2.1.0           →  ==8.0.8
umap-learn   :  >=0.5.7,<0.6      →  >=0.5.12,<0.6
hdbscan      :  removed from clustering extras (now provided by sklearn)
rnanorm      :  removed from rna extras (replaced by native implementation)
Python       :  >=3.10, <3.14     →  >=3.11, <3.15

scikit-learn upper bound raised from <=1.5.0 to <1.9:

sklearn 1.8.0 requires Python ≥ 3.11, which drove the Python version change below.
sklearn 1.9.0 is excluded: quantile-forest uses the removed private symbol sklearn.tree._tree.DTYPE. The fix is already merged on the quantile-forest GitHub but not yet released; the upper bound will be raised once they publish a compatible version.

catboost bumped to >=1.2.9,<=1.2.10:

1.2.9 adds Python 3.14 support, fixes the setuptools.Distribution.dry_run removal (which broke installs on modern setuptools), and adds __sklearn_tags__ for sklearn ≥ 1.8 compatibility.

tabpfn bumped to ==8.0.8:

TabPFN 3 (the new default in 8.x) ships under a non-commercial license. The code now explicitly enforces the commercially-licensed V2 weights (see section 4).

umap-learn bumped to >=0.5.12,<0.6:

The 0.5.12 release (April 2026) is the first to support sklearn 1.8's revised __sklearn_tags__ API.

hdbscan external package removed — sklearn.cluster.HDBSCAN (available since sklearn 1.0) is sufficient and reduces the install footprint.

rnanorm removed — replaced by native numpy implementations inside rna.py (see section 6).

2. HDBSCAN clustering (`cv/cv_methods.py`)

Before:

try:
    import hdbscan
except ImportError as import_error:
    raise ExtrasDependencyImportError("clustering", import_error)

clusterer = hdbscan.HDBSCAN(...)

After:

from sklearn.cluster import HDBSCAN

clusterer = HDBSCAN(...)

Why: The standalone hdbscan package is no longer maintained at the same pace as sklearn, and sklearn ships its own HDBSCAN implementation. Removing the external dependency simplifies the [clustering] extra.

sklearn.cluster.HDBSCAN is imported inside hdbscan_clustering() (not at module level), and scikit-learn is a core dependency — so no ImportError can occur at runtime regardless of which extras are installed. The [clustering] extra now only lists kmedoids. test/test-dist.bash was updated to check for kmedoids instead of hdbscan when verifying the extra is installed.

Bug fix — fingerprint input conversion (hdbscan_clustering):

sklearn.cluster.HDBSCAN requires a 2D numeric feature matrix of shape (n_samples, n_bits). The old code converted a numpy array input to a list of ExplicitBitVect objects and then called np.array(fingerprints), producing a 1D object array that HDBSCAN cannot process.

# Before — wrong: produces a 1D object array of ExplicitBitVect
if isinstance(fingerprints, np.ndarray):
    fingerprints = [np_to_bv(fv) for fv in fingerprints]
clusterer.fit_predict(np.array(fingerprints))

# After — correct: 2D uint8 matrix (n_samples × n_bits)
if len(fingerprints) == 0:
    module_logger.warning("No valid fingerprints provided for clustering")
    return {}
if isinstance(fingerprints, np.ndarray):
    X = fingerprints  # already a 2D numeric matrix
else:
    nbits = fingerprints[0].GetNumBits()
    X = np.zeros((len(fingerprints), nbits), dtype=np.uint8)
    for i, fp in enumerate(fingerprints):
        DataStructs.ConvertToNumpyArray(fp, X[i])
clusterer.fit_predict(X)

Empty-input guard: The fixed code also adds a length check before accessing fingerprints[0], mirroring the guard already present in tanimoto_sphere_exclusion_clustering. Without it, an empty list or empty tuple input would cause IndexError at fingerprints[0].GetNumBits(). The guard returns {} early, consistent with the other clustering functions.

3. CatBoost sklearn clone compatibility (`ml/models/m_catboost.py`)

Problem: Newer CatBoost versions (>=1.2.9) internally deep-copy mutable constructor parameters such as cat_features, embedding_features, and text_features. This breaks sklearn's default clone() implementation, which uses identity (is) to verify that parameters were faithfully round-tripped after reconstruction.

Solution: A new mixin _CatboostModelMotherBase is introduced with a custom __sklearn_clone__ method. It is a pure mixin (inherits only from AbstractMotherPipeline, not from any CatBoost model class) and is inserted into the MRO of the three classes that expose mutable constructor params (cat_features, embedding_features, text_features) through get_params():

Note: CatboostGaussianProcessRegressorMother was initially excluded. Its explicit
constructor params (samples, prior_iterations, learning_rate, etc.) are all scalars,
so the default identity-based clone works for those. However, it also accepts **kwargs
that are forwarded directly to CatBoostRegressor.__init__, meaning mutable params such
as cat_features, embedding_features, and text_features can be supplied. CatBoost
1.2.9+ copies those lists internally, causing the same identity-check failure.
_CatboostModelMotherBase is therefore added to its MRO as well.

Class	Before	After
`CatboostRegressorMother`	`CatBoostRegressor, _CatboostHyperParams`	`CatBoostRegressor, _CatboostModelMotherBase, _CatboostHyperParams`
`CatboostGaussianProcessRegressorMother`	`CatBoostRegressor, _CatboostHyperParams`	`CatBoostRegressor, _CatboostModelMotherBase, _CatboostHyperParams` — its explicit constructor params are all scalars, but it also accepts **kwargs forwarded to `CatBoostRegressor.__init__`; mutable params such as `cat_features`, `embedding_features`, and `text_features` can be supplied that way and are copied internally by CatBoost 1.2.9+, so the override is required here too
`CatboostClassifierMother`	`CatBoostClassifier, _CatboostHyperParams, AbstractMotherPipeline`	`CatBoostClassifier, _CatboostModelMotherBase, _CatboostHyperParams`
`CatboostRankerMother`	`CatBoostRanker, _CatboostHyperParams, BaseEstimator`	`CatBoostRanker, _CatboostModelMotherBase, _CatboostHyperParams, BaseEstimator`

Solution: A new mixin _CatboostModelMotherBase is introduced with a custom __sklearn_clone__ method:

class _CatboostModelMotherBase(AbstractMotherPipeline):
    def __sklearn_clone__(self):
        klass = self.__class__
        params = self.get_params(deep=False)
        new_params = {k: copy.deepcopy(v) for k, v in params.items()}
        new_obj = klass(**new_params)
        # ... content-equality verification (type-dispatched for numpy/pandas) ...
        # ... preserve _metadata_request across clone ...
        return new_obj

Key points:

Uses type-dispatched content equality (numpy array_equal, pandas equals, or plain ==) instead of identity (is) to verify that mutable params were cloned correctly — avoiding the ValueError: The truth value of an array is ambiguous error.
Preserves _metadata_request on the new instance so that set_fit_request(group_id=True) calls survive a clone (which is what happens inside cross_val_score).
The old ensure_metadata_routing decorator on CatboostRankerMother.__init__ is removed; metadata routing is now handled purely by the caller (see the Learning to Rank section below).
Added the missing import copy (the method used copy.deepcopy but the module was previously never imported).

New tests — test/unit/test_catboost_clone.py (136 lines):

Test	What it checks
`test_clone_regressor_with_mutable_params`	`cat_features` / `embedding_features` / `text_features` survive clone on `CatboostRegressorMother`
`test_clone_classifier_with_mutable_params`	Same for `CatboostClassifierMother`
`test_clone_without_mutable_params`	Plain clone works for regressor, classifier, and ranker
`test_clone_preserves_metadata_routing_request`	`set_fit_request` state is preserved on `CatboostRankerMother` (skipped on sklearn < 1.3)
`test_cross_val_score_regressor_with_cat_features`	`cross_val_score` completes end-to-end with `cat_features`
`test_cross_val_score_classifier_with_cat_features`	Same for classifier
`test_clone_gp_regressor_without_mutable_params`	`clone()` succeeds for `CatboostGaussianProcessRegressorMother` without mutable kwargs
`test_clone_gp_regressor_with_mutable_params`	`clone()` succeeds with `cat_features` / `embedding_features` / `text_features` passed via `**kwargs` to the GP regressor

4. TabPFN compatibility (`ml/models/m_tabpfn.py`)

force_all_finite removal (sklearn 1.8):

sklearn 1.8 removed the force_all_finite argument from check_array, replacing it with ensure_all_finite (introduced in 1.6). A small compatibility shim is built once at module load:

_sklearn_version = tuple(int(x) for x in sklearn.__version__.split(".")[:2])
_ALLOW_NAN_KWARG = (
    {"ensure_all_finite": "allow-nan"} if _sklearn_version >= (1, 6)
    else {"force_all_finite": "allow-nan"}
)
# usage:
check_array(X, ensure_2d=False, **_ALLOW_NAN_KWARG)

TabPFN v8 API change and commercial licensing:

TabPFN 8.x introduced TabPFN 3 as the new default model, but TabPFN 3 is released under a non-commercial license. To ensure Mother always uses the commercially-licensed V2 weights, both TabPFNRegressorMother and TabPFNClassifierMother now explicitly select V2 at construction time, following the official recommendation:

from tabpfn.constants import ModelVersion

if "model_path" not in kwargs:
    kwargs["model_path"] = TabPFNRegressor.create_default_for_version(ModelVersion.V2).model_path

Users can still override model_path to select a different version.

5. `is_classifier()` in pipeline utils (`pipeline_utils.py`)

# Before
if hasattr(val_estimator, "_estimator_type") and val_estimator._estimator_type == "classifier":

# After
if is_classifier(val_estimator):

Why this matters beyond style: In sklearn 1.8, Pipeline no longer exposes _estimator_type as a public attribute. The old hasattr check therefore silently returned False for pipeline-wrapped classifiers, causing _proba_ columns to go missing from multi-target classification CV output. Using sklearn.base.is_classifier() — the canonical API — works correctly across all supported versions.

Learning to rank — detailed changes (`optimization/core.py`, `pipeline_utils.py`)

Background

The learning-to-rank path in MotherTuner.optimize() uses cross_val_score with a CatboostRankerMother estimator and an NDCG scorer. This requires that group_id (the query-group array used by the ranker) be forwarded to both fit() and score() during cross-validation. sklearn only forwards scorer kwargs when enable_metadata_routing=True is set globally.

Why sklearn 1.5–1.8 forced these changes:

sklearn version	Relevant behaviour
≤ 1.4	Metadata routing does not exist. `groups=` can always be passed directly to `cross_val_score`.
1.5	Metadata routing introduced as an opt-in experimental feature. When enabled, sklearn starts enforcing that CV-split `groups` travel through `params=` rather than as a direct `groups=` kwarg — passing both raises `ValueError`.
1.6	`force_all_finite` deprecated in `check_array` in favour of `ensure_all_finite`.
1.8	`force_all_finite` removed entirely. `Pipeline._estimator_type` removed. Metadata routing enforcement is stricter and more widely tested.

Under sklearn ≤ 1.5 the old ranking code worked by coincidence: routing was off by default, groups= was always accepted, and the enforcement gap was never triggered. Upgrading to ≥ 1.5 (and especially 1.8) exposed all three latent bugs described below.

The old design (problems)

The handle_metadata_routing decorator previously:

Called skl_get_config() on every invocation to check whether metadata routing was globally enabled.
If ranking groups were passed without metadata routing enabled, it raised a hard AssertionError telling the caller to call sklearn.set_config(enable_metadata_routing=True) manually.
When metadata routing was enabled, it moved groups from the direct cross_val_score kwarg into fit_kwargs; but this introduced a subtle bug: the groups key was not always present in fit_kwargs when it was later removed (fit_kwargs.pop("groups")), causing a KeyError.
get_ranking_pipeline permanently set enable_metadata_routing=True with no teardown. Under sklearn ≤ 1.5 this was harmless because the enforcement of routing rules was not yet applied to subsequent non-ranking calls. Under sklearn 1.5+ any subsequent non-ranking cross_val_score call with groups= raised ValueError.

The new design

The decorator reads the global config once per call (sklearn.get_config()["enable_metadata_routing"]) and uses three distinct paths:

ranking_groups is not None  →  "ranking path"
  Guard: raises RuntimeError immediately if routing is not enabled.
  Both CV groups and group_id are injected into fit_kwargs and forwarded
  via params=.  The direct groups= kwarg is NOT set (sklearn 1.5+ rejects
  it when routing is enabled).

ranking_groups is None + routing OFF (default)  →  "non-ranking classic path"
  CV split groups are passed as groups= directly to cross_val_score.

ranking_groups is None + routing ON  →  "non-ranking routing-aware path"
  sklearn 1.5+ raises ValueError when groups= is passed directly to
  cross_val_score while enable_metadata_routing=True (e.g. after a prior
  ranking call in the same process).  Groups are redirected into
  fit_kwargs / params= instead.

routing_on = sklearn.get_config()["enable_metadata_routing"]

if ranking_groups is not None:
    if not routing_on:
        raise RuntimeError(
            "Ranking optimization requires sklearn metadata routing to be enabled. "
            "Call sklearn.set_config(enable_metadata_routing=True) before optimize()."
        )
    fit_kwargs["group_id"] = ranking_groups
    if groups is not None:
        fit_kwargs["groups"] = groups
    kwargs["_groups_as_cross_val_args"] = False
else:
    if routing_on:
        if groups is not None:
            fit_kwargs["groups"] = groups
        kwargs["_groups_as_cross_val_args"] = False
    else:
        kwargs["_groups_as_cross_val_args"] = True

CatboostRankerMother.__init__'s @ensure_metadata_routing decorator is removed — the responsibility for calling sklearn.set_config(enable_metadata_routing=True) belongs to the caller, not the estimator constructor.

Bug fix: fit_kwargs.pop("groups") → fit_kwargs.pop("groups", None) in MotherTuner.optimize(). When groups is None it was never injected into fit_kwargs, so a direct .pop() raised KeyError. The safe form avoids this.

`get_ranking_pipeline` no longer mutates global routing state (`pipeline_utils.py`)

Old behaviour: get_ranking_pipeline called sklearn.set_config(enable_metadata_routing=True) with no teardown, permanently changing global sklearn state for the rest of the process. Any subsequent non-ranking cross_val_score call in the same process could silently break because sklearn 1.5+ rejects groups= as a direct kwarg when routing is on.

New behaviour: get_ranking_pipeline snapshots the current routing setting before enabling it, then restores the original value once the pipeline and tuner objects have been constructed. A try/finally block guarantees restoration even if an exception is raised during pipeline construction:

_prev_routing = sklearn.get_config().get("enable_metadata_routing", False)
sklearn.set_config(enable_metadata_routing=True)
try:
    # ... build ranking_pipeline and tuner ...
finally:
    sklearn.set_config(enable_metadata_routing=_prev_routing)
return ranking_pipeline, tuner

Callers that need metadata routing active for fitting or tuning (i.e. every ranking workflow) must now enable it explicitly around those calls:

sklearn.set_config(enable_metadata_routing=True)
pipeline.fit(X, y, group_id=ranking_groups)
tuner.optimize(..., ranking_groups=ranking_groups)

Or equivalently, use sklearn.config_context(enable_metadata_routing=True) as a context manager.

Test update (test/unit/test_pipeline_utils.py): test_get_ranking_pipeline now calls sklearn.set_config(enable_metadata_routing=True) explicitly after get_ranking_pipeline() returns, since the helper no longer leaves routing permanently on. A preserve_metadata_routing fixture restores the original routing state on teardown.

Why sklearn 1.5+ rejects `groups=` as a direct kwarg when routing is on

sklearn 1.5+ enforces that when metadata routing is active, CV-split groups must travel through params=. Passing groups= as a direct kwarg alongside params= raises ValueError. The new decorator handles this by routing groups through fit_kwargs whenever routing is enabled.

CI changes (`.github/workflows/workflow.yml`)

Python version	Before	After	Reason
3.10	✅	❌ dropped	sklearn 1.8 requires Python ≥ 3.11
3.11	✅	✅	—
3.12	✅	✅	—
3.13	✅	✅	—
3.14	❌	✅ added	Enabled by catboost 1.2.9 (Python 3.14 support + setuptools `dry_run` fix)

Release and docs jobs: Python pinned from 3.10 → 3.11.

TabPFN weight download added to fast test job:
The scripts/prepareTabpfn.py step was originally only present in test-slow. The fast test matrix job also imports TabPFNRegressorMother/TabPFNClassifierMother (e.g. in test_tabpfn.py import-time checks), so the download step is now also run before uv run poe coverage.

scripts/prepareTabpfn.py — missing import fixed:
The script used Path(model_path) without from pathlib import Path, causing a NameError on every CI run. The import was added.

RNA normalizer tests (`test/unit/test_rna.py`)

On the main branch, test_different_normalisation_methods was a single parametrize over ["Scanpy", "UQ", "CUF", "CPM"] with all four cases expected to pass. After sklearn was updated to 1.8 (in an intermediate commit on this branch), UQ and CUF started failing because rnanorm 2.2.0 calls the internal sklearn._validate_data method which was removed in sklearn 1.8. The tests were therefore temporarily marked xfail(strict=True) to document the known breakage:

# intermediate state on this branch — xfail because rnanorm broke under sklearn 1.8
pytest.param("UQ",  marks=pytest.mark.xfail(reason="rnanorm<=2.2.0 calls deprecated sklearn._validate_data removed in sklearn>=1.8", strict=True))
pytest.param("CUF", marks=pytest.mark.xfail(reason="rnanorm<=2.2.0 calls deprecated sklearn._validate_data removed in sklearn>=1.8", strict=True))

Once rnanorm was replaced with the native implementations in rna.py (see section 6 below), the xfail marks were removed and the test reverts to the same simple form as main:

@pytest.mark.parametrize("norm_method", ["Scanpy", "UQ", "CUF", "CPM"])
def test_different_normalisation_methods(...)

All four normalization methods now pass cleanly against sklearn 1.8.

6. RNA input validation guards (`ml/rna.py`)

Zero or degenerate inputs previously propagated silently as inf/nan through downstream pipelines. Three fail-fast checks are now raised as ValueError with specific messages:

Location	Condition checked	Error message
`CPM.transform`	Any sample row-sum is 0	`"CPM normalization requires each sample to have a positive library size (row sum > 0). Found one or more all-zero samples."`
`UQ._raw_factors`	All genes are zero after filtering (empty matrix)	`"UQ/CUF normalization requires at least one gene with non-zero counts, but all genes are zero across all samples."`
`UQ._raw_factors`	Any sample row-sum is 0 over expressed genes	`"UQ/CUF normalization requires each sample to have a positive library size (row sum > 0 over expressed genes)."`
`UQ._raw_factors`	Any 75th-percentile factor is 0 (would cause `log(0)` in `fit`)	`"UQ/CUF normalization produced a zero scaling factor ... Check for samples with very sparse counts."`

sklearn fit-before-transform contract (CPM.transform):

UQ.transform and CUF.transform already called check_is_fitted(self) as their first line, raising NotFittedError if called before fit(). CPM.transform was missing this guard, so calling it unfitted would silently proceed (skipping n_features_in_ / feature_names_in_ population) and produce incorrect output when used with set_output(transform="pandas"). The guard is now added for consistency:

def transform(self, X):
    check_is_fitted(self)  # added — consistent with UQ and CUF
    X_arr = np.asarray(X, dtype=float)
    ...

7. rnanorm replacement (`ml/rna.py`)

Problem: rnanorm 2.2.0 calls sklearn._validate_data (an internal, underscore-prefixed method) which was removed entirely in sklearn 1.8, causing UQ and CUF normalization to raise AttributeError at runtime. The package is unmaintained and unlikely to be fixed.

Solution: CPM, UQ, and CUF are now implemented as native sklearn transformers directly in rna.py, using only numpy. rnanorm is removed from the [rna] extras in pyproject.toml.

The math faithfully replicates the rnanorm / edgeR algorithm:

Class	Formula
`CPM`	`X[i,j] / sum(X[i,:]) × 10⁶`
`UQ`	CPM with effective library size: `lib_size × (UQ_factor / geomean(UQ_factors))`, where `UQ_factor[i] = percentile75(non-zero genes of sample i) / lib_size[i]`
`CUF`	`X[i,j] / (UQ_factor[i] / geomean(UQ_factors))` — same factors as UQ but raw counts divided directly, no CPM scaling

Output values were verified against rnanorm's own documentation examples (toy dataset from Bullard et al. 2010).

…eady working on it

Copilot

Pull request overview

This PR aims to lift the previous scikit-learn version cap (<=1.5.0) while keeping MotherML’s model wrappers compatible with scikit-learn cloning behavior—especially for CatBoost estimators where newer CatBoost versions copy mutable constructor parameters.

Changes:

Adds a custom __sklearn_clone__ implementation for CatBoost Mother estimators and introduces unit tests to validate clone behavior and CV integration.
Expands supported Python/scikit-learn versions in packaging and CI (Python >=3.11; scikit-learn up to 1.8.0).
Updates TabPFN dependency/version and changes TabPFN Mother defaults to select V2 model weights when model_path is not provided.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`src/mother/ml/models/m_catboost.py`	Adds a custom sklearn clone implementation and wires it into CatBoost Mother estimator classes.
`test/unit/test_catboost_clone.py`	Introduces tests covering clone round-trips, metadata routing preservation, and `cross_val_score` integration.
`src/mother/ml/models/m_tabpfn.py`	Changes TabPFN Mother defaults to force V2 model weights when `model_path` is absent.
`pyproject.toml`	Updates Python/scikit-learn constraints and bumps TabPFN dependency; adjusts CatBoost constraint.
`.github/workflows/workflow.yml`	Updates CI Python matrix and default Python version from 3.10 to 3.11.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 5 comments.

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.

Copilot

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

src/mother/ml/models/m_catboost.py:1414

The CatboostRankerMother docstring still claims the estimator “automatically enables sklearn's metadata routing”, but the @ensure_metadata_routing decorator was removed. This is now misleading for users relying on group_id routing in CV/ranking flows.

    and automatic metadata routing enablement.

    This class extends the CatBoostRanker and integrates with the Mother framework to provide
    dynamic hyperparameter tuning using Optuna. It automatically enables sklearn's metadata routing
    if not already enabled, which is required for passing group_id parameters during training.

Copilot

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/mother/ml/models/m_catboost.py:1414

The CatboostRankerMother docstring still claims it 'automatically enables sklearn's metadata routing' after the @ensure_metadata_routing decorator was removed. This is now misleading for users and conflicts with the updated guidance below about passing group_id via routing.

    and automatic metadata routing enablement.

    This class extends the CatBoostRanker and integrates with the Mother framework to provide
    dynamic hyperparameter tuning using Optuna. It automatically enables sklearn's metadata routing
    if not already enabled, which is required for passing group_id parameters during training.

Copilot

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.

…features

Copilot

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

src/mother/ml/rna.py:10

The module raises ExtrasDependencyImportError("rna", ...) if either anndata or scanpy is missing. That prevents importing/using the new pure-numpy normalizers (CPM/UQ/CUF) even though they don't depend on scanpy/anndata. Consider importing scanpy/anndata lazily (e.g., inside ScanpyPreprocessor/RNA) so CPM/UQ/CUF remain usable without the full [rna] extra.

try:
    import anndata
    import scanpy as sc
except ImportError as import_error:
    from mother.errors import ExtrasDependencyImportError

    raise ExtrasDependencyImportError("rna", import_error)

@@ -67,26 +65,33 @@ def wrapper(*args, **kwargs):
            fit_kwargs = {}
            kwargs["fit_kwargs"] = fit_kwargs


thomasATbayer added 7 commits June 17, 2026 15:52

introduce sklearn_clone to a base mother catboost class

6b1397f

update tabpfn while still using v2

ebfa546

update sklearn to the newest version

75bf74b

some changes due to newer sklearn

0f2ad76

limit sklearn to 1.8 till quantile forest supports 1.9 , they are alr…

bf861c8

…eady working on it

only use python > 3.10

bac0d54

add pytjon 3.14

7ab03f6

Copilot AI review requested due to automatic review settings June 17, 2026 16:49

thomasATbayer linked an issue Jun 17, 2026 that may be closed by this pull request

Get rid of scikit-learn 1.5.0 limitation #23

Open

Copilot started reviewing on behalf of thomasATbayer June 17, 2026 16:49 View session

Copilot AI reviewed Jun 17, 2026

View reviewed changes

make sure that we use the latest catboost

c65ef6e

thomasATbayer added bug Something isn't working enhancement New feature or request dependencies Pull requests that update a dependency file labels Jun 17, 2026

thomasATbayer self-assigned this Jun 17, 2026

adapt check array in tabpfn to current sklearn

260c87c

Copilot AI review requested due to automatic review settings June 17, 2026 17:11

Copilot started reviewing on behalf of thomasATbayer June 17, 2026 17:11 View session

rnanorm tests fail right now, but already in main brnach for rnanorm

d303dbf

Copilot AI reviewed Jun 17, 2026

View reviewed changes

Comment thread src/mother/ml/models/m_tabpfn.py Outdated

Comment thread src/mother/ml/models/m_catboost.py Outdated

Comment thread src/mother/ml/models/m_catboost.py

Comment thread test/unit/test_catboost_clone.py

Comment thread test/unit/test_catboost_clone.py

fix ranking for sklearn > 1.8

b27d70b

Copilot AI review requested due to automatic review settings June 17, 2026 17:41

Copilot started reviewing on behalf of thomasATbayer June 17, 2026 17:41 View session

fiy umap problem

5956580

Copilot AI reviewed Jun 17, 2026

View reviewed changes

update umap

37b1329

Copilot AI review requested due to automatic review settings June 17, 2026 18:32

Copilot started reviewing on behalf of thomasATbayer June 17, 2026 18:33 View session

use is_classifier

e6e54aa

Copilot started reviewing on behalf of thomasATbayer June 18, 2026 15:18 View session

Copilot AI reviewed Jun 18, 2026

View reviewed changes

fix ranking settings for pipeline

ee1c18c

thomasATbayer requested a review from Copilot June 19, 2026 07:21

Copilot started reviewing on behalf of thomasATbayer June 19, 2026 07:22 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread src/mother/ml/rna.py

Comment thread src/mother/pipeline_utils.py

additional chage to ranking pipeline

d3a0c59

thomasATbayer linked an issue Jun 19, 2026 that may be closed by this pull request

Fix sklearn.base.clone compatibility after CatBoost get_params change to enable update to current catboost version #7

Open

update ranking notebook

6f703eb

Copilot AI review requested due to automatic review settings June 19, 2026 12:31

Copilot started reviewing on behalf of thomasATbayer June 19, 2026 12:32 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread src/mother/pipeline_utils.py

thomasATbayer added 2 commits June 19, 2026 12:37

fix style check for noteook

664e636

update some rnaking description and details

5dff56f

Copilot AI review requested due to automatic review settings June 19, 2026 12:41

Copilot started reviewing on behalf of thomasATbayer June 19, 2026 12:42 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread src/mother/ml/rna.py

thomasATbayer added 2 commits June 19, 2026 13:14

also fix the clone for gaussian catboost and check for clone of text …

a54425b

…features

fix hdbscan dimension bug

55908ab

Copilot AI review requested due to automatic review settings June 19, 2026 13:24

Copilot started reviewing on behalf of thomasATbayer June 19, 2026 13:25 View session

proper check for rna normalization

a214915

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread src/mother/cv/cv_methods.py

mirror pattern from sphere exclusion for hdbscan for empty patterns

4886c57

Copilot AI review requested due to automatic review settings June 19, 2026 13:31

Copilot started reviewing on behalf of thomasATbayer June 19, 2026 13:32 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread src/mother/optimization/core.py

Comment on lines 62 to 66

@@ -67,26 +65,33 @@ def wrapper(*args, **kwargs):

fit_kwargs = {}

kwargs["fit_kwargs"] = fit_kwargs

thomasATbayer mentioned this pull request Jun 19, 2026

Get rid of scikit-learn 1.5.0 limitation #23

Open

thomasATbayer requested review from SommerKai and agzieba June 19, 2026 14:46

		@@ -67,26 +65,33 @@ def wrapper(args, *kwargs):
		fit_kwargs = {}
		kwargs["fit_kwargs"] = fit_kwargs

Conversation

thomasATbayer commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Branch: update_sklearn_to_current_version

Why this branch exists

Summary of changes

Detailed changes

1. Dependency updates (pyproject.toml)

2. HDBSCAN clustering (cv/cv_methods.py)

3. CatBoost sklearn clone compatibility (ml/models/m_catboost.py)

4. TabPFN compatibility (ml/models/m_tabpfn.py)

5. is_classifier() in pipeline utils (pipeline_utils.py)

Learning to rank — detailed changes (optimization/core.py, pipeline_utils.py)

Background

The old design (problems)

The new design

get_ranking_pipeline no longer mutates global routing state (pipeline_utils.py)

Why sklearn 1.5+ rejects groups= as a direct kwarg when routing is on

CI changes (.github/workflows/workflow.yml)

RNA normalizer tests (test/unit/test_rna.py)

6. RNA input validation guards (ml/rna.py)

7. rnanorm replacement (ml/rna.py)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

thomasATbayer commented Jun 17, 2026 •

edited

Loading

Branch: `update_sklearn_to_current_version`

1. Dependency updates (`pyproject.toml`)

2. HDBSCAN clustering (`cv/cv_methods.py`)

3. CatBoost sklearn clone compatibility (`ml/models/m_catboost.py`)

4. TabPFN compatibility (`ml/models/m_tabpfn.py`)

5. `is_classifier()` in pipeline utils (`pipeline_utils.py`)

Learning to rank — detailed changes (`optimization/core.py`, `pipeline_utils.py`)

`get_ranking_pipeline` no longer mutates global routing state (`pipeline_utils.py`)

Why sklearn 1.5+ rejects `groups=` as a direct kwarg when routing is on

CI changes (`.github/workflows/workflow.yml`)

RNA normalizer tests (`test/unit/test_rna.py`)

6. RNA input validation guards (`ml/rna.py`)

7. rnanorm replacement (`ml/rna.py`)