Remove scikit-learn 1.5.0 limitation#24
Conversation
…eady working on it
There was a problem hiding this comment.
Pull request overview
This PR aims to lift the previous scikit-learn version cap (<=1.5.0) while keeping MotherML’s model wrappers compatible with scikit-learn cloning behavior—especially for CatBoost estimators where newer CatBoost versions copy mutable constructor parameters.
Changes:
- Adds a custom
__sklearn_clone__implementation for CatBoost Mother estimators and introduces unit tests to validate clone behavior and CV integration. - Expands supported Python/scikit-learn versions in packaging and CI (Python >=3.11; scikit-learn up to 1.8.0).
- Updates TabPFN dependency/version and changes TabPFN Mother defaults to select V2 model weights when
model_pathis not provided.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
src/mother/ml/models/m_catboost.py |
Adds a custom sklearn clone implementation and wires it into CatBoost Mother estimator classes. |
test/unit/test_catboost_clone.py |
Introduces tests covering clone round-trips, metadata routing preservation, and cross_val_score integration. |
src/mother/ml/models/m_tabpfn.py |
Changes TabPFN Mother defaults to force V2 model weights when model_path is absent. |
pyproject.toml |
Updates Python/scikit-learn constraints and bumps TabPFN dependency; adjusts CatBoost constraint. |
.github/workflows/workflow.yml |
Updates CI Python matrix and default Python version from 3.10 to 3.11. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 14 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
src/mother/ml/models/m_catboost.py:1414
- The CatboostRankerMother docstring still claims the estimator “automatically enables sklearn's metadata routing”, but the
@ensure_metadata_routingdecorator was removed. This is now misleading for users relying on group_id routing in CV/ranking flows.
and automatic metadata routing enablement.
This class extends the CatBoostRanker and integrates with the Mother framework to provide
dynamic hyperparameter tuning using Optuna. It automatically enables sklearn's metadata routing
if not already enabled, which is required for passing group_id parameters during training.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/mother/ml/models/m_catboost.py:1414
- The CatboostRankerMother docstring still claims it 'automatically enables sklearn's metadata routing' after the
@ensure_metadata_routingdecorator was removed. This is now misleading for users and conflicts with the updated guidance below about passing group_id via routing.
and automatic metadata routing enablement.
This class extends the CatBoostRanker and integrates with the Mother framework to provide
dynamic hyperparameter tuning using Optuna. It automatically enables sklearn's metadata routing
if not already enabled, which is required for passing group_id parameters during training.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 16 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
src/mother/ml/rna.py:10
- The module raises
ExtrasDependencyImportError("rna", ...)if eitheranndataorscanpyis missing. That prevents importing/using the new pure-numpy normalizers (CPM/UQ/CUF) even though they don't depend on scanpy/anndata. Consider importing scanpy/anndata lazily (e.g., insideScanpyPreprocessor/RNA) so CPM/UQ/CUF remain usable without the full[rna]extra.
try:
import anndata
import scanpy as sc
except ImportError as import_error:
from mother.errors import ExtrasDependencyImportError
raise ExtrasDependencyImportError("rna", import_error)
| @@ -67,26 +65,33 @@ def wrapper(*args, **kwargs): | |||
| fit_kwargs = {} | |||
| kwargs["fit_kwargs"] = fit_kwargs | |||
Branch:
update_sklearn_to_current_versionWhy this branch exists
The previous release pinned scikit-learn to
<=1.5.0, blocking the entire dependency tree from upgrading. This branch lifts that pin and adapts the codebase to the API changes introduced in scikit-learn 1.6–1.8. Several companion libraries (CatBoost, TabPFN, UMAP, HDBSCAN) were also updated at the same time.Summary of changes
pyproject.toml,uv.lockpyproject.toml,workflow.ymlcv/cv_methods.pyhdbscantosklearn.cluster.HDBSCAN; fix fingerprint conversion for 2D inputml/models/m_catboost.py__sklearn_clone__base classml/models/m_tabpfn.pyforce_all_finite→ensure_all_finite; pin V2 model weightsml/rna.py,pyproject.tomlml/rna.pyCPM.transformenforces fit-before-transform viacheck_is_fittedpipeline_utils.pyis_classifier()helper;get_ranking_pipelinesaves/restores global metadata routing stateoptimization/core.pyhandle_metadata_routingwith routing guardworkflow.ymltestjob (was only intest-slow)scripts/prepareTabpfn.pyfrom pathlib import Pathimporttest/unit/test_catboost_clone.py__sklearn_clone__test/unit/test_rna.pyxfailmarkers forUQ/CUFDetailed changes
1. Dependency updates (
pyproject.toml)scikit-learn upper bound raised from
<=1.5.0to<1.9:quantile-forestuses the removed private symbolsklearn.tree._tree.DTYPE. The fix is already merged on thequantile-forestGitHub but not yet released; the upper bound will be raised once they publish a compatible version.catboost bumped to
>=1.2.9,<=1.2.10:setuptools.Distribution.dry_runremoval (which broke installs on modern setuptools), and adds__sklearn_tags__for sklearn ≥ 1.8 compatibility.tabpfn bumped to
==8.0.8:umap-learn bumped to
>=0.5.12,<0.6:__sklearn_tags__API.hdbscan external package removed —
sklearn.cluster.HDBSCAN(available since sklearn 1.0) is sufficient and reduces the install footprint.rnanorm removed — replaced by native numpy implementations inside
rna.py(see section 6).2. HDBSCAN clustering (
cv/cv_methods.py)Before:
After:
Why: The standalone
hdbscanpackage is no longer maintained at the same pace as sklearn, and sklearn ships its own HDBSCAN implementation. Removing the external dependency simplifies the[clustering]extra.sklearn.cluster.HDBSCANis imported insidehdbscan_clustering()(not at module level), andscikit-learnis a core dependency — so noImportErrorcan occur at runtime regardless of which extras are installed. The[clustering]extra now only listskmedoids.test/test-dist.bashwas updated to check forkmedoidsinstead ofhdbscanwhen verifying the extra is installed.Bug fix — fingerprint input conversion (
hdbscan_clustering):sklearn.cluster.HDBSCANrequires a 2D numeric feature matrix of shape(n_samples, n_bits). The old code converted a numpy array input to a list ofExplicitBitVectobjects and then callednp.array(fingerprints), producing a 1D object array that HDBSCAN cannot process.Empty-input guard: The fixed code also adds a length check before accessing
fingerprints[0], mirroring the guard already present intanimoto_sphere_exclusion_clustering. Without it, an empty list or empty tuple input would causeIndexErroratfingerprints[0].GetNumBits(). The guard returns{}early, consistent with the other clustering functions.3. CatBoost sklearn clone compatibility (
ml/models/m_catboost.py)Problem: Newer CatBoost versions (
>=1.2.9) internally deep-copy mutable constructor parameters such ascat_features,embedding_features, andtext_features. This breaks sklearn's defaultclone()implementation, which uses identity (is) to verify that parameters were faithfully round-tripped after reconstruction.Solution: A new mixin
_CatboostModelMotherBaseis introduced with a custom__sklearn_clone__method. It is a pure mixin (inherits only fromAbstractMotherPipeline, not from any CatBoost model class) and is inserted into the MRO of the three classes that expose mutable constructor params (cat_features,embedding_features,text_features) throughget_params():CatboostRegressorMotherCatBoostRegressor, _CatboostHyperParamsCatBoostRegressor, _CatboostModelMotherBase, _CatboostHyperParamsCatboostGaussianProcessRegressorMotherCatBoostRegressor, _CatboostHyperParamsCatBoostRegressor, _CatboostModelMotherBase, _CatboostHyperParams— its explicit constructor params are all scalars, but it also accepts **kwargs forwarded toCatBoostRegressor.__init__; mutable params such ascat_features,embedding_features, andtext_featurescan be supplied that way and are copied internally by CatBoost 1.2.9+, so the override is required here tooCatboostClassifierMotherCatBoostClassifier, _CatboostHyperParams, AbstractMotherPipelineCatBoostClassifier, _CatboostModelMotherBase, _CatboostHyperParamsCatboostRankerMotherCatBoostRanker, _CatboostHyperParams, BaseEstimatorCatBoostRanker, _CatboostModelMotherBase, _CatboostHyperParams, BaseEstimatorSolution: A new mixin
_CatboostModelMotherBaseis introduced with a custom__sklearn_clone__method:Key points:
array_equal, pandasequals, or plain==) instead of identity (is) to verify that mutable params were cloned correctly — avoiding theValueError: The truth value of an array is ambiguouserror._metadata_requeston the new instance so thatset_fit_request(group_id=True)calls survive a clone (which is what happens insidecross_val_score).ensure_metadata_routingdecorator onCatboostRankerMother.__init__is removed; metadata routing is now handled purely by the caller (see the Learning to Rank section below).import copy(the method usedcopy.deepcopybut the module was previously never imported).New tests —
test/unit/test_catboost_clone.py(136 lines):test_clone_regressor_with_mutable_paramscat_features/embedding_features/text_featuressurvive clone onCatboostRegressorMothertest_clone_classifier_with_mutable_paramsCatboostClassifierMothertest_clone_without_mutable_paramstest_clone_preserves_metadata_routing_requestset_fit_requeststate is preserved onCatboostRankerMother(skipped on sklearn < 1.3)test_cross_val_score_regressor_with_cat_featurescross_val_scorecompletes end-to-end withcat_featurestest_cross_val_score_classifier_with_cat_featurestest_clone_gp_regressor_without_mutable_paramsclone()succeeds forCatboostGaussianProcessRegressorMotherwithout mutable kwargstest_clone_gp_regressor_with_mutable_paramsclone()succeeds withcat_features/embedding_features/text_featurespassed via**kwargsto the GP regressor4. TabPFN compatibility (
ml/models/m_tabpfn.py)force_all_finiteremoval (sklearn 1.8):sklearn 1.8 removed the
force_all_finiteargument fromcheck_array, replacing it withensure_all_finite(introduced in 1.6). A small compatibility shim is built once at module load:TabPFN v8 API change and commercial licensing:
TabPFN 8.x introduced TabPFN 3 as the new default model, but TabPFN 3 is released under a non-commercial license. To ensure Mother always uses the commercially-licensed V2 weights, both
TabPFNRegressorMotherandTabPFNClassifierMothernow explicitly select V2 at construction time, following the official recommendation:Users can still override
model_pathto select a different version.5.
is_classifier()in pipeline utils (pipeline_utils.py)Why this matters beyond style: In sklearn 1.8,
Pipelineno longer exposes_estimator_typeas a public attribute. The oldhasattrcheck therefore silently returnedFalsefor pipeline-wrapped classifiers, causing_proba_columns to go missing from multi-target classification CV output. Usingsklearn.base.is_classifier()— the canonical API — works correctly across all supported versions.Learning to rank — detailed changes (
optimization/core.py,pipeline_utils.py)Background
The learning-to-rank path in
MotherTuner.optimize()usescross_val_scorewith aCatboostRankerMotherestimator and an NDCG scorer. This requires thatgroup_id(the query-group array used by the ranker) be forwarded to bothfit()andscore()during cross-validation. sklearn only forwards scorer kwargs whenenable_metadata_routing=Trueis set globally.Why sklearn 1.5–1.8 forced these changes:
groups=can always be passed directly tocross_val_score.groupstravel throughparams=rather than as a directgroups=kwarg — passing both raisesValueError.force_all_finitedeprecated incheck_arrayin favour ofensure_all_finite.force_all_finiteremoved entirely.Pipeline._estimator_typeremoved. Metadata routing enforcement is stricter and more widely tested.Under sklearn ≤ 1.5 the old ranking code worked by coincidence: routing was off by default,
groups=was always accepted, and the enforcement gap was never triggered. Upgrading to ≥ 1.5 (and especially 1.8) exposed all three latent bugs described below.The old design (problems)
The
handle_metadata_routingdecorator previously:skl_get_config()on every invocation to check whether metadata routing was globally enabled.AssertionErrortelling the caller to callsklearn.set_config(enable_metadata_routing=True)manually.groupsfrom the directcross_val_scorekwarg intofit_kwargs; but this introduced a subtle bug: thegroupskey was not always present infit_kwargswhen it was later removed (fit_kwargs.pop("groups")), causing aKeyError.get_ranking_pipelinepermanently setenable_metadata_routing=Truewith no teardown. Under sklearn ≤ 1.5 this was harmless because the enforcement of routing rules was not yet applied to subsequent non-ranking calls. Under sklearn 1.5+ any subsequent non-rankingcross_val_scorecall withgroups=raisedValueError.The new design
The decorator reads the global config once per call (
sklearn.get_config()["enable_metadata_routing"]) and uses three distinct paths:CatboostRankerMother.__init__'s@ensure_metadata_routingdecorator is removed — the responsibility for callingsklearn.set_config(enable_metadata_routing=True)belongs to the caller, not the estimator constructor.Bug fix:
fit_kwargs.pop("groups")→fit_kwargs.pop("groups", None)inMotherTuner.optimize(). When groups isNoneit was never injected intofit_kwargs, so a direct.pop()raisedKeyError. The safe form avoids this.get_ranking_pipelineno longer mutates global routing state (pipeline_utils.py)Old behaviour:
get_ranking_pipelinecalledsklearn.set_config(enable_metadata_routing=True)with no teardown, permanently changing global sklearn state for the rest of the process. Any subsequent non-rankingcross_val_scorecall in the same process could silently break because sklearn 1.5+ rejectsgroups=as a direct kwarg when routing is on.New behaviour:
get_ranking_pipelinesnapshots the current routing setting before enabling it, then restores the original value once the pipeline and tuner objects have been constructed. Atry/finallyblock guarantees restoration even if an exception is raised during pipeline construction:Callers that need metadata routing active for fitting or tuning (i.e. every ranking workflow) must now enable it explicitly around those calls:
Or equivalently, use
sklearn.config_context(enable_metadata_routing=True)as a context manager.Test update (
test/unit/test_pipeline_utils.py):test_get_ranking_pipelinenow callssklearn.set_config(enable_metadata_routing=True)explicitly afterget_ranking_pipeline()returns, since the helper no longer leaves routing permanently on. Apreserve_metadata_routingfixture restores the original routing state on teardown.Why sklearn 1.5+ rejects
groups=as a direct kwarg when routing is onsklearn 1.5+ enforces that when metadata routing is active, CV-split
groupsmust travel throughparams=. Passinggroups=as a direct kwarg alongsideparams=raisesValueError. The new decorator handles this by routing groups throughfit_kwargswhenever routing is enabled.CI changes (
.github/workflows/workflow.yml)dry_runfix)Release and docs jobs: Python pinned from
3.10→3.11.TabPFN weight download added to fast
testjob:The
scripts/prepareTabpfn.pystep was originally only present intest-slow. The fasttestmatrix job also importsTabPFNRegressorMother/TabPFNClassifierMother(e.g. intest_tabpfn.pyimport-time checks), so the download step is now also run beforeuv run poe coverage.scripts/prepareTabpfn.py— missing import fixed:The script used
Path(model_path)withoutfrom pathlib import Path, causing aNameErroron every CI run. The import was added.RNA normalizer tests (
test/unit/test_rna.py)On the
mainbranch,test_different_normalisation_methodswas a single parametrize over["Scanpy", "UQ", "CUF", "CPM"]with all four cases expected to pass. After sklearn was updated to 1.8 (in an intermediate commit on this branch),UQandCUFstarted failing because rnanorm 2.2.0 calls the internalsklearn._validate_datamethod which was removed in sklearn 1.8. The tests were therefore temporarily markedxfail(strict=True)to document the known breakage:Once
rnanormwas replaced with the native implementations inrna.py(see section 6 below), thexfailmarks were removed and the test reverts to the same simple form asmain:All four normalization methods now pass cleanly against sklearn 1.8.
6. RNA input validation guards (
ml/rna.py)Zero or degenerate inputs previously propagated silently as
inf/nanthrough downstream pipelines. Three fail-fast checks are now raised asValueErrorwith specific messages:CPM.transform"CPM normalization requires each sample to have a positive library size (row sum > 0). Found one or more all-zero samples."UQ._raw_factors"UQ/CUF normalization requires at least one gene with non-zero counts, but all genes are zero across all samples."UQ._raw_factors"UQ/CUF normalization requires each sample to have a positive library size (row sum > 0 over expressed genes)."UQ._raw_factorslog(0)infit)"UQ/CUF normalization produced a zero scaling factor ... Check for samples with very sparse counts."sklearn fit-before-transform contract (
CPM.transform):UQ.transformandCUF.transformalready calledcheck_is_fitted(self)as their first line, raisingNotFittedErrorif called beforefit().CPM.transformwas missing this guard, so calling it unfitted would silently proceed (skippingn_features_in_/feature_names_in_population) and produce incorrect output when used withset_output(transform="pandas"). The guard is now added for consistency:7. rnanorm replacement (
ml/rna.py)Problem:
rnanorm2.2.0 callssklearn._validate_data(an internal, underscore-prefixed method) which was removed entirely in sklearn 1.8, causingUQandCUFnormalization to raiseAttributeErrorat runtime. The package is unmaintained and unlikely to be fixed.Solution:
CPM,UQ, andCUFare now implemented as native sklearn transformers directly inrna.py, using only numpy.rnanormis removed from the[rna]extras inpyproject.toml.The math faithfully replicates the rnanorm / edgeR algorithm:
CPMX[i,j] / sum(X[i,:]) × 10⁶UQlib_size × (UQ_factor / geomean(UQ_factors)), whereUQ_factor[i] = percentile75(non-zero genes of sample i) / lib_size[i]CUFX[i,j] / (UQ_factor[i] / geomean(UQ_factors))— same factors as UQ but raw counts divided directly, no CPM scalingOutput values were verified against rnanorm's own documentation examples (toy dataset from Bullard et al. 2010).