.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/00_starting/plot_stratified_kfold_reg.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_00_starting_plot_stratified_kfold_reg.py: Stratified K-fold CV for regression analysis ============================================ This example uses the ``diabetes`` data from ``sklearn datasets`` to perform stratified Kfold CV for a regression problem, .. include:: ../../links.inc .. GENERATED FROM PYTHON SOURCE LINES 10-25 .. code-block:: Python # Authors: Shammi More # Federico Raimondo # Leonard Sasse # License: AGPL import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.datasets import load_diabetes from sklearn.model_selection import KFold from julearn import run_cross_validation from julearn.utils import configure_logging from julearn.model_selection import ContinuousStratifiedKFold .. GENERATED FROM PYTHON SOURCE LINES 26-27 Set the logging level to info to see extra information. .. GENERATED FROM PYTHON SOURCE LINES 27-29 .. code-block:: Python configure_logging(level="INFO") .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 10:53:54,840 - julearn - INFO - ===== Lib Versions ===== 2026-01-16 10:53:54,841 - julearn - INFO - numpy: 1.26.4 2026-01-16 10:53:54,841 - julearn - INFO - scipy: 1.17.0 2026-01-16 10:53:54,841 - julearn - INFO - sklearn: 1.7.2 2026-01-16 10:53:54,841 - julearn - INFO - pandas: 2.3.3 2026-01-16 10:53:54,841 - julearn - INFO - julearn: 0.3.5.dev123 2026-01-16 10:53:54,841 - julearn - INFO - ======================== .. GENERATED FROM PYTHON SOURCE LINES 30-31 Load the diabetes data from ``sklearn`` as a ``pandas.DataFrame``. .. GENERATED FROM PYTHON SOURCE LINES 31-33 .. code-block:: Python features, target = load_diabetes(return_X_y=True, as_frame=True) .. GENERATED FROM PYTHON SOURCE LINES 34-38 Dataset contains ten variables age, sex, body mass index, average blood pressure, and six blood serum measurements (s1-s6) diabetes patients and a quantitative measure of disease progression one year after baseline which is the target we are interested in predicting. .. GENERATED FROM PYTHON SOURCE LINES 38-42 .. code-block:: Python print("Features: \n", features.head()) print("Target: \n", target.describe()) .. rst-class:: sphx-glr-script-out .. code-block:: none Features: age sex bmi ... s4 s5 s6 0 0.038076 0.050680 0.061696 ... -0.002592 0.019907 -0.017646 1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068332 -0.092204 2 0.085299 0.050680 0.044451 ... -0.002592 0.002861 -0.025930 3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022688 -0.009362 4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031988 -0.046641 [5 rows x 10 columns] Target: count 442.000000 mean 152.133484 std 77.093005 min 25.000000 25% 87.000000 50% 140.500000 75% 211.500000 max 346.000000 Name: target, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 43-46 Let's combine features and target together in one dataframe and create some outliers to see the difference in model performance with and without stratification. .. GENERATED FROM PYTHON SOURCE LINES 46-60 .. code-block:: Python data_df = pd.concat([features, target], axis=1) # Create outliers for test purpose new_df = data_df[(data_df.target > 145) & (data_df.target <= 150)] new_df["target"] = [590, 580, 597, 595, 590, 590, 600] data_df = pd.concat([data_df, new_df], axis=0) data_df = data_df.reset_index(drop=True) # Define X, y X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"] y = "target" .. rst-class:: sphx-glr-script-out .. code-block:: none /home/runner/work/julearn/julearn/examples/00_starting/plot_stratified_kfold_reg.py:51: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy new_df["target"] = [590, 580, 597, 595, 590, 590, 600] .. GENERATED FROM PYTHON SOURCE LINES 61-66 Define number of bins/group for stratification. The idea is that each "bin" will be equally represented in each fold. The number of bins should be chosen such that each bin has a sufficient number of samples so that each fold has more than one sample from each bin. Let's see a couple of histrograms with different number of bins. .. GENERATED FROM PYTHON SOURCE LINES 66-73 .. code-block:: Python sns.displot(data_df, x="target", bins=60) sns.displot(data_df, x="target", bins=40) sns.displot(data_df, x="target", bins=20) .. rst-class:: sphx-glr-horizontal * .. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_001.png :alt: plot stratified kfold reg :srcset: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_001.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_002.png :alt: plot stratified kfold reg :srcset: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_002.png :class: sphx-glr-multi-img * .. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_003.png :alt: plot stratified kfold reg :srcset: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_003.png :class: sphx-glr-multi-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 74-80 From the histogram above, we can see that the data is not uniformly distributed. We can see that the data is skewed towards the lower end of the target variable. We can also see that there are some outliers in the data. In any case, even with a low number of splits, some groups will not be represented in each fold. Let's continue with 40 bins which gives a good granularity. .. GENERATED FROM PYTHON SOURCE LINES 80-83 .. code-block:: Python cv_stratified = ContinuousStratifiedKFold(n_bins=40, n_splits=5, shuffle=False) .. GENERATED FROM PYTHON SOURCE LINES 84-85 Train a linear regression model with stratification on target. .. GENERATED FROM PYTHON SOURCE LINES 85-98 .. code-block:: Python scores_strat, model = run_cross_validation( X=X, y=y, data=data_df, preprocess="zscore", cv=cv_stratified, problem_type="regression", model="linreg", return_estimator="final", scoring="neg_mean_absolute_error", ) .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 10:53:55,313 - julearn - INFO - ==== Input Data ==== 2026-01-16 10:53:55,314 - julearn - INFO - Using dataframe as input 2026-01-16 10:53:55,314 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 2026-01-16 10:53:55,314 - julearn - INFO - Target: target 2026-01-16 10:53:55,314 - julearn - INFO - Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 2026-01-16 10:53:55,314 - julearn - INFO - X_types:{} 2026-01-16 10:53:55,314 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous. /home/runner/work/julearn/julearn/julearn/prepare.py:576: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous. warn_with_log( 2026-01-16 10:53:55,315 - julearn - INFO - ==================== 2026-01-16 10:53:55,315 - julearn - INFO - 2026-01-16 10:53:55,316 - julearn - INFO - Adding step zscore that applies to ColumnTypes 2026-01-16 10:53:55,316 - julearn - INFO - Step added 2026-01-16 10:53:55,316 - julearn - INFO - Adding step linreg that applies to ColumnTypes 2026-01-16 10:53:55,316 - julearn - INFO - Step added 2026-01-16 10:53:55,316 - julearn - INFO - = Model Parameters = 2026-01-16 10:53:55,317 - julearn - INFO - ==================== 2026-01-16 10:53:55,317 - julearn - INFO - 2026-01-16 10:53:55,317 - julearn - INFO - = Data Information = 2026-01-16 10:53:55,317 - julearn - INFO - Problem type: regression 2026-01-16 10:53:55,317 - julearn - INFO - Number of samples: 449 2026-01-16 10:53:55,317 - julearn - INFO - Number of features: 10 2026-01-16 10:53:55,317 - julearn - INFO - ==================== 2026-01-16 10:53:55,317 - julearn - INFO - 2026-01-16 10:53:55,317 - julearn - INFO - Target type: float64 2026-01-16 10:53:55,317 - julearn - INFO - Using outer CV scheme ContinuousStratifiedKFold(method='binning', n_bins=40, n_splits=5, random_state=None, shuffle=False) (incl. final model) /opt/hostedtoolcache/Python/3.14.2/x64/lib/python3.14/site-packages/sklearn/model_selection/_split.py:811: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. warnings.warn( /opt/hostedtoolcache/Python/3.14.2/x64/lib/python3.14/site-packages/sklearn/model_selection/_split.py:811: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5. warnings.warn( .. GENERATED FROM PYTHON SOURCE LINES 99-100 Train a linear regression model without stratification on target. .. GENERATED FROM PYTHON SOURCE LINES 100-114 .. code-block:: Python cv = KFold(n_splits=5, shuffle=False, random_state=None) scores, model = run_cross_validation( X=X, y=y, data=data_df, preprocess="zscore", cv=cv, problem_type="regression", model="linreg", return_estimator="final", scoring="neg_mean_absolute_error", ) .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 10:53:55,369 - julearn - INFO - ==== Input Data ==== 2026-01-16 10:53:55,369 - julearn - INFO - Using dataframe as input 2026-01-16 10:53:55,369 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 2026-01-16 10:53:55,369 - julearn - INFO - Target: target 2026-01-16 10:53:55,369 - julearn - INFO - Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 2026-01-16 10:53:55,369 - julearn - INFO - X_types:{} 2026-01-16 10:53:55,369 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous. /home/runner/work/julearn/julearn/julearn/prepare.py:576: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous. warn_with_log( 2026-01-16 10:53:55,370 - julearn - INFO - ==================== 2026-01-16 10:53:55,370 - julearn - INFO - 2026-01-16 10:53:55,370 - julearn - INFO - Adding step zscore that applies to ColumnTypes 2026-01-16 10:53:55,371 - julearn - INFO - Step added 2026-01-16 10:53:55,371 - julearn - INFO - Adding step linreg that applies to ColumnTypes 2026-01-16 10:53:55,371 - julearn - INFO - Step added 2026-01-16 10:53:55,371 - julearn - INFO - = Model Parameters = 2026-01-16 10:53:55,371 - julearn - INFO - ==================== 2026-01-16 10:53:55,371 - julearn - INFO - 2026-01-16 10:53:55,372 - julearn - INFO - = Data Information = 2026-01-16 10:53:55,372 - julearn - INFO - Problem type: regression 2026-01-16 10:53:55,372 - julearn - INFO - Number of samples: 449 2026-01-16 10:53:55,372 - julearn - INFO - Number of features: 10 2026-01-16 10:53:55,372 - julearn - INFO - ==================== 2026-01-16 10:53:55,372 - julearn - INFO - 2026-01-16 10:53:55,372 - julearn - INFO - Target type: float64 2026-01-16 10:53:55,372 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model) .. GENERATED FROM PYTHON SOURCE LINES 115-117 Now we can compare the test score for model trained with and without stratification. We can combine the two outputs as ``pandas.DataFrame``. .. GENERATED FROM PYTHON SOURCE LINES 117-123 .. code-block:: Python scores_strat["model"] = "With stratification" scores["model"] = "Without stratification" df_scores = scores_strat[["test_score", "model"]] df_scores = pd.concat([df_scores, scores[["test_score", "model"]]]) .. GENERATED FROM PYTHON SOURCE LINES 124-126 Plot a boxplot with test scores from both the models. We see here that the test score is higher when CV splits were not stratified. .. GENERATED FROM PYTHON SOURCE LINES 126-131 .. code-block:: Python fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.set_style("darkgrid") ax = sns.boxplot(x="model", y="test_score", data=df_scores) ax = sns.swarmplot(x="model", y="test_score", data=df_scores, color=".25") .. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_004.png :alt: plot stratified kfold reg :srcset: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_004.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.764 seconds) .. _sphx_glr_download_auto_examples_00_starting_plot_stratified_kfold_reg.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_stratified_kfold_reg.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_stratified_kfold_reg.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_stratified_kfold_reg.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_