.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/00_starting/plot_stratified_kfold_reg.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_00_starting_plot_stratified_kfold_reg.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_00_starting_plot_stratified_kfold_reg.py:


Stratified K-fold CV for regression analysis
============================================

This example uses the ``diabetes`` data from ``sklearn datasets`` to
perform stratified Kfold CV for a regression problem,

.. include:: ../../links.inc

.. GENERATED FROM PYTHON SOURCE LINES 10-25

.. code-block:: Python

    # Authors: Shammi More <s.more@fz-juelich.de>
    #          Federico Raimondo <f.raimondo@fz-juelich.de>
    #          Leonard Sasse <l.sasse@fz-juelich.de>
    # License: AGPL

    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_diabetes
    from sklearn.model_selection import KFold

    from julearn import run_cross_validation
    from julearn.utils import configure_logging
    from julearn.model_selection import ContinuousStratifiedKFold


.. GENERATED FROM PYTHON SOURCE LINES 26-27

Set the logging level to info to see extra information.

.. GENERATED FROM PYTHON SOURCE LINES 27-29

.. code-block:: Python

    configure_logging(level="INFO")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-01-16 10:53:54,840 - julearn - INFO - ===== Lib Versions =====
    2026-01-16 10:53:54,841 - julearn - INFO - numpy: 1.26.4
    2026-01-16 10:53:54,841 - julearn - INFO - scipy: 1.17.0
    2026-01-16 10:53:54,841 - julearn - INFO - sklearn: 1.7.2
    2026-01-16 10:53:54,841 - julearn - INFO - pandas: 2.3.3
    2026-01-16 10:53:54,841 - julearn - INFO - julearn: 0.3.5.dev123
    2026-01-16 10:53:54,841 - julearn - INFO - ========================


.. GENERATED FROM PYTHON SOURCE LINES 30-31

Load the diabetes data from ``sklearn`` as a ``pandas.DataFrame``.

.. GENERATED FROM PYTHON SOURCE LINES 31-33

.. code-block:: Python

    features, target = load_diabetes(return_X_y=True, as_frame=True)


.. GENERATED FROM PYTHON SOURCE LINES 34-38

Dataset contains ten variables age, sex, body mass index, average blood
pressure, and six blood serum measurements (s1-s6) diabetes patients and
a quantitative measure of disease progression one year after baseline which
is the target we are interested in predicting.

.. GENERATED FROM PYTHON SOURCE LINES 38-42

.. code-block:: Python


    print("Features: \n", features.head())
    print("Target: \n", target.describe())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Features: 
             age       sex       bmi  ...        s4        s5        s6
    0  0.038076  0.050680  0.061696  ... -0.002592  0.019907 -0.017646
    1 -0.001882 -0.044642 -0.051474  ... -0.039493 -0.068332 -0.092204
    2  0.085299  0.050680  0.044451  ... -0.002592  0.002861 -0.025930
    3 -0.089063 -0.044642 -0.011595  ...  0.034309  0.022688 -0.009362
    4  0.005383 -0.044642 -0.036385  ... -0.002592 -0.031988 -0.046641

    [5 rows x 10 columns]
    Target: 
     count    442.000000
    mean     152.133484
    std       77.093005
    min       25.000000
    25%       87.000000
    50%      140.500000
    75%      211.500000
    max      346.000000
    Name: target, dtype: float64


.. GENERATED FROM PYTHON SOURCE LINES 43-46

Let's combine features and target together in one dataframe and create some
outliers to see the difference in model performance with and without
stratification.

.. GENERATED FROM PYTHON SOURCE LINES 46-60

.. code-block:: Python


    data_df = pd.concat([features, target], axis=1)

    # Create outliers for test purpose
    new_df = data_df[(data_df.target > 145) & (data_df.target <= 150)]
    new_df["target"] = [590, 580, 597, 595, 590, 590, 600]
    data_df = pd.concat([data_df, new_df], axis=0)
    data_df = data_df.reset_index(drop=True)

    # Define X, y
    X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"]
    y = "target"


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    /home/runner/work/julearn/julearn/examples/00_starting/plot_stratified_kfold_reg.py:51: SettingWithCopyWarning: 
    A value is trying to be set on a copy of a slice from a DataFrame.
    Try using .loc[row_indexer,col_indexer] = value instead

    See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
      new_df["target"] = [590, 580, 597, 595, 590, 590, 600]


.. GENERATED FROM PYTHON SOURCE LINES 61-66

Define number of bins/group for stratification. The idea is that each "bin"
will be equally represented in each fold. The number of bins should be
chosen such that each bin has a sufficient number of samples so that each
fold has more than one sample from each bin.
Let's see a couple of histrograms with different number of bins.

.. GENERATED FROM PYTHON SOURCE LINES 66-73

.. code-block:: Python


    sns.displot(data_df, x="target", bins=60)

    sns.displot(data_df, x="target", bins=40)

    sns.displot(data_df, x="target", bins=20)


.. rst-class:: sphx-glr-horizontal


    *

      .. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_001.png
         :alt: plot stratified kfold reg
         :srcset: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_001.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_002.png
         :alt: plot stratified kfold reg
         :srcset: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_002.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_003.png
         :alt: plot stratified kfold reg
         :srcset: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_003.png
         :class: sphx-glr-multi-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <seaborn.axisgrid.FacetGrid object at 0x7f2d836f9bd0>


.. GENERATED FROM PYTHON SOURCE LINES 74-80

From the histogram above, we can see that the data is not uniformly
distributed. We can see that the data is skewed towards the lower end of
the target variable. We can also see that there are some outliers in the
data. In any case, even with a low number of splits, some groups will not be
represented in each fold. Let's continue with 40 bins which gives a good
granularity.

.. GENERATED FROM PYTHON SOURCE LINES 80-83

.. code-block:: Python


    cv_stratified = ContinuousStratifiedKFold(n_bins=40, n_splits=5, shuffle=False)


.. GENERATED FROM PYTHON SOURCE LINES 84-85

Train a linear regression model with stratification on target.

.. GENERATED FROM PYTHON SOURCE LINES 85-98

.. code-block:: Python


    scores_strat, model = run_cross_validation(
        X=X,
        y=y,
        data=data_df,
        preprocess="zscore",
        cv=cv_stratified,
        problem_type="regression",
        model="linreg",
        return_estimator="final",
        scoring="neg_mean_absolute_error",
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-01-16 10:53:55,313 - julearn - INFO - ==== Input Data ====
    2026-01-16 10:53:55,314 - julearn - INFO - Using dataframe as input
    2026-01-16 10:53:55,314 - julearn - INFO -      Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
    2026-01-16 10:53:55,314 - julearn - INFO -      Target: target
    2026-01-16 10:53:55,314 - julearn - INFO -      Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
    2026-01-16 10:53:55,314 - julearn - INFO -      X_types:{}
    2026-01-16 10:53:55,314 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
    /home/runner/work/julearn/julearn/julearn/prepare.py:576: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
      warn_with_log(
    2026-01-16 10:53:55,315 - julearn - INFO - ====================
    2026-01-16 10:53:55,315 - julearn - INFO - 
    2026-01-16 10:53:55,316 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    2026-01-16 10:53:55,316 - julearn - INFO - Step added
    2026-01-16 10:53:55,316 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    2026-01-16 10:53:55,316 - julearn - INFO - Step added
    2026-01-16 10:53:55,316 - julearn - INFO - = Model Parameters =
    2026-01-16 10:53:55,317 - julearn - INFO - ====================
    2026-01-16 10:53:55,317 - julearn - INFO - 
    2026-01-16 10:53:55,317 - julearn - INFO - = Data Information =
    2026-01-16 10:53:55,317 - julearn - INFO -      Problem type: regression
    2026-01-16 10:53:55,317 - julearn - INFO -      Number of samples: 449
    2026-01-16 10:53:55,317 - julearn - INFO -      Number of features: 10
    2026-01-16 10:53:55,317 - julearn - INFO - ====================
    2026-01-16 10:53:55,317 - julearn - INFO - 
    2026-01-16 10:53:55,317 - julearn - INFO -      Target type: float64
    2026-01-16 10:53:55,317 - julearn - INFO - Using outer CV scheme ContinuousStratifiedKFold(method='binning', n_bins=40, n_splits=5,
                 random_state=None, shuffle=False) (incl. final model)
    /opt/hostedtoolcache/Python/3.14.2/x64/lib/python3.14/site-packages/sklearn/model_selection/_split.py:811: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
      warnings.warn(
    /opt/hostedtoolcache/Python/3.14.2/x64/lib/python3.14/site-packages/sklearn/model_selection/_split.py:811: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
      warnings.warn(


.. GENERATED FROM PYTHON SOURCE LINES 99-100

Train a linear regression model without stratification on target.

.. GENERATED FROM PYTHON SOURCE LINES 100-114

.. code-block:: Python


    cv = KFold(n_splits=5, shuffle=False, random_state=None)
    scores, model = run_cross_validation(
        X=X,
        y=y,
        data=data_df,
        preprocess="zscore",
        cv=cv,
        problem_type="regression",
        model="linreg",
        return_estimator="final",
        scoring="neg_mean_absolute_error",
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-01-16 10:53:55,369 - julearn - INFO - ==== Input Data ====
    2026-01-16 10:53:55,369 - julearn - INFO - Using dataframe as input
    2026-01-16 10:53:55,369 - julearn - INFO -      Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
    2026-01-16 10:53:55,369 - julearn - INFO -      Target: target
    2026-01-16 10:53:55,369 - julearn - INFO -      Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
    2026-01-16 10:53:55,369 - julearn - INFO -      X_types:{}
    2026-01-16 10:53:55,369 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
    /home/runner/work/julearn/julearn/julearn/prepare.py:576: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
      warn_with_log(
    2026-01-16 10:53:55,370 - julearn - INFO - ====================
    2026-01-16 10:53:55,370 - julearn - INFO - 
    2026-01-16 10:53:55,370 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    2026-01-16 10:53:55,371 - julearn - INFO - Step added
    2026-01-16 10:53:55,371 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    2026-01-16 10:53:55,371 - julearn - INFO - Step added
    2026-01-16 10:53:55,371 - julearn - INFO - = Model Parameters =
    2026-01-16 10:53:55,371 - julearn - INFO - ====================
    2026-01-16 10:53:55,371 - julearn - INFO - 
    2026-01-16 10:53:55,372 - julearn - INFO - = Data Information =
    2026-01-16 10:53:55,372 - julearn - INFO -      Problem type: regression
    2026-01-16 10:53:55,372 - julearn - INFO -      Number of samples: 449
    2026-01-16 10:53:55,372 - julearn - INFO -      Number of features: 10
    2026-01-16 10:53:55,372 - julearn - INFO - ====================
    2026-01-16 10:53:55,372 - julearn - INFO - 
    2026-01-16 10:53:55,372 - julearn - INFO -      Target type: float64
    2026-01-16 10:53:55,372 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)


.. GENERATED FROM PYTHON SOURCE LINES 115-117

Now we can compare the test score for model trained with and without
stratification. We can combine the two outputs as ``pandas.DataFrame``.

.. GENERATED FROM PYTHON SOURCE LINES 117-123

.. code-block:: Python


    scores_strat["model"] = "With stratification"
    scores["model"] = "Without stratification"
    df_scores = scores_strat[["test_score", "model"]]
    df_scores = pd.concat([df_scores, scores[["test_score", "model"]]])


.. GENERATED FROM PYTHON SOURCE LINES 124-126

Plot a boxplot with test scores from both the models. We see here that
the test score is higher when CV splits were not stratified.

.. GENERATED FROM PYTHON SOURCE LINES 126-131

.. code-block:: Python


    fig, ax = plt.subplots(1, 1, figsize=(10, 7))
    sns.set_style("darkgrid")
    ax = sns.boxplot(x="model", y="test_score", data=df_scores)
    ax = sns.swarmplot(x="model", y="test_score", data=df_scores, color=".25")


.. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_004.png
   :alt: plot stratified kfold reg
   :srcset: /auto_examples/00_starting/images/sphx_glr_plot_stratified_kfold_reg_004.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.764 seconds)


.. _sphx_glr_download_auto_examples_00_starting_plot_stratified_kfold_reg.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_stratified_kfold_reg.ipynb <plot_stratified_kfold_reg.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_stratified_kfold_reg.py <plot_stratified_kfold_reg.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_stratified_kfold_reg.zip <plot_stratified_kfold_reg.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_