Stratified K-fold CV for regression analysis#

This example uses the ‘diabetes’ data from sklearn datasets to perform stratified Kfold CV for a regression problem,

# Authors: Shammi More <s.more@fz-juelich.de>
#          Federico Raimondo <f.raimondo@fz-juelich.de>
#          Leonard Sasse <l.sasse@fz-juelich.de>
# License: AGPL

import math
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import KFold

from julearn import run_cross_validation
from julearn.utils import configure_logging
from julearn.model_selection import ContinuousStratifiedKFold

Set the logging level to info to see extra information

configure_logging(level="INFO")
2023-07-19 12:41:47,422 - julearn - INFO - ===== Lib Versions =====
2023-07-19 12:41:47,422 - julearn - INFO - numpy: 1.25.1
2023-07-19 12:41:47,422 - julearn - INFO - scipy: 1.11.1
2023-07-19 12:41:47,422 - julearn - INFO - sklearn: 1.3.0
2023-07-19 12:41:47,422 - julearn - INFO - pandas: 2.0.3
2023-07-19 12:41:47,422 - julearn - INFO - julearn: 0.3.1.dev1
2023-07-19 12:41:47,422 - julearn - INFO - ========================

load the diabetes data from sklearn as a pandas dataframe

features, target = load_diabetes(return_X_y=True, as_frame=True)

Dataset contains ten variables age, sex, body mass index, average blood pressure, and six blood serum measurements (s1-s6) diabetes patients and a quantitative measure of disease progression one year after baseline which is the target we are interested in predicting.

print("Features: \n", features.head())
print("Target: \n", target.describe())
Features:
         age       sex       bmi  ...        s4        s5        s6
0  0.038076  0.050680  0.061696  ... -0.002592  0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474  ... -0.039493 -0.068332 -0.092204
2  0.085299  0.050680  0.044451  ... -0.002592  0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595  ...  0.034309  0.022688 -0.009362
4  0.005383 -0.044642 -0.036385  ... -0.002592 -0.031988 -0.046641

[5 rows x 10 columns]
Target:
 count    442.000000
mean     152.133484
std       77.093005
min       25.000000
25%       87.000000
50%      140.500000
75%      211.500000
max      346.000000
Name: target, dtype: float64

Let’s combine features and target together in one dataframe and create some outliers to see the difference in model performance with and without stratification

data_df = pd.concat([features, target], axis=1)

# Create outliers for test purpose
new_df = data_df[(data_df.target > 145) & (data_df.target <= 150)]
new_df["target"] = [590, 580, 597, 595, 590, 590, 600]
data_df = pd.concat([data_df, new_df], axis=0)
data_df = data_df.reset_index(drop=True)

# define X, y
X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"]
y = "target"
/tmp/tmpy4hmj28m/361c4ba107896ce3e9b14e5ca2d4d851dff85b11/examples/00_starting/plot_stratified_kfold_reg.py:53: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df["target"] = [590, 580, 597, 595, 590, 590, 600]

Define number of bins/group for stratification. The idea is that each “bin” will be equally represented in each fold. The number of bins should be chosen such that each bin has a sufficient number of samples so that each fold has more than one sample from each bin. Let’s see a couple of histrograms with different number of bins.

sns.displot(data_df, x="target", bins=60)

sns.displot(data_df, x="target", bins=40)

sns.displot(data_df, x="target", bins=20)
  • plot stratified kfold reg
  • plot stratified kfold reg
  • plot stratified kfold reg
/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)
/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
  self._figure.tight_layout(*args, **kwargs)

<seaborn.axisgrid.FacetGrid object at 0x7f7f62c5bfd0>

From the histogram above, we can see that the data is not uniformly distributed. We can see that the data is skewed towards the lower end of the target variable. We can also see that there are some outliers in the data. In any case, even with a low number of splits, some groups will not be represented in each fold. Lets continue with 40 bins which gives a good granularity.

cv_stratified = ContinuousStratifiedKFold(
    n_bins=40, n_splits=5, shuffle=False
)

Train a linear regression model with stratification on target

scores_strat, model = run_cross_validation(
    X=X,
    y=y,
    data=data_df,
    preprocess="zscore",
    cv=cv_stratified,
    problem_type="regression",
    model="linreg",
    return_estimator="final",
    scoring="neg_mean_absolute_error",
)
2023-07-19 12:41:48,017 - julearn - INFO - ==== Input Data ====
2023-07-19 12:41:48,017 - julearn - INFO - Using dataframe as input
2023-07-19 12:41:48,017 - julearn - INFO -      Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2023-07-19 12:41:48,017 - julearn - INFO -      Target: target
2023-07-19 12:41:48,018 - julearn - INFO -      Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2023-07-19 12:41:48,018 - julearn - INFO -      X_types:{}
2023-07-19 12:41:48,018 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/utils/logging.py:238: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
  warn(msg, category=category)
2023-07-19 12:41:48,019 - julearn - INFO - ====================
2023-07-19 12:41:48,019 - julearn - INFO -
2023-07-19 12:41:48,019 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:41:48,019 - julearn - INFO - Step added
2023-07-19 12:41:48,019 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:41:48,020 - julearn - INFO - Step added
2023-07-19 12:41:48,020 - julearn - INFO - = Model Parameters =
2023-07-19 12:41:48,020 - julearn - INFO - ====================
2023-07-19 12:41:48,020 - julearn - INFO -
2023-07-19 12:41:48,020 - julearn - INFO - = Data Information =
2023-07-19 12:41:48,020 - julearn - INFO -      Problem type: regression
2023-07-19 12:41:48,020 - julearn - INFO -      Number of samples: 449
2023-07-19 12:41:48,020 - julearn - INFO -      Number of features: 10
2023-07-19 12:41:48,020 - julearn - INFO - ====================
2023-07-19 12:41:48,021 - julearn - INFO -
2023-07-19 12:41:48,021 - julearn - INFO -      Target type: float64
2023-07-19 12:41:48,021 - julearn - INFO - Using outer CV scheme ContinuousStratifiedKFold(method=None, n_bins=None, n_splits=5,
             random_state=None, shuffle=False)
/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/sklearn/model_selection/_split.py:725: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
/opt/hostedtoolcache/Python/3.10.12/x64/lib/python3.10/site-packages/sklearn/model_selection/_split.py:725: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(

Train a linear regression model without stratification on target

cv = KFold(n_splits=5, shuffle=False, random_state=None)
scores, model = run_cross_validation(
    X=X,
    y=y,
    data=data_df,
    preprocess="zscore",
    cv=cv,
    problem_type="regression",
    model="linreg",
    return_estimator="final",
    scoring="neg_mean_absolute_error",
)
2023-07-19 12:41:48,075 - julearn - INFO - ==== Input Data ====
2023-07-19 12:41:48,075 - julearn - INFO - Using dataframe as input
2023-07-19 12:41:48,076 - julearn - INFO -      Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2023-07-19 12:41:48,076 - julearn - INFO -      Target: target
2023-07-19 12:41:48,076 - julearn - INFO -      Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2023-07-19 12:41:48,076 - julearn - INFO -      X_types:{}
2023-07-19 12:41:48,076 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/utils/logging.py:238: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
  warn(msg, category=category)
2023-07-19 12:41:48,077 - julearn - INFO - ====================
2023-07-19 12:41:48,077 - julearn - INFO -
2023-07-19 12:41:48,077 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:41:48,077 - julearn - INFO - Step added
2023-07-19 12:41:48,077 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:41:48,077 - julearn - INFO - Step added
2023-07-19 12:41:48,078 - julearn - INFO - = Model Parameters =
2023-07-19 12:41:48,078 - julearn - INFO - ====================
2023-07-19 12:41:48,078 - julearn - INFO -
2023-07-19 12:41:48,078 - julearn - INFO - = Data Information =
2023-07-19 12:41:48,078 - julearn - INFO -      Problem type: regression
2023-07-19 12:41:48,078 - julearn - INFO -      Number of samples: 449
2023-07-19 12:41:48,078 - julearn - INFO -      Number of features: 10
2023-07-19 12:41:48,078 - julearn - INFO - ====================
2023-07-19 12:41:48,078 - julearn - INFO -
2023-07-19 12:41:48,078 - julearn - INFO -      Target type: float64
2023-07-19 12:41:48,078 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)

Now we can compare the test score for model trained with and without stratification. We can combine the two outputs as pandas dataframes

scores_strat["model"] = "With stratification"
scores["model"] = "Without stratification"
df_scores = scores_strat[["test_score", "model"]]
df_scores = pd.concat([df_scores, scores[["test_score", "model"]]])

Plot a boxplot with test scores from both the models. We see here that the test score is higher when CV splits were not stratified

fig, ax = plt.subplots(1, 1, figsize=(10, 7))
sns.set_style("darkgrid")
ax = sns.boxplot(x="model", y="test_score", data=df_scores)
ax = sns.swarmplot(x="model", y="test_score", data=df_scores, color=".25")
plot stratified kfold reg

Total running time of the script: ( 0 minutes 0.872 seconds)

Gallery generated by Sphinx-Gallery