Note
Go to the end to download the full example code
Stratified K-fold CV for regression analysis#
This example uses the diabetes
data from sklearn datasets
to
perform stratified Kfold CV for a regression problem,
# Authors: Shammi More <s.more@fz-juelich.de>
# Federico Raimondo <f.raimondo@fz-juelich.de>
# Leonard Sasse <l.sasse@fz-juelich.de>
# License: AGPL
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import KFold
from julearn import run_cross_validation
from julearn.utils import configure_logging
from julearn.model_selection import ContinuousStratifiedKFold
Set the logging level to info to see extra information.
configure_logging(level="INFO")
2024-05-03 15:21:44,870 - julearn - INFO - ===== Lib Versions =====
2024-05-03 15:21:44,870 - julearn - INFO - numpy: 1.26.4
2024-05-03 15:21:44,871 - julearn - INFO - scipy: 1.13.0
2024-05-03 15:21:44,871 - julearn - INFO - sklearn: 1.4.2
2024-05-03 15:21:44,871 - julearn - INFO - pandas: 2.1.4
2024-05-03 15:21:44,871 - julearn - INFO - julearn: 0.3.2
2024-05-03 15:21:44,871 - julearn - INFO - ========================
Load the diabetes data from sklearn
as a pandas.DataFrame
.
features, target = load_diabetes(return_X_y=True, as_frame=True)
Dataset contains ten variables age, sex, body mass index, average blood pressure, and six blood serum measurements (s1-s6) diabetes patients and a quantitative measure of disease progression one year after baseline which is the target we are interested in predicting.
print("Features: \n", features.head())
print("Target: \n", target.describe())
Features:
age sex bmi ... s4 s5 s6
0 0.038076 0.050680 0.061696 ... -0.002592 0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068332 -0.092204
2 0.085299 0.050680 0.044451 ... -0.002592 0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022688 -0.009362
4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031988 -0.046641
[5 rows x 10 columns]
Target:
count 442.000000
mean 152.133484
std 77.093005
min 25.000000
25% 87.000000
50% 140.500000
75% 211.500000
max 346.000000
Name: target, dtype: float64
Let’s combine features and target together in one dataframe and create some outliers to see the difference in model performance with and without stratification.
data_df = pd.concat([features, target], axis=1)
# Create outliers for test purpose
new_df = data_df[(data_df.target > 145) & (data_df.target <= 150)]
new_df["target"] = [590, 580, 597, 595, 590, 590, 600]
data_df = pd.concat([data_df, new_df], axis=0)
data_df = data_df.reset_index(drop=True)
# Define X, y
X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"]
y = "target"
/tmp/tmp6bvb25hj/b014d30071c82339139be03cf776412ba610302e/examples/00_starting/plot_stratified_kfold_reg.py:51: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_df["target"] = [590, 580, 597, 595, 590, 590, 600]
Define number of bins/group for stratification. The idea is that each “bin” will be equally represented in each fold. The number of bins should be chosen such that each bin has a sufficient number of samples so that each fold has more than one sample from each bin. Let’s see a couple of histrograms with different number of bins.
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
<seaborn.axisgrid.FacetGrid object at 0x7fd486a5f490>
From the histogram above, we can see that the data is not uniformly distributed. We can see that the data is skewed towards the lower end of the target variable. We can also see that there are some outliers in the data. In any case, even with a low number of splits, some groups will not be represented in each fold. Lets continue with 40 bins which gives a good granularity.
cv_stratified = ContinuousStratifiedKFold(n_bins=40, n_splits=5, shuffle=False)
Train a linear regression model with stratification on target.
2024-05-03 15:21:45,452 - julearn - INFO - ==== Input Data ====
2024-05-03 15:21:45,452 - julearn - INFO - Using dataframe as input
2024-05-03 15:21:45,452 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-05-03 15:21:45,452 - julearn - INFO - Target: target
2024-05-03 15:21:45,452 - julearn - INFO - Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-05-03 15:21:45,452 - julearn - INFO - X_types:{}
2024-05-03 15:21:45,453 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:505: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
warn_with_log(
2024-05-03 15:21:45,453 - julearn - INFO - ====================
2024-05-03 15:21:45,453 - julearn - INFO -
2024-05-03 15:21:45,453 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:21:45,454 - julearn - INFO - Step added
2024-05-03 15:21:45,454 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:21:45,454 - julearn - INFO - Step added
2024-05-03 15:21:45,454 - julearn - INFO - = Model Parameters =
2024-05-03 15:21:45,454 - julearn - INFO - ====================
2024-05-03 15:21:45,454 - julearn - INFO -
2024-05-03 15:21:45,454 - julearn - INFO - = Data Information =
2024-05-03 15:21:45,454 - julearn - INFO - Problem type: regression
2024-05-03 15:21:45,454 - julearn - INFO - Number of samples: 449
2024-05-03 15:21:45,454 - julearn - INFO - Number of features: 10
2024-05-03 15:21:45,454 - julearn - INFO - ====================
2024-05-03 15:21:45,455 - julearn - INFO -
2024-05-03 15:21:45,455 - julearn - INFO - Target type: float64
2024-05-03 15:21:45,455 - julearn - INFO - Using outer CV scheme ContinuousStratifiedKFold(method='binning', n_bins=40, n_splits=5,
random_state=None, shuffle=False)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
warnings.warn(
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
warnings.warn(
2024-05-03 15:21:45,494 - julearn - INFO - Fitting final model
Train a linear regression model without stratification on target.
2024-05-03 15:21:45,498 - julearn - INFO - ==== Input Data ====
2024-05-03 15:21:45,499 - julearn - INFO - Using dataframe as input
2024-05-03 15:21:45,499 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-05-03 15:21:45,499 - julearn - INFO - Target: target
2024-05-03 15:21:45,499 - julearn - INFO - Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-05-03 15:21:45,499 - julearn - INFO - X_types:{}
2024-05-03 15:21:45,499 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:505: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
warn_with_log(
2024-05-03 15:21:45,499 - julearn - INFO - ====================
2024-05-03 15:21:45,499 - julearn - INFO -
2024-05-03 15:21:45,500 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:21:45,500 - julearn - INFO - Step added
2024-05-03 15:21:45,500 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:21:45,500 - julearn - INFO - Step added
2024-05-03 15:21:45,500 - julearn - INFO - = Model Parameters =
2024-05-03 15:21:45,500 - julearn - INFO - ====================
2024-05-03 15:21:45,500 - julearn - INFO -
2024-05-03 15:21:45,500 - julearn - INFO - = Data Information =
2024-05-03 15:21:45,500 - julearn - INFO - Problem type: regression
2024-05-03 15:21:45,500 - julearn - INFO - Number of samples: 449
2024-05-03 15:21:45,500 - julearn - INFO - Number of features: 10
2024-05-03 15:21:45,500 - julearn - INFO - ====================
2024-05-03 15:21:45,501 - julearn - INFO -
2024-05-03 15:21:45,501 - julearn - INFO - Target type: float64
2024-05-03 15:21:45,501 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:21:45,537 - julearn - INFO - Fitting final model
Now we can compare the test score for model trained with and without
stratification. We can combine the two outputs as pandas.DataFrame
.
scores_strat["model"] = "With stratification"
scores["model"] = "Without stratification"
df_scores = scores_strat[["test_score", "model"]]
df_scores = pd.concat([df_scores, scores[["test_score", "model"]]])
Plot a boxplot with test scores from both the models. We see here that the test score is higher when CV splits were not stratified.
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
Total running time of the script: (0 minutes 0.833 seconds)