.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/basic/plot_stratified_kfold_reg.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_basic_plot_stratified_kfold_reg.py: Stratified K-fold CV for regression analysis ============================================ This example uses the 'diabetes' data from sklearn datasets to perform stratified Kfold CV for a regression problem, .. include:: ../../links.inc .. GENERATED FROM PYTHON SOURCE LINES 10-26 .. code-block:: default # Authors: Shammi More # Federico Raimondo # # License: AGPL import math import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.datasets import load_diabetes from sklearn.model_selection import KFold from julearn import run_cross_validation from julearn.utils import configure_logging from julearn.model_selection import StratifiedGroupsKFold .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'rocket' which already exists. mpl_cm.register_cmap(_name, _cmap) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'rocket_r' which already exists. mpl_cm.register_cmap(_name + "_r", _cmap_r) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'mako' which already exists. mpl_cm.register_cmap(_name, _cmap) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'mako_r' which already exists. mpl_cm.register_cmap(_name + "_r", _cmap_r) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'icefire' which already exists. mpl_cm.register_cmap(_name, _cmap) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'icefire_r' which already exists. mpl_cm.register_cmap(_name + "_r", _cmap_r) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'vlag' which already exists. mpl_cm.register_cmap(_name, _cmap) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'vlag_r' which already exists. mpl_cm.register_cmap(_name + "_r", _cmap_r) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'flare' which already exists. mpl_cm.register_cmap(_name, _cmap) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'flare_r' which already exists. mpl_cm.register_cmap(_name + "_r", _cmap_r) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'crest' which already exists. mpl_cm.register_cmap(_name, _cmap) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'crest_r' which already exists. mpl_cm.register_cmap(_name + "_r", _cmap_r) .. GENERATED FROM PYTHON SOURCE LINES 27-28 Set the logging level to info to see extra information .. GENERATED FROM PYTHON SOURCE LINES 28-30 .. code-block:: default configure_logging(level='INFO') .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 2022-07-21 09:54:48,532 - julearn - INFO - ===== Lib Versions ===== 2022-07-21 09:54:48,532 - julearn - INFO - numpy: 1.23.1 2022-07-21 09:54:48,532 - julearn - INFO - scipy: 1.8.1 2022-07-21 09:54:48,532 - julearn - INFO - sklearn: 1.0.2 2022-07-21 09:54:48,532 - julearn - INFO - pandas: 1.4.3 2022-07-21 09:54:48,532 - julearn - INFO - julearn: 0.2.5 2022-07-21 09:54:48,532 - julearn - INFO - ======================== .. GENERATED FROM PYTHON SOURCE LINES 31-32 load the diabetes data from sklearn as a pandas dataframe .. GENERATED FROM PYTHON SOURCE LINES 32-34 .. code-block:: default features, target = load_diabetes(return_X_y=True, as_frame=True) .. GENERATED FROM PYTHON SOURCE LINES 35-39 Dataset contains ten variables age, sex, body mass index, average blood pressure, and six blood serum measurements (s1-s6) diabetes patients and a quantitative measure of disease progression one year after baseline which is the target we are interested in predicting. .. GENERATED FROM PYTHON SOURCE LINES 39-43 .. code-block:: default print('Features: \n', features.head()) print('Target: \n', target.describe()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Features: age sex bmi ... s4 s5 s6 0 0.038076 0.050680 0.061696 ... -0.002592 0.019908 -0.017646 1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068330 -0.092204 2 0.085299 0.050680 0.044451 ... -0.002592 0.002864 -0.025930 3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022692 -0.009362 4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031991 -0.046641 [5 rows x 10 columns] Target: count 442.000000 mean 152.133484 std 77.093005 min 25.000000 25% 87.000000 50% 140.500000 75% 211.500000 max 346.000000 Name: target, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 44-47 Let's combine features and target together in one dataframe and create some outliers to see the difference in model performance with and without stratification .. GENERATED FROM PYTHON SOURCE LINES 47-60 .. code-block:: default data_df = pd.concat([features, target], axis=1) # Create outliers for test purpose new_df = data_df[(data_df.target > 145) & (data_df.target <= 150)] new_df['target'] = [590, 580, 597, 595, 590, 590, 600] data_df = pd.concat([data_df, new_df], axis=0) data_df = data_df.reset_index(drop=True) # define X, y X = ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] y = 'target' .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /tmp/tmpyof5mika/d9ae8920fc6747d21aebbef95e7684232509be64/examples/basic/plot_stratified_kfold_reg.py:52: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy new_df['target'] = [590, 580, 597, 595, 590, 590, 600] .. GENERATED FROM PYTHON SOURCE LINES 61-62 Define number of splits for CV and create bins/group for stratification .. GENERATED FROM PYTHON SOURCE LINES 62-70 .. code-block:: default num_splits = 7 num_bins = math.floor(len(data_df) / num_splits) # num of bins to be created bins_on = data_df.target # variable to be used for stratification qc = pd.cut(bins_on.tolist(), num_bins) # divides data in bins data_df['bins'] = qc.codes groups = 'bins' .. GENERATED FROM PYTHON SOURCE LINES 71-72 Train a linear regression model with stratification on target .. GENERATED FROM PYTHON SOURCE LINES 72-79 .. code-block:: default cv_stratified = StratifiedGroupsKFold(n_splits=num_splits, shuffle=False) scores_strat, model = run_cross_validation( X=X, y=y, data=data_df, preprocess_X='zscore', cv=cv_stratified, groups=groups, problem_type='regression', model='linreg', return_estimator='final', scoring='neg_mean_absolute_error') .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 2022-07-21 09:54:48,556 - julearn - INFO - ==== Input Data ==== 2022-07-21 09:54:48,556 - julearn - INFO - Using dataframe as input 2022-07-21 09:54:48,556 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 2022-07-21 09:54:48,556 - julearn - INFO - Target: target 2022-07-21 09:54:48,557 - julearn - INFO - Expanded X: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 2022-07-21 09:54:48,557 - julearn - INFO - Expanded Confounds: [] 2022-07-21 09:54:48,558 - julearn - INFO - Using bins as groups 2022-07-21 09:54:48,558 - julearn - INFO - ==================== 2022-07-21 09:54:48,558 - julearn - INFO - 2022-07-21 09:54:48,558 - julearn - INFO - ====== Model ====== 2022-07-21 09:54:48,558 - julearn - INFO - Obtaining model by name: linreg 2022-07-21 09:54:48,558 - julearn - INFO - =================== 2022-07-21 09:54:48,558 - julearn - INFO - 2022-07-21 09:54:48,559 - julearn - INFO - Using scikit-learn CV scheme StratifiedGroupsKFold(n_splits=7, random_state=None, shuffle=False) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/sklearn/model_selection/_split.py:676: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=7. warnings.warn( .. GENERATED FROM PYTHON SOURCE LINES 80-81 Train a linear regression model without stratification on target .. GENERATED FROM PYTHON SOURCE LINES 81-88 .. code-block:: default cv = KFold(n_splits=num_splits, shuffle=False, random_state=None) scores, model = run_cross_validation( X=X, y=y, data=data_df, preprocess_X='zscore', cv=cv, problem_type='regression', model='linreg', return_estimator='final', scoring='neg_mean_absolute_error') .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 2022-07-21 09:54:48,670 - julearn - INFO - ==== Input Data ==== 2022-07-21 09:54:48,671 - julearn - INFO - Using dataframe as input 2022-07-21 09:54:48,671 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 2022-07-21 09:54:48,671 - julearn - INFO - Target: target 2022-07-21 09:54:48,671 - julearn - INFO - Expanded X: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 2022-07-21 09:54:48,671 - julearn - INFO - Expanded Confounds: [] 2022-07-21 09:54:48,672 - julearn - INFO - ==================== 2022-07-21 09:54:48,672 - julearn - INFO - 2022-07-21 09:54:48,672 - julearn - INFO - ====== Model ====== 2022-07-21 09:54:48,672 - julearn - INFO - Obtaining model by name: linreg 2022-07-21 09:54:48,672 - julearn - INFO - =================== 2022-07-21 09:54:48,672 - julearn - INFO - 2022-07-21 09:54:48,672 - julearn - INFO - Using scikit-learn CV scheme KFold(n_splits=7, random_state=None, shuffle=False) .. GENERATED FROM PYTHON SOURCE LINES 89-91 Now we can compare the test score for model trained with and without stratification. We can combine the two outputs as pandas dataframes .. GENERATED FROM PYTHON SOURCE LINES 91-97 .. code-block:: default scores_strat['model'] = 'With stratification' scores['model'] = 'Without stratification' df_scores = scores_strat[['test_score', 'model']] df_scores = pd.concat([df_scores, scores[['test_score', 'model']]]) .. GENERATED FROM PYTHON SOURCE LINES 98-101 Plot a boxplot with test scores from both the models. We see here that the variance for the test score is much higher when CV splits were not stratified .. GENERATED FROM PYTHON SOURCE LINES 101-106 .. code-block:: default fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.set_style("darkgrid") ax = sns.boxplot(x='model', y='test_score', data=df_scores) ax = sns.swarmplot(x="model", y="test_score", data=df_scores, color=".25") .. image-sg:: /auto_examples/basic/images/sphx_glr_plot_stratified_kfold_reg_001.png :alt: plot stratified kfold reg :srcset: /auto_examples/basic/images/sphx_glr_plot_stratified_kfold_reg_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.389 seconds) .. _sphx_glr_download_auto_examples_basic_plot_stratified_kfold_reg.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_stratified_kfold_reg.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_stratified_kfold_reg.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_