.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/00_starting/run_grouped_cv.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_00_starting_run_grouped_cv.py: Grouped CV ========== This example uses the ``fMRI`` dataset and performs GroupKFold Cross-Validation for classification using Random Forest Classifier. References ---------- Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of cognitive control in context-dependent decision-making. Cerebral Cortex. .. include:: ../../links.inc .. GENERATED FROM PYTHON SOURCE LINES 17-32 .. code-block:: Python # Authors: Federico Raimondo # Shammi More # Kimia Nazarzadeh # License: AGPL # Importing the necessary Python libraries import numpy as np from seaborn import load_dataset from sklearn.model_selection import GroupKFold, StratifiedGroupKFold from julearn.utils import configure_logging from julearn import run_cross_validation .. GENERATED FROM PYTHON SOURCE LINES 33-34 Set the logging level to info to see extra information .. GENERATED FROM PYTHON SOURCE LINES 34-36 .. code-block:: Python configure_logging(level="INFO") .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 10:53:52,286 - julearn - INFO - ===== Lib Versions ===== 2026-01-16 10:53:52,286 - julearn - INFO - numpy: 1.26.4 2026-01-16 10:53:52,286 - julearn - INFO - scipy: 1.17.0 2026-01-16 10:53:52,286 - julearn - INFO - sklearn: 1.7.2 2026-01-16 10:53:52,286 - julearn - INFO - pandas: 2.3.3 2026-01-16 10:53:52,286 - julearn - INFO - julearn: 0.3.5.dev123 2026-01-16 10:53:52,286 - julearn - INFO - ======================== .. GENERATED FROM PYTHON SOURCE LINES 37-39 Dealing with Cross-Validation techniques ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. GENERATED FROM PYTHON SOURCE LINES 39-42 .. code-block:: Python df_fmri = load_dataset("fmri") .. GENERATED FROM PYTHON SOURCE LINES 43-44 First, let's get some information on what the dataset has: .. GENERATED FROM PYTHON SOURCE LINES 44-47 .. code-block:: Python print(df_fmri.head()) .. rst-class:: sphx-glr-script-out .. code-block:: none subject timepoint event region signal 0 s13 18 stim parietal -0.017552 1 s5 14 stim parietal -0.080883 2 s12 18 stim parietal -0.081033 3 s11 18 stim parietal -0.046134 4 s10 18 stim parietal -0.037970 .. GENERATED FROM PYTHON SOURCE LINES 48-53 From this information, we can infer that it is an fMRI study in which there were several subjects, timepoints, events and signal extracted from several brain regions. Let's check how many kinds of each we have. .. GENERATED FROM PYTHON SOURCE LINES 53-58 .. code-block:: Python print(df_fmri["event"].unique()) print(df_fmri["region"].unique()) print(sorted(df_fmri["timepoint"].unique())) print(df_fmri["subject"].unique()) .. rst-class:: sphx-glr-script-out .. code-block:: none ['stim' 'cue'] ['parietal' 'frontal'] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] ['s13' 's5' 's12' 's11' 's10' 's9' 's8' 's7' 's6' 's4' 's3' 's2' 's1' 's0'] .. GENERATED FROM PYTHON SOURCE LINES 59-62 We have data from parietal and frontal regions during 2 types of events (*cue* and *stim*) during 18 timepoints and for 14 subjects. Let's see how many samples we have for each condition .. GENERATED FROM PYTHON SOURCE LINES 62-72 .. code-block:: Python print(df_fmri.groupby(["subject", "timepoint", "event", "region"]).count()) print( np.unique( df_fmri.groupby(["subject", "timepoint", "event", "region"]) .count() .values ) ) .. rst-class:: sphx-glr-script-out .. code-block:: none signal subject timepoint event region s0 0 cue frontal 1 parietal 1 stim frontal 1 parietal 1 1 cue frontal 1 ... ... s9 17 stim parietal 1 18 cue frontal 1 parietal 1 stim frontal 1 parietal 1 [1064 rows x 1 columns] [1] .. GENERATED FROM PYTHON SOURCE LINES 73-79 We have exactly one value per condition. Let's try to build a model, that uses parietal and frontal signal to predicts whether the event was a *cue* or a *stim*. First we define our X and y variables. .. GENERATED FROM PYTHON SOURCE LINES 79-82 .. code-block:: Python X = ["parietal", "frontal"] y = "event" .. GENERATED FROM PYTHON SOURCE LINES 83-88 In order for this to work, both *parietal* and *frontal* must be columns. We need to *pivot* the table. The values of *region* will be the columns. The column *signal* will be the values. And the columns *subject*, *timepoint* and *event* will be the index .. GENERATED FROM PYTHON SOURCE LINES 88-94 .. code-block:: Python df_fmri = df_fmri.pivot( index=["subject", "timepoint", "event"], columns="region", values="signal" ) df_fmri = df_fmri.reset_index() .. GENERATED FROM PYTHON SOURCE LINES 95-97 Here we want to zscore all the features and then train a Support Vector Machine classifier. .. GENERATED FROM PYTHON SOURCE LINES 97-109 .. code-block:: Python scores = run_cross_validation( X=X, y=y, data=df_fmri, preprocess="zscore", model="rf", problem_type="classification", ) print(scores["test_score"].mean()) .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 10:53:52,304 - julearn - INFO - ==== Input Data ==== 2026-01-16 10:53:52,305 - julearn - INFO - Using dataframe as input 2026-01-16 10:53:52,305 - julearn - INFO - Features: ['parietal', 'frontal'] 2026-01-16 10:53:52,305 - julearn - INFO - Target: event 2026-01-16 10:53:52,305 - julearn - INFO - Expanded features: ['parietal', 'frontal'] 2026-01-16 10:53:52,305 - julearn - INFO - X_types:{} 2026-01-16 10:53:52,305 - julearn - WARNING - The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous. /home/runner/work/julearn/julearn/julearn/prepare.py:576: RuntimeWarning: The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous. warn_with_log( 2026-01-16 10:53:52,306 - julearn - INFO - ==================== 2026-01-16 10:53:52,306 - julearn - INFO - 2026-01-16 10:53:52,306 - julearn - INFO - Adding step zscore that applies to ColumnTypes 2026-01-16 10:53:52,306 - julearn - INFO - Step added 2026-01-16 10:53:52,307 - julearn - INFO - Adding step rf that applies to ColumnTypes 2026-01-16 10:53:52,307 - julearn - INFO - Step added 2026-01-16 10:53:52,307 - julearn - INFO - = Model Parameters = 2026-01-16 10:53:52,308 - julearn - INFO - ==================== 2026-01-16 10:53:52,308 - julearn - INFO - 2026-01-16 10:53:52,308 - julearn - INFO - = Data Information = 2026-01-16 10:53:52,308 - julearn - INFO - Problem type: classification 2026-01-16 10:53:52,308 - julearn - INFO - Number of samples: 532 2026-01-16 10:53:52,308 - julearn - INFO - Number of features: 2 2026-01-16 10:53:52,308 - julearn - INFO - ==================== 2026-01-16 10:53:52,308 - julearn - INFO - 2026-01-16 10:53:52,308 - julearn - INFO - Number of classes: 2 2026-01-16 10:53:52,309 - julearn - INFO - Target type: object 2026-01-16 10:53:52,309 - julearn - INFO - Class distributions: event cue 266 stim 266 Name: count, dtype: int64 2026-01-16 10:53:52,309 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) 2026-01-16 10:53:52,310 - julearn - INFO - Binary classification problem detected. 0.6841826838300122 .. GENERATED FROM PYTHON SOURCE LINES 110-111 Train classification model with stratification on data .. GENERATED FROM PYTHON SOURCE LINES 111-125 .. code-block:: Python cv_stratified = StratifiedGroupKFold(n_splits=2) scores, model = run_cross_validation( X=X, y=y, data=df_fmri, groups="subject", model="rf", problem_type="classification", cv=cv_stratified, return_estimator="final", ) print(scores["test_score"].mean()) .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 10:53:53,042 - julearn - INFO - ==== Input Data ==== 2026-01-16 10:53:53,042 - julearn - INFO - Using dataframe as input 2026-01-16 10:53:53,042 - julearn - INFO - Features: ['parietal', 'frontal'] 2026-01-16 10:53:53,042 - julearn - INFO - Target: event 2026-01-16 10:53:53,042 - julearn - INFO - Expanded features: ['parietal', 'frontal'] 2026-01-16 10:53:53,042 - julearn - INFO - X_types:{} 2026-01-16 10:53:53,042 - julearn - WARNING - The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous. /home/runner/work/julearn/julearn/julearn/prepare.py:576: RuntimeWarning: The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous. warn_with_log( 2026-01-16 10:53:53,043 - julearn - INFO - Using subject as groups 2026-01-16 10:53:53,043 - julearn - INFO - ==================== 2026-01-16 10:53:53,043 - julearn - INFO - 2026-01-16 10:53:53,043 - julearn - INFO - Adding step rf that applies to ColumnTypes 2026-01-16 10:53:53,044 - julearn - INFO - Step added 2026-01-16 10:53:53,044 - julearn - INFO - = Model Parameters = 2026-01-16 10:53:53,044 - julearn - INFO - ==================== 2026-01-16 10:53:53,044 - julearn - INFO - 2026-01-16 10:53:53,044 - julearn - INFO - = Data Information = 2026-01-16 10:53:53,045 - julearn - INFO - Problem type: classification 2026-01-16 10:53:53,045 - julearn - INFO - Number of samples: 532 2026-01-16 10:53:53,045 - julearn - INFO - Number of features: 2 2026-01-16 10:53:53,045 - julearn - INFO - ==================== 2026-01-16 10:53:53,045 - julearn - INFO - 2026-01-16 10:53:53,045 - julearn - INFO - Number of classes: 2 2026-01-16 10:53:53,045 - julearn - INFO - Target type: object 2026-01-16 10:53:53,046 - julearn - INFO - Class distributions: event cue 266 stim 266 Name: count, dtype: int64 2026-01-16 10:53:53,046 - julearn - INFO - Using outer CV scheme StratifiedGroupKFold(n_splits=2, random_state=None, shuffle=False) (incl. final model) 2026-01-16 10:53:53,047 - julearn - INFO - Binary classification problem detected. 0.6710526315789473 .. GENERATED FROM PYTHON SOURCE LINES 126-127 Train classification model without stratification on data .. GENERATED FROM PYTHON SOURCE LINES 127-140 .. code-block:: Python cv = GroupKFold(n_splits=2) scores, model = run_cross_validation( X=X, y=y, data=df_fmri, groups="subject", model="rf", problem_type="classification", cv=cv, return_estimator="final", ) print(scores["test_score"].mean()) .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 10:53:53,513 - julearn - INFO - ==== Input Data ==== 2026-01-16 10:53:53,513 - julearn - INFO - Using dataframe as input 2026-01-16 10:53:53,513 - julearn - INFO - Features: ['parietal', 'frontal'] 2026-01-16 10:53:53,513 - julearn - INFO - Target: event 2026-01-16 10:53:53,513 - julearn - INFO - Expanded features: ['parietal', 'frontal'] 2026-01-16 10:53:53,513 - julearn - INFO - X_types:{} 2026-01-16 10:53:53,514 - julearn - WARNING - The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous. /home/runner/work/julearn/julearn/julearn/prepare.py:576: RuntimeWarning: The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous. warn_with_log( 2026-01-16 10:53:53,514 - julearn - INFO - Using subject as groups 2026-01-16 10:53:53,514 - julearn - INFO - ==================== 2026-01-16 10:53:53,515 - julearn - INFO - 2026-01-16 10:53:53,515 - julearn - INFO - Adding step rf that applies to ColumnTypes 2026-01-16 10:53:53,515 - julearn - INFO - Step added 2026-01-16 10:53:53,515 - julearn - INFO - = Model Parameters = 2026-01-16 10:53:53,516 - julearn - INFO - ==================== 2026-01-16 10:53:53,516 - julearn - INFO - 2026-01-16 10:53:53,516 - julearn - INFO - = Data Information = 2026-01-16 10:53:53,516 - julearn - INFO - Problem type: classification 2026-01-16 10:53:53,516 - julearn - INFO - Number of samples: 532 2026-01-16 10:53:53,516 - julearn - INFO - Number of features: 2 2026-01-16 10:53:53,516 - julearn - INFO - ==================== 2026-01-16 10:53:53,516 - julearn - INFO - 2026-01-16 10:53:53,517 - julearn - INFO - Number of classes: 2 2026-01-16 10:53:53,517 - julearn - INFO - Target type: object 2026-01-16 10:53:53,517 - julearn - INFO - Class distributions: event cue 266 stim 266 Name: count, dtype: int64 2026-01-16 10:53:53,518 - julearn - INFO - Using outer CV scheme GroupKFold(n_splits=2, random_state=None, shuffle=False) (incl. final model) 2026-01-16 10:53:53,518 - julearn - INFO - Binary classification problem detected. 0.6672932330827068 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 1.660 seconds) .. _sphx_glr_download_auto_examples_00_starting_run_grouped_cv.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: run_grouped_cv.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: run_grouped_cv.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: run_grouped_cv.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_