Grouped CV#

This example uses the fMRI dataset and performs GroupKFold Cross-Validation for classification using Random Forest Classifier.

References#

Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of cognitive control in context-dependent decision-making. Cerebral Cortex.

# Authors: Federico Raimondo <f.raimondo@fz-juelich.de>
#          Shammi More <s.more@fz-juelich.de>
#          Kimia Nazarzadeh <k.nazarzadeh@fz-juelich.de>
# License: AGPL

# Importing the necessary Python libraries
import numpy as np

from seaborn import load_dataset
from sklearn.model_selection import GroupKFold, StratifiedGroupKFold

from julearn.utils import configure_logging
from julearn import run_cross_validation

Set the logging level to info to see extra information

configure_logging(level="INFO")
2024-04-04 14:43:46,606 - julearn - INFO - ===== Lib Versions =====
2024-04-04 14:43:46,606 - julearn - INFO - numpy: 1.26.4
2024-04-04 14:43:46,606 - julearn - INFO - scipy: 1.13.0
2024-04-04 14:43:46,606 - julearn - INFO - sklearn: 1.4.1.post1
2024-04-04 14:43:46,606 - julearn - INFO - pandas: 2.1.4
2024-04-04 14:43:46,606 - julearn - INFO - julearn: 0.3.2.dev24
2024-04-04 14:43:46,606 - julearn - INFO - ========================

Dealing with Cross-Validation techniques#

df_fmri = load_dataset("fmri")

First, lets get some information on what the dataset has:

print(df_fmri.head())
  subject  timepoint event    region    signal
0     s13         18  stim  parietal -0.017552
1      s5         14  stim  parietal -0.080883
2     s12         18  stim  parietal -0.081033
3     s11         18  stim  parietal -0.046134
4     s10         18  stim  parietal -0.037970

From this information, we can infer that it is an fMRI study in which there were several subjects, timepoints, events and signal extracted from several brain regions.

Lets check how many kinds of each we have.

print(df_fmri["event"].unique())
print(df_fmri["region"].unique())
print(sorted(df_fmri["timepoint"].unique()))
print(df_fmri["subject"].unique())
['stim' 'cue']
['parietal' 'frontal']
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
['s13' 's5' 's12' 's11' 's10' 's9' 's8' 's7' 's6' 's4' 's3' 's2' 's1' 's0']

We have data from parietal and frontal regions during 2 types of events (cue and stim) during 18 timepoints and for 14 subjects. Lets see how many samples we have for each condition

print(df_fmri.groupby(["subject", "timepoint", "event", "region"]).count())
print(
    np.unique(
        df_fmri.groupby(["subject", "timepoint", "event", "region"])
        .count()
        .values
    )
)
                                  signal
subject timepoint event region
s0      0         cue   frontal        1
                        parietal       1
                  stim  frontal        1
                        parietal       1
        1         cue   frontal        1
...                                  ...
s9      17        stim  parietal       1
        18        cue   frontal        1
                        parietal       1
                  stim  frontal        1
                        parietal       1

[1064 rows x 1 columns]
[1]

We have exactly one value per condition.

Lets try to build a model, that uses parietal and frontal signal to predicts whether the event was a cue or a stim.

First we define our X and y variables.

X = ["parietal", "frontal"]
y = "event"

In order for this to work, both parietal and frontal must be columns. We need to pivot the table.

The values of region will be the columns. The column signal will be the values. And the columns subject, timepoint and event will be the index

df_fmri = df_fmri.pivot(
    index=["subject", "timepoint", "event"], columns="region", values="signal"
)

df_fmri = df_fmri.reset_index()

Here we want to zscore all the features and then train a Support Vector Machine classifier.

scores = run_cross_validation(
    X=X,
    y=y,
    data=df_fmri,
    preprocess="zscore",
    model="rf",
    problem_type="classification",
)

print(scores["test_score"].mean())
2024-04-04 14:43:46,625 - julearn - INFO - ==== Input Data ====
2024-04-04 14:43:46,625 - julearn - INFO - Using dataframe as input
2024-04-04 14:43:46,625 - julearn - INFO -      Features: ['parietal', 'frontal']
2024-04-04 14:43:46,625 - julearn - INFO -      Target: event
2024-04-04 14:43:46,626 - julearn - INFO -      Expanded features: ['parietal', 'frontal']
2024-04-04 14:43:46,626 - julearn - INFO -      X_types:{}
2024-04-04 14:43:46,626 - julearn - WARNING - The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:507: RuntimeWarning: The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous.
  warn_with_log(
2024-04-04 14:43:46,626 - julearn - INFO - ====================
2024-04-04 14:43:46,626 - julearn - INFO -
2024-04-04 14:43:46,627 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-04 14:43:46,627 - julearn - INFO - Step added
2024-04-04 14:43:46,627 - julearn - INFO - Adding step rf that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-04 14:43:46,627 - julearn - INFO - Step added
2024-04-04 14:43:46,628 - julearn - INFO - = Model Parameters =
2024-04-04 14:43:46,628 - julearn - INFO - ====================
2024-04-04 14:43:46,628 - julearn - INFO -
2024-04-04 14:43:46,628 - julearn - INFO - = Data Information =
2024-04-04 14:43:46,628 - julearn - INFO -      Problem type: classification
2024-04-04 14:43:46,628 - julearn - INFO -      Number of samples: 532
2024-04-04 14:43:46,628 - julearn - INFO -      Number of features: 2
2024-04-04 14:43:46,628 - julearn - INFO - ====================
2024-04-04 14:43:46,628 - julearn - INFO -
2024-04-04 14:43:46,628 - julearn - INFO -      Number of classes: 2
2024-04-04 14:43:46,628 - julearn - INFO -      Target type: object
2024-04-04 14:43:46,629 - julearn - INFO -      Class distributions: event
cue     266
stim    266
Name: count, dtype: int64
2024-04-04 14:43:46,629 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-04-04 14:43:46,629 - julearn - INFO - Binary classification problem detected.
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(
0.6841826838300122

Train classification model with stratification on data

cv_stratified = StratifiedGroupKFold(n_splits=2)
scores, model = run_cross_validation(
    X=X,
    y=y,
    data=df_fmri,
    groups="subject",
    model="rf",
    problem_type="classification",
    cv=cv_stratified,
    return_estimator="final",
)

print(scores["test_score"].mean())
2024-04-04 14:43:47,268 - julearn - INFO - ==== Input Data ====
2024-04-04 14:43:47,268 - julearn - INFO - Using dataframe as input
2024-04-04 14:43:47,268 - julearn - INFO -      Features: ['parietal', 'frontal']
2024-04-04 14:43:47,268 - julearn - INFO -      Target: event
2024-04-04 14:43:47,268 - julearn - INFO -      Expanded features: ['parietal', 'frontal']
2024-04-04 14:43:47,268 - julearn - INFO -      X_types:{}
2024-04-04 14:43:47,268 - julearn - WARNING - The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:507: RuntimeWarning: The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous.
  warn_with_log(
2024-04-04 14:43:47,269 - julearn - INFO - Using subject as groups
2024-04-04 14:43:47,269 - julearn - INFO - ====================
2024-04-04 14:43:47,269 - julearn - INFO -
2024-04-04 14:43:47,269 - julearn - INFO - Adding step rf that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-04 14:43:47,269 - julearn - INFO - Step added
2024-04-04 14:43:47,270 - julearn - INFO - = Model Parameters =
2024-04-04 14:43:47,270 - julearn - INFO - ====================
2024-04-04 14:43:47,270 - julearn - INFO -
2024-04-04 14:43:47,270 - julearn - INFO - = Data Information =
2024-04-04 14:43:47,270 - julearn - INFO -      Problem type: classification
2024-04-04 14:43:47,270 - julearn - INFO -      Number of samples: 532
2024-04-04 14:43:47,270 - julearn - INFO -      Number of features: 2
2024-04-04 14:43:47,270 - julearn - INFO - ====================
2024-04-04 14:43:47,270 - julearn - INFO -
2024-04-04 14:43:47,270 - julearn - INFO -      Number of classes: 2
2024-04-04 14:43:47,270 - julearn - INFO -      Target type: object
2024-04-04 14:43:47,271 - julearn - INFO -      Class distributions: event
cue     266
stim    266
Name: count, dtype: int64
2024-04-04 14:43:47,271 - julearn - INFO - Using outer CV scheme StratifiedGroupKFold(n_splits=2, random_state=None, shuffle=False)
2024-04-04 14:43:47,271 - julearn - INFO - Binary classification problem detected.
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(
0.6898496240601504

Train classification model without stratification on data

cv = GroupKFold(n_splits=2)
scores, model = run_cross_validation(
    X=X,
    y=y,
    data=df_fmri,
    groups="subject",
    model="rf",
    problem_type="classification",
    cv=cv,
    return_estimator="final",
)

print(scores["test_score"].mean())
2024-04-04 14:43:47,625 - julearn - INFO - ==== Input Data ====
2024-04-04 14:43:47,626 - julearn - INFO - Using dataframe as input
2024-04-04 14:43:47,626 - julearn - INFO -      Features: ['parietal', 'frontal']
2024-04-04 14:43:47,626 - julearn - INFO -      Target: event
2024-04-04 14:43:47,626 - julearn - INFO -      Expanded features: ['parietal', 'frontal']
2024-04-04 14:43:47,626 - julearn - INFO -      X_types:{}
2024-04-04 14:43:47,626 - julearn - WARNING - The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:507: RuntimeWarning: The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous.
  warn_with_log(
2024-04-04 14:43:47,627 - julearn - INFO - Using subject as groups
2024-04-04 14:43:47,627 - julearn - INFO - ====================
2024-04-04 14:43:47,627 - julearn - INFO -
2024-04-04 14:43:47,627 - julearn - INFO - Adding step rf that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-04 14:43:47,627 - julearn - INFO - Step added
2024-04-04 14:43:47,627 - julearn - INFO - = Model Parameters =
2024-04-04 14:43:47,628 - julearn - INFO - ====================
2024-04-04 14:43:47,628 - julearn - INFO -
2024-04-04 14:43:47,628 - julearn - INFO - = Data Information =
2024-04-04 14:43:47,628 - julearn - INFO -      Problem type: classification
2024-04-04 14:43:47,628 - julearn - INFO -      Number of samples: 532
2024-04-04 14:43:47,628 - julearn - INFO -      Number of features: 2
2024-04-04 14:43:47,628 - julearn - INFO - ====================
2024-04-04 14:43:47,628 - julearn - INFO -
2024-04-04 14:43:47,628 - julearn - INFO -      Number of classes: 2
2024-04-04 14:43:47,628 - julearn - INFO -      Target type: object
2024-04-04 14:43:47,629 - julearn - INFO -      Class distributions: event
cue     266
stim    266
Name: count, dtype: int64
2024-04-04 14:43:47,629 - julearn - INFO - Using outer CV scheme GroupKFold(n_splits=2)
2024-04-04 14:43:47,629 - julearn - INFO - Binary classification problem detected.
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(
0.6879699248120301

Total running time of the script: (0 minutes 1.381 seconds)

Gallery generated by Sphinx-Gallery