.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/00_starting/run_grouped_cv.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_00_starting_run_grouped_cv.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_00_starting_run_grouped_cv.py:


Grouped CV
==========

This example uses the ``fMRI`` dataset and performs GroupKFold
Cross-Validation for classification using Random Forest Classifier.

References
----------

  Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of
  cognitive control in context-dependent decision-making. Cerebral Cortex.

.. include:: ../../links.inc

.. GENERATED FROM PYTHON SOURCE LINES 17-32

.. code-block:: Python


    # Authors: Federico Raimondo <f.raimondo@fz-juelich.de>
    #          Shammi More <s.more@fz-juelich.de>
    #          Kimia Nazarzadeh <k.nazarzadeh@fz-juelich.de>
    # License: AGPL

    # Importing the necessary Python libraries
    import numpy as np

    from seaborn import load_dataset
    from sklearn.model_selection import GroupKFold, StratifiedGroupKFold

    from julearn.utils import configure_logging
    from julearn import run_cross_validation


.. GENERATED FROM PYTHON SOURCE LINES 33-34

Set the logging level to info to see extra information

.. GENERATED FROM PYTHON SOURCE LINES 34-36

.. code-block:: Python

    configure_logging(level="INFO")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-01-16 10:53:52,286 - julearn - INFO - ===== Lib Versions =====
    2026-01-16 10:53:52,286 - julearn - INFO - numpy: 1.26.4
    2026-01-16 10:53:52,286 - julearn - INFO - scipy: 1.17.0
    2026-01-16 10:53:52,286 - julearn - INFO - sklearn: 1.7.2
    2026-01-16 10:53:52,286 - julearn - INFO - pandas: 2.3.3
    2026-01-16 10:53:52,286 - julearn - INFO - julearn: 0.3.5.dev123
    2026-01-16 10:53:52,286 - julearn - INFO - ========================


.. GENERATED FROM PYTHON SOURCE LINES 37-39

Dealing with Cross-Validation techniques
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. GENERATED FROM PYTHON SOURCE LINES 39-42

.. code-block:: Python


    df_fmri = load_dataset("fmri")


.. GENERATED FROM PYTHON SOURCE LINES 43-44

First, let's get some information on what the dataset has:

.. GENERATED FROM PYTHON SOURCE LINES 44-47

.. code-block:: Python


    print(df_fmri.head())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

      subject  timepoint event    region    signal
    0     s13         18  stim  parietal -0.017552
    1      s5         14  stim  parietal -0.080883
    2     s12         18  stim  parietal -0.081033
    3     s11         18  stim  parietal -0.046134
    4     s10         18  stim  parietal -0.037970


.. GENERATED FROM PYTHON SOURCE LINES 48-53

From this information, we can infer that it is an fMRI study in which there
were several subjects, timepoints, events and signal extracted from several
brain regions.

Let's check how many kinds of each we have.

.. GENERATED FROM PYTHON SOURCE LINES 53-58

.. code-block:: Python

    print(df_fmri["event"].unique())
    print(df_fmri["region"].unique())
    print(sorted(df_fmri["timepoint"].unique()))
    print(df_fmri["subject"].unique())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    ['stim' 'cue']
    ['parietal' 'frontal']
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
    ['s13' 's5' 's12' 's11' 's10' 's9' 's8' 's7' 's6' 's4' 's3' 's2' 's1' 's0']


.. GENERATED FROM PYTHON SOURCE LINES 59-62

We have data from parietal and frontal regions during 2 types of events
(*cue* and *stim*) during 18 timepoints and for 14 subjects.
Let's see how many samples we have for each condition

.. GENERATED FROM PYTHON SOURCE LINES 62-72

.. code-block:: Python


    print(df_fmri.groupby(["subject", "timepoint", "event", "region"]).count())
    print(
        np.unique(
            df_fmri.groupby(["subject", "timepoint", "event", "region"])
            .count()
            .values
        )
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

                                      signal
    subject timepoint event region          
    s0      0         cue   frontal        1
                            parietal       1
                      stim  frontal        1
                            parietal       1
            1         cue   frontal        1
    ...                                  ...
    s9      17        stim  parietal       1
            18        cue   frontal        1
                            parietal       1
                      stim  frontal        1
                            parietal       1

    [1064 rows x 1 columns]
    [1]


.. GENERATED FROM PYTHON SOURCE LINES 73-79

We have exactly one value per condition.

Let's try to build a model, that uses parietal and frontal signal to predicts
whether the event was a *cue* or a *stim*.

First we define our X and y variables.

.. GENERATED FROM PYTHON SOURCE LINES 79-82

.. code-block:: Python

    X = ["parietal", "frontal"]
    y = "event"


.. GENERATED FROM PYTHON SOURCE LINES 83-88

In order for this to work, both *parietal* and *frontal* must be columns.
We need to *pivot* the table.

The values of *region* will be the columns. The column *signal* will be the
values. And the columns *subject*, *timepoint* and *event* will be the index

.. GENERATED FROM PYTHON SOURCE LINES 88-94

.. code-block:: Python

    df_fmri = df_fmri.pivot(
        index=["subject", "timepoint", "event"], columns="region", values="signal"
    )

    df_fmri = df_fmri.reset_index()


.. GENERATED FROM PYTHON SOURCE LINES 95-97

Here we want to zscore all the features and then train a Support Vector
Machine classifier.

.. GENERATED FROM PYTHON SOURCE LINES 97-109

.. code-block:: Python


    scores = run_cross_validation(
        X=X,
        y=y,
        data=df_fmri,
        preprocess="zscore",
        model="rf",
        problem_type="classification",
    )

    print(scores["test_score"].mean())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-01-16 10:53:52,304 - julearn - INFO - ==== Input Data ====
    2026-01-16 10:53:52,305 - julearn - INFO - Using dataframe as input
    2026-01-16 10:53:52,305 - julearn - INFO -      Features: ['parietal', 'frontal']
    2026-01-16 10:53:52,305 - julearn - INFO -      Target: event
    2026-01-16 10:53:52,305 - julearn - INFO -      Expanded features: ['parietal', 'frontal']
    2026-01-16 10:53:52,305 - julearn - INFO -      X_types:{}
    2026-01-16 10:53:52,305 - julearn - WARNING - The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous.
    /home/runner/work/julearn/julearn/julearn/prepare.py:576: RuntimeWarning: The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous.
      warn_with_log(
    2026-01-16 10:53:52,306 - julearn - INFO - ====================
    2026-01-16 10:53:52,306 - julearn - INFO - 
    2026-01-16 10:53:52,306 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    2026-01-16 10:53:52,306 - julearn - INFO - Step added
    2026-01-16 10:53:52,307 - julearn - INFO - Adding step rf that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    2026-01-16 10:53:52,307 - julearn - INFO - Step added
    2026-01-16 10:53:52,307 - julearn - INFO - = Model Parameters =
    2026-01-16 10:53:52,308 - julearn - INFO - ====================
    2026-01-16 10:53:52,308 - julearn - INFO - 
    2026-01-16 10:53:52,308 - julearn - INFO - = Data Information =
    2026-01-16 10:53:52,308 - julearn - INFO -      Problem type: classification
    2026-01-16 10:53:52,308 - julearn - INFO -      Number of samples: 532
    2026-01-16 10:53:52,308 - julearn - INFO -      Number of features: 2
    2026-01-16 10:53:52,308 - julearn - INFO - ====================
    2026-01-16 10:53:52,308 - julearn - INFO - 
    2026-01-16 10:53:52,308 - julearn - INFO -      Number of classes: 2
    2026-01-16 10:53:52,309 - julearn - INFO -      Target type: object
    2026-01-16 10:53:52,309 - julearn - INFO -      Class distributions: event
    cue     266
    stim    266
    Name: count, dtype: int64
    2026-01-16 10:53:52,309 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
    2026-01-16 10:53:52,310 - julearn - INFO - Binary classification problem detected.
    0.6841826838300122


.. GENERATED FROM PYTHON SOURCE LINES 110-111

Train classification model with stratification on data

.. GENERATED FROM PYTHON SOURCE LINES 111-125

.. code-block:: Python

    cv_stratified = StratifiedGroupKFold(n_splits=2)
    scores, model = run_cross_validation(
        X=X,
        y=y,
        data=df_fmri,
        groups="subject",
        model="rf",
        problem_type="classification",
        cv=cv_stratified,
        return_estimator="final",
    )

    print(scores["test_score"].mean())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-01-16 10:53:53,042 - julearn - INFO - ==== Input Data ====
    2026-01-16 10:53:53,042 - julearn - INFO - Using dataframe as input
    2026-01-16 10:53:53,042 - julearn - INFO -      Features: ['parietal', 'frontal']
    2026-01-16 10:53:53,042 - julearn - INFO -      Target: event
    2026-01-16 10:53:53,042 - julearn - INFO -      Expanded features: ['parietal', 'frontal']
    2026-01-16 10:53:53,042 - julearn - INFO -      X_types:{}
    2026-01-16 10:53:53,042 - julearn - WARNING - The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous.
    /home/runner/work/julearn/julearn/julearn/prepare.py:576: RuntimeWarning: The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous.
      warn_with_log(
    2026-01-16 10:53:53,043 - julearn - INFO - Using subject as groups
    2026-01-16 10:53:53,043 - julearn - INFO - ====================
    2026-01-16 10:53:53,043 - julearn - INFO - 
    2026-01-16 10:53:53,043 - julearn - INFO - Adding step rf that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    2026-01-16 10:53:53,044 - julearn - INFO - Step added
    2026-01-16 10:53:53,044 - julearn - INFO - = Model Parameters =
    2026-01-16 10:53:53,044 - julearn - INFO - ====================
    2026-01-16 10:53:53,044 - julearn - INFO - 
    2026-01-16 10:53:53,044 - julearn - INFO - = Data Information =
    2026-01-16 10:53:53,045 - julearn - INFO -      Problem type: classification
    2026-01-16 10:53:53,045 - julearn - INFO -      Number of samples: 532
    2026-01-16 10:53:53,045 - julearn - INFO -      Number of features: 2
    2026-01-16 10:53:53,045 - julearn - INFO - ====================
    2026-01-16 10:53:53,045 - julearn - INFO - 
    2026-01-16 10:53:53,045 - julearn - INFO -      Number of classes: 2
    2026-01-16 10:53:53,045 - julearn - INFO -      Target type: object
    2026-01-16 10:53:53,046 - julearn - INFO -      Class distributions: event
    cue     266
    stim    266
    Name: count, dtype: int64
    2026-01-16 10:53:53,046 - julearn - INFO - Using outer CV scheme StratifiedGroupKFold(n_splits=2, random_state=None, shuffle=False) (incl. final model)
    2026-01-16 10:53:53,047 - julearn - INFO - Binary classification problem detected.
    0.6710526315789473


.. GENERATED FROM PYTHON SOURCE LINES 126-127

Train classification model without stratification on data

.. GENERATED FROM PYTHON SOURCE LINES 127-140

.. code-block:: Python

    cv = GroupKFold(n_splits=2)
    scores, model = run_cross_validation(
        X=X,
        y=y,
        data=df_fmri,
        groups="subject",
        model="rf",
        problem_type="classification",
        cv=cv,
        return_estimator="final",
    )

    print(scores["test_score"].mean())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-01-16 10:53:53,513 - julearn - INFO - ==== Input Data ====
    2026-01-16 10:53:53,513 - julearn - INFO - Using dataframe as input
    2026-01-16 10:53:53,513 - julearn - INFO -      Features: ['parietal', 'frontal']
    2026-01-16 10:53:53,513 - julearn - INFO -      Target: event
    2026-01-16 10:53:53,513 - julearn - INFO -      Expanded features: ['parietal', 'frontal']
    2026-01-16 10:53:53,513 - julearn - INFO -      X_types:{}
    2026-01-16 10:53:53,514 - julearn - WARNING - The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous.
    /home/runner/work/julearn/julearn/julearn/prepare.py:576: RuntimeWarning: The following columns are not defined in X_types: ['parietal', 'frontal']. They will be treated as continuous.
      warn_with_log(
    2026-01-16 10:53:53,514 - julearn - INFO - Using subject as groups
    2026-01-16 10:53:53,514 - julearn - INFO - ====================
    2026-01-16 10:53:53,515 - julearn - INFO - 
    2026-01-16 10:53:53,515 - julearn - INFO - Adding step rf that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    2026-01-16 10:53:53,515 - julearn - INFO - Step added
    2026-01-16 10:53:53,515 - julearn - INFO - = Model Parameters =
    2026-01-16 10:53:53,516 - julearn - INFO - ====================
    2026-01-16 10:53:53,516 - julearn - INFO - 
    2026-01-16 10:53:53,516 - julearn - INFO - = Data Information =
    2026-01-16 10:53:53,516 - julearn - INFO -      Problem type: classification
    2026-01-16 10:53:53,516 - julearn - INFO -      Number of samples: 532
    2026-01-16 10:53:53,516 - julearn - INFO -      Number of features: 2
    2026-01-16 10:53:53,516 - julearn - INFO - ====================
    2026-01-16 10:53:53,516 - julearn - INFO - 
    2026-01-16 10:53:53,517 - julearn - INFO -      Number of classes: 2
    2026-01-16 10:53:53,517 - julearn - INFO -      Target type: object
    2026-01-16 10:53:53,517 - julearn - INFO -      Class distributions: event
    cue     266
    stim    266
    Name: count, dtype: int64
    2026-01-16 10:53:53,518 - julearn - INFO - Using outer CV scheme GroupKFold(n_splits=2, random_state=None, shuffle=False) (incl. final model)
    2026-01-16 10:53:53,518 - julearn - INFO - Binary classification problem detected.
    0.6672932330827068


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 1.660 seconds)


.. _sphx_glr_download_auto_examples_00_starting_run_grouped_cv.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: run_grouped_cv.ipynb <run_grouped_cv.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: run_grouped_cv.py <run_grouped_cv.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: run_grouped_cv.zip <run_grouped_cv.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_