.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/basic/plot_groupcv_inspect_svm.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_basic_plot_groupcv_inspect_svm.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_basic_plot_groupcv_inspect_svm.py:


Inspecting SVM models
=====================

This example uses the 'fmri' dataset, performs simple binary classification
using a Support Vector Machine classifier and analyse the model.


References
----------
Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of
cognitive control in context-dependent decision-making. Cerebral Cortex.


.. include:: ../../links.inc

.. GENERATED FROM PYTHON SOURCE LINES 17-31

.. code-block:: default

    # Authors: Federico Raimondo <f.raimondo@fz-juelich.de>
    #
    # License: AGPL
    import numpy as np

    from sklearn.model_selection import GroupShuffleSplit

    import matplotlib.pyplot as plt
    import seaborn as sns
    from seaborn import load_dataset

    from julearn import run_cross_validation
    from julearn.utils import configure_logging


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'rocket' which already exists.
      mpl_cm.register_cmap(_name, _cmap)
    /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'rocket_r' which already exists.
      mpl_cm.register_cmap(_name + "_r", _cmap_r)
    /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'mako' which already exists.
      mpl_cm.register_cmap(_name, _cmap)
    /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'mako_r' which already exists.
      mpl_cm.register_cmap(_name + "_r", _cmap_r)
    /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'icefire' which already exists.
      mpl_cm.register_cmap(_name, _cmap)
    /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'icefire_r' which already exists.
      mpl_cm.register_cmap(_name + "_r", _cmap_r)
    /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'vlag' which already exists.
      mpl_cm.register_cmap(_name, _cmap)
    /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'vlag_r' which already exists.
      mpl_cm.register_cmap(_name + "_r", _cmap_r)
    /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'flare' which already exists.
      mpl_cm.register_cmap(_name, _cmap)
    /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'flare_r' which already exists.
      mpl_cm.register_cmap(_name + "_r", _cmap_r)
    /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'crest' which already exists.
      mpl_cm.register_cmap(_name, _cmap)
    /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'crest_r' which already exists.
      mpl_cm.register_cmap(_name + "_r", _cmap_r)


.. GENERATED FROM PYTHON SOURCE LINES 32-33

Set the logging level to info to see extra information

.. GENERATED FROM PYTHON SOURCE LINES 33-36

.. code-block:: default

    configure_logging(level='INFO')


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    2022-07-21 09:54:47,414 - julearn - INFO - ===== Lib Versions =====
    2022-07-21 09:54:47,414 - julearn - INFO - numpy: 1.23.1
    2022-07-21 09:54:47,415 - julearn - INFO - scipy: 1.8.1
    2022-07-21 09:54:47,415 - julearn - INFO - sklearn: 1.0.2
    2022-07-21 09:54:47,415 - julearn - INFO - pandas: 1.4.3
    2022-07-21 09:54:47,415 - julearn - INFO - julearn: 0.2.5
    2022-07-21 09:54:47,415 - julearn - INFO - ========================


.. GENERATED FROM PYTHON SOURCE LINES 37-40

Dealing with Cross-Validation techniques
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


.. GENERATED FROM PYTHON SOURCE LINES 40-43

.. code-block:: default


    df_fmri = load_dataset('fmri')


.. GENERATED FROM PYTHON SOURCE LINES 44-46

First, lets get some information on what the dataset has:


.. GENERATED FROM PYTHON SOURCE LINES 46-48

.. code-block:: default

    print(df_fmri.head())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

      subject  timepoint event    region    signal
    0     s13         18  stim  parietal -0.017552
    1      s5         14  stim  parietal -0.080883
    2     s12         18  stim  parietal -0.081033
    3     s11         18  stim  parietal -0.046134
    4     s10         18  stim  parietal -0.037970


.. GENERATED FROM PYTHON SOURCE LINES 49-54

From this information, we can infer that it is an fMRI study in which there
were several subjects, timepoints, events and signal extracted from several
brain regions.

Lets check how many kinds of each we have.

.. GENERATED FROM PYTHON SOURCE LINES 54-60

.. code-block:: default


    print(df_fmri['event'].unique())
    print(df_fmri['region'].unique())
    print(sorted(df_fmri['timepoint'].unique()))
    print(df_fmri['subject'].unique())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    ['stim' 'cue']
    ['parietal' 'frontal']
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
    ['s13' 's5' 's12' 's11' 's10' 's9' 's8' 's7' 's6' 's4' 's3' 's2' 's1' 's0']


.. GENERATED FROM PYTHON SOURCE LINES 61-64

We have data from parietal and frontal regions during 2 types of events
(*cue* and *stim*) during 18 timepoints and for 14 subjects.
Lets see how many samples we have for each condition

.. GENERATED FROM PYTHON SOURCE LINES 64-69

.. code-block:: default


    print(df_fmri.groupby(['subject', 'timepoint', 'event', 'region']).count())
    print(np.unique(df_fmri.groupby(
        ['subject', 'timepoint', 'event', 'region']).count().values))


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

                                      signal
    subject timepoint event region          
    s0      0         cue   frontal        1
                            parietal       1
                      stim  frontal        1
                            parietal       1
            1         cue   frontal        1
    ...                                  ...
    s9      17        stim  parietal       1
            18        cue   frontal        1
                            parietal       1
                      stim  frontal        1
                            parietal       1

    [1064 rows x 1 columns]
    [1]


.. GENERATED FROM PYTHON SOURCE LINES 70-76

We have exactly one value per condition.

Lets try to build a model, that given both parietal and frontal signal,
predicts if the event was a *cue* or a *stim*.

First we define our X and y variables.

.. GENERATED FROM PYTHON SOURCE LINES 76-79

.. code-block:: default

    X = ['parietal', 'frontal']
    y = 'event'


.. GENERATED FROM PYTHON SOURCE LINES 80-85

In order for this to work, both *parietal* and *frontal* must be columns.
We need to *pivot* the table.

The values of *region* will be the columns. The column *signal* will be the
values. And the columns *subject*, *timepoint* and *event* will be the index

.. GENERATED FROM PYTHON SOURCE LINES 85-92

.. code-block:: default

    df_fmri = df_fmri.pivot(
        index=['subject', 'timepoint', 'event'],
        columns='region',
        values='signal')

    df_fmri = df_fmri.reset_index()


.. GENERATED FROM PYTHON SOURCE LINES 93-94

We will use a Support Vector Machine.

.. GENERATED FROM PYTHON SOURCE LINES 94-100

.. code-block:: default


    scores = run_cross_validation(X=X, y=y, preprocess_X='zscore', data=df_fmri,
                                  model='svm')

    print(scores['test_score'].mean())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    2022-07-21 09:54:47,512 - julearn - INFO - Using default CV
    2022-07-21 09:54:47,512 - julearn - INFO - ==== Input Data ====
    2022-07-21 09:54:47,512 - julearn - INFO - Using dataframe as input
    2022-07-21 09:54:47,512 - julearn - INFO - Features: ['parietal', 'frontal']
    2022-07-21 09:54:47,512 - julearn - INFO - Target: event
    2022-07-21 09:54:47,512 - julearn - INFO - Expanded X: ['parietal', 'frontal']
    2022-07-21 09:54:47,512 - julearn - INFO - Expanded Confounds: []
    2022-07-21 09:54:47,513 - julearn - INFO - ====================
    2022-07-21 09:54:47,513 - julearn - INFO - 
    2022-07-21 09:54:47,513 - julearn - INFO - ====== Model ======
    2022-07-21 09:54:47,513 - julearn - INFO - Obtaining model by name: svm
    2022-07-21 09:54:47,513 - julearn - INFO - ===================
    2022-07-21 09:54:47,513 - julearn - INFO - 
    2022-07-21 09:54:47,514 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds
    0.7221618762123082


.. GENERATED FROM PYTHON SOURCE LINES 101-122

This results indicate that we can decode the kind of event by looking at
the *parietal* and *frontal* signal. However, that claim is true only if we
have some data from the same subject already acquired.

The problem is that we split the data randomly into 5 folds (default, see
:func:`.run_cross_validation`). This means that data from one subject could
be both in the training and the testing set. If this is the case, then the
model can learn the subjects' specific characteristics and apply it to the
testing set. Thus, it is not true that we can decode it for an unseen
subject, but for an unseen timepoint for a subject that for whom we already
have data.

To test for unseen subject, we need to make sure that all the data from each
subject is either on the training or the testing set, but not in both.

We can use scikit-learn's GroupShuffleSplit (see `Cross Validation`_).
And specify which is the grouping column using the `group` parameter.

By setting `return_estimator='final'`, the :func:`.run_cross_validation`
function return the estimator fitted with all the data. We will use this
later to do some analysis.

.. GENERATED FROM PYTHON SOURCE LINES 122-129

.. code-block:: default

    cv = GroupShuffleSplit(n_splits=5, test_size=0.5, random_state=42)

    scores, model = run_cross_validation(
        X=X, y=y, data=df_fmri, model='svm', preprocess_X='zscore', cv=cv,
        groups='subject', return_estimator='final')
    print(scores['test_score'].mean())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    2022-07-21 09:54:47,992 - julearn - INFO - ==== Input Data ====
    2022-07-21 09:54:47,992 - julearn - INFO - Using dataframe as input
    2022-07-21 09:54:47,993 - julearn - INFO - Features: ['parietal', 'frontal']
    2022-07-21 09:54:47,993 - julearn - INFO - Target: event
    2022-07-21 09:54:47,993 - julearn - INFO - Expanded X: ['parietal', 'frontal']
    2022-07-21 09:54:47,993 - julearn - INFO - Expanded Confounds: []
    2022-07-21 09:54:47,993 - julearn - INFO - Using subject as groups
    2022-07-21 09:54:47,994 - julearn - INFO - ====================
    2022-07-21 09:54:47,994 - julearn - INFO - 
    2022-07-21 09:54:47,994 - julearn - INFO - ====== Model ======
    2022-07-21 09:54:47,994 - julearn - INFO - Obtaining model by name: svm
    2022-07-21 09:54:47,994 - julearn - INFO - ===================
    2022-07-21 09:54:47,994 - julearn - INFO - 
    2022-07-21 09:54:47,994 - julearn - INFO - Using scikit-learn CV scheme GroupShuffleSplit(n_splits=5, random_state=42, test_size=0.5, train_size=None)
    0.7210526315789474


.. GENERATED FROM PYTHON SOURCE LINES 130-135

After testing on independent subjects, we can now claim that given a new
subject, we can predict the kind of event.

Lets do some visualization on how these two features interact and what
the preprocessing part of the model does.

.. GENERATED FROM PYTHON SOURCE LINES 135-146

.. code-block:: default

    fig, axes = plt.subplots(1, 2, figsize=(8, 4))
    sns.scatterplot(x='parietal', y='frontal', hue='event', data=df_fmri,
                    ax=axes[0], s=5)
    axes[0].set_title('Raw data')

    pre_X, pre_y = model.preprocess(df_fmri[X], df_fmri[y])
    pre_df = pre_X.join(pre_y)
    sns.scatterplot(x='parietal', y='frontal', hue='event', data=pre_df,
                    ax=axes[1], s=5)
    axes[1].set_title('Preprocessed data')


.. image-sg:: /auto_examples/basic/images/sphx_glr_plot_groupcv_inspect_svm_001.png
   :alt: Raw data, Preprocessed data
   :srcset: /auto_examples/basic/images/sphx_glr_plot_groupcv_inspect_svm_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none


    Text(0.5, 1.0, 'Preprocessed data')


.. GENERATED FROM PYTHON SOURCE LINES 147-151

In this case, the preprocessing is nothing more than a `Standard Scaler`_.

It seems that the data is not quite linearly separable. Lets now visualize
how the SVM does this complex task.

.. GENERATED FROM PYTHON SOURCE LINES 151-165

.. code-block:: default

    clf = model['svm']
    ax = sns.scatterplot(x='parietal', y='frontal', hue='event', data=pre_df, s=5)

    xlim = ax.get_xlim()
    ylim = ax.get_ylim()

    # create grid to evaluate model
    xx = np.linspace(xlim[0], xlim[1], 30)
    yy = np.linspace(ylim[0], ylim[1], 30)
    YY, XX = np.meshgrid(yy, xx)
    xy = np.vstack([XX.ravel(), YY.ravel()]).T
    Z = clf.decision_function(xy).reshape(XX.shape)
    a = ax.contour(XX, YY, Z, colors='k', levels=[0], alpha=0.5, linestyles=['-'])
    ax.set_title('Preprocessed data with SVM decision function boundaries')


.. image-sg:: /auto_examples/basic/images/sphx_glr_plot_groupcv_inspect_svm_002.png
   :alt: Preprocessed data with SVM decision function boundaries
   :srcset: /auto_examples/basic/images/sphx_glr_plot_groupcv_inspect_svm_002.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but SVC was fitted with feature names
      warnings.warn(

    Text(0.5, 1.0, 'Preprocessed data with SVM decision function boundaries')


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  1.003 seconds)


.. _sphx_glr_download_auto_examples_basic_plot_groupcv_inspect_svm.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example


  .. container:: sphx-glr-download sphx-glr-download-python

     :download:`Download Python source code: plot_groupcv_inspect_svm.py <plot_groupcv_inspect_svm.py>`


  .. container:: sphx-glr-download sphx-glr-download-jupyter

     :download:`Download Jupyter notebook: plot_groupcv_inspect_svm.ipynb <plot_groupcv_inspect_svm.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_