.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/basic/plot_groupcv_inspect_svm.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_basic_plot_groupcv_inspect_svm.py: Inspecting SVM models ===================== This example uses the 'fmri' dataset, performs simple binary classification using a Support Vector Machine classifier and analyse the model. References ---------- Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of cognitive control in context-dependent decision-making. Cerebral Cortex. .. include:: ../../links.inc .. GENERATED FROM PYTHON SOURCE LINES 17-31 .. code-block:: default # Authors: Federico Raimondo # # License: AGPL import numpy as np from sklearn.model_selection import GroupShuffleSplit import matplotlib.pyplot as plt import seaborn as sns from seaborn import load_dataset from julearn import run_cross_validation from julearn.utils import configure_logging .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'rocket' which already exists. mpl_cm.register_cmap(_name, _cmap) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'rocket_r' which already exists. mpl_cm.register_cmap(_name + "_r", _cmap_r) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'mako' which already exists. mpl_cm.register_cmap(_name, _cmap) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'mako_r' which already exists. mpl_cm.register_cmap(_name + "_r", _cmap_r) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'icefire' which already exists. mpl_cm.register_cmap(_name, _cmap) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'icefire_r' which already exists. mpl_cm.register_cmap(_name + "_r", _cmap_r) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'vlag' which already exists. mpl_cm.register_cmap(_name, _cmap) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'vlag_r' which already exists. mpl_cm.register_cmap(_name + "_r", _cmap_r) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'flare' which already exists. mpl_cm.register_cmap(_name, _cmap) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'flare_r' which already exists. mpl_cm.register_cmap(_name + "_r", _cmap_r) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1582: UserWarning: Trying to register the cmap 'crest' which already exists. mpl_cm.register_cmap(_name, _cmap) /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/seaborn/cm.py:1583: UserWarning: Trying to register the cmap 'crest_r' which already exists. mpl_cm.register_cmap(_name + "_r", _cmap_r) .. GENERATED FROM PYTHON SOURCE LINES 32-33 Set the logging level to info to see extra information .. GENERATED FROM PYTHON SOURCE LINES 33-36 .. code-block:: default configure_logging(level='INFO') .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 2022-07-21 09:54:47,414 - julearn - INFO - ===== Lib Versions ===== 2022-07-21 09:54:47,414 - julearn - INFO - numpy: 1.23.1 2022-07-21 09:54:47,415 - julearn - INFO - scipy: 1.8.1 2022-07-21 09:54:47,415 - julearn - INFO - sklearn: 1.0.2 2022-07-21 09:54:47,415 - julearn - INFO - pandas: 1.4.3 2022-07-21 09:54:47,415 - julearn - INFO - julearn: 0.2.5 2022-07-21 09:54:47,415 - julearn - INFO - ======================== .. GENERATED FROM PYTHON SOURCE LINES 37-40 Dealing with Cross-Validation techniques ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 40-43 .. code-block:: default df_fmri = load_dataset('fmri') .. GENERATED FROM PYTHON SOURCE LINES 44-46 First, lets get some information on what the dataset has: .. GENERATED FROM PYTHON SOURCE LINES 46-48 .. code-block:: default print(df_fmri.head()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none subject timepoint event region signal 0 s13 18 stim parietal -0.017552 1 s5 14 stim parietal -0.080883 2 s12 18 stim parietal -0.081033 3 s11 18 stim parietal -0.046134 4 s10 18 stim parietal -0.037970 .. GENERATED FROM PYTHON SOURCE LINES 49-54 From this information, we can infer that it is an fMRI study in which there were several subjects, timepoints, events and signal extracted from several brain regions. Lets check how many kinds of each we have. .. GENERATED FROM PYTHON SOURCE LINES 54-60 .. code-block:: default print(df_fmri['event'].unique()) print(df_fmri['region'].unique()) print(sorted(df_fmri['timepoint'].unique())) print(df_fmri['subject'].unique()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none ['stim' 'cue'] ['parietal' 'frontal'] [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] ['s13' 's5' 's12' 's11' 's10' 's9' 's8' 's7' 's6' 's4' 's3' 's2' 's1' 's0'] .. GENERATED FROM PYTHON SOURCE LINES 61-64 We have data from parietal and frontal regions during 2 types of events (*cue* and *stim*) during 18 timepoints and for 14 subjects. Lets see how many samples we have for each condition .. GENERATED FROM PYTHON SOURCE LINES 64-69 .. code-block:: default print(df_fmri.groupby(['subject', 'timepoint', 'event', 'region']).count()) print(np.unique(df_fmri.groupby( ['subject', 'timepoint', 'event', 'region']).count().values)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none signal subject timepoint event region s0 0 cue frontal 1 parietal 1 stim frontal 1 parietal 1 1 cue frontal 1 ... ... s9 17 stim parietal 1 18 cue frontal 1 parietal 1 stim frontal 1 parietal 1 [1064 rows x 1 columns] [1] .. GENERATED FROM PYTHON SOURCE LINES 70-76 We have exactly one value per condition. Lets try to build a model, that given both parietal and frontal signal, predicts if the event was a *cue* or a *stim*. First we define our X and y variables. .. GENERATED FROM PYTHON SOURCE LINES 76-79 .. code-block:: default X = ['parietal', 'frontal'] y = 'event' .. GENERATED FROM PYTHON SOURCE LINES 80-85 In order for this to work, both *parietal* and *frontal* must be columns. We need to *pivot* the table. The values of *region* will be the columns. The column *signal* will be the values. And the columns *subject*, *timepoint* and *event* will be the index .. GENERATED FROM PYTHON SOURCE LINES 85-92 .. code-block:: default df_fmri = df_fmri.pivot( index=['subject', 'timepoint', 'event'], columns='region', values='signal') df_fmri = df_fmri.reset_index() .. GENERATED FROM PYTHON SOURCE LINES 93-94 We will use a Support Vector Machine. .. GENERATED FROM PYTHON SOURCE LINES 94-100 .. code-block:: default scores = run_cross_validation(X=X, y=y, preprocess_X='zscore', data=df_fmri, model='svm') print(scores['test_score'].mean()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 2022-07-21 09:54:47,512 - julearn - INFO - Using default CV 2022-07-21 09:54:47,512 - julearn - INFO - ==== Input Data ==== 2022-07-21 09:54:47,512 - julearn - INFO - Using dataframe as input 2022-07-21 09:54:47,512 - julearn - INFO - Features: ['parietal', 'frontal'] 2022-07-21 09:54:47,512 - julearn - INFO - Target: event 2022-07-21 09:54:47,512 - julearn - INFO - Expanded X: ['parietal', 'frontal'] 2022-07-21 09:54:47,512 - julearn - INFO - Expanded Confounds: [] 2022-07-21 09:54:47,513 - julearn - INFO - ==================== 2022-07-21 09:54:47,513 - julearn - INFO - 2022-07-21 09:54:47,513 - julearn - INFO - ====== Model ====== 2022-07-21 09:54:47,513 - julearn - INFO - Obtaining model by name: svm 2022-07-21 09:54:47,513 - julearn - INFO - =================== 2022-07-21 09:54:47,513 - julearn - INFO - 2022-07-21 09:54:47,514 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds 0.7221618762123082 .. GENERATED FROM PYTHON SOURCE LINES 101-122 This results indicate that we can decode the kind of event by looking at the *parietal* and *frontal* signal. However, that claim is true only if we have some data from the same subject already acquired. The problem is that we split the data randomly into 5 folds (default, see :func:`.run_cross_validation`). This means that data from one subject could be both in the training and the testing set. If this is the case, then the model can learn the subjects' specific characteristics and apply it to the testing set. Thus, it is not true that we can decode it for an unseen subject, but for an unseen timepoint for a subject that for whom we already have data. To test for unseen subject, we need to make sure that all the data from each subject is either on the training or the testing set, but not in both. We can use scikit-learn's GroupShuffleSplit (see `Cross Validation`_). And specify which is the grouping column using the `group` parameter. By setting `return_estimator='final'`, the :func:`.run_cross_validation` function return the estimator fitted with all the data. We will use this later to do some analysis. .. GENERATED FROM PYTHON SOURCE LINES 122-129 .. code-block:: default cv = GroupShuffleSplit(n_splits=5, test_size=0.5, random_state=42) scores, model = run_cross_validation( X=X, y=y, data=df_fmri, model='svm', preprocess_X='zscore', cv=cv, groups='subject', return_estimator='final') print(scores['test_score'].mean()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 2022-07-21 09:54:47,992 - julearn - INFO - ==== Input Data ==== 2022-07-21 09:54:47,992 - julearn - INFO - Using dataframe as input 2022-07-21 09:54:47,993 - julearn - INFO - Features: ['parietal', 'frontal'] 2022-07-21 09:54:47,993 - julearn - INFO - Target: event 2022-07-21 09:54:47,993 - julearn - INFO - Expanded X: ['parietal', 'frontal'] 2022-07-21 09:54:47,993 - julearn - INFO - Expanded Confounds: [] 2022-07-21 09:54:47,993 - julearn - INFO - Using subject as groups 2022-07-21 09:54:47,994 - julearn - INFO - ==================== 2022-07-21 09:54:47,994 - julearn - INFO - 2022-07-21 09:54:47,994 - julearn - INFO - ====== Model ====== 2022-07-21 09:54:47,994 - julearn - INFO - Obtaining model by name: svm 2022-07-21 09:54:47,994 - julearn - INFO - =================== 2022-07-21 09:54:47,994 - julearn - INFO - 2022-07-21 09:54:47,994 - julearn - INFO - Using scikit-learn CV scheme GroupShuffleSplit(n_splits=5, random_state=42, test_size=0.5, train_size=None) 0.7210526315789474 .. GENERATED FROM PYTHON SOURCE LINES 130-135 After testing on independent subjects, we can now claim that given a new subject, we can predict the kind of event. Lets do some visualization on how these two features interact and what the preprocessing part of the model does. .. GENERATED FROM PYTHON SOURCE LINES 135-146 .. code-block:: default fig, axes = plt.subplots(1, 2, figsize=(8, 4)) sns.scatterplot(x='parietal', y='frontal', hue='event', data=df_fmri, ax=axes[0], s=5) axes[0].set_title('Raw data') pre_X, pre_y = model.preprocess(df_fmri[X], df_fmri[y]) pre_df = pre_X.join(pre_y) sns.scatterplot(x='parietal', y='frontal', hue='event', data=pre_df, ax=axes[1], s=5) axes[1].set_title('Preprocessed data') .. image-sg:: /auto_examples/basic/images/sphx_glr_plot_groupcv_inspect_svm_001.png :alt: Raw data, Preprocessed data :srcset: /auto_examples/basic/images/sphx_glr_plot_groupcv_inspect_svm_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Text(0.5, 1.0, 'Preprocessed data') .. GENERATED FROM PYTHON SOURCE LINES 147-151 In this case, the preprocessing is nothing more than a `Standard Scaler`_. It seems that the data is not quite linearly separable. Lets now visualize how the SVM does this complex task. .. GENERATED FROM PYTHON SOURCE LINES 151-165 .. code-block:: default clf = model['svm'] ax = sns.scatterplot(x='parietal', y='frontal', hue='event', data=pre_df, s=5) xlim = ax.get_xlim() ylim = ax.get_ylim() # create grid to evaluate model xx = np.linspace(xlim[0], xlim[1], 30) yy = np.linspace(ylim[0], ylim[1], 30) YY, XX = np.meshgrid(yy, xx) xy = np.vstack([XX.ravel(), YY.ravel()]).T Z = clf.decision_function(xy).reshape(XX.shape) a = ax.contour(XX, YY, Z, colors='k', levels=[0], alpha=0.5, linestyles=['-']) ax.set_title('Preprocessed data with SVM decision function boundaries') .. image-sg:: /auto_examples/basic/images/sphx_glr_plot_groupcv_inspect_svm_002.png :alt: Preprocessed data with SVM decision function boundaries :srcset: /auto_examples/basic/images/sphx_glr_plot_groupcv_inspect_svm_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /opt/hostedtoolcache/Python/3.8.13/x64/lib/python3.8/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but SVC was fitted with feature names warnings.warn( Text(0.5, 1.0, 'Preprocessed data with SVM decision function boundaries') .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 1.003 seconds) .. _sphx_glr_download_auto_examples_basic_plot_groupcv_inspect_svm.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_groupcv_inspect_svm.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_groupcv_inspect_svm.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_