.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/advanced/run_combine_pandas.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_advanced_run_combine_pandas.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_advanced_run_combine_pandas.py:


Working with pandas
===================

This example uses the 'fmri' dataset to transform and combine data in order
to prepare it to bse used by julearn.


References
----------
Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of
cognitive control in context-dependent decision-making. Cerebral Cortex.


.. include:: ../../links.inc

.. GENERATED FROM PYTHON SOURCE LINES 17-23

.. code-block:: default

    # Authors: Federico Raimondo <f.raimondo@fz-juelich.de>
    #
    # License: AGPL
    from seaborn import load_dataset
    import pandas as pd


.. GENERATED FROM PYTHON SOURCE LINES 24-44

One of the key elements that make julearn easy to use, is the possibility to
work directly with pandas data frames. Also known as excel spreadsheets or
csv files.

Ideally, we will have everything tabulated and organised for julearn, but it
might not be your case. You might have some files with the fMRI values, some
others with demographics, some other with diagnostic metrics or behavioural
results.

You need to prepare this files for julearn.

One option is to manually edit the files and make sure that everything is
ready to do some machine-learning. However, this is prune to introduce
errors.

Fortunately, `pandas`_ provides several tools to deal with this task.

This example is a collection on some of this useful methods

Lets start with the fmri dataset.

.. GENERATED FROM PYTHON SOURCE LINES 44-47

.. code-block:: default


    df_fmri = load_dataset('fmri')


.. GENERATED FROM PYTHON SOURCE LINES 48-50

Lets see what this dataset has.


.. GENERATED FROM PYTHON SOURCE LINES 50-52

.. code-block:: default

    print(df_fmri.head())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

      subject  timepoint event    region    signal
    0     s13         18  stim  parietal -0.017552
    1      s5         14  stim  parietal -0.080883
    2     s12         18  stim  parietal -0.081033
    3     s11         18  stim  parietal -0.046134
    4     s10         18  stim  parietal -0.037970


.. GENERATED FROM PYTHON SOURCE LINES 53-61

From long to wide format
^^^^^^^^^^^^^^^^^^^^^^^^
We have seen this in other examples. If we want to use julearn, each feature
must be a columns. In order to use the signals from different regions as
features, we need to convert this dataframe from the long format to the wide
format.

We will use the ``pivot`` method.

.. GENERATED FROM PYTHON SOURCE LINES 61-66

.. code-block:: default

    df_fmri = df_fmri.pivot(
        index=['subject', 'timepoint', 'event'],
        columns='region',
        values='signal')


.. GENERATED FROM PYTHON SOURCE LINES 67-79

This method reshapes the table, keeping a the specified elements as index,
columns and values.

In our case, the values are extracted from the *signal* column. The columns
from the *region* column and *subject*, *timepoint* and *event* becomes the
index.

The index is what identifies each sample. As a rule, the index can't be
duplicated. If each subject has more than one timepoint, and each timepoint
has more than one event, then this 3 elements are needed as the index.

Lets see what we have here:

.. GENERATED FROM PYTHON SOURCE LINES 79-82

.. code-block:: default

    print(df_fmri.head())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    region                    frontal  parietal
    subject timepoint event                    
    s0      0         cue    0.007766 -0.006899
                      stim  -0.021452 -0.039327
            1         cue    0.016440  0.000300
                      stim  -0.021054 -0.035735
            2         cue    0.024296  0.033220


.. GENERATED FROM PYTHON SOURCE LINES 83-87

Now this is in the format we want. However, in order to access the index
as columns ``df_fmri['subject']`` we need to reset the index.

Check the sutil but important difference:

.. GENERATED FROM PYTHON SOURCE LINES 87-90

.. code-block:: default

    df_fmri = df_fmri.reset_index()
    print(df_fmri.head())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    region subject  timepoint event   frontal  parietal
    0           s0          0   cue  0.007766 -0.006899
    1           s0          0  stim -0.021452 -0.039327
    2           s0          1   cue  0.016440  0.000300
    3           s0          1  stim -0.021054 -0.035735
    4           s0          2   cue  0.024296  0.033220


.. GENERATED FROM PYTHON SOURCE LINES 91-99

Merging or joining data frames
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

So now we have our fMRI data tabulated for julearn. However, it might be the
case that we have some important information in another file. For example,
the subjects' age and the place where they were scanned.

For the purpose of the example, I will create the dataframe here.

.. GENERATED FROM PYTHON SOURCE LINES 99-107

.. code-block:: default

    metadata = {
        'subject': [f's{i}' for i in range(14)],
        'age': [23, 21, 31, 29, 43, 23, 43, 28, 48, 29, 35, 23, 34, 25],
        'scanner': ['a'] * 6 + ['b'] * 8
    }
    df_meta = pd.DataFrame(metadata)
    print(df_meta)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

       subject  age scanner
    0       s0   23       a
    1       s1   21       a
    2       s2   31       a
    3       s3   29       a
    4       s4   43       a
    5       s5   23       a
    6       s6   43       b
    7       s7   28       b
    8       s8   48       b
    9       s9   29       b
    10     s10   35       b
    11     s11   23       b
    12     s12   34       b
    13     s13   25       b


.. GENERATED FROM PYTHON SOURCE LINES 108-113

We will use the ``join`` method. This method will join the two dataframes,
matching elements by the *index*.

In this case, the matching element (or index) will be the column ``subject``.
We need to set the index in each dataframe before join.

.. GENERATED FROM PYTHON SOURCE LINES 113-118

.. code-block:: default

    df_fmri = df_fmri.set_index('subject')
    df_meta = df_meta.set_index('subject')
    df_fmri = df_fmri.join(df_meta)
    print(df_fmri)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

             timepoint event   frontal  parietal  age scanner
    subject                                                  
    s0               0   cue  0.007766 -0.006899   23       a
    s0               0  stim -0.021452 -0.039327   23       a
    s0               1   cue  0.016440  0.000300   23       a
    s0               1  stim -0.021054 -0.035735   23       a
    s0               2   cue  0.024296  0.033220   23       a
    ...            ...   ...       ...       ...  ...     ...
    s9              16  stim -0.036739 -0.131641   29       b
    s9              17   cue -0.004900 -0.036362   29       b
    s9              17  stim -0.030099 -0.121574   29       b
    s9              18   cue -0.000643 -0.051040   29       b
    s9              18  stim -0.009959 -0.103513   29       b

    [532 rows x 6 columns]


.. GENERATED FROM PYTHON SOURCE LINES 119-120

Finally, lets reset the index and have it ready for julearn

.. GENERATED FROM PYTHON SOURCE LINES 120-122

.. code-block:: default

    df_fmri = df_fmri.reset_index()


.. GENERATED FROM PYTHON SOURCE LINES 123-124

Now we can use, for example, *age* and *scanner* as confounds.

.. GENERATED FROM PYTHON SOURCE LINES 126-134

Reshaping data frames (more complex)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Lets suppose that our prediction target is now the *age* and we want to use
as features the frontal and parietal value during each event. For this
purpose, we need to convert the event values into columns. There are two
events: *cue* and *stim*. So this will result in 4 columns.

We will still use the pivot, but in this case, we will have two values:

.. GENERATED FROM PYTHON SOURCE LINES 134-140

.. code-block:: default

    df_fmri = df_fmri.pivot(
        index=['subject', 'timepoint', 'age', 'scanner'],
        columns='event',
        values=['frontal', 'parietal'])

    print(df_fmri)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

                                    frontal            parietal          
    event                               cue      stim       cue      stim
    subject timepoint age scanner                                        
    s0      0         23  a        0.007766 -0.021452 -0.006899 -0.039327
            1         23  a        0.016440 -0.021054  0.000300 -0.035735
            2         23  a        0.024296 -0.009038  0.033220  0.009642
            3         23  a        0.047859  0.026727  0.085040  0.086399
            4         23  a        0.069775  0.070558  0.115321  0.154058
    ...                                 ...       ...       ...       ...
    s9      14        29  b        0.010535 -0.061817 -0.034386 -0.130267
            15        29  b        0.002170 -0.048007 -0.038257 -0.134828
            16        29  b       -0.004290 -0.036739 -0.035395 -0.131641
            17        29  b       -0.004900 -0.030099 -0.036362 -0.121574
            18        29  b       -0.000643 -0.009959 -0.051040 -0.103513

    [266 rows x 4 columns]


.. GENERATED FROM PYTHON SOURCE LINES 141-144

Since the columns names are combinations of the values in the *event* column
and the previous *frontal* and *parietal* columns, it is now a multi-level
column name.

.. GENERATED FROM PYTHON SOURCE LINES 144-145

.. code-block:: default

    print(df_fmri.columns)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    MultiIndex([( 'frontal',  'cue'),
                ( 'frontal', 'stim'),
                ('parietal',  'cue'),
                ('parietal', 'stim')],
               names=[None, 'event'])


.. GENERATED FROM PYTHON SOURCE LINES 146-147

The following trick will join the different levels using an underscore (*_*)

.. GENERATED FROM PYTHON SOURCE LINES 147-150

.. code-block:: default

    df_fmri.columns = ['_'.join(x) for x in df_fmri.columns]

    print(df_fmri)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

                                   frontal_cue  ...  parietal_stim
    subject timepoint age scanner               ...               
    s0      0         23  a           0.007766  ...      -0.039327
            1         23  a           0.016440  ...      -0.035735
            2         23  a           0.024296  ...       0.009642
            3         23  a           0.047859  ...       0.086399
            4         23  a           0.069775  ...       0.154058
    ...                                    ...  ...            ...
    s9      14        29  b           0.010535  ...      -0.130267
            15        29  b           0.002170  ...      -0.134828
            16        29  b          -0.004290  ...      -0.131641
            17        29  b          -0.004900  ...      -0.121574
            18        29  b          -0.000643  ...      -0.103513

    [266 rows x 4 columns]


.. GENERATED FROM PYTHON SOURCE LINES 151-152

We have finally the information we want. We can now reset the index

.. GENERATED FROM PYTHON SOURCE LINES 152-153

.. code-block:: default

    df_fmri = df_fmri.reset_index()


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  0.108 seconds)


.. _sphx_glr_download_auto_examples_advanced_run_combine_pandas.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example


  .. container:: sphx-glr-download sphx-glr-download-python

     :download:`Download Python source code: run_combine_pandas.py <run_combine_pandas.py>`


  .. container:: sphx-glr-download sphx-glr-download-jupyter

     :download:`Download Jupyter notebook: run_combine_pandas.ipynb <run_combine_pandas.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_