Working with pandas

This example uses the ‘fmri’ dataset to transform and combine data in order to prepare it to bse used by julearn.

References

Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of cognitive control in context-dependent decision-making. Cerebral Cortex.

# Authors: Federico Raimondo <f.raimondo@fz-juelich.de>
#
# License: AGPL
from seaborn import load_dataset
import pandas as pd

One of the key elements that make julearn easy to use, is the possibility to work directly with pandas data frames. Also known as excel spreadsheets or csv files.

Ideally, we will have everything tabulated and organised for julearn, but it might not be your case. You might have some files with the fMRI values, some others with demographics, some other with diagnostic metrics or behavioural results.

You need to prepare this files for julearn.

One option is to manually edit the files and make sure that everything is ready to do some machine-learning. However, this is prune to introduce errors.

Fortunately, pandas provides several tools to deal with this task.

This example is a collection on some of this useful methods

Lets start with the fmri dataset.

df_fmri = load_dataset('fmri')

Lets see what this dataset has.

print(df_fmri.head())

Out:

  subject  timepoint event    region    signal
0     s13         18  stim  parietal -0.017552
1      s5         14  stim  parietal -0.080883
2     s12         18  stim  parietal -0.081033
3     s11         18  stim  parietal -0.046134
4     s10         18  stim  parietal -0.037970

From long to wide format

We have seen this in other examples. If we want to use julearn, each feature must be a columns. In order to use the signals from different regions as features, we need to convert this dataframe from the long format to the wide format.

We will use the pivot method.

df_fmri = df_fmri.pivot(
    index=['subject', 'timepoint', 'event'],
    columns='region',
    values='signal')

This method reshapes the table, keeping a the specified elements as index, columns and values.

In our case, the values are extracted from the signal column. The columns from the region column and subject, timepoint and event becomes the index.

The index is what identifies each sample. As a rule, the index can’t be duplicated. If each subject has more than one timepoint, and each timepoint has more than one event, then this 3 elements are needed as the index.

Lets see what we have here:

print(df_fmri.head())

Out:

region                    frontal  parietal
subject timepoint event
s0      0         cue    0.007766 -0.006899
                  stim  -0.021452 -0.039327
        1         cue    0.016440  0.000300
                  stim  -0.021054 -0.035735
        2         cue    0.024296  0.033220

Now this is in the format we want. However, in order to access the index as columns df_fmri['subject'] we need to reset the index.

Check the sutil but important difference:

df_fmri = df_fmri.reset_index()
print(df_fmri.head())

Out:

region subject  timepoint event   frontal  parietal
0           s0          0   cue  0.007766 -0.006899
1           s0          0  stim -0.021452 -0.039327
2           s0          1   cue  0.016440  0.000300
3           s0          1  stim -0.021054 -0.035735
4           s0          2   cue  0.024296  0.033220

Merging or joining data frames

So now we have our fMRI data tabulated for julearn. However, it might be the case that we have some important information in another file. For example, the subjects’ age and the place where they were scanned.

For the purpose of the example, I will create the dataframe here.

metadata = {
    'subject': [f's{i}' for i in range(14)],
    'age': [23, 21, 31, 29, 43, 23, 43, 28, 48, 29, 35, 23, 34, 25],
    'scanner': ['a'] * 6 + ['b'] * 8
}
df_meta = pd.DataFrame(metadata)
print(df_meta)

Out:

   subject  age scanner
0       s0   23       a
1       s1   21       a
2       s2   31       a
3       s3   29       a
4       s4   43       a
5       s5   23       a
6       s6   43       b
7       s7   28       b
8       s8   48       b
9       s9   29       b
10     s10   35       b
11     s11   23       b
12     s12   34       b
13     s13   25       b

We will use the join method. This method will join the two dataframes, matching elements by the index.

In this case, the matching element (or index) will be the column subject. We need to set the index in each dataframe before join.

df_fmri = df_fmri.set_index('subject')
df_meta = df_meta.set_index('subject')
df_fmri = df_fmri.join(df_meta)
print(df_fmri)

Out:

         timepoint event   frontal  parietal  age scanner
subject
s0               0   cue  0.007766 -0.006899   23       a
s0               0  stim -0.021452 -0.039327   23       a
s0               1   cue  0.016440  0.000300   23       a
s0               1  stim -0.021054 -0.035735   23       a
s0               2   cue  0.024296  0.033220   23       a
...            ...   ...       ...       ...  ...     ...
s9              16  stim -0.036739 -0.131641   29       b
s9              17   cue -0.004900 -0.036362   29       b
s9              17  stim -0.030099 -0.121574   29       b
s9              18   cue -0.000643 -0.051040   29       b
s9              18  stim -0.009959 -0.103513   29       b

[532 rows x 6 columns]

Finally, lets reset the index and have it ready for julearn

df_fmri = df_fmri.reset_index()

Now we can use, for example, age and scanner as confounds.

Reshaping data frames (more complex)

Lets suppose that our prediction target is now the age and we want to use as features the frontal and parietal value during each event. For this purpose, we need to convert the event values into columns. There are two events: cue and stim. So this will result in 4 columns.

We will still use the pivot, but in this case, we will have two values:

df_fmri = df_fmri.pivot(
    index=['subject', 'timepoint', 'age', 'scanner'],
    columns='event',
    values=['frontal', 'parietal'])

print(df_fmri)

Out:

                                frontal            parietal
event                               cue      stim       cue      stim
subject timepoint age scanner
s0      0         23  a        0.007766 -0.021452 -0.006899 -0.039327
        1         23  a        0.016440 -0.021054  0.000300 -0.035735
        2         23  a        0.024296 -0.009038  0.033220  0.009642
        3         23  a        0.047859  0.026727  0.085040  0.086399
        4         23  a        0.069775  0.070558  0.115321  0.154058
...                                 ...       ...       ...       ...
s9      14        29  b        0.010535 -0.061817 -0.034386 -0.130267
        15        29  b        0.002170 -0.048007 -0.038257 -0.134828
        16        29  b       -0.004290 -0.036739 -0.035395 -0.131641
        17        29  b       -0.004900 -0.030099 -0.036362 -0.121574
        18        29  b       -0.000643 -0.009959 -0.051040 -0.103513

[266 rows x 4 columns]

Since the columns names are combinations of the values in the event column and the previous frontal and parietal columns, it is now a multi-level column name.

print(df_fmri.columns)

Out:

MultiIndex([( 'frontal',  'cue'),
            ( 'frontal', 'stim'),
            ('parietal',  'cue'),
            ('parietal', 'stim')],
           names=[None, 'event'])

The following trick will join the different levels using an underscore (_)

df_fmri.columns = ['_'.join(x) for x in df_fmri.columns]

print(df_fmri)

Out:

                               frontal_cue  ...  parietal_stim
subject timepoint age scanner               ...
s0      0         23  a           0.007766  ...      -0.039327
        1         23  a           0.016440  ...      -0.035735
        2         23  a           0.024296  ...       0.009642
        3         23  a           0.047859  ...       0.086399
        4         23  a           0.069775  ...       0.154058
...                                    ...  ...            ...
s9      14        29  b           0.010535  ...      -0.130267
        15        29  b           0.002170  ...      -0.134828
        16        29  b          -0.004290  ...      -0.131641
        17        29  b          -0.004900  ...      -0.121574
        18        29  b          -0.000643  ...      -0.103513

[266 rows x 4 columns]

We have finally the information we want. We can now reset the index

df_fmri = df_fmri.reset_index()

Total running time of the script: ( 0 minutes 0.110 seconds)

Gallery generated by Sphinx-Gallery