Working with pandas#

This example uses the fmri dataset to transform and combine data in order to prepare it to be used by julearn.

References#

Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of cognitive control in context-dependent decision-making. Cerebral Cortex.

# Authors: Federico Raimondo <f.raimondo@fz-juelich.de>
#
# License: AGPL

from seaborn import load_dataset
import pandas as pd

One of the key elements that make julearn easy to use, is the possibility to work directly with pandas.DataFrame, similar to MS Excel spreadsheets or csv files.

Ideally, we will have everything tabulated and organised for julearn, but it might not be your case. You might have some files with the fMRI values, some others with demographics, some other with diagnostic metrics or behavioral results.

You need to prepare these files for julearn.

One option is to manually edit the files and make sure that everything is ready to do some machine-learning. However, this is error-prone.

Fortunately, pandas provides several tools to deal with this task.

This example is a collection of some of these useful methods.

Let’s start with the fmri dataset.

df_fmri = load_dataset("fmri")

Let’s see what this dataset has.

subject timepoint event region signal
0 s13 18 stim parietal -0.017552
1 s5 14 stim parietal -0.080883
2 s12 18 stim parietal -0.081033
3 s11 18 stim parietal -0.046134
4 s10 18 stim parietal -0.037970


From long to wide format We have seen this in other examples. If we want to use julearn, each feature must be a columns. In order to use the signals from different regions as ~~~~~~~~~~~~~~~~~~~~~~~~ features, we need to convert this dataframe from the long format to the wide format.

We will use the pivot method.

df_fmri = df_fmri.pivot(
    index=["subject", "timepoint", "event"], columns="region", values="signal"
)

This method reshapes the table, keeping the specified elements as index, columns and values.

In our case, the values are extracted from the signal column. The columns from the region column and subject, timepoint and event becomes the index.

The index is what identifies each sample. As a rule, the index can’t be duplicated. If each subject has more than one timepoint, and each timepoint has more than one event, then these 3 elements are needed as the index.

Let’s see what we have here:

region frontal parietal
subject timepoint event
s0 0 cue 0.007766 -0.006899
stim -0.021452 -0.039327
1 cue 0.016440 0.000300
stim -0.021054 -0.035735
2 cue 0.024296 0.033220


Now this is in the format we want. However, in order to access the index as columns df_fmri["subject"] we need to reset the index.

Check the subtle but important difference:

region subject timepoint event frontal parietal
0 s0 0 cue 0.007766 -0.006899
1 s0 0 stim -0.021452 -0.039327
2 s0 1 cue 0.016440 0.000300
3 s0 1 stim -0.021054 -0.035735
4 s0 2 cue 0.024296 0.033220


Merging or joining DataFrame#

So now we have our fMRI data tabulated for julearn. However, it might be the case that we have some important information in another file. For example, the subjects’ age and the place where they were scanned.

For the purpose of the example, we’ll create the dataframe here.

metadata = {
    "subject": [f"s{i}" for i in range(14)],
    "age": [23, 21, 31, 29, 43, 23, 43, 28, 48, 29, 35, 23, 34, 25],
    "scanner": ["a"] * 6 + ["b"] * 8,
}
df_meta = pd.DataFrame(metadata)
df_meta
subject age scanner
0 s0 23 a
1 s1 21 a
2 s2 31 a
3 s3 29 a
4 s4 43 a
5 s5 23 a
6 s6 43 b
7 s7 28 b
8 s8 48 b
9 s9 29 b
10 s10 35 b
11 s11 23 b
12 s12 34 b
13 s13 25 b


We will use the join method. This method will join the two dataframes, matching elements by the index.

In this case, the matching element (or index) will be the column subject. We need to set the index in each dataframe before join.

timepoint event frontal parietal age scanner
subject
s0 0 cue 0.007766 -0.006899 23 a
s0 0 stim -0.021452 -0.039327 23 a
s0 1 cue 0.016440 0.000300 23 a
s0 1 stim -0.021054 -0.035735 23 a
s0 2 cue 0.024296 0.033220 23 a
... ... ... ... ... ... ...
s9 16 stim -0.036739 -0.131641 29 b
s9 17 cue -0.004900 -0.036362 29 b
s9 17 stim -0.030099 -0.121574 29 b
s9 18 cue -0.000643 -0.051040 29 b
s9 18 stim -0.009959 -0.103513 29 b

532 rows × 6 columns



Finally, let’s reset the index and have it ready for julearn.

Now we can use, for example, age and scanner as confounds.

Reshaping data frames (more complex) Lets suppose that our prediction target is now the age and we want to use ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ as features the frontal and parietal value during each event. For this purpose, we need to convert the event values into columns. There are two events: cue and stim. So this will result in 4 columns.

We will still use the pivot, but in this case, we will have two values:

df_fmri = df_fmri.pivot(
    index=["subject", "timepoint", "age", "scanner"],
    columns="event",
    values=["frontal", "parietal"],
)
df_fmri
frontal parietal
event cue stim cue stim
subject timepoint age scanner
s0 0 23 a 0.007766 -0.021452 -0.006899 -0.039327
1 23 a 0.016440 -0.021054 0.000300 -0.035735
2 23 a 0.024296 -0.009038 0.033220 0.009642
3 23 a 0.047859 0.026727 0.085040 0.086399
4 23 a 0.069775 0.070558 0.115321 0.154058
... ... ... ... ... ... ... ...
s9 14 29 b 0.010535 -0.061817 -0.034386 -0.130267
15 29 b 0.002170 -0.048007 -0.038257 -0.134828
16 29 b -0.004290 -0.036739 -0.035395 -0.131641
17 29 b -0.004900 -0.030099 -0.036362 -0.121574
18 29 b -0.000643 -0.009959 -0.051040 -0.103513

266 rows × 4 columns



Since the column names are combinations of the values in the event column and the previous frontal and parietal columns, it is now a multi-level column name.

MultiIndex([( 'frontal',  'cue'),
            ( 'frontal', 'stim'),
            ('parietal',  'cue'),
            ('parietal', 'stim')],
           names=[None, 'event'])

The following trick will join the different levels using an underscore (_)

df_fmri.columns = ["_".join(x) for x in df_fmri.columns]
df_fmri
frontal_cue frontal_stim parietal_cue parietal_stim
subject timepoint age scanner
s0 0 23 a 0.007766 -0.021452 -0.006899 -0.039327
1 23 a 0.016440 -0.021054 0.000300 -0.035735
2 23 a 0.024296 -0.009038 0.033220 0.009642
3 23 a 0.047859 0.026727 0.085040 0.086399
4 23 a 0.069775 0.070558 0.115321 0.154058
... ... ... ... ... ... ... ...
s9 14 29 b 0.010535 -0.061817 -0.034386 -0.130267
15 29 b 0.002170 -0.048007 -0.038257 -0.134828
16 29 b -0.004290 -0.036739 -0.035395 -0.131641
17 29 b -0.004900 -0.030099 -0.036362 -0.121574
18 29 b -0.000643 -0.009959 -0.051040 -0.103513

266 rows × 4 columns



We have finally the information we want. We can now reset the index.

Total running time of the script: (0 minutes 0.451 seconds)

Gallery generated by Sphinx-Gallery