.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/00_starting/run_combine_pandas.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_00_starting_run_combine_pandas.py: Working with ``pandas`` ======================= This example uses the ``fmri`` dataset to transform and combine data in order to prepare it to be used by ``julearn``. References ---------- Waskom, M.L., Frank, M.C., Wagner, A.D. (2016). Adaptive engagement of cognitive control in context-dependent decision-making. Cerebral Cortex. .. include:: ../../links.inc .. GENERATED FROM PYTHON SOURCE LINES 17-24 .. code-block:: Python # Authors: Federico Raimondo # # License: AGPL from seaborn import load_dataset import pandas as pd .. GENERATED FROM PYTHON SOURCE LINES 25-44 One of the key elements that make ``julearn`` easy to use, is the possibility to work directly with ``pandas.DataFrame``, similar to MS Excel spreadsheets or csv files. Ideally, we will have everything tabulated and organised for ``julearn``, but it might not be your case. You might have some files with the fMRI values, some others with demographics, some other with diagnostic metrics or behavioral results. You need to prepare these files for ``julearn``. One option is to manually edit the files and make sure that everything is ready to do some machine-learning. However, this is error-prone. Fortunately, `pandas`_ provides several tools to deal with this task. This example is a collection of some of these useful methods. Let's start with the ``fmri`` dataset. .. GENERATED FROM PYTHON SOURCE LINES 44-47 .. code-block:: Python df_fmri = load_dataset("fmri") .. GENERATED FROM PYTHON SOURCE LINES 48-50 Let's see what this dataset has. .. GENERATED FROM PYTHON SOURCE LINES 50-52 .. code-block:: Python df_fmri.head() .. raw:: html

	subject	timepoint	event	region	signal
0	s13	18	stim	parietal	-0.017552
1	s5	14	stim	parietal	-0.080883
2	s12	18	stim	parietal	-0.081033
3	s11	18	stim	parietal	-0.046134
4	s10	18	stim	parietal	-0.037970

.. GENERATED FROM PYTHON SOURCE LINES 53-61 From long to wide format We have seen this in other examples. If we want to use julearn, each feature must be a columns. In order to use the signals from different regions as ~~~~~~~~~~~~~~~~~~~~~~~~ features, we need to convert this dataframe from the long format to the wide format. We will use the ``pivot`` method. .. GENERATED FROM PYTHON SOURCE LINES 61-65 .. code-block:: Python df_fmri = df_fmri.pivot( index=["subject", "timepoint", "event"], columns="region", values="signal" ) .. GENERATED FROM PYTHON SOURCE LINES 66-78 This method reshapes the table, keeping the specified elements as index, columns and values. In our case, the values are extracted from the *signal* column. The columns from the *region* column and *subject*, *timepoint* and *event* becomes the index. The index is what identifies each sample. As a rule, the index can't be duplicated. If each subject has more than one timepoint, and each timepoint has more than one event, then these 3 elements are needed as the index. Let's see what we have here: .. GENERATED FROM PYTHON SOURCE LINES 78-80 .. code-block:: Python df_fmri.head() .. raw:: html

		region	frontal	parietal
subject	timepoint	event
s0	0	cue	0.007766	-0.006899
	0	stim	-0.021452	-0.039327
	1	cue	0.016440	0.000300
	1	stim	-0.021054	-0.035735
	2	cue	0.024296	0.033220

.. GENERATED FROM PYTHON SOURCE LINES 81-85 Now this is in the format we want. However, in order to access the index as columns ``df_fmri["subject"]`` we need to reset the index. Check the subtle but important difference: .. GENERATED FROM PYTHON SOURCE LINES 85-88 .. code-block:: Python df_fmri = df_fmri.reset_index() df_fmri.head() .. raw:: html

region	subject	timepoint	event	frontal	parietal
0	s0	0	cue	0.007766	-0.006899
1	s0	0	stim	-0.021452	-0.039327
2	s0	1	cue	0.016440	0.000300
3	s0	1	stim	-0.021054	-0.035735
4	s0	2	cue	0.024296	0.033220

.. GENERATED FROM PYTHON SOURCE LINES 89-97 Merging or joining ``DataFrame`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ So now we have our fMRI data tabulated for ``julearn``. However, it might be the case that we have some important information in another file. For example, the subjects' age and the place where they were scanned. For the purpose of the example, we'll create the dataframe here. .. GENERATED FROM PYTHON SOURCE LINES 97-105 .. code-block:: Python metadata = { "subject": [f"s{i}" for i in range(14)], "age": [23, 21, 31, 29, 43, 23, 43, 28, 48, 29, 35, 23, 34, 25], "scanner": ["a"] * 6 + ["b"] * 8, } df_meta = pd.DataFrame(metadata) df_meta .. raw:: html

	subject	age	scanner
0	s0	23	a
1	s1	21	a
2	s2	31	a
3	s3	29	a
4	s4	43	a
5	s5	23	a
6	s6	43	b
7	s7	28	b
8	s8	48	b
9	s9	29	b
10	s10	35	b
11	s11	23	b
12	s12	34	b
13	s13	25	b

.. GENERATED FROM PYTHON SOURCE LINES 106-111 We will use the ``join`` method. This method will join the two dataframes, matching elements by the *index*. In this case, the matching element (or index) will be the column ``subject``. We need to set the index in each dataframe before join. .. GENERATED FROM PYTHON SOURCE LINES 111-116 .. code-block:: Python df_fmri = df_fmri.set_index("subject") df_meta = df_meta.set_index("subject") df_fmri = df_fmri.join(df_meta) df_fmri .. raw:: html

	timepoint	event	frontal	parietal	age	scanner
subject
s0	0	cue	0.007766	-0.006899	23	a
s0	0	stim	-0.021452	-0.039327	23	a
s0	1	cue	0.016440	0.000300	23	a
s0	1	stim	-0.021054	-0.035735	23	a
s0	2	cue	0.024296	0.033220	23	a
...	...	...	...	...	...	...
s9	16	stim	-0.036739	-0.131641	29	b
s9	17	cue	-0.004900	-0.036362	29	b
s9	17	stim	-0.030099	-0.121574	29	b
s9	18	cue	-0.000643	-0.051040	29	b
s9	18	stim	-0.009959	-0.103513	29	b

532 rows × 6 columns

.. GENERATED FROM PYTHON SOURCE LINES 117-118 Finally, let's reset the index and have it ready for ``julearn``. .. GENERATED FROM PYTHON SOURCE LINES 118-120 .. code-block:: Python df_fmri = df_fmri.reset_index() .. GENERATED FROM PYTHON SOURCE LINES 121-122 Now we can use, for example, *age* and *scanner* as confounds. .. GENERATED FROM PYTHON SOURCE LINES 124-132 Reshaping data frames (more complex) Let's suppose that our prediction target is now the *age* and we want to use ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ as features the frontal and parietal value during each event. For this purpose, we need to convert the event values into columns. There are two events: *cue* and *stim*. So this will result in 4 columns. We will still use the pivot, but in this case, we will have two values: .. GENERATED FROM PYTHON SOURCE LINES 132-139 .. code-block:: Python df_fmri = df_fmri.pivot( index=["subject", "timepoint", "age", "scanner"], columns="event", values=["frontal", "parietal"], ) df_fmri .. raw:: html

				frontal		parietal
			event	cue	stim	cue	stim
subject	timepoint	age	scanner
s0	0	23	a	0.007766	-0.021452	-0.006899	-0.039327
	1	23	a	0.016440	-0.021054	0.000300	-0.035735
	2	23	a	0.024296	-0.009038	0.033220	0.009642
	3	23	a	0.047859	0.026727	0.085040	0.086399
	4	23	a	0.069775	0.070558	0.115321	0.154058
...	...	...	...	...	...	...	...
s9	14	29	b	0.010535	-0.061817	-0.034386	-0.130267
	15	29	b	0.002170	-0.048007	-0.038257	-0.134828
	16	29	b	-0.004290	-0.036739	-0.035395	-0.131641
	17	29	b	-0.004900	-0.030099	-0.036362	-0.121574
	18	29	b	-0.000643	-0.009959	-0.051040	-0.103513

266 rows × 4 columns

.. GENERATED FROM PYTHON SOURCE LINES 140-143 Since the column names are combinations of the values in the *event* column and the previous *frontal* and *parietal* columns, it is now a multi-level column name. .. GENERATED FROM PYTHON SOURCE LINES 143-145 .. code-block:: Python df_fmri.columns .. rst-class:: sphx-glr-script-out .. code-block:: none MultiIndex([( 'frontal', 'cue'), ( 'frontal', 'stim'), ('parietal', 'cue'), ('parietal', 'stim')], names=[None, 'event']) .. GENERATED FROM PYTHON SOURCE LINES 146-147 The following trick will join the different levels using an underscore (*_*) .. GENERATED FROM PYTHON SOURCE LINES 147-150 .. code-block:: Python df_fmri.columns = ["_".join(x) for x in df_fmri.columns] df_fmri .. raw:: html

				frontal_cue	frontal_stim	parietal_cue	parietal_stim
subject	timepoint	age	scanner
s0	0	23	a	0.007766	-0.021452	-0.006899	-0.039327
	1	23	a	0.016440	-0.021054	0.000300	-0.035735
	2	23	a	0.024296	-0.009038	0.033220	0.009642
	3	23	a	0.047859	0.026727	0.085040	0.086399
	4	23	a	0.069775	0.070558	0.115321	0.154058
...	...	...	...	...	...	...	...
s9	14	29	b	0.010535	-0.061817	-0.034386	-0.130267
	15	29	b	0.002170	-0.048007	-0.038257	-0.134828
	16	29	b	-0.004290	-0.036739	-0.035395	-0.131641
	17	29	b	-0.004900	-0.030099	-0.036362	-0.121574
	18	29	b	-0.000643	-0.009959	-0.051040	-0.103513

266 rows × 4 columns

.. GENERATED FROM PYTHON SOURCE LINES 151-152 We have finally the information we want. We can now reset the index. .. GENERATED FROM PYTHON SOURCE LINES 152-153 .. code-block:: Python df_fmri = df_fmri.reset_index() .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.288 seconds) .. _sphx_glr_download_auto_examples_00_starting_run_combine_pandas.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: run_combine_pandas.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: run_combine_pandas.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: run_combine_pandas.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_

	subject	age	scanner
0	s0	23	a
1	s1	21	a
2	s2	31	a
3	s3	29	a
4	s4	43	a
5	s5	23	a
6	s6	43	b
7	s7	28	b
8	s8	48	b
9	s9	29	b
10	s10	35	b
11	s11	23	b
12	s12	34	b
13	s13	25	b

	subject	age	scanner
0	s0	23	a
1	s1	21	a
2	s2	31	a
3	s3	29	a
4	s4	43	a
5	s5	23	a
6	s6	43	b
7	s7	28	b
8	s8	48	b
9	s9	29	b
10	s10	35	b
11	s11	23	b
12	s12	34	b
13	s13	25	b

	subject	age	scanner
0	s0	23	a
1	s1	21	a
2	s2	31	a
3	s3	29	a
4	s4	43	a
5	s5	23	a
6	s6	43	b
7	s7	28	b
8	s8	48	b
9	s9	29	b
10	s10	35	b
11	s11	23	b
12	s12	34	b
13	s13	25	b