.. include:: ../links.inc .. _extending_datagrabbers: Creating Data Grabbers ====================== Data Grabbers are the first step of the pipeline. Its purpose is to interpret the structure of a dataset and provide two specific functionalities: #. Given an *element*, provide the path to each kind of data available for this element (e.g. the path to the T1 image, the path to the T2 image, etc.) #. Provide the list of *elements* available in the dataset. In this section, we will see how to create a DataGrabber for a dataset. Basic aspects of DataGrabbers are covered in the :ref:`Understanding Data Grabbers ` section. .. _extending_datagrabbers_think: Step 1: Think about the element ------------------------------- Like with any programming-related task, the first step is to think. When creating a DataGrabber, we need to first define what an *element* is. The *element* should be the smallest unit of data that can be processed. That is, for each element, there should be a set of data that can be processed, but only one of each *data type* (see :ref:`data_types`). For example, if we have a dataset from a fMRI study in which: a. both T1w and fMRI was acquired b. 20 subjects went through an experiment twice c. the experiment included resting-stage fMRI and a task named *stroop* then the *element* should be composed of 3 items: * ``subject``: The subject IDs, e.g. ``sub001``, ``sub002``, ... ``sub020`` * ``session``: The session number, e.g. ``ses1``, ``ses2`` * ``task``: The task performed, e.g. ``rest``, ``stroop`` If any of these items were not part of the element, then we will have more than one ``T1w`` and / or ``BOLD`` image for each subject, which is not allowed. Importantly, nothing prevents that one image being part of two different elements. For example, it is usually the case that the ``T1w`` image is not acquired for each task, but once in the entire session. So in this case, the ``T1w`` image for the element (``sub001``, ``ses1``, ``rest``) will be the same as the ``T1w`` image for the element (``sub001``, ``ses1``, ``stroop``). We will now continue this section using as an example, a dataset in BIDS format in which 9 subjects (``sub-01`` to ``sub-09``) were scanned each during 3 sessions (``ses-01``, ``ses-02``, ``ses-03``) and each session included a ``T1w`` and a ``BOLD`` image (resting-state), except for ``ses-03`` which was only anatomical data. Step 2: Think about the dataset's structure ------------------------------------------- Now that we have our element defined, we need to think about the structure of the dataset. Mainly, because the structure of the dataset will determine how the DataGrabber needs to be implemented. ``junifer`` provides a concrete class to deal with datasets that can be thought in terms of *patterns*. A *pattern* is a string that contains placeholders that are replaced by the actual values of the element. In our BIDS example, the path to the T1w image of subject ``sub-01`` and session ``ses-01``, relative to the dataset location, is ``sub-01/ses-01/anat/sub-01_ses-01_T1w.nii.gz``. By replacing ``sub-01`` with ``sub-02``, we can obtain the T1w image of the first session of the second subject. Indeed, the path to the T1w images can be expressed as a pattern: ``{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz`` where ``{subject}`` is the replacement for the subject id and ``{session}`` is the replacement for the session id. Since it is a BIDS dataset, the same happens with the BOLD images. The path to the BOLD images can be expressed as a pattern: ``{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz`` This will be the norm in most of the datasets. If your dataset can be expressed in terms of patterns, then follow :ref:`extending_datagrabbers_pattern`. Otherwise, we recommend that you take time to re-think about your dataset structure and why it does not have clear *patterns*. Feel free to open a discussion in the `junifer Discussions`_ page. Most probably we can help you get your dataset in order. If there is no other way, then you can follow :ref:`extending_datagrabbers_base` to create a DataGrabber from scratch. .. _extending_datagrabbers_pattern: Step 3: Create a Data Grabber ----------------------------- Option A: Extending from PatternDataGrabber ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The :class:`.PatternDataGrabber` class is a concrete class that has the functionality of understanding patterns embedded in it. Before creating the DataGrabber, we need to define 3 variables: * ``types``: A list with the available :ref:`data_types` in our dataset. * ``patterns``: A dictionary that specifies the pattern and some additional information for each data type. * ``replacements``: A list indicating which of the elements in the patterns should be replaced by the values of the element. For example, in our BIDS example, the variables will be: .. code-block:: python types = ["T1w", "BOLD"] patterns = { "T1w": { "pattern": "{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz", "space": "native", }, "BOLD": { "pattern": "{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz", "space": "MNI152NLin6Asym", }, } replacements = ["subject", "session"] An additional fourth variable is the ``datadir``, which should be the path to where the dataset is located. For example, if the dataset is located in ``/data/project/test/data``, then ``datadir`` should be ``/data/project/test/data``. Or, if we want to allow the user to specify the location of the dataset, we can expose the variable in the constructor, as in the following example. With the variables defined above, we can create our DataGrabber and name it ``ExampleBIDSDataGrabber``: .. code-block:: python from pathlib import Path from junifer.datagrabber import PatternDataGrabber class ExampleBIDSDataGrabber(PatternDataGrabber): def __init__(self, datadir: str | Path) -> None: types = ["T1w", "BOLD"] patterns = { "T1w": { "pattern": "{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz", "space": "native", }, "BOLD": { "pattern": "{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz", "space": "MNI152NLin6Asym", }, } replacements = ["subject", "session"] super().__init__( datadir=datadir, types=types, patterns=patterns, replacements=replacements, ) Our DataGrabber is ready to be used by ``junifer``. However, it is still unknown to the library. We need to register it in the library. To do so, we need to use the :func:`.register_datagrabber` decorator. .. code-block:: python from pathlib import Path from junifer.api.decorators import register_datagrabber from junifer.datagrabber import PatternDataGrabber @register_datagrabber class ExampleBIDSDataGrabber(PatternDataGrabber): def __init__(self, datadir: str | Path) -> None: types = ["T1w", "BOLD"] patterns = { "T1w": { "pattern": "{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz", "space": "native", }, "BOLD": { "pattern": "{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz", "space": "MNI152NLin6Asym", }, } replacements = ["subject", "session"] super().__init__( datadir=datadir, types=types, patterns=patterns, replacements=replacements, ) Now, we can use our DataGrabber in ``junifer``, by setting the ``datagrabber`` kind in the yaml file to ``ExampleBIDSDataGrabber``. Remember that we still need to set the ``datadir``. .. code-block:: yaml datagrabber: kind: ExampleBIDSDataGrabber datadir: /data/project/test/data Optional: Using datalad ~~~~~~~~~~~~~~~~~~~~~~~ If you are using `datalad`_, you can use the :class:`.PatternDataladDataGrabber` instead of the :class:`.PatternDataGrabber`. This class will not only interpret patterns, but also use `datalad`_ to ``clone`` and ``get`` the data. The main difference between the two is that the ``datadir`` is not the actual location of the dataset, but the location where the dataset will be cloned. It can now be ``None``, which means that the data will be downloaded to a temporary directory. To set the location of the dataset, you can use the ``uri`` argument in the constructor. Additionally, a ``rootdir`` argument can be used to specify the path to the root directory of the dataset after doing ``datalad clone``. In the example, the dataset is hosted in Gin (``https://gin.g-node.org/juaml/datalad-example-bids``). When we clone this dataset, we will see the following structure: .. code-block:: . └── example_bids_ses ├── sub-01 │ ├── ses-01 │ ├── ses-02 │ └── ses-03 ├── sub-02 │ ├── ses-01 │ ├── ses-02 │ └── ses-03 ├── sub-03 ... So the patterns will start after ``example_bids_ses``. This is our ``rootdir``. Now we have our 2 additional variables: .. code-block:: python uri = "https://gin.g-node.org/juaml/datalad-example-bids" rootdir = "example_bids_ses" And we can create our DataGrabber: .. code-block:: python from junifer.api.decorators import register_datagrabber from junifer.datagrabber import PatternDataladDataGrabber @register_datagrabber class ExampleBIDSDataGrabber(PatternDataladDataGrabber): def __init__(self) -> None: types = ["T1w", "BOLD"] patterns = { "T1w": { "pattern": "{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz", "space": "native", }, "BOLD": { "pattern": "{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz", "space": "MNI152NLin6Asym", }, } replacements = ["subject", "session"] uri = "https://gin.g-node.org/juaml/datalad-example-bids" rootdir = "example_bids_ses" super().__init__( datadir=None, uri=uri, rootdir=rootdir, types=types, patterns=patterns, replacements=replacements, ) This approach can be used directly from the YAML, like so: .. code-block:: yaml datagrabber: kind: PatternDataladDataGrabber types: - BOLD - T1w patterns: BOLD: pattern: "{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz" space: MNI152NLin6Asym T1w: pattern: "{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz" space: native replacements: - subject - session uri: "https://gin.g-node.org/juaml/datalad-example-bids" rootdir: example_bids_ses Advanced: Using Unix-like path expansion directives ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It is also possible to use some advanced Unix-like path expansion tricks to define our patterns. A very common thing would be to use ``*`` to match any number of characters but we cannot use it right after a replacement like: .. code-block:: python "derivatives/freesurfer/{subject}*" or if there are multiple files or no files which can be globbed. We can also use ``[]`` and ``[!]`` to glob certain tricky files like with the case of FreeSurfer derivatives. The file structure seen in a typical FreeSurfer derivative of a dataset (like ``AOMIC`` ones) is like so: .. code-block:: . └── derivatives └── freesurfer ├── fsaverage │ ├── mri │ | ├── T1.mgz │ | └── ... │ └── ... ├── sub-01 │ ├── mri │ | ├── T1.mgz │ | └── ... │ | └── ... │ └── ... ... With a structure like this, it would be cumbersome to write custom methods for the class and thus we could use a pattern like this: .. code-block:: python "derivatives/freesurfer/[!f]{subject}/mri/T1.mg[z]" This would ignore the ``fsaverage`` directory as a subject and let ``T1.mgz`` be fetched as there can be many files with the same prefix. .. _extending_datagrabbers_base: Option B: Extending from BaseDataGrabber ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ While we could not think of a use case in which the pattern-based DataGrabber would not be suitable, it is still possible to create a DataGrabber extending from the :class:`.BaseDataGrabber` class. In order to create a DataGrabber extending from :class:`.BaseDataGrabber`, we need to implement the following methods: - ``get_item``: to get a single item from the dataset. - ``get_elements``: to get the list of all elements present in the dataset - ``get_element_keys``: to get the keys of the elements in the dataset. .. note:: The ``__init__`` method could also be implemented, but it is not mandatory. This is required if the DataGrabber requires any extra parameter. We will now implement our BIDS example with this method. The first method, ``get_item``, needs to obtain a single item from the dataset. Since this dataset requires two variables, ``subject`` and ``session``, we will use them as parameters of ``get_item``: .. code-block:: python def get_item(self, subject: str, session: str) -> dict[str, dict[str, str]]: out = { "T1w": { "path": f"{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz", "space": "native", }, "BOLD": { "path": f"{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz", "space": "MNI152NLin6Asym", }, } return out The second method, ``get_elements``, needs to return a list of all the elements in the dataset. In this case, we know that the dataset contains 3 subjects and 3 sessions, so we can create a list of all the possible combinations. However, we need to remember that for session *ses-03* there is no BOLD data. .. code-block:: python from itertools import product def get_elements(self) -> list[str]: subjects = ["sub-01", "sub-02", "sub-03"] sessions = ["ses-01", "ses-02"] # If we are not working on BOLD data, we can add "ses-03" if "BOLD" not in self.types: sessions.append("ses-03") elements = [] for subject, element in product(subjects, sessions): elements.append({"subject": subject, "session": session}) return elements And finally, we can implement the ``get_element_keys`` method. This method needs to return a list of the keys that represent each of the items in the element tuple. As a rule of thumb, they should be the parameters of the ``get_item`` method, in the same order. .. code-block:: python def get_element_keys(self) -> list[str]: return ["subject", "session"] So, to summarise, our DataGrabber will look like this: .. code-block:: python from junifer.api.decorators import register_datagrabber from junifer.datagrabber import BaseDataGrabber @register_datagrabber class ExampleBIDSDataGrabber(BaseDataGrabber): def get_item( self, subject: str, session: str ) -> dict[str, dict[str, str]]: out = { "T1w": { "path": f"{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz", "space": "native", }, "BOLD": { "path": f"{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz", "space": "MNI152NLin6Asym", }, } return out def get_elements(self) -> list[str]: subjects = ["sub-01", "sub-02", "sub-03"] sessions = ["ses-01", "ses-02"] # If we are not working on BOLD data, we can add "ses-03" if "BOLD" not in self.types: sessions.append("ses-03") elements = [] for subject in subjects: for session in sessions: elements.append({"subject": subject, "session": session}) return elements def get_element_keys(self) -> list[str]: return ["subject", "session"] Optional: Using datalad ~~~~~~~~~~~~~~~~~~~~~~~ If this dataset is in a datalad dataset, we can extend from :class:`.DataladDataGrabber` instead of :class:`.BaseDataGrabber`. This will allow us to use the datalad API to obtain the data. Step 4: Optional: Adding *BOLD confounds* ----------------------------------------- For some analyses, it is useful to have the confounds associated with the BOLD data. This corresponds to the ``BOLD.confounds`` item in the :ref:`Data Object ` (see :ref:`data_types`). However, the ``BOLD.confounds`` element does not only consists of a ``path``, but it requires more information about the format of the confounds file. Thus, the ``BOLD.confounds`` element is a dictionary with the following keys: - ``path``: the path to the confounds file. - ``format``: the format of the confounds file. Currently, this can be either ``fmriprep`` or ``adhoc``. The ``fmriprep`` format corresponds to the format of the confounds files generated by `fMRIPrep`_. The ``adhoc`` format corresponds to a format that is not standardised. .. note:: The ``mappings`` key is only required if the ``format`` is ``adhoc``. If the ``format`` is ``fmriprep``, the ``mappings`` key is not required. Currently, ``junifer`` provides only one confound remover step (:class:`.fMRIPrepConfoundRemover`), which relies entirely on the ``fmriprep`` confound variable names. Thus, if the confounds are not in ``fmriprep`` format, the user will need to provide the mappings between the *ad-hoc* variable names and the ``fmriprep`` variable names. This is done by specifying the ``adhoc`` format and providing the mappings as a dictionary in the ``mappings`` key. In the following example, the confounds file has 3 variables that are not in the ``fmriprep`` format. Thus, we will provide the mappings for these variables to the ``fmriprep`` format. For example, the ``get_item`` method could look like this: .. code-block:: python def get_item(self, subject: str, session: str) -> dict: out = { "BOLD": { "path": f"{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz", "space": "MNI152NLin6Asym", "confounds": { "path": f"{subject}/{session}/func/{subject}_{session}_confounds.tsv", "format": "adhoc", "mappings": { "fmriprep": { "variable1": "rot_x", "variable2": "rot_z", "variable3": "rot_y", }, }, }, }, } .. note:: Not all of the mappings need to be provided. For the moment, this is used only by the :class:`.fMRIPrepConfoundRemover` step, which requires variables based on the strategy selected. However, it is recommended to provide all the mappings, as this will allow the user to choose different strategies with the same dataset.