7.2. Creating Data Grabbers¶

Data Grabbers are the first step of the pipeline. Its purpose is to interpret the structure of a dataset and provide two specific functionalities:

Given an element, provide the path to each kind of data available for this element (e.g. the path to the T1 image, the path to the T2 image, etc.)
Provide the list of elements available in the dataset.

In this section, we will see how to create a DataGrabber for a dataset. Basic aspects of DataGrabbers are covered in the Understanding Data Grabbers section.

7.2.1. Step 1: Think about the element¶

Like with any programming-related task, the first step is to think. When creating a DataGrabber, we need to first define what an element is. The element should be the smallest unit of data that can be processed. That is, for each element, there should be a set of data that can be processed, but only one of each data type (see Data Types).

For example, if we have a dataset from a fMRI study in which:

both T1w and fMRI was acquired
20 subjects went through an experiment twice
the experiment included resting-stage fMRI and a task named stroop

then the element should be composed of 3 items:

subject: The subject IDs, e.g. sub001, sub002, … sub020
session: The session number, e.g. ses1, ses2
task: The task performed, e.g. rest, stroop

If any of these items were not part of the element, then we will have more than one T1w and / or BOLD image for each subject, which is not allowed.

Importantly, nothing prevents that one image being part of two different elements. For example, it is usually the case that the T1w image is not acquired for each task, but once in the entire session. So in this case, the T1w image for the element (sub001, ses1, rest) will be the same as the T1w image for the element (sub001, ses1, stroop).

We will now continue this section using as an example, a dataset in BIDS format in which 9 subjects (sub-01 to sub-09) were scanned each during 3 sessions (ses-01, ses-02, ses-03) and each session included a T1w and a BOLD image (resting-state), except for ses-03 which was only anatomical data.

7.2.2. Step 2: Think about the dataset’s structure¶

Now that we have our element defined, we need to think about the structure of the dataset. Mainly, because the structure of the dataset will determine how the DataGrabber needs to be implemented.

junifer provides a concrete class to deal with datasets that can be thought in terms of patterns. A pattern is a string that contains placeholders that are replaced by the actual values of the element. In our BIDS example, the path to the T1w image of subject sub-01 and session ses-01, relative to the dataset location, is sub-01/ses-01/anat/sub-01_ses-01_T1w.nii.gz. By replacing sub-01 with sub-02, we can obtain the T1w image of the first session of the second subject. Indeed, the path to the T1w images can be expressed as a pattern:

{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz

where {subject} is the replacement for the subject id and {session} is the replacement for the session id.

Since it is a BIDS dataset, the same happens with the BOLD images. The path to the BOLD images can be expressed as a pattern:

{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz

This will be the norm in most of the datasets. If your dataset can be expressed in terms of patterns, then follow Step 3: Create a Data Grabber. Otherwise, we recommend that you take time to re-think about your dataset structure and why it does not have clear patterns. Feel free to open a discussion in the junifer Discussions page. Most probably we can help you get your dataset in order.

If there is no other way, then you can follow Option B: Extending from BaseDataGrabber to create a DataGrabber from scratch.

7.2.3. Step 3: Create a Data Grabber¶

Option A: Extending from PatternDataGrabber¶

The PatternDataGrabber class is a concrete class that has the functionality of understanding patterns embedded in it.

Before creating the DataGrabber, we need to define 3 variables:

types: A list with the available Data Types in our dataset.
patterns: A dictionary that specifies the pattern and some additional information for each data type.
replacements: A list indicating which of the elements in the patterns should be replaced by the values of the element.

For example, in our BIDS example, the variables will be:

types = ["T1w", "BOLD"]
patterns = {
    "T1w": {
        "pattern": "{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz",
        "space": "native",
    },
    "BOLD": {
        "pattern": "{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz",
        "space": "MNI152NLin6Asym",
    },
}
replacements = ["subject", "session"]

An additional fourth variable is the datadir, which should be the path to where the dataset is located. For example, if the dataset is located in /data/project/test/data, then datadir should be /data/project/test/data. Or, if we want to allow the user to specify the location of the dataset, we can expose the variable in the constructor, as in the following example.

With the variables defined above, we can create our DataGrabber and name it ExampleBIDSDataGrabber:

from pathlib import Path

from junifer.datagrabber import PatternDataGrabber


class ExampleBIDSDataGrabber(PatternDataGrabber):
    def __init__(self, datadir: str | Path) -> None:
        types = ["T1w", "BOLD"]
        patterns = {
            "T1w": {
                "pattern": "{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz",
                "space": "native",
            },
            "BOLD": {
                "pattern": "{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz",
                "space": "MNI152NLin6Asym",
            },
        }
        replacements = ["subject", "session"]
        super().__init__(
            datadir=datadir,
            types=types,
            patterns=patterns,
            replacements=replacements,
        )

Our DataGrabber is ready to be used by junifer. However, it is still unknown to the library. We need to register it in the library. To do so, we need to use the register_datagrabber() decorator.

from pathlib import Path

from junifer.api.decorators import register_datagrabber
from junifer.datagrabber import PatternDataGrabber


@register_datagrabber
class ExampleBIDSDataGrabber(PatternDataGrabber):
    def __init__(self, datadir: str | Path) -> None:
        types = ["T1w", "BOLD"]
        patterns = {
            "T1w": {
                "pattern": "{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz",
                "space": "native",
            },
            "BOLD": {
                "pattern": "{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz",
                "space": "MNI152NLin6Asym",
            },
        }
        replacements = ["subject", "session"]
        super().__init__(
            datadir=datadir,
            types=types,
            patterns=patterns,
            replacements=replacements,
        )

Now, we can use our DataGrabber in junifer, by setting the datagrabber kind in the yaml file to ExampleBIDSDataGrabber. Remember that we still need to set the datadir.

datagrabber:
   kind: ExampleBIDSDataGrabber
   datadir: /data/project/test/data

Optional: Using datalad¶

If you are using datalad, you can use the PatternDataladDataGrabber instead of the PatternDataGrabber. This class will not only interpret patterns, but also use datalad to clone and get the data.

The main difference between the two is that the datadir is not the actual location of the dataset, but the location where the dataset will be cloned. It can now be None, which means that the data will be downloaded to a temporary directory. To set the location of the dataset, you can use the uri argument in the constructor. Additionally, a rootdir argument can be used to specify the path to the root directory of the dataset after doing datalad clone.

In the example, the dataset is hosted in Gin (https://gin.g-node.org/juaml/datalad-example-bids).

When we clone this dataset, we will see the following structure:

.
└── example_bids_ses
   ├── sub-01
   │   ├── ses-01
   │   ├── ses-02
   │   └── ses-03
   ├── sub-02
   │   ├── ses-01
   │   ├── ses-02
   │   └── ses-03
   ├── sub-03
   ...

So the patterns will start after example_bids_ses. This is our rootdir.

Now we have our 2 additional variables:

uri = "https://gin.g-node.org/juaml/datalad-example-bids"
rootdir = "example_bids_ses"

And we can create our DataGrabber:

from junifer.api.decorators import register_datagrabber
from junifer.datagrabber import PatternDataladDataGrabber


@register_datagrabber
class ExampleBIDSDataGrabber(PatternDataladDataGrabber):
    def __init__(self) -> None:
        types = ["T1w", "BOLD"]
        patterns = {
            "T1w": {
                "pattern": "{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz",
                "space": "native",
            },
            "BOLD": {
                "pattern": "{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz",
                "space": "MNI152NLin6Asym",
            },
        }
        replacements = ["subject", "session"]
        uri = "https://gin.g-node.org/juaml/datalad-example-bids"
        rootdir = "example_bids_ses"
        super().__init__(
            datadir=None,
            uri=uri,
            rootdir=rootdir,
            types=types,
            patterns=patterns,
            replacements=replacements,
        )

This approach can be used directly from the YAML, like so:

datagrabber:
  kind: PatternDataladDataGrabber
  types:
    - BOLD
    - T1w
  patterns:
    BOLD:
      pattern: "{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz"
      space: MNI152NLin6Asym
    T1w:
      pattern: "{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz"
      space: native
  replacements:
    - subject
    - session
  uri: "https://gin.g-node.org/juaml/datalad-example-bids"
  rootdir: example_bids_ses

Advanced: Using Unix-like path expansion directives¶

It is also possible to use some advanced Unix-like path expansion tricks to define our patterns.

A very common thing would be to use * to match any number of characters but we cannot use it right after a replacement like:

"derivatives/freesurfer/{subject}*"

or if there are multiple files or no files which can be globbed.

We can also use [] and [!] to glob certain tricky files like with the case of FreeSurfer derivatives. The file structure seen in a typical FreeSurfer derivative of a dataset (like AOMIC ones) is like so:

.
└── derivatives
    └── freesurfer
       ├── fsaverage
       │   ├── mri
       │   |   ├── T1.mgz
       │   |   └── ...
       │   └── ...
       ├── sub-01
       │   ├── mri
       │   |   ├── T1.mgz
       │   |   └── ...
       │   |   └── ...
       │   └── ...
       ...

With a structure like this, it would be cumbersome to write custom methods for the class and thus we could use a pattern like this:

"derivatives/freesurfer/[!f]{subject}/mri/T1.mg[z]"

This would ignore the fsaverage directory as a subject and let T1.mgz be fetched as there can be many files with the same prefix.

Option B: Extending from BaseDataGrabber¶

While we could not think of a use case in which the pattern-based DataGrabber would not be suitable, it is still possible to create a DataGrabber extending from the BaseDataGrabber class.

In order to create a DataGrabber extending from BaseDataGrabber, we need to implement the following methods:

get_item: to get a single item from the dataset.
get_elements: to get the list of all elements present in the dataset
get_element_keys: to get the keys of the elements in the dataset.

Note

The __init__ method could also be implemented, but it is not mandatory. This is required if the DataGrabber requires any extra parameter.

We will now implement our BIDS example with this method.

The first method, get_item, needs to obtain a single item from the dataset. Since this dataset requires two variables, subject and session, we will use them as parameters of get_item:

def get_item(self, subject: str, session: str) -> dict[str, dict[str, str]]:
    out = {
        "T1w": {
            "path": f"{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz",
            "space": "native",
        },
        "BOLD": {
            "path": f"{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz",
            "space": "MNI152NLin6Asym",
        },
    }
    return out

The second method, get_elements, needs to return a list of all the elements in the dataset. In this case, we know that the dataset contains 3 subjects and 3 sessions, so we can create a list of all the possible combinations. However, we need to remember that for session ses-03 there is no BOLD data.

from itertools import product


def get_elements(self) -> list[str]:
    subjects = ["sub-01", "sub-02", "sub-03"]
    sessions = ["ses-01", "ses-02"]

    # If we are not working on BOLD data, we can add "ses-03"
    if "BOLD" not in self.types:
        sessions.append("ses-03")
    elements = []
    for subject, element in product(subjects, sessions):
        elements.append({"subject": subject, "session": session})
    return elements

And finally, we can implement the get_element_keys method. This method needs to return a list of the keys that represent each of the items in the element tuple. As a rule of thumb, they should be the parameters of the get_item method, in the same order.

def get_element_keys(self) -> list[str]:
    return ["subject", "session"]

So, to summarise, our DataGrabber will look like this:

from junifer.api.decorators import register_datagrabber
from junifer.datagrabber import BaseDataGrabber


@register_datagrabber
class ExampleBIDSDataGrabber(BaseDataGrabber):
    def get_item(
        self, subject: str, session: str
    ) -> dict[str, dict[str, str]]:
        out = {
            "T1w": {
                "path": f"{subject}/{session}/anat/{subject}_{session}_T1w.nii.gz",
                "space": "native",
            },
            "BOLD": {
                "path": f"{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz",
                "space": "MNI152NLin6Asym",
            },
        }
        return out

    def get_elements(self) -> list[str]:
        subjects = ["sub-01", "sub-02", "sub-03"]
        sessions = ["ses-01", "ses-02"]

        # If we are not working on BOLD data, we can add "ses-03"
        if "BOLD" not in self.types:
            sessions.append("ses-03")
        elements = []
        for subject in subjects:
            for session in sessions:
                elements.append({"subject": subject, "session": session})
        return elements

    def get_element_keys(self) -> list[str]:
        return ["subject", "session"]

Optional: Using datalad¶

If this dataset is in a datalad dataset, we can extend from DataladDataGrabber instead of BaseDataGrabber. This will allow us to use the datalad API to obtain the data.

7.2.4. Step 4: Optional: Adding BOLD confounds¶

For some analyses, it is useful to have the confounds associated with the BOLD data. This corresponds to the BOLD.confounds item in the Data Object (see Data Types). However, the BOLD.confounds element does not only consists of a path, but it requires more information about the format of the confounds file. Thus, the BOLD.confounds element is a dictionary with the following keys:

path: the path to the confounds file.
format: the format of the confounds file. Currently, this can be either fmriprep or adhoc.

The fmriprep format corresponds to the format of the confounds files generated by fMRIPrep. The adhoc format corresponds to a format that is not standardised.

Note

The mappings key is only required if the format is adhoc. If the format is fmriprep, the mappings key is not required.

Currently, junifer provides only one confound remover step (fMRIPrepConfoundRemover), which relies entirely on the fmriprep confound variable names. Thus, if the confounds are not in fmriprep format, the user will need to provide the mappings between the ad-hoc variable names and the fmriprep variable names. This is done by specifying the adhoc format and providing the mappings as a dictionary in the mappings key.

In the following example, the confounds file has 3 variables that are not in the fmriprep format. Thus, we will provide the mappings for these variables to the fmriprep format. For example, the get_item method could look like this:

def get_item(self, subject: str, session: str) -> dict:
    out = {
        "BOLD": {
            "path": f"{subject}/{session}/func/{subject}_{session}_task-rest_bold.nii.gz",
            "space": "MNI152NLin6Asym",
            "confounds": {
                "path": f"{subject}/{session}/func/{subject}_{session}_confounds.tsv",
                "format": "adhoc",
                "mappings": {
                    "fmriprep": {
                        "variable1": "rot_x",
                        "variable2": "rot_z",
                        "variable3": "rot_y",
                    },
                },
            },
        },
    }

Note

Not all of the mappings need to be provided. For the moment, this is used only by the fMRIPrepConfoundRemover step, which requires variables based on the strategy selected. However, it is recommended to provide all the mappings, as this will allow the user to choose different strategies with the same dataset.