6.7. Connectome-based Predictive Modeling (CBPM)#

Applications of machine learning in neuroimaging research typically have to deal with high-dimensional input features. This can be quite problematic due to the curse of dimensionality, especially when sample sizes are low at the same time. Recently, connectome-based predictive modeling (CBPM) has been proposed as an approach to deal with this problem [4] in regression. This approach has been used to predict fluid intelligence [5] as well sustained attention [6] based on brain functional connectivity.

In a nutshell, CBPM consists of:

  1. Feature selection

  2. Feature aggregation

  3. Model building

In CBPM, features are selected if their correlation to the target is significant according to some specified significance threshold alpha. These selected features are then summarized according to an aggregation function and subsequently used to fit a machine learning model. Most commonly in this approach a linear model is used for this, but in principle it could be any other machine learning model.

CBPM in julearn#

julearn implements a simple, scikit-learn compatible transformer (“cbpm”), that performs the first two parts of this approach, i.e., the feature selection and feature aggregation. Leveraging julearn’s PipelineCreator, one can therefore easily apply the "cbpm" transformer as a preprocessing step, and then apply any scikit-learn-compatible estimator for the model building part.

For example, to build a simple CBPM workflow, you can create a pipeline and run a cross-validation as follows:

# Authors: Leonard Sasse <l.sasse@fz-juelich.de>
# License: AGPL

from julearn import run_cross_validation
from julearn.pipeline import PipelineCreator

from sklearn.datasets import make_regression
import pandas as pd

# Prepare data
X, y = make_regression(n_features=20, n_samples=200)

# Make dataframe
X_names = [f"feature_{x}" for x in range(1, 21)]
data = pd.DataFrame(X)
data.columns = X_names
data["target"] = y

# Prepare a pipeline creator
cbpm_pipeline_creator = PipelineCreator(problem_type="regression")
cbpm_pipeline_creator.add("cbpm")
cbpm_pipeline_creator.add("linreg")

# Cross-validate the cbpm pipeline
scores, final_model = run_cross_validation(
    data=data,
    X=X_names,
    y="target",
    model=cbpm_pipeline_creator,
    return_estimator="all",
)
2024-04-29 11:45:55,360 - julearn - INFO - Adding step cbpm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:45:55,360 - julearn - INFO - Step added
2024-04-29 11:45:55,360 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:45:55,360 - julearn - INFO - Step added
2024-04-29 11:45:55,360 - julearn - INFO - ==== Input Data ====
2024-04-29 11:45:55,360 - julearn - INFO - Using dataframe as input
2024-04-29 11:45:55,360 - julearn - INFO -      Features: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']
2024-04-29 11:45:55,360 - julearn - INFO -      Target: target
2024-04-29 11:45:55,361 - julearn - INFO -      Expanded features: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']
2024-04-29 11:45:55,361 - julearn - INFO -      X_types:{}
2024-04-29 11:45:55,361 - julearn - WARNING - The following columns are not defined in X_types: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:507: RuntimeWarning: The following columns are not defined in X_types: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']. They will be treated as continuous.
  warn_with_log(
2024-04-29 11:45:55,362 - julearn - INFO - ====================
2024-04-29 11:45:55,362 - julearn - INFO -
2024-04-29 11:45:55,363 - julearn - INFO - = Model Parameters =
2024-04-29 11:45:55,363 - julearn - INFO - ====================
2024-04-29 11:45:55,363 - julearn - INFO -
2024-04-29 11:45:55,363 - julearn - INFO - = Data Information =
2024-04-29 11:45:55,363 - julearn - INFO -      Problem type: regression
2024-04-29 11:45:55,363 - julearn - INFO -      Number of samples: 200
2024-04-29 11:45:55,363 - julearn - INFO -      Number of features: 20
2024-04-29 11:45:55,363 - julearn - INFO - ====================
2024-04-29 11:45:55,363 - julearn - INFO -
2024-04-29 11:45:55,363 - julearn - INFO -      Target type: float64
2024-04-29 11:45:55,363 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(
2024-04-29 11:45:55,380 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/transformers/cbpm.py:267: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
  warn_with_log(
2024-04-29 11:45:55,401 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/transformers/cbpm.py:267: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
  warn_with_log(
2024-04-29 11:45:55,423 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/transformers/cbpm.py:267: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
  warn_with_log(
2024-04-29 11:45:55,443 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/transformers/cbpm.py:267: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
  warn_with_log(
2024-04-29 11:45:55,463 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/transformers/cbpm.py:267: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
  warn_with_log(
2024-04-29 11:45:55,468 - julearn - INFO - Fitting final model
2024-04-29 11:45:55,483 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/transformers/cbpm.py:267: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
  warn_with_log(

By default the "cbpm" transformer will perform feature selection using the Pearson correlation between each feature and the target, and select the features for which the p-value of the correlation falls below the default significance threshold of 0.01. It will then group the features into negatively and positively correlated features, and sum up the features within each of these groups using numpy.sum(). That is, the linear model in this case is fitted on two features:

  1. Sum of features that are positively correlated to the target

  2. Sum of features that are negatively correlated to the target

The pipeline creator also allows easily customising these parameters of the "cbpm" transformer according to your needs. For example, to use a different significance threshold during feature selection one may set the significance_threshold keyword to increase it to 0.05 as follows:

# Prepare a pipeline creator
cbpm_pipeline_creator = PipelineCreator(problem_type="regression")
cbpm_pipeline_creator.add("cbpm", significance_threshold=0.05)
cbpm_pipeline_creator.add("linreg")

print(cbpm_pipeline_creator)
2024-04-29 11:45:55,486 - julearn - INFO - Adding step cbpm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:45:55,486 - julearn - INFO - Setting hyperparameter significance_threshold = 0.05
2024-04-29 11:45:55,486 - julearn - INFO - Step added
2024-04-29 11:45:55,486 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:45:55,486 - julearn - INFO - Step added
PipelineCreator:
  Step 0: cbpm
    estimator:     CBPM()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: linreg
    estimator:     LinearRegression()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}

julearn also allows this to be tuned as a hyperparameter in a nested cross-validation. Simply hand over an iterable of values:

# Prepare a pipeline creator
cbpm_pipeline_creator = PipelineCreator(problem_type="regression")
cbpm_pipeline_creator.add("cbpm", significance_threshold=[0.01, 0.05])
cbpm_pipeline_creator.add("linreg")

print(cbpm_pipeline_creator)
2024-04-29 11:45:55,487 - julearn - INFO - Adding step cbpm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:45:55,487 - julearn - INFO - Tuning hyperparameter significance_threshold = [0.01, 0.05]
2024-04-29 11:45:55,487 - julearn - INFO - Step added
2024-04-29 11:45:55,487 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:45:55,487 - julearn - INFO - Step added
PipelineCreator:
  Step 0: cbpm
    estimator:     CBPM()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'cbpm__significance_threshold': [0.01, 0.05]}
  Step 1: linreg
    estimator:     LinearRegression()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}

In addition, it may be noteworthy, that you can customize the correlation method, the aggregation method, as well as the sign ("pos", "neg", or "posneg") of the feature-target correlations that should be selected. For example, a pipeline that specifies each of these parameters may look as follows:

import numpy as np
from scipy.stats import spearmanr

# Prepare a pipeline creator
cbpm_pipeline_creator = PipelineCreator(problem_type="regression")
cbpm_pipeline_creator.add(
    "cbpm",
    significance_threshold=0.05,
    corr_method=spearmanr,
    agg_method=np.average,
    corr_sign="pos",
)
cbpm_pipeline_creator.add("linreg")

print(cbpm_pipeline_creator)
2024-04-29 11:45:55,488 - julearn - INFO - Adding step cbpm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:45:55,488 - julearn - INFO - Setting hyperparameter significance_threshold = 0.05
2024-04-29 11:45:55,488 - julearn - INFO - Setting hyperparameter corr_method = <function spearmanr at 0x7fe30c0d8dc0>
2024-04-29 11:45:55,488 - julearn - INFO - Setting hyperparameter agg_method = <function average at 0x7fe32db4ad40>
2024-04-29 11:45:55,488 - julearn - INFO - Setting hyperparameter corr_sign = pos
2024-04-29 11:45:55,488 - julearn - INFO - Step added
2024-04-29 11:45:55,488 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:45:55,488 - julearn - INFO - Step added
PipelineCreator:
  Step 0: cbpm
    estimator:     CBPM(agg_method=<function average at 0x7fe32db5c8f0>,
     corr_method=<function spearmanr at 0x7fe30c0d8dc0>, corr_sign='pos')
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: linreg
    estimator:     LinearRegression()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}

As you may have guessed, this pipeline will use a Spearman correlation and a significance level of 0.05 for feature selection. It will only select features that are positively correlated to the target and aggregate them using the numpy.average() aggregation function.

Total running time of the script: (0 minutes 0.131 seconds)