3.7. Connectome-based Predictive Modeling (CBPM)#
Applications of machine learning in neuroimaging research typically have to deal with high-dimensional input features. This can be quite problematic due to the curse of dimensionality, especially when sample sizes are low at the same time. Recently, connectome-based predictive modeling (CBPM) has been proposed as an approach to deal with this problem [4] in regression. This approach has been used to predict fluid intelligence [5] as well sustained attention [6] based on brain functional connectivity.
In a nutshell, CBPM consists of:
feature selection
feature aggregation
model building
In CBPM, features are selected if their correlation to the target is significant according to some specified significance threshold alpha. These selected features are then summarized according to an aggregation function and subsequently used to fit a machine learning model. Most commonly in this approach a linear model is used for this, but in principle it could be any other machine learning model.
CBPM in Julearn#
Julearn implements a simple, scikit-learn compatible transformer (“cbpm”), that performs the first two parts of this approach, i.e. the feature selection and feature aggregation. Leveraging julearn’s PipelineCreator, one can therefore easily apply the “cbpm” transformer as a preprocessing step, and then apply any sklearn-compatible estimator for the model building part.
For example, to build a simple CBPM workflow, you can create a pipeline and run a cross-validation as follows:
from julearn import run_cross_validation
from julearn.pipeline import PipelineCreator
from sklearn.datasets import make_regression
import pandas as pd
# prepare some data:
# prepare data
X, y = make_regression(n_features=20, n_samples=200)
# make dataframe
X_names = [f"feature_{x}" for x in range(1, 21)]
data = pd.DataFrame(X)
data.columns = X_names
data["target"] = y
# prepare a pipeline creator:
cbpm_pipeline_creator = PipelineCreator(problem_type="regression")
cbpm_pipeline_creator.add("cbpm")
cbpm_pipeline_creator.add("linreg")
# cross-validate the cbpm pipeline
scores, final_model = run_cross_validation(
data=data,
X=X_names,
y="target",
model=cbpm_pipeline_creator,
return_estimator="all",
)
2023-07-19 12:42:16,778 - julearn - INFO - Adding step cbpm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:16,778 - julearn - INFO - Step added
2023-07-19 12:42:16,778 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:16,779 - julearn - INFO - Step added
2023-07-19 12:42:16,779 - julearn - INFO - ==== Input Data ====
2023-07-19 12:42:16,779 - julearn - INFO - Using dataframe as input
2023-07-19 12:42:16,779 - julearn - INFO - Features: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']
2023-07-19 12:42:16,779 - julearn - INFO - Target: target
2023-07-19 12:42:16,780 - julearn - INFO - Expanded features: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']
2023-07-19 12:42:16,780 - julearn - INFO - X_types:{}
2023-07-19 12:42:16,780 - julearn - WARNING - The following columns are not defined in X_types: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/utils/logging.py:238: RuntimeWarning: The following columns are not defined in X_types: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']. They will be treated as continuous.
warn(msg, category=category)
2023-07-19 12:42:16,781 - julearn - INFO - ====================
2023-07-19 12:42:16,781 - julearn - INFO -
2023-07-19 12:42:16,782 - julearn - INFO - = Model Parameters =
2023-07-19 12:42:16,782 - julearn - INFO - ====================
2023-07-19 12:42:16,782 - julearn - INFO -
2023-07-19 12:42:16,782 - julearn - INFO - = Data Information =
2023-07-19 12:42:16,782 - julearn - INFO - Problem type: regression
2023-07-19 12:42:16,782 - julearn - INFO - Number of samples: 200
2023-07-19 12:42:16,782 - julearn - INFO - Number of features: 20
2023-07-19 12:42:16,782 - julearn - INFO - ====================
2023-07-19 12:42:16,782 - julearn - INFO -
2023-07-19 12:42:16,782 - julearn - INFO - Target type: float64
2023-07-19 12:42:16,782 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:42:16,804 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/utils/logging.py:238: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
warn(msg, category=category)
2023-07-19 12:42:16,830 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/utils/logging.py:238: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
warn(msg, category=category)
2023-07-19 12:42:16,856 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/utils/logging.py:238: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
warn(msg, category=category)
2023-07-19 12:42:16,882 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/utils/logging.py:238: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
warn(msg, category=category)
2023-07-19 12:42:16,908 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/utils/logging.py:238: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
warn(msg, category=category)
2023-07-19 12:42:16,934 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/utils/logging.py:238: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
warn(msg, category=category)
By default the “cbpm” transformer will perform feature selection using the
Pearson correlation between each feature and the target, and select the
features for which the p-value of the correlation falls below the default
significance threshold of 0.01. It will then group the features into
negatively and positively correlated features, and sum up the features within
each of these groups using numpy.sum()
. That is, the linear model in
this case is fitted on two features:
sum of features that are positively correlated to the target
sum of features that are negatively correlated to the target
The pipeline creator also allows easily customising these parameters of the “cbpm” transformer according to your needs. For example, to use a different significance threshold during feature selection one may set the significance_threshold keyword to increase it to 0.05 as follows:
# prepare a pipeline creator
cbpm_pipeline_creator = PipelineCreator(problem_type="regression")
cbpm_pipeline_creator.add("cbpm", significance_threshold=0.05)
cbpm_pipeline_creator.add("linreg")
print(cbpm_pipeline_creator)
2023-07-19 12:42:16,937 - julearn - INFO - Adding step cbpm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:16,937 - julearn - INFO - Setting hyperparameter significance_threshold = 0.05
2023-07-19 12:42:16,937 - julearn - INFO - Step added
2023-07-19 12:42:16,937 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:16,938 - julearn - INFO - Step added
PipelineCreator:
Step 0: cbpm
estimator: CBPM()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: linreg
estimator: LinearRegression()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Julearn also allows this to be tuned as a hyperparameter in a nested cross-validation. Simply hand over an iterable of values:
# prepare a pipeline creator:
cbpm_pipeline_creator = PipelineCreator(problem_type="regression")
cbpm_pipeline_creator.add("cbpm", significance_threshold=[0.01, 0.05])
cbpm_pipeline_creator.add("linreg")
print(cbpm_pipeline_creator)
2023-07-19 12:42:16,938 - julearn - INFO - Adding step cbpm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:16,939 - julearn - INFO - Tuning hyperparameter significance_threshold = [0.01, 0.05]
2023-07-19 12:42:16,939 - julearn - INFO - Step added
2023-07-19 12:42:16,939 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:16,939 - julearn - INFO - Step added
PipelineCreator:
Step 0: cbpm
estimator: CBPM()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'cbpm__significance_threshold': [0.01, 0.05]}
Step 1: linreg
estimator: LinearRegression()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
In addition, it may be noteworthy, that you can customise the correlation method, the aggregation method, as well as the sign (“pos”, “neg”, or “posneg”) of the feature-target correlations that should be selected. For example, a pipeline that specifies each of these parameters may look as follows:
import numpy as np
from scipy.stats import spearmanr
# prepare a pipeline creator:
cbpm_pipeline_creator = PipelineCreator(problem_type="regression")
cbpm_pipeline_creator.add(
"cbpm",
significance_threshold=0.05,
corr_method=spearmanr,
agg_method=np.average,
corr_sign="pos",
)
cbpm_pipeline_creator.add("linreg")
print(cbpm_pipeline_creator)
2023-07-19 12:42:16,940 - julearn - INFO - Adding step cbpm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:16,940 - julearn - INFO - Setting hyperparameter significance_threshold = 0.05
2023-07-19 12:42:16,940 - julearn - INFO - Setting hyperparameter corr_method = <function spearmanr at 0x7f7f67d97ac0>
2023-07-19 12:42:16,940 - julearn - INFO - Setting hyperparameter agg_method = <function average at 0x7f7f71ba0dc0>
2023-07-19 12:42:16,940 - julearn - INFO - Setting hyperparameter corr_sign = pos
2023-07-19 12:42:16,940 - julearn - INFO - Step added
2023-07-19 12:42:16,940 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:16,940 - julearn - INFO - Step added
PipelineCreator:
Step 0: cbpm
estimator: CBPM(agg_method=<function average at 0x7f7f71b9c8b0>,
corr_method=<function spearmanr at 0x7f7f67d97ac0>, corr_sign='pos')
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: linreg
estimator: LinearRegression()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
As you may have guessed, this pipeline will use a Spearman correlation and a
significance level of 0.05 for feature selection. It will only select
features that are positively correlated to the target and aggregate them
using the numpy.average()
aggregation function.
Total running time of the script: ( 0 minutes 0.165 seconds)