6.7. Connectome-based Predictive Modeling (CBPM)#
Applications of machine learning in neuroimaging research typically have to deal with high-dimensional input features. This can be quite problematic due to the curse of dimensionality, especially when sample sizes are low at the same time. Recently, connectome-based predictive modeling (CBPM) has been proposed as an approach to deal with this problem [4] in regression. This approach has been used to predict fluid intelligence [5] as well sustained attention [6] based on brain functional connectivity.
In a nutshell, CBPM consists of:
Feature selection
Feature aggregation
Model building
In CBPM, features are selected if their correlation to the target is significant according to some specified significance threshold alpha. These selected features are then summarized according to an aggregation function and subsequently used to fit a machine learning model. Most commonly in this approach a linear model is used for this, but in principle it could be any other machine learning model.
CBPM in julearn
#
julearn
implements a simple, scikit-learn
compatible transformer
(“cbpm”), that performs the first two parts of this approach, i.e., the feature
selection and feature aggregation. Leveraging julearn
’s PipelineCreator
,
one can therefore easily apply the "cbpm"
transformer as a preprocessing
step, and then apply any scikit-learn
-compatible estimator for the model
building part.
For example, to build a simple CBPM workflow, you can create a pipeline and run a cross-validation as follows:
# Authors: Leonard Sasse <l.sasse@fz-juelich.de>
# License: AGPL
from julearn import run_cross_validation
from julearn.pipeline import PipelineCreator
from sklearn.datasets import make_regression
import pandas as pd
# Prepare data
X, y = make_regression(n_features=20, n_samples=200)
# Make dataframe
X_names = [f"feature_{x}" for x in range(1, 21)]
data = pd.DataFrame(X)
data.columns = X_names
data["target"] = y
# Prepare a pipeline creator
cbpm_pipeline_creator = PipelineCreator(problem_type="regression")
cbpm_pipeline_creator.add("cbpm")
cbpm_pipeline_creator.add("linreg")
# Cross-validate the cbpm pipeline
scores, final_model = run_cross_validation(
data=data,
X=X_names,
y="target",
model=cbpm_pipeline_creator,
return_estimator="all",
)
2024-10-17 14:16:00,520 - julearn - INFO - Adding step cbpm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:00,520 - julearn - INFO - Step added
2024-10-17 14:16:00,520 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:00,520 - julearn - INFO - Step added
2024-10-17 14:16:00,520 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:00,520 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:00,521 - julearn - INFO - Features: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']
2024-10-17 14:16:00,521 - julearn - INFO - Target: target
2024-10-17 14:16:00,522 - julearn - INFO - Expanded features: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']
2024-10-17 14:16:00,522 - julearn - INFO - X_types:{}
2024-10-17 14:16:00,522 - julearn - WARNING - The following columns are not defined in X_types: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:509: RuntimeWarning: The following columns are not defined in X_types: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']. They will be treated as continuous.
warn_with_log(
2024-10-17 14:16:00,523 - julearn - INFO - ====================
2024-10-17 14:16:00,523 - julearn - INFO -
2024-10-17 14:16:00,523 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:00,523 - julearn - INFO - ====================
2024-10-17 14:16:00,523 - julearn - INFO -
2024-10-17 14:16:00,523 - julearn - INFO - = Data Information =
2024-10-17 14:16:00,523 - julearn - INFO - Problem type: regression
2024-10-17 14:16:00,523 - julearn - INFO - Number of samples: 200
2024-10-17 14:16:00,523 - julearn - INFO - Number of features: 20
2024-10-17 14:16:00,524 - julearn - INFO - ====================
2024-10-17 14:16:00,524 - julearn - INFO -
2024-10-17 14:16:00,524 - julearn - INFO - Target type: float64
2024-10-17 14:16:00,524 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)
2024-10-17 14:16:00,545 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/transformers/cbpm.py:267: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
warn_with_log(
2024-10-17 14:16:00,571 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/transformers/cbpm.py:267: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
warn_with_log(
2024-10-17 14:16:00,596 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/transformers/cbpm.py:267: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
warn_with_log(
2024-10-17 14:16:00,621 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/transformers/cbpm.py:267: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
warn_with_log(
2024-10-17 14:16:00,645 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/transformers/cbpm.py:267: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
warn_with_log(
2024-10-17 14:16:00,670 - julearn - WARNING - No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
/home/runner/work/julearn/julearn/julearn/transformers/cbpm.py:267: RuntimeWarning: No feature with significant negative correlations was present. Only features with positive correlations will be used. To get rid of this message, set `corr_sign = 'pos'`.
warn_with_log(
By default the "cbpm"
transformer will perform feature selection using the
Pearson correlation between each feature and the target, and select the
features for which the p-value of the correlation falls below the default
significance threshold of 0.01. It will then group the features into
negatively and positively correlated features, and sum up the features within
each of these groups using numpy.sum()
. That is, the linear model in
this case is fitted on two features:
Sum of features that are positively correlated to the target
Sum of features that are negatively correlated to the target
The pipeline creator also allows easily customising these parameters of the
"cbpm"
transformer according to your needs. For example, to use a different
significance threshold during feature selection one may set the
significance_threshold
keyword to increase it to 0.05 as follows:
# Prepare a pipeline creator
cbpm_pipeline_creator = PipelineCreator(problem_type="regression")
cbpm_pipeline_creator.add("cbpm", significance_threshold=0.05)
cbpm_pipeline_creator.add("linreg")
print(cbpm_pipeline_creator)
2024-10-17 14:16:00,676 - julearn - INFO - Adding step cbpm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:00,676 - julearn - INFO - Setting hyperparameter significance_threshold = 0.05
2024-10-17 14:16:00,676 - julearn - INFO - Step added
2024-10-17 14:16:00,676 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:00,676 - julearn - INFO - Step added
PipelineCreator:
Step 0: cbpm
estimator: CBPM()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: linreg
estimator: LinearRegression()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
julearn
also allows this to be tuned as a hyperparameter in a nested
cross-validation. Simply hand over an iterable of values:
# Prepare a pipeline creator
cbpm_pipeline_creator = PipelineCreator(problem_type="regression")
cbpm_pipeline_creator.add("cbpm", significance_threshold=[0.01, 0.05])
cbpm_pipeline_creator.add("linreg")
print(cbpm_pipeline_creator)
2024-10-17 14:16:00,677 - julearn - INFO - Adding step cbpm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:00,677 - julearn - INFO - Tuning hyperparameter significance_threshold = [0.01, 0.05]
2024-10-17 14:16:00,677 - julearn - INFO - Step added
2024-10-17 14:16:00,677 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:00,677 - julearn - INFO - Step added
PipelineCreator:
Step 0: cbpm
estimator: CBPM()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'cbpm__significance_threshold': [0.01, 0.05]}
Step 1: linreg
estimator: LinearRegression()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
In addition, it may be noteworthy, that you can customize the correlation
method, the aggregation method, as well as the sign ("pos"
, "neg"
,
or "posneg"
) of the feature-target correlations that should be selected.
For example, a pipeline that specifies each of these parameters may look as
follows:
import numpy as np
from scipy.stats import spearmanr
# Prepare a pipeline creator
cbpm_pipeline_creator = PipelineCreator(problem_type="regression")
cbpm_pipeline_creator.add(
"cbpm",
significance_threshold=0.05,
corr_method=spearmanr,
agg_method=np.average,
corr_sign="pos",
)
cbpm_pipeline_creator.add("linreg")
print(cbpm_pipeline_creator)
2024-10-17 14:16:00,678 - julearn - INFO - Adding step cbpm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:00,678 - julearn - INFO - Setting hyperparameter significance_threshold = 0.05
2024-10-17 14:16:00,678 - julearn - INFO - Setting hyperparameter corr_method = <function spearmanr at 0x7f45e9f484c0>
2024-10-17 14:16:00,678 - julearn - INFO - Setting hyperparameter agg_method = <function average at 0x7f460aba9fc0>
2024-10-17 14:16:00,678 - julearn - INFO - Setting hyperparameter corr_sign = pos
2024-10-17 14:16:00,678 - julearn - INFO - Step added
2024-10-17 14:16:00,678 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:00,679 - julearn - INFO - Step added
PipelineCreator:
Step 0: cbpm
estimator: CBPM(agg_method=<function average at 0x7f460ab9b570>,
corr_method=<function spearmanr at 0x7f45e9f484c0>, corr_sign='pos')
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: linreg
estimator: LinearRegression()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
As you may have guessed, this pipeline will use a Spearman correlation and a
significance level of 0.05 for feature selection. It will only select
features that are positively correlated to the target and aggregate them
using the numpy.average()
aggregation function.
Total running time of the script: (0 minutes 0.161 seconds)