6.2. Cross-validation consistent Confound Removal#
In many machine learning applications, researchers ultimately want to assess whether the features are related to the target. However, in most real-world scenarios the supposed relationship between the features and the target may be confounded by one or more (un)observed variables. Therefore, the effect of potential confounding variables is often removed by training a linear regression to predict each feature given the confounds, and using the residuals from this confound removal model to predict the target [8], [9]. Similarly, one may instead remove the confounding effect by performing confound regression on the target. That is, one may predict the target given the confounds, and then predict the residuals from such a confound removal model using the features [10]. In either case, it is important that such confound regression models are trained within the cross-validation splits, rather than on the training and testing data jointly in order to prevent test-to-train data leakage [11], [12].
Confound Removal in julearn
#
julearn
implements cross-validation consistent confound regression for both
of the scenarios laid out above (i.e., either confound regression on the
features or on the target) allowing the user to implement complex machine
learning pipelines with relatively little code while avoiding test-to-train
leakage during confound removal.
Let us initially consider removing a confounding variable from the features.
Removing Confounds from the Features#
The first scenario involves confound regression on the features. In order to
do this we can simply configure an instance of a PipelineCreator
by adding the "confound_removal"
step.
We can create some data using scikit-learn
’s
make_regression()
and then simulate a normally distributed random variable that has a linear
relationship with the target that we can use as a confound.
Let’s import some of the functionality we will need:
from julearn import run_cross_validation
from julearn.pipeline import PipelineCreator, TargetPipelineCreator
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd
First we create the features and the target, and based on this we create two artificial confounds that we can use as an example:
# make X and y
X, y = make_regression(n_features=20)
# create two normally distributed random variables with the same mean
# and standard deviation as y
normal_dist_conf_one = np.random.normal(y.mean(), y.std(), y.size)
normal_dist_conf_two = np.random.normal(y.mean(), y.std(), y.size)
# prepare some noise to add to the confounds
noise_conf_one = np.random.rand(len(y))
noise_conf_two = np.random.rand(len(y))
# create the confounds by adding the y and multiplying with a noise factor
confound_one = normal_dist_conf_one + y * noise_conf_one
confound_two = normal_dist_conf_two + y * noise_conf_two
Let’s organise these data as a pandas.DataFrame
, which is the
preferred data format when using julearn
:
# put the features into a dataframe
data = pd.DataFrame(X)
# give features and confounds human readable names
features = [f"feature_{x}" for x in data.columns]
confounds = ["confound_1", "confound_2"]
# make sure that feature names and column names are the same
data.columns = features
# add the target to the dataframe
data["my_target"] = y
# add the confounds to the dataframe
data["confound_1"] = confound_one
data["confound_2"] = confound_two
In this example, we only distinguish between two types of variables in the
X
. That is, we have 1. our features (or predictors) and 2. our confounds.
Let’s prepare the X_types
dictionary that we hand over to
run_cross_validation()
accordingly:
X_types = {"features": features, "confounds": confounds}
Now, that we have all the data prepared, and we have defined our X_types
,
we can think about creating the pipeline that we want to run. Now, this is
the crucial point at which we parametrize the confound removal. We initialize
the PipelineCreator
and add to it as a step using the
"confound_removal"
transformer (the underlying transformer object is the
ConfoundRemover
).
pipeline_creator = PipelineCreator(
problem_type="regression", apply_to="features"
)
pipeline_creator.add("confound_removal", confounds="confounds")
pipeline_creator.add("linreg")
print(pipeline_creator)
2024-10-23 11:29:51,772 - julearn - INFO - Adding step confound_removal that applies to ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
2024-10-23 11:29:51,772 - julearn - INFO - Setting hyperparameter confounds = confounds
2024-10-23 11:29:51,772 - julearn - INFO - Step added
2024-10-23 11:29:51,772 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
2024-10-23 11:29:51,772 - julearn - INFO - Step added
PipelineCreator:
Step 0: confound_removal
estimator: ConfoundRemover(apply_to=ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>,
confounds=ColumnTypes<types={'confounds'}; pattern=(?:__:type:__confounds)>,
model_confound=LinearRegression())
apply to: ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
needed types: ColumnTypes<types={'features', 'confounds'}; pattern=(?:__:type:__features|__:type:__confounds)>
tuning params: {}
Step 1: linreg
estimator: LinearRegression()
apply to: ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
needed types: ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
tuning params: {}
As you can see, we tell the PipelineCreator
that we want to work on
a “regression” problem when we initialize the class. We also tell that by
default each “step” of the pipeline should be applied to the features whose
type is "features"
. In the first step that we add, we specify we want to
perform "confound_removal"
, and that the features that have the type
"confounds"
should be used as confounds in the confound regression.
Note, that because we already specified apply_to="features"
during the initialization, we do not need to explicitly state this again.
In short, the "confounds"
will be removed from the "features"
.
As a second and last step, we simply add a linear regression ("linreg"
) to
fit a predictive model to the de-confounded X
and the y
.
Lastly, we only need to apply this pipeline in the run_cross_validation()
function to perform confound removal on the features in a cross-validation
consistent way:
scores = run_cross_validation(
data=data,
X=features + confounds,
y="my_target",
X_types=X_types,
model=pipeline_creator,
scoring="r2",
)
print(scores)
2024-10-23 11:29:51,773 - julearn - INFO - ==== Input Data ====
2024-10-23 11:29:51,774 - julearn - INFO - Using dataframe as input
2024-10-23 11:29:51,774 - julearn - INFO - Features: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'confound_1', 'confound_2']
2024-10-23 11:29:51,774 - julearn - INFO - Target: my_target
2024-10-23 11:29:51,774 - julearn - INFO - Expanded features: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'confound_1', 'confound_2']
2024-10-23 11:29:51,774 - julearn - INFO - X_types:{'features': ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19'], 'confounds': ['confound_1', 'confound_2']}
2024-10-23 11:29:51,775 - julearn - INFO - ====================
2024-10-23 11:29:51,775 - julearn - INFO -
2024-10-23 11:29:51,776 - julearn - INFO - = Model Parameters =
2024-10-23 11:29:51,776 - julearn - INFO - ====================
2024-10-23 11:29:51,777 - julearn - INFO -
2024-10-23 11:29:51,777 - julearn - INFO - = Data Information =
2024-10-23 11:29:51,777 - julearn - INFO - Problem type: regression
2024-10-23 11:29:51,777 - julearn - INFO - Number of samples: 100
2024-10-23 11:29:51,777 - julearn - INFO - Number of features: 22
2024-10-23 11:29:51,777 - julearn - INFO - ====================
2024-10-23 11:29:51,777 - julearn - INFO -
2024-10-23 11:29:51,777 - julearn - INFO - Target type: float64
2024-10-23 11:29:51,777 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
fit_time score_time test_score n_train n_test repeat fold \
0 0.025076 0.006863 0.786706 80 20 0 0
1 0.024267 0.006824 0.592156 80 20 0 1
2 0.024372 0.006876 0.722494 80 20 0 2
3 0.024238 0.006837 0.674188 80 20 0 3
4 0.024254 0.006925 0.570243 80 20 0 4
cv_mdsum
0 b10eef89b4192178d482d7a1587a248a
1 b10eef89b4192178d482d7a1587a248a
2 b10eef89b4192178d482d7a1587a248a
3 b10eef89b4192178d482d7a1587a248a
4 b10eef89b4192178d482d7a1587a248a
Now, what if we want to remove the confounds from the target rather than the features instead?
Removing Confounds from the Target#
If we want to remove the confounds from the target rather than from the
features, we need to create a slightly different pipeline. julearn
has a
specific TargetPipelineCreator
to perform transformations on the
target. We first configure this pipeline and add the "confound_removal"
step.
target_pipeline_creator = TargetPipelineCreator()
target_pipeline_creator.add("confound_removal", confounds="confounds")
print(target_pipeline_creator)
TargetPipelineCreator:
Step 0: confound_removal
estimator: <julearn.transformers.target.target_confound_remover.TargetConfoundRemover object at 0x7f90ad74e9e0>
Now we insert the target pipeline into the main pipeline that will be used
to do the prediction. Importantly, we specify that the target pipeline should
be applied to the "target"
.
pipeline_creator = PipelineCreator(
problem_type="regression", apply_to="features"
)
pipeline_creator.add(target_pipeline_creator, apply_to="target")
pipeline_creator.add("linreg")
print(pipeline_creator)
2024-10-23 11:29:51,941 - julearn - INFO - Adding step jutargetpipeline that applies to ColumnTypes<types={'target'}; pattern=(?:target)>
2024-10-23 11:29:51,941 - julearn - INFO - Step added
2024-10-23 11:29:51,941 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
2024-10-23 11:29:51,941 - julearn - INFO - Step added
PipelineCreator:
Step 0: jutargetpipeline
estimator: <julearn.pipeline.target_pipeline.JuTargetPipeline object at 0x7f90ad74c490>
apply to: ColumnTypes<types={'target', 'confounds'}; pattern=(?:target|__:type:__confounds)>
needed types: ColumnTypes<types={'target', 'confounds'}; pattern=(?:target|__:type:__confounds)>
tuning params: {}
Step 1: linreg
estimator: LinearRegression()
apply to: ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
needed types: ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
tuning params: {}
Having configured this pipeline, we can then simply use the same
run_cross_validation()
call to obtain our results:
scores = run_cross_validation(
data=data,
X=features + confounds,
y="my_target",
X_types=X_types,
model=pipeline_creator,
scoring="r2",
)
print(scores)
2024-10-23 11:29:51,942 - julearn - INFO - ==== Input Data ====
2024-10-23 11:29:51,942 - julearn - INFO - Using dataframe as input
2024-10-23 11:29:51,942 - julearn - INFO - Features: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'confound_1', 'confound_2']
2024-10-23 11:29:51,942 - julearn - INFO - Target: my_target
2024-10-23 11:29:51,943 - julearn - INFO - Expanded features: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'confound_1', 'confound_2']
2024-10-23 11:29:51,943 - julearn - INFO - X_types:{'features': ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19'], 'confounds': ['confound_1', 'confound_2']}
2024-10-23 11:29:51,944 - julearn - INFO - ====================
2024-10-23 11:29:51,944 - julearn - INFO -
2024-10-23 11:29:51,944 - julearn - INFO - = Model Parameters =
2024-10-23 11:29:51,944 - julearn - INFO - ====================
2024-10-23 11:29:51,944 - julearn - INFO -
2024-10-23 11:29:51,944 - julearn - INFO - = Data Information =
2024-10-23 11:29:51,945 - julearn - INFO - Problem type: regression
2024-10-23 11:29:51,945 - julearn - INFO - Number of samples: 100
2024-10-23 11:29:51,945 - julearn - INFO - Number of features: 22
2024-10-23 11:29:51,945 - julearn - INFO - ====================
2024-10-23 11:29:51,945 - julearn - INFO -
2024-10-23 11:29:51,945 - julearn - INFO - Target type: float64
2024-10-23 11:29:51,945 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
fit_time score_time test_score n_train n_test repeat fold \
0 0.005858 0.003843 0.253033 80 20 0 0
1 0.005841 0.003807 0.059126 80 20 0 1
2 0.005746 0.003765 -0.256962 80 20 0 2
3 0.005877 0.003804 -0.096475 80 20 0 3
4 0.005782 0.003780 -0.534386 80 20 0 4
cv_mdsum
0 b10eef89b4192178d482d7a1587a248a
1 b10eef89b4192178d482d7a1587a248a
2 b10eef89b4192178d482d7a1587a248a
3 b10eef89b4192178d482d7a1587a248a
4 b10eef89b4192178d482d7a1587a248a
As you can see, applying confound regression in your machine learning
pipeline in a cross-validated fashion is reasonably easy using julearn
.
If you are considering whether or not to use confound regression, however,
there are further important considerations:
Should I use Confound Regression?#
One reason why one might want to perform confound regression in a machine learning pipeline is to account for the effects of the confounding variables on the target. This can help to mitigate the potential bias introduced by the confounding variables and provide more accurate estimates of the true relationship between the features and the target.
On the other hand, some argue that confound regression may not always be necessary or appropriate, as it can lead to loss of valuable information in the data. Additionally, confounding variables may sometimes be difficult to identify or measure accurately, which can make confound regression challenging or ineffective. In particular, controlling for some variables that are not confounds, but in fact colliders, may introduce spurious relationships between your features and your targets [13]. Lastly, there is also some evidence that removing confounds can leak information about the target into the features, biasing the resulting predictive models [14]. Ultimately, the decision to perform confound regression in a machine learning pipeline should be based on careful consideration of the specific dataset and research question at hand, as well as a thorough understanding of the strengths and limitations of this technique.
Total running time of the script: (0 minutes 0.231 seconds)