6.2. Cross-validation consistent Confound Removal#

In many machine learning applications, researchers ultimately want to assess whether the features are related to the target. However, in most real-world scenarios the supposed relationship between the features and the target may be confounded by one or more (un)observed variables. Therefore, the effect of potential confounding variables is often removed by training a linear regression to predict each feature given the confounds, and using the residuals from this confound removal model to predict the target [8], [9]. Similarly, one may instead remove the confounding effect by performing confound regression on the target. That is, one may predict the target given the confounds, and then predict the residuals from such a confound removal model using the features [10]. In either case, it is important that such confound regression models are trained within the cross-validation splits, rather than on the training and testing data jointly in order to prevent test-to-train data leakage [11], [12].

Confound Removal in julearn#

julearn implements cross-validation consistent confound regression for both of the scenarios laid out above (i.e., either confound regression on the features or on the target) allowing the user to implement complex machine learning pipelines with relatively little code while avoiding test-to-train leakage during confound removal.

Let us initially consider removing a confounding variable from the features.

Removing Confounds from the Features#

The first scenario involves confound regression on the features. In order to do this we can simply configure an instance of a PipelineCreator by adding the "confound_removal" step.

We can create some data using scikit-learn’s make_regression() and then simulate a normally distributed random variable that has a linear relationship with the target that we can use as a confound.

Let’s import some of the functionality we will need:

from julearn import run_cross_validation
from julearn.pipeline import PipelineCreator, TargetPipelineCreator
from sklearn.datasets import make_regression

import numpy as np
import pandas as pd

First we create the features and the target, and based on this we create two artificial confounds that we can use as an example:

# make X and y
X, y = make_regression(n_features=20)

# create two normally distributed random variables with the same mean
# and standard deviation as y
normal_dist_conf_one = np.random.normal(y.mean(), y.std(), y.size)
normal_dist_conf_two = np.random.normal(y.mean(), y.std(), y.size)

# prepare some noise to add to the confounds
noise_conf_one = np.random.rand(len(y))
noise_conf_two = np.random.rand(len(y))

# create the confounds by adding the y and multiplying with a noise factor
confound_one = normal_dist_conf_one + y * noise_conf_one
confound_two = normal_dist_conf_two + y * noise_conf_two

Let’s organise these data as a pandas.DataFrame, which is the preferred data format when using julearn:

# put the features into a dataframe
data = pd.DataFrame(X)

# give features and confounds human readable names
features = [f"feature_{x}" for x in data.columns]
confounds = ["confound_1", "confound_2"]

# make sure that feature names and column names are the same
data.columns = features

# add the target to the dataframe
data["my_target"] = y

# add the confounds to the dataframe
data["confound_1"] = confound_one
data["confound_2"] = confound_two

In this example, we only distinguish between two types of variables in the X. That is, we have 1. our features (or predictors) and 2. our confounds. Let’s prepare the X_types dictionary that we hand over to run_cross_validation() accordingly:

X_types = {"features": features, "confounds": confounds}

Now, that we have all the data prepared, and we have defined our X_types, we can think about creating the pipeline that we want to run. Now, this is the crucial point at which we parametrize the confound removal. We initialize the PipelineCreator and add to it as a step using the "confound_removal" transformer (the underlying transformer object is the ConfoundRemover).

pipeline_creator = PipelineCreator(
    problem_type="regression", apply_to="features"
)
pipeline_creator.add("confound_removal", confounds="confounds")
pipeline_creator.add("linreg")

print(pipeline_creator)
2024-05-03 15:22:35,496 - julearn - INFO - Adding step confound_removal that applies to ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
2024-05-03 15:22:35,496 - julearn - INFO - Setting hyperparameter confounds = confounds
2024-05-03 15:22:35,496 - julearn - INFO - Step added
2024-05-03 15:22:35,496 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
2024-05-03 15:22:35,496 - julearn - INFO - Step added
PipelineCreator:
  Step 0: confound_removal
    estimator:     ConfoundRemover(apply_to=ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>,
                confounds=ColumnTypes<types={'confounds'}; pattern=(?:__:type:__confounds)>,
                model_confound=LinearRegression())
    apply to:      ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
    needed types:  ColumnTypes<types={'confounds', 'features'}; pattern=(?:__:type:__confounds|__:type:__features)>
    tuning params: {}
  Step 1: linreg
    estimator:     LinearRegression()
    apply to:      ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
    needed types:  ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
    tuning params: {}

As you can see, we tell the PipelineCreator that we want to work on a “regression” problem when we initialize the class. We also tell that by default each “step” of the pipeline should be applied to the features whose type is "features". In the first step that we add, we specify we want to perform "confound_removal", and that the features that have the type "confounds" should be used as confounds in the confound regression. Note, that because we already specified apply_to="features" during the initialization, we do not need to explicitly state this again. In short, the "confounds" will be removed from the "features".

As a second and last step, we simply add a linear regression ("linreg") to fit a predictive model to the de-confounded X and the y.

Lastly, we only need to apply this pipeline in the run_cross_validation() function to perform confound removal on the features in a cross-validation consistent way:

scores = run_cross_validation(
    data=data,
    X=features + confounds,
    y="my_target",
    X_types=X_types,
    model=pipeline_creator,
    scoring="r2",
)

print(scores)
2024-05-03 15:22:35,498 - julearn - INFO - ==== Input Data ====
2024-05-03 15:22:35,498 - julearn - INFO - Using dataframe as input
2024-05-03 15:22:35,498 - julearn - INFO -      Features: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'confound_1', 'confound_2']
2024-05-03 15:22:35,498 - julearn - INFO -      Target: my_target
2024-05-03 15:22:35,498 - julearn - INFO -      Expanded features: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'confound_1', 'confound_2']
2024-05-03 15:22:35,498 - julearn - INFO -      X_types:{'features': ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19'], 'confounds': ['confound_1', 'confound_2']}
2024-05-03 15:22:35,500 - julearn - INFO - ====================
2024-05-03 15:22:35,500 - julearn - INFO -
2024-05-03 15:22:35,501 - julearn - INFO - = Model Parameters =
2024-05-03 15:22:35,501 - julearn - INFO - ====================
2024-05-03 15:22:35,501 - julearn - INFO -
2024-05-03 15:22:35,501 - julearn - INFO - = Data Information =
2024-05-03 15:22:35,501 - julearn - INFO -      Problem type: regression
2024-05-03 15:22:35,501 - julearn - INFO -      Number of samples: 100
2024-05-03 15:22:35,501 - julearn - INFO -      Number of features: 22
2024-05-03 15:22:35,501 - julearn - INFO - ====================
2024-05-03 15:22:35,501 - julearn - INFO -
2024-05-03 15:22:35,501 - julearn - INFO -      Target type: float64
2024-05-03 15:22:35,501 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
   fit_time  score_time  test_score  n_train  n_test  repeat  fold  \
0  0.024430    0.006834    0.786706       80      20       0     0
1  0.023413    0.006827    0.592156       80      20       0     1
2  0.023649    0.006777    0.722494       80      20       0     2
3  0.023315    0.006799    0.674188       80      20       0     3
4  0.023639    0.007036    0.570243       80      20       0     4

                           cv_mdsum
0  b10eef89b4192178d482d7a1587a248a
1  b10eef89b4192178d482d7a1587a248a
2  b10eef89b4192178d482d7a1587a248a
3  b10eef89b4192178d482d7a1587a248a
4  b10eef89b4192178d482d7a1587a248a

Now, what if we want to remove the confounds from the target rather than the features instead?

Removing Confounds from the Target#

If we want to remove the confounds from the target rather than from the features, we need to create a slightly different pipeline. julearn has a specific TargetPipelineCreator to perform transformations on the target. We first configure this pipeline and add the "confound_removal" step.

target_pipeline_creator = TargetPipelineCreator()
target_pipeline_creator.add("confound_removal", confounds="confounds")

print(target_pipeline_creator)
TargetPipelineCreator:
  Step 0: confound_removal
    estimator:     <julearn.transformers.target.target_confound_remover.TargetConfoundRemover object at 0x7fd484589f30>

Now we insert the target pipeline into the main pipeline that will be used to do the prediction. Importantly, we specify that the target pipeline should be applied to the "target".

pipeline_creator = PipelineCreator(
    problem_type="regression", apply_to="features"
)
pipeline_creator.add(target_pipeline_creator, apply_to="target")
pipeline_creator.add("linreg")

print(pipeline_creator)
2024-05-03 15:22:35,662 - julearn - INFO - Adding step jutargetpipeline that applies to ColumnTypes<types={'target'}; pattern=(?:target)>
2024-05-03 15:22:35,662 - julearn - INFO - Step added
2024-05-03 15:22:35,662 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
2024-05-03 15:22:35,662 - julearn - INFO - Step added
PipelineCreator:
  Step 0: jutargetpipeline
    estimator:     <julearn.pipeline.target_pipeline.JuTargetPipeline object at 0x7fd48687d510>
    apply to:      ColumnTypes<types={'confounds', 'target'}; pattern=(?:__:type:__confounds|target)>
    needed types:  ColumnTypes<types={'confounds', 'target'}; pattern=(?:__:type:__confounds|target)>
    tuning params: {}
  Step 1: linreg
    estimator:     LinearRegression()
    apply to:      ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
    needed types:  ColumnTypes<types={'features'}; pattern=(?:__:type:__features)>
    tuning params: {}

Having configured this pipeline, we can then simply use the same run_cross_validation() call to obtain our results:

scores = run_cross_validation(
    data=data,
    X=features + confounds,
    y="my_target",
    X_types=X_types,
    model=pipeline_creator,
    scoring="r2",
)

print(scores)
2024-05-03 15:22:35,663 - julearn - INFO - ==== Input Data ====
2024-05-03 15:22:35,663 - julearn - INFO - Using dataframe as input
2024-05-03 15:22:35,663 - julearn - INFO -      Features: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'confound_1', 'confound_2']
2024-05-03 15:22:35,663 - julearn - INFO -      Target: my_target
2024-05-03 15:22:35,663 - julearn - INFO -      Expanded features: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'confound_1', 'confound_2']
2024-05-03 15:22:35,664 - julearn - INFO -      X_types:{'features': ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19'], 'confounds': ['confound_1', 'confound_2']}
2024-05-03 15:22:35,665 - julearn - INFO - ====================
2024-05-03 15:22:35,665 - julearn - INFO -
2024-05-03 15:22:35,665 - julearn - INFO - = Model Parameters =
2024-05-03 15:22:35,665 - julearn - INFO - ====================
2024-05-03 15:22:35,665 - julearn - INFO -
2024-05-03 15:22:35,665 - julearn - INFO - = Data Information =
2024-05-03 15:22:35,666 - julearn - INFO -      Problem type: regression
2024-05-03 15:22:35,666 - julearn - INFO -      Number of samples: 100
2024-05-03 15:22:35,666 - julearn - INFO -      Number of features: 22
2024-05-03 15:22:35,666 - julearn - INFO - ====================
2024-05-03 15:22:35,666 - julearn - INFO -
2024-05-03 15:22:35,666 - julearn - INFO -      Target type: float64
2024-05-03 15:22:35,666 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
   fit_time  score_time  test_score  n_train  n_test  repeat  fold  \
0  0.006018    0.003858    0.253033       80      20       0     0
1  0.005797    0.003812    0.059126       80      20       0     1
2  0.005787    0.003765   -0.256962       80      20       0     2
3  0.005827    0.003758   -0.096475       80      20       0     3
4  0.005747    0.003751   -0.534386       80      20       0     4

                           cv_mdsum
0  b10eef89b4192178d482d7a1587a248a
1  b10eef89b4192178d482d7a1587a248a
2  b10eef89b4192178d482d7a1587a248a
3  b10eef89b4192178d482d7a1587a248a
4  b10eef89b4192178d482d7a1587a248a

As you can see, applying confound regression in your machine learning pipeline in a cross-validated fashion is reasonably easy using julearn. If you are considering whether or not to use confound regression, however, there are further important considerations:

Should I use Confound Regression?#

One reason why one might want to perform confound regression in a machine learning pipeline is to account for the effects of the confounding variables on the target. This can help to mitigate the potential bias introduced by the confounding variables and provide more accurate estimates of the true relationship between the features and the target.

On the other hand, some argue that confound regression may not always be necessary or appropriate, as it can lead to loss of valuable information in the data. Additionally, confounding variables may sometimes be difficult to identify or measure accurately, which can make confound regression challenging or ineffective. In particular, controlling for some variables that are not confounds, but in fact colliders, may introduce spurious relationships between your features and your targets [13]. Lastly, there is also some evidence that removing confounds can leak information about the target into the features, biasing the resulting predictive models [14]. Ultimately, the decision to perform confound regression in a machine learning pipeline should be based on careful consideration of the specific dataset and research question at hand, as well as a thorough understanding of the strengths and limitations of this technique.

Total running time of the script: (0 minutes 0.228 seconds)