Return Confounds in Confound Removal#

In most cases confound removal is a simple operation. You regress out the confound from the features and only continue working with these new confound removed features. This is also the default setting for julearn’s remove_confound step. But sometimes you want to work with the confound even after removing it from the features. In this example, we will discuss the options you have.

# Authors: Sami Hamdan <s.hamdan@fz-juelich.de>
# License: AGPL

from sklearn.datasets import load_diabetes  # to load data
from julearn.pipeline import PipelineCreator
from julearn import run_cross_validation
from julearn.inspect import preprocess

# Load in the data
df_features, target = load_diabetes(return_X_y=True, as_frame=True)

First, we can have a look at our features. You can see it includes Age, BMI, average blood pressure (bp) and 6 other measures from s1 to s6. Furthermore, it includes sex which will be considered as a confound in this example.

print("Features: ", df_features.head())
Features:          age       sex       bmi  ...        s4        s5        s6
0  0.038076  0.050680  0.061696  ... -0.002592  0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474  ... -0.039493 -0.068332 -0.092204
2  0.085299  0.050680  0.044451  ... -0.002592  0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595  ...  0.034309  0.022688 -0.009362
4  0.005383 -0.044642 -0.036385  ... -0.002592 -0.031988 -0.046641

[5 rows x 10 columns]

Second, we can have a look at the target.

print("Target: ", target.describe())
Target:  count    442.000000
mean     152.133484
std       77.093005
min       25.000000
25%       87.000000
50%      140.500000
75%      211.500000
max      346.000000
Name: target, dtype: float64

Now, we can put both into one DataFrame:

data = df_features.copy()
data["target"] = target

In the following we will explore different settings of confound removal using julearn’s pipeline functionalities.

Confound Removal Typical Use Case#

Here, we want to deconfound the features and not include the confound as a feature into our last model. We will use the remove_confound step for this. Then we will use the pca step to reduce the dimensionality of the features. Finally, we will fit a linear regression model.

creator = PipelineCreator(problem_type="regression", apply_to="continuous")
creator.add("confound_removal")
creator.add("pca")
creator.add("linreg")
2024-04-04 14:44:07,078 - julearn - INFO - Adding step confound_removal that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-04 14:44:07,078 - julearn - INFO - Step added
2024-04-04 14:44:07,079 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-04 14:44:07,079 - julearn - INFO - Step added
2024-04-04 14:44:07,079 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-04 14:44:07,079 - julearn - INFO - Step added

<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7fdacd3c7820>

Now we need to set the X_types argument of the run_cross_validation function. This argument is a dictionary that maps the names of the different types of X to the features that belong to this type. In this example, we have two types of features: continuous and confound. The continuous features are the features that we want to deconfound and the confound features are the features that we want to remove from the continuous.

feature_names = list(df_features.drop(columns="sex").columns)
X_types = {"continuous": feature_names, "confound": "sex"}

X = feature_names + ["sex"]

Now we can run the cross validation and get the scores.

scores, model = run_cross_validation(
    X=X,
    y="target",
    X_types=X_types,
    data=data,
    model=creator,
    return_estimator="final",
)
2024-04-04 14:44:07,080 - julearn - INFO - ==== Input Data ====
2024-04-04 14:44:07,080 - julearn - INFO - Using dataframe as input
2024-04-04 14:44:07,080 - julearn - INFO -      Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-04-04 14:44:07,080 - julearn - INFO -      Target: target
2024-04-04 14:44:07,081 - julearn - INFO -      Expanded features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-04-04 14:44:07,081 - julearn - INFO -      X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'confound': ['sex']}
2024-04-04 14:44:07,082 - julearn - INFO - ====================
2024-04-04 14:44:07,082 - julearn - INFO -
2024-04-04 14:44:07,084 - julearn - INFO - = Model Parameters =
2024-04-04 14:44:07,084 - julearn - INFO - ====================
2024-04-04 14:44:07,084 - julearn - INFO -
2024-04-04 14:44:07,084 - julearn - INFO - = Data Information =
2024-04-04 14:44:07,084 - julearn - INFO -      Problem type: regression
2024-04-04 14:44:07,084 - julearn - INFO -      Number of samples: 442
2024-04-04 14:44:07,084 - julearn - INFO -      Number of features: 10
2024-04-04 14:44:07,084 - julearn - INFO - ====================
2024-04-04 14:44:07,084 - julearn - INFO -
2024-04-04 14:44:07,084 - julearn - INFO -      Target type: float64
2024-04-04 14:44:07,084 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(

We can use the preprocess method of the inspect module to inspect the transformations steps of the returned estimator. By providing a step name to the until argument of the preprocess method we return the transformed X and y up to the provided step (inclusive).

df_deconfounded = preprocess(model, X=X, data=data, until="confound_removal")
df_deconfounded.head()
age bmi bp s1 s2 s3 s4 s5 s6
0 0.029271 0.057228 0.009658 -0.046011 -0.042050 -0.024189 -0.019424 0.012310 -0.028194
1 0.005874 -0.047538 -0.015568 -0.006874 -0.012796 0.057488 -0.024667 -0.061639 -0.082913
2 0.076494 0.039983 -0.017885 -0.047387 -0.041423 -0.013144 -0.019424 -0.004736 -0.036479
3 -0.081307 -0.007659 -0.025897 0.013765 0.031358 -0.052961 0.049135 0.029380 -0.000071
4 0.013139 -0.032449 0.032631 0.005510 0.021964 -0.008781 0.012234 -0.025295 -0.037349


As you can see the confound sex was dropped and only the confound removed features are used in the following PCA.

But what if you want to keep the confound after removal for other transformations?

For example, let’s assume that you want to do a PCA on the confound removed feature, but want to keep the confound for the actual modelling step. Let us have a closer look to the confound remover in order to understand how we could achieve such a task:

class julearn.transformers.confound_remover.ConfoundRemover(apply_to='continuous', model_confound=None, confounds='confound', threshold=None, keep_confounds=False, row_select_col_type=None, row_select_vals=None)

Remove confounds from specific features.

Transformer which removes the confounds from specific features by subtracting the predicted features given the confounds from the actual features.

Parameters:
apply_toColumnTypesLike, optional

From which feature types (‘X_types’) to remove confounds. If not specified, ‘apply_to’ defaults to ‘continuous’. To apply confound removal to all features, you can use the ‘*’ regular expression syntax.

model_confoundModelLike, optional

Sklearn compatible model used to predict specified features independently using the confounds as features. The predictions of these models are then subtracted from each of the specified features, defaults to LinearRegression().

confoundsstr or list of str, optional

The name of the ‘confounds’ type(s), i.e. which column type(s) represents the confounds. By default this is set to ‘confounds’.

thresholdfloat, optional

All residual values after confound removal which fall under the threshold will be set to 0. None (default) means that no threshold will be applied.

keep_confoundsbool, optional

Whether you want to return the confound together with the confound removed features, default is False.

row_select_col_typestr or list of str or set of str or ColumnTypes

The column types needed to select rows (default is None)

row_select_valsstr, int, bool or list of str, int, bool

The value(s) which should be selected in the row_select_col_type to select the rows used for training (default is None)

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

In this example, we will set the keep_confounds argument to True. This will keep the confounds after confound removal.

creator = PipelineCreator(problem_type="regression", apply_to="continuous")
creator.add("confound_removal", keep_confounds=True)
creator.add("pca")
creator.add("linreg")
2024-04-04 14:44:07,299 - julearn - INFO - Adding step confound_removal that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-04 14:44:07,299 - julearn - INFO - Setting hyperparameter keep_confounds = True
2024-04-04 14:44:07,299 - julearn - INFO - Step added
2024-04-04 14:44:07,299 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-04 14:44:07,299 - julearn - INFO - Step added
2024-04-04 14:44:07,299 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-04 14:44:07,300 - julearn - INFO - Step added

<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7fdacd3c5ae0>

Now we can run the cross validation and get the scores.

scores, model = run_cross_validation(
    X=X,
    y="target",
    X_types=X_types,
    data=data,
    model=creator,
    return_estimator="final",
)
2024-04-04 14:44:07,300 - julearn - INFO - ==== Input Data ====
2024-04-04 14:44:07,300 - julearn - INFO - Using dataframe as input
2024-04-04 14:44:07,300 - julearn - INFO -      Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-04-04 14:44:07,300 - julearn - INFO -      Target: target
2024-04-04 14:44:07,300 - julearn - INFO -      Expanded features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-04-04 14:44:07,300 - julearn - INFO -      X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'confound': ['sex']}
2024-04-04 14:44:07,301 - julearn - INFO - ====================
2024-04-04 14:44:07,301 - julearn - INFO -
2024-04-04 14:44:07,302 - julearn - INFO - = Model Parameters =
2024-04-04 14:44:07,302 - julearn - INFO - ====================
2024-04-04 14:44:07,302 - julearn - INFO -
2024-04-04 14:44:07,302 - julearn - INFO - = Data Information =
2024-04-04 14:44:07,302 - julearn - INFO -      Problem type: regression
2024-04-04 14:44:07,302 - julearn - INFO -      Number of samples: 442
2024-04-04 14:44:07,302 - julearn - INFO -      Number of features: 10
2024-04-04 14:44:07,302 - julearn - INFO - ====================
2024-04-04 14:44:07,302 - julearn - INFO -
2024-04-04 14:44:07,302 - julearn - INFO -      Target type: float64
2024-04-04 14:44:07,303 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(

As you can see this kept the confound variable sex in the data.

df_deconfounded = preprocess(model, X=X, data=data, until="confound_removal")
df_deconfounded.head()
age bmi bp s1 s2 s3 s4 s5 s6 sex
0 0.029271 0.057228 0.009658 -0.046011 -0.042050 -0.024189 -0.019424 0.012310 -0.028194 0.050680
1 0.005874 -0.047538 -0.015568 -0.006874 -0.012796 0.057488 -0.024667 -0.061639 -0.082913 -0.044642
2 0.076494 0.039983 -0.017885 -0.047387 -0.041423 -0.013144 -0.019424 -0.004736 -0.036479 0.050680
3 -0.081307 -0.007659 -0.025897 0.013765 0.031358 -0.052961 0.049135 0.029380 -0.000071 -0.044642
4 0.013139 -0.032449 0.032631 0.005510 0.021964 -0.008781 0.012234 -0.025295 -0.037349 -0.044642


Even after the PCA, the confound will still be present. This is the case because by default transformers only transform continuous features (including features without a specified type) and ignore confounds and categorical variables.

pca__pca0 pca__pca1 pca__pca2 pca__pca3 pca__pca4 pca__pca5 pca__pca6 pca__pca7 pca__pca8 sex
0 -0.014051 0.075715 0.017395 -0.012591 -0.046676 -0.013408 0.034497 -0.008604 -0.002330 0.050680
1 -0.099883 -0.062829 0.014516 -0.013673 -0.048058 0.010254 -0.004124 0.024022 0.002075 -0.044642
2 -0.029015 0.053253 0.032477 -0.061933 -0.049167 -0.029565 0.042031 -0.001197 -0.002579 0.050680
3 0.035162 -0.001324 -0.106807 0.028981 0.020850 0.023413 -0.008421 -0.006566 -0.003545 -0.044642
4 -0.003951 -0.025445 0.000421 -0.018411 -0.039692 0.025022 -0.043086 0.002095 -0.000517 -0.044642


This means that the resulting Linear Regression can use the deconfounded features together with the confound to predict the target. However, in the pipeline creator, the model is only applied to the continuous features. This means that the confound is not used in the model. Here we can see that the model is using 9 features.

print(len(model.steps[-1][1].model.coef_))
9

Lastly, you can also use the confound as a normal feature after confound removal.

creator = PipelineCreator(problem_type="regression", apply_to="continuous")
creator.add("confound_removal", keep_confounds=True)
creator.add("pca")
creator.add("linreg", apply_to="*")

scores, model = run_cross_validation(
    X=X,
    y="target",
    X_types=X_types,
    data=data,
    model=creator,
    return_estimator="final",
)
scores
2024-04-04 14:44:07,536 - julearn - INFO - Adding step confound_removal that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-04 14:44:07,536 - julearn - INFO - Setting hyperparameter keep_confounds = True
2024-04-04 14:44:07,536 - julearn - INFO - Step added
2024-04-04 14:44:07,536 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-04 14:44:07,536 - julearn - INFO - Step added
2024-04-04 14:44:07,536 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-04-04 14:44:07,536 - julearn - INFO - Step added
2024-04-04 14:44:07,536 - julearn - INFO - ==== Input Data ====
2024-04-04 14:44:07,537 - julearn - INFO - Using dataframe as input
2024-04-04 14:44:07,537 - julearn - INFO -      Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-04-04 14:44:07,537 - julearn - INFO -      Target: target
2024-04-04 14:44:07,537 - julearn - INFO -      Expanded features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-04-04 14:44:07,537 - julearn - INFO -      X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'confound': ['sex']}
2024-04-04 14:44:07,537 - julearn - INFO - ====================
2024-04-04 14:44:07,537 - julearn - INFO -
2024-04-04 14:44:07,539 - julearn - INFO - = Model Parameters =
2024-04-04 14:44:07,539 - julearn - INFO - ====================
2024-04-04 14:44:07,539 - julearn - INFO -
2024-04-04 14:44:07,539 - julearn - INFO - = Data Information =
2024-04-04 14:44:07,539 - julearn - INFO -      Problem type: regression
2024-04-04 14:44:07,539 - julearn - INFO -      Number of samples: 442
2024-04-04 14:44:07,539 - julearn - INFO -      Number of features: 10
2024-04-04 14:44:07,539 - julearn - INFO - ====================
2024-04-04 14:44:07,539 - julearn - INFO -
2024-04-04 14:44:07,539 - julearn - INFO -      Target type: float64
2024-04-04 14:44:07,539 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(
fit_time score_time test_score n_train n_test repeat fold cv_mdsum
0 0.022809 0.008506 0.429556 353 89 0 0 b10eef89b4192178d482d7a1587a248a
1 0.027137 0.008872 0.522599 353 89 0 1 b10eef89b4192178d482d7a1587a248a
2 0.025200 0.009709 0.482681 354 88 0 2 b10eef89b4192178d482d7a1587a248a
3 0.024055 0.009382 0.426498 354 88 0 3 b10eef89b4192178d482d7a1587a248a
4 0.023276 0.008520 0.550248 354 88 0 4 b10eef89b4192178d482d7a1587a248a


As you can see the confound is now used in the linear regression model. This is the case because we set the apply_to argument of the linreg step to *. This means that the step will be applied to all features (including confounds and categorical variables). Here we can see that the model is using 10 features (9 deconfounded features and the confound).

print(len(model.steps[-1][1].model.coef_))
10

Total running time of the script: (0 minutes 0.688 seconds)

Gallery generated by Sphinx-Gallery