Return Confounds in Confound Removal#

In most cases confound removal is a simple operation. You regress out the confound from the features and only continue working with these new confound removed features. This is also the default setting for julearn’s remove_confound step. But sometimes you want to work with the confound even after removing it from the features. In this example, we will discuss the options you have.

# Authors: Sami Hamdan <s.hamdan@fz-juelich.de>
#
# License: AGPL
from sklearn.datasets import load_diabetes  # to load data
from julearn.pipeline import PipelineCreator
from julearn import run_cross_validation
from julearn.inspect import preprocess

# load in the data
df_features, target = load_diabetes(return_X_y=True, as_frame=True)

First, we can have a look at our features. You can see it includes Age, BMI, average blood pressure (bp) and 6 other measures from s1 to s6 Furthermore, it includes sex which will be considered as a confound in this example.

print("Features: ", df_features.head())
Features:          age       sex       bmi  ...        s4        s5        s6
0  0.038076  0.050680  0.061696  ... -0.002592  0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474  ... -0.039493 -0.068332 -0.092204
2  0.085299  0.050680  0.044451  ... -0.002592  0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595  ...  0.034309  0.022688 -0.009362
4  0.005383 -0.044642 -0.036385  ... -0.002592 -0.031988 -0.046641

[5 rows x 10 columns]

Second, we can have a look at the target

print("Target: ", target.describe())
Target:  count    442.000000
mean     152.133484
std       77.093005
min       25.000000
25%       87.000000
50%      140.500000
75%      211.500000
max      346.000000
Name: target, dtype: float64

Now, we can put both into one DataFrame:

data = df_features.copy()
data["target"] = target

In the following we will explore different settings of confound removal using Julearns pipeline functionalities.

Confound Removal Typical Use Case#

Here, we want to deconfound the features and not include the confound as a feature into our last model. We will use the remove_confound step for this. Then we will use the pca step to reduce the dimensionality of the features. Finally, we will fit a linear regression model.

creator = PipelineCreator(problem_type="regression", apply_to="continuous")
creator.add("confound_removal")
creator.add("pca")
creator.add("linreg")
2023-07-19 12:42:08,263 - julearn - INFO - Adding step confound_removal that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:08,264 - julearn - INFO - Step added
2023-07-19 12:42:08,264 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:08,264 - julearn - INFO - Step added
2023-07-19 12:42:08,264 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:08,264 - julearn - INFO - Step added

<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f7f62aa5ba0>

Now we need to set the X_types argument of the run_cross_validation function. This argument is a dictionary that maps the names of the different types of X to the features that belong to this type. In this example, we have two types of features: continuous and confound. The continuous features are the features that we want to deconfound and the confound features are the features that we want to remove from the continuous.

feature_names = list(df_features.drop(columns="sex").columns)
X_types = {"continuous": feature_names, "confound": "sex"}

X = feature_names + ["sex"]

Now we can run the cross validation and get the scores.

scores, model = run_cross_validation(
    X=X,
    y="target",
    X_types=X_types,
    data=data,
    model=creator,
    return_estimator="final",
)
2023-07-19 12:42:08,266 - julearn - INFO - ==== Input Data ====
2023-07-19 12:42:08,266 - julearn - INFO - Using dataframe as input
2023-07-19 12:42:08,266 - julearn - INFO -      Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2023-07-19 12:42:08,266 - julearn - INFO -      Target: target
2023-07-19 12:42:08,266 - julearn - INFO -      Expanded features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2023-07-19 12:42:08,266 - julearn - INFO -      X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'confound': ['sex']}
2023-07-19 12:42:08,267 - julearn - INFO - ====================
2023-07-19 12:42:08,267 - julearn - INFO -
2023-07-19 12:42:08,269 - julearn - INFO - = Model Parameters =
2023-07-19 12:42:08,269 - julearn - INFO - ====================
2023-07-19 12:42:08,269 - julearn - INFO -
2023-07-19 12:42:08,269 - julearn - INFO - = Data Information =
2023-07-19 12:42:08,269 - julearn - INFO -      Problem type: regression
2023-07-19 12:42:08,269 - julearn - INFO -      Number of samples: 442
2023-07-19 12:42:08,269 - julearn - INFO -      Number of features: 10
2023-07-19 12:42:08,269 - julearn - INFO - ====================
2023-07-19 12:42:08,269 - julearn - INFO -
2023-07-19 12:42:08,269 - julearn - INFO -      Target type: float64
2023-07-19 12:42:08,269 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)

We can use the preprocess method of the inspect module to inspect the transformations steps of the returned estimator. By providing a step name to the until argument of the preprocess method we return the transformed X and y up to the provided step (inclusive).

df_deconfounded  = preprocess(model, X=X, data=data, until="confound_removal")
print(df_deconfounded.head())

# As you can see the confound `sex` was dropped and only the confound removed
# features are used in the following pca.
#
# But what if you want to keep the confound after removal for
# other transformations?
#
# For example, let's assume that you want to do a pca on the confound removed
# feature, but want to keep the confound for the actual modelling step.
# Let us have a closer look to the confound remover in order to understand
# how we could achieve such a task:
#
# .. autoclass:: julearn.transformers.ConfoundRemover
        age       bmi        bp  ...        s4        s5        s6
0  0.029271  0.057228  0.009658  ... -0.019424  0.012310 -0.028194
1  0.005874 -0.047538 -0.015568  ... -0.024667 -0.061639 -0.082913
2  0.076494  0.039983 -0.017885  ... -0.019424 -0.004736 -0.036479
3 -0.081307 -0.007659 -0.025897  ...  0.049135  0.029380 -0.000071
4  0.013139 -0.032449  0.032631  ...  0.012234 -0.025295 -0.037349

[5 rows x 9 columns]

In this example, we will set the keep_confounds argument to True. This will keep the confounds after confound removal.

creator = PipelineCreator(problem_type="regression", apply_to="continuous")
creator.add("confound_removal", keep_confounds=True)
creator.add("pca")
creator.add("linreg")
2023-07-19 12:42:08,521 - julearn - INFO - Adding step confound_removal that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:08,522 - julearn - INFO - Setting hyperparameter keep_confounds = True
2023-07-19 12:42:08,522 - julearn - INFO - Step added
2023-07-19 12:42:08,522 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:08,522 - julearn - INFO - Step added
2023-07-19 12:42:08,522 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:08,522 - julearn - INFO - Step added

<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f7f62aa5f60>

Now we can run the cross validation and get the scores.

scores, model = run_cross_validation(
    X=X,
    y="target",
    X_types=X_types,
    data=data,
    model=creator,
    return_estimator="final",
)
2023-07-19 12:42:08,523 - julearn - INFO - ==== Input Data ====
2023-07-19 12:42:08,523 - julearn - INFO - Using dataframe as input
2023-07-19 12:42:08,523 - julearn - INFO -      Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2023-07-19 12:42:08,523 - julearn - INFO -      Target: target
2023-07-19 12:42:08,523 - julearn - INFO -      Expanded features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2023-07-19 12:42:08,523 - julearn - INFO -      X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'confound': ['sex']}
2023-07-19 12:42:08,524 - julearn - INFO - ====================
2023-07-19 12:42:08,524 - julearn - INFO -
2023-07-19 12:42:08,526 - julearn - INFO - = Model Parameters =
2023-07-19 12:42:08,526 - julearn - INFO - ====================
2023-07-19 12:42:08,526 - julearn - INFO -
2023-07-19 12:42:08,526 - julearn - INFO - = Data Information =
2023-07-19 12:42:08,526 - julearn - INFO -      Problem type: regression
2023-07-19 12:42:08,526 - julearn - INFO -      Number of samples: 442
2023-07-19 12:42:08,526 - julearn - INFO -      Number of features: 10
2023-07-19 12:42:08,526 - julearn - INFO - ====================
2023-07-19 12:42:08,526 - julearn - INFO -
2023-07-19 12:42:08,526 - julearn - INFO -      Target type: float64
2023-07-19 12:42:08,526 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)

As you can see this kept the confound variable sex in the data.

df_deconfounded  = preprocess(model, X=X, data=data, until="confound_removal")
print(df_deconfounded.head())
        age       bmi        bp  ...        s5        s6       sex
0  0.029271  0.057228  0.009658  ...  0.012310 -0.028194  0.050680
1  0.005874 -0.047538 -0.015568  ... -0.061639 -0.082913 -0.044642
2  0.076494  0.039983 -0.017885  ... -0.004736 -0.036479  0.050680
3 -0.081307 -0.007659 -0.025897  ...  0.029380 -0.000071 -0.044642
4  0.013139 -0.032449  0.032631  ... -0.025295 -0.037349 -0.044642

[5 rows x 10 columns]

Even after the pca, the confound will still be present. This is the case because by default transformers only transform continuous features (including features without a specified type) and ignore confounds and categorical variables.

df_transformed = preprocess(model, X=X, data=data)
print(df_transformed.head())

# This means that the resulting Linear Regression can use the deconfounded
# features together with the confound to predict the target. However, in the
# pipeline creator, the model is only applied to the continuous features.
# This means that the confound is not used in the model.
# Here we can see that the model is using 9 features.

print(len(model.steps[-1][1].model.coef_))
   pca__pca0  pca__pca1  pca__pca2  ...  pca__pca7  pca__pca8       sex
0  -0.014051   0.075715   0.017395  ...  -0.008604  -0.002330  0.050680
1  -0.099883  -0.062829   0.014516  ...   0.024022   0.002075 -0.044642
2  -0.029015   0.053253   0.032477  ...  -0.001197  -0.002579  0.050680
3   0.035162  -0.001324  -0.106807  ...  -0.006566  -0.003545 -0.044642
4  -0.003951  -0.025445   0.000421  ...   0.002095  -0.000517 -0.044642

[5 rows x 10 columns]
9

Lastly, you can also use the confound as a normal feature after confound removal.

creator = PipelineCreator(problem_type="regression", apply_to="continuous")
creator.add("confound_removal", keep_confounds=True)
creator.add("pca")
creator.add("linreg", apply_to="*")

scores, model = run_cross_validation(
    X=X,
    y="target",
    X_types=X_types,
    data=data,
    model=creator,
    return_estimator="final",
)
print(scores)
2023-07-19 12:42:08,806 - julearn - INFO - Adding step confound_removal that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:08,806 - julearn - INFO - Setting hyperparameter keep_confounds = True
2023-07-19 12:42:08,806 - julearn - INFO - Step added
2023-07-19 12:42:08,806 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:42:08,806 - julearn - INFO - Step added
2023-07-19 12:42:08,806 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'*'}; pattern=.*>
2023-07-19 12:42:08,807 - julearn - INFO - Step added
2023-07-19 12:42:08,807 - julearn - INFO - ==== Input Data ====
2023-07-19 12:42:08,807 - julearn - INFO - Using dataframe as input
2023-07-19 12:42:08,807 - julearn - INFO -      Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2023-07-19 12:42:08,807 - julearn - INFO -      Target: target
2023-07-19 12:42:08,807 - julearn - INFO -      Expanded features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2023-07-19 12:42:08,807 - julearn - INFO -      X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'confound': ['sex']}
2023-07-19 12:42:08,808 - julearn - INFO - ====================
2023-07-19 12:42:08,808 - julearn - INFO -
2023-07-19 12:42:08,809 - julearn - INFO - = Model Parameters =
2023-07-19 12:42:08,810 - julearn - INFO - ====================
2023-07-19 12:42:08,810 - julearn - INFO -
2023-07-19 12:42:08,810 - julearn - INFO - = Data Information =
2023-07-19 12:42:08,810 - julearn - INFO -      Problem type: regression
2023-07-19 12:42:08,810 - julearn - INFO -      Number of samples: 442
2023-07-19 12:42:08,810 - julearn - INFO -      Number of features: 10
2023-07-19 12:42:08,810 - julearn - INFO - ====================
2023-07-19 12:42:08,810 - julearn - INFO -
2023-07-19 12:42:08,810 - julearn - INFO -      Target type: float64
2023-07-19 12:42:08,810 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
   fit_time  score_time  ...  fold                          cv_mdsum
0  0.027553    0.010742  ...     0  b10eef89b4192178d482d7a1587a248a
1  0.027331    0.010737  ...     1  b10eef89b4192178d482d7a1587a248a
2  0.027732    0.010712  ...     2  b10eef89b4192178d482d7a1587a248a
3  0.027545    0.010639  ...     3  b10eef89b4192178d482d7a1587a248a
4  0.027785    0.010670  ...     4  b10eef89b4192178d482d7a1587a248a

[5 rows x 8 columns]

As you can see the confound is now used in the linear regression model. This is the case because we set the apply_to argument of the linreg step to *. This means that the step will be applied to all features (including confounds and categorical variables). Here we can see that the model is using 10 features (9 deconfounded features and the confound).

print(len(model.steps[-1][1].model.coef_))
10

Total running time of the script: ( 0 minutes 0.803 seconds)

Gallery generated by Sphinx-Gallery