Note

Go to the end to download the full example code

Return Confounds in Confound Removal#

In most cases confound removal is a simple operation. You regress out the confound from the features and only continue working with these new confound removed features. This is also the default setting for julearn’s remove_confound step. But sometimes you want to work with the confound even after removing it from the features. In this example, we will discuss the options you have.

# Authors: Sami Hamdan <s.hamdan@fz-juelich.de>
# License: AGPL

from sklearn.datasets import load_diabetes  # to load data
from julearn.pipeline import PipelineCreator
from julearn import run_cross_validation
from julearn.inspect import preprocess

# Load in the data
df_features, target = load_diabetes(return_X_y=True, as_frame=True)

First, we can have a look at our features. You can see it includes Age, BMI, average blood pressure (bp) and 6 other measures from s1 to s6. Furthermore, it includes sex which will be considered as a confound in this example.

print("Features: ", df_features.head())

Features:          age       sex       bmi  ...        s4        s5        s6
0.038076  0.050680  0.061696  ... -0.002592  0.019907 -0.017646
-0.001882 -0.044642 -0.051474  ... -0.039493 -0.068332 -0.092204
0.085299  0.050680  0.044451  ... -0.002592  0.002861 -0.025930
-0.089063 -0.044642 -0.011595  ...  0.034309  0.022688 -0.009362
0.005383 -0.044642 -0.036385  ... -0.002592 -0.031988 -0.046641

[5 rows x 10 columns]

Second, we can have a look at the target.

print("Target: ", target.describe())

Target:  count    442.000000
mean     152.133484
std       77.093005
min       25.000000
25%       87.000000
50%      140.500000
75%      211.500000
max      346.000000
Name: target, dtype: float64

Now, we can put both into one DataFrame:

data = df_features.copy()
data["target"] = target

In the following we will explore different settings of confound removal using julearn’s pipeline functionalities.

Confound Removal Typical Use Case#

Here, we want to deconfound the features and not include the confound as a feature into our last model. We will use the remove_confound step for this. Then we will use the pca step to reduce the dimensionality of the features. Finally, we will fit a linear regression model.

creator = PipelineCreator(problem_type="regression", apply_to="continuous")
creator.add("confound_removal")
creator.add("pca")
creator.add("linreg")

2024-10-23 11:29:30,405 - julearn - INFO - Adding step confound_removal that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,405 - julearn - INFO - Step added
2024-10-23 11:29:30,405 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,406 - julearn - INFO - Step added
2024-10-23 11:29:30,406 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,406 - julearn - INFO - Step added

<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f90ae92c460>

Now we need to set the X_types argument of the run_cross_validation function. This argument is a dictionary that maps the names of the different types of X to the features that belong to this type. In this example, we have two types of features: continuous and confound. The continuous features are the features that we want to deconfound and the confound features are the features that we want to remove from the continuous.

feature_names = list(df_features.drop(columns="sex").columns)
X_types = {"continuous": feature_names, "confound": "sex"}

X = feature_names + ["sex"]

Now we can run the cross validation and get the scores.

scores, model = run_cross_validation(
    X=X,
    y="target",
    X_types=X_types,
    data=data,
    model=creator,
    return_estimator="final",
)

2024-10-23 11:29:30,407 - julearn - INFO - ==== Input Data ====
2024-10-23 11:29:30,407 - julearn - INFO - Using dataframe as input
2024-10-23 11:29:30,407 - julearn - INFO -      Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-10-23 11:29:30,407 - julearn - INFO -      Target: target
2024-10-23 11:29:30,407 - julearn - INFO -      Expanded features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-10-23 11:29:30,407 - julearn - INFO -      X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'confound': ['sex']}
2024-10-23 11:29:30,408 - julearn - INFO - ====================
2024-10-23 11:29:30,408 - julearn - INFO -
2024-10-23 11:29:30,409 - julearn - INFO - = Model Parameters =
2024-10-23 11:29:30,409 - julearn - INFO - ====================
2024-10-23 11:29:30,409 - julearn - INFO -
2024-10-23 11:29:30,409 - julearn - INFO - = Data Information =
2024-10-23 11:29:30,409 - julearn - INFO -      Problem type: regression
2024-10-23 11:29:30,409 - julearn - INFO -      Number of samples: 442
2024-10-23 11:29:30,409 - julearn - INFO -      Number of features: 10
2024-10-23 11:29:30,409 - julearn - INFO - ====================
2024-10-23 11:29:30,409 - julearn - INFO -
2024-10-23 11:29:30,409 - julearn - INFO -      Target type: float64
2024-10-23 11:29:30,410 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)

We can use the preprocess method of the inspect module to inspect the transformations steps of the returned estimator. By providing a step name to the until argument of the preprocess method we return the transformed X and y up to the provided step (inclusive).

df_deconfounded = preprocess(model, X=X, data=data, until="confound_removal")
df_deconfounded.head()

	age	bmi	bp	s1	s2	s3	s4	s5	s6
0	0.029271	0.057228	0.009658	-0.046011	-0.042050	-0.024189	-0.019424	0.012310	-0.028194
1	0.005874	-0.047538	-0.015568	-0.006874	-0.012796	0.057488	-0.024667	-0.061639	-0.082913
2	0.076494	0.039983	-0.017885	-0.047387	-0.041423	-0.013144	-0.019424	-0.004736	-0.036479
3	-0.081307	-0.007659	-0.025897	0.013765	0.031358	-0.052961	0.049135	0.029380	-0.000071
4	0.013139	-0.032449	0.032631	0.005510	0.021964	-0.008781	0.012234	-0.025295	-0.037349

As you can see the confound sex was dropped and only the confound removed features are used in the following PCA.

But what if you want to keep the confound after removal for other transformations?

For example, let’s assume that you want to do a PCA on the confound removed feature, but want to keep the confound for the actual modelling step. Let us have a closer look to the confound remover in order to understand how we could achieve such a task:

class julearn.transformers.confound_remover.ConfoundRemover(apply_to='continuous', model_confound=None, confounds='confound', threshold=None, keep_confounds=False, row_select_col_type=None, row_select_vals=None)

Remove confounds from specific features.

Transformer which removes the confounds from specific features by subtracting the predicted features given the confounds from the actual features.

Parameters:

apply_toColumnTypesLike, optional: From which feature types (‘X_types’) to remove confounds. If not specified, ‘apply_to’ defaults to ‘continuous’. To apply confound removal to all features, you can use the ‘*’ regular expression syntax.
model_confoundModelLike, optional: Sklearn compatible model used to predict specified features independently using the confounds as features. The predictions of these models are then subtracted from each of the specified features, defaults to LinearRegression().
confoundsstr or list of str, optional: The name of the ‘confounds’ type(s), i.e. which column type(s) represents the confounds. By default this is set to ‘confounds’.
thresholdfloat, optional: All residual values after confound removal which fall under the threshold will be set to 0. None (default) means that no threshold will be applied.
keep_confoundsbool, optional: Whether you want to return the confound together with the confound removed features, default is False.
row_select_col_typestr or list of str or set of str or ColumnTypes: The column types needed to select rows (default is None)
row_select_valsstr, int, bool or list of str, int, bool: The value(s) which should be selected in the row_select_col_type to select the rows used for training (default is None)

get_metadata_routing()

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

In this example, we will set the keep_confounds argument to True. This will keep the confounds after confound removal.

creator = PipelineCreator(problem_type="regression", apply_to="continuous")
creator.add("confound_removal", keep_confounds=True)
creator.add("pca")
creator.add("linreg")

2024-10-23 11:29:30,616 - julearn - INFO - Adding step confound_removal that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,616 - julearn - INFO - Setting hyperparameter keep_confounds = True
2024-10-23 11:29:30,616 - julearn - INFO - Step added
2024-10-23 11:29:30,616 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,616 - julearn - INFO - Step added
2024-10-23 11:29:30,616 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,616 - julearn - INFO - Step added

<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f90adde7220>

Now we can run the cross validation and get the scores.

scores, model = run_cross_validation(
    X=X,
    y="target",
    X_types=X_types,
    data=data,
    model=creator,
    return_estimator="final",
)

2024-10-23 11:29:30,617 - julearn - INFO - ==== Input Data ====
2024-10-23 11:29:30,617 - julearn - INFO - Using dataframe as input
2024-10-23 11:29:30,617 - julearn - INFO -      Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-10-23 11:29:30,617 - julearn - INFO -      Target: target
2024-10-23 11:29:30,617 - julearn - INFO -      Expanded features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-10-23 11:29:30,617 - julearn - INFO -      X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'confound': ['sex']}
2024-10-23 11:29:30,618 - julearn - INFO - ====================
2024-10-23 11:29:30,618 - julearn - INFO -
2024-10-23 11:29:30,619 - julearn - INFO - = Model Parameters =
2024-10-23 11:29:30,619 - julearn - INFO - ====================
2024-10-23 11:29:30,619 - julearn - INFO -
2024-10-23 11:29:30,619 - julearn - INFO - = Data Information =
2024-10-23 11:29:30,619 - julearn - INFO -      Problem type: regression
2024-10-23 11:29:30,619 - julearn - INFO -      Number of samples: 442
2024-10-23 11:29:30,619 - julearn - INFO -      Number of features: 10
2024-10-23 11:29:30,619 - julearn - INFO - ====================
2024-10-23 11:29:30,619 - julearn - INFO -
2024-10-23 11:29:30,619 - julearn - INFO -      Target type: float64
2024-10-23 11:29:30,620 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)

As you can see this kept the confound variable sex in the data.

df_deconfounded = preprocess(model, X=X, data=data, until="confound_removal")
df_deconfounded.head()

	age	bmi	bp	s1	s2	s3	s4	s5	s6	sex
0	0.029271	0.057228	0.009658	-0.046011	-0.042050	-0.024189	-0.019424	0.012310	-0.028194	0.050680
1	0.005874	-0.047538	-0.015568	-0.006874	-0.012796	0.057488	-0.024667	-0.061639	-0.082913	-0.044642
2	0.076494	0.039983	-0.017885	-0.047387	-0.041423	-0.013144	-0.019424	-0.004736	-0.036479	0.050680
3	-0.081307	-0.007659	-0.025897	0.013765	0.031358	-0.052961	0.049135	0.029380	-0.000071	-0.044642
4	0.013139	-0.032449	0.032631	0.005510	0.021964	-0.008781	0.012234	-0.025295	-0.037349	-0.044642

Even after the PCA, the confound will still be present. This is the case because by default transformers only transform continuous features (including features without a specified type) and ignore confounds and categorical variables.

df_transformed = preprocess(model, X=X, data=data)
df_transformed.head()

	pca__pca0	pca__pca1	pca__pca2	pca__pca3	pca__pca4	pca__pca5	pca__pca6	pca__pca7	pca__pca8	sex
0	-0.014051	-0.075715	0.017395	0.012591	-0.046676	0.013408	0.034497	-0.008604	-0.002330	0.050680
1	-0.099883	0.062829	0.014516	0.013673	-0.048058	-0.010254	-0.004124	0.024022	0.002075	-0.044642
2	-0.029015	-0.053253	0.032477	0.061933	-0.049167	0.029565	0.042031	-0.001197	-0.002579	0.050680
3	0.035162	0.001324	-0.106807	-0.028981	0.020850	-0.023413	-0.008421	-0.006566	-0.003545	-0.044642
4	-0.003951	0.025445	0.000421	0.018411	-0.039692	-0.025022	-0.043086	0.002095	-0.000517	-0.044642

This means that the resulting Linear Regression can use the deconfounded features together with the confound to predict the target. However, in the pipeline creator, the model is only applied to the continuous features. This means that the confound is not used in the model. Here we can see that the model is using 9 features.

print(len(model.steps[-1][1].model.coef_))

Lastly, you can also use the confound as a normal feature after confound removal.

creator = PipelineCreator(problem_type="regression", apply_to="continuous")
creator.add("confound_removal", keep_confounds=True)
creator.add("pca")
creator.add("linreg", apply_to="*")

scores, model = run_cross_validation(
    X=X,
    y="target",
    X_types=X_types,
    data=data,
    model=creator,
    return_estimator="final",
)
scores

2024-10-23 11:29:30,842 - julearn - INFO - Adding step confound_removal that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,842 - julearn - INFO - Setting hyperparameter keep_confounds = True
2024-10-23 11:29:30,842 - julearn - INFO - Step added
2024-10-23 11:29:30,842 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,842 - julearn - INFO - Step added
2024-10-23 11:29:30,842 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-10-23 11:29:30,842 - julearn - INFO - Step added
2024-10-23 11:29:30,842 - julearn - INFO - ==== Input Data ====
2024-10-23 11:29:30,842 - julearn - INFO - Using dataframe as input
2024-10-23 11:29:30,842 - julearn - INFO -      Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-10-23 11:29:30,842 - julearn - INFO -      Target: target
2024-10-23 11:29:30,843 - julearn - INFO -      Expanded features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-10-23 11:29:30,843 - julearn - INFO -      X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'confound': ['sex']}
2024-10-23 11:29:30,843 - julearn - INFO - ====================
2024-10-23 11:29:30,843 - julearn - INFO -
2024-10-23 11:29:30,844 - julearn - INFO - = Model Parameters =
2024-10-23 11:29:30,844 - julearn - INFO - ====================
2024-10-23 11:29:30,844 - julearn - INFO -
2024-10-23 11:29:30,844 - julearn - INFO - = Data Information =
2024-10-23 11:29:30,844 - julearn - INFO -      Problem type: regression
2024-10-23 11:29:30,844 - julearn - INFO -      Number of samples: 442
2024-10-23 11:29:30,844 - julearn - INFO -      Number of features: 10
2024-10-23 11:29:30,844 - julearn - INFO - ====================
2024-10-23 11:29:30,844 - julearn - INFO -
2024-10-23 11:29:30,845 - julearn - INFO -      Target type: float64
2024-10-23 11:29:30,845 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)

	fit_time	score_time	test_score	n_train	n_test	fold	cv_mdsum
0	0.022023	0.007822	0.429556	353	89	0	b10eef89b4192178d482d7a1587a248a
1	0.021967	0.007798	0.522599	353	89	1	b10eef89b4192178d482d7a1587a248a
2	0.026673	0.009122	0.482681	354	88	2	b10eef89b4192178d482d7a1587a248a
3	0.028322	0.009376	0.426498	354	88	3	b10eef89b4192178d482d7a1587a248a
4	0.025966	0.011434	0.550248	354	88	4	b10eef89b4192178d482d7a1587a248a

As you can see the confound is now used in the linear regression model. This is the case because we set the apply_to argument of the linreg step to *. This means that the step will be applied to all features (including confounds and categorical variables). Here we can see that the model is using 10 features (9 deconfounded features and the confound).

print(len(model.steps[-1][1].coef_))

Total running time of the script: (0 minutes 0.668 seconds)

Gallery generated by Sphinx-Gallery