Note
Go to the end to download the full example code
Return Confounds in Confound Removal#
In most cases confound removal is a simple operation.
You regress out the confound from the features and only continue working with
these new confound removed features. This is also the default setting for
julearn
’s remove_confound
step. But sometimes you want to work with the
confound even after removing it from the features. In this example, we
will discuss the options you have.
# Authors: Sami Hamdan <s.hamdan@fz-juelich.de>
# License: AGPL
from sklearn.datasets import load_diabetes # to load data
from julearn.pipeline import PipelineCreator
from julearn import run_cross_validation
from julearn.inspect import preprocess
# Load in the data
df_features, target = load_diabetes(return_X_y=True, as_frame=True)
First, we can have a look at our features. You can see it includes Age, BMI, average blood pressure (bp) and 6 other measures from s1 to s6. Furthermore, it includes sex which will be considered as a confound in this example.
print("Features: ", df_features.head())
Features: age sex bmi ... s4 s5 s6
0 0.038076 0.050680 0.061696 ... -0.002592 0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068332 -0.092204
2 0.085299 0.050680 0.044451 ... -0.002592 0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022688 -0.009362
4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031988 -0.046641
[5 rows x 10 columns]
Second, we can have a look at the target.
print("Target: ", target.describe())
Target: count 442.000000
mean 152.133484
std 77.093005
min 25.000000
25% 87.000000
50% 140.500000
75% 211.500000
max 346.000000
Name: target, dtype: float64
Now, we can put both into one DataFrame:
data = df_features.copy()
data["target"] = target
In the following we will explore different settings of confound removal
using julearn
’s pipeline functionalities.
Confound Removal Typical Use Case#
Here, we want to deconfound the features and not include the confound as a
feature into our last model. We will use the remove_confound
step for this.
Then we will use the pca
step to reduce the dimensionality of the features.
Finally, we will fit a linear regression model.
creator = PipelineCreator(problem_type="regression", apply_to="continuous")
creator.add("confound_removal")
creator.add("pca")
creator.add("linreg")
2024-10-23 11:29:30,405 - julearn - INFO - Adding step confound_removal that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,405 - julearn - INFO - Step added
2024-10-23 11:29:30,405 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,406 - julearn - INFO - Step added
2024-10-23 11:29:30,406 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,406 - julearn - INFO - Step added
<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f90ae92c460>
Now we need to set the X_types
argument of the run_cross_validation
function. This argument is a dictionary that maps the names of the different
types of X to the features that belong to this type. In this example, we
have two types of features: continuous and confound. The continuous
features are the features that we want to deconfound and the confound
features are the features that we want to remove from the continuous.
feature_names = list(df_features.drop(columns="sex").columns)
X_types = {"continuous": feature_names, "confound": "sex"}
X = feature_names + ["sex"]
Now we can run the cross validation and get the scores.
2024-10-23 11:29:30,407 - julearn - INFO - ==== Input Data ====
2024-10-23 11:29:30,407 - julearn - INFO - Using dataframe as input
2024-10-23 11:29:30,407 - julearn - INFO - Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-10-23 11:29:30,407 - julearn - INFO - Target: target
2024-10-23 11:29:30,407 - julearn - INFO - Expanded features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-10-23 11:29:30,407 - julearn - INFO - X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'confound': ['sex']}
2024-10-23 11:29:30,408 - julearn - INFO - ====================
2024-10-23 11:29:30,408 - julearn - INFO -
2024-10-23 11:29:30,409 - julearn - INFO - = Model Parameters =
2024-10-23 11:29:30,409 - julearn - INFO - ====================
2024-10-23 11:29:30,409 - julearn - INFO -
2024-10-23 11:29:30,409 - julearn - INFO - = Data Information =
2024-10-23 11:29:30,409 - julearn - INFO - Problem type: regression
2024-10-23 11:29:30,409 - julearn - INFO - Number of samples: 442
2024-10-23 11:29:30,409 - julearn - INFO - Number of features: 10
2024-10-23 11:29:30,409 - julearn - INFO - ====================
2024-10-23 11:29:30,409 - julearn - INFO -
2024-10-23 11:29:30,409 - julearn - INFO - Target type: float64
2024-10-23 11:29:30,410 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)
We can use the preprocess
method of the inspect
module to inspect the
transformations steps of the returned estimator.
By providing a step name to the until
argument of the
preprocess
method we return the transformed X and y up to
the provided step (inclusive).
df_deconfounded = preprocess(model, X=X, data=data, until="confound_removal")
df_deconfounded.head()
As you can see the confound sex
was dropped and only the confound removed
features are used in the following PCA.
But what if you want to keep the confound after removal for other transformations?
For example, let’s assume that you want to do a PCA on the confound removed feature, but want to keep the confound for the actual modelling step. Let us have a closer look to the confound remover in order to understand how we could achieve such a task:
- class julearn.transformers.confound_remover.ConfoundRemover(apply_to='continuous', model_confound=None, confounds='confound', threshold=None, keep_confounds=False, row_select_col_type=None, row_select_vals=None)
Remove confounds from specific features.
Transformer which removes the confounds from specific features by subtracting the predicted features given the confounds from the actual features.
- Parameters:
- apply_toColumnTypesLike, optional
From which feature types (‘X_types’) to remove confounds. If not specified, ‘apply_to’ defaults to ‘continuous’. To apply confound removal to all features, you can use the ‘*’ regular expression syntax.
- model_confoundModelLike, optional
Sklearn compatible model used to predict specified features independently using the confounds as features. The predictions of these models are then subtracted from each of the specified features, defaults to LinearRegression().
- confoundsstr or list of str, optional
The name of the ‘confounds’ type(s), i.e. which column type(s) represents the confounds. By default this is set to ‘confounds’.
- thresholdfloat, optional
All residual values after confound removal which fall under the threshold will be set to 0. None (default) means that no threshold will be applied.
- keep_confoundsbool, optional
Whether you want to return the confound together with the confound removed features, default is False.
- row_select_col_typestr or list of str or set of str or ColumnTypes
The column types needed to select rows (default is None)
- row_select_valsstr, int, bool or list of str, int, bool
The value(s) which should be selected in the row_select_col_type to select the rows used for training (default is None)
- get_metadata_routing()
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
In this example, we will set the keep_confounds
argument to True.
This will keep the confounds after confound removal.
creator = PipelineCreator(problem_type="regression", apply_to="continuous")
creator.add("confound_removal", keep_confounds=True)
creator.add("pca")
creator.add("linreg")
2024-10-23 11:29:30,616 - julearn - INFO - Adding step confound_removal that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,616 - julearn - INFO - Setting hyperparameter keep_confounds = True
2024-10-23 11:29:30,616 - julearn - INFO - Step added
2024-10-23 11:29:30,616 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,616 - julearn - INFO - Step added
2024-10-23 11:29:30,616 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,616 - julearn - INFO - Step added
<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f90adde7220>
Now we can run the cross validation and get the scores.
2024-10-23 11:29:30,617 - julearn - INFO - ==== Input Data ====
2024-10-23 11:29:30,617 - julearn - INFO - Using dataframe as input
2024-10-23 11:29:30,617 - julearn - INFO - Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-10-23 11:29:30,617 - julearn - INFO - Target: target
2024-10-23 11:29:30,617 - julearn - INFO - Expanded features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-10-23 11:29:30,617 - julearn - INFO - X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'confound': ['sex']}
2024-10-23 11:29:30,618 - julearn - INFO - ====================
2024-10-23 11:29:30,618 - julearn - INFO -
2024-10-23 11:29:30,619 - julearn - INFO - = Model Parameters =
2024-10-23 11:29:30,619 - julearn - INFO - ====================
2024-10-23 11:29:30,619 - julearn - INFO -
2024-10-23 11:29:30,619 - julearn - INFO - = Data Information =
2024-10-23 11:29:30,619 - julearn - INFO - Problem type: regression
2024-10-23 11:29:30,619 - julearn - INFO - Number of samples: 442
2024-10-23 11:29:30,619 - julearn - INFO - Number of features: 10
2024-10-23 11:29:30,619 - julearn - INFO - ====================
2024-10-23 11:29:30,619 - julearn - INFO -
2024-10-23 11:29:30,619 - julearn - INFO - Target type: float64
2024-10-23 11:29:30,620 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)
As you can see this kept the confound variable sex
in the data.
df_deconfounded = preprocess(model, X=X, data=data, until="confound_removal")
df_deconfounded.head()
Even after the PCA, the confound will still be present. This is the case because by default transformers only transform continuous features (including features without a specified type) and ignore confounds and categorical variables.
df_transformed = preprocess(model, X=X, data=data)
df_transformed.head()
This means that the resulting Linear Regression can use the deconfounded features together with the confound to predict the target. However, in the pipeline creator, the model is only applied to the continuous features. This means that the confound is not used in the model. Here we can see that the model is using 9 features.
print(len(model.steps[-1][1].model.coef_))
9
Lastly, you can also use the confound as a normal feature after confound removal.
creator = PipelineCreator(problem_type="regression", apply_to="continuous")
creator.add("confound_removal", keep_confounds=True)
creator.add("pca")
creator.add("linreg", apply_to="*")
scores, model = run_cross_validation(
X=X,
y="target",
X_types=X_types,
data=data,
model=creator,
return_estimator="final",
)
scores
2024-10-23 11:29:30,842 - julearn - INFO - Adding step confound_removal that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,842 - julearn - INFO - Setting hyperparameter keep_confounds = True
2024-10-23 11:29:30,842 - julearn - INFO - Step added
2024-10-23 11:29:30,842 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:30,842 - julearn - INFO - Step added
2024-10-23 11:29:30,842 - julearn - INFO - Adding step linreg that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-10-23 11:29:30,842 - julearn - INFO - Step added
2024-10-23 11:29:30,842 - julearn - INFO - ==== Input Data ====
2024-10-23 11:29:30,842 - julearn - INFO - Using dataframe as input
2024-10-23 11:29:30,842 - julearn - INFO - Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-10-23 11:29:30,842 - julearn - INFO - Target: target
2024-10-23 11:29:30,843 - julearn - INFO - Expanded features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6', 'sex']
2024-10-23 11:29:30,843 - julearn - INFO - X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'confound': ['sex']}
2024-10-23 11:29:30,843 - julearn - INFO - ====================
2024-10-23 11:29:30,843 - julearn - INFO -
2024-10-23 11:29:30,844 - julearn - INFO - = Model Parameters =
2024-10-23 11:29:30,844 - julearn - INFO - ====================
2024-10-23 11:29:30,844 - julearn - INFO -
2024-10-23 11:29:30,844 - julearn - INFO - = Data Information =
2024-10-23 11:29:30,844 - julearn - INFO - Problem type: regression
2024-10-23 11:29:30,844 - julearn - INFO - Number of samples: 442
2024-10-23 11:29:30,844 - julearn - INFO - Number of features: 10
2024-10-23 11:29:30,844 - julearn - INFO - ====================
2024-10-23 11:29:30,844 - julearn - INFO -
2024-10-23 11:29:30,845 - julearn - INFO - Target type: float64
2024-10-23 11:29:30,845 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)
As you can see the confound is now used in the linear regression model.
This is the case because we set the apply_to
argument of the linreg
step to *
. This means that the step will be applied to all features
(including confounds and categorical variables).
Here we can see that the model is using 10 features (9 deconfounded features
and the confound).
print(len(model.steps[-1][1].coef_))
10
Total running time of the script: (0 minutes 0.668 seconds)