Note
Click here to download the full example code
Return Confounds in Confound Removal¶
In most cases confound removal is a simple operation. You regress out the confound from the features and only continue working with these new confound removed features. This is also the default setting for julearn’s remove_confound step. But sometimes you want to work with the confound even after removing it from the features. In this example, we will discuss the options you have.
# Authors: Sami Hamdan <s.hamdan@fz-juelich.de>
#
# License: AGPL
from sklearn.datasets import load_diabetes # to load data
from julearn.transformers import ChangeColumnTypes
from julearn import run_cross_validation
# load in the data
df_features, target = load_diabetes(return_X_y=True, as_frame=True)
First, we can have a look at our features. You can see it includes Age, BMI, average blood pressure (bp) and 6 other measures from s1 to s6 Furthermore, it includes sex which will be considered as a confound in this example.
print('Features: ', df_features.head())
Out:
Features: age sex bmi ... s4 s5 s6
0 0.038076 0.050680 0.061696 ... -0.002592 0.019908 -0.017646
1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068330 -0.092204
2 0.085299 0.050680 0.044451 ... -0.002592 0.002864 -0.025930
3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022692 -0.009362
4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031991 -0.046641
[5 rows x 10 columns]
Second, we can have a look at the target
print('Target: ', target.describe())
Out:
Target: count 442.000000
mean 152.133484
std 77.093005
min 25.000000
25% 87.000000
50% 140.500000
75% 211.500000
max 346.000000
Name: target, dtype: float64
Now, we can put both into one DataFrame:
data = df_features.copy()
data['target'] = target
In the following we will explore different settings of confound removal using Julearns pipeline functionalities.
Confound Removal Typical Use Case¶
Here, we want to deconfound the features and not include the confound as a feature into our last model. Afterwards, we will transform our features with a pca and run a linear regression.
feature_names = list(df_features.drop(columns='sex').columns)
scores, model = run_cross_validation(
X=feature_names, y='target', data=data,
confounds='sex', model='linreg', problem_type='regression',
preprocess_X=['remove_confound', 'pca'],
return_estimator='final')
Out:
2021-01-28 20:07:36,285 - julearn - INFO - Using default CV
2021-01-28 20:07:36,285 - julearn - INFO - ==== Input Data ====
2021-01-28 20:07:36,285 - julearn - INFO - Using dataframe as input
2021-01-28 20:07:36,285 - julearn - INFO - Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2021-01-28 20:07:36,286 - julearn - INFO - Target: target
2021-01-28 20:07:36,286 - julearn - INFO - Confounds: sex
2021-01-28 20:07:36,286 - julearn - INFO - Expanded X: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2021-01-28 20:07:36,286 - julearn - INFO - Expanded Confounds: ['sex']
2021-01-28 20:07:36,287 - julearn - INFO - ====================
2021-01-28 20:07:36,287 - julearn - INFO -
2021-01-28 20:07:36,287 - julearn - INFO - ====== Model ======
2021-01-28 20:07:36,287 - julearn - INFO - Obtaining model by name: linreg
2021-01-28 20:07:36,287 - julearn - INFO - ===================
2021-01-28 20:07:36,287 - julearn - INFO -
2021-01-28 20:07:36,287 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds
We can use the preprocess method of the .ExtendedDataFramePipeline to inspect the transformations/preprocessing steps of the returned estimator. By providing a step name to the until argument of the preprocess method we return the transformed X and y up to the provided step (inclusive). This output consists of a tuple containing the transformed X and y,
X_deconfounded, _ = model.preprocess(
df_features, target, until='remove_confound')
print(X_deconfounded.head())
# As you can see the confound `sex` was dropped
# and only the confound removed features are used in the following pca.
# But what if you want to keep the confound after removal for
# other transformations.
#
# For example, let's assume that you want to do a pca on the confound removed
# feature, but want to keep the confound for the actual modelling step.
# Let us have a closer look to the confound remover in order to understand
# how we could achieve such a task:
#
# .. autoclass:: julearn.transformers.DataFrameConfoundRemover
Out:
age bmi bp ... s4 s5 s6
0 0.029271 0.057228 0.009658 ... -0.019424 0.012311 -0.028194
1 0.005874 -0.047538 -0.015569 ... -0.024667 -0.061637 -0.082913
2 0.076494 0.039983 -0.017885 ... -0.019424 -0.004734 -0.036479
3 -0.081307 -0.007659 -0.025897 ... 0.049135 0.029385 -0.000071
4 0.013139 -0.032449 0.032632 ... 0.012234 -0.025299 -0.037349
[5 rows x 9 columns]
Above, you can see that we can set the keep_confounds argument to True. This will keep the confounds after confound removal. Here, is an example of how this can look like:
scores, model = run_cross_validation(
X=feature_names, y='target', data=data,
confounds='sex', model='linreg', problem_type='regression',
preprocess_X=['remove_confound', 'pca'],
model_params=dict(remove_confound__keep_confounds=True),
return_estimator='final')
Out:
2021-01-28 20:07:38,490 - julearn - INFO - Using default CV
2021-01-28 20:07:38,490 - julearn - INFO - ==== Input Data ====
2021-01-28 20:07:38,490 - julearn - INFO - Using dataframe as input
2021-01-28 20:07:38,491 - julearn - INFO - Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2021-01-28 20:07:38,491 - julearn - INFO - Target: target
2021-01-28 20:07:38,491 - julearn - INFO - Confounds: sex
2021-01-28 20:07:38,491 - julearn - INFO - Expanded X: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2021-01-28 20:07:38,491 - julearn - INFO - Expanded Confounds: ['sex']
2021-01-28 20:07:38,492 - julearn - INFO - ====================
2021-01-28 20:07:38,492 - julearn - INFO -
2021-01-28 20:07:38,492 - julearn - INFO - ====== Model ======
2021-01-28 20:07:38,492 - julearn - INFO - Obtaining model by name: linreg
2021-01-28 20:07:38,492 - julearn - INFO - ===================
2021-01-28 20:07:38,492 - julearn - INFO -
2021-01-28 20:07:38,492 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds
2021-01-28 20:07:38,492 - julearn - INFO - = Model Parameters =
2021-01-28 20:07:38,493 - julearn - INFO - Setting hyperparameter remove_confound__keep_confounds = True
2021-01-28 20:07:38,494 - julearn - INFO - ====================
2021-01-28 20:07:38,494 - julearn - INFO -
As you can see this will keep the confound
X_deconfounded, _ = model.preprocess(
df_features, target, until='remove_confound')
print(X_deconfounded.head())
Out:
age sex bmi ... s4 s5 s6
0 0.029271 0.050680 0.057228 ... -0.019424 0.012311 -0.028194
1 0.005874 -0.044642 -0.047538 ... -0.024667 -0.061637 -0.082913
2 0.076494 0.050680 0.039983 ... -0.019424 -0.004734 -0.036479
3 -0.081307 -0.044642 -0.007659 ... 0.049135 0.029385 -0.000071
4 0.013139 -0.044642 -0.032449 ... 0.012234 -0.025299 -0.037349
[5 rows x 10 columns]
Even after the pca the confound will still be present. This is the case because by default transformers only transform continuous features (including features without a specified type) and ignore confounds and categorical variables.
X_transformed, _ = model.preprocess(df_features, target)
print(X_transformed.head())
# This means that the resulting Linear Regression will use the deconfounded
# features together with the confound to predict the target.
Out:
pca_component_0 pca_component_1 ... pca_component_7 pca_component_8
0 -0.014050 0.075715 ... -0.008604 -0.002330
1 -0.099883 -0.062830 ... 0.024022 0.002074
2 -0.029014 0.053253 ... -0.001197 -0.002579
3 0.035164 -0.001321 ... -0.006567 -0.003546
4 -0.003952 -0.025446 ... 0.002095 -0.000516
[5 rows x 9 columns]
Lastly, you can also use the confound as a normal feature after confound removal. To do so you can either add the confound(s) to the which return the same columns or you can use the .ChangeColumnTypes to change the returned confounds to a continuous variable like this:
scores, model = run_cross_validation(
X=feature_names, y='target', data=data,
confounds='sex', model='linreg', problem_type='regression',
preprocess_X=['remove_confound',
ChangeColumnTypes('.*confound', 'continuous'),
'pca'],
preprocess_confounds='zscore',
model_params=dict(remove_confound__keep_confounds=True),
return_estimator='final'
)
Out:
2021-01-28 20:07:40,739 - julearn - INFO - Using default CV
2021-01-28 20:07:40,739 - julearn - INFO - ==== Input Data ====
2021-01-28 20:07:40,739 - julearn - INFO - Using dataframe as input
2021-01-28 20:07:40,739 - julearn - INFO - Features: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2021-01-28 20:07:40,739 - julearn - INFO - Target: target
2021-01-28 20:07:40,739 - julearn - INFO - Confounds: sex
2021-01-28 20:07:40,740 - julearn - INFO - Expanded X: ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2021-01-28 20:07:40,740 - julearn - INFO - Expanded Confounds: ['sex']
2021-01-28 20:07:40,740 - julearn - INFO - ====================
2021-01-28 20:07:40,741 - julearn - INFO -
2021-01-28 20:07:40,741 - julearn - INFO - ====== Model ======
2021-01-28 20:07:40,741 - julearn - INFO - Obtaining model by name: linreg
2021-01-28 20:07:40,741 - julearn - INFO - ===================
2021-01-28 20:07:40,741 - julearn - INFO -
2021-01-28 20:07:40,741 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds
2021-01-28 20:07:40,741 - julearn - INFO - = Model Parameters =
2021-01-28 20:07:40,741 - julearn - INFO - Setting hyperparameter remove_confound__keep_confounds = True
2021-01-28 20:07:40,742 - julearn - INFO - ====================
2021-01-28 20:07:40,742 - julearn - INFO -
As you can see this will keep the confound and change its type to a continuous variable.
X_deconfounded, _ = model.preprocess(
df_features, target, until='changecolumntypes',
return_trans_column_type=True)
print(X_deconfounded.head())
Out:
age bmi bp ... s5 s6 sex__:type:__continuous
0 0.029271 0.057228 0.009658 ... 0.012311 -0.028194 1.065488
1 0.005874 -0.047538 -0.015569 ... -0.061637 -0.082913 -0.938537
2 0.076494 0.039983 -0.017885 ... -0.004734 -0.036479 1.065488
3 -0.081307 -0.007659 -0.025897 ... 0.029385 -0.000071 -0.938537
4 0.013139 -0.032449 0.032632 ... -0.025299 -0.037349 -0.938537
[5 rows x 10 columns]
Because the confound is treated as a normal continuous feature after removal it will be transformed in the pca as well
X_transformed, _ = model.preprocess(df_features, target)
print(X_transformed.head())
Out:
pca_component_0 pca_component_1 ... pca_component_8 pca_component_9
0 1.065488 -0.014050 ... -0.008604 -0.002330
1 -0.938537 -0.099883 ... 0.024022 0.002074
2 1.065488 -0.029014 ... -0.001197 -0.002579
3 -0.938537 0.035164 ... -0.006567 -0.003546
4 -0.938537 -0.003952 ... 0.002095 -0.000516
[5 rows x 10 columns]
Total running time of the script: ( 0 minutes 7.183 seconds)