6.1. Applying preprocessing to the target#
What we covered so far is how to apply preprocessing to the features and train
a model in a cv-conistent manner by building a pipeline.
However, sometimes one wants to apply preprocessing to the target. For example,
when having a regression-task (continuous target variable), one might want to
predict the z-scored target.
This can be achieved by using a TargetPipelineCreator
as a step in the general pipeline.
Let’s start by loading the data and importing the required modules:
import pandas as pd
from julearn import run_cross_validation
from julearn.pipeline import PipelineCreator, TargetPipelineCreator
from sklearn.datasets import load_diabetes
Load the diabetes dataset from scikit-learn
as a pandas.DataFrame
features, target = load_diabetes(return_X_y=True, as_frame=True)
print("Features: \n", features.head())
print("Target: \n", target.describe())
data_diabetes = pd.concat([features, target], axis=1)
X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"]
y = "target"
X_types = {
"continuous": ["age", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"],
"categorical": ["sex"],
}
Features:
age sex bmi ... s4 s5 s6
0 0.038076 0.050680 0.061696 ... -0.002592 0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068332 -0.092204
2 0.085299 0.050680 0.044451 ... -0.002592 0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022688 -0.009362
4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031988 -0.046641
[5 rows x 10 columns]
Target:
count 442.000000
mean 152.133484
std 77.093005
min 25.000000
25% 87.000000
50% 140.500000
75% 211.500000
max 346.000000
Name: target, dtype: float64
We first create a TargetPipelineCreator
:
target_creator = TargetPipelineCreator()
target_creator.add("zscore")
print(target_creator)
TargetPipelineCreator:
Step 0: zscore
estimator: StandardScaler()
Next, we create the general pipeline using a PipelineCreator
. We
pass the target_creator
as a step of the pipeline and specify that it
should only be applied to the target
, which makes it clear for julearn
to only apply it to y
:
creator = PipelineCreator(
problem_type="regression", apply_to=["categorical", "continuous"]
)
creator.add(target_creator, apply_to="target")
creator.add("svm")
print(creator)
2024-05-16 08:52:54,839 - julearn - INFO - Adding step jutargetpipeline that applies to ColumnTypes<types={'target'}; pattern=(?:target)>
2024-05-16 08:52:54,839 - julearn - INFO - Step added
2024-05-16 08:52:54,839 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'categorical', 'continuous'}; pattern=(?:__:type:__categorical|__:type:__continuous)>
2024-05-16 08:52:54,839 - julearn - INFO - Step added
PipelineCreator:
Step 0: target_jutargetpipeline
estimator: <julearn.pipeline.target_pipeline.JuTargetPipeline object at 0x7f9204730310>
apply to: ColumnTypes<types={'target'}; pattern=(?:target)>
needed types: ColumnTypes<types={'target'}; pattern=(?:target)>
tuning params: {}
Step 1: svm
estimator: SVR()
apply to: ColumnTypes<types={'categorical', 'continuous'}; pattern=(?:__:type:__categorical|__:type:__continuous)>
needed types: ColumnTypes<types={'categorical', 'continuous'}; pattern=(?:__:type:__categorical|__:type:__continuous)>
tuning params: {}
This creator
can then be passed to run_cross_validation()
:
scores = run_cross_validation(
X=X, y=y, data=data_diabetes, X_types=X_types, model=creator
)
print(scores)
2024-05-16 08:52:54,840 - julearn - INFO - ==== Input Data ====
2024-05-16 08:52:54,840 - julearn - INFO - Using dataframe as input
2024-05-16 08:52:54,840 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-05-16 08:52:54,840 - julearn - INFO - Target: target
2024-05-16 08:52:54,840 - julearn - INFO - Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-05-16 08:52:54,840 - julearn - INFO - X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'categorical': ['sex']}
2024-05-16 08:52:54,841 - julearn - INFO - ====================
2024-05-16 08:52:54,841 - julearn - INFO -
2024-05-16 08:52:54,842 - julearn - INFO - = Model Parameters =
2024-05-16 08:52:54,842 - julearn - INFO - ====================
2024-05-16 08:52:54,842 - julearn - INFO -
2024-05-16 08:52:54,842 - julearn - INFO - = Data Information =
2024-05-16 08:52:54,842 - julearn - INFO - Problem type: regression
2024-05-16 08:52:54,842 - julearn - INFO - Number of samples: 442
2024-05-16 08:52:54,842 - julearn - INFO - Number of features: 10
2024-05-16 08:52:54,842 - julearn - INFO - ====================
2024-05-16 08:52:54,842 - julearn - INFO -
2024-05-16 08:52:54,842 - julearn - INFO - Target type: float64
2024-05-16 08:52:54,842 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
fit_time score_time ... fold cv_mdsum
0 0.010292 0.003381 ... 0 b10eef89b4192178d482d7a1587a248a
1 0.008929 0.003293 ... 1 b10eef89b4192178d482d7a1587a248a
2 0.009403 0.003320 ... 2 b10eef89b4192178d482d7a1587a248a
3 0.009150 0.003210 ... 3 b10eef89b4192178d482d7a1587a248a
4 0.009237 0.003223 ... 4 b10eef89b4192178d482d7a1587a248a
[5 rows x 8 columns]
All transformers in (Transformers) can be used for both,
feature and target transformations. However, features transformations can be
directly specified as step in the PipelineCreator
, while target
transformations have to be specified using the
TargetPipelineCreator
, which is then passed to the overall
PipelineCreator
as an extra step.
Total running time of the script: (0 minutes 0.096 seconds)