6.1. Applying preprocessing to the target#
What we covered so far is how to apply preprocessing to the features and train
a model in a cv-conistent manner by building a pipeline.
However, sometimes one wants to apply preprocessing to the target. For example,
when having a regression-task (continuous target variable), one might want to
predict the z-scored target.
This can be achieved by using a TargetPipelineCreator
as a step in the general pipeline.
Let’s start by loading the data and importing the required modules:
import pandas as pd
from julearn import run_cross_validation
from julearn.pipeline import PipelineCreator, TargetPipelineCreator
from sklearn.datasets import load_diabetes
Load the diabetes dataset from scikit-learn as a pandas.DataFrame
features, target = load_diabetes(return_X_y=True, as_frame=True)
print("Features: \n", features.head())
print("Target: \n", target.describe())
data_diabetes = pd.concat([features, target], axis=1)
X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"]
y = "target"
X_types = {
"continuous": ["age", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"],
"categorical": ["sex"],
}
Features:
age sex bmi ... s4 s5 s6
0 0.038076 0.050680 0.061696 ... -0.002592 0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068332 -0.092204
2 0.085299 0.050680 0.044451 ... -0.002592 0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022688 -0.009362
4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031988 -0.046641
[5 rows x 10 columns]
Target:
count 442.000000
mean 152.133484
std 77.093005
min 25.000000
25% 87.000000
50% 140.500000
75% 211.500000
max 346.000000
Name: target, dtype: float64
We first create a TargetPipelineCreator:
target_creator = TargetPipelineCreator()
target_creator.add("zscore")
print(target_creator)
TargetPipelineCreator:
Step 0: zscore
estimator: StandardScaler()
Next, we create the general pipeline using a PipelineCreator. We
pass the target_creator as a step of the pipeline and specify that it
should only be applied to the target, which makes it clear for julearn
to only apply it to y:
creator = PipelineCreator(
problem_type="regression", apply_to=["categorical", "continuous"]
)
creator.add(target_creator, apply_to="target")
creator.add("svm")
print(creator)
2024-10-17 14:16:00,302 - julearn - INFO - Adding step jutargetpipeline that applies to ColumnTypes<types={'target'}; pattern=(?:target)>
2024-10-17 14:16:00,302 - julearn - INFO - Step added
2024-10-17 14:16:00,302 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'categorical', 'continuous'}; pattern=(?:__:type:__categorical|__:type:__continuous)>
2024-10-17 14:16:00,302 - julearn - INFO - Step added
PipelineCreator:
Step 0: target_jutargetpipeline
estimator: <julearn.pipeline.target_pipeline.JuTargetPipeline object at 0x7f45e4b75cc0>
apply to: ColumnTypes<types={'target'}; pattern=(?:target)>
needed types: ColumnTypes<types={'target'}; pattern=(?:target)>
tuning params: {}
Step 1: svm
estimator: SVR()
apply to: ColumnTypes<types={'categorical', 'continuous'}; pattern=(?:__:type:__categorical|__:type:__continuous)>
needed types: ColumnTypes<types={'categorical', 'continuous'}; pattern=(?:__:type:__categorical|__:type:__continuous)>
tuning params: {}
This creator can then be passed to run_cross_validation():
scores = run_cross_validation(
X=X, y=y, data=data_diabetes, X_types=X_types, model=creator
)
print(scores)
2024-10-17 14:16:00,303 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:00,303 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:00,303 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-10-17 14:16:00,303 - julearn - INFO - Target: target
2024-10-17 14:16:00,303 - julearn - INFO - Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-10-17 14:16:00,303 - julearn - INFO - X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'categorical': ['sex']}
2024-10-17 14:16:00,304 - julearn - INFO - ====================
2024-10-17 14:16:00,304 - julearn - INFO -
2024-10-17 14:16:00,305 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:00,305 - julearn - INFO - ====================
2024-10-17 14:16:00,305 - julearn - INFO -
2024-10-17 14:16:00,305 - julearn - INFO - = Data Information =
2024-10-17 14:16:00,305 - julearn - INFO - Problem type: regression
2024-10-17 14:16:00,305 - julearn - INFO - Number of samples: 442
2024-10-17 14:16:00,305 - julearn - INFO - Number of features: 10
2024-10-17 14:16:00,305 - julearn - INFO - ====================
2024-10-17 14:16:00,305 - julearn - INFO -
2024-10-17 14:16:00,305 - julearn - INFO - Target type: float64
2024-10-17 14:16:00,305 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
fit_time score_time ... fold cv_mdsum
0 0.008690 0.002672 ... 0 b10eef89b4192178d482d7a1587a248a
1 0.007508 0.002633 ... 1 b10eef89b4192178d482d7a1587a248a
2 0.007941 0.002605 ... 2 b10eef89b4192178d482d7a1587a248a
3 0.007759 0.002611 ... 3 b10eef89b4192178d482d7a1587a248a
4 0.007791 0.002571 ... 4 b10eef89b4192178d482d7a1587a248a
[5 rows x 8 columns]
All transformers in (Transformers) can be used for both,
feature and target transformations. However, features transformations can be
directly specified as step in the PipelineCreator, while target
transformations have to be specified using the
TargetPipelineCreator, which is then passed to the overall
PipelineCreator as an extra step.
Total running time of the script: (0 minutes 0.082 seconds)