3.1. Applying preprocessing to the target#
What we covered so far is how to apply preprocessing to the features and train
a model in a cv-conistent manner by building a pipeline.
However, sometimes one wants to apply preprocessing to the target. For example,
when having a regression-task (continuous target variable), one might want to
predict the z-scored target.
This can be achieved by using a TargetPipelineCreator
as one step in the general pipeline.
Lets start by loading the data and importing the required modules:
import pandas as pd
from julearn import run_cross_validation
from julearn.pipeline import PipelineCreator, TargetPipelineCreator
from sklearn.datasets import load_diabetes
Load the diabetes dataset from sklearn as a pandas dataframe
features, target = load_diabetes(return_X_y=True, as_frame=True)
print("Features: \n", features.head())
print("Target: \n", target.describe())
data_diabetes = pd.concat([features, target], axis=1)
X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"]
y = "target"
X_types = {
"continuous": ["age", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"],
"categorical": ["sex"],
}
Features:
age sex bmi ... s4 s5 s6
0 0.038076 0.050680 0.061696 ... -0.002592 0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068332 -0.092204
2 0.085299 0.050680 0.044451 ... -0.002592 0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022688 -0.009362
4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031988 -0.046641
[5 rows x 10 columns]
Target:
count 442.000000
mean 152.133484
std 77.093005
min 25.000000
25% 87.000000
50% 140.500000
75% 211.500000
max 346.000000
Name: target, dtype: float64
We first create a TargetPipelineCreator:
target_creator = TargetPipelineCreator()
target_creator.add("zscore")
print(target_creator)
TargetPipelineCreator:
Step 0: zscore
estimator: StandardScaler()
Next, we create the general pipeline using a PipelineCreator. We
pass the target_creator as one step of the pipeline and specify that it
should only be applied to the target, which makes it clear for Julearn
to only apply it to y:
creator = PipelineCreator(
problem_type="regression", apply_to=["categorical", "continuous"]
)
creator.add(target_creator, apply_to="target")
creator.add("svm")
print(creator)
2023-07-19 12:42:16,571 - julearn - INFO - Adding step jutargetpipeline that applies to ColumnTypes<types={'target'}; pattern=(?:target)>
2023-07-19 12:42:16,571 - julearn - INFO - Step added
2023-07-19 12:42:16,571 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'categorical', 'continuous'}; pattern=(?:__:type:__categorical|__:type:__continuous)>
2023-07-19 12:42:16,571 - julearn - INFO - Step added
PipelineCreator:
Step 0: target_jutargetpipeline
estimator: <julearn.pipeline.target_pipeline.JuTargetPipeline object at 0x7f7f62659330>
apply to: ColumnTypes<types={'target'}; pattern=(?:target)>
needed types: ColumnTypes<types={'target'}; pattern=(?:target)>
tuning params: {}
Step 1: svm
estimator: SVR()
apply to: ColumnTypes<types={'categorical', 'continuous'}; pattern=(?:__:type:__categorical|__:type:__continuous)>
needed types: ColumnTypes<types={'categorical', 'continuous'}; pattern=(?:__:type:__categorical|__:type:__continuous)>
tuning params: {}
This creator can then be passed to run_cross_validation():
scores = run_cross_validation(
X=X, y=y, data=data_diabetes, X_types=X_types, model=creator
)
print(scores)
2023-07-19 12:42:16,572 - julearn - INFO - ==== Input Data ====
2023-07-19 12:42:16,572 - julearn - INFO - Using dataframe as input
2023-07-19 12:42:16,572 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2023-07-19 12:42:16,572 - julearn - INFO - Target: target
2023-07-19 12:42:16,572 - julearn - INFO - Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2023-07-19 12:42:16,572 - julearn - INFO - X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'categorical': ['sex']}
2023-07-19 12:42:16,573 - julearn - INFO - ====================
2023-07-19 12:42:16,573 - julearn - INFO -
2023-07-19 12:42:16,574 - julearn - INFO - = Model Parameters =
2023-07-19 12:42:16,574 - julearn - INFO - ====================
2023-07-19 12:42:16,575 - julearn - INFO -
2023-07-19 12:42:16,575 - julearn - INFO - = Data Information =
2023-07-19 12:42:16,575 - julearn - INFO - Problem type: regression
2023-07-19 12:42:16,575 - julearn - INFO - Number of samples: 442
2023-07-19 12:42:16,575 - julearn - INFO - Number of features: 10
2023-07-19 12:42:16,575 - julearn - INFO - ====================
2023-07-19 12:42:16,575 - julearn - INFO -
2023-07-19 12:42:16,575 - julearn - INFO - Target type: float64
2023-07-19 12:42:16,575 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
fit_time score_time ... fold cv_mdsum
0 0.012334 0.004221 ... 0 b10eef89b4192178d482d7a1587a248a
1 0.010788 0.004138 ... 1 b10eef89b4192178d482d7a1587a248a
2 0.011340 0.004147 ... 2 b10eef89b4192178d482d7a1587a248a
3 0.011090 0.004127 ... 3 b10eef89b4192178d482d7a1587a248a
4 0.011126 0.004113 ... 4 b10eef89b4192178d482d7a1587a248a
[5 rows x 8 columns]
All transformers in (Transformers) can be used for both,
feature and target transformations. However, features transformations can be
directly specified as step in the PipelineCreator, while target
transformations have to be specified using the
TargetPipelineCreator, which is then passed to the overall
PipelineCreator as an extra step.
Total running time of the script: ( 0 minutes 0.114 seconds)