6.1. Applying preprocessing to the target#

What we covered so far is how to apply preprocessing to the features and train a model in a cv-conistent manner by building a pipeline. However, sometimes one wants to apply preprocessing to the target. For example, when having a regression-task (continuous target variable), one might want to predict the z-scored target. This can be achieved by using a TargetPipelineCreator as a step in the general pipeline.

Let’s start by loading the data and importing the required modules:

import pandas as pd
from julearn import run_cross_validation
from julearn.pipeline import PipelineCreator, TargetPipelineCreator
from sklearn.datasets import load_diabetes

Load the diabetes dataset from scikit-learn as a pandas.DataFrame

features, target = load_diabetes(return_X_y=True, as_frame=True)

print("Features: \n", features.head())
print("Target: \n", target.describe())

data_diabetes = pd.concat([features, target], axis=1)

X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"]
y = "target"

X_types = {
    "continuous": ["age", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"],
    "categorical": ["sex"],
}
Features:
         age       sex       bmi  ...        s4        s5        s6
0  0.038076  0.050680  0.061696  ... -0.002592  0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474  ... -0.039493 -0.068332 -0.092204
2  0.085299  0.050680  0.044451  ... -0.002592  0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595  ...  0.034309  0.022688 -0.009362
4  0.005383 -0.044642 -0.036385  ... -0.002592 -0.031988 -0.046641

[5 rows x 10 columns]
Target:
 count    442.000000
mean     152.133484
std       77.093005
min       25.000000
25%       87.000000
50%      140.500000
75%      211.500000
max      346.000000
Name: target, dtype: float64

We first create a TargetPipelineCreator:

target_creator = TargetPipelineCreator()
target_creator.add("zscore")

print(target_creator)
TargetPipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()

Next, we create the general pipeline using a PipelineCreator. We pass the target_creator as a step of the pipeline and specify that it should only be applied to the target, which makes it clear for julearn to only apply it to y:

creator = PipelineCreator(
    problem_type="regression", apply_to=["categorical", "continuous"]
)
creator.add(target_creator, apply_to="target")
creator.add("svm")
print(creator)
2024-04-29 11:45:55,149 - julearn - INFO - Adding step jutargetpipeline that applies to ColumnTypes<types={'target'}; pattern=(?:target)>
2024-04-29 11:45:55,149 - julearn - INFO - Step added
2024-04-29 11:45:55,149 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous', 'categorical'}; pattern=(?:__:type:__continuous|__:type:__categorical)>
2024-04-29 11:45:55,149 - julearn - INFO - Step added
PipelineCreator:
  Step 0: target_jutargetpipeline
    estimator:     <julearn.pipeline.target_pipeline.JuTargetPipeline object at 0x7fe3064197e0>
    apply to:      ColumnTypes<types={'target'}; pattern=(?:target)>
    needed types:  ColumnTypes<types={'target'}; pattern=(?:target)>
    tuning params: {}
  Step 1: svm
    estimator:     SVR()
    apply to:      ColumnTypes<types={'continuous', 'categorical'}; pattern=(?:__:type:__continuous|__:type:__categorical)>
    needed types:  ColumnTypes<types={'continuous', 'categorical'}; pattern=(?:__:type:__continuous|__:type:__categorical)>
    tuning params: {}

This creator can then be passed to run_cross_validation():

scores = run_cross_validation(
    X=X, y=y, data=data_diabetes, X_types=X_types, model=creator
)

print(scores)
2024-04-29 11:45:55,150 - julearn - INFO - ==== Input Data ====
2024-04-29 11:45:55,150 - julearn - INFO - Using dataframe as input
2024-04-29 11:45:55,150 - julearn - INFO -      Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-04-29 11:45:55,150 - julearn - INFO -      Target: target
2024-04-29 11:45:55,150 - julearn - INFO -      Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-04-29 11:45:55,150 - julearn - INFO -      X_types:{'continuous': ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'], 'categorical': ['sex']}
2024-04-29 11:45:55,151 - julearn - INFO - ====================
2024-04-29 11:45:55,151 - julearn - INFO -
2024-04-29 11:45:55,152 - julearn - INFO - = Model Parameters =
2024-04-29 11:45:55,152 - julearn - INFO - ====================
2024-04-29 11:45:55,152 - julearn - INFO -
2024-04-29 11:45:55,152 - julearn - INFO - = Data Information =
2024-04-29 11:45:55,152 - julearn - INFO -      Problem type: regression
2024-04-29 11:45:55,152 - julearn - INFO -      Number of samples: 442
2024-04-29 11:45:55,152 - julearn - INFO -      Number of features: 10
2024-04-29 11:45:55,152 - julearn - INFO - ====================
2024-04-29 11:45:55,152 - julearn - INFO -
2024-04-29 11:45:55,152 - julearn - INFO -      Target type: float64
2024-04-29 11:45:55,152 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(
   fit_time  score_time  ...  fold                          cv_mdsum
0  0.009789    0.003202  ...     0  b10eef89b4192178d482d7a1587a248a
1  0.008546    0.003155  ...     1  b10eef89b4192178d482d7a1587a248a
2  0.009091    0.003164  ...     2  b10eef89b4192178d482d7a1587a248a
3  0.008926    0.003134  ...     3  b10eef89b4192178d482d7a1587a248a
4  0.008935    0.003226  ...     4  b10eef89b4192178d482d7a1587a248a

[5 rows x 8 columns]

All transformers in (Transformers) can be used for both, feature and target transformations. However, features transformations can be directly specified as step in the PipelineCreator, while target transformations have to be specified using the TargetPipelineCreator, which is then passed to the overall PipelineCreator as an extra step.

Total running time of the script: (0 minutes 0.091 seconds)