Transforming target variable with z-score#

This example uses the sklearn diabetes regression dataset, and transforms the target variable, in this case, using z-score. Then, we perform a regression analysis using Ridge Regression model.

# Authors: Lya K. Paas Oliveros <l.paas.oliveros@fz-juelich.de>
#          Sami Hamdan <s.hamdan@fz-juelich.de>
#
# License: AGPL

import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split

from julearn import run_cross_validation
from julearn.utils import configure_logging

from julearn.pipeline import PipelineCreator, TargetPipelineCreator

Set the logging level to info to see extra information.

configure_logging(level="INFO")
2024-04-29 11:45:34,757 - julearn - INFO - ===== Lib Versions =====
2024-04-29 11:45:34,757 - julearn - INFO - numpy: 1.26.4
2024-04-29 11:45:34,757 - julearn - INFO - scipy: 1.13.0
2024-04-29 11:45:34,757 - julearn - INFO - sklearn: 1.4.2
2024-04-29 11:45:34,757 - julearn - INFO - pandas: 2.1.4
2024-04-29 11:45:34,757 - julearn - INFO - julearn: 0.3.2.dev57
2024-04-29 11:45:34,757 - julearn - INFO - ========================

Load the diabetes dataset from sklearn as a pandas.DataFrame.

features, target = load_diabetes(return_X_y=True, as_frame=True)

Dataset contains ten variables age, sex, body mass index, average blood pressure, and six blood serum measurements (s1-s6) diabetes patients and a quantitative measure of disease progression one year after baseline which is the target we are interested in predicting.

print("Features: \n", features.head())
print("Target: \n", target.describe())
Features:
         age       sex       bmi  ...        s4        s5        s6
0  0.038076  0.050680  0.061696  ... -0.002592  0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474  ... -0.039493 -0.068332 -0.092204
2  0.085299  0.050680  0.044451  ... -0.002592  0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595  ...  0.034309  0.022688 -0.009362
4  0.005383 -0.044642 -0.036385  ... -0.002592 -0.031988 -0.046641

[5 rows x 10 columns]
Target:
 count    442.000000
mean     152.133484
std       77.093005
min       25.000000
25%       87.000000
50%      140.500000
75%      211.500000
max      346.000000
Name: target, dtype: float64

Let’s combine features and target together in one dataframe and define X and y.

data_diabetes = pd.concat([features, target], axis=1)

X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"]
y = "target"

Split the dataset into train and test.

Let’s create the model. Since we will be transforming the target variable we will first need to create a TargetPipelineCreator for this.

target_creator = TargetPipelineCreator()
target_creator.add("zscore")
<julearn.pipeline.target_pipeline_creator.TargetPipelineCreator object at 0x7fe306b0a740>

Now we can create the pipeline using a PipelineCreator.

creator = PipelineCreator(problem_type="regression")
creator.add(target_creator, apply_to="target")
creator.add("ridge")

scores, model = run_cross_validation(
    X=X,
    y=y,
    data=train_diabetes,
    model=creator,
    return_estimator="final",
    scoring="neg_mean_absolute_error",
)

print(scores.head(5))
2024-04-29 11:45:34,773 - julearn - INFO - Adding step jutargetpipeline that applies to ColumnTypes<types={'target'}; pattern=(?:target)>
2024-04-29 11:45:34,773 - julearn - INFO - Step added
2024-04-29 11:45:34,773 - julearn - INFO - Adding step ridge that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:45:34,773 - julearn - INFO - Step added
2024-04-29 11:45:34,774 - julearn - INFO - ==== Input Data ====
2024-04-29 11:45:34,774 - julearn - INFO - Using dataframe as input
2024-04-29 11:45:34,774 - julearn - INFO -      Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-04-29 11:45:34,774 - julearn - INFO -      Target: target
2024-04-29 11:45:34,774 - julearn - INFO -      Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-04-29 11:45:34,774 - julearn - INFO -      X_types:{}
2024-04-29 11:45:34,774 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:507: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
  warn_with_log(
2024-04-29 11:45:34,775 - julearn - INFO - ====================
2024-04-29 11:45:34,775 - julearn - INFO -
2024-04-29 11:45:34,775 - julearn - INFO - = Model Parameters =
2024-04-29 11:45:34,776 - julearn - INFO - ====================
2024-04-29 11:45:34,776 - julearn - INFO -
2024-04-29 11:45:34,776 - julearn - INFO - = Data Information =
2024-04-29 11:45:34,776 - julearn - INFO -      Problem type: regression
2024-04-29 11:45:34,776 - julearn - INFO -      Number of samples: 309
2024-04-29 11:45:34,776 - julearn - INFO -      Number of features: 10
2024-04-29 11:45:34,776 - julearn - INFO - ====================
2024-04-29 11:45:34,776 - julearn - INFO -
2024-04-29 11:45:34,776 - julearn - INFO -      Target type: float64
2024-04-29 11:45:34,776 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(
2024-04-29 11:45:34,812 - julearn - INFO - Fitting final model
   fit_time  score_time  ...  fold                          cv_mdsum
0  0.004417    0.002245  ...     0  b10eef89b4192178d482d7a1587a248a
1  0.004056    0.002215  ...     1  b10eef89b4192178d482d7a1587a248a
2  0.004124    0.002175  ...     2  b10eef89b4192178d482d7a1587a248a
3  0.003975    0.002162  ...     3  b10eef89b4192178d482d7a1587a248a
4  0.003942    0.002188  ...     4  b10eef89b4192178d482d7a1587a248a

[5 rows x 8 columns]

Mean value of mean absolute error across CV

print(scores["test_score"].mean() * -1)
51.51357151914367

Total running time of the script: (0 minutes 0.071 seconds)

Gallery generated by Sphinx-Gallery