Note
Go to the end to download the full example code
Transforming target variable with z-score.#
This example uses the sklearn “diabetes” regression dataset, and transforms the target variable, in this case, using z-score. Then, we perform a regression analysis using Ridge Regression model.
# Authors: Lya K. Paas Oliveros <l.paas.oliveros@fz-juelich.de>
# Sami Hamdan <s.hamdan@fz-juelich.de>
#
# License: AGPL
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from julearn import run_cross_validation
from julearn.utils import configure_logging
# this is crucial for creating the model in the new version
from julearn.pipeline import PipelineCreator, TargetPipelineCreator
Set the logging level to info to see extra information
configure_logging(level="INFO")
2023-07-19 12:41:57,868 - julearn - INFO - ===== Lib Versions =====
2023-07-19 12:41:57,868 - julearn - INFO - numpy: 1.25.1
2023-07-19 12:41:57,868 - julearn - INFO - scipy: 1.11.1
2023-07-19 12:41:57,868 - julearn - INFO - sklearn: 1.3.0
2023-07-19 12:41:57,868 - julearn - INFO - pandas: 2.0.3
2023-07-19 12:41:57,868 - julearn - INFO - julearn: 0.3.1.dev1
2023-07-19 12:41:57,868 - julearn - INFO - ========================
Load the diabetes dataset from sklearn as a pandas dataframe
features, target = load_diabetes(return_X_y=True, as_frame=True)
Dataset contains ten variables age, sex, body mass index, average blood pressure, and six blood serum measurements (s1-s6) diabetes patients and a quantitative measure of disease progression one year after baseline which is the target we are interested in predicting.
print("Features: \n", features.head())
print("Target: \n", target.describe())
Features:
age sex bmi ... s4 s5 s6
0 0.038076 0.050680 0.061696 ... -0.002592 0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068332 -0.092204
2 0.085299 0.050680 0.044451 ... -0.002592 0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022688 -0.009362
4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031988 -0.046641
[5 rows x 10 columns]
Target:
count 442.000000
mean 152.133484
std 77.093005
min 25.000000
25% 87.000000
50% 140.500000
75% 211.500000
max 346.000000
Name: target, dtype: float64
Let’s combine features and target together in one dataframe and define X and y
Split the dataset into train and test
train_diabetes, test_diabetes = train_test_split(data_diabetes, test_size=0.3)
Let’s create the model. Since we will be transforming the target variable we will first need to create a TargetPipelineCreator for this.
target_creator = TargetPipelineCreator()
target_creator.add("zscore")
<julearn.pipeline.target_pipeline_creator.TargetPipelineCreator object at 0x7f7f62841e40>
Now we can create the pipeline using a PipelineCreator.
creator = PipelineCreator(problem_type="regression")
creator.add(target_creator, apply_to="target")
creator.add("ridge")
scores, model = run_cross_validation(
X=X,
y=y,
data=train_diabetes,
model=creator,
return_estimator="final",
scoring="neg_mean_absolute_error",
)
print(scores.head(5))
2023-07-19 12:41:57,887 - julearn - INFO - Adding step jutargetpipeline that applies to ColumnTypes<types={'target'}; pattern=(?:target)>
2023-07-19 12:41:57,887 - julearn - INFO - Step added
2023-07-19 12:41:57,887 - julearn - INFO - Adding step ridge that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:41:57,887 - julearn - INFO - Step added
2023-07-19 12:41:57,888 - julearn - INFO - ==== Input Data ====
2023-07-19 12:41:57,888 - julearn - INFO - Using dataframe as input
2023-07-19 12:41:57,888 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2023-07-19 12:41:57,888 - julearn - INFO - Target: target
2023-07-19 12:41:57,888 - julearn - INFO - Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2023-07-19 12:41:57,888 - julearn - INFO - X_types:{}
2023-07-19 12:41:57,888 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/utils/logging.py:238: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
warn(msg, category=category)
2023-07-19 12:41:57,889 - julearn - INFO - ====================
2023-07-19 12:41:57,889 - julearn - INFO -
2023-07-19 12:41:57,890 - julearn - INFO - = Model Parameters =
2023-07-19 12:41:57,890 - julearn - INFO - ====================
2023-07-19 12:41:57,890 - julearn - INFO -
2023-07-19 12:41:57,890 - julearn - INFO - = Data Information =
2023-07-19 12:41:57,890 - julearn - INFO - Problem type: regression
2023-07-19 12:41:57,890 - julearn - INFO - Number of samples: 309
2023-07-19 12:41:57,890 - julearn - INFO - Number of features: 10
2023-07-19 12:41:57,890 - julearn - INFO - ====================
2023-07-19 12:41:57,890 - julearn - INFO -
2023-07-19 12:41:57,890 - julearn - INFO - Target type: float64
2023-07-19 12:41:57,891 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
fit_time score_time ... fold cv_mdsum
0 0.005344 0.003926 ... 0 b10eef89b4192178d482d7a1587a248a
1 0.004971 0.003868 ... 1 b10eef89b4192178d482d7a1587a248a
2 0.004951 0.003897 ... 2 b10eef89b4192178d482d7a1587a248a
3 0.004963 0.003906 ... 3 b10eef89b4192178d482d7a1587a248a
4 0.004969 0.003894 ... 4 b10eef89b4192178d482d7a1587a248a
[5 rows x 8 columns]
Mean value of mean absolute error across CV
print(scores["test_score"].mean() * -1)
154.0615805903489
Total running time of the script: ( 0 minutes 0.090 seconds)