Note
Go to the end to download the full example code
Custom Scoring Function for Regression#
This example uses the diabetes
data from sklearn datasets
and performs
a regression analysis using a Ridge Regression model. As scorers, it uses
scikit-learn
, julearn
and a custom metric defined by the user.
# Authors: Shammi More <s.more@fz-juelich.de>
# Federico Raimondo <f.raimondo@fz-juelich.de>
# License: AGPL
import pandas as pd
import scipy
from sklearn.datasets import load_diabetes
from sklearn.metrics import make_scorer
from julearn.scoring import register_scorer
from julearn import run_cross_validation
from julearn.utils import configure_logging
Set the logging level to info to see extra information.
configure_logging(level="INFO")
/home/runner/work/julearn/julearn/julearn/utils/logging.py:66: UserWarning: The '__version__' attribute is deprecated and will be removed in MarkupSafe 3.1. Use feature detection, or `importlib.metadata.version("markupsafe")`, instead.
vstring = str(getattr(module, "__version__", None))
2024-10-17 14:15:59,787 - julearn - INFO - ===== Lib Versions =====
2024-10-17 14:15:59,787 - julearn - INFO - numpy: 1.26.4
2024-10-17 14:15:59,787 - julearn - INFO - scipy: 1.14.1
2024-10-17 14:15:59,787 - julearn - INFO - sklearn: 1.5.2
2024-10-17 14:15:59,787 - julearn - INFO - pandas: 2.2.3
2024-10-17 14:15:59,787 - julearn - INFO - julearn: 0.3.4
2024-10-17 14:15:59,787 - julearn - INFO - ========================
load the diabetes data from sklearn
as a pandas.DataFrame
.
features, target = load_diabetes(return_X_y=True, as_frame=True)
Dataset contains ten variables age, sex, body mass index, average blood pressure, and six blood serum measurements (s1-s6) diabetes patients and a quantitative measure of disease progression one year after baseline which is the target we are interested in predicting.
print("Features: \n", features.head())
print("Target: \n", target.describe())
Features:
age sex bmi ... s4 s5 s6
0 0.038076 0.050680 0.061696 ... -0.002592 0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068332 -0.092204
2 0.085299 0.050680 0.044451 ... -0.002592 0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022688 -0.009362
4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031988 -0.046641
[5 rows x 10 columns]
Target:
count 442.000000
mean 152.133484
std 77.093005
min 25.000000
25% 87.000000
50% 140.500000
75% 211.500000
max 346.000000
Name: target, dtype: float64
Let’s combine features and target together in one dataframe and define X and y.
Train a ridge regression model on train dataset and use mean absolute error for scoring.
2024-10-17 14:15:59,802 - julearn - INFO - ==== Input Data ====
2024-10-17 14:15:59,802 - julearn - INFO - Using dataframe as input
2024-10-17 14:15:59,802 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-10-17 14:15:59,802 - julearn - INFO - Target: target
2024-10-17 14:15:59,802 - julearn - INFO - Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-10-17 14:15:59,802 - julearn - INFO - X_types:{}
2024-10-17 14:15:59,803 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:509: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
warn_with_log(
2024-10-17 14:15:59,803 - julearn - INFO - ====================
2024-10-17 14:15:59,803 - julearn - INFO -
2024-10-17 14:15:59,803 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:15:59,803 - julearn - INFO - Step added
2024-10-17 14:15:59,804 - julearn - INFO - Adding step ridge that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:15:59,804 - julearn - INFO - Step added
2024-10-17 14:15:59,804 - julearn - INFO - = Model Parameters =
2024-10-17 14:15:59,804 - julearn - INFO - ====================
2024-10-17 14:15:59,804 - julearn - INFO -
2024-10-17 14:15:59,804 - julearn - INFO - = Data Information =
2024-10-17 14:15:59,804 - julearn - INFO - Problem type: regression
2024-10-17 14:15:59,804 - julearn - INFO - Number of samples: 442
2024-10-17 14:15:59,804 - julearn - INFO - Number of features: 10
2024-10-17 14:15:59,804 - julearn - INFO - ====================
2024-10-17 14:15:59,804 - julearn - INFO -
2024-10-17 14:15:59,805 - julearn - INFO - Target type: float64
2024-10-17 14:15:59,805 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)
The scores dataframe has all the values for each CV split.
Mean value of mean absolute error across CV.
print(scores["test_score"].mean() * -1)
44.264653948271885
Now do the same thing, but use mean absolute error and Pearson product-moment correlation coefficient (squared) as scoring functions.
2024-10-17 14:15:59,852 - julearn - INFO - ==== Input Data ====
2024-10-17 14:15:59,852 - julearn - INFO - Using dataframe as input
2024-10-17 14:15:59,853 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-10-17 14:15:59,853 - julearn - INFO - Target: target
2024-10-17 14:15:59,853 - julearn - INFO - Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-10-17 14:15:59,853 - julearn - INFO - X_types:{}
2024-10-17 14:15:59,853 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:509: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
warn_with_log(
2024-10-17 14:15:59,853 - julearn - INFO - ====================
2024-10-17 14:15:59,854 - julearn - INFO -
2024-10-17 14:15:59,854 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:15:59,854 - julearn - INFO - Step added
2024-10-17 14:15:59,854 - julearn - INFO - Adding step ridge that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:15:59,854 - julearn - INFO - Step added
2024-10-17 14:15:59,854 - julearn - INFO - = Model Parameters =
2024-10-17 14:15:59,854 - julearn - INFO - ====================
2024-10-17 14:15:59,854 - julearn - INFO -
2024-10-17 14:15:59,854 - julearn - INFO - = Data Information =
2024-10-17 14:15:59,855 - julearn - INFO - Problem type: regression
2024-10-17 14:15:59,855 - julearn - INFO - Number of samples: 442
2024-10-17 14:15:59,855 - julearn - INFO - Number of features: 10
2024-10-17 14:15:59,855 - julearn - INFO - ====================
2024-10-17 14:15:59,855 - julearn - INFO -
2024-10-17 14:15:59,855 - julearn - INFO - Target type: float64
2024-10-17 14:15:59,855 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)
Now the scores dataframe has all the values for each CV split, but two scores
unders the column names "test_neg_mean_absolute_error"
and
"test_r2_corr"
.
print(scores[["test_neg_mean_absolute_error", "test_r2_corr"]].mean())
test_neg_mean_absolute_error -44.264654
test_r2_corr 0.486498
dtype: float64
If we want to define a custom scoring metric, we need to define a function that takes the predicted and the actual values as input and returns a value. In this case, we want to compute Pearson correlation coefficient (r).
def pearson_scorer(y_true, y_pred):
return scipy.stats.pearsonr(y_true.squeeze(), y_pred.squeeze())[0]
Before using it, we need to convert it to a sklearn scorer
and register it
with julearn
.
register_scorer(scorer_name="pearsonr", scorer=make_scorer(pearson_scorer))
2024-10-17 14:15:59,903 - julearn - INFO - registering scorer named pearsonr
Now we can use it as another scoring metric.
2024-10-17 14:15:59,903 - julearn - INFO - ==== Input Data ====
2024-10-17 14:15:59,903 - julearn - INFO - Using dataframe as input
2024-10-17 14:15:59,904 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-10-17 14:15:59,904 - julearn - INFO - Target: target
2024-10-17 14:15:59,904 - julearn - INFO - Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-10-17 14:15:59,904 - julearn - INFO - X_types:{}
2024-10-17 14:15:59,904 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:509: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
warn_with_log(
2024-10-17 14:15:59,904 - julearn - INFO - ====================
2024-10-17 14:15:59,905 - julearn - INFO -
2024-10-17 14:15:59,905 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:15:59,905 - julearn - INFO - Step added
2024-10-17 14:15:59,905 - julearn - INFO - Adding step ridge that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:15:59,905 - julearn - INFO - Step added
2024-10-17 14:15:59,905 - julearn - INFO - = Model Parameters =
2024-10-17 14:15:59,905 - julearn - INFO - ====================
2024-10-17 14:15:59,905 - julearn - INFO -
2024-10-17 14:15:59,905 - julearn - INFO - = Data Information =
2024-10-17 14:15:59,905 - julearn - INFO - Problem type: regression
2024-10-17 14:15:59,906 - julearn - INFO - Number of samples: 442
2024-10-17 14:15:59,906 - julearn - INFO - Number of features: 10
2024-10-17 14:15:59,906 - julearn - INFO - ====================
2024-10-17 14:15:59,906 - julearn - INFO -
2024-10-17 14:15:59,906 - julearn - INFO - Target type: float64
2024-10-17 14:15:59,906 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)
Total running time of the script: (0 minutes 0.178 seconds)