5.4. Model Evaluation#
The output of run_cross_validation()
#
So far, we saw how to run a cross-validation using the PipelineCreator
and run_cross_validation()
. But what do we get as output from such a
pipeline?
Cross-validation scores#
We consider the iris
data example and one of the pipelines from the previous
section (feature z-scoring and a svm
).
from julearn import run_cross_validation
from julearn.pipeline import PipelineCreator
from seaborn import load_dataset
df = load_dataset("iris")
X = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
y = "species"
X_types = {
"continuous": [
"sepal_length",
"sepal_width",
"petal_length",
"petal_width",
]
}
# Create a pipeline
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("svm")
# Run cross-validation
scores = run_cross_validation(
X=X,
y=y,
X_types=X_types,
data=df,
model=creator,
)
2024-05-16 08:53:14,157 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-16 08:53:14,157 - julearn - INFO - Step added
2024-05-16 08:53:14,157 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-16 08:53:14,158 - julearn - INFO - Step added
2024-05-16 08:53:14,158 - julearn - INFO - ==== Input Data ====
2024-05-16 08:53:14,158 - julearn - INFO - Using dataframe as input
2024-05-16 08:53:14,158 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-16 08:53:14,158 - julearn - INFO - Target: species
2024-05-16 08:53:14,158 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-16 08:53:14,158 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-16 08:53:14,159 - julearn - INFO - ====================
2024-05-16 08:53:14,159 - julearn - INFO -
2024-05-16 08:53:14,159 - julearn - INFO - = Model Parameters =
2024-05-16 08:53:14,159 - julearn - INFO - ====================
2024-05-16 08:53:14,160 - julearn - INFO -
2024-05-16 08:53:14,160 - julearn - INFO - = Data Information =
2024-05-16 08:53:14,160 - julearn - INFO - Problem type: classification
2024-05-16 08:53:14,160 - julearn - INFO - Number of samples: 150
2024-05-16 08:53:14,160 - julearn - INFO - Number of features: 4
2024-05-16 08:53:14,160 - julearn - INFO - ====================
2024-05-16 08:53:14,160 - julearn - INFO -
2024-05-16 08:53:14,160 - julearn - INFO - Number of classes: 3
2024-05-16 08:53:14,160 - julearn - INFO - Target type: object
2024-05-16 08:53:14,161 - julearn - INFO - Class distributions: species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-16 08:53:14,161 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-16 08:53:14,161 - julearn - INFO - Multi-class classification problem detected #classes = 3.
The scores
variable is a pandas.DataFrame
object which contains the
cross-validated metrics for each fold as columns and rows respectively.
print(scores)
fit_time score_time ... fold cv_mdsum
0 0.005626 0.002875 ... 0 b10eef89b4192178d482d7a1587a248a
1 0.005048 0.002621 ... 1 b10eef89b4192178d482d7a1587a248a
2 0.004768 0.002676 ... 2 b10eef89b4192178d482d7a1587a248a
3 0.004935 0.002719 ... 3 b10eef89b4192178d482d7a1587a248a
4 0.005126 0.002661 ... 4 b10eef89b4192178d482d7a1587a248a
[5 rows x 8 columns]
We see that for example the test_score
for the third fold is 0.933. This
means that the model achieved a score of 0.933 on the validation set
of this fold.
We can also see more information, such as the number of samples used for training and testing.
Cross-validation is particularly useful to inspect if a model is overfitting.
For this purpose it is useful to not only see the test scores for each fold
but also the training scores. This can be achieved by setting the
return_train_score
parameter to True
in
run_cross_validation()
:
scores = run_cross_validation(
X=X,
y=y,
X_types=X_types,
data=df,
model=creator,
return_train_score=True,
)
print(scores)
2024-05-16 08:53:14,212 - julearn - INFO - ==== Input Data ====
2024-05-16 08:53:14,212 - julearn - INFO - Using dataframe as input
2024-05-16 08:53:14,212 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-16 08:53:14,212 - julearn - INFO - Target: species
2024-05-16 08:53:14,212 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-16 08:53:14,212 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-16 08:53:14,213 - julearn - INFO - ====================
2024-05-16 08:53:14,213 - julearn - INFO -
2024-05-16 08:53:14,214 - julearn - INFO - = Model Parameters =
2024-05-16 08:53:14,214 - julearn - INFO - ====================
2024-05-16 08:53:14,214 - julearn - INFO -
2024-05-16 08:53:14,214 - julearn - INFO - = Data Information =
2024-05-16 08:53:14,214 - julearn - INFO - Problem type: classification
2024-05-16 08:53:14,214 - julearn - INFO - Number of samples: 150
2024-05-16 08:53:14,214 - julearn - INFO - Number of features: 4
2024-05-16 08:53:14,214 - julearn - INFO - ====================
2024-05-16 08:53:14,214 - julearn - INFO -
2024-05-16 08:53:14,214 - julearn - INFO - Number of classes: 3
2024-05-16 08:53:14,214 - julearn - INFO - Target type: object
2024-05-16 08:53:14,215 - julearn - INFO - Class distributions: species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-16 08:53:14,215 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-16 08:53:14,215 - julearn - INFO - Multi-class classification problem detected #classes = 3.
fit_time score_time ... fold cv_mdsum
0 0.005365 0.002706 ... 0 b10eef89b4192178d482d7a1587a248a
1 0.004995 0.002661 ... 1 b10eef89b4192178d482d7a1587a248a
2 0.005130 0.002757 ... 2 b10eef89b4192178d482d7a1587a248a
3 0.005427 0.002702 ... 3 b10eef89b4192178d482d7a1587a248a
4 0.005094 0.002557 ... 4 b10eef89b4192178d482d7a1587a248a
[5 rows x 9 columns]
The additional column train_score
indicates the score on the training
set.
For a model that is not overfitting, the training and test scores should be similar. In our example, the training and test scores are indeed similar.
The column cv_mdsum
on the first glance might appear a bit cryptic.
This column is used in internal checks, to verify that the same CV was used
when results are compared using julearn
’s provided statistical tests.
This is nothing you need to worry about at this point.
Returning a model (estimator)#
Now that we saw that our model doesn’t seem to overfit, we might be
interested in checking how our model parameters look like. By setting the
parameter return_estimator
, we can tell run_cross_validation()
to
give us the models. It can have three different values:
"cv"
: This option indicates that we want to get the model that was trained on the entire training data of each CV fold. This means that we get as many models as we have CV folds. They will be returned within the scores DataFrame."final"
: With this setting, an additional model will be trained on the entire input dataset. This model will be returned as a separate variable."all"
: In this scenario, all the estimators ("final"
and"cv"
) will be returned.
For demonstration purposes we will have a closer look at the "final"
estimator option.
scores, model = run_cross_validation(
X=X,
y=y,
X_types=X_types,
data=df,
model=creator,
return_train_score=True,
return_estimator="final",
)
print(scores)
2024-05-16 08:53:14,281 - julearn - INFO - ==== Input Data ====
2024-05-16 08:53:14,281 - julearn - INFO - Using dataframe as input
2024-05-16 08:53:14,281 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-16 08:53:14,281 - julearn - INFO - Target: species
2024-05-16 08:53:14,281 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-16 08:53:14,281 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-16 08:53:14,282 - julearn - INFO - ====================
2024-05-16 08:53:14,282 - julearn - INFO -
2024-05-16 08:53:14,283 - julearn - INFO - = Model Parameters =
2024-05-16 08:53:14,283 - julearn - INFO - ====================
2024-05-16 08:53:14,283 - julearn - INFO -
2024-05-16 08:53:14,283 - julearn - INFO - = Data Information =
2024-05-16 08:53:14,283 - julearn - INFO - Problem type: classification
2024-05-16 08:53:14,283 - julearn - INFO - Number of samples: 150
2024-05-16 08:53:14,283 - julearn - INFO - Number of features: 4
2024-05-16 08:53:14,283 - julearn - INFO - ====================
2024-05-16 08:53:14,283 - julearn - INFO -
2024-05-16 08:53:14,283 - julearn - INFO - Number of classes: 3
2024-05-16 08:53:14,283 - julearn - INFO - Target type: object
2024-05-16 08:53:14,284 - julearn - INFO - Class distributions: species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-16 08:53:14,284 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-16 08:53:14,284 - julearn - INFO - Multi-class classification problem detected #classes = 3.
2024-05-16 08:53:14,344 - julearn - INFO - Fitting final model
fit_time score_time ... fold cv_mdsum
0 0.005320 0.002996 ... 0 b10eef89b4192178d482d7a1587a248a
1 0.006621 0.002935 ... 1 b10eef89b4192178d482d7a1587a248a
2 0.005409 0.002712 ... 2 b10eef89b4192178d482d7a1587a248a
3 0.005277 0.002659 ... 3 b10eef89b4192178d482d7a1587a248a
4 0.004907 0.002529 ... 4 b10eef89b4192178d482d7a1587a248a
[5 rows x 9 columns]
As we see, the scores DataFrame is the same as before. However, we now have
an additional variable model
. This variable contains the final estimator
that was trained on the entire training dataset.
model
We can use this estimator object to for example inspect the coefficients of the model or make predictions on a hold out test set. To learn more about how to inspect models please have a look at Inspecting Models.
Cross-validation splitters#
When performing a cross-validation, we need to split the data into training and validation sets. This is done by a cross-validation splitter, that defines how the data should be split, how many folds should be used and whether to repeat the process several times. For example, we might want to shuffle the data before splitting, stratify the splits so the distribution of targets are always represented in the individual folds, or consider certain grouping variables in the splitting process, so that samples from the same group are always in the same fold and not split across folds.
So far, however, we didn’t specify anything in that regard and still the
cross-validation was performed and we got five folds (see the five rows above
in the scores dataframe). This is because the default behaviour in
run_cross_validation()
falls back to the scikit-learn
defaults,
which is a sklearn.model_selection.StratifiedKFold
(with k=5
)
for classification and sklearn.model_selection.KFold
(with k=5
)
for regression.
Note
These defaults will change when they are changed in scikit-learn
as here
julearn
uses scikit-learn
’s defaults.
We can define the cross-validation splitting strategy ourselves by passing an
int, str or cross-validation generator
to the cv
parameter of
run_cross_validation()
. The default described above is cv=None
.
the second option is to pass only an integer to cv
. In that case, the
same default splitting strategies will be used
(sklearn.model_selection.StratifiedKFold
for classification,
sklearn.model_selection.KFold
for regression), but the number of
folds will be changed to the value of the provided integer (e.g., cv=10
).
To define the entire splitting strategy, one can pass all scikit-learn
compatible splitters sklearn.model_selection
to cv
. However,
julearn
provides a built-in set of additional splitters that can be found
under model_selection
(see more about them in Cross-validation splitters).
The fourth option is to pass an iterable that yields the train and test
indices for each split.
Using the same pipeline creator as above, we can define a cv-splitter and
pass it to run_cross_validation()
as follows:
from sklearn.model_selection import RepeatedStratifiedKFold
cv_splitter = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)
scores = run_cross_validation(
X=X,
y=y,
X_types=X_types,
data=df,
model=creator,
return_train_score=True,
cv=cv_splitter,
)
2024-05-16 08:53:14,367 - julearn - INFO - ==== Input Data ====
2024-05-16 08:53:14,368 - julearn - INFO - Using dataframe as input
2024-05-16 08:53:14,368 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-16 08:53:14,368 - julearn - INFO - Target: species
2024-05-16 08:53:14,368 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-16 08:53:14,368 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-16 08:53:14,369 - julearn - INFO - ====================
2024-05-16 08:53:14,369 - julearn - INFO -
2024-05-16 08:53:14,369 - julearn - INFO - = Model Parameters =
2024-05-16 08:53:14,369 - julearn - INFO - ====================
2024-05-16 08:53:14,369 - julearn - INFO -
2024-05-16 08:53:14,369 - julearn - INFO - = Data Information =
2024-05-16 08:53:14,369 - julearn - INFO - Problem type: classification
2024-05-16 08:53:14,370 - julearn - INFO - Number of samples: 150
2024-05-16 08:53:14,370 - julearn - INFO - Number of features: 4
2024-05-16 08:53:14,370 - julearn - INFO - ====================
2024-05-16 08:53:14,370 - julearn - INFO -
2024-05-16 08:53:14,370 - julearn - INFO - Number of classes: 3
2024-05-16 08:53:14,370 - julearn - INFO - Target type: object
2024-05-16 08:53:14,370 - julearn - INFO - Class distributions: species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-16 08:53:14,371 - julearn - INFO - Using outer CV scheme RepeatedStratifiedKFold(n_repeats=2, n_splits=5, random_state=42)
2024-05-16 08:53:14,371 - julearn - INFO - Multi-class classification problem detected #classes = 3.
This will repeat 2 times a 5-fold stratified cross-validation. So the
returned scores
DataFrame will have 10 rows. We set the random_state
to an arbitrary integer to make the splitting of the data reproducible.
print(scores)
fit_time score_time ... fold cv_mdsum
0 0.005634 0.002784 ... 0 7449876d309382acfef94df9d102aa76
1 0.005147 0.002719 ... 1 7449876d309382acfef94df9d102aa76
2 0.005214 0.002629 ... 2 7449876d309382acfef94df9d102aa76
3 0.005085 0.002744 ... 3 7449876d309382acfef94df9d102aa76
4 0.005271 0.002647 ... 4 7449876d309382acfef94df9d102aa76
5 0.005548 0.002772 ... 0 7449876d309382acfef94df9d102aa76
6 0.005117 0.002553 ... 1 7449876d309382acfef94df9d102aa76
7 0.004966 0.002624 ... 2 7449876d309382acfef94df9d102aa76
8 0.005060 0.002633 ... 3 7449876d309382acfef94df9d102aa76
9 0.005027 0.002562 ... 4 7449876d309382acfef94df9d102aa76
[10 rows x 9 columns]
Scoring metrics#
Nice, we have a basic pipeline, with preprocessing of features and a model, we defined the splitting strategy for the cross-validation the way we want it and we had a look at our resulting train and test scores when performing the cross-validation. But what do these scores even mean?
Same as for the kind of cv-splitter, run_cross_validation()
has a
default assumption for the scorer to be used to evaluate the
cross-validation, which is always the model’s default scorer. Remember, we
used a support vector classifier with the y
(target) variable being the
species of the iris
dataset (possible values: 'setosa'
,
'versicolor'
or 'virginica'
). Therefore we have a multi-class
classification (not to be confused with a multi-label classification!).
Checking scikit-learn
’s documentation of a support vector classifier’s
default scorer sklearn.svm.SVC.score()
, we can see that this is the
‘mean accuracy on the given test data and labels’.
With the scoring
parameter of run_cross_validation()
, one can
define the scoring function to be used. On top of the available
scikit-learn
sklearn.metrics
, julearn
extends the functionality
with more internal scorers and the possibility to define custom scorers. To see
the available julearn
scorers, one can use the list_scorers()
function:
from julearn import scoring
from pprint import pprint # for nice printing
pprint(scoring.list_scorers())
['accuracy',
'adjusted_mutual_info_score',
'adjusted_rand_score',
'average_precision',
'balanced_accuracy',
'completeness_score',
'explained_variance',
'f1',
'f1_macro',
'f1_micro',
'f1_samples',
'f1_weighted',
'fowlkes_mallows_score',
'homogeneity_score',
'jaccard',
'jaccard_macro',
'jaccard_micro',
'jaccard_samples',
'jaccard_weighted',
'matthews_corrcoef',
'max_error',
'mutual_info_score',
'neg_brier_score',
'neg_log_loss',
'neg_mean_absolute_error',
'neg_mean_absolute_percentage_error',
'neg_mean_gamma_deviance',
'neg_mean_poisson_deviance',
'neg_mean_squared_error',
'neg_mean_squared_log_error',
'neg_median_absolute_error',
'neg_negative_likelihood_ratio',
'neg_root_mean_squared_error',
'neg_root_mean_squared_log_error',
'normalized_mutual_info_score',
'positive_likelihood_ratio',
'precision',
'precision_macro',
'precision_micro',
'precision_samples',
'precision_weighted',
'r2',
'rand_score',
'recall',
'recall_macro',
'recall_micro',
'recall_samples',
'recall_weighted',
'roc_auc',
'roc_auc_ovo',
'roc_auc_ovo_weighted',
'roc_auc_ovr',
'roc_auc_ovr_weighted',
'top_k_accuracy',
'v_measure_score',
'r2_corr',
'r_corr',
'pearsonr']
To use a julearn
scorer, one can pass the name of the scorer as a string
to the scoring
parameter of run_cross_validation()
. If multiple
different scorers need to be used, a list of strings can be passed. For
example, if we were interested in the accuracy
and the f1
scores we
could do the following:
scoring = ["accuracy", "f1_macro"]
scores = run_cross_validation(
X=X,
y=y,
X_types=X_types,
data=df,
model=creator,
return_train_score=True,
cv=cv_splitter,
scoring=scoring,
)
2024-05-16 08:53:14,496 - julearn - INFO - ==== Input Data ====
2024-05-16 08:53:14,496 - julearn - INFO - Using dataframe as input
2024-05-16 08:53:14,496 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-16 08:53:14,496 - julearn - INFO - Target: species
2024-05-16 08:53:14,496 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-16 08:53:14,496 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-16 08:53:14,497 - julearn - INFO - ====================
2024-05-16 08:53:14,497 - julearn - INFO -
2024-05-16 08:53:14,497 - julearn - INFO - = Model Parameters =
2024-05-16 08:53:14,498 - julearn - INFO - ====================
2024-05-16 08:53:14,498 - julearn - INFO -
2024-05-16 08:53:14,498 - julearn - INFO - = Data Information =
2024-05-16 08:53:14,498 - julearn - INFO - Problem type: classification
2024-05-16 08:53:14,498 - julearn - INFO - Number of samples: 150
2024-05-16 08:53:14,498 - julearn - INFO - Number of features: 4
2024-05-16 08:53:14,498 - julearn - INFO - ====================
2024-05-16 08:53:14,498 - julearn - INFO -
2024-05-16 08:53:14,498 - julearn - INFO - Number of classes: 3
2024-05-16 08:53:14,498 - julearn - INFO - Target type: object
2024-05-16 08:53:14,499 - julearn - INFO - Class distributions: species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-16 08:53:14,499 - julearn - INFO - Using outer CV scheme RepeatedStratifiedKFold(n_repeats=2, n_splits=5, random_state=42)
2024-05-16 08:53:14,499 - julearn - INFO - Multi-class classification problem detected #classes = 3.
The scores
DataFrame will now have train- and test-score columns for both
scorers:
print(scores)
fit_time score_time ... fold cv_mdsum
0 0.005236 0.004477 ... 0 7449876d309382acfef94df9d102aa76
1 0.004908 0.004165 ... 1 7449876d309382acfef94df9d102aa76
2 0.005112 0.004352 ... 2 7449876d309382acfef94df9d102aa76
3 0.004920 0.004140 ... 3 7449876d309382acfef94df9d102aa76
4 0.004872 0.004271 ... 4 7449876d309382acfef94df9d102aa76
5 0.004958 0.004198 ... 0 7449876d309382acfef94df9d102aa76
6 0.005766 0.005517 ... 1 7449876d309382acfef94df9d102aa76
7 0.005805 0.004511 ... 2 7449876d309382acfef94df9d102aa76
8 0.005545 0.004497 ... 3 7449876d309382acfef94df9d102aa76
9 0.005196 0.004300 ... 4 7449876d309382acfef94df9d102aa76
[10 rows x 11 columns]
Total running time of the script: (0 minutes 0.512 seconds)