5.4. Model Evaluation#

The output of run_cross_validation()#

So far, we saw how to run a cross-validation using the PipelineCreator and run_cross_validation(). But what do we get as output from such a pipeline?

Cross-validation scores#

We consider the iris data example and one of the pipelines from the previous section (feature z-scoring and a svm).

from julearn import run_cross_validation
from julearn.pipeline import PipelineCreator

from seaborn import load_dataset


df = load_dataset("iris")
X = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
y = "species"
X_types = {
    "continuous": [
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width",
    ]
}

# Create a pipeline
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("svm")

# Run cross-validation
scores = run_cross_validation(
    X=X,
    y=y,
    X_types=X_types,
    data=df,
    model=creator,
)
2024-10-17 14:16:11,632 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:11,632 - julearn - INFO - Step added
2024-10-17 14:16:11,632 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:11,632 - julearn - INFO - Step added
2024-10-17 14:16:11,632 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:11,632 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:11,632 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:11,632 - julearn - INFO -      Target: species
2024-10-17 14:16:11,632 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:11,633 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-10-17 14:16:11,633 - julearn - INFO - ====================
2024-10-17 14:16:11,633 - julearn - INFO -
2024-10-17 14:16:11,634 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:11,634 - julearn - INFO - ====================
2024-10-17 14:16:11,634 - julearn - INFO -
2024-10-17 14:16:11,634 - julearn - INFO - = Data Information =
2024-10-17 14:16:11,634 - julearn - INFO -      Problem type: classification
2024-10-17 14:16:11,634 - julearn - INFO -      Number of samples: 150
2024-10-17 14:16:11,634 - julearn - INFO -      Number of features: 4
2024-10-17 14:16:11,634 - julearn - INFO - ====================
2024-10-17 14:16:11,634 - julearn - INFO -
2024-10-17 14:16:11,635 - julearn - INFO -      Number of classes: 3
2024-10-17 14:16:11,635 - julearn - INFO -      Target type: object
2024-10-17 14:16:11,635 - julearn - INFO -      Class distributions: species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64
2024-10-17 14:16:11,636 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:11,636 - julearn - INFO - Multi-class classification problem detected #classes = 3.

The scores variable is a pandas.DataFrame object which contains the cross-validated metrics for each fold as columns and rows respectively.

print(scores)
   fit_time  score_time  ...  fold                          cv_mdsum
0  0.005250    0.002500  ...     0  b10eef89b4192178d482d7a1587a248a
1  0.004530    0.002443  ...     1  b10eef89b4192178d482d7a1587a248a
2  0.004432    0.002537  ...     2  b10eef89b4192178d482d7a1587a248a
3  0.004533    0.002444  ...     3  b10eef89b4192178d482d7a1587a248a
4  0.004460    0.002433  ...     4  b10eef89b4192178d482d7a1587a248a

[5 rows x 8 columns]

We see that for example the test_score for the third fold is 0.933. This means that the model achieved a score of 0.933 on the validation set of this fold.

We can also see more information, such as the number of samples used for training and testing.

Cross-validation is particularly useful to inspect if a model is overfitting. For this purpose it is useful to not only see the test scores for each fold but also the training scores. This can be achieved by setting the return_train_score parameter to True in run_cross_validation():

scores = run_cross_validation(
    X=X,
    y=y,
    X_types=X_types,
    data=df,
    model=creator,
    return_train_score=True,
)

print(scores)
2024-10-17 14:16:11,682 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:11,682 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:11,683 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:11,683 - julearn - INFO -      Target: species
2024-10-17 14:16:11,683 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:11,683 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-10-17 14:16:11,683 - julearn - INFO - ====================
2024-10-17 14:16:11,683 - julearn - INFO -
2024-10-17 14:16:11,684 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:11,684 - julearn - INFO - ====================
2024-10-17 14:16:11,684 - julearn - INFO -
2024-10-17 14:16:11,684 - julearn - INFO - = Data Information =
2024-10-17 14:16:11,684 - julearn - INFO -      Problem type: classification
2024-10-17 14:16:11,684 - julearn - INFO -      Number of samples: 150
2024-10-17 14:16:11,684 - julearn - INFO -      Number of features: 4
2024-10-17 14:16:11,684 - julearn - INFO - ====================
2024-10-17 14:16:11,684 - julearn - INFO -
2024-10-17 14:16:11,684 - julearn - INFO -      Number of classes: 3
2024-10-17 14:16:11,684 - julearn - INFO -      Target type: object
2024-10-17 14:16:11,685 - julearn - INFO -      Class distributions: species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64
2024-10-17 14:16:11,685 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:11,685 - julearn - INFO - Multi-class classification problem detected #classes = 3.
   fit_time  score_time  ...  fold                          cv_mdsum
0  0.004789    0.002506  ...     0  b10eef89b4192178d482d7a1587a248a
1  0.004516    0.002410  ...     1  b10eef89b4192178d482d7a1587a248a
2  0.004546    0.002456  ...     2  b10eef89b4192178d482d7a1587a248a
3  0.004501    0.002476  ...     3  b10eef89b4192178d482d7a1587a248a
4  0.004558    0.002456  ...     4  b10eef89b4192178d482d7a1587a248a

[5 rows x 9 columns]

The additional column train_score indicates the score on the training set.

For a model that is not overfitting, the training and test scores should be similar. In our example, the training and test scores are indeed similar.

The column cv_mdsum on the first glance might appear a bit cryptic. This column is used in internal checks, to verify that the same CV was used when results are compared using julearn’s provided statistical tests. This is nothing you need to worry about at this point.

Returning a model (estimator)#

Now that we saw that our model doesn’t seem to overfit, we might be interested in checking how our model parameters look like. By setting the parameter return_estimator, we can tell run_cross_validation() to give us the models. It can have three different values:

  1. "cv": This option indicates that we want to get the model that was trained on the entire training data of each CV fold. This means that we get as many models as we have CV folds. They will be returned within the scores DataFrame.

  2. "final": With this setting, an additional model will be trained on the entire input dataset. This model will be returned as a separate variable.

  3. "all": In this scenario, all the estimators ("final" and "cv") will be returned.

For demonstration purposes we will have a closer look at the "final" estimator option.

scores, model = run_cross_validation(
    X=X,
    y=y,
    X_types=X_types,
    data=df,
    model=creator,
    return_train_score=True,
    return_estimator="final",
)

print(scores)
2024-10-17 14:16:11,745 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:11,745 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:11,745 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:11,745 - julearn - INFO -      Target: species
2024-10-17 14:16:11,745 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:11,745 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-10-17 14:16:11,745 - julearn - INFO - ====================
2024-10-17 14:16:11,745 - julearn - INFO -
2024-10-17 14:16:11,746 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:11,746 - julearn - INFO - ====================
2024-10-17 14:16:11,746 - julearn - INFO -
2024-10-17 14:16:11,746 - julearn - INFO - = Data Information =
2024-10-17 14:16:11,746 - julearn - INFO -      Problem type: classification
2024-10-17 14:16:11,746 - julearn - INFO -      Number of samples: 150
2024-10-17 14:16:11,746 - julearn - INFO -      Number of features: 4
2024-10-17 14:16:11,746 - julearn - INFO - ====================
2024-10-17 14:16:11,746 - julearn - INFO -
2024-10-17 14:16:11,747 - julearn - INFO -      Number of classes: 3
2024-10-17 14:16:11,747 - julearn - INFO -      Target type: object
2024-10-17 14:16:11,747 - julearn - INFO -      Class distributions: species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64
2024-10-17 14:16:11,747 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)
2024-10-17 14:16:11,748 - julearn - INFO - Multi-class classification problem detected #classes = 3.
   fit_time  score_time  ...  fold                          cv_mdsum
0  0.004550    0.002423  ...     0  b10eef89b4192178d482d7a1587a248a
1  0.004603    0.002417  ...     1  b10eef89b4192178d482d7a1587a248a
2  0.004570    0.002503  ...     2  b10eef89b4192178d482d7a1587a248a
3  0.004580    0.002424  ...     3  b10eef89b4192178d482d7a1587a248a
4  0.004519    0.002447  ...     4  b10eef89b4192178d482d7a1587a248a

[5 rows x 9 columns]

As we see, the scores DataFrame is the same as before. However, we now have an additional variable model. This variable contains the final estimator that was trained on the entire training dataset.

model
Pipeline(steps=[('set_column_types',
                 SetColumnTypes(X_types={'continuous': ['sepal_length',
                                                        'sepal_width',
                                                        'petal_length',
                                                        'petal_width']})),
                ('zscore', StandardScaler()), ('svm', SVC())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


We can use this estimator object to for example inspect the coefficients of the model or make predictions on a hold out test set. To learn more about how to inspect models please have a look at Inspecting Models.

Cross-validation splitters#

When performing a cross-validation, we need to split the data into training and validation sets. This is done by a cross-validation splitter, that defines how the data should be split, how many folds should be used and whether to repeat the process several times. For example, we might want to shuffle the data before splitting, stratify the splits so the distribution of targets are always represented in the individual folds, or consider certain grouping variables in the splitting process, so that samples from the same group are always in the same fold and not split across folds.

So far, however, we didn’t specify anything in that regard and still the cross-validation was performed and we got five folds (see the five rows above in the scores dataframe). This is because the default behaviour in run_cross_validation() falls back to the scikit-learn defaults, which is a sklearn.model_selection.StratifiedKFold (with k=5) for classification and sklearn.model_selection.KFold (with k=5) for regression.

Note

These defaults will change when they are changed in scikit-learn as here julearn uses scikit-learn’s defaults.

We can define the cross-validation splitting strategy ourselves by passing an int, str or cross-validation generator to the cv parameter of run_cross_validation(). The default described above is cv=None. the second option is to pass only an integer to cv. In that case, the same default splitting strategies will be used (sklearn.model_selection.StratifiedKFold for classification, sklearn.model_selection.KFold for regression), but the number of folds will be changed to the value of the provided integer (e.g., cv=10). To define the entire splitting strategy, one can pass all scikit-learn compatible splitters sklearn.model_selection to cv. However, julearn provides a built-in set of additional splitters that can be found under model_selection (see more about them in Cross-validation splitters). The fourth option is to pass an iterable that yields the train and test indices for each split.

Using the same pipeline creator as above, we can define a cv-splitter and pass it to run_cross_validation() as follows:

from sklearn.model_selection import RepeatedStratifiedKFold

cv_splitter = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)

scores = run_cross_validation(
    X=X,
    y=y,
    X_types=X_types,
    data=df,
    model=creator,
    return_train_score=True,
    cv=cv_splitter,
)
2024-10-17 14:16:11,821 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:11,821 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:11,821 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:11,821 - julearn - INFO -      Target: species
2024-10-17 14:16:11,821 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:11,821 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-10-17 14:16:11,822 - julearn - INFO - ====================
2024-10-17 14:16:11,822 - julearn - INFO -
2024-10-17 14:16:11,822 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:11,822 - julearn - INFO - ====================
2024-10-17 14:16:11,822 - julearn - INFO -
2024-10-17 14:16:11,822 - julearn - INFO - = Data Information =
2024-10-17 14:16:11,822 - julearn - INFO -      Problem type: classification
2024-10-17 14:16:11,822 - julearn - INFO -      Number of samples: 150
2024-10-17 14:16:11,822 - julearn - INFO -      Number of features: 4
2024-10-17 14:16:11,822 - julearn - INFO - ====================
2024-10-17 14:16:11,822 - julearn - INFO -
2024-10-17 14:16:11,823 - julearn - INFO -      Number of classes: 3
2024-10-17 14:16:11,823 - julearn - INFO -      Target type: object
2024-10-17 14:16:11,823 - julearn - INFO -      Class distributions: species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64
2024-10-17 14:16:11,823 - julearn - INFO - Using outer CV scheme RepeatedStratifiedKFold(n_repeats=2, n_splits=5, random_state=42)
2024-10-17 14:16:11,824 - julearn - INFO - Multi-class classification problem detected #classes = 3.

This will repeat 2 times a 5-fold stratified cross-validation. So the returned scores DataFrame will have 10 rows. We set the random_state to an arbitrary integer to make the splitting of the data reproducible.

print(scores)
   fit_time  score_time  ...  fold                          cv_mdsum
0  0.004861    0.002455  ...     0  7449876d309382acfef94df9d102aa76
1  0.004528    0.002451  ...     1  7449876d309382acfef94df9d102aa76
2  0.004600    0.002416  ...     2  7449876d309382acfef94df9d102aa76
3  0.004559    0.002458  ...     3  7449876d309382acfef94df9d102aa76
4  0.004479    0.002392  ...     4  7449876d309382acfef94df9d102aa76
5  0.004464    0.002456  ...     0  7449876d309382acfef94df9d102aa76
6  0.004496    0.002401  ...     1  7449876d309382acfef94df9d102aa76
7  0.004746    0.002468  ...     2  7449876d309382acfef94df9d102aa76
8  0.004680    0.002466  ...     3  7449876d309382acfef94df9d102aa76
9  0.004513    0.002440  ...     4  7449876d309382acfef94df9d102aa76

[10 rows x 9 columns]

Scoring metrics#

Nice, we have a basic pipeline, with preprocessing of features and a model, we defined the splitting strategy for the cross-validation the way we want it and we had a look at our resulting train and test scores when performing the cross-validation. But what do these scores even mean?

Same as for the kind of cv-splitter, run_cross_validation() has a default assumption for the scorer to be used to evaluate the cross-validation, which is always the model’s default scorer. Remember, we used a support vector classifier with the y (target) variable being the species of the iris dataset (possible values: 'setosa', 'versicolor' or 'virginica'). Therefore we have a multi-class classification (not to be confused with a multi-label classification!). Checking scikit-learn’s documentation of a support vector classifier’s default scorer sklearn.svm.SVC.score(), we can see that this is the ‘mean accuracy on the given test data and labels’.

With the scoring parameter of run_cross_validation(), one can define the scoring function to be used. On top of the available scikit-learn sklearn.metrics, julearn extends the functionality with more internal scorers and the possibility to define custom scorers. To see the available julearn scorers, one can use the list_scorers() function:

from julearn import scoring
from pprint import pprint  # for nice printing

pprint(scoring.list_scorers())
['accuracy',
 'adjusted_mutual_info_score',
 'adjusted_rand_score',
 'average_precision',
 'balanced_accuracy',
 'completeness_score',
 'd2_absolute_error_score',
 'explained_variance',
 'f1',
 'f1_macro',
 'f1_micro',
 'f1_samples',
 'f1_weighted',
 'fowlkes_mallows_score',
 'homogeneity_score',
 'jaccard',
 'jaccard_macro',
 'jaccard_micro',
 'jaccard_samples',
 'jaccard_weighted',
 'matthews_corrcoef',
 'max_error',
 'mutual_info_score',
 'neg_brier_score',
 'neg_log_loss',
 'neg_mean_absolute_error',
 'neg_mean_absolute_percentage_error',
 'neg_mean_gamma_deviance',
 'neg_mean_poisson_deviance',
 'neg_mean_squared_error',
 'neg_mean_squared_log_error',
 'neg_median_absolute_error',
 'neg_negative_likelihood_ratio',
 'neg_root_mean_squared_error',
 'neg_root_mean_squared_log_error',
 'normalized_mutual_info_score',
 'positive_likelihood_ratio',
 'precision',
 'precision_macro',
 'precision_micro',
 'precision_samples',
 'precision_weighted',
 'r2',
 'rand_score',
 'recall',
 'recall_macro',
 'recall_micro',
 'recall_samples',
 'recall_weighted',
 'roc_auc',
 'roc_auc_ovo',
 'roc_auc_ovo_weighted',
 'roc_auc_ovr',
 'roc_auc_ovr_weighted',
 'top_k_accuracy',
 'v_measure_score',
 'r2_corr',
 'r_corr',
 'pearsonr']

To use a julearn scorer, one can pass the name of the scorer as a string to the scoring parameter of run_cross_validation(). If multiple different scorers need to be used, a list of strings can be passed. For example, if we were interested in the accuracy and the f1 scores we could do the following:

scoring = ["accuracy", "f1_macro"]

scores = run_cross_validation(
    X=X,
    y=y,
    X_types=X_types,
    data=df,
    model=creator,
    return_train_score=True,
    cv=cv_splitter,
    scoring=scoring,
)
2024-10-17 14:16:11,937 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:11,937 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:11,937 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:11,937 - julearn - INFO -      Target: species
2024-10-17 14:16:11,937 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:11,937 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-10-17 14:16:11,938 - julearn - INFO - ====================
2024-10-17 14:16:11,938 - julearn - INFO -
2024-10-17 14:16:11,938 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:11,938 - julearn - INFO - ====================
2024-10-17 14:16:11,938 - julearn - INFO -
2024-10-17 14:16:11,938 - julearn - INFO - = Data Information =
2024-10-17 14:16:11,938 - julearn - INFO -      Problem type: classification
2024-10-17 14:16:11,939 - julearn - INFO -      Number of samples: 150
2024-10-17 14:16:11,939 - julearn - INFO -      Number of features: 4
2024-10-17 14:16:11,939 - julearn - INFO - ====================
2024-10-17 14:16:11,939 - julearn - INFO -
2024-10-17 14:16:11,939 - julearn - INFO -      Number of classes: 3
2024-10-17 14:16:11,939 - julearn - INFO -      Target type: object
2024-10-17 14:16:11,939 - julearn - INFO -      Class distributions: species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64
2024-10-17 14:16:11,940 - julearn - INFO - Using outer CV scheme RepeatedStratifiedKFold(n_repeats=2, n_splits=5, random_state=42)
2024-10-17 14:16:11,940 - julearn - INFO - Multi-class classification problem detected #classes = 3.

The scores DataFrame will now have train- and test-score columns for both scorers:

print(scores)
   fit_time  score_time  ...  fold                          cv_mdsum
0  0.004915    0.004139  ...     0  7449876d309382acfef94df9d102aa76
1  0.004510    0.004036  ...     1  7449876d309382acfef94df9d102aa76
2  0.004515    0.004013  ...     2  7449876d309382acfef94df9d102aa76
3  0.004513    0.004041  ...     3  7449876d309382acfef94df9d102aa76
4  0.004461    0.004018  ...     4  7449876d309382acfef94df9d102aa76
5  0.004534    0.004063  ...     0  7449876d309382acfef94df9d102aa76
6  0.004610    0.004025  ...     1  7449876d309382acfef94df9d102aa76
7  0.004615    0.004115  ...     2  7449876d309382acfef94df9d102aa76
8  0.004813    0.004244  ...     3  7449876d309382acfef94df9d102aa76
9  0.005003    0.004204  ...     4  7449876d309382acfef94df9d102aa76

[10 rows x 11 columns]

Total running time of the script: (0 minutes 0.459 seconds)