5.5. Model Comparison#

In the previous section, we saw how to evaluate a single model using cross-validation. The example model seems to perform decently well. However, how do we know that it can’t be better? Building machine-learning models is always a matter of benchmarking. We want to know how well our model performs, compared to other models. We already saw how to evaluate a model’s performance using cross-validation. This is a good start, but it is not enough. We can use cross-validation to evaluate the performance of a single model, but we can’t use it to compare different models. We could build different models and evaluate them using cross-validation, but then we would have to compare the results manually. This is not only tedious, but also error-prone. We need a way to compare different models in a statistically sound way.

To statistically compare different models, julearn provides a built-in corrected t-test. To see how to apply it, we will first build three different models, each with different learning algorithms.

To perform a binary classification (and not a multi-class classification) we will switch to the breast cancer dataset from scikit-learn as an example. The target to be predicted is if the cancer is malignant or benign.

import pandas as pd
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df["target"] = data.target

X = data.feature_names.tolist()
y = "target"
X_types = {"continuous": [".*"]}


df.head()
mean radius mean texture mean perimeter mean area ... worst concave points worst symmetry worst fractal dimension target
0 17.99 10.38 122.80 1001.0 ... 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 ... 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 ... 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 ... 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 ... 0.1625 0.2364 0.07678 0

5 rows × 31 columns



We will use the same cross-validation splitter as in the previous section and two scorers: accuracy and roc_auc.

from sklearn.model_selection import RepeatedStratifiedKFold

cv_splitter = RepeatedStratifiedKFold(n_splits=5, n_repeats=2, random_state=42)

scoring = ["accuracy", "roc_auc"]

We use three different learning algorithms to build three different models. We will use the default hyperparameters for each of them.

Model 1: default SVM.

from julearn.pipeline import PipelineCreator
from julearn import run_cross_validation

creator1 = PipelineCreator(problem_type="classification")
creator1.add("zscore")
creator1.add("svm")

scores1 = run_cross_validation(
    X=X,
    y=y,
    X_types=X_types,
    data=df,
    model=creator1,
    scoring=scoring,
    cv=cv_splitter,
)
2024-04-29 11:46:13,782 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:46:13,782 - julearn - INFO - Step added
2024-04-29 11:46:13,782 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:46:13,782 - julearn - INFO - Step added
2024-04-29 11:46:13,782 - julearn - INFO - ==== Input Data ====
2024-04-29 11:46:13,783 - julearn - INFO - Using dataframe as input
2024-04-29 11:46:13,783 - julearn - INFO -      Features: ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']
2024-04-29 11:46:13,783 - julearn - INFO -      Target: target
2024-04-29 11:46:13,784 - julearn - INFO -      Expanded features: ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']
2024-04-29 11:46:13,784 - julearn - INFO -      X_types:{'continuous': ['.*']}
2024-04-29 11:46:13,785 - julearn - INFO - ====================
2024-04-29 11:46:13,785 - julearn - INFO -
2024-04-29 11:46:13,786 - julearn - INFO - = Model Parameters =
2024-04-29 11:46:13,786 - julearn - INFO - ====================
2024-04-29 11:46:13,786 - julearn - INFO -
2024-04-29 11:46:13,786 - julearn - INFO - = Data Information =
2024-04-29 11:46:13,786 - julearn - INFO -      Problem type: classification
2024-04-29 11:46:13,786 - julearn - INFO -      Number of samples: 569
2024-04-29 11:46:13,786 - julearn - INFO -      Number of features: 30
2024-04-29 11:46:13,786 - julearn - INFO - ====================
2024-04-29 11:46:13,786 - julearn - INFO -
2024-04-29 11:46:13,786 - julearn - INFO -      Number of classes: 2
2024-04-29 11:46:13,786 - julearn - INFO -      Target type: int64
2024-04-29 11:46:13,787 - julearn - INFO -      Class distributions: target
1    357
0    212
Name: count, dtype: int64
2024-04-29 11:46:13,787 - julearn - INFO - Using outer CV scheme RepeatedStratifiedKFold(n_repeats=2, n_splits=5, random_state=42)
2024-04-29 11:46:13,787 - julearn - INFO - Binary classification problem detected.
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(

Model 2: default Random Forest.

As we can see in Models (Estimators), we can use the "rf" string to use a random forest.

creator2 = PipelineCreator(problem_type="classification")
creator2.add("zscore")
creator2.add("rf")

scores2 = run_cross_validation(
    X=X,
    y=y,
    X_types=X_types,
    data=df,
    model=creator2,
    scoring=scoring,
    cv=cv_splitter,
)
2024-04-29 11:46:13,941 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:46:13,941 - julearn - INFO - Step added
2024-04-29 11:46:13,941 - julearn - INFO - Adding step rf that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:46:13,941 - julearn - INFO - Step added
2024-04-29 11:46:13,941 - julearn - INFO - ==== Input Data ====
2024-04-29 11:46:13,941 - julearn - INFO - Using dataframe as input
2024-04-29 11:46:13,941 - julearn - INFO -      Features: ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']
2024-04-29 11:46:13,941 - julearn - INFO -      Target: target
2024-04-29 11:46:13,942 - julearn - INFO -      Expanded features: ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']
2024-04-29 11:46:13,942 - julearn - INFO -      X_types:{'continuous': ['.*']}
2024-04-29 11:46:13,942 - julearn - INFO - ====================
2024-04-29 11:46:13,943 - julearn - INFO -
2024-04-29 11:46:13,943 - julearn - INFO - = Model Parameters =
2024-04-29 11:46:13,943 - julearn - INFO - ====================
2024-04-29 11:46:13,943 - julearn - INFO -
2024-04-29 11:46:13,943 - julearn - INFO - = Data Information =
2024-04-29 11:46:13,943 - julearn - INFO -      Problem type: classification
2024-04-29 11:46:13,943 - julearn - INFO -      Number of samples: 569
2024-04-29 11:46:13,943 - julearn - INFO -      Number of features: 30
2024-04-29 11:46:13,943 - julearn - INFO - ====================
2024-04-29 11:46:13,943 - julearn - INFO -
2024-04-29 11:46:13,944 - julearn - INFO -      Number of classes: 2
2024-04-29 11:46:13,944 - julearn - INFO -      Target type: int64
2024-04-29 11:46:13,944 - julearn - INFO -      Class distributions: target
1    357
0    212
Name: count, dtype: int64
2024-04-29 11:46:13,944 - julearn - INFO - Using outer CV scheme RepeatedStratifiedKFold(n_repeats=2, n_splits=5, random_state=42)
2024-04-29 11:46:13,944 - julearn - INFO - Binary classification problem detected.
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(

Model 3: default Logistic Regression.

creator3 = PipelineCreator(problem_type="classification")
creator3.add("zscore")
creator3.add("logit")

scores3 = run_cross_validation(
    X=X,
    y=y,
    X_types=X_types,
    data=df,
    model=creator3,
    scoring=scoring,
    cv=cv_splitter,
)
2024-04-29 11:46:15,640 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:46:15,641 - julearn - INFO - Step added
2024-04-29 11:46:15,641 - julearn - INFO - Adding step logit that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:46:15,641 - julearn - INFO - Step added
2024-04-29 11:46:15,641 - julearn - INFO - ==== Input Data ====
2024-04-29 11:46:15,641 - julearn - INFO - Using dataframe as input
2024-04-29 11:46:15,641 - julearn - INFO -      Features: ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']
2024-04-29 11:46:15,641 - julearn - INFO -      Target: target
2024-04-29 11:46:15,642 - julearn - INFO -      Expanded features: ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']
2024-04-29 11:46:15,642 - julearn - INFO -      X_types:{'continuous': ['.*']}
2024-04-29 11:46:15,642 - julearn - INFO - ====================
2024-04-29 11:46:15,642 - julearn - INFO -
2024-04-29 11:46:15,643 - julearn - INFO - = Model Parameters =
2024-04-29 11:46:15,643 - julearn - INFO - ====================
2024-04-29 11:46:15,643 - julearn - INFO -
2024-04-29 11:46:15,643 - julearn - INFO - = Data Information =
2024-04-29 11:46:15,643 - julearn - INFO -      Problem type: classification
2024-04-29 11:46:15,643 - julearn - INFO -      Number of samples: 569
2024-04-29 11:46:15,643 - julearn - INFO -      Number of features: 30
2024-04-29 11:46:15,643 - julearn - INFO - ====================
2024-04-29 11:46:15,643 - julearn - INFO -
2024-04-29 11:46:15,643 - julearn - INFO -      Number of classes: 2
2024-04-29 11:46:15,643 - julearn - INFO -      Target type: int64
2024-04-29 11:46:15,644 - julearn - INFO -      Class distributions: target
1    357
0    212
Name: count, dtype: int64
2024-04-29 11:46:15,644 - julearn - INFO - Using outer CV scheme RepeatedStratifiedKFold(n_repeats=2, n_splits=5, random_state=42)
2024-04-29 11:46:15,644 - julearn - INFO - Binary classification problem detected.
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(

We will add a column to each scores DataFrames to be able to use names to identify the models later on.

scores1["model"] = "svm"
scores2["model"] = "rf"
scores3["model"] = "logit"

Statistical comparisons#

Comparing the performance of these three models is now as easy as the following one-liner:

from julearn.stats.corrected_ttest import corrected_ttest

stats_df = corrected_ttest(scores1, scores2, scores3)
2024-04-29 11:46:15,836 - julearn - WARNING - The training set sizes are not the same. Will use a rounded average.
/home/runner/work/julearn/julearn/julearn/stats/corrected_ttest.py:171: RuntimeWarning: The training set sizes are not the same. Will use a rounded average.
  warn_with_log(
2024-04-29 11:46:15,836 - julearn - WARNING - The testing set sizes are not the same. Will use a rounded average.
/home/runner/work/julearn/julearn/julearn/stats/corrected_ttest.py:180: RuntimeWarning: The testing set sizes are not the same. Will use a rounded average.
  warn_with_log(
2024-04-29 11:46:15,839 - julearn - WARNING - The training set sizes are not the same. Will use a rounded average.
/home/runner/work/julearn/julearn/julearn/stats/corrected_ttest.py:171: RuntimeWarning: The training set sizes are not the same. Will use a rounded average.
  warn_with_log(
2024-04-29 11:46:15,840 - julearn - WARNING - The testing set sizes are not the same. Will use a rounded average.
/home/runner/work/julearn/julearn/julearn/stats/corrected_ttest.py:180: RuntimeWarning: The testing set sizes are not the same. Will use a rounded average.
  warn_with_log(
2024-04-29 11:46:15,843 - julearn - WARNING - The training set sizes are not the same. Will use a rounded average.
/home/runner/work/julearn/julearn/julearn/stats/corrected_ttest.py:171: RuntimeWarning: The training set sizes are not the same. Will use a rounded average.
  warn_with_log(
2024-04-29 11:46:15,843 - julearn - WARNING - The testing set sizes are not the same. Will use a rounded average.
/home/runner/work/julearn/julearn/julearn/stats/corrected_ttest.py:180: RuntimeWarning: The testing set sizes are not the same. Will use a rounded average.
  warn_with_log(

This gives us a DataFrame with the corrected t-test results for each pairwise comparison of the three models’ test scores:

We can see, that none of the models performed better with respect to neither accuracy nor roc_auc.

print(stats_df)
          metric    t-stat     p-val model_1 model_2  p-val-corrected
0  test_accuracy  1.946304  0.083461     svm      rf         0.250382
2  test_accuracy  0.140882  0.891066     svm   logit         1.000000
4  test_accuracy -2.285373  0.048138      rf   logit         0.144413
1   test_roc_auc  1.361847  0.206356     svm      rf         0.619069
3   test_roc_auc  0.029499  0.977110     svm   logit         1.000000
5   test_roc_auc -1.084010  0.306544      rf   logit         0.919632

Score visualizations#

Visualizations can help to get a better intuitive understanding of the differences between the models. To get a better overview of the performances of our three models, we can make use of julearn’s visualization tool to plot the scores in an interactive manner. As visualizations are not part of the core functionality of julearn, you will need to first manually install the additional visualization dependencies.

From here we can create the interactive plot. Interactive, because you can choose the models to be displayed and the scorer to be plotted.

from julearn.viz import plot_scores

panel = plot_scores(scores1, scores2, scores3)
# panel.show()
# uncomment the previous line show the plot
# read the documentation for more information
#  https://panel.holoviz.org/getting_started/build_app.html#deploying-panels

Note

The plot is interactive. You can zoom in and out, and hover over. However, buttons will not work in this documentation.

Well done, you made it until here and are now ready to dive into Selected deeper topics! Maybe you are curious to learn Cross-validation consistent Confound Removal or want to learn more about Inspecting Models.

Total running time of the script: (0 minutes 2.178 seconds)