Simple Model Comparison#

This example uses the iris dataset and performs binary classifications using different models. At the end, it compares the performance of the models using different scoring functions and performs a statistical test to assess whether the difference in performance is significant.

# Authors: Federico Raimondo <f.raimondo@fz-juelich.de>
# License: AGPL

from seaborn import load_dataset
from sklearn.model_selection import RepeatedStratifiedKFold
from julearn import run_cross_validation
from julearn.utils import configure_logging
from julearn.stats.corrected_ttest import corrected_ttest

Set the logging level to info to see extra information.

configure_logging(level="INFO")
2024-05-16 08:52:25,739 - julearn - INFO - ===== Lib Versions =====
2024-05-16 08:52:25,739 - julearn - INFO - numpy: 1.26.4
2024-05-16 08:52:25,739 - julearn - INFO - scipy: 1.13.0
2024-05-16 08:52:25,739 - julearn - INFO - sklearn: 1.4.2
2024-05-16 08:52:25,739 - julearn - INFO - pandas: 2.1.4
2024-05-16 08:52:25,739 - julearn - INFO - julearn: 0.3.3
2024-05-16 08:52:25,739 - julearn - INFO - ========================
df_iris = load_dataset("iris")

The dataset has three kind of species. We will keep two to perform a binary classification.

df_iris = df_iris[df_iris["species"].isin(["versicolor", "virginica"])]

As features, we will use the sepal length, width and petal length. We will try to predict the species.

X = ["sepal_length", "sepal_width", "petal_length"]
y = "species"
scores = run_cross_validation(
    X=X,
    y=y,
    data=df_iris,
    model="svm",
    problem_type="classification",
    preprocess="zscore",
)

print(scores["test_score"])
2024-05-16 08:52:25,742 - julearn - INFO - ==== Input Data ====
2024-05-16 08:52:25,742 - julearn - INFO - Using dataframe as input
2024-05-16 08:52:25,742 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length']
2024-05-16 08:52:25,742 - julearn - INFO -      Target: species
2024-05-16 08:52:25,742 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length']
2024-05-16 08:52:25,742 - julearn - INFO -      X_types:{}
2024-05-16 08:52:25,742 - julearn - WARNING - The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:505: RuntimeWarning: The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
  warn_with_log(
2024-05-16 08:52:25,743 - julearn - INFO - ====================
2024-05-16 08:52:25,743 - julearn - INFO -
2024-05-16 08:52:25,743 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-16 08:52:25,743 - julearn - INFO - Step added
2024-05-16 08:52:25,744 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-16 08:52:25,744 - julearn - INFO - Step added
2024-05-16 08:52:25,744 - julearn - INFO - = Model Parameters =
2024-05-16 08:52:25,744 - julearn - INFO - ====================
2024-05-16 08:52:25,744 - julearn - INFO -
2024-05-16 08:52:25,744 - julearn - INFO - = Data Information =
2024-05-16 08:52:25,744 - julearn - INFO -      Problem type: classification
2024-05-16 08:52:25,744 - julearn - INFO -      Number of samples: 100
2024-05-16 08:52:25,744 - julearn - INFO -      Number of features: 3
2024-05-16 08:52:25,745 - julearn - INFO - ====================
2024-05-16 08:52:25,745 - julearn - INFO -
2024-05-16 08:52:25,745 - julearn - INFO -      Number of classes: 2
2024-05-16 08:52:25,745 - julearn - INFO -      Target type: object
2024-05-16 08:52:25,745 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-05-16 08:52:25,745 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-16 08:52:25,746 - julearn - INFO - Binary classification problem detected.
0    0.90
1    0.75
2    0.95
3    0.70
4    0.90
Name: test_score, dtype: float64

Additionally, we can choose to assess the performance of the model using different scoring functions.

For example, we might have an unbalanced dataset:

df_unbalanced = df_iris[20:]  # drop the first 20 versicolor samples
print(df_unbalanced["species"].value_counts())
species
virginica     50
versicolor    30
Name: count, dtype: int64

So we will choose to use the balanced_accuracy and roc_auc metrics.

scoring = ["balanced_accuracy", "roc_auc"]

Since we are comparing the performance of different models, we will need to use the same random seed to split the data in the same way.

cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=5, random_state=42)

First we will use a default SVM model.

scores1 = run_cross_validation(
    X=X,
    y=y,
    data=df_unbalanced,
    model="svm",
    preprocess="zscore",
    problem_type="classification",
    scoring=scoring,
    cv=cv,
)

scores1["model"] = "svm"
2024-05-16 08:52:25,787 - julearn - INFO - ==== Input Data ====
2024-05-16 08:52:25,787 - julearn - INFO - Using dataframe as input
2024-05-16 08:52:25,787 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length']
2024-05-16 08:52:25,787 - julearn - INFO -      Target: species
2024-05-16 08:52:25,787 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length']
2024-05-16 08:52:25,787 - julearn - INFO -      X_types:{}
2024-05-16 08:52:25,787 - julearn - WARNING - The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:505: RuntimeWarning: The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
  warn_with_log(
2024-05-16 08:52:25,788 - julearn - INFO - ====================
2024-05-16 08:52:25,788 - julearn - INFO -
2024-05-16 08:52:25,788 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-16 08:52:25,788 - julearn - INFO - Step added
2024-05-16 08:52:25,788 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-16 08:52:25,788 - julearn - INFO - Step added
2024-05-16 08:52:25,789 - julearn - INFO - = Model Parameters =
2024-05-16 08:52:25,789 - julearn - INFO - ====================
2024-05-16 08:52:25,789 - julearn - INFO -
2024-05-16 08:52:25,789 - julearn - INFO - = Data Information =
2024-05-16 08:52:25,789 - julearn - INFO -      Problem type: classification
2024-05-16 08:52:25,789 - julearn - INFO -      Number of samples: 80
2024-05-16 08:52:25,789 - julearn - INFO -      Number of features: 3
2024-05-16 08:52:25,789 - julearn - INFO - ====================
2024-05-16 08:52:25,789 - julearn - INFO -
2024-05-16 08:52:25,789 - julearn - INFO -      Number of classes: 2
2024-05-16 08:52:25,789 - julearn - INFO -      Target type: object
2024-05-16 08:52:25,790 - julearn - INFO -      Class distributions: species
virginica     50
versicolor    30
Name: count, dtype: int64
2024-05-16 08:52:25,790 - julearn - INFO - Using outer CV scheme RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=42)
2024-05-16 08:52:25,790 - julearn - INFO - Binary classification problem detected.

Second we will use a default Random Forest model.

scores2 = run_cross_validation(
    X=X,
    y=y,
    data=df_unbalanced,
    model="rf",
    preprocess="zscore",
    problem_type="classification",
    scoring=scoring,
    cv=cv,
)

scores2["model"] = "rf"
2024-05-16 08:52:26,077 - julearn - INFO - ==== Input Data ====
2024-05-16 08:52:26,078 - julearn - INFO - Using dataframe as input
2024-05-16 08:52:26,078 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length']
2024-05-16 08:52:26,078 - julearn - INFO -      Target: species
2024-05-16 08:52:26,078 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length']
2024-05-16 08:52:26,078 - julearn - INFO -      X_types:{}
2024-05-16 08:52:26,078 - julearn - WARNING - The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:505: RuntimeWarning: The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
  warn_with_log(
2024-05-16 08:52:26,078 - julearn - INFO - ====================
2024-05-16 08:52:26,079 - julearn - INFO -
2024-05-16 08:52:26,079 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-16 08:52:26,079 - julearn - INFO - Step added
2024-05-16 08:52:26,079 - julearn - INFO - Adding step rf that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-16 08:52:26,079 - julearn - INFO - Step added
2024-05-16 08:52:26,080 - julearn - INFO - = Model Parameters =
2024-05-16 08:52:26,080 - julearn - INFO - ====================
2024-05-16 08:52:26,080 - julearn - INFO -
2024-05-16 08:52:26,080 - julearn - INFO - = Data Information =
2024-05-16 08:52:26,080 - julearn - INFO -      Problem type: classification
2024-05-16 08:52:26,080 - julearn - INFO -      Number of samples: 80
2024-05-16 08:52:26,080 - julearn - INFO -      Number of features: 3
2024-05-16 08:52:26,080 - julearn - INFO - ====================
2024-05-16 08:52:26,080 - julearn - INFO -
2024-05-16 08:52:26,080 - julearn - INFO -      Number of classes: 2
2024-05-16 08:52:26,080 - julearn - INFO -      Target type: object
2024-05-16 08:52:26,081 - julearn - INFO -      Class distributions: species
virginica     50
versicolor    30
Name: count, dtype: int64
2024-05-16 08:52:26,081 - julearn - INFO - Using outer CV scheme RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=42)
2024-05-16 08:52:26,081 - julearn - INFO - Binary classification problem detected.

The third model will be a SVM with a linear kernel.

scores3 = run_cross_validation(
    X=X,
    y=y,
    data=df_unbalanced,
    model="svm",
    model_params={"svm__kernel": "linear"},
    preprocess="zscore",
    problem_type="classification",
    scoring=scoring,
    cv=cv,
)

scores3["model"] = "svm_linear"
2024-05-16 08:52:28,775 - julearn - INFO - ==== Input Data ====
2024-05-16 08:52:28,776 - julearn - INFO - Using dataframe as input
2024-05-16 08:52:28,776 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length']
2024-05-16 08:52:28,776 - julearn - INFO -      Target: species
2024-05-16 08:52:28,776 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length']
2024-05-16 08:52:28,776 - julearn - INFO -      X_types:{}
2024-05-16 08:52:28,776 - julearn - WARNING - The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:505: RuntimeWarning: The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
  warn_with_log(
2024-05-16 08:52:28,777 - julearn - INFO - ====================
2024-05-16 08:52:28,777 - julearn - INFO -
2024-05-16 08:52:28,777 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-16 08:52:28,777 - julearn - INFO - Step added
2024-05-16 08:52:28,777 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-16 08:52:28,777 - julearn - INFO - Setting hyperparameter kernel = linear
2024-05-16 08:52:28,777 - julearn - INFO - Step added
2024-05-16 08:52:28,778 - julearn - INFO - = Model Parameters =
2024-05-16 08:52:28,778 - julearn - INFO - ====================
2024-05-16 08:52:28,778 - julearn - INFO -
2024-05-16 08:52:28,778 - julearn - INFO - = Data Information =
2024-05-16 08:52:28,778 - julearn - INFO -      Problem type: classification
2024-05-16 08:52:28,778 - julearn - INFO -      Number of samples: 80
2024-05-16 08:52:28,778 - julearn - INFO -      Number of features: 3
2024-05-16 08:52:28,778 - julearn - INFO - ====================
2024-05-16 08:52:28,778 - julearn - INFO -
2024-05-16 08:52:28,778 - julearn - INFO -      Number of classes: 2
2024-05-16 08:52:28,778 - julearn - INFO -      Target type: object
2024-05-16 08:52:28,779 - julearn - INFO -      Class distributions: species
virginica     50
versicolor    30
Name: count, dtype: int64
2024-05-16 08:52:28,779 - julearn - INFO - Using outer CV scheme RepeatedStratifiedKFold(n_repeats=5, n_splits=5, random_state=42)
2024-05-16 08:52:28,779 - julearn - INFO - Binary classification problem detected.

We can now compare the performance of the models using corrected statistics.

stats_df = corrected_ttest(scores1, scores2, scores3)
print(stats_df)
                   metric    t-stat  ...     model_2 p-val-corrected
0  test_balanced_accuracy -0.175075  ...          rf        1.000000
2  test_balanced_accuracy -1.062567  ...  svm_linear        0.895662
4  test_balanced_accuracy -1.151390  ...  svm_linear        0.782741
1            test_roc_auc  1.108944  ...          rf        0.835331
3            test_roc_auc -1.236153  ...  svm_linear        0.685092
5            test_roc_auc -1.669010  ...  svm_linear        0.324331

[6 rows x 6 columns]

We can also plot the performance of the models using the julearn Score Viewer.

from julearn.viz import plot_scores

panel = plot_scores(scores1, scores2, scores3)
# panel.show()
# uncomment the previous line show the plot
# read the documentation for more information
#  https://panel.holoviz.org/getting_started/build_app.html#deploying-panels

This is how the plot looks like.

Note

The plot is interactive. You can zoom in and out, and hover over. However, buttons will not work in this documentation.

Total running time of the script: (0 minutes 3.445 seconds)

Gallery generated by Sphinx-Gallery