Multiclass Classification#

This example uses the iris dataset and performs multiclass classification using a Support Vector Machine classifier and plots heatmaps for cross-validation accuracies and plots confusion matrix for the test data.

# Authors: Shammi More <s.more@fz-juelich.de>
#          Federico Raimondo <f.raimondo@fz-juelich.de>
# License: AGPL

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from seaborn import load_dataset
from sklearn.model_selection import train_test_split, RepeatedKFold
from sklearn.metrics import confusion_matrix

from julearn import run_cross_validation
from julearn.utils import configure_logging

Set the logging level to info to see extra information

configure_logging(level="INFO")

2024-04-29 11:45:24,738 - julearn - INFO - ===== Lib Versions =====
2024-04-29 11:45:24,739 - julearn - INFO - numpy: 1.26.4
2024-04-29 11:45:24,739 - julearn - INFO - scipy: 1.13.0
2024-04-29 11:45:24,739 - julearn - INFO - sklearn: 1.4.2
2024-04-29 11:45:24,739 - julearn - INFO - pandas: 2.1.4
2024-04-29 11:45:24,739 - julearn - INFO - julearn: 0.3.2.dev57
2024-04-29 11:45:24,739 - julearn - INFO - ========================

load the iris data from seaborn

df_iris = load_dataset("iris")
X = ["sepal_length", "sepal_width", "petal_length"]
y = "species"

Split the dataset into train and test

train_iris, test_iris = train_test_split(
    df_iris, test_size=0.2, stratify=df_iris[y], random_state=200
)

We want to perform multiclass classification as iris dataset contains 3 kinds of species. We will first zscore all the features and then train a support vector machine classifier.

cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=200)
scores, model_iris = run_cross_validation(
    X=X,
    y=y,
    data=train_iris,
    model="svm",
    preprocess="zscore",
    problem_type="classification",
    cv=cv,
    scoring=["accuracy"],
    return_estimator="final",
)

2024-04-29 11:45:24,742 - julearn - INFO - ==== Input Data ====
2024-04-29 11:45:24,743 - julearn - INFO - Using dataframe as input
2024-04-29 11:45:24,743 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length']
2024-04-29 11:45:24,743 - julearn - INFO -      Target: species
2024-04-29 11:45:24,743 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length']
2024-04-29 11:45:24,743 - julearn - INFO -      X_types:{}
2024-04-29 11:45:24,743 - julearn - WARNING - The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:507: RuntimeWarning: The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
  warn_with_log(
2024-04-29 11:45:24,744 - julearn - INFO - ====================
2024-04-29 11:45:24,744 - julearn - INFO -
2024-04-29 11:45:24,744 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:45:24,744 - julearn - INFO - Step added
2024-04-29 11:45:24,744 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-04-29 11:45:24,744 - julearn - INFO - Step added
2024-04-29 11:45:24,745 - julearn - INFO - = Model Parameters =
2024-04-29 11:45:24,745 - julearn - INFO - ====================
2024-04-29 11:45:24,745 - julearn - INFO -
2024-04-29 11:45:24,745 - julearn - INFO - = Data Information =
2024-04-29 11:45:24,745 - julearn - INFO -      Problem type: classification
2024-04-29 11:45:24,745 - julearn - INFO -      Number of samples: 120
2024-04-29 11:45:24,745 - julearn - INFO -      Number of features: 3
2024-04-29 11:45:24,745 - julearn - INFO - ====================
2024-04-29 11:45:24,745 - julearn - INFO -
2024-04-29 11:45:24,745 - julearn - INFO -      Number of classes: 3
2024-04-29 11:45:24,745 - julearn - INFO -      Target type: object
2024-04-29 11:45:24,746 - julearn - INFO -      Class distributions: species
versicolor    40
virginica     40
setosa        40
Name: count, dtype: int64
2024-04-29 11:45:24,746 - julearn - INFO - Using outer CV scheme RepeatedKFold(n_repeats=5, n_splits=5, random_state=200)
2024-04-29 11:45:24,746 - julearn - INFO - Multi-class classification problem detected #classes = 3.
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(
2024-04-29 11:45:24,936 - julearn - INFO - Fitting final model

The scores dataframe has all the values for each CV split.

scores.head()

	fit_time	score_time	test_accuracy	n_train	n_test	fold	cv_mdsum
0	0.005056	0.002651	0.916667	96	24	0	fa5ab7a2b930761687a8e82d9971ebca
1	0.004513	0.002574	0.833333	96	24	1	fa5ab7a2b930761687a8e82d9971ebca
2	0.004429	0.002554	0.958333	96	24	2	fa5ab7a2b930761687a8e82d9971ebca
3	0.004478	0.002538	0.916667	96	24	3	fa5ab7a2b930761687a8e82d9971ebca
4	0.004434	0.002512	0.833333	96	24	4	fa5ab7a2b930761687a8e82d9971ebca

Now we can get the accuracy per fold and repetition:

df_accuracy = scores.set_index(["repeat", "fold"])["test_accuracy"].unstack()
df_accuracy.index.name = "Repeats"
df_accuracy.columns.name = "K-fold splits"
df_accuracy

K-fold splits	0	1	2	3	4
Repeats
0	0.916667	0.833333	0.958333	0.916667	0.833333
1	0.875000	0.833333	0.916667	0.833333	0.833333
2	0.750000	0.916667	0.916667	0.958333	0.916667
3	1.000000	0.791667	0.875000	1.000000	0.791667
4	0.875000	0.833333	0.875000	0.916667	0.958333

Plot heatmap of accuracy over all repeats and CV splits

sns.set(font_scale=1.2)
fig, ax = plt.subplots(1, 1, figsize=(10, 7))
sns.heatmap(df_accuracy, cmap="YlGnBu")
plt.title("Cross-validation Accuracy")

Text(0.5, 1.0, 'Cross-validation Accuracy')

We can also test our final model’s accuracy and plot the confusion matrix for the test data as an annotated heatmap

y_true = test_iris[y]
y_pred = model_iris.predict(test_iris[X])
cm = confusion_matrix(y_true, y_pred, labels=np.unique(y_true))

print(cm)

[[9 1 0]
 [0 9 1]
 [0 2 8]]

Now that we have our confusion matrix, let’s build another matrix with annotations.

cm_sum = np.sum(cm, axis=1, keepdims=True)
cm_perc = cm / cm_sum.astype(float) * 100
annot = np.empty_like(cm).astype(str)
nrows, ncols = cm.shape
for i in range(nrows):
    for j in range(ncols):
        c = cm[i, j]
        p = cm_perc[i, j]
        if c == 0:
            annot[i, j] = ""
        else:
            s = cm_sum[i]
            annot[i, j] = "%.1f%%\n%d/%d" % (p, c, s)

/tmp/tmpixoh8x_o/239c2c40d7fceb1b2e1c50e50e8636c118790cc1/examples/00_starting/plot_cm_acc_multiclass.py:104: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  annot[i, j] = "%.1f%%\n%d/%d" % (p, c, s)

Finally we create another dataframe with the confusion matrix and plot the heatmap with annotations.

cm = pd.DataFrame(cm, index=np.unique(y_true), columns=np.unique(y_true))
cm.index.name = "Actual"
cm.columns.name = "Predicted"

fig, ax = plt.subplots(1, 1, figsize=(10, 7))
sns.heatmap(cm, cmap="YlGnBu", annot=annot, fmt="", ax=ax)
plt.title("Confusion matrix")

Text(0.5, 1.0, 'Confusion matrix')

Total running time of the script: (0 minutes 0.526 seconds)

Gallery generated by Sphinx-Gallery