Note
Go to the end to download the full example code
Simple Binary Classification#
This example uses the iris
dataset and performs a simple binary
classification using a Support Vector Machine classifier.
# Authors: Federico Raimondo <f.raimondo@fz-juelich.de>
#
# License: AGPL
from seaborn import load_dataset
from julearn import run_cross_validation
from julearn.utils import configure_logging
Set the logging level to info to see extra information
configure_logging(level="INFO")
2024-01-23 10:53:15,374 - julearn - INFO - ===== Lib Versions =====
2024-01-23 10:53:15,375 - julearn - INFO - numpy: 1.26.3
2024-01-23 10:53:15,375 - julearn - INFO - scipy: 1.12.0
2024-01-23 10:53:15,375 - julearn - INFO - sklearn: 1.3.2
2024-01-23 10:53:15,375 - julearn - INFO - pandas: 2.1.4
2024-01-23 10:53:15,375 - julearn - INFO - julearn: 0.3.1
2024-01-23 10:53:15,375 - julearn - INFO - ========================
df_iris = load_dataset("iris")
The dataset has three kind of species. We will keep two to perform a binary classification.
As features, we will use the sepal length, width and petal length. We will try to predict the species.
2024-01-23 10:53:15,442 - julearn - INFO - ==== Input Data ====
2024-01-23 10:53:15,442 - julearn - INFO - Using dataframe as input
2024-01-23 10:53:15,442 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length']
2024-01-23 10:53:15,442 - julearn - INFO - Target: species
2024-01-23 10:53:15,443 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length']
2024-01-23 10:53:15,443 - julearn - INFO - X_types:{}
2024-01-23 10:53:15,443 - julearn - WARNING - The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:503: RuntimeWarning: The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
warn_with_log(
2024-01-23 10:53:15,444 - julearn - INFO - ====================
2024-01-23 10:53:15,444 - julearn - INFO -
2024-01-23 10:53:15,444 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:53:15,444 - julearn - INFO - Step added
2024-01-23 10:53:15,444 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:53:15,444 - julearn - INFO - Step added
2024-01-23 10:53:15,445 - julearn - INFO - = Model Parameters =
2024-01-23 10:53:15,445 - julearn - INFO - ====================
2024-01-23 10:53:15,445 - julearn - INFO -
2024-01-23 10:53:15,445 - julearn - INFO - = Data Information =
2024-01-23 10:53:15,445 - julearn - INFO - Problem type: classification
2024-01-23 10:53:15,445 - julearn - INFO - Number of samples: 100
2024-01-23 10:53:15,445 - julearn - INFO - Number of features: 3
2024-01-23 10:53:15,445 - julearn - INFO - ====================
2024-01-23 10:53:15,445 - julearn - INFO -
2024-01-23 10:53:15,446 - julearn - INFO - Number of classes: 2
2024-01-23 10:53:15,446 - julearn - INFO - Target type: object
2024-01-23 10:53:15,446 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-01-23 10:53:15,446 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:53:15,447 - julearn - INFO - Binary classification problem detected.
0 0.90
1 0.75
2 0.95
3 0.70
4 0.90
Name: test_score, dtype: float64
Additionally, we can choose to assess the performance of the model using different scoring functions.
For example, we might have an unbalanced dataset:
df_unbalanced = df_iris[20:] # drop the first 20 versicolor samples
print(df_unbalanced["species"].value_counts())
species
virginica 50
versicolor 30
Name: count, dtype: int64
If we compute the accuracy, we might not account for this imbalance. A more
suitable metric is the balanced_accuracy. More information in
scikit-learn
: balanced_accuracy_score()
.
We will also set the random seed so we always split the data in the same way.
2024-01-23 10:53:15,486 - julearn - INFO - Setting random seed to 42
2024-01-23 10:53:15,486 - julearn - INFO - ==== Input Data ====
2024-01-23 10:53:15,487 - julearn - INFO - Using dataframe as input
2024-01-23 10:53:15,487 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length']
2024-01-23 10:53:15,487 - julearn - INFO - Target: species
2024-01-23 10:53:15,487 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length']
2024-01-23 10:53:15,487 - julearn - INFO - X_types:{}
2024-01-23 10:53:15,487 - julearn - WARNING - The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:503: RuntimeWarning: The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
warn_with_log(
2024-01-23 10:53:15,487 - julearn - INFO - ====================
2024-01-23 10:53:15,487 - julearn - INFO -
2024-01-23 10:53:15,488 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:53:15,488 - julearn - INFO - Step added
2024-01-23 10:53:15,488 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:53:15,488 - julearn - INFO - Step added
2024-01-23 10:53:15,488 - julearn - INFO - = Model Parameters =
2024-01-23 10:53:15,488 - julearn - INFO - ====================
2024-01-23 10:53:15,488 - julearn - INFO -
2024-01-23 10:53:15,488 - julearn - INFO - = Data Information =
2024-01-23 10:53:15,488 - julearn - INFO - Problem type: classification
2024-01-23 10:53:15,488 - julearn - INFO - Number of samples: 80
2024-01-23 10:53:15,489 - julearn - INFO - Number of features: 3
2024-01-23 10:53:15,489 - julearn - INFO - ====================
2024-01-23 10:53:15,489 - julearn - INFO -
2024-01-23 10:53:15,489 - julearn - INFO - Number of classes: 2
2024-01-23 10:53:15,489 - julearn - INFO - Target type: object
2024-01-23 10:53:15,489 - julearn - INFO - Class distributions: species
virginica 50
versicolor 30
Name: count, dtype: int64
2024-01-23 10:53:15,489 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:53:15,490 - julearn - INFO - Binary classification problem detected.
/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/sklearn/metrics/_classification.py:2399: UserWarning: y_pred contains classes not in y_true
warnings.warn("y_pred contains classes not in y_true")
/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/sklearn/metrics/_classification.py:2399: UserWarning: y_pred contains classes not in y_true
warnings.warn("y_pred contains classes not in y_true")
/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/sklearn/metrics/_classification.py:2399: UserWarning: y_pred contains classes not in y_true
warnings.warn("y_pred contains classes not in y_true")
/opt/hostedtoolcache/Python/3.10.13/x64/lib/python3.10/site-packages/sklearn/metrics/_classification.py:2399: UserWarning: y_pred contains classes not in y_true
warnings.warn("y_pred contains classes not in y_true")
0.8625
0.8678571428571429
Other kind of metrics allows us to evaluate how good our model is to detect specific targets. Suppose we want to create a model that correctly identifies the versicolor samples.
Now we might want to evaluate the precision score, or the ratio of true
positives (tp) over all positives (true and false positives). More
information in scikit-learn
: precision_score()
.
For this metric to work, we need to define which are our positive values. In this example, we are interested in detecting versicolor.
precision_scores = run_cross_validation(
X=X,
y=y,
data=df_unbalanced,
model="svm",
preprocess="zscore",
problem_type="classification",
seed=42,
scoring="precision",
pos_labels="versicolor",
)
print(precision_scores["test_score"].mean())
2024-01-23 10:53:15,532 - julearn - INFO - Setting random seed to 42
2024-01-23 10:53:15,532 - julearn - INFO - ==== Input Data ====
2024-01-23 10:53:15,532 - julearn - INFO - Using dataframe as input
2024-01-23 10:53:15,532 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length']
2024-01-23 10:53:15,533 - julearn - INFO - Target: species
2024-01-23 10:53:15,533 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length']
2024-01-23 10:53:15,533 - julearn - INFO - X_types:{}
2024-01-23 10:53:15,533 - julearn - WARNING - The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:503: RuntimeWarning: The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
warn_with_log(
2024-01-23 10:53:15,533 - julearn - INFO - Setting the following as positive labels ['versicolor']
2024-01-23 10:53:15,534 - julearn - INFO - ====================
2024-01-23 10:53:15,534 - julearn - INFO -
2024-01-23 10:53:15,534 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:53:15,534 - julearn - INFO - Step added
2024-01-23 10:53:15,534 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:53:15,534 - julearn - INFO - Step added
2024-01-23 10:53:15,535 - julearn - INFO - = Model Parameters =
2024-01-23 10:53:15,535 - julearn - INFO - ====================
2024-01-23 10:53:15,535 - julearn - INFO -
2024-01-23 10:53:15,535 - julearn - INFO - = Data Information =
2024-01-23 10:53:15,535 - julearn - INFO - Problem type: classification
2024-01-23 10:53:15,535 - julearn - INFO - Number of samples: 80
2024-01-23 10:53:15,535 - julearn - INFO - Number of features: 3
2024-01-23 10:53:15,535 - julearn - INFO - ====================
2024-01-23 10:53:15,535 - julearn - INFO -
2024-01-23 10:53:15,535 - julearn - INFO - Number of classes: 2
2024-01-23 10:53:15,535 - julearn - INFO - Target type: int64
2024-01-23 10:53:15,535 - julearn - INFO - Class distributions: species
0 50
1 30
Name: count, dtype: int64
2024-01-23 10:53:15,536 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:53:15,536 - julearn - INFO - Binary classification problem detected.
0.4
Total running time of the script: (0 minutes 0.209 seconds)