Note
Click here to download the full example code
Simple Binary Classification¶
This example uses the ‘iris’ dataset and performs a simple binary classification using a Support Vector Machine classifier.
# Authors: Federico Raimondo <f.raimondo@fz-juelich.de>
#
# License: AGPL
from seaborn import load_dataset
from julearn import run_cross_validation
from julearn.utils import configure_logging
Out:
/home/travis/virtualenv/python3.7.1/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
return f(*args, **kwds)
Set the logging level to info to see extra information
configure_logging(level='INFO')
Out:
2021-01-28 20:07:48,537 - julearn - INFO - ===== Lib Versions =====
2021-01-28 20:07:48,537 - julearn - INFO - numpy: 1.19.5
2021-01-28 20:07:48,538 - julearn - INFO - scipy: 1.6.0
2021-01-28 20:07:48,538 - julearn - INFO - sklearn: 0.24.1
2021-01-28 20:07:48,538 - julearn - INFO - pandas: 1.2.1
2021-01-28 20:07:48,538 - julearn - INFO - julearn: 0.2.5.dev19+g9c15c5f
2021-01-28 20:07:48,538 - julearn - INFO - ========================
df_iris = load_dataset('iris')
The dataset has three kind of species. We will keep two to perform a binary classification.
df_iris = df_iris[df_iris['species'].isin(['versicolor', 'virginica'])]
As features, we will use the sepal length, width and petal length. We will try to predict the species.
Out:
2021-01-28 20:07:48,543 - julearn - INFO - Using default CV
2021-01-28 20:07:48,543 - julearn - INFO - ==== Input Data ====
2021-01-28 20:07:48,543 - julearn - INFO - Using dataframe as input
2021-01-28 20:07:48,543 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length']
2021-01-28 20:07:48,543 - julearn - INFO - Target: species
2021-01-28 20:07:48,543 - julearn - INFO - Expanded X: ['sepal_length', 'sepal_width', 'petal_length']
2021-01-28 20:07:48,543 - julearn - INFO - Expanded Confounds: []
2021-01-28 20:07:48,544 - julearn - INFO - ====================
2021-01-28 20:07:48,544 - julearn - INFO -
2021-01-28 20:07:48,544 - julearn - INFO - ====== Model ======
2021-01-28 20:07:48,544 - julearn - INFO - Obtaining model by name: svm
2021-01-28 20:07:48,545 - julearn - INFO - ===================
2021-01-28 20:07:48,545 - julearn - INFO -
2021-01-28 20:07:48,545 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds
0 0.95
1 0.90
2 0.90
3 1.00
4 0.90
5 1.00
6 0.90
7 0.85
8 0.95
9 0.90
10 0.95
11 0.95
12 0.90
13 0.80
14 1.00
15 0.90
16 0.85
17 1.00
18 0.85
19 0.95
20 0.90
21 0.85
22 0.95
23 0.95
24 1.00
Name: test_score, dtype: float64
Additionally, we can choose to assess the performance of the model using different scoring functions.
For example, we might have an unbalanced dataset:
df_unbalanced = df_iris[20:] # drop the first 20 versicolor samples
print(df_unbalanced['species'].value_counts())
Out:
virginica 50
versicolor 30
Name: species, dtype: int64
If we compute the accuracy, we might not account for this imbalance. A more suitable metric is the balanced_accuracy. More information in scikit-learn: Balanced Accuracy
We will also set the random seed so we always split the data in the same way.
Out:
2021-01-28 20:07:49,054 - julearn - INFO - Setting random seed to 42
2021-01-28 20:07:49,054 - julearn - INFO - Using default CV
2021-01-28 20:07:49,054 - julearn - INFO - ==== Input Data ====
2021-01-28 20:07:49,054 - julearn - INFO - Using dataframe as input
2021-01-28 20:07:49,054 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length']
2021-01-28 20:07:49,054 - julearn - INFO - Target: species
2021-01-28 20:07:49,054 - julearn - INFO - Expanded X: ['sepal_length', 'sepal_width', 'petal_length']
2021-01-28 20:07:49,054 - julearn - INFO - Expanded Confounds: []
2021-01-28 20:07:49,055 - julearn - INFO - ====================
2021-01-28 20:07:49,055 - julearn - INFO -
2021-01-28 20:07:49,055 - julearn - INFO - ====== Model ======
2021-01-28 20:07:49,055 - julearn - INFO - Obtaining model by name: svm
2021-01-28 20:07:49,055 - julearn - INFO - ===================
2021-01-28 20:07:49,055 - julearn - INFO -
2021-01-28 20:07:49,055 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds
0.895
0.8708886668886668
Other kind of metrics allows us to evaluate how good our model is to detect specific targets. Suppose we want to create a model that correctly identifies the versicolor samples.
Now we might want to evaluate the precision score, or the ratio of true positives (tp) over all positives (true and false positives). More information in scikit-learn: Precision
For this metric to work, we need to define which are our positive values. In this example, we are interested in detecting versicolor.
Out:
2021-01-28 20:07:49,763 - julearn - INFO - Setting random seed to 42
2021-01-28 20:07:49,763 - julearn - INFO - Using default CV
2021-01-28 20:07:49,763 - julearn - INFO - ==== Input Data ====
2021-01-28 20:07:49,763 - julearn - INFO - Using dataframe as input
2021-01-28 20:07:49,763 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length']
2021-01-28 20:07:49,764 - julearn - INFO - Target: species
2021-01-28 20:07:49,764 - julearn - INFO - Expanded X: ['sepal_length', 'sepal_width', 'petal_length']
2021-01-28 20:07:49,764 - julearn - INFO - Expanded Confounds: []
2021-01-28 20:07:49,764 - julearn - INFO - Setting the following as positive labels ['versicolor']
2021-01-28 20:07:49,765 - julearn - INFO - ====================
2021-01-28 20:07:49,765 - julearn - INFO -
2021-01-28 20:07:49,765 - julearn - INFO - ====== Model ======
2021-01-28 20:07:49,765 - julearn - INFO - Obtaining model by name: svm
2021-01-28 20:07:49,765 - julearn - INFO - ===================
2021-01-28 20:07:49,765 - julearn - INFO -
2021-01-28 20:07:49,765 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds
0.9223333333333333
Total running time of the script: ( 0 minutes 1.772 seconds)