.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/basic/run_simple_binary_classification.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_basic_run_simple_binary_classification.py: Simple Binary Classification ============================ This example uses the 'iris' dataset and performs a simple binary classification using a Support Vector Machine classifier. .. include:: ../../links.inc .. GENERATED FROM PYTHON SOURCE LINES 10-17 .. code-block:: default # Authors: Federico Raimondo # # License: AGPL from seaborn import load_dataset from julearn import run_cross_validation from julearn.utils import configure_logging .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/travis/virtualenv/python3.7.1/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject return f(*args, **kwds) .. GENERATED FROM PYTHON SOURCE LINES 18-19 Set the logging level to info to see extra information .. GENERATED FROM PYTHON SOURCE LINES 19-21 .. code-block:: default configure_logging(level='INFO') .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 2021-01-28 20:06:45,526 - julearn - INFO - ===== Lib Versions ===== 2021-01-28 20:06:45,527 - julearn - INFO - numpy: 1.19.5 2021-01-28 20:06:45,527 - julearn - INFO - scipy: 1.6.0 2021-01-28 20:06:45,527 - julearn - INFO - sklearn: 0.24.1 2021-01-28 20:06:45,527 - julearn - INFO - pandas: 1.2.1 2021-01-28 20:06:45,527 - julearn - INFO - julearn: 0.2.5.dev19+g9c15c5f 2021-01-28 20:06:45,527 - julearn - INFO - ======================== .. GENERATED FROM PYTHON SOURCE LINES 22-24 .. code-block:: default df_iris = load_dataset('iris') .. GENERATED FROM PYTHON SOURCE LINES 25-27 The dataset has three kind of species. We will keep two to perform a binary classification. .. GENERATED FROM PYTHON SOURCE LINES 27-29 .. code-block:: default df_iris = df_iris[df_iris['species'].isin(['versicolor', 'virginica'])] .. GENERATED FROM PYTHON SOURCE LINES 30-32 As features, we will use the sepal length, width and petal length. We will try to predict the species. .. GENERATED FROM PYTHON SOURCE LINES 32-40 .. code-block:: default X = ['sepal_length', 'sepal_width', 'petal_length'] y = 'species' scores = run_cross_validation( X=X, y=y, data=df_iris, model='svm', preprocess_X='zscore') print(scores['test_score']) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 2021-01-28 20:06:45,532 - julearn - INFO - Using default CV 2021-01-28 20:06:45,532 - julearn - INFO - ==== Input Data ==== 2021-01-28 20:06:45,532 - julearn - INFO - Using dataframe as input 2021-01-28 20:06:45,532 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length'] 2021-01-28 20:06:45,532 - julearn - INFO - Target: species 2021-01-28 20:06:45,533 - julearn - INFO - Expanded X: ['sepal_length', 'sepal_width', 'petal_length'] 2021-01-28 20:06:45,533 - julearn - INFO - Expanded Confounds: [] 2021-01-28 20:06:45,534 - julearn - INFO - ==================== 2021-01-28 20:06:45,534 - julearn - INFO - 2021-01-28 20:06:45,534 - julearn - INFO - ====== Model ====== 2021-01-28 20:06:45,534 - julearn - INFO - Obtaining model by name: svm 2021-01-28 20:06:45,534 - julearn - INFO - =================== 2021-01-28 20:06:45,534 - julearn - INFO - 2021-01-28 20:06:45,534 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds 0 0.90 1 0.90 2 0.95 3 1.00 4 0.90 5 0.85 6 0.90 7 0.95 8 0.95 9 0.90 10 0.90 11 0.95 12 0.90 13 0.90 14 0.90 15 0.70 16 1.00 17 0.90 18 0.95 19 0.90 20 0.95 21 0.95 22 0.95 23 0.90 24 0.95 Name: test_score, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 41-45 Additionally, we can choose to assess the performance of the model using different scoring functions. For example, we might have an unbalanced dataset: .. GENERATED FROM PYTHON SOURCE LINES 45-49 .. code-block:: default df_unbalanced = df_iris[20:] # drop the first 20 versicolor samples print(df_unbalanced['species'].value_counts()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none virginica 50 versicolor 30 Name: species, dtype: int64 .. GENERATED FROM PYTHON SOURCE LINES 50-55 If we compute the `accuracy`, we might not account for this imbalance. A more suitable metric is the `balanced_accuracy`. More information in scikit-learn: `Balanced Accuracy`_ We will also set the random seed so we always split the data in the same way. .. GENERATED FROM PYTHON SOURCE LINES 55-63 .. code-block:: default scores = run_cross_validation( X=X, y=y, data=df_unbalanced, model='svm', seed=42, preprocess_X='zscore', scoring=['accuracy', 'balanced_accuracy']) print(scores['test_accuracy'].mean()) print(scores['test_balanced_accuracy'].mean()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 2021-01-28 20:06:46,043 - julearn - INFO - Setting random seed to 42 2021-01-28 20:06:46,043 - julearn - INFO - Using default CV 2021-01-28 20:06:46,043 - julearn - INFO - ==== Input Data ==== 2021-01-28 20:06:46,043 - julearn - INFO - Using dataframe as input 2021-01-28 20:06:46,043 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length'] 2021-01-28 20:06:46,043 - julearn - INFO - Target: species 2021-01-28 20:06:46,043 - julearn - INFO - Expanded X: ['sepal_length', 'sepal_width', 'petal_length'] 2021-01-28 20:06:46,043 - julearn - INFO - Expanded Confounds: [] 2021-01-28 20:06:46,044 - julearn - INFO - ==================== 2021-01-28 20:06:46,044 - julearn - INFO - 2021-01-28 20:06:46,044 - julearn - INFO - ====== Model ====== 2021-01-28 20:06:46,044 - julearn - INFO - Obtaining model by name: svm 2021-01-28 20:06:46,044 - julearn - INFO - =================== 2021-01-28 20:06:46,044 - julearn - INFO - 2021-01-28 20:06:46,044 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds 0.895 0.8708886668886668 .. GENERATED FROM PYTHON SOURCE LINES 64-74 Other kind of metrics allows us to evaluate how good our model is to detect specific targets. Suppose we want to create a model that correctly identifies the `versicolor` samples. Now we might want to evaluate the precision score, or the ratio of true positives (tp) over all positives (true and false positives). More information in scikit-learn: `Precision`_ For this metric to work, we need to define which are our `positive` values. In this example, we are interested in detecting `versicolor`. .. GENERATED FROM PYTHON SOURCE LINES 74-78 .. code-block:: default precision_scores = run_cross_validation( X=X, y=y, data=df_unbalanced, model='svm', preprocess_X='zscore', seed=42, scoring='precision', pos_labels='versicolor') print(precision_scores['test_precision'].mean()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 2021-01-28 20:06:46,752 - julearn - INFO - Setting random seed to 42 2021-01-28 20:06:46,752 - julearn - INFO - Using default CV 2021-01-28 20:06:46,752 - julearn - INFO - ==== Input Data ==== 2021-01-28 20:06:46,752 - julearn - INFO - Using dataframe as input 2021-01-28 20:06:46,752 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length'] 2021-01-28 20:06:46,752 - julearn - INFO - Target: species 2021-01-28 20:06:46,752 - julearn - INFO - Expanded X: ['sepal_length', 'sepal_width', 'petal_length'] 2021-01-28 20:06:46,752 - julearn - INFO - Expanded Confounds: [] 2021-01-28 20:06:46,753 - julearn - INFO - Setting the following as positive labels ['versicolor'] 2021-01-28 20:06:46,753 - julearn - INFO - ==================== 2021-01-28 20:06:46,753 - julearn - INFO - 2021-01-28 20:06:46,754 - julearn - INFO - ====== Model ====== 2021-01-28 20:06:46,754 - julearn - INFO - Obtaining model by name: svm 2021-01-28 20:06:46,754 - julearn - INFO - =================== 2021-01-28 20:06:46,754 - julearn - INFO - 2021-01-28 20:06:46,754 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds 0.9223333333333333 .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 1.782 seconds) .. _sphx_glr_download_auto_examples_basic_run_simple_binary_classification.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: run_simple_binary_classification.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: run_simple_binary_classification.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_