.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/basic/run_simple_binary_classification.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_basic_run_simple_binary_classification.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_basic_run_simple_binary_classification.py:


Simple Binary Classification
============================

This example uses the 'iris' dataset and performs a simple binary
classification using a Support Vector Machine classifier.

.. include:: ../../links.inc

.. GENERATED FROM PYTHON SOURCE LINES 10-17

.. code-block:: default

    # Authors: Federico Raimondo <f.raimondo@fz-juelich.de>
    #
    # License: AGPL
    from seaborn import load_dataset
    from julearn import run_cross_validation
    from julearn.utils import configure_logging


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    /home/travis/virtualenv/python3.7.1/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
      return f(*args, **kwds)


.. GENERATED FROM PYTHON SOURCE LINES 18-19

Set the logging level to info to see extra information

.. GENERATED FROM PYTHON SOURCE LINES 19-21

.. code-block:: default

    configure_logging(level='INFO')


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    2021-01-28 20:06:45,526 - julearn - INFO - ===== Lib Versions =====
    2021-01-28 20:06:45,527 - julearn - INFO - numpy: 1.19.5
    2021-01-28 20:06:45,527 - julearn - INFO - scipy: 1.6.0
    2021-01-28 20:06:45,527 - julearn - INFO - sklearn: 0.24.1
    2021-01-28 20:06:45,527 - julearn - INFO - pandas: 1.2.1
    2021-01-28 20:06:45,527 - julearn - INFO - julearn: 0.2.5.dev19+g9c15c5f
    2021-01-28 20:06:45,527 - julearn - INFO - ========================


.. GENERATED FROM PYTHON SOURCE LINES 22-24

.. code-block:: default

    df_iris = load_dataset('iris')


.. GENERATED FROM PYTHON SOURCE LINES 25-27

The dataset has three kind of species. We will keep two to perform a binary
classification.

.. GENERATED FROM PYTHON SOURCE LINES 27-29

.. code-block:: default

    df_iris = df_iris[df_iris['species'].isin(['versicolor', 'virginica'])]


.. GENERATED FROM PYTHON SOURCE LINES 30-32

As features, we will use the sepal length, width and petal length.
We will try to predict the species.

.. GENERATED FROM PYTHON SOURCE LINES 32-40

.. code-block:: default


    X = ['sepal_length', 'sepal_width', 'petal_length']
    y = 'species'
    scores = run_cross_validation(
        X=X, y=y, data=df_iris, model='svm', preprocess_X='zscore')

    print(scores['test_score'])


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    2021-01-28 20:06:45,532 - julearn - INFO - Using default CV
    2021-01-28 20:06:45,532 - julearn - INFO - ==== Input Data ====
    2021-01-28 20:06:45,532 - julearn - INFO - Using dataframe as input
    2021-01-28 20:06:45,532 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length']
    2021-01-28 20:06:45,532 - julearn - INFO - Target: species
    2021-01-28 20:06:45,533 - julearn - INFO - Expanded X: ['sepal_length', 'sepal_width', 'petal_length']
    2021-01-28 20:06:45,533 - julearn - INFO - Expanded Confounds: []
    2021-01-28 20:06:45,534 - julearn - INFO - ====================
    2021-01-28 20:06:45,534 - julearn - INFO - 
    2021-01-28 20:06:45,534 - julearn - INFO - ====== Model ======
    2021-01-28 20:06:45,534 - julearn - INFO - Obtaining model by name: svm
    2021-01-28 20:06:45,534 - julearn - INFO - ===================
    2021-01-28 20:06:45,534 - julearn - INFO - 
    2021-01-28 20:06:45,534 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds
    0     0.90
    1     0.90
    2     0.95
    3     1.00
    4     0.90
    5     0.85
    6     0.90
    7     0.95
    8     0.95
    9     0.90
    10    0.90
    11    0.95
    12    0.90
    13    0.90
    14    0.90
    15    0.70
    16    1.00
    17    0.90
    18    0.95
    19    0.90
    20    0.95
    21    0.95
    22    0.95
    23    0.90
    24    0.95
    Name: test_score, dtype: float64


.. GENERATED FROM PYTHON SOURCE LINES 41-45

Additionally, we can choose to assess the performance of the model using
different scoring functions.

For example, we might have an unbalanced dataset:

.. GENERATED FROM PYTHON SOURCE LINES 45-49

.. code-block:: default


    df_unbalanced = df_iris[20:]  # drop the first 20 versicolor samples
    print(df_unbalanced['species'].value_counts())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    virginica     50
    versicolor    30
    Name: species, dtype: int64


.. GENERATED FROM PYTHON SOURCE LINES 50-55

If we compute the `accuracy`, we might not account for this imbalance. A more
suitable metric is the `balanced_accuracy`. More information in scikit-learn:
`Balanced Accuracy`_

We will also set the random seed so we always split the data in the same way.

.. GENERATED FROM PYTHON SOURCE LINES 55-63

.. code-block:: default

    scores = run_cross_validation(
        X=X, y=y, data=df_unbalanced, model='svm', seed=42, preprocess_X='zscore',
        scoring=['accuracy', 'balanced_accuracy'])

    print(scores['test_accuracy'].mean())
    print(scores['test_balanced_accuracy'].mean())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    2021-01-28 20:06:46,043 - julearn - INFO - Setting random seed to 42
    2021-01-28 20:06:46,043 - julearn - INFO - Using default CV
    2021-01-28 20:06:46,043 - julearn - INFO - ==== Input Data ====
    2021-01-28 20:06:46,043 - julearn - INFO - Using dataframe as input
    2021-01-28 20:06:46,043 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length']
    2021-01-28 20:06:46,043 - julearn - INFO - Target: species
    2021-01-28 20:06:46,043 - julearn - INFO - Expanded X: ['sepal_length', 'sepal_width', 'petal_length']
    2021-01-28 20:06:46,043 - julearn - INFO - Expanded Confounds: []
    2021-01-28 20:06:46,044 - julearn - INFO - ====================
    2021-01-28 20:06:46,044 - julearn - INFO - 
    2021-01-28 20:06:46,044 - julearn - INFO - ====== Model ======
    2021-01-28 20:06:46,044 - julearn - INFO - Obtaining model by name: svm
    2021-01-28 20:06:46,044 - julearn - INFO - ===================
    2021-01-28 20:06:46,044 - julearn - INFO - 
    2021-01-28 20:06:46,044 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds
    0.895
    0.8708886668886668


.. GENERATED FROM PYTHON SOURCE LINES 64-74

Other kind of metrics allows us to evaluate how good our model is to detect
specific targets. Suppose we want to create a model that correctly identifies
the `versicolor` samples.

Now we might want to evaluate the precision score, or the ratio of true
positives (tp) over all positives (true and false positives). More
information in scikit-learn: `Precision`_

For this metric to work, we need to define which are our `positive` values.
In this example, we are interested in detecting `versicolor`.

.. GENERATED FROM PYTHON SOURCE LINES 74-78

.. code-block:: default

    precision_scores = run_cross_validation(
        X=X, y=y, data=df_unbalanced, model='svm', preprocess_X='zscore', seed=42,
        scoring='precision', pos_labels='versicolor')
    print(precision_scores['test_precision'].mean())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    2021-01-28 20:06:46,752 - julearn - INFO - Setting random seed to 42
    2021-01-28 20:06:46,752 - julearn - INFO - Using default CV
    2021-01-28 20:06:46,752 - julearn - INFO - ==== Input Data ====
    2021-01-28 20:06:46,752 - julearn - INFO - Using dataframe as input
    2021-01-28 20:06:46,752 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length']
    2021-01-28 20:06:46,752 - julearn - INFO - Target: species
    2021-01-28 20:06:46,752 - julearn - INFO - Expanded X: ['sepal_length', 'sepal_width', 'petal_length']
    2021-01-28 20:06:46,752 - julearn - INFO - Expanded Confounds: []
    2021-01-28 20:06:46,753 - julearn - INFO - Setting the following as positive labels ['versicolor']
    2021-01-28 20:06:46,753 - julearn - INFO - ====================
    2021-01-28 20:06:46,753 - julearn - INFO - 
    2021-01-28 20:06:46,754 - julearn - INFO - ====== Model ======
    2021-01-28 20:06:46,754 - julearn - INFO - Obtaining model by name: svm
    2021-01-28 20:06:46,754 - julearn - INFO - ===================
    2021-01-28 20:06:46,754 - julearn - INFO - 
    2021-01-28 20:06:46,754 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds
    0.9223333333333333


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  1.782 seconds)


.. _sphx_glr_download_auto_examples_basic_run_simple_binary_classification.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example


  .. container:: sphx-glr-download sphx-glr-download-python

     :download:`Download Python source code: run_simple_binary_classification.py <run_simple_binary_classification.py>`


  .. container:: sphx-glr-download sphx-glr-download-jupyter

     :download:`Download Jupyter notebook: run_simple_binary_classification.ipynb <run_simple_binary_classification.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_