.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/basic/plot_transform_until.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_basic_plot_transform_until.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_basic_plot_transform_until.py:


Preprocessing with variance threshold, zscore and PCA
=======================================================

This example uses the 'iris' dataset, performs simple binary
classification after the pre-processing the features including removal of low
variance features, feature normalization using zscore and feature reduction
using PCA. We will check the features after each preprocessing step.

.. GENERATED FROM PYTHON SOURCE LINES 11-23

.. code-block:: default


    # Authors: Shammi More <s.more@fz-juelich.de>
    #
    # License: AGPL

    import matplotlib.pyplot as plt
    import seaborn as sns
    from seaborn import load_dataset

    from julearn import run_cross_validation
    from julearn.utils import configure_logging


.. GENERATED FROM PYTHON SOURCE LINES 24-25

Set the logging level to info to see extra information

.. GENERATED FROM PYTHON SOURCE LINES 25-27

.. code-block:: default

    configure_logging(level='INFO')


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2023-04-06 09:51:01,275 - julearn - INFO - ===== Lib Versions =====
    2023-04-06 09:51:01,275 - julearn - INFO - numpy: 1.23.5
    2023-04-06 09:51:01,275 - julearn - INFO - scipy: 1.10.1
    2023-04-06 09:51:01,275 - julearn - INFO - sklearn: 1.0.2
    2023-04-06 09:51:01,275 - julearn - INFO - pandas: 1.4.4
    2023-04-06 09:51:01,276 - julearn - INFO - julearn: 0.3.1.dev2
    2023-04-06 09:51:01,276 - julearn - INFO - ========================


.. GENERATED FROM PYTHON SOURCE LINES 28-29

Load the iris data from seaborn

.. GENERATED FROM PYTHON SOURCE LINES 29-31

.. code-block:: default

    df_iris = load_dataset('iris')


.. GENERATED FROM PYTHON SOURCE LINES 32-34

The dataset has three kind of species. We will keep two to perform a binary
classification.

.. GENERATED FROM PYTHON SOURCE LINES 34-37

.. code-block:: default


    df_iris = df_iris[df_iris['species'].isin(['versicolor', 'virginica'])]


.. GENERATED FROM PYTHON SOURCE LINES 38-40

We will use the sepal length, width and petal length and
petal width as features and predict the species

.. GENERATED FROM PYTHON SOURCE LINES 40-44

.. code-block:: default


    X = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
    y = 'species'


.. GENERATED FROM PYTHON SOURCE LINES 45-46

Let's look at the summary statistics of the raw features

.. GENERATED FROM PYTHON SOURCE LINES 46-48

.. code-block:: default

    print('Summary Statistics of the raw features : \n', df_iris.describe())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Summary Statistics of the raw features : 
            sepal_length  sepal_width  petal_length  petal_width
    count    100.000000   100.000000    100.000000   100.000000
    mean       6.262000     2.872000      4.906000     1.676000
    std        0.662834     0.332751      0.825578     0.424769
    min        4.900000     2.000000      3.000000     1.000000
    25%        5.800000     2.700000      4.375000     1.300000
    50%        6.300000     2.900000      4.900000     1.600000
    75%        6.700000     3.025000      5.525000     2.000000
    max        7.900000     3.800000      6.900000     2.500000


.. GENERATED FROM PYTHON SOURCE LINES 49-51

We will preprocess the features using variance thresholding, zscore and PCA
and then train a random forest model

.. GENERATED FROM PYTHON SOURCE LINES 51-67

.. code-block:: default


    # Define the model parameters and preprocessing steps first
    # Setting the threshold for variance to 0.15, number of PCA components to 2
    # and number of trees for random forest to 200

    model_params = {'select_variance__threshold': 0.15,
                    'pca__n_components': 2,
                    'rf__n_estimators': 200}

    preprocess_X = ['select_variance', 'zscore', 'pca']

    scores, model = run_cross_validation(
        X=X, y=y, data=df_iris, model='rf', preprocess_X=preprocess_X,
        scoring=['accuracy', 'roc_auc'], model_params=model_params,
        return_estimator='final', seed=200)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2023-04-06 09:51:01,299 - julearn - INFO - Setting random seed to 200
    2023-04-06 09:51:01,299 - julearn - INFO - Using default CV
    2023-04-06 09:51:01,299 - julearn - INFO - ==== Input Data ====
    2023-04-06 09:51:01,299 - julearn - INFO - Using dataframe as input
    2023-04-06 09:51:01,300 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
    2023-04-06 09:51:01,300 - julearn - INFO - Target: species
    2023-04-06 09:51:01,300 - julearn - INFO - Expanded X: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
    2023-04-06 09:51:01,300 - julearn - INFO - Expanded Confounds: []
    2023-04-06 09:51:01,301 - julearn - INFO - ====================
    2023-04-06 09:51:01,301 - julearn - INFO - 
    2023-04-06 09:51:01,301 - julearn - INFO - ====== Model ======
    2023-04-06 09:51:01,301 - julearn - INFO - Obtaining model by name: rf
    2023-04-06 09:51:01,302 - julearn - INFO - ===================
    2023-04-06 09:51:01,302 - julearn - INFO - 
    2023-04-06 09:51:01,302 - julearn - INFO - = Model Parameters =
    2023-04-06 09:51:01,302 - julearn - INFO - Setting hyperparameter select_variance__threshold = 0.15
    2023-04-06 09:51:01,303 - julearn - INFO - Setting hyperparameter pca__n_components = 2
    2023-04-06 09:51:01,304 - julearn - INFO - Setting hyperparameter rf__n_estimators = 200
    2023-04-06 09:51:01,306 - julearn - INFO - ====================
    2023-04-06 09:51:01,306 - julearn - INFO - 
    2023-04-06 09:51:01,306 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds


.. GENERATED FROM PYTHON SOURCE LINES 68-73

Now let's look at the data after pre-processing. It can be done using
'preprocess' method. By default it will apply all the pre-processing steps
(`'select_variance'`, `'zscore'`, `'pca'` in this case) and return
pre-processed data. Note that here we are applying pre-processing only on X.
Notice that the column names have changed in this new dataframe.

.. GENERATED FROM PYTHON SOURCE LINES 73-87

.. code-block:: default


    pre_X, pre_y = model.preprocess(df_iris[X], df_iris[y])
    print('Features after PCA : \n', pre_X)

    # Let's plot scatter plots for raw features and the PCA components
    pre_df = pre_X.join(pre_y)
    fig, axes = plt.subplots(1, 2, figsize=(12, 6))
    sns.scatterplot(x='sepal_length', y='sepal_width', data=df_iris, hue='species',
                    ax=axes[0])
    axes[0].set_title('Raw features')
    sns.scatterplot(x='pca_component_0', y='pca_component_1', data=pre_df,
                    hue='species', ax=axes[1])
    axes[1].set_title('PCA components')


.. image-sg:: /auto_examples/basic/images/sphx_glr_plot_transform_until_001.png
   :alt: Raw features, PCA components
   :srcset: /auto_examples/basic/images/sphx_glr_plot_transform_until_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Features after PCA : 
          pca_component_0  pca_component_1
    50          0.107487        -1.249219
    51         -0.418588        -0.440114
    52          0.304093        -0.975911
    53         -1.818796         0.185820
    54         -0.259106        -0.547432
    ..               ...              ...
    145         1.413947         0.581755
    146         0.397898         0.335935
    147         0.848759         0.289674
    148         1.139654         1.112779
    149         0.001729         0.592856

    [100 rows x 2 columns]

    Text(0.5, 1.0, 'PCA components')


.. GENERATED FROM PYTHON SOURCE LINES 88-93

But let's say we want to look at features after applying only one or more
preprocessing steps eg: only variance thresholding or till zscore.
To do so we can set the argument `until` to the desired preprocessing step.
Note that the name of the preprocessing step is the same as used in the
`run_cross_validation` function in `preprocess_X`.

.. GENERATED FROM PYTHON SOURCE LINES 93-102

.. code-block:: default


    # Let's look at features after variance thresholding. We see that now we have
    # one feature less as the variance for this feature ('sepal_width') was below
    # the set threshold.

    var_th_X, var_th_y = model.preprocess(df_iris[X], df_iris[y],
                                          until='select_variance')
    print('Features after variance thresholding: \n', var_th_X)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Features after variance thresholding: 
          sepal_length  petal_length  petal_width
    50            7.0           4.7          1.4
    51            6.4           4.5          1.5
    52            6.9           4.9          1.5
    53            5.5           4.0          1.3
    54            6.5           4.6          1.5
    ..            ...           ...          ...
    145           6.7           5.2          2.3
    146           6.3           5.0          1.9
    147           6.5           5.2          2.0
    148           6.2           5.4          2.3
    149           5.9           5.1          1.8

    [100 rows x 3 columns]


.. GENERATED FROM PYTHON SOURCE LINES 103-105

Now let's see features after variance thresholding and zscoring. We can now
set the `until` argument to `'zscore'`

.. GENERATED FROM PYTHON SOURCE LINES 105-109

.. code-block:: default


    zscored_X, zscored_y = model.preprocess(df_iris[X], df_iris[y], until='zscore')
    zscored_df = zscored_X.join(zscored_y)


.. GENERATED FROM PYTHON SOURCE LINES 110-112

Let's look at the summary statistics of the zscored features. We see here
that the mean of all the features is zero and standard deviation is one.

.. GENERATED FROM PYTHON SOURCE LINES 112-114

.. code-block:: default

    print('Summary Statistics of the zscored features : \n', zscored_df.describe())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Summary Statistics of the zscored features : 
            sepal_length  petal_length   petal_width
    count  1.000000e+02  1.000000e+02  1.000000e+02
    mean  -6.683543e-16  3.269607e-16 -1.443845e-15
    std    1.005038e+00  1.005038e+00  1.005038e+00
    min   -2.065164e+00 -2.320315e+00 -1.599473e+00
    25%   -7.005180e-01 -6.464256e-01 -8.896475e-01
    50%    5.761837e-02 -7.304244e-03 -1.798224e-01
    75%    6.641275e-01  7.535546e-01  7.666111e-01
    max    2.483655e+00  2.427444e+00  1.949653e+00


.. GENERATED FROM PYTHON SOURCE LINES 115-118

We can also look at the features pre-processed till PCA. Since `'pca'` is the
last preprocessing step we don't really need the `until` argument
(as shown above).

.. GENERATED FROM PYTHON SOURCE LINES 118-122

.. code-block:: default


    pre_X, pre_y = model.preprocess(df_iris[X], df_iris[y], until='pca')
    print('Features after PCA : \n', pre_X)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Features after PCA : 
          pca_component_0  pca_component_1
    50          0.107487        -1.249219
    51         -0.418588        -0.440114
    52          0.304093        -0.975911
    53         -1.818796         0.185820
    54         -0.259106        -0.547432
    ..               ...              ...
    145         1.413947         0.581755
    146         0.397898         0.335935
    147         0.848759         0.289674
    148         1.139654         1.112779
    149         0.001729         0.592856

    [100 rows x 2 columns]


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  11.272 seconds)


.. _sphx_glr_download_auto_examples_basic_plot_transform_until.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example


    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_transform_until.py <plot_transform_until.py>`

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_transform_until.ipynb <plot_transform_until.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_