.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/basic/plot_example_regression.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        Click :ref:`here <sphx_glr_download_auto_examples_basic_plot_example_regression.py>`
        to download the full example code

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_basic_plot_example_regression.py:


Regression Analysis
============================

This example uses the 'diabetes' data from sklearn datasets and performs
a regression analysis using a Ridge Regression model.

.. GENERATED FROM PYTHON SOURCE LINES 9-25

.. code-block:: default

    # Authors: Shammi More <s.more@fz-juelich.de>
    #          Federico Raimondo <f.raimondo@fz-juelich.de>
    #
    # License: AGPL

    import pandas as pd
    import seaborn as sns
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_diabetes
    from sklearn.metrics import mean_absolute_error
    from sklearn.model_selection import train_test_split

    from julearn import run_cross_validation
    from julearn.utils import configure_logging


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    /home/travis/virtualenv/python3.7.1/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
      return f(*args, **kwds)


.. GENERATED FROM PYTHON SOURCE LINES 26-27

Set the logging level to info to see extra information

.. GENERATED FROM PYTHON SOURCE LINES 27-29

.. code-block:: default

    configure_logging(level='INFO')


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    2021-01-28 20:08:00,657 - julearn - INFO - ===== Lib Versions =====
    2021-01-28 20:08:00,657 - julearn - INFO - numpy: 1.19.5
    2021-01-28 20:08:00,657 - julearn - INFO - scipy: 1.6.0
    2021-01-28 20:08:00,657 - julearn - INFO - sklearn: 0.24.1
    2021-01-28 20:08:00,657 - julearn - INFO - pandas: 1.2.1
    2021-01-28 20:08:00,657 - julearn - INFO - julearn: 0.2.5.dev19+g9c15c5f
    2021-01-28 20:08:00,657 - julearn - INFO - ========================


.. GENERATED FROM PYTHON SOURCE LINES 30-31

load the diabetes data from sklearn as a pandas dataframe

.. GENERATED FROM PYTHON SOURCE LINES 31-33

.. code-block:: default

    features, target = load_diabetes(return_X_y=True, as_frame=True)


.. GENERATED FROM PYTHON SOURCE LINES 34-38

Dataset contains ten variables age, sex, body mass index, average  blood
pressure, and six blood serum measurements (s1-s6) diabetes patients and
a quantitative measure of disease progression one year after baseline which
is the target we are interested in predicting.

.. GENERATED FROM PYTHON SOURCE LINES 38-42

.. code-block:: default


    print('Features: \n', features.head())
    print('Target: \n', target.describe())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    Features: 
             age       sex       bmi  ...        s4        s5        s6
    0  0.038076  0.050680  0.061696  ... -0.002592  0.019908 -0.017646
    1 -0.001882 -0.044642 -0.051474  ... -0.039493 -0.068330 -0.092204
    2  0.085299  0.050680  0.044451  ... -0.002592  0.002864 -0.025930
    3 -0.089063 -0.044642 -0.011595  ...  0.034309  0.022692 -0.009362
    4  0.005383 -0.044642 -0.036385  ... -0.002592 -0.031991 -0.046641

    [5 rows x 10 columns]
    Target: 
     count    442.000000
    mean     152.133484
    std       77.093005
    min       25.000000
    25%       87.000000
    50%      140.500000
    75%      211.500000
    max      346.000000
    Name: target, dtype: float64


.. GENERATED FROM PYTHON SOURCE LINES 43-45

Let's combine features and target together in one dataframe and define X
and y

.. GENERATED FROM PYTHON SOURCE LINES 45-50

.. code-block:: default

    data_diabetes = pd.concat([features, target], axis=1)

    X = ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
    y = 'target'


.. GENERATED FROM PYTHON SOURCE LINES 51-52

calculate correlations between the features/variables and plot it as heat map

.. GENERATED FROM PYTHON SOURCE LINES 52-59

.. code-block:: default

    corr = data_diabetes.corr()
    fig, ax = plt.subplots(1, 1, figsize=(10, 7))
    sns.set(font_scale=1.2)
    sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns,
                annot=True, fmt="0.1f")


.. image:: /auto_examples/basic/images/sphx_glr_plot_example_regression_001.png
    :alt: plot example regression
    :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none


    <AxesSubplot:>


.. GENERATED FROM PYTHON SOURCE LINES 60-61

Split the dataset into train and test

.. GENERATED FROM PYTHON SOURCE LINES 61-63

.. code-block:: default

    train_diabetes, test_diabetes = train_test_split(data_diabetes, test_size=0.3)


.. GENERATED FROM PYTHON SOURCE LINES 64-66

Train a ridge regression model on train dataset and use mean absolute error
for scoring

.. GENERATED FROM PYTHON SOURCE LINES 66-71

.. code-block:: default

    scores, model = run_cross_validation(
        X=X, y=y, data=train_diabetes, preprocess_X='zscore',
        problem_type='regression', model='ridge', return_estimator='final',
        scoring='neg_mean_absolute_error')


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    2021-01-28 20:08:01,081 - julearn - INFO - Using default CV
    2021-01-28 20:08:01,081 - julearn - INFO - ==== Input Data ====
    2021-01-28 20:08:01,081 - julearn - INFO - Using dataframe as input
    2021-01-28 20:08:01,081 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
    2021-01-28 20:08:01,081 - julearn - INFO - Target: target
    2021-01-28 20:08:01,082 - julearn - INFO - Expanded X: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
    2021-01-28 20:08:01,082 - julearn - INFO - Expanded Confounds: []
    2021-01-28 20:08:01,083 - julearn - INFO - ====================
    2021-01-28 20:08:01,083 - julearn - INFO - 
    2021-01-28 20:08:01,083 - julearn - INFO - ====== Model ======
    2021-01-28 20:08:01,083 - julearn - INFO - Obtaining model by name: ridge
    2021-01-28 20:08:01,083 - julearn - INFO - ===================
    2021-01-28 20:08:01,083 - julearn - INFO - 
    2021-01-28 20:08:01,083 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds


.. GENERATED FROM PYTHON SOURCE LINES 72-73

The scores dataframe has all the values for each CV split.

.. GENERATED FROM PYTHON SOURCE LINES 73-76

.. code-block:: default


    print(scores.head())


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

       fit_time  score_time  test_neg_mean_absolute_error  repeat  fold
    0  0.012554    0.007784                    -53.711774       0     0
    1  0.011844    0.008358                    -38.334731       0     1
    2  0.012480    0.007795                    -46.578596       0     2
    3  0.011738    0.007895                    -48.109208       0     3
    4  0.011963    0.007806                    -42.632269       0     4


.. GENERATED FROM PYTHON SOURCE LINES 77-78

Mean value of mean absolute error across CV

.. GENERATED FROM PYTHON SOURCE LINES 78-80

.. code-block:: default

    print(scores['test_neg_mean_absolute_error'].mean() * -1)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    45.94450832287054


.. GENERATED FROM PYTHON SOURCE LINES 81-82

Now we can get the MAE fold and repetition:

.. GENERATED FROM PYTHON SOURCE LINES 82-90

.. code-block:: default


    df_mae = scores.set_index(
        ['repeat', 'fold'])['test_neg_mean_absolute_error'].unstack() * -1
    df_mae.index.name = 'Repeats'
    df_mae.columns.name = 'K-fold splits'

    print(df_mae)


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none

    K-fold splits          0          1          2          3          4
    Repeats                                                             
    0              53.711774  38.334731  46.578596  48.109208  42.632269
    1              46.110360  49.273373  45.245755  48.425018  41.661824
    2              45.176249  50.034589  48.108842  39.433303  44.937370
    3              46.238680  39.120725  48.266876  46.853780  48.097210
    4              45.943724  43.651725  41.527743  47.429889  53.709096


.. GENERATED FROM PYTHON SOURCE LINES 91-92

Plot heatmap of mean absolute error (MAE) over all repeats and CV splits

.. GENERATED FROM PYTHON SOURCE LINES 92-96

.. code-block:: default

    fig, ax = plt.subplots(1, 1, figsize=(10, 7))
    sns.heatmap(df_mae, cmap="YlGnBu")
    plt.title('Cross-validation MAE')


.. image:: /auto_examples/basic/images/sphx_glr_plot_example_regression_002.png
    :alt: Cross-validation MAE
    :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none


    Text(0.5, 1.0, 'Cross-validation MAE')


.. GENERATED FROM PYTHON SOURCE LINES 97-98

Let's plot the feature importance using the coefficients of the trained model

.. GENERATED FROM PYTHON SOURCE LINES 98-111

.. code-block:: default


    features = pd.DataFrame({'Features': X, 'importance': model['ridge'].coef_})
    features.sort_values(by=['importance'], ascending=True, inplace=True)
    features['positive'] = features['importance'] > 0

    fig, ax = plt.subplots(1, 1, figsize=(10, 7))
    features.set_index('Features', inplace=True)
    features.importance.plot(kind='barh',
                             color=features.positive.map
                             ({True: 'blue', False: 'red'}))
    ax.set(xlabel='Importance', title='Variable importance for Ridge Regression')


.. image:: /auto_examples/basic/images/sphx_glr_plot_example_regression_003.png
    :alt: Variable importance for Ridge Regression
    :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none


    [Text(0.5, 40.249999999999986, 'Importance'), Text(0.5, 1.0, 'Variable importance for Ridge Regression')]


.. GENERATED FROM PYTHON SOURCE LINES 112-114

Use the final model to make predictions on test data and plot scatterplot
of true values vs predicted values

.. GENERATED FROM PYTHON SOURCE LINES 114-132

.. code-block:: default


    y_true = test_diabetes[y]
    y_pred = model.predict(test_diabetes[X])
    mae = format(mean_absolute_error(y_true, y_pred), '.2f')
    corr = format(np.corrcoef(y_pred, y_true)[1, 0], '.2f')

    fig, ax = plt.subplots(1, 1, figsize=(10, 7))
    sns.set_style("darkgrid")
    plt.scatter(y_true, y_pred)
    plt.plot(y_true, y_true)
    xmin, xmax = ax.get_xlim()
    ymin, ymax = ax.get_ylim()
    text = 'MAE: ' + str(mae) + '   CORR: ' + str(corr)
    ax.set(xlabel='True values', ylabel='Predicted values')
    plt.title('Actual vs Predicted')
    plt.text(xmax - 0.01 * xmax, ymax - 0.01 * ymax, text, verticalalignment='top',
             horizontalalignment='right', fontsize=12)
    plt.axis('scaled')


.. image:: /auto_examples/basic/images/sphx_glr_plot_example_regression_004.png
    :alt: Actual vs Predicted
    :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 Out:

 .. code-block:: none


    (10.75, 324.25, 10.75, 324.25)


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** ( 0 minutes  1.494 seconds)


.. _sphx_glr_download_auto_examples_basic_plot_example_regression.py:


.. only :: html

 .. container:: sphx-glr-footer
    :class: sphx-glr-footer-example


  .. container:: sphx-glr-download sphx-glr-download-python

     :download:`Download Python source code: plot_example_regression.py <plot_example_regression.py>`


  .. container:: sphx-glr-download sphx-glr-download-jupyter

     :download:`Download Jupyter notebook: plot_example_regression.ipynb <plot_example_regression.ipynb>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_