.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/basic/plot_example_regression.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_basic_plot_example_regression.py: Regression Analysis ============================ This example uses the 'diabetes' data from sklearn datasets and performs a regression analysis using a Ridge Regression model. .. GENERATED FROM PYTHON SOURCE LINES 9-25 .. code-block:: default # Authors: Shammi More # Federico Raimondo # # License: AGPL import pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error from sklearn.model_selection import train_test_split from julearn import run_cross_validation from julearn.utils import configure_logging .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /home/travis/virtualenv/python3.7.1/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject return f(*args, **kwds) .. GENERATED FROM PYTHON SOURCE LINES 26-27 Set the logging level to info to see extra information .. GENERATED FROM PYTHON SOURCE LINES 27-29 .. code-block:: default configure_logging(level='INFO') .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 2021-01-28 20:08:00,657 - julearn - INFO - ===== Lib Versions ===== 2021-01-28 20:08:00,657 - julearn - INFO - numpy: 1.19.5 2021-01-28 20:08:00,657 - julearn - INFO - scipy: 1.6.0 2021-01-28 20:08:00,657 - julearn - INFO - sklearn: 0.24.1 2021-01-28 20:08:00,657 - julearn - INFO - pandas: 1.2.1 2021-01-28 20:08:00,657 - julearn - INFO - julearn: 0.2.5.dev19+g9c15c5f 2021-01-28 20:08:00,657 - julearn - INFO - ======================== .. GENERATED FROM PYTHON SOURCE LINES 30-31 load the diabetes data from sklearn as a pandas dataframe .. GENERATED FROM PYTHON SOURCE LINES 31-33 .. code-block:: default features, target = load_diabetes(return_X_y=True, as_frame=True) .. GENERATED FROM PYTHON SOURCE LINES 34-38 Dataset contains ten variables age, sex, body mass index, average blood pressure, and six blood serum measurements (s1-s6) diabetes patients and a quantitative measure of disease progression one year after baseline which is the target we are interested in predicting. .. GENERATED FROM PYTHON SOURCE LINES 38-42 .. code-block:: default print('Features: \n', features.head()) print('Target: \n', target.describe()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Features: age sex bmi ... s4 s5 s6 0 0.038076 0.050680 0.061696 ... -0.002592 0.019908 -0.017646 1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068330 -0.092204 2 0.085299 0.050680 0.044451 ... -0.002592 0.002864 -0.025930 3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022692 -0.009362 4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031991 -0.046641 [5 rows x 10 columns] Target: count 442.000000 mean 152.133484 std 77.093005 min 25.000000 25% 87.000000 50% 140.500000 75% 211.500000 max 346.000000 Name: target, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 43-45 Let's combine features and target together in one dataframe and define X and y .. GENERATED FROM PYTHON SOURCE LINES 45-50 .. code-block:: default data_diabetes = pd.concat([features, target], axis=1) X = ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] y = 'target' .. GENERATED FROM PYTHON SOURCE LINES 51-52 calculate correlations between the features/variables and plot it as heat map .. GENERATED FROM PYTHON SOURCE LINES 52-59 .. code-block:: default corr = data_diabetes.corr() fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.set(font_scale=1.2) sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, fmt="0.1f") .. image:: /auto_examples/basic/images/sphx_glr_plot_example_regression_001.png :alt: plot example regression :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 60-61 Split the dataset into train and test .. GENERATED FROM PYTHON SOURCE LINES 61-63 .. code-block:: default train_diabetes, test_diabetes = train_test_split(data_diabetes, test_size=0.3) .. GENERATED FROM PYTHON SOURCE LINES 64-66 Train a ridge regression model on train dataset and use mean absolute error for scoring .. GENERATED FROM PYTHON SOURCE LINES 66-71 .. code-block:: default scores, model = run_cross_validation( X=X, y=y, data=train_diabetes, preprocess_X='zscore', problem_type='regression', model='ridge', return_estimator='final', scoring='neg_mean_absolute_error') .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 2021-01-28 20:08:01,081 - julearn - INFO - Using default CV 2021-01-28 20:08:01,081 - julearn - INFO - ==== Input Data ==== 2021-01-28 20:08:01,081 - julearn - INFO - Using dataframe as input 2021-01-28 20:08:01,081 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 2021-01-28 20:08:01,081 - julearn - INFO - Target: target 2021-01-28 20:08:01,082 - julearn - INFO - Expanded X: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 2021-01-28 20:08:01,082 - julearn - INFO - Expanded Confounds: [] 2021-01-28 20:08:01,083 - julearn - INFO - ==================== 2021-01-28 20:08:01,083 - julearn - INFO - 2021-01-28 20:08:01,083 - julearn - INFO - ====== Model ====== 2021-01-28 20:08:01,083 - julearn - INFO - Obtaining model by name: ridge 2021-01-28 20:08:01,083 - julearn - INFO - =================== 2021-01-28 20:08:01,083 - julearn - INFO - 2021-01-28 20:08:01,083 - julearn - INFO - CV interpreted as RepeatedKFold with 5 repetitions of 5 folds .. GENERATED FROM PYTHON SOURCE LINES 72-73 The scores dataframe has all the values for each CV split. .. GENERATED FROM PYTHON SOURCE LINES 73-76 .. code-block:: default print(scores.head()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none fit_time score_time test_neg_mean_absolute_error repeat fold 0 0.012554 0.007784 -53.711774 0 0 1 0.011844 0.008358 -38.334731 0 1 2 0.012480 0.007795 -46.578596 0 2 3 0.011738 0.007895 -48.109208 0 3 4 0.011963 0.007806 -42.632269 0 4 .. GENERATED FROM PYTHON SOURCE LINES 77-78 Mean value of mean absolute error across CV .. GENERATED FROM PYTHON SOURCE LINES 78-80 .. code-block:: default print(scores['test_neg_mean_absolute_error'].mean() * -1) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none 45.94450832287054 .. GENERATED FROM PYTHON SOURCE LINES 81-82 Now we can get the MAE fold and repetition: .. GENERATED FROM PYTHON SOURCE LINES 82-90 .. code-block:: default df_mae = scores.set_index( ['repeat', 'fold'])['test_neg_mean_absolute_error'].unstack() * -1 df_mae.index.name = 'Repeats' df_mae.columns.name = 'K-fold splits' print(df_mae) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none K-fold splits 0 1 2 3 4 Repeats 0 53.711774 38.334731 46.578596 48.109208 42.632269 1 46.110360 49.273373 45.245755 48.425018 41.661824 2 45.176249 50.034589 48.108842 39.433303 44.937370 3 46.238680 39.120725 48.266876 46.853780 48.097210 4 45.943724 43.651725 41.527743 47.429889 53.709096 .. GENERATED FROM PYTHON SOURCE LINES 91-92 Plot heatmap of mean absolute error (MAE) over all repeats and CV splits .. GENERATED FROM PYTHON SOURCE LINES 92-96 .. code-block:: default fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.heatmap(df_mae, cmap="YlGnBu") plt.title('Cross-validation MAE') .. image:: /auto_examples/basic/images/sphx_glr_plot_example_regression_002.png :alt: Cross-validation MAE :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Text(0.5, 1.0, 'Cross-validation MAE') .. GENERATED FROM PYTHON SOURCE LINES 97-98 Let's plot the feature importance using the coefficients of the trained model .. GENERATED FROM PYTHON SOURCE LINES 98-111 .. code-block:: default features = pd.DataFrame({'Features': X, 'importance': model['ridge'].coef_}) features.sort_values(by=['importance'], ascending=True, inplace=True) features['positive'] = features['importance'] > 0 fig, ax = plt.subplots(1, 1, figsize=(10, 7)) features.set_index('Features', inplace=True) features.importance.plot(kind='barh', color=features.positive.map ({True: 'blue', False: 'red'})) ax.set(xlabel='Importance', title='Variable importance for Ridge Regression') .. image:: /auto_examples/basic/images/sphx_glr_plot_example_regression_003.png :alt: Variable importance for Ridge Regression :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none [Text(0.5, 40.249999999999986, 'Importance'), Text(0.5, 1.0, 'Variable importance for Ridge Regression')] .. GENERATED FROM PYTHON SOURCE LINES 112-114 Use the final model to make predictions on test data and plot scatterplot of true values vs predicted values .. GENERATED FROM PYTHON SOURCE LINES 114-132 .. code-block:: default y_true = test_diabetes[y] y_pred = model.predict(test_diabetes[X]) mae = format(mean_absolute_error(y_true, y_pred), '.2f') corr = format(np.corrcoef(y_pred, y_true)[1, 0], '.2f') fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.set_style("darkgrid") plt.scatter(y_true, y_pred) plt.plot(y_true, y_true) xmin, xmax = ax.get_xlim() ymin, ymax = ax.get_ylim() text = 'MAE: ' + str(mae) + ' CORR: ' + str(corr) ax.set(xlabel='True values', ylabel='Predicted values') plt.title('Actual vs Predicted') plt.text(xmax - 0.01 * xmax, ymax - 0.01 * ymax, text, verticalalignment='top', horizontalalignment='right', fontsize=12) plt.axis('scaled') .. image:: /auto_examples/basic/images/sphx_glr_plot_example_regression_004.png :alt: Actual vs Predicted :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none (10.75, 324.25, 10.75, 324.25) .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 1.494 seconds) .. _sphx_glr_download_auto_examples_basic_plot_example_regression.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_example_regression.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_example_regression.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_