.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/00_starting/plot_example_regression.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_00_starting_plot_example_regression.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_00_starting_plot_example_regression.py:


Regression Analysis
===================

This example uses the ``diabetes`` data from ``sklearn datasets`` and performs
a regression analysis using a Ridge Regression model.

.. GENERATED FROM PYTHON SOURCE LINES 9-24

.. code-block:: Python

    # Authors: Shammi More <s.more@fz-juelich.de>
    #          Federico Raimondo <f.raimondo@fz-juelich.de>
    # License: AGPL

    import pandas as pd
    import seaborn as sns
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_diabetes
    from sklearn.metrics import mean_absolute_error
    from sklearn.model_selection import train_test_split

    from julearn import run_cross_validation
    from julearn.utils import configure_logging


.. GENERATED FROM PYTHON SOURCE LINES 25-26

Set the logging level to info to see extra information.

.. GENERATED FROM PYTHON SOURCE LINES 26-28

.. code-block:: Python

    configure_logging(level="INFO")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-01-16 10:53:55,695 - julearn - INFO - ===== Lib Versions =====
    2026-01-16 10:53:55,696 - julearn - INFO - numpy: 1.26.4
    2026-01-16 10:53:55,696 - julearn - INFO - scipy: 1.17.0
    2026-01-16 10:53:55,696 - julearn - INFO - sklearn: 1.7.2
    2026-01-16 10:53:55,696 - julearn - INFO - pandas: 2.3.3
    2026-01-16 10:53:55,696 - julearn - INFO - julearn: 0.3.5.dev123
    2026-01-16 10:53:55,696 - julearn - INFO - ========================


.. GENERATED FROM PYTHON SOURCE LINES 29-30

Load the diabetes data from ``sklearn`` as a ``pandas.DataFrame``.

.. GENERATED FROM PYTHON SOURCE LINES 30-32

.. code-block:: Python

    features, target = load_diabetes(return_X_y=True, as_frame=True)


.. GENERATED FROM PYTHON SOURCE LINES 33-37

Dataset contains ten variables age, sex, body mass index, average blood
pressure, and six blood serum measurements (s1-s6) diabetes patients and
a quantitative measure of disease progression one year after baseline which
is the target we are interested in predicting.

.. GENERATED FROM PYTHON SOURCE LINES 37-41

.. code-block:: Python


    print("Features: \n", features.head())
    print("Target: \n", target.describe())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Features: 
             age       sex       bmi  ...        s4        s5        s6
    0  0.038076  0.050680  0.061696  ... -0.002592  0.019907 -0.017646
    1 -0.001882 -0.044642 -0.051474  ... -0.039493 -0.068332 -0.092204
    2  0.085299  0.050680  0.044451  ... -0.002592  0.002861 -0.025930
    3 -0.089063 -0.044642 -0.011595  ...  0.034309  0.022688 -0.009362
    4  0.005383 -0.044642 -0.036385  ... -0.002592 -0.031988 -0.046641

    [5 rows x 10 columns]
    Target: 
     count    442.000000
    mean     152.133484
    std       77.093005
    min       25.000000
    25%       87.000000
    50%      140.500000
    75%      211.500000
    max      346.000000
    Name: target, dtype: float64


.. GENERATED FROM PYTHON SOURCE LINES 42-44

Let's combine features and target together in one dataframe and define X
and y

.. GENERATED FROM PYTHON SOURCE LINES 44-49

.. code-block:: Python

    data_diabetes = pd.concat([features, target], axis=1)

    X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"]
    y = "target"


.. GENERATED FROM PYTHON SOURCE LINES 50-51

Calculate correlations between the features/variables and plot it as heat map.

.. GENERATED FROM PYTHON SOURCE LINES 51-63

.. code-block:: Python

    corr = data_diabetes.corr()

    fig, ax = plt.subplots(1, 1, figsize=(10, 7))
    sns.set(font_scale=1.2)
    sns.heatmap(
        corr,
        xticklabels=corr.columns,
        yticklabels=corr.columns,
        annot=True,
        fmt="0.1f",
    )


.. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_001.png
   :alt: plot example regression
   :srcset: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    <Axes: >


.. GENERATED FROM PYTHON SOURCE LINES 64-65

Split the dataset into train and test.

.. GENERATED FROM PYTHON SOURCE LINES 65-67

.. code-block:: Python

    train_diabetes, test_diabetes = train_test_split(data_diabetes, test_size=0.3)


.. GENERATED FROM PYTHON SOURCE LINES 68-70

Train a ridge regression model on train dataset and use mean absolute error
for scoring.

.. GENERATED FROM PYTHON SOURCE LINES 70-81

.. code-block:: Python

    scores, model = run_cross_validation(
        X=X,
        y=y,
        data=train_diabetes,
        preprocess="zscore",
        problem_type="regression",
        model="ridge",
        return_estimator="final",
        scoring="neg_mean_absolute_error",
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-01-16 10:53:55,963 - julearn - INFO - ==== Input Data ====
    2026-01-16 10:53:55,963 - julearn - INFO - Using dataframe as input
    2026-01-16 10:53:55,963 - julearn - INFO -      Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
    2026-01-16 10:53:55,963 - julearn - INFO -      Target: target
    2026-01-16 10:53:55,963 - julearn - INFO -      Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
    2026-01-16 10:53:55,963 - julearn - INFO -      X_types:{}
    2026-01-16 10:53:55,964 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
    /home/runner/work/julearn/julearn/julearn/prepare.py:576: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
      warn_with_log(
    2026-01-16 10:53:55,964 - julearn - INFO - ====================
    2026-01-16 10:53:55,965 - julearn - INFO - 
    2026-01-16 10:53:55,965 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    2026-01-16 10:53:55,965 - julearn - INFO - Step added
    2026-01-16 10:53:55,965 - julearn - INFO - Adding step ridge that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    2026-01-16 10:53:55,965 - julearn - INFO - Step added
    2026-01-16 10:53:55,966 - julearn - INFO - = Model Parameters =
    2026-01-16 10:53:55,966 - julearn - INFO - ====================
    2026-01-16 10:53:55,966 - julearn - INFO - 
    2026-01-16 10:53:55,966 - julearn - INFO - = Data Information =
    2026-01-16 10:53:55,966 - julearn - INFO -      Problem type: regression
    2026-01-16 10:53:55,966 - julearn - INFO -      Number of samples: 309
    2026-01-16 10:53:55,966 - julearn - INFO -      Number of features: 10
    2026-01-16 10:53:55,966 - julearn - INFO - ====================
    2026-01-16 10:53:55,967 - julearn - INFO - 
    2026-01-16 10:53:55,967 - julearn - INFO -      Target type: float64
    2026-01-16 10:53:55,967 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)


.. GENERATED FROM PYTHON SOURCE LINES 82-83

The scores dataframe has all the values for each CV split.

.. GENERATED FROM PYTHON SOURCE LINES 83-86

.. code-block:: Python


    scores.head()


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>fit_time</th>
          <th>score_time</th>
          <th>test_score</th>
          <th>n_train</th>
          <th>n_test</th>
          <th>repeat</th>
          <th>fold</th>
          <th>cv_mdsum</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>0.004708</td>
          <td>0.002466</td>
          <td>-48.783874</td>
          <td>247</td>
          <td>62</td>
          <td>0</td>
          <td>0</td>
          <td>b10eef89b4192178d482d7a1587a248a</td>
        </tr>
        <tr>
          <th>1</th>
          <td>0.004704</td>
          <td>0.002454</td>
          <td>-47.573568</td>
          <td>247</td>
          <td>62</td>
          <td>0</td>
          <td>1</td>
          <td>b10eef89b4192178d482d7a1587a248a</td>
        </tr>
        <tr>
          <th>2</th>
          <td>0.004622</td>
          <td>0.002446</td>
          <td>-37.617474</td>
          <td>247</td>
          <td>62</td>
          <td>0</td>
          <td>2</td>
          <td>b10eef89b4192178d482d7a1587a248a</td>
        </tr>
        <tr>
          <th>3</th>
          <td>0.004645</td>
          <td>0.002494</td>
          <td>-47.686852</td>
          <td>247</td>
          <td>62</td>
          <td>0</td>
          <td>3</td>
          <td>b10eef89b4192178d482d7a1587a248a</td>
        </tr>
        <tr>
          <th>4</th>
          <td>0.004605</td>
          <td>0.002450</td>
          <td>-45.558655</td>
          <td>248</td>
          <td>61</td>
          <td>0</td>
          <td>4</td>
          <td>b10eef89b4192178d482d7a1587a248a</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 87-88

Mean value of mean absolute error across CV

.. GENERATED FROM PYTHON SOURCE LINES 88-90

.. code-block:: Python

    print(scores["test_score"].mean() * -1)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    45.44408444147062


.. GENERATED FROM PYTHON SOURCE LINES 91-92

Now we can get the MAE fold and repetition:

.. GENERATED FROM PYTHON SOURCE LINES 92-99

.. code-block:: Python


    df_mae = scores.set_index(["repeat", "fold"])["test_score"].unstack() * -1
    df_mae.index.name = "Repeats"
    df_mae.columns.name = "K-fold splits"

    df_mae


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th>K-fold splits</th>
          <th>0</th>
          <th>1</th>
          <th>2</th>
          <th>3</th>
          <th>4</th>
        </tr>
        <tr>
          <th>Repeats</th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
          <th></th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>48.783874</td>
          <td>47.573568</td>
          <td>37.617474</td>
          <td>47.686852</td>
          <td>45.558655</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 100-101

Plot heatmap of mean absolute error (MAE) over all repeats and CV splits.

.. GENERATED FROM PYTHON SOURCE LINES 101-105

.. code-block:: Python

    fig, ax = plt.subplots(1, 1, figsize=(10, 7))
    sns.heatmap(df_mae, cmap="YlGnBu")
    plt.title("Cross-validation MAE")


.. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_002.png
   :alt: Cross-validation MAE
   :srcset: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_002.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Text(0.5, 1.0, 'Cross-validation MAE')


.. GENERATED FROM PYTHON SOURCE LINES 106-107

Let's plot the feature importance using the coefficients of the trained model.

.. GENERATED FROM PYTHON SOURCE LINES 107-118

.. code-block:: Python

    features = pd.DataFrame({"Features": X, "importance": model["ridge"].coef_})
    features.sort_values(by=["importance"], ascending=True, inplace=True)
    features["positive"] = features["importance"] > 0

    fig, ax = plt.subplots(1, 1, figsize=(10, 7))
    features.set_index("Features", inplace=True)
    features.importance.plot(
        kind="barh", color=features.positive.map({True: "blue", False: "red"})
    )
    ax.set(xlabel="Importance", title="Variable importance for Ridge Regression")


.. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_003.png
   :alt: Variable importance for Ridge Regression
   :srcset: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_003.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    [Text(0.5, 40.249999999999986, 'Importance'), Text(0.5, 1.0, 'Variable importance for Ridge Regression')]


.. GENERATED FROM PYTHON SOURCE LINES 119-121

Use the final model to make predictions on test data and plot scatterplot
of true values vs predicted values.

.. GENERATED FROM PYTHON SOURCE LINES 121-145

.. code-block:: Python


    y_true = test_diabetes[y]
    y_pred = model.predict(test_diabetes[X])
    mae = format(mean_absolute_error(y_true, y_pred), ".2f")
    corr = format(np.corrcoef(y_pred, y_true)[1, 0], ".2f")

    fig, ax = plt.subplots(1, 1, figsize=(10, 7))
    sns.set_style("darkgrid")
    plt.scatter(y_true, y_pred)
    plt.plot(y_true, y_true)
    xmin, xmax = ax.get_xlim()
    ymin, ymax = ax.get_ylim()
    text = "MAE: " + str(mae) + "   CORR: " + str(corr)
    ax.set(xlabel="True values", ylabel="Predicted values")
    plt.title("Actual vs Predicted")
    plt.text(
        xmax - 0.01 * xmax,
        ymax - 0.01 * ymax,
        text,
        verticalalignment="top",
        horizontalalignment="right",
        fontsize=12,
    )
    plt.axis("scaled")


.. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_004.png
   :alt: Actual vs Predicted
   :srcset: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_004.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    (9.649999999999999, 347.35, 9.649999999999999, 347.35)


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.621 seconds)


.. _sphx_glr_download_auto_examples_00_starting_plot_example_regression.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_example_regression.ipynb <plot_example_regression.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_example_regression.py <plot_example_regression.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_example_regression.zip <plot_example_regression.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_