.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/00_starting/plot_example_regression.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_00_starting_plot_example_regression.py: Regression Analysis =================== This example uses the ``diabetes`` data from ``sklearn datasets`` and performs a regression analysis using a Ridge Regression model. .. GENERATED FROM PYTHON SOURCE LINES 9-24 .. code-block:: Python # Authors: Shammi More # Federico Raimondo # License: AGPL import pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error from sklearn.model_selection import train_test_split from julearn import run_cross_validation from julearn.utils import configure_logging .. GENERATED FROM PYTHON SOURCE LINES 25-26 Set the logging level to info to see extra information. .. GENERATED FROM PYTHON SOURCE LINES 26-28 .. code-block:: Python configure_logging(level="INFO") .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 10:53:55,695 - julearn - INFO - ===== Lib Versions ===== 2026-01-16 10:53:55,696 - julearn - INFO - numpy: 1.26.4 2026-01-16 10:53:55,696 - julearn - INFO - scipy: 1.17.0 2026-01-16 10:53:55,696 - julearn - INFO - sklearn: 1.7.2 2026-01-16 10:53:55,696 - julearn - INFO - pandas: 2.3.3 2026-01-16 10:53:55,696 - julearn - INFO - julearn: 0.3.5.dev123 2026-01-16 10:53:55,696 - julearn - INFO - ======================== .. GENERATED FROM PYTHON SOURCE LINES 29-30 Load the diabetes data from ``sklearn`` as a ``pandas.DataFrame``. .. GENERATED FROM PYTHON SOURCE LINES 30-32 .. code-block:: Python features, target = load_diabetes(return_X_y=True, as_frame=True) .. GENERATED FROM PYTHON SOURCE LINES 33-37 Dataset contains ten variables age, sex, body mass index, average blood pressure, and six blood serum measurements (s1-s6) diabetes patients and a quantitative measure of disease progression one year after baseline which is the target we are interested in predicting. .. GENERATED FROM PYTHON SOURCE LINES 37-41 .. code-block:: Python print("Features: \n", features.head()) print("Target: \n", target.describe()) .. rst-class:: sphx-glr-script-out .. code-block:: none Features: age sex bmi ... s4 s5 s6 0 0.038076 0.050680 0.061696 ... -0.002592 0.019907 -0.017646 1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068332 -0.092204 2 0.085299 0.050680 0.044451 ... -0.002592 0.002861 -0.025930 3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022688 -0.009362 4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031988 -0.046641 [5 rows x 10 columns] Target: count 442.000000 mean 152.133484 std 77.093005 min 25.000000 25% 87.000000 50% 140.500000 75% 211.500000 max 346.000000 Name: target, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 42-44 Let's combine features and target together in one dataframe and define X and y .. GENERATED FROM PYTHON SOURCE LINES 44-49 .. code-block:: Python data_diabetes = pd.concat([features, target], axis=1) X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"] y = "target" .. GENERATED FROM PYTHON SOURCE LINES 50-51 Calculate correlations between the features/variables and plot it as heat map. .. GENERATED FROM PYTHON SOURCE LINES 51-63 .. code-block:: Python corr = data_diabetes.corr() fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.set(font_scale=1.2) sns.heatmap( corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, fmt="0.1f", ) .. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_001.png :alt: plot example regression :srcset: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none .. GENERATED FROM PYTHON SOURCE LINES 64-65 Split the dataset into train and test. .. GENERATED FROM PYTHON SOURCE LINES 65-67 .. code-block:: Python train_diabetes, test_diabetes = train_test_split(data_diabetes, test_size=0.3) .. GENERATED FROM PYTHON SOURCE LINES 68-70 Train a ridge regression model on train dataset and use mean absolute error for scoring. .. GENERATED FROM PYTHON SOURCE LINES 70-81 .. code-block:: Python scores, model = run_cross_validation( X=X, y=y, data=train_diabetes, preprocess="zscore", problem_type="regression", model="ridge", return_estimator="final", scoring="neg_mean_absolute_error", ) .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 10:53:55,963 - julearn - INFO - ==== Input Data ==== 2026-01-16 10:53:55,963 - julearn - INFO - Using dataframe as input 2026-01-16 10:53:55,963 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 2026-01-16 10:53:55,963 - julearn - INFO - Target: target 2026-01-16 10:53:55,963 - julearn - INFO - Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 2026-01-16 10:53:55,963 - julearn - INFO - X_types:{} 2026-01-16 10:53:55,964 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous. /home/runner/work/julearn/julearn/julearn/prepare.py:576: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous. warn_with_log( 2026-01-16 10:53:55,964 - julearn - INFO - ==================== 2026-01-16 10:53:55,965 - julearn - INFO - 2026-01-16 10:53:55,965 - julearn - INFO - Adding step zscore that applies to ColumnTypes 2026-01-16 10:53:55,965 - julearn - INFO - Step added 2026-01-16 10:53:55,965 - julearn - INFO - Adding step ridge that applies to ColumnTypes 2026-01-16 10:53:55,965 - julearn - INFO - Step added 2026-01-16 10:53:55,966 - julearn - INFO - = Model Parameters = 2026-01-16 10:53:55,966 - julearn - INFO - ==================== 2026-01-16 10:53:55,966 - julearn - INFO - 2026-01-16 10:53:55,966 - julearn - INFO - = Data Information = 2026-01-16 10:53:55,966 - julearn - INFO - Problem type: regression 2026-01-16 10:53:55,966 - julearn - INFO - Number of samples: 309 2026-01-16 10:53:55,966 - julearn - INFO - Number of features: 10 2026-01-16 10:53:55,966 - julearn - INFO - ==================== 2026-01-16 10:53:55,967 - julearn - INFO - 2026-01-16 10:53:55,967 - julearn - INFO - Target type: float64 2026-01-16 10:53:55,967 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model) .. GENERATED FROM PYTHON SOURCE LINES 82-83 The scores dataframe has all the values for each CV split. .. GENERATED FROM PYTHON SOURCE LINES 83-86 .. code-block:: Python scores.head() .. raw:: html
fit_time score_time test_score n_train n_test repeat fold cv_mdsum
0 0.004708 0.002466 -48.783874 247 62 0 0 b10eef89b4192178d482d7a1587a248a
1 0.004704 0.002454 -47.573568 247 62 0 1 b10eef89b4192178d482d7a1587a248a
2 0.004622 0.002446 -37.617474 247 62 0 2 b10eef89b4192178d482d7a1587a248a
3 0.004645 0.002494 -47.686852 247 62 0 3 b10eef89b4192178d482d7a1587a248a
4 0.004605 0.002450 -45.558655 248 61 0 4 b10eef89b4192178d482d7a1587a248a


.. GENERATED FROM PYTHON SOURCE LINES 87-88 Mean value of mean absolute error across CV .. GENERATED FROM PYTHON SOURCE LINES 88-90 .. code-block:: Python print(scores["test_score"].mean() * -1) .. rst-class:: sphx-glr-script-out .. code-block:: none 45.44408444147062 .. GENERATED FROM PYTHON SOURCE LINES 91-92 Now we can get the MAE fold and repetition: .. GENERATED FROM PYTHON SOURCE LINES 92-99 .. code-block:: Python df_mae = scores.set_index(["repeat", "fold"])["test_score"].unstack() * -1 df_mae.index.name = "Repeats" df_mae.columns.name = "K-fold splits" df_mae .. raw:: html
K-fold splits 0 1 2 3 4
Repeats
0 48.783874 47.573568 37.617474 47.686852 45.558655


.. GENERATED FROM PYTHON SOURCE LINES 100-101 Plot heatmap of mean absolute error (MAE) over all repeats and CV splits. .. GENERATED FROM PYTHON SOURCE LINES 101-105 .. code-block:: Python fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.heatmap(df_mae, cmap="YlGnBu") plt.title("Cross-validation MAE") .. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_002.png :alt: Cross-validation MAE :srcset: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(0.5, 1.0, 'Cross-validation MAE') .. GENERATED FROM PYTHON SOURCE LINES 106-107 Let's plot the feature importance using the coefficients of the trained model. .. GENERATED FROM PYTHON SOURCE LINES 107-118 .. code-block:: Python features = pd.DataFrame({"Features": X, "importance": model["ridge"].coef_}) features.sort_values(by=["importance"], ascending=True, inplace=True) features["positive"] = features["importance"] > 0 fig, ax = plt.subplots(1, 1, figsize=(10, 7)) features.set_index("Features", inplace=True) features.importance.plot( kind="barh", color=features.positive.map({True: "blue", False: "red"}) ) ax.set(xlabel="Importance", title="Variable importance for Ridge Regression") .. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_003.png :alt: Variable importance for Ridge Regression :srcset: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_003.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none [Text(0.5, 40.249999999999986, 'Importance'), Text(0.5, 1.0, 'Variable importance for Ridge Regression')] .. GENERATED FROM PYTHON SOURCE LINES 119-121 Use the final model to make predictions on test data and plot scatterplot of true values vs predicted values. .. GENERATED FROM PYTHON SOURCE LINES 121-145 .. code-block:: Python y_true = test_diabetes[y] y_pred = model.predict(test_diabetes[X]) mae = format(mean_absolute_error(y_true, y_pred), ".2f") corr = format(np.corrcoef(y_pred, y_true)[1, 0], ".2f") fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.set_style("darkgrid") plt.scatter(y_true, y_pred) plt.plot(y_true, y_true) xmin, xmax = ax.get_xlim() ymin, ymax = ax.get_ylim() text = "MAE: " + str(mae) + " CORR: " + str(corr) ax.set(xlabel="True values", ylabel="Predicted values") plt.title("Actual vs Predicted") plt.text( xmax - 0.01 * xmax, ymax - 0.01 * ymax, text, verticalalignment="top", horizontalalignment="right", fontsize=12, ) plt.axis("scaled") .. image-sg:: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_004.png :alt: Actual vs Predicted :srcset: /auto_examples/00_starting/images/sphx_glr_plot_example_regression_004.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none (9.649999999999999, 347.35, 9.649999999999999, 347.35) .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.621 seconds) .. _sphx_glr_download_auto_examples_00_starting_plot_example_regression.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_example_regression.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_example_regression.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_example_regression.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_