Regression Analysis#

This example uses the diabetes data from sklearn datasets and performs a regression analysis using a Ridge Regression model.

# Authors: Shammi More <s.more@fz-juelich.de>
#          Federico Raimondo <f.raimondo@fz-juelich.de>
# License: AGPL

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

from julearn import run_cross_validation
from julearn.utils import configure_logging

Set the logging level to info to see extra information.

configure_logging(level="INFO")

2024-05-16 08:52:24,995 - julearn - INFO - ===== Lib Versions =====
2024-05-16 08:52:24,995 - julearn - INFO - numpy: 1.26.4
2024-05-16 08:52:24,995 - julearn - INFO - scipy: 1.13.0
2024-05-16 08:52:24,995 - julearn - INFO - sklearn: 1.4.2
2024-05-16 08:52:24,995 - julearn - INFO - pandas: 2.1.4
2024-05-16 08:52:24,996 - julearn - INFO - julearn: 0.3.3
2024-05-16 08:52:24,996 - julearn - INFO - ========================

Load the diabetes data from sklearn as a pandas.DataFrame.

features, target = load_diabetes(return_X_y=True, as_frame=True)

Dataset contains ten variables age, sex, body mass index, average blood pressure, and six blood serum measurements (s1-s6) diabetes patients and a quantitative measure of disease progression one year after baseline which is the target we are interested in predicting.

print("Features: \n", features.head())
print("Target: \n", target.describe())

Features:
         age       sex       bmi  ...        s4        s5        s6
0  0.038076  0.050680  0.061696  ... -0.002592  0.019907 -0.017646
1 -0.001882 -0.044642 -0.051474  ... -0.039493 -0.068332 -0.092204
2  0.085299  0.050680  0.044451  ... -0.002592  0.002861 -0.025930
3 -0.089063 -0.044642 -0.011595  ...  0.034309  0.022688 -0.009362
4  0.005383 -0.044642 -0.036385  ... -0.002592 -0.031988 -0.046641

[5 rows x 10 columns]
Target:
 count    442.000000
mean     152.133484
std       77.093005
min       25.000000
25%       87.000000
50%      140.500000
75%      211.500000
max      346.000000
Name: target, dtype: float64

Let’s combine features and target together in one dataframe and define X and y

data_diabetes = pd.concat([features, target], axis=1)

X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"]
y = "target"

Calculate correlations between the features/variables and plot it as heat map.

corr = data_diabetes.corr()

fig, ax = plt.subplots(1, 1, figsize=(10, 7))
sns.set(font_scale=1.2)
sns.heatmap(
    corr,
    xticklabels=corr.columns,
    yticklabels=corr.columns,
    annot=True,
    fmt="0.1f",
)

<Axes: >

Split the dataset into train and test.

train_diabetes, test_diabetes = train_test_split(data_diabetes, test_size=0.3)

Train a ridge regression model on train dataset and use mean absolute error for scoring.

scores, model = run_cross_validation(
    X=X,
    y=y,
    data=train_diabetes,
    preprocess="zscore",
    problem_type="regression",
    model="ridge",
    return_estimator="final",
    scoring="neg_mean_absolute_error",
)

2024-05-16 08:52:25,166 - julearn - INFO - ==== Input Data ====
2024-05-16 08:52:25,166 - julearn - INFO - Using dataframe as input
2024-05-16 08:52:25,167 - julearn - INFO -      Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-05-16 08:52:25,167 - julearn - INFO -      Target: target
2024-05-16 08:52:25,167 - julearn - INFO -      Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
2024-05-16 08:52:25,167 - julearn - INFO -      X_types:{}
2024-05-16 08:52:25,167 - julearn - WARNING - The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:505: RuntimeWarning: The following columns are not defined in X_types: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']. They will be treated as continuous.
  warn_with_log(
2024-05-16 08:52:25,168 - julearn - INFO - ====================
2024-05-16 08:52:25,168 - julearn - INFO -
2024-05-16 08:52:25,168 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-16 08:52:25,168 - julearn - INFO - Step added
2024-05-16 08:52:25,168 - julearn - INFO - Adding step ridge that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-16 08:52:25,168 - julearn - INFO - Step added
2024-05-16 08:52:25,169 - julearn - INFO - = Model Parameters =
2024-05-16 08:52:25,169 - julearn - INFO - ====================
2024-05-16 08:52:25,169 - julearn - INFO -
2024-05-16 08:52:25,169 - julearn - INFO - = Data Information =
2024-05-16 08:52:25,169 - julearn - INFO -      Problem type: regression
2024-05-16 08:52:25,169 - julearn - INFO -      Number of samples: 309
2024-05-16 08:52:25,169 - julearn - INFO -      Number of features: 10
2024-05-16 08:52:25,169 - julearn - INFO - ====================
2024-05-16 08:52:25,169 - julearn - INFO -
2024-05-16 08:52:25,169 - julearn - INFO -      Target type: float64
2024-05-16 08:52:25,169 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-16 08:52:25,207 - julearn - INFO - Fitting final model

The scores dataframe has all the values for each CV split.

scores.head()

	fit_time	score_time	test_score	n_train	n_test	fold	cv_mdsum
0	0.005022	0.002378	-48.783874	247	62	0	b10eef89b4192178d482d7a1587a248a
1	0.004470	0.002295	-47.573568	247	62	1	b10eef89b4192178d482d7a1587a248a
2	0.004492	0.002288	-37.617474	247	62	2	b10eef89b4192178d482d7a1587a248a
3	0.004402	0.002308	-47.686852	247	62	3	b10eef89b4192178d482d7a1587a248a
4	0.004388	0.002310	-45.558655	248	61	4	b10eef89b4192178d482d7a1587a248a

Mean value of mean absolute error across CV

print(scores["test_score"].mean() * -1)

45.444084441470615

Now we can get the MAE fold and repetition:

df_mae = scores.set_index(["repeat", "fold"])["test_score"].unstack() * -1
df_mae.index.name = "Repeats"
df_mae.columns.name = "K-fold splits"

df_mae

K-fold splits	0	1	2	3	4
Repeats
0	48.783874	47.573568	37.617474	47.686852	45.558655

Plot heatmap of mean absolute error (MAE) over all repeats and CV splits.

fig, ax = plt.subplots(1, 1, figsize=(10, 7))
sns.heatmap(df_mae, cmap="YlGnBu")
plt.title("Cross-validation MAE")

Text(0.5, 1.0, 'Cross-validation MAE')

Let’s plot the feature importance using the coefficients of the trained model.

features = pd.DataFrame({"Features": X, "importance": model["ridge"].coef_})
features.sort_values(by=["importance"], ascending=True, inplace=True)
features["positive"] = features["importance"] > 0

fig, ax = plt.subplots(1, 1, figsize=(10, 7))
features.set_index("Features", inplace=True)
features.importance.plot(
    kind="barh", color=features.positive.map({True: "blue", False: "red"})
)
ax.set(xlabel="Importance", title="Variable importance for Ridge Regression")

Variable importance for Ridge Regression

[Text(0.5, 40.249999999999986, 'Importance'), Text(0.5, 1.0, 'Variable importance for Ridge Regression')]

Use the final model to make predictions on test data and plot scatterplot of true values vs predicted values.

y_true = test_diabetes[y]
y_pred = model.predict(test_diabetes[X])
mae = format(mean_absolute_error(y_true, y_pred), ".2f")
corr = format(np.corrcoef(y_pred, y_true)[1, 0], ".2f")

fig, ax = plt.subplots(1, 1, figsize=(10, 7))
sns.set_style("darkgrid")
plt.scatter(y_true, y_pred)
plt.plot(y_true, y_true)
xmin, xmax = ax.get_xlim()
ymin, ymax = ax.get_ylim()
text = "MAE: " + str(mae) + "   CORR: " + str(corr)
ax.set(xlabel="True values", ylabel="Predicted values")
plt.title("Actual vs Predicted")
plt.text(
    xmax - 0.01 * xmax,
    ymax - 0.01 * ymax,
    text,
    verticalalignment="top",
    horizontalalignment="right",
    fontsize=12,
)
plt.axis("scaled")

(9.649999999999999, 347.35, 9.649999999999999, 347.35)

Total running time of the script: (0 minutes 0.550 seconds)

Gallery generated by Sphinx-Gallery