.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/03_complex_models/run_example_pca_featsets.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_03_complex_models_run_example_pca_featsets.py: Regression Analysis =================== This example uses the ``diabetes`` data from ``sklearn datasets`` and performs a regression analysis using a Ridge Regression model. We'll use the ``julearn.PipelineCreator`` to create a pipeline with two different PCA steps and reduce the dimensionality of the data, each one computed on a different subset of features. .. GENERATED FROM PYTHON SOURCE LINES 12-31 .. code-block:: Python # Authors: Georgios Antonopoulos # Kaustubh R. Patil # Shammi More # Federico Raimondo # License: AGPL import pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_diabetes from sklearn.metrics import mean_absolute_error from sklearn.model_selection import train_test_split from julearn import run_cross_validation from julearn.utils import configure_logging from julearn.pipeline import PipelineCreator from julearn.inspect import preprocess .. GENERATED FROM PYTHON SOURCE LINES 32-33 Set the logging level to info to see extra information. .. GENERATED FROM PYTHON SOURCE LINES 33-35 .. code-block:: Python configure_logging(level="INFO") .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 10:54:19,145 - julearn - INFO - ===== Lib Versions ===== 2026-01-16 10:54:19,145 - julearn - INFO - numpy: 1.26.4 2026-01-16 10:54:19,145 - julearn - INFO - scipy: 1.17.0 2026-01-16 10:54:19,145 - julearn - INFO - sklearn: 1.7.2 2026-01-16 10:54:19,145 - julearn - INFO - pandas: 2.3.3 2026-01-16 10:54:19,145 - julearn - INFO - julearn: 0.3.5.dev123 2026-01-16 10:54:19,145 - julearn - INFO - ======================== .. GENERATED FROM PYTHON SOURCE LINES 36-37 Load the diabetes data from ``sklearn`` as a ``pandas.DataFrame``. .. GENERATED FROM PYTHON SOURCE LINES 37-39 .. code-block:: Python features, target = load_diabetes(return_X_y=True, as_frame=True) .. GENERATED FROM PYTHON SOURCE LINES 40-44 Dataset contains ten variables age, sex, body mass index, average blood pressure, and six blood serum measurements (s1-s6) diabetes patients and a quantitative measure of disease progression one year after baseline which is the target we are interested in predicting. .. GENERATED FROM PYTHON SOURCE LINES 44-48 .. code-block:: Python print("Features: \n", features.head()) print("Target: \n", target.describe()) .. rst-class:: sphx-glr-script-out .. code-block:: none Features: age sex bmi ... s4 s5 s6 0 0.038076 0.050680 0.061696 ... -0.002592 0.019907 -0.017646 1 -0.001882 -0.044642 -0.051474 ... -0.039493 -0.068332 -0.092204 2 0.085299 0.050680 0.044451 ... -0.002592 0.002861 -0.025930 3 -0.089063 -0.044642 -0.011595 ... 0.034309 0.022688 -0.009362 4 0.005383 -0.044642 -0.036385 ... -0.002592 -0.031988 -0.046641 [5 rows x 10 columns] Target: count 442.000000 mean 152.133484 std 77.093005 min 25.000000 25% 87.000000 50% 140.500000 75% 211.500000 max 346.000000 Name: target, dtype: float64 .. GENERATED FROM PYTHON SOURCE LINES 49-51 Let's combine features and target together in one dataframe and define X and y .. GENERATED FROM PYTHON SOURCE LINES 51-56 .. code-block:: Python data_diabetes = pd.concat([features, target], axis=1) X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"] y = "target" .. GENERATED FROM PYTHON SOURCE LINES 57-59 Assign types to the features and create feature groups for PCA. We will keep 1 component per PCA group. .. GENERATED FROM PYTHON SOURCE LINES 59-65 .. code-block:: Python X_types = { "pca1": ["age", "bmi", "bp"], "pca2": ["s1", "s2", "s3", "s4", "s5", "s6"], "categorical": ["sex"], } .. GENERATED FROM PYTHON SOURCE LINES 66-70 Create a pipeline to process the data and the fit a model. We must specify how each ``X_type`` will be used. For example if in the last step we do not specify ``apply_to=["continuous", "categorical"]``, then the pipeline will not know what to do with the categorical features. .. GENERATED FROM PYTHON SOURCE LINES 70-75 .. code-block:: Python creator = PipelineCreator(problem_type="regression") creator.add("pca", apply_to="pca1", n_components=1, name="pca_feats1") creator.add("pca", apply_to="pca2", n_components=1, name="pca_feats2") creator.add("ridge", apply_to=["continuous", "categorical"]) .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 10:54:19,160 - julearn - INFO - Adding step pca_feats1 that applies to ColumnTypes 2026-01-16 10:54:19,160 - julearn - INFO - Setting hyperparameter n_components = 1 2026-01-16 10:54:19,160 - julearn - INFO - Step added 2026-01-16 10:54:19,160 - julearn - INFO - Adding step pca_feats2 that applies to ColumnTypes 2026-01-16 10:54:19,161 - julearn - INFO - Setting hyperparameter n_components = 1 2026-01-16 10:54:19,161 - julearn - INFO - Step added 2026-01-16 10:54:19,161 - julearn - INFO - Adding step ridge that applies to ColumnTypes 2026-01-16 10:54:19,161 - julearn - INFO - Step added .. GENERATED FROM PYTHON SOURCE LINES 76-77 Split the dataset into train and test. .. GENERATED FROM PYTHON SOURCE LINES 77-79 .. code-block:: Python train_diabetes, test_diabetes = train_test_split(data_diabetes, test_size=0.3) .. GENERATED FROM PYTHON SOURCE LINES 80-82 Train a ridge regression model on train dataset and use mean absolute error for scoring. .. GENERATED FROM PYTHON SOURCE LINES 82-92 .. code-block:: Python scores, model = run_cross_validation( X=X, y=y, X_types=X_types, data=train_diabetes, model=creator, scoring="r2", return_estimator="final", ) .. rst-class:: sphx-glr-script-out .. code-block:: none 2026-01-16 10:54:19,162 - julearn - INFO - ==== Input Data ==== 2026-01-16 10:54:19,162 - julearn - INFO - Using dataframe as input 2026-01-16 10:54:19,162 - julearn - INFO - Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 2026-01-16 10:54:19,162 - julearn - INFO - Target: target 2026-01-16 10:54:19,163 - julearn - INFO - Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6'] 2026-01-16 10:54:19,163 - julearn - INFO - X_types:{'pca1': ['age', 'bmi', 'bp'], 'pca2': ['s1', 's2', 's3', 's4', 's5', 's6'], 'categorical': ['sex']} 2026-01-16 10:54:19,164 - julearn - INFO - ==================== 2026-01-16 10:54:19,164 - julearn - INFO - 2026-01-16 10:54:19,165 - julearn - INFO - = Model Parameters = 2026-01-16 10:54:19,165 - julearn - INFO - ==================== 2026-01-16 10:54:19,165 - julearn - INFO - 2026-01-16 10:54:19,165 - julearn - INFO - = Data Information = 2026-01-16 10:54:19,165 - julearn - INFO - Problem type: regression 2026-01-16 10:54:19,166 - julearn - INFO - Number of samples: 309 2026-01-16 10:54:19,166 - julearn - INFO - Number of features: 10 2026-01-16 10:54:19,166 - julearn - INFO - ==================== 2026-01-16 10:54:19,166 - julearn - INFO - 2026-01-16 10:54:19,166 - julearn - INFO - Target type: float64 2026-01-16 10:54:19,166 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model) .. GENERATED FROM PYTHON SOURCE LINES 93-94 The scores dataframe has all the values for each CV split. .. GENERATED FROM PYTHON SOURCE LINES 94-96 .. code-block:: Python print(scores.head()) .. rst-class:: sphx-glr-script-out .. code-block:: none fit_time score_time ... fold cv_mdsum 0 0.013129 0.006169 ... 0 b10eef89b4192178d482d7a1587a248a 1 0.013075 0.006130 ... 1 b10eef89b4192178d482d7a1587a248a 2 0.060779 0.006267 ... 2 b10eef89b4192178d482d7a1587a248a 3 0.013103 0.006146 ... 3 b10eef89b4192178d482d7a1587a248a 4 0.013021 0.006149 ... 4 b10eef89b4192178d482d7a1587a248a [5 rows x 8 columns] .. GENERATED FROM PYTHON SOURCE LINES 97-98 Mean value of mean absolute error across CV. .. GENERATED FROM PYTHON SOURCE LINES 98-100 .. code-block:: Python print(scores["test_score"].mean()) .. rst-class:: sphx-glr-script-out .. code-block:: none 0.31079767436789235 .. GENERATED FROM PYTHON SOURCE LINES 101-104 Let's see how the data looks like after preprocessing. We will process the data until the first PCA step. We should get the first PCA component for ["age", "bmi", "bp"] and leave other features untouched. .. GENERATED FROM PYTHON SOURCE LINES 104-108 .. code-block:: Python data_processed1 = preprocess(model, X, data=train_diabetes, until="pca_feats1") print("Data after preprocessing until PCA step 1") data_processed1.head() .. rst-class:: sphx-glr-script-out .. code-block:: none Data after preprocessing until PCA step 1 .. raw:: html
pca_feats1__pca0 sex s1 s2 s3 s4 s5 s6
161 0.063175 0.050680 0.133274 0.131461 -0.039719 0.108111 0.075741 0.085907
140 0.054779 0.050680 -0.030464 -0.001314 -0.043401 -0.002592 -0.033246 0.015491
145 0.098172 -0.044642 -0.033216 -0.032629 0.011824 -0.039493 -0.015999 -0.050783
9 -0.032289 -0.044642 -0.012577 -0.034508 -0.024993 -0.002592 0.067737 -0.013504
315 -0.045025 -0.044642 0.031454 0.020607 0.056003 -0.039493 -0.010903 -0.001078


.. GENERATED FROM PYTHON SOURCE LINES 109-111 We will process the data until the second PCA step. We should now also get one PCA component for ["s1", "s2", "s3", "s4", "s5", "s6"]. .. GENERATED FROM PYTHON SOURCE LINES 111-115 .. code-block:: Python data_processed2 = preprocess(model, X, data=train_diabetes, until="pca_feats2") print("Data after preprocessing until PCA step 2") data_processed2.head() .. rst-class:: sphx-glr-script-out .. code-block:: none Data after preprocessing until PCA step 2 .. raw:: html
pca_feats2__pca0 pca_feats1__pca0 sex
161 0.234716 0.063175 0.050680
140 -0.012141 0.054779 0.050680
145 -0.078784 0.098172 -0.044642
9 0.006290 -0.032289 -0.044642
315 -0.026190 -0.045025 -0.044642


.. GENERATED FROM PYTHON SOURCE LINES 116-117 Now we can get the MAE fold and repetition: .. GENERATED FROM PYTHON SOURCE LINES 117-123 .. code-block:: Python df_mae = scores.set_index(["repeat", "fold"])["test_score"].unstack() * -1 df_mae.index.name = "Repeats" df_mae.columns.name = "K-fold splits" print(df_mae) .. rst-class:: sphx-glr-script-out .. code-block:: none K-fold splits 0 1 2 3 4 Repeats 0 -0.341472 -0.348168 -0.269257 -0.286067 -0.309025 .. GENERATED FROM PYTHON SOURCE LINES 124-125 Plot heatmap of mean absolute error (MAE) over all repeats and CV splits. .. GENERATED FROM PYTHON SOURCE LINES 125-129 .. code-block:: Python fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.heatmap(df_mae, cmap="YlGnBu") plt.title("Cross-validation MAE") .. image-sg:: /auto_examples/03_complex_models/images/sphx_glr_run_example_pca_featsets_001.png :alt: Cross-validation MAE :srcset: /auto_examples/03_complex_models/images/sphx_glr_run_example_pca_featsets_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(0.5, 1.0, 'Cross-validation MAE') .. GENERATED FROM PYTHON SOURCE LINES 130-132 Use the final model to make predictions on test data and plot scatterplot of true values vs predicted values. .. GENERATED FROM PYTHON SOURCE LINES 132-155 .. code-block:: Python y_true = test_diabetes[y] y_pred = model.predict(test_diabetes[X]) mae = format(mean_absolute_error(y_true, y_pred), ".2f") corr = format(np.corrcoef(y_pred, y_true)[1, 0], ".2f") fig, ax = plt.subplots(1, 1, figsize=(10, 7)) sns.set_style("darkgrid") plt.scatter(y_true, y_pred) plt.plot(y_true, y_true) xmin, xmax = ax.get_xlim() ymin, ymax = ax.get_ylim() text = "MAE: " + str(mae) + " CORR: " + str(corr) ax.set(xlabel="True values", ylabel="Predicted values") plt.title("Actual vs Predicted") plt.text( xmax - 0.01 * xmax, ymax - 0.01 * ymax, text, verticalalignment="top", horizontalalignment="right", fontsize=12, ) plt.axis("scaled") .. image-sg:: /auto_examples/03_complex_models/images/sphx_glr_run_example_pca_featsets_002.png :alt: Actual vs Predicted :srcset: /auto_examples/03_complex_models/images/sphx_glr_run_example_pca_featsets_002.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none (8.95, 362.05, 8.95, 362.05) .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.413 seconds) .. _sphx_glr_download_auto_examples_03_complex_models_run_example_pca_featsets.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: run_example_pca_featsets.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: run_example_pca_featsets.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: run_example_pca_featsets.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_