.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/03_complex_models/run_example_pca_featsets.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_03_complex_models_run_example_pca_featsets.py>`
        to download the full example code.

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_03_complex_models_run_example_pca_featsets.py:


Regression Analysis
===================

This example uses the ``diabetes`` data from ``sklearn datasets`` and performs
a regression analysis using a Ridge Regression model. We'll use the
``julearn.PipelineCreator`` to create a pipeline with two different PCA steps and
reduce the dimensionality of the data, each one computed on a different
subset of features.

.. GENERATED FROM PYTHON SOURCE LINES 12-31

.. code-block:: Python

    # Authors: Georgios Antonopoulos <g.antonopoulos@fz-juelich.de>
    #          Kaustubh R. Patil <k.patil@fz-juelich.de>
    #          Shammi More <s.more@fz-juelich.de>
    #          Federico Raimondo <f.raimondo@fz-juelich.de>
    # License: AGPL

    import pandas as pd
    import seaborn as sns
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.datasets import load_diabetes
    from sklearn.metrics import mean_absolute_error
    from sklearn.model_selection import train_test_split

    from julearn import run_cross_validation
    from julearn.utils import configure_logging
    from julearn.pipeline import PipelineCreator
    from julearn.inspect import preprocess


.. GENERATED FROM PYTHON SOURCE LINES 32-33

Set the logging level to info to see extra information.

.. GENERATED FROM PYTHON SOURCE LINES 33-35

.. code-block:: Python

    configure_logging(level="INFO")


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-01-16 10:54:19,145 - julearn - INFO - ===== Lib Versions =====
    2026-01-16 10:54:19,145 - julearn - INFO - numpy: 1.26.4
    2026-01-16 10:54:19,145 - julearn - INFO - scipy: 1.17.0
    2026-01-16 10:54:19,145 - julearn - INFO - sklearn: 1.7.2
    2026-01-16 10:54:19,145 - julearn - INFO - pandas: 2.3.3
    2026-01-16 10:54:19,145 - julearn - INFO - julearn: 0.3.5.dev123
    2026-01-16 10:54:19,145 - julearn - INFO - ========================


.. GENERATED FROM PYTHON SOURCE LINES 36-37

Load the diabetes data from ``sklearn`` as a ``pandas.DataFrame``.

.. GENERATED FROM PYTHON SOURCE LINES 37-39

.. code-block:: Python

    features, target = load_diabetes(return_X_y=True, as_frame=True)


.. GENERATED FROM PYTHON SOURCE LINES 40-44

Dataset contains ten variables age, sex, body mass index, average  blood
pressure, and six blood serum measurements (s1-s6) diabetes patients and
a quantitative measure of disease progression one year after baseline which
is the target we are interested in predicting.

.. GENERATED FROM PYTHON SOURCE LINES 44-48

.. code-block:: Python


    print("Features: \n", features.head())
    print("Target: \n", target.describe())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Features: 
             age       sex       bmi  ...        s4        s5        s6
    0  0.038076  0.050680  0.061696  ... -0.002592  0.019907 -0.017646
    1 -0.001882 -0.044642 -0.051474  ... -0.039493 -0.068332 -0.092204
    2  0.085299  0.050680  0.044451  ... -0.002592  0.002861 -0.025930
    3 -0.089063 -0.044642 -0.011595  ...  0.034309  0.022688 -0.009362
    4  0.005383 -0.044642 -0.036385  ... -0.002592 -0.031988 -0.046641

    [5 rows x 10 columns]
    Target: 
     count    442.000000
    mean     152.133484
    std       77.093005
    min       25.000000
    25%       87.000000
    50%      140.500000
    75%      211.500000
    max      346.000000
    Name: target, dtype: float64


.. GENERATED FROM PYTHON SOURCE LINES 49-51

Let's combine features and target together in one dataframe and define X
and y

.. GENERATED FROM PYTHON SOURCE LINES 51-56

.. code-block:: Python

    data_diabetes = pd.concat([features, target], axis=1)

    X = ["age", "sex", "bmi", "bp", "s1", "s2", "s3", "s4", "s5", "s6"]
    y = "target"


.. GENERATED FROM PYTHON SOURCE LINES 57-59

Assign types to the features and create feature groups for PCA.
We will keep 1 component per PCA group.

.. GENERATED FROM PYTHON SOURCE LINES 59-65

.. code-block:: Python

    X_types = {
        "pca1": ["age", "bmi", "bp"],
        "pca2": ["s1", "s2", "s3", "s4", "s5", "s6"],
        "categorical": ["sex"],
    }


.. GENERATED FROM PYTHON SOURCE LINES 66-70

Create a pipeline to process the data and the fit a model. We must specify
how each ``X_type`` will be used. For example if in the last step we do not
specify ``apply_to=["continuous", "categorical"]``, then the pipeline will not
know what to do with the categorical features.

.. GENERATED FROM PYTHON SOURCE LINES 70-75

.. code-block:: Python

    creator = PipelineCreator(problem_type="regression")
    creator.add("pca", apply_to="pca1", n_components=1, name="pca_feats1")
    creator.add("pca", apply_to="pca2", n_components=1, name="pca_feats2")
    creator.add("ridge", apply_to=["continuous", "categorical"])


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-01-16 10:54:19,160 - julearn - INFO - Adding step pca_feats1 that applies to ColumnTypes<types={'pca1'}; pattern=(?:__:type:__pca1)>
    2026-01-16 10:54:19,160 - julearn - INFO - Setting hyperparameter n_components = 1
    2026-01-16 10:54:19,160 - julearn - INFO - Step added
    2026-01-16 10:54:19,160 - julearn - INFO - Adding step pca_feats2 that applies to ColumnTypes<types={'pca2'}; pattern=(?:__:type:__pca2)>
    2026-01-16 10:54:19,161 - julearn - INFO - Setting hyperparameter n_components = 1
    2026-01-16 10:54:19,161 - julearn - INFO - Step added
    2026-01-16 10:54:19,161 - julearn - INFO - Adding step ridge that applies to ColumnTypes<types={'continuous', 'categorical'}; pattern=(?:__:type:__continuous|__:type:__categorical)>
    2026-01-16 10:54:19,161 - julearn - INFO - Step added

    <julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f2d80d05450>


.. GENERATED FROM PYTHON SOURCE LINES 76-77

Split the dataset into train and test.

.. GENERATED FROM PYTHON SOURCE LINES 77-79

.. code-block:: Python

    train_diabetes, test_diabetes = train_test_split(data_diabetes, test_size=0.3)


.. GENERATED FROM PYTHON SOURCE LINES 80-82

Train a ridge regression model on train dataset and use mean absolute error
for scoring.

.. GENERATED FROM PYTHON SOURCE LINES 82-92

.. code-block:: Python

    scores, model = run_cross_validation(
        X=X,
        y=y,
        X_types=X_types,
        data=train_diabetes,
        model=creator,
        scoring="r2",
        return_estimator="final",
    )


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    2026-01-16 10:54:19,162 - julearn - INFO - ==== Input Data ====
    2026-01-16 10:54:19,162 - julearn - INFO - Using dataframe as input
    2026-01-16 10:54:19,162 - julearn - INFO -      Features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
    2026-01-16 10:54:19,162 - julearn - INFO -      Target: target
    2026-01-16 10:54:19,163 - julearn - INFO -      Expanded features: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
    2026-01-16 10:54:19,163 - julearn - INFO -      X_types:{'pca1': ['age', 'bmi', 'bp'], 'pca2': ['s1', 's2', 's3', 's4', 's5', 's6'], 'categorical': ['sex']}
    2026-01-16 10:54:19,164 - julearn - INFO - ====================
    2026-01-16 10:54:19,164 - julearn - INFO - 
    2026-01-16 10:54:19,165 - julearn - INFO - = Model Parameters =
    2026-01-16 10:54:19,165 - julearn - INFO - ====================
    2026-01-16 10:54:19,165 - julearn - INFO - 
    2026-01-16 10:54:19,165 - julearn - INFO - = Data Information =
    2026-01-16 10:54:19,165 - julearn - INFO -      Problem type: regression
    2026-01-16 10:54:19,166 - julearn - INFO -      Number of samples: 309
    2026-01-16 10:54:19,166 - julearn - INFO -      Number of features: 10
    2026-01-16 10:54:19,166 - julearn - INFO - ====================
    2026-01-16 10:54:19,166 - julearn - INFO - 
    2026-01-16 10:54:19,166 - julearn - INFO -      Target type: float64
    2026-01-16 10:54:19,166 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)


.. GENERATED FROM PYTHON SOURCE LINES 93-94

The scores dataframe has all the values for each CV split.

.. GENERATED FROM PYTHON SOURCE LINES 94-96

.. code-block:: Python

    print(scores.head())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

       fit_time  score_time  ...  fold                          cv_mdsum
    0  0.013129    0.006169  ...     0  b10eef89b4192178d482d7a1587a248a
    1  0.013075    0.006130  ...     1  b10eef89b4192178d482d7a1587a248a
    2  0.060779    0.006267  ...     2  b10eef89b4192178d482d7a1587a248a
    3  0.013103    0.006146  ...     3  b10eef89b4192178d482d7a1587a248a
    4  0.013021    0.006149  ...     4  b10eef89b4192178d482d7a1587a248a

    [5 rows x 8 columns]


.. GENERATED FROM PYTHON SOURCE LINES 97-98

Mean value of mean absolute error across CV.

.. GENERATED FROM PYTHON SOURCE LINES 98-100

.. code-block:: Python

    print(scores["test_score"].mean())


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    0.31079767436789235


.. GENERATED FROM PYTHON SOURCE LINES 101-104

Let's see how the data looks like after preprocessing. We will process the
data until the first PCA step. We should get the first PCA component for
["age", "bmi", "bp"] and leave other features untouched.

.. GENERATED FROM PYTHON SOURCE LINES 104-108

.. code-block:: Python

    data_processed1 = preprocess(model, X, data=train_diabetes, until="pca_feats1")
    print("Data after preprocessing until PCA step 1")
    data_processed1.head()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Data after preprocessing until PCA step 1


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>pca_feats1__pca0</th>
          <th>sex</th>
          <th>s1</th>
          <th>s2</th>
          <th>s3</th>
          <th>s4</th>
          <th>s5</th>
          <th>s6</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>161</th>
          <td>0.063175</td>
          <td>0.050680</td>
          <td>0.133274</td>
          <td>0.131461</td>
          <td>-0.039719</td>
          <td>0.108111</td>
          <td>0.075741</td>
          <td>0.085907</td>
        </tr>
        <tr>
          <th>140</th>
          <td>0.054779</td>
          <td>0.050680</td>
          <td>-0.030464</td>
          <td>-0.001314</td>
          <td>-0.043401</td>
          <td>-0.002592</td>
          <td>-0.033246</td>
          <td>0.015491</td>
        </tr>
        <tr>
          <th>145</th>
          <td>0.098172</td>
          <td>-0.044642</td>
          <td>-0.033216</td>
          <td>-0.032629</td>
          <td>0.011824</td>
          <td>-0.039493</td>
          <td>-0.015999</td>
          <td>-0.050783</td>
        </tr>
        <tr>
          <th>9</th>
          <td>-0.032289</td>
          <td>-0.044642</td>
          <td>-0.012577</td>
          <td>-0.034508</td>
          <td>-0.024993</td>
          <td>-0.002592</td>
          <td>0.067737</td>
          <td>-0.013504</td>
        </tr>
        <tr>
          <th>315</th>
          <td>-0.045025</td>
          <td>-0.044642</td>
          <td>0.031454</td>
          <td>0.020607</td>
          <td>0.056003</td>
          <td>-0.039493</td>
          <td>-0.010903</td>
          <td>-0.001078</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 109-111

We will process the data until the second PCA step. We should now also get
one PCA component for ["s1", "s2", "s3", "s4", "s5", "s6"].

.. GENERATED FROM PYTHON SOURCE LINES 111-115

.. code-block:: Python

    data_processed2 = preprocess(model, X, data=train_diabetes, until="pca_feats2")
    print("Data after preprocessing until PCA step 2")
    data_processed2.head()


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Data after preprocessing until PCA step 2


.. raw:: html

    <div class="output_subarea output_html rendered_html output_result">
    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }

        .dataframe tbody tr th {
            vertical-align: top;
        }

        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>pca_feats2__pca0</th>
          <th>pca_feats1__pca0</th>
          <th>sex</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>161</th>
          <td>0.234716</td>
          <td>0.063175</td>
          <td>0.050680</td>
        </tr>
        <tr>
          <th>140</th>
          <td>-0.012141</td>
          <td>0.054779</td>
          <td>0.050680</td>
        </tr>
        <tr>
          <th>145</th>
          <td>-0.078784</td>
          <td>0.098172</td>
          <td>-0.044642</td>
        </tr>
        <tr>
          <th>9</th>
          <td>0.006290</td>
          <td>-0.032289</td>
          <td>-0.044642</td>
        </tr>
        <tr>
          <th>315</th>
          <td>-0.026190</td>
          <td>-0.045025</td>
          <td>-0.044642</td>
        </tr>
      </tbody>
    </table>
    </div>
    </div>
    <br />
    <br />

.. GENERATED FROM PYTHON SOURCE LINES 116-117

Now we can get the MAE fold and repetition:

.. GENERATED FROM PYTHON SOURCE LINES 117-123

.. code-block:: Python

    df_mae = scores.set_index(["repeat", "fold"])["test_score"].unstack() * -1
    df_mae.index.name = "Repeats"
    df_mae.columns.name = "K-fold splits"

    print(df_mae)


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    K-fold splits         0         1         2         3         4
    Repeats                                                        
    0             -0.341472 -0.348168 -0.269257 -0.286067 -0.309025


.. GENERATED FROM PYTHON SOURCE LINES 124-125

Plot heatmap of mean absolute error (MAE) over all repeats and CV splits.

.. GENERATED FROM PYTHON SOURCE LINES 125-129

.. code-block:: Python

    fig, ax = plt.subplots(1, 1, figsize=(10, 7))
    sns.heatmap(df_mae, cmap="YlGnBu")
    plt.title("Cross-validation MAE")


.. image-sg:: /auto_examples/03_complex_models/images/sphx_glr_run_example_pca_featsets_001.png
   :alt: Cross-validation MAE
   :srcset: /auto_examples/03_complex_models/images/sphx_glr_run_example_pca_featsets_001.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    Text(0.5, 1.0, 'Cross-validation MAE')


.. GENERATED FROM PYTHON SOURCE LINES 130-132

Use the final model to make predictions on test data and plot scatterplot
of true values vs predicted values.

.. GENERATED FROM PYTHON SOURCE LINES 132-155

.. code-block:: Python

    y_true = test_diabetes[y]
    y_pred = model.predict(test_diabetes[X])
    mae = format(mean_absolute_error(y_true, y_pred), ".2f")
    corr = format(np.corrcoef(y_pred, y_true)[1, 0], ".2f")

    fig, ax = plt.subplots(1, 1, figsize=(10, 7))
    sns.set_style("darkgrid")
    plt.scatter(y_true, y_pred)
    plt.plot(y_true, y_true)
    xmin, xmax = ax.get_xlim()
    ymin, ymax = ax.get_ylim()
    text = "MAE: " + str(mae) + "   CORR: " + str(corr)
    ax.set(xlabel="True values", ylabel="Predicted values")
    plt.title("Actual vs Predicted")
    plt.text(
        xmax - 0.01 * xmax,
        ymax - 0.01 * ymax,
        text,
        verticalalignment="top",
        horizontalalignment="right",
        fontsize=12,
    )
    plt.axis("scaled")


.. image-sg:: /auto_examples/03_complex_models/images/sphx_glr_run_example_pca_featsets_002.png
   :alt: Actual vs Predicted
   :srcset: /auto_examples/03_complex_models/images/sphx_glr_run_example_pca_featsets_002.png
   :class: sphx-glr-single-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none


    (8.95, 362.05, 8.95, 362.05)


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 0.413 seconds)


.. _sphx_glr_download_auto_examples_03_complex_models_run_example_pca_featsets.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: run_example_pca_featsets.ipynb <run_example_pca_featsets.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: run_example_pca_featsets.py <run_example_pca_featsets.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: run_example_pca_featsets.zip <run_example_pca_featsets.zip>`


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_