6.6. Stacking Models#

scikit-learn already provides a stacking implementation for stacking regression as well as for stacking classification.

Now, scikit-learn’s stacking implementation will fit each estimator on all of the data. However, this may not always be what you want. Sometimes you want one estimator in the ensemble to be fitted on one type of features, while fitting another estimator on another type of features. julearn’s API provides some extra flexibility to build more flexible and customizable stacking pipelines. In order to explore its capabilities, let’s first look at this simple example of fitting each estimator on all of the data. For example, we can stack a support vector regression (SVR) and a random forest regression (RF) to predict some target in a bit of toy data.

Fitting each estimator on all of the features#

First, of course, let’s import some necessary packages. Let’s also configure julearn’s logger to get some additional information about what is happening:

from sklearn.datasets import make_regression
import pandas as pd
import numpy as np

from julearn import run_cross_validation
from julearn.pipeline import PipelineCreator
from julearn.utils import configure_logging

configure_logging(level="INFO")

2024-04-04 14:44:36,509 - julearn - INFO - ===== Lib Versions =====
2024-04-04 14:44:36,509 - julearn - INFO - numpy: 1.26.4
2024-04-04 14:44:36,510 - julearn - INFO - scipy: 1.13.0
2024-04-04 14:44:36,510 - julearn - INFO - sklearn: 1.4.1.post1
2024-04-04 14:44:36,510 - julearn - INFO - pandas: 2.1.4
2024-04-04 14:44:36,510 - julearn - INFO - julearn: 0.3.2.dev24
2024-04-04 14:44:36,510 - julearn - INFO - ========================

Now, that we have these out of the way, we can create some artificial toy data to demonstrate a very simple stacking estimator within julearn. We will use a dataset with 20 features and 200 samples.

# Prepare data
X, y = make_regression(n_features=20, n_samples=200)

# Make dataframe
X_names = [f"feature_{x}" for x in range(1, 21)]
data = pd.DataFrame(X)
data.columns = X_names
data["target"] = y

To build a stacking pipeline, we have to initialize each estimator that we want to use in stacking, and then of course the stacking estimator itself. Let’s start by initializing an SVR. For this we can use the PipelineCreator. Keep in mind that this is only an example, and the hyperparameter grids we use here are somewhat arbitrary:

model_1 = PipelineCreator(problem_type="regression", apply_to="*")
model_1.add("svm", kernel="linear", C=np.geomspace(1e-2, 1e2, 10))

2024-04-04 14:44:36,512 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-04-04 14:44:36,512 - julearn - INFO - Setting hyperparameter kernel = linear
2024-04-04 14:44:36,512 - julearn - INFO - Tuning hyperparameter C = [1.00000000e-02 2.78255940e-02 7.74263683e-02 2.15443469e-01
 5.99484250e-01 1.66810054e+00 4.64158883e+00 1.29154967e+01
 3.59381366e+01 1.00000000e+02]
2024-04-04 14:44:36,512 - julearn - INFO - Step added

<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7fdacc46c370>

Note, that we specify applying the model to all of the features using apply_to="*". Now, let’s also create a pipeline for our random forest estimator:

model_2 = PipelineCreator(problem_type="regression", apply_to="*")
model_2.add(
    "rf",
    n_estimators=20,
    max_depth=[10, 50],
    min_samples_leaf=[1, 3, 4],
    min_samples_split=[2, 10],
)

2024-04-04 14:44:36,513 - julearn - INFO - Adding step rf that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-04-04 14:44:36,513 - julearn - INFO - Setting hyperparameter n_estimators = 20
2024-04-04 14:44:36,513 - julearn - INFO - Tuning hyperparameter max_depth = [10, 50]
2024-04-04 14:44:36,513 - julearn - INFO - Tuning hyperparameter min_samples_leaf = [1, 3, 4]
2024-04-04 14:44:36,513 - julearn - INFO - Tuning hyperparameter min_samples_split = [2, 10]
2024-04-04 14:44:36,513 - julearn - INFO - Step added

<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7fdacc46cd00>

We can now provide these two models to a PipelineCreator to initialize a stacking model. The interface for this is very similar to a sklearn.pipeline.Pipeline:

# Create the stacking model
model = PipelineCreator(problem_type="regression")
model.add(
    "stacking",
    estimators=[[("model_1", model_1), ("model_2", model_2)]],
    apply_to="*",
)

2024-04-04 14:44:36,513 - julearn - INFO - Adding step stacking that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-04-04 14:44:36,513 - julearn - INFO - Setting hyperparameter estimators = [('model_1', <julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7fdacc46c370>), ('model_2', <julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7fdacc46cd00>)]
2024-04-04 14:44:36,514 - julearn - INFO - Step added

<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7fdacc46d330>

This final stacking PipelineCreator can now simply be handed over to julearn’s run_cross_validation():

scores, final = run_cross_validation(
    X=X_names,
    y="target",
    data=data,
    model=model,
    seed=200,
    return_estimator="final",
)

2024-04-04 14:44:36,514 - julearn - INFO - Setting random seed to 200
2024-04-04 14:44:36,514 - julearn - INFO - ==== Input Data ====
2024-04-04 14:44:36,514 - julearn - INFO - Using dataframe as input
2024-04-04 14:44:36,514 - julearn - INFO -      Features: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']
2024-04-04 14:44:36,514 - julearn - INFO -      Target: target
2024-04-04 14:44:36,514 - julearn - INFO -      Expanded features: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']
2024-04-04 14:44:36,514 - julearn - INFO -      X_types:{}
2024-04-04 14:44:36,515 - julearn - WARNING - The following columns are not defined in X_types: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:507: RuntimeWarning: The following columns are not defined in X_types: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']. They will be treated as continuous.
  warn_with_log(
2024-04-04 14:44:36,515 - julearn - INFO - ====================
2024-04-04 14:44:36,515 - julearn - INFO -
2024-04-04 14:44:36,517 - julearn - INFO - = Model Parameters =
2024-04-04 14:44:36,517 - julearn - INFO - Tuning hyperparameters using grid
2024-04-04 14:44:36,517 - julearn - INFO - Hyperparameters:
2024-04-04 14:44:36,517 - julearn - INFO -      svm__C: [1.00000000e-02 2.78255940e-02 7.74263683e-02 2.15443469e-01
 5.99484250e-01 1.66810054e+00 4.64158883e+00 1.29154967e+01
 3.59381366e+01 1.00000000e+02]
2024-04-04 14:44:36,517 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-04-04 14:44:36,517 - julearn - INFO - Search Parameters:
2024-04-04 14:44:36,517 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-04-04 14:44:36,517 - julearn - INFO - ====================
2024-04-04 14:44:36,517 - julearn - INFO -
2024-04-04 14:44:36,518 - julearn - INFO - = Model Parameters =
2024-04-04 14:44:36,518 - julearn - INFO - Tuning hyperparameters using grid
2024-04-04 14:44:36,518 - julearn - INFO - Hyperparameters:
2024-04-04 14:44:36,518 - julearn - INFO -      rf__max_depth: [10, 50]
2024-04-04 14:44:36,519 - julearn - INFO -      rf__min_samples_leaf: [1, 3, 4]
2024-04-04 14:44:36,519 - julearn - INFO -      rf__min_samples_split: [2, 10]
2024-04-04 14:44:36,519 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-04-04 14:44:36,519 - julearn - INFO - Search Parameters:
2024-04-04 14:44:36,519 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-04-04 14:44:36,519 - julearn - INFO - ====================
2024-04-04 14:44:36,519 - julearn - INFO -
2024-04-04 14:44:36,657 - julearn - INFO - = Model Parameters =
2024-04-04 14:44:36,657 - julearn - INFO - ====================
2024-04-04 14:44:36,657 - julearn - INFO -
2024-04-04 14:44:36,657 - julearn - INFO - = Data Information =
2024-04-04 14:44:36,657 - julearn - INFO -      Problem type: regression
2024-04-04 14:44:36,658 - julearn - INFO -      Number of samples: 200
2024-04-04 14:44:36,658 - julearn - INFO -      Number of features: 20
2024-04-04 14:44:36,658 - julearn - INFO - ====================
2024-04-04 14:44:36,658 - julearn - INFO -
2024-04-04 14:44:36,658 - julearn - INFO -      Target type: float64
2024-04-04 14:44:36,658 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(

Fitting each estimator on a specific feature type#

As you can see, fitting a standard scikit-learn stacking estimator is relatively simple with julearn. However, sometimes it may be desirable to have a bit more control over which features are used to fit each estimator. For example, there may be two types of features. One of these feature types we may want to use for fitting the SVR, and one of these feature types we may want to use for fitting the RF. To demonstrate how this can be done in julearn, let’s now create some very similar toy data, but distinguish between two different types of features: "type1" and "type2".

# Prepare data
X, y = make_regression(n_features=20, n_samples=200)

# Prepare feature names and types
X_types = {
    "type1": [f"type1_{x}" for x in range(1, 11)],
    "type2": [f"type2_{x}" for x in range(1, 11)],
}

# First 10 features are "type1", second 10 features are "type2"
X_names = X_types["type1"] + X_types["type2"]

# Make dataframe, apply correct column names according to X_names
data = pd.DataFrame(X)
data.columns = X_names
data["target"] = y

Let’s first configure a PipelineCreator to fit an SVR on the features of "type1"`:

model_1 = PipelineCreator(problem_type="regression", apply_to="type1")
model_1.add("filter_columns", apply_to="*", keep="type1")
model_1.add("svm", kernel="linear", C=np.geomspace(1e-2, 1e2, 10))

2024-04-04 14:46:47,506 - julearn - INFO - Adding step filter_columns that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-04-04 14:46:47,506 - julearn - INFO - Setting hyperparameter keep = type1
2024-04-04 14:46:47,506 - julearn - INFO - Step added
2024-04-04 14:46:47,507 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'type1'}; pattern=(?:__:type:__type1)>
2024-04-04 14:46:47,507 - julearn - INFO - Setting hyperparameter kernel = linear
2024-04-04 14:46:47,507 - julearn - INFO - Tuning hyperparameter C = [1.00000000e-02 2.78255940e-02 7.74263683e-02 2.15443469e-01
 5.99484250e-01 1.66810054e+00 4.64158883e+00 1.29154967e+01
 3.59381366e+01 1.00000000e+02]
2024-04-04 14:46:47,507 - julearn - INFO - Step added

<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7fdacc7e0940>

Afterwards, let’s configure a PipelineCreator to fit a RF on the features of "type2"`:

model_2 = PipelineCreator(problem_type="regression", apply_to="type2")
model_2.add("filter_columns", apply_to="*", keep="type2")
model_2.add(
    "rf",
    n_estimators=20,
    max_depth=[10, 50],
    min_samples_leaf=[1, 3, 4],
    min_samples_split=[2, 10],
)

2024-04-04 14:46:47,508 - julearn - INFO - Adding step filter_columns that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-04-04 14:46:47,508 - julearn - INFO - Setting hyperparameter keep = type2
2024-04-04 14:46:47,508 - julearn - INFO - Step added
2024-04-04 14:46:47,508 - julearn - INFO - Adding step rf that applies to ColumnTypes<types={'type2'}; pattern=(?:__:type:__type2)>
2024-04-04 14:46:47,508 - julearn - INFO - Setting hyperparameter n_estimators = 20
2024-04-04 14:46:47,508 - julearn - INFO - Tuning hyperparameter max_depth = [10, 50]
2024-04-04 14:46:47,508 - julearn - INFO - Tuning hyperparameter min_samples_leaf = [1, 3, 4]
2024-04-04 14:46:47,508 - julearn - INFO - Tuning hyperparameter min_samples_split = [2, 10]
2024-04-04 14:46:47,508 - julearn - INFO - Step added

<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7fdacc6da1a0>

Now, as in the previous example, we only have to create a stacking estimator that uses both of these estimators internally. Then we can simply use this stacking estimator in a run_cross_validation() call:

# Create the stacking model
model = PipelineCreator(problem_type="regression")
model.add(
    "stacking",
    estimators=[[("model_1", model_1), ("model_2", model_2)]],
    apply_to="*",
)

# Run
scores, final = run_cross_validation(
    X=X_names,
    X_types=X_types,
    y="target",
    data=data,
    model=model,
    seed=200,
    return_estimator="final",
)

2024-04-04 14:46:47,509 - julearn - INFO - Adding step stacking that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-04-04 14:46:47,509 - julearn - INFO - Setting hyperparameter estimators = [('model_1', <julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7fdacc7e0940>), ('model_2', <julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7fdacc6da1a0>)]
2024-04-04 14:46:47,509 - julearn - INFO - Step added
2024-04-04 14:46:47,509 - julearn - INFO - Setting random seed to 200
2024-04-04 14:46:47,509 - julearn - INFO - ==== Input Data ====
2024-04-04 14:46:47,509 - julearn - INFO - Using dataframe as input
2024-04-04 14:46:47,509 - julearn - INFO -      Features: ['type1_1', 'type1_2', 'type1_3', 'type1_4', 'type1_5', 'type1_6', 'type1_7', 'type1_8', 'type1_9', 'type1_10', 'type2_1', 'type2_2', 'type2_3', 'type2_4', 'type2_5', 'type2_6', 'type2_7', 'type2_8', 'type2_9', 'type2_10']
2024-04-04 14:46:47,509 - julearn - INFO -      Target: target
2024-04-04 14:46:47,510 - julearn - INFO -      Expanded features: ['type1_1', 'type1_2', 'type1_3', 'type1_4', 'type1_5', 'type1_6', 'type1_7', 'type1_8', 'type1_9', 'type1_10', 'type2_1', 'type2_2', 'type2_3', 'type2_4', 'type2_5', 'type2_6', 'type2_7', 'type2_8', 'type2_9', 'type2_10']
2024-04-04 14:46:47,510 - julearn - INFO -      X_types:{'type1': ['type1_1', 'type1_2', 'type1_3', 'type1_4', 'type1_5', 'type1_6', 'type1_7', 'type1_8', 'type1_9', 'type1_10'], 'type2': ['type2_1', 'type2_2', 'type2_3', 'type2_4', 'type2_5', 'type2_6', 'type2_7', 'type2_8', 'type2_9', 'type2_10']}
2024-04-04 14:46:47,511 - julearn - INFO - ====================
2024-04-04 14:46:47,511 - julearn - INFO -
2024-04-04 14:46:47,512 - julearn - INFO - = Model Parameters =
2024-04-04 14:46:47,512 - julearn - INFO - Tuning hyperparameters using grid
2024-04-04 14:46:47,512 - julearn - INFO - Hyperparameters:
2024-04-04 14:46:47,513 - julearn - INFO -      svm__C: [1.00000000e-02 2.78255940e-02 7.74263683e-02 2.15443469e-01
 5.99484250e-01 1.66810054e+00 4.64158883e+00 1.29154967e+01
 3.59381366e+01 1.00000000e+02]
2024-04-04 14:46:47,513 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-04-04 14:46:47,513 - julearn - INFO - Search Parameters:
2024-04-04 14:46:47,513 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-04-04 14:46:47,513 - julearn - INFO - ====================
2024-04-04 14:46:47,513 - julearn - INFO -
2024-04-04 14:46:47,514 - julearn - INFO - = Model Parameters =
2024-04-04 14:46:47,514 - julearn - INFO - Tuning hyperparameters using grid
2024-04-04 14:46:47,514 - julearn - INFO - Hyperparameters:
2024-04-04 14:46:47,514 - julearn - INFO -      rf__max_depth: [10, 50]
2024-04-04 14:46:47,514 - julearn - INFO -      rf__min_samples_leaf: [1, 3, 4]
2024-04-04 14:46:47,514 - julearn - INFO -      rf__min_samples_split: [2, 10]
2024-04-04 14:46:47,514 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-04-04 14:46:47,515 - julearn - INFO - Search Parameters:
2024-04-04 14:46:47,515 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-04-04 14:46:47,515 - julearn - INFO - ====================
2024-04-04 14:46:47,515 - julearn - INFO -
2024-04-04 14:46:47,694 - julearn - INFO - = Model Parameters =
2024-04-04 14:46:47,694 - julearn - INFO - ====================
2024-04-04 14:46:47,694 - julearn - INFO -
2024-04-04 14:46:47,694 - julearn - INFO - = Data Information =
2024-04-04 14:46:47,694 - julearn - INFO -      Problem type: regression
2024-04-04 14:46:47,694 - julearn - INFO -      Number of samples: 200
2024-04-04 14:46:47,694 - julearn - INFO -      Number of features: 20
2024-04-04 14:46:47,694 - julearn - INFO - ====================
2024-04-04 14:46:47,694 - julearn - INFO -
2024-04-04 14:46:47,694 - julearn - INFO -      Target type: float64
2024-04-04 14:46:47,695 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/model_selection/_validation.py:73: FutureWarning: `fit_params` is deprecated and will be removed in version 1.6. Pass parameters via `params` instead.
  warnings.warn(

As you can see, the PipelineCreator and the in-built StackingRegressor make it very easy to flexibly build some very powerful stacking pipelines. Of course, you can do the same for classification which will use the StackingClassifier instead.

Total running time of the script: (3 minutes 37.368 seconds)