6.6. Stacking Models#
scikit-learn
already provides a stacking implementation for
stacking regression
as well
as for stacking classification
.
Now, scikit-learn
’s stacking implementation will fit each estimator on all
of the data. However, this may not always be what you want. Sometimes you want
one estimator in the ensemble to be fitted on one type of features, while fitting
another estimator on another type of features. julearn
’s API provides some
extra flexibility to build more flexible and customizable stacking pipelines.
In order to explore its capabilities, let’s first look at this simple example
of fitting each estimator on all of the data. For example, we can stack a
support vector regression (SVR) and a random forest regression (RF) to predict
some target in a bit of toy data.
Fitting each estimator on all of the features#
First, of course, let’s import some necessary packages. Let’s also configure
julearn
’s logger to get some additional information about what is happening:
from sklearn.datasets import make_regression
import pandas as pd
import numpy as np
from julearn import run_cross_validation
from julearn.pipeline import PipelineCreator
from julearn.utils import configure_logging
configure_logging(level="INFO")
/home/runner/work/julearn/julearn/julearn/utils/logging.py:66: UserWarning: The '__version__' attribute is deprecated and will be removed in MarkupSafe 3.1. Use feature detection, or `importlib.metadata.version("markupsafe")`, instead.
vstring = str(getattr(module, "__version__", None))
2024-10-17 14:16:15,521 - julearn - INFO - ===== Lib Versions =====
2024-10-17 14:16:15,521 - julearn - INFO - numpy: 1.26.4
2024-10-17 14:16:15,521 - julearn - INFO - scipy: 1.14.1
2024-10-17 14:16:15,521 - julearn - INFO - sklearn: 1.5.2
2024-10-17 14:16:15,521 - julearn - INFO - pandas: 2.2.3
2024-10-17 14:16:15,522 - julearn - INFO - julearn: 0.3.4
2024-10-17 14:16:15,522 - julearn - INFO - ========================
Now, that we have these out of the way, we can create some artificial toy
data to demonstrate a very simple stacking estimator within julearn
. We
will use a dataset with 20 features and 200 samples.
# Prepare data
X, y = make_regression(n_features=20, n_samples=200)
# Make dataframe
X_names = [f"feature_{x}" for x in range(1, 21)]
data = pd.DataFrame(X)
data.columns = X_names
data["target"] = y
To build a stacking pipeline, we have to initialize each estimator that we
want to use in stacking, and then of course the stacking estimator itself.
Let’s start by initializing an SVR. For this we can use the
PipelineCreator
. Keep in mind that this is only an example, and
the hyperparameter grids we use here are somewhat arbitrary:
model_1 = PipelineCreator(problem_type="regression", apply_to="*")
model_1.add("svm", kernel="linear", C=np.geomspace(1e-2, 1e2, 10))
2024-10-17 14:16:15,524 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-10-17 14:16:15,524 - julearn - INFO - Setting hyperparameter kernel = linear
2024-10-17 14:16:15,524 - julearn - INFO - Tuning hyperparameter C = [1.00000000e-02 2.78255940e-02 7.74263683e-02 2.15443469e-01
5.99484250e-01 1.66810054e+00 4.64158883e+00 1.29154967e+01
3.59381366e+01 1.00000000e+02]
2024-10-17 14:16:15,524 - julearn - INFO - Step added
<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f45e468d210>
Note, that we specify applying the model to all of the features using
apply_to="*"
. Now, let’s also create a pipeline for our random
forest estimator:
model_2 = PipelineCreator(problem_type="regression", apply_to="*")
model_2.add(
"rf",
n_estimators=20,
max_depth=[10, 50],
min_samples_leaf=[1, 3, 4],
min_samples_split=[2, 10],
)
2024-10-17 14:16:15,525 - julearn - INFO - Adding step rf that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-10-17 14:16:15,525 - julearn - INFO - Setting hyperparameter n_estimators = 20
2024-10-17 14:16:15,525 - julearn - INFO - Tuning hyperparameter max_depth = [10, 50]
2024-10-17 14:16:15,525 - julearn - INFO - Tuning hyperparameter min_samples_leaf = [1, 3, 4]
2024-10-17 14:16:15,525 - julearn - INFO - Tuning hyperparameter min_samples_split = [2, 10]
2024-10-17 14:16:15,525 - julearn - INFO - Step added
<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f45e19fd750>
We can now provide these two models to a PipelineCreator
to
initialize a stacking model. The interface for this is very similar to a
sklearn.pipeline.Pipeline
:
# Create the stacking model
model = PipelineCreator(problem_type="regression")
model.add(
"stacking",
estimators=[[("model_1", model_1), ("model_2", model_2)]],
apply_to="*",
)
2024-10-17 14:16:15,525 - julearn - INFO - Adding step stacking that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-10-17 14:16:15,525 - julearn - INFO - Setting hyperparameter estimators = [('model_1', <julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f45e468d210>), ('model_2', <julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f45e19fd750>)]
2024-10-17 14:16:15,526 - julearn - INFO - Step added
<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f45e468fe80>
This final stacking PipelineCreator
can now simply be handed over
to julearn
’s run_cross_validation()
:
scores, final = run_cross_validation(
X=X_names,
y="target",
data=data,
model=model,
seed=200,
return_estimator="final",
)
2024-10-17 14:16:15,526 - julearn - INFO - Setting random seed to 200
2024-10-17 14:16:15,526 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:15,526 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:15,526 - julearn - INFO - Features: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']
2024-10-17 14:16:15,526 - julearn - INFO - Target: target
2024-10-17 14:16:15,526 - julearn - INFO - Expanded features: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']
2024-10-17 14:16:15,526 - julearn - INFO - X_types:{}
2024-10-17 14:16:15,527 - julearn - WARNING - The following columns are not defined in X_types: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:509: RuntimeWarning: The following columns are not defined in X_types: ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_12', 'feature_13', 'feature_14', 'feature_15', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20']. They will be treated as continuous.
warn_with_log(
2024-10-17 14:16:15,527 - julearn - INFO - ====================
2024-10-17 14:16:15,527 - julearn - INFO -
2024-10-17 14:16:15,528 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:15,528 - julearn - INFO - Tuning hyperparameters using grid
2024-10-17 14:16:15,528 - julearn - INFO - Hyperparameters:
2024-10-17 14:16:15,528 - julearn - INFO - svm__C: [1.00000000e-02 2.78255940e-02 7.74263683e-02 2.15443469e-01
5.99484250e-01 1.66810054e+00 4.64158883e+00 1.29154967e+01
3.59381366e+01 1.00000000e+02]
2024-10-17 14:16:15,528 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:15,529 - julearn - INFO - Search Parameters:
2024-10-17 14:16:15,529 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:15,529 - julearn - INFO - ====================
2024-10-17 14:16:15,529 - julearn - INFO -
2024-10-17 14:16:15,529 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:15,529 - julearn - INFO - Tuning hyperparameters using grid
2024-10-17 14:16:15,529 - julearn - INFO - Hyperparameters:
2024-10-17 14:16:15,529 - julearn - INFO - rf__max_depth: [10, 50]
2024-10-17 14:16:15,529 - julearn - INFO - rf__min_samples_leaf: [1, 3, 4]
2024-10-17 14:16:15,529 - julearn - INFO - rf__min_samples_split: [2, 10]
2024-10-17 14:16:15,530 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:15,530 - julearn - INFO - Search Parameters:
2024-10-17 14:16:15,530 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:15,530 - julearn - INFO - ====================
2024-10-17 14:16:15,530 - julearn - INFO -
2024-10-17 14:16:15,554 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:15,554 - julearn - INFO - ====================
2024-10-17 14:16:15,554 - julearn - INFO -
2024-10-17 14:16:15,554 - julearn - INFO - = Data Information =
2024-10-17 14:16:15,554 - julearn - INFO - Problem type: regression
2024-10-17 14:16:15,554 - julearn - INFO - Number of samples: 200
2024-10-17 14:16:15,554 - julearn - INFO - Number of features: 20
2024-10-17 14:16:15,554 - julearn - INFO - ====================
2024-10-17 14:16:15,555 - julearn - INFO -
2024-10-17 14:16:15,555 - julearn - INFO - Target type: float64
2024-10-17 14:16:15,555 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)
/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/numpy/ma/core.py:2820: RuntimeWarning: invalid value encountered in cast
_data = np.array(data, dtype=dtype, copy=copy,
Fitting each estimator on a specific feature type#
As you can see, fitting a standard scikit-learn
stacking estimator is
relatively simple with julearn
. However, sometimes it may be desirable to
have a bit more control over which features are used to fit each estimator.
For example, there may be two types of features. One of these feature types
we may want to use for fitting the SVR, and one of these feature types we
may want to use for fitting the RF. To demonstrate how this can be done in
julearn
, let’s now create some very similar toy data, but distinguish
between two different types of features: "type1"
and "type2"
.
# Prepare data
X, y = make_regression(n_features=20, n_samples=200)
# Prepare feature names and types
X_types = {
"type1": [f"type1_{x}" for x in range(1, 11)],
"type2": [f"type2_{x}" for x in range(1, 11)],
}
# First 10 features are "type1", second 10 features are "type2"
X_names = X_types["type1"] + X_types["type2"]
# Make dataframe, apply correct column names according to X_names
data = pd.DataFrame(X)
data.columns = X_names
data["target"] = y
Let’s first configure a PipelineCreator
to fit an SVR on the
features of "type1"`
:
model_1 = PipelineCreator(problem_type="regression", apply_to="type1")
model_1.add("filter_columns", apply_to="*", keep="type1")
model_1.add("svm", kernel="linear", C=np.geomspace(1e-2, 1e2, 10))
2024-10-17 14:18:19,783 - julearn - INFO - Adding step filter_columns that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-10-17 14:18:19,783 - julearn - INFO - Setting hyperparameter keep = type1
2024-10-17 14:18:19,784 - julearn - INFO - Step added
2024-10-17 14:18:19,784 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'type1'}; pattern=(?:__:type:__type1)>
2024-10-17 14:18:19,784 - julearn - INFO - Setting hyperparameter kernel = linear
2024-10-17 14:18:19,784 - julearn - INFO - Tuning hyperparameter C = [1.00000000e-02 2.78255940e-02 7.74263683e-02 2.15443469e-01
5.99484250e-01 1.66810054e+00 4.64158883e+00 1.29154967e+01
3.59381366e+01 1.00000000e+02]
2024-10-17 14:18:19,784 - julearn - INFO - Step added
<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f45e1dcb5e0>
Afterwards, let’s configure a PipelineCreator
to fit a RF on the
features of "type2"`
:
model_2 = PipelineCreator(problem_type="regression", apply_to="type2")
model_2.add("filter_columns", apply_to="*", keep="type2")
model_2.add(
"rf",
n_estimators=20,
max_depth=[10, 50],
min_samples_leaf=[1, 3, 4],
min_samples_split=[2, 10],
)
2024-10-17 14:18:19,785 - julearn - INFO - Adding step filter_columns that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-10-17 14:18:19,785 - julearn - INFO - Setting hyperparameter keep = type2
2024-10-17 14:18:19,785 - julearn - INFO - Step added
2024-10-17 14:18:19,785 - julearn - INFO - Adding step rf that applies to ColumnTypes<types={'type2'}; pattern=(?:__:type:__type2)>
2024-10-17 14:18:19,785 - julearn - INFO - Setting hyperparameter n_estimators = 20
2024-10-17 14:18:19,785 - julearn - INFO - Tuning hyperparameter max_depth = [10, 50]
2024-10-17 14:18:19,785 - julearn - INFO - Tuning hyperparameter min_samples_leaf = [1, 3, 4]
2024-10-17 14:18:19,785 - julearn - INFO - Tuning hyperparameter min_samples_split = [2, 10]
2024-10-17 14:18:19,786 - julearn - INFO - Step added
<julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f45e1bd6800>
Now, as in the previous example, we only have to create a stacking estimator
that uses both of these estimators internally. Then we can simply use this
stacking estimator in a run_cross_validation()
call:
# Create the stacking model
model = PipelineCreator(problem_type="regression")
model.add(
"stacking",
estimators=[[("model_1", model_1), ("model_2", model_2)]],
apply_to="*",
)
# Run
scores, final = run_cross_validation(
X=X_names,
X_types=X_types,
y="target",
data=data,
model=model,
seed=200,
return_estimator="final",
)
2024-10-17 14:18:19,786 - julearn - INFO - Adding step stacking that applies to ColumnTypes<types={'*'}; pattern=.*>
2024-10-17 14:18:19,786 - julearn - INFO - Setting hyperparameter estimators = [('model_1', <julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f45e1dcb5e0>), ('model_2', <julearn.pipeline.pipeline_creator.PipelineCreator object at 0x7f45e1bd6800>)]
2024-10-17 14:18:19,786 - julearn - INFO - Step added
2024-10-17 14:18:19,786 - julearn - INFO - Setting random seed to 200
2024-10-17 14:18:19,786 - julearn - INFO - ==== Input Data ====
2024-10-17 14:18:19,786 - julearn - INFO - Using dataframe as input
2024-10-17 14:18:19,786 - julearn - INFO - Features: ['type1_1', 'type1_2', 'type1_3', 'type1_4', 'type1_5', 'type1_6', 'type1_7', 'type1_8', 'type1_9', 'type1_10', 'type2_1', 'type2_2', 'type2_3', 'type2_4', 'type2_5', 'type2_6', 'type2_7', 'type2_8', 'type2_9', 'type2_10']
2024-10-17 14:18:19,786 - julearn - INFO - Target: target
2024-10-17 14:18:19,787 - julearn - INFO - Expanded features: ['type1_1', 'type1_2', 'type1_3', 'type1_4', 'type1_5', 'type1_6', 'type1_7', 'type1_8', 'type1_9', 'type1_10', 'type2_1', 'type2_2', 'type2_3', 'type2_4', 'type2_5', 'type2_6', 'type2_7', 'type2_8', 'type2_9', 'type2_10']
2024-10-17 14:18:19,787 - julearn - INFO - X_types:{'type1': ['type1_1', 'type1_2', 'type1_3', 'type1_4', 'type1_5', 'type1_6', 'type1_7', 'type1_8', 'type1_9', 'type1_10'], 'type2': ['type2_1', 'type2_2', 'type2_3', 'type2_4', 'type2_5', 'type2_6', 'type2_7', 'type2_8', 'type2_9', 'type2_10']}
2024-10-17 14:18:19,788 - julearn - INFO - ====================
2024-10-17 14:18:19,788 - julearn - INFO -
2024-10-17 14:18:19,790 - julearn - INFO - = Model Parameters =
2024-10-17 14:18:19,790 - julearn - INFO - Tuning hyperparameters using grid
2024-10-17 14:18:19,790 - julearn - INFO - Hyperparameters:
2024-10-17 14:18:19,790 - julearn - INFO - svm__C: [1.00000000e-02 2.78255940e-02 7.74263683e-02 2.15443469e-01
5.99484250e-01 1.66810054e+00 4.64158883e+00 1.29154967e+01
3.59381366e+01 1.00000000e+02]
2024-10-17 14:18:19,790 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:18:19,790 - julearn - INFO - Search Parameters:
2024-10-17 14:18:19,790 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:18:19,790 - julearn - INFO - ====================
2024-10-17 14:18:19,790 - julearn - INFO -
2024-10-17 14:18:19,791 - julearn - INFO - = Model Parameters =
2024-10-17 14:18:19,792 - julearn - INFO - Tuning hyperparameters using grid
2024-10-17 14:18:19,792 - julearn - INFO - Hyperparameters:
2024-10-17 14:18:19,792 - julearn - INFO - rf__max_depth: [10, 50]
2024-10-17 14:18:19,792 - julearn - INFO - rf__min_samples_leaf: [1, 3, 4]
2024-10-17 14:18:19,792 - julearn - INFO - rf__min_samples_split: [2, 10]
2024-10-17 14:18:19,792 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:18:19,792 - julearn - INFO - Search Parameters:
2024-10-17 14:18:19,792 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:18:19,792 - julearn - INFO - ====================
2024-10-17 14:18:19,792 - julearn - INFO -
2024-10-17 14:18:19,851 - julearn - INFO - = Model Parameters =
2024-10-17 14:18:19,851 - julearn - INFO - ====================
2024-10-17 14:18:19,851 - julearn - INFO -
2024-10-17 14:18:19,851 - julearn - INFO - = Data Information =
2024-10-17 14:18:19,851 - julearn - INFO - Problem type: regression
2024-10-17 14:18:19,851 - julearn - INFO - Number of samples: 200
2024-10-17 14:18:19,851 - julearn - INFO - Number of features: 20
2024-10-17 14:18:19,851 - julearn - INFO - ====================
2024-10-17 14:18:19,851 - julearn - INFO -
2024-10-17 14:18:19,851 - julearn - INFO - Target type: float64
2024-10-17 14:18:19,851 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)
As you can see, the PipelineCreator
and the in-built
StackingRegressor
make it very easy to flexibly
build some very powerful stacking pipelines. Of course, you can do the same
for classification which will use the
StackingClassifier
instead.
Total running time of the script: (3 minutes 30.727 seconds)