6.3. Hyperparameter Tuning#
Parameters vs Hyperparameters#
Parameters are the values that define the model, and are learned from the data. For example, the weights of a linear regression model are parameters. The parameters of a model are learned during training and are not set by the user.
Hyperparameters are the values that define the model, but are not learned from
the data. For example, the regularization parameter C
of a Support Vector
Machine (SVC
) model is a hyperparameter. The
hyperparameters of a model are set by the user before training and are not
learned during training.
Let’s see an example of a SVC
model with a regularization
parameter C
. We will use the iris
dataset, which is a dataset of
measurements of flowers.
We start by loading the dataset and setting the features and target variables.
from seaborn import load_dataset
from pprint import pprint # To print in a pretty way
df = load_dataset("iris")
X = df.columns[:-1].tolist()
y = "species"
X_types = {"continuous": X}
# The dataset has three kind of species. We will keep two to perform a binary
# classification.
df = df[df["species"].isin(["versicolor", "virginica"])]
We can now use the PipelineCreator
to create a pipeline with a
RobustScaler
and a
SVC
, with a regularization parameter C
set to
0.1
.
from julearn.pipeline import PipelineCreator
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("svm", C=0.1)
print(creator)
2024-01-23 10:57:36,242 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:57:36,242 - julearn - INFO - Step added
2024-01-23 10:57:36,243 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:57:36,243 - julearn - INFO - Setting hyperparameter C = 0.1
2024-01-23 10:57:36,243 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: svm
estimator: SVC(C=0.1)
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Hyperparameter Tuning#
Since it is the user who sets the hyperparameters, it is important to choose the right values. This is not always easy, and it is common to try different values and see which one works best. This process is called hyperparameter tuning.
Basically, hyperparameter tuning refers to testing several hyperparameter values and choosing the one that works best.
For example, we can try different values for the regularization parameter
C
of the SVC
model and see which one works best.
from julearn import run_cross_validation
scores1 = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
)
print(f"Score with C=0.1: {scores1['test_score'].mean()}")
creator2 = PipelineCreator(problem_type="classification")
creator2.add("zscore")
creator2.add("svm", C=0.01)
scores2 = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator2,
)
print(f"Score with C=0.01: {scores2['test_score'].mean()}")
2024-01-23 10:57:36,244 - julearn - INFO - ==== Input Data ====
2024-01-23 10:57:36,244 - julearn - INFO - Using dataframe as input
2024-01-23 10:57:36,244 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:57:36,244 - julearn - INFO - Target: species
2024-01-23 10:57:36,244 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:57:36,244 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-01-23 10:57:36,245 - julearn - INFO - ====================
2024-01-23 10:57:36,245 - julearn - INFO -
2024-01-23 10:57:36,245 - julearn - INFO - = Model Parameters =
2024-01-23 10:57:36,245 - julearn - INFO - ====================
2024-01-23 10:57:36,245 - julearn - INFO -
2024-01-23 10:57:36,246 - julearn - INFO - = Data Information =
2024-01-23 10:57:36,246 - julearn - INFO - Problem type: classification
2024-01-23 10:57:36,246 - julearn - INFO - Number of samples: 100
2024-01-23 10:57:36,246 - julearn - INFO - Number of features: 4
2024-01-23 10:57:36,246 - julearn - INFO - ====================
2024-01-23 10:57:36,246 - julearn - INFO -
2024-01-23 10:57:36,246 - julearn - INFO - Number of classes: 2
2024-01-23 10:57:36,246 - julearn - INFO - Target type: object
2024-01-23 10:57:36,246 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-01-23 10:57:36,247 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:57:36,247 - julearn - INFO - Binary classification problem detected.
Score with C=0.1: 0.8099999999999999
2024-01-23 10:57:36,284 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:57:36,284 - julearn - INFO - Step added
2024-01-23 10:57:36,285 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:57:36,285 - julearn - INFO - Setting hyperparameter C = 0.01
2024-01-23 10:57:36,285 - julearn - INFO - Step added
2024-01-23 10:57:36,285 - julearn - INFO - ==== Input Data ====
2024-01-23 10:57:36,285 - julearn - INFO - Using dataframe as input
2024-01-23 10:57:36,285 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:57:36,285 - julearn - INFO - Target: species
2024-01-23 10:57:36,285 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:57:36,285 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-01-23 10:57:36,286 - julearn - INFO - ====================
2024-01-23 10:57:36,286 - julearn - INFO -
2024-01-23 10:57:36,286 - julearn - INFO - = Model Parameters =
2024-01-23 10:57:36,286 - julearn - INFO - ====================
2024-01-23 10:57:36,286 - julearn - INFO -
2024-01-23 10:57:36,286 - julearn - INFO - = Data Information =
2024-01-23 10:57:36,286 - julearn - INFO - Problem type: classification
2024-01-23 10:57:36,286 - julearn - INFO - Number of samples: 100
2024-01-23 10:57:36,286 - julearn - INFO - Number of features: 4
2024-01-23 10:57:36,286 - julearn - INFO - ====================
2024-01-23 10:57:36,286 - julearn - INFO -
2024-01-23 10:57:36,287 - julearn - INFO - Number of classes: 2
2024-01-23 10:57:36,287 - julearn - INFO - Target type: object
2024-01-23 10:57:36,287 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-01-23 10:57:36,287 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:57:36,287 - julearn - INFO - Binary classification problem detected.
Score with C=0.01: 0.19
We can see that the model with C=0.1
works better than the model with
C=0.01
. However, to be sure that C=0.1
is the best value, we should
try more values. And since this is only one hyperparameter, it is not that
difficult. But what if we have more hyperparameters? And what if we have
several steps in the pipeline (e.g. feature selection, PCA, etc.)?
This is a major problem: the more hyperparameters we have, the more
times we use the same data for training and testing. This usually gives an
optimistic estimation of the performance of the model.
To prevent this, we can use a technique called nested cross-validation. That is, we use cross-validation to tune the hyperparameters, and then we use cross-validation again to estimate the performance of the model using the best hyperparameters set. It is called nested because we first split the data into training and testing sets to estimate the model performance (outer loop), and then we split the training set into two sets to tune the hyperparameters (inner loop).
julearn
has a simple way to do hyperparameter tuning using nested cross-
validation. When we use a PipelineCreator
to create a pipeline,
we can set the hyperparameters we want to tune and the values we want to try.
For example, we can try different values for the regularization parameter
C
of the SVC
model:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("svm", C=[0.01, 0.1, 1, 10])
print(creator)
2024-01-23 10:57:36,324 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:57:36,324 - julearn - INFO - Step added
2024-01-23 10:57:36,324 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:57:36,324 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-01-23 10:57:36,324 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': [0.01, 0.1, 1, 10]}
As we can see above, the creator now shows that the C
hyperparameter
will be tuned. We can now use this creator to run cross-validation. This will
tune the hyperparameters and estimate the performance of the model using the
best hyperparameters set.
scores_tuned, model_tuned = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
return_estimator="all",
)
print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")
2024-01-23 10:57:36,325 - julearn - INFO - ==== Input Data ====
2024-01-23 10:57:36,325 - julearn - INFO - Using dataframe as input
2024-01-23 10:57:36,325 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:57:36,326 - julearn - INFO - Target: species
2024-01-23 10:57:36,326 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:57:36,326 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-01-23 10:57:36,326 - julearn - INFO - ====================
2024-01-23 10:57:36,326 - julearn - INFO -
2024-01-23 10:57:36,327 - julearn - INFO - = Model Parameters =
2024-01-23 10:57:36,327 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:57:36,327 - julearn - INFO - Hyperparameters:
2024-01-23 10:57:36,327 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-01-23 10:57:36,327 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:57:36,327 - julearn - INFO - Search Parameters:
2024-01-23 10:57:36,327 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:57:36,327 - julearn - INFO - ====================
2024-01-23 10:57:36,327 - julearn - INFO -
2024-01-23 10:57:36,327 - julearn - INFO - = Data Information =
2024-01-23 10:57:36,327 - julearn - INFO - Problem type: classification
2024-01-23 10:57:36,327 - julearn - INFO - Number of samples: 100
2024-01-23 10:57:36,327 - julearn - INFO - Number of features: 4
2024-01-23 10:57:36,327 - julearn - INFO - ====================
2024-01-23 10:57:36,328 - julearn - INFO -
2024-01-23 10:57:36,328 - julearn - INFO - Number of classes: 2
2024-01-23 10:57:36,328 - julearn - INFO - Target type: object
2024-01-23 10:57:36,328 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-01-23 10:57:36,328 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:57:36,328 - julearn - INFO - Binary classification problem detected.
Scores with best hyperparameter: 0.9100000000000001
We can see that the model with the best hyperparameters works better than
the model with C=0.1
. But what’s the best hyperparameter set? We can
see it by printing the model_tuned.best_params_
variable.
pprint(model_tuned.best_params_)
{'svm__C': 1}
We can see that the best hyperparameter set is C=1
. Since this
hyperparameter was not on the boundary of the values we tried, we can
conclude that our search for the best C
value was successful.
However, by checking the SVC
documentation, we can
see that there are more hyperparameters that we can tune. For example, for
the default rbf
kernel, we can tune the gamma
hyperparameter:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("svm", C=[0.01, 0.1, 1, 10], gamma=[0.01, 0.1, 1, 10])
print(creator)
scores_tuned, model_tuned = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
return_estimator="all",
)
print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")
pprint(model_tuned.best_params_)
2024-01-23 10:57:37,225 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:57:37,225 - julearn - INFO - Step added
2024-01-23 10:57:37,226 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:57:37,226 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-01-23 10:57:37,226 - julearn - INFO - Tuning hyperparameter gamma = [0.01, 0.1, 1, 10]
2024-01-23 10:57:37,226 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.01, 0.1, 1, 10]}
2024-01-23 10:57:37,226 - julearn - INFO - ==== Input Data ====
2024-01-23 10:57:37,226 - julearn - INFO - Using dataframe as input
2024-01-23 10:57:37,226 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:57:37,226 - julearn - INFO - Target: species
2024-01-23 10:57:37,227 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:57:37,227 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-01-23 10:57:37,227 - julearn - INFO - ====================
2024-01-23 10:57:37,227 - julearn - INFO -
2024-01-23 10:57:37,228 - julearn - INFO - = Model Parameters =
2024-01-23 10:57:37,228 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:57:37,228 - julearn - INFO - Hyperparameters:
2024-01-23 10:57:37,228 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-01-23 10:57:37,228 - julearn - INFO - svm__gamma: [0.01, 0.1, 1, 10]
2024-01-23 10:57:37,228 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:57:37,228 - julearn - INFO - Search Parameters:
2024-01-23 10:57:37,228 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:57:37,228 - julearn - INFO - ====================
2024-01-23 10:57:37,228 - julearn - INFO -
2024-01-23 10:57:37,228 - julearn - INFO - = Data Information =
2024-01-23 10:57:37,228 - julearn - INFO - Problem type: classification
2024-01-23 10:57:37,228 - julearn - INFO - Number of samples: 100
2024-01-23 10:57:37,228 - julearn - INFO - Number of features: 4
2024-01-23 10:57:37,229 - julearn - INFO - ====================
2024-01-23 10:57:37,229 - julearn - INFO -
2024-01-23 10:57:37,229 - julearn - INFO - Number of classes: 2
2024-01-23 10:57:37,229 - julearn - INFO - Target type: object
2024-01-23 10:57:37,229 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-01-23 10:57:37,229 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:57:37,229 - julearn - INFO - Binary classification problem detected.
Scores with best hyperparameter: 0.9100000000000001
{'svm__C': 10, 'svm__gamma': 0.01}
We can see that the best hyperparameter set is C=1
and gamma=0.1
.
But since gamma
was on the boundary of the values we tried, we should
try more values to be sure that we are using the best hyperparameter set.
We can even give a combination of different variable types, like the words
"scale"
and "auto"
for the gamma
hyperparameter:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add(
"svm",
C=[0.01, 0.1, 1, 10],
gamma=[1e-5, 1e-4, 1e-3, 1e-2, "scale", "auto"],
)
print(creator)
scores_tuned, model_tuned = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
return_estimator="all",
)
print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")
pprint(model_tuned.best_params_)
2024-01-23 10:57:40,673 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:57:40,673 - julearn - INFO - Step added
2024-01-23 10:57:40,673 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:57:40,673 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-01-23 10:57:40,673 - julearn - INFO - Tuning hyperparameter gamma = [1e-05, 0.0001, 0.001, 0.01, 'scale', 'auto']
2024-01-23 10:57:40,673 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [1e-05, 0.0001, 0.001, 0.01, 'scale', 'auto']}
2024-01-23 10:57:40,674 - julearn - INFO - ==== Input Data ====
2024-01-23 10:57:40,674 - julearn - INFO - Using dataframe as input
2024-01-23 10:57:40,674 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:57:40,674 - julearn - INFO - Target: species
2024-01-23 10:57:40,674 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:57:40,674 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-01-23 10:57:40,675 - julearn - INFO - ====================
2024-01-23 10:57:40,675 - julearn - INFO -
2024-01-23 10:57:40,675 - julearn - INFO - = Model Parameters =
2024-01-23 10:57:40,675 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:57:40,675 - julearn - INFO - Hyperparameters:
2024-01-23 10:57:40,675 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-01-23 10:57:40,675 - julearn - INFO - svm__gamma: [1e-05, 0.0001, 0.001, 0.01, 'scale', 'auto']
2024-01-23 10:57:40,675 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:57:40,675 - julearn - INFO - Search Parameters:
2024-01-23 10:57:40,676 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:57:40,676 - julearn - INFO - ====================
2024-01-23 10:57:40,676 - julearn - INFO -
2024-01-23 10:57:40,676 - julearn - INFO - = Data Information =
2024-01-23 10:57:40,676 - julearn - INFO - Problem type: classification
2024-01-23 10:57:40,676 - julearn - INFO - Number of samples: 100
2024-01-23 10:57:40,676 - julearn - INFO - Number of features: 4
2024-01-23 10:57:40,676 - julearn - INFO - ====================
2024-01-23 10:57:40,676 - julearn - INFO -
2024-01-23 10:57:40,676 - julearn - INFO - Number of classes: 2
2024-01-23 10:57:40,676 - julearn - INFO - Target type: object
2024-01-23 10:57:40,677 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-01-23 10:57:40,677 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:57:40,677 - julearn - INFO - Binary classification problem detected.
Scores with best hyperparameter: 0.9100000000000001
{'svm__C': 10, 'svm__gamma': 0.01}
We can even tune hyperparameters from different steps of the pipeline. Let’s
add a SelectKBest
step to the pipeline
and tune its k
hyperparameter:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("select_k", k=[2, 3, 4])
creator.add(
"svm",
C=[0.01, 0.1, 1, 10],
gamma=[1e-3, 1e-2, 1e-1, "scale", "auto"],
)
print(creator)
scores_tuned, model_tuned = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
return_estimator="all",
)
print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")
pprint(model_tuned.best_params_)
2024-01-23 10:57:45,946 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:57:45,946 - julearn - INFO - Step added
2024-01-23 10:57:45,946 - julearn - INFO - Adding step select_k that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:57:45,946 - julearn - INFO - Tuning hyperparameter k = [2, 3, 4]
2024-01-23 10:57:45,946 - julearn - INFO - Step added
2024-01-23 10:57:45,946 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:57:45,946 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-01-23 10:57:45,946 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 0.1, 'scale', 'auto']
2024-01-23 10:57:45,946 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: select_k
estimator: SelectKBest()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'select_k__k': [2, 3, 4]}
Step 2: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.001, 0.01, 0.1, 'scale', 'auto']}
2024-01-23 10:57:45,947 - julearn - INFO - ==== Input Data ====
2024-01-23 10:57:45,947 - julearn - INFO - Using dataframe as input
2024-01-23 10:57:45,947 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:57:45,947 - julearn - INFO - Target: species
2024-01-23 10:57:45,947 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:57:45,947 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-01-23 10:57:45,948 - julearn - INFO - ====================
2024-01-23 10:57:45,948 - julearn - INFO -
2024-01-23 10:57:45,948 - julearn - INFO - = Model Parameters =
2024-01-23 10:57:45,948 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:57:45,948 - julearn - INFO - Hyperparameters:
2024-01-23 10:57:45,949 - julearn - INFO - select_k__k: [2, 3, 4]
2024-01-23 10:57:45,949 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-01-23 10:57:45,949 - julearn - INFO - svm__gamma: [0.001, 0.01, 0.1, 'scale', 'auto']
2024-01-23 10:57:45,949 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:57:45,949 - julearn - INFO - Search Parameters:
2024-01-23 10:57:45,949 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:57:45,949 - julearn - INFO - ====================
2024-01-23 10:57:45,949 - julearn - INFO -
2024-01-23 10:57:45,949 - julearn - INFO - = Data Information =
2024-01-23 10:57:45,949 - julearn - INFO - Problem type: classification
2024-01-23 10:57:45,949 - julearn - INFO - Number of samples: 100
2024-01-23 10:57:45,949 - julearn - INFO - Number of features: 4
2024-01-23 10:57:45,949 - julearn - INFO - ====================
2024-01-23 10:57:45,949 - julearn - INFO -
2024-01-23 10:57:45,949 - julearn - INFO - Number of classes: 2
2024-01-23 10:57:45,949 - julearn - INFO - Target type: object
2024-01-23 10:57:45,950 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-01-23 10:57:45,950 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:57:45,950 - julearn - INFO - Binary classification problem detected.
Scores with best hyperparameter: 0.9100000000000001
{'select_k__k': 4, 'svm__C': 10, 'svm__gamma': 0.01}
But how will julearn
find the optimal hyperparameter set?
Searchers#
julearn
uses the same concept as scikit-learn to tune hyperparameters:
it uses a searcher to find the best hyperparameter set. A searcher is an
object that receives a set of hyperparameters and their values, and then
tries to find the best combination of values for the hyperparameters using
cross-validation.
By default, julearn
uses a GridSearchCV
.
This searcher is very simple. First, it construct the “grid” of
hyperparameters to try. As we see above, we have 3 hyperparameters to tune.
So it constructs a 3-dimentional grid with all the possible combinations of
the hyperparameters values. The second step is to perform cross-validation
on each of the possible combinations of hyperparameters values.
Another searcher that julearn
provides is the
RandomizedSearchCV
. This searcher is
similar to the GridSearchCV
, but instead
of trying all the possible combinations of hyperparameters values, it tries
a random subset of them. This is useful when we have a lot of hyperparameters
to tune, since it can be very time consuming to try all the possible, as well
as continuous parameters that can be sampled out of a distribution. For
more information, see the
RandomizedSearchCV
documentation.
Tuning more than one grid#
Following our tuning of the SVC
hyperparameters, we
can also see that we can tune the kernel
hyperparameter. This
hyperparameter can also be “linear”. Let’s see how our grid of
hyperparameters would look like if we add this hyperparameter:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add(
"svm",
C=[0.01, 0.1, 1, 10],
gamma=[1e-3, 1e-2, "scale", "auto"],
kernel=["linear", "rbf"],
)
print(creator)
2024-01-23 10:58:02,408 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:02,409 - julearn - INFO - Step added
2024-01-23 10:58:02,409 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:02,409 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-01-23 10:58:02,409 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2024-01-23 10:58:02,409 - julearn - INFO - Tuning hyperparameter kernel = ['linear', 'rbf']
2024-01-23 10:58:02,409 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.001, 0.01, 'scale', 'auto'], 'svm__kernel': ['linear', 'rbf']}
We can see that the grid of hyperparameters is now 3-dimensional. However,
there are some combinations that don’t make much sense. For example, the
gamma
hyperparameter is only used when the kernel
is rbf
. So
we will be trying the linear
kernel with each one of the 4 different
gamma
and 4 different C
values. Those are 16 unnecessary combinations.
We can avoid this by using multiple grids. One grid for the linear
kernel and one grid for the rbf
kernel.
julearn
allows to specify multiple grid using two different approaches.
Repeating the step name with different hyperparameters:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add(
"svm",
C=[0.01, 0.1, 1, 10],
gamma=[1e-3, 1e-2, "scale", "auto"],
kernel=["rbf"],
name="svm",
)
creator.add(
"svm",
C=[0.01, 0.1, 1, 10],
kernel=["linear"],
name="svm",
)
print(creator)
scores1, model1 = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
return_estimator="all",
)
print(f"Scores with best hyperparameter: {scores1['test_score'].mean()}")
pprint(model1.best_params_)
2024-01-23 10:58:02,410 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:02,410 - julearn - INFO - Step added
2024-01-23 10:58:02,410 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:02,410 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-01-23 10:58:02,410 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2024-01-23 10:58:02,410 - julearn - INFO - Setting hyperparameter kernel = rbf
2024-01-23 10:58:02,411 - julearn - INFO - Step added
2024-01-23 10:58:02,411 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:02,411 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-01-23 10:58:02,411 - julearn - INFO - Setting hyperparameter kernel = linear
2024-01-23 10:58:02,411 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.001, 0.01, 'scale', 'auto']}
Step 2: svm
estimator: SVC(kernel='linear')
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': [0.01, 0.1, 1, 10]}
2024-01-23 10:58:02,412 - julearn - INFO - ==== Input Data ====
2024-01-23 10:58:02,412 - julearn - INFO - Using dataframe as input
2024-01-23 10:58:02,412 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:58:02,412 - julearn - INFO - Target: species
2024-01-23 10:58:02,412 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:58:02,412 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-01-23 10:58:02,412 - julearn - INFO - ====================
2024-01-23 10:58:02,412 - julearn - INFO -
2024-01-23 10:58:02,413 - julearn - INFO - = Model Parameters =
2024-01-23 10:58:02,413 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:58:02,413 - julearn - INFO - Hyperparameters:
2024-01-23 10:58:02,413 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-01-23 10:58:02,413 - julearn - INFO - svm__gamma: [0.001, 0.01, 'scale', 'auto']
2024-01-23 10:58:02,413 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:02,413 - julearn - INFO - Search Parameters:
2024-01-23 10:58:02,413 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:02,414 - julearn - INFO - ====================
2024-01-23 10:58:02,414 - julearn - INFO -
2024-01-23 10:58:02,414 - julearn - INFO - = Model Parameters =
2024-01-23 10:58:02,414 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:58:02,414 - julearn - INFO - Hyperparameters:
2024-01-23 10:58:02,414 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-01-23 10:58:02,414 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:02,414 - julearn - INFO - Search Parameters:
2024-01-23 10:58:02,414 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:02,415 - julearn - INFO - ====================
2024-01-23 10:58:02,415 - julearn - INFO -
2024-01-23 10:58:02,415 - julearn - INFO - = Model Parameters =
2024-01-23 10:58:02,415 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:58:02,415 - julearn - INFO - Hyperparameters list:
2024-01-23 10:58:02,415 - julearn - INFO - Set 0
2024-01-23 10:58:02,415 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-01-23 10:58:02,415 - julearn - INFO - svm__gamma: [0.001, 0.01, 'scale', 'auto']
2024-01-23 10:58:02,415 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-01-23 10:58:02,415 - julearn - INFO - svm: [SVC()]
2024-01-23 10:58:02,415 - julearn - INFO - Set 1
2024-01-23 10:58:02,415 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-01-23 10:58:02,416 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-01-23 10:58:02,416 - julearn - INFO - svm: [SVC(kernel='linear')]
2024-01-23 10:58:02,416 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:02,416 - julearn - INFO - Search Parameters:
2024-01-23 10:58:02,416 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:02,416 - julearn - INFO - ====================
2024-01-23 10:58:02,416 - julearn - INFO -
2024-01-23 10:58:02,416 - julearn - INFO - = Data Information =
2024-01-23 10:58:02,416 - julearn - INFO - Problem type: classification
2024-01-23 10:58:02,417 - julearn - INFO - Number of samples: 100
2024-01-23 10:58:02,417 - julearn - INFO - Number of features: 4
2024-01-23 10:58:02,417 - julearn - INFO - ====================
2024-01-23 10:58:02,417 - julearn - INFO -
2024-01-23 10:58:02,417 - julearn - INFO - Number of classes: 2
2024-01-23 10:58:02,417 - julearn - INFO - Target type: object
2024-01-23 10:58:02,417 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-01-23 10:58:02,417 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:02,418 - julearn - INFO - Binary classification problem detected.
Scores with best hyperparameter: 0.93
{'set_column_types': SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']}),
'svm': SVC(),
'svm__C': 10,
'svm__gamma': 0.01}
Important
Note that the name
parameter is required when repeating a step name.
If we do not specify the name
parameter, julearn
will
auto-determine the step name in an unique way. The only way to force repated
names is to do so explicitly.
Using multiple pipeline creators:
creator1 = PipelineCreator(problem_type="classification")
creator1.add("zscore")
creator1.add(
"svm",
C=[0.01, 0.1, 1, 10],
gamma=[1e-3, 1e-2, "scale", "auto"],
kernel=["rbf"],
)
creator2 = PipelineCreator(problem_type="classification")
creator2.add("zscore")
creator2.add(
"svm",
C=[0.01, 0.1, 1, 10],
kernel=["linear"],
)
scores2, model2 = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=[creator1, creator2],
return_estimator="all",
)
print(f"Scores with best hyperparameter: {scores2['test_score'].mean()}")
pprint(model2.best_params_)
2024-01-23 10:58:06,892 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:06,892 - julearn - INFO - Step added
2024-01-23 10:58:06,892 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:06,892 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-01-23 10:58:06,892 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2024-01-23 10:58:06,892 - julearn - INFO - Setting hyperparameter kernel = rbf
2024-01-23 10:58:06,892 - julearn - INFO - Step added
2024-01-23 10:58:06,892 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:06,892 - julearn - INFO - Step added
2024-01-23 10:58:06,892 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:06,892 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-01-23 10:58:06,892 - julearn - INFO - Setting hyperparameter kernel = linear
2024-01-23 10:58:06,892 - julearn - INFO - Step added
2024-01-23 10:58:06,892 - julearn - INFO - ==== Input Data ====
2024-01-23 10:58:06,893 - julearn - INFO - Using dataframe as input
2024-01-23 10:58:06,893 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:58:06,893 - julearn - INFO - Target: species
2024-01-23 10:58:06,893 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:58:06,893 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-01-23 10:58:06,893 - julearn - INFO - ====================
2024-01-23 10:58:06,893 - julearn - INFO -
2024-01-23 10:58:06,894 - julearn - INFO - = Model Parameters =
2024-01-23 10:58:06,894 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:58:06,894 - julearn - INFO - Hyperparameters:
2024-01-23 10:58:06,894 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-01-23 10:58:06,894 - julearn - INFO - svm__gamma: [0.001, 0.01, 'scale', 'auto']
2024-01-23 10:58:06,894 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:06,894 - julearn - INFO - Search Parameters:
2024-01-23 10:58:06,894 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:06,895 - julearn - INFO - ====================
2024-01-23 10:58:06,895 - julearn - INFO -
2024-01-23 10:58:06,895 - julearn - INFO - = Model Parameters =
2024-01-23 10:58:06,895 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:58:06,895 - julearn - INFO - Hyperparameters:
2024-01-23 10:58:06,895 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-01-23 10:58:06,895 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:06,895 - julearn - INFO - Search Parameters:
2024-01-23 10:58:06,895 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:06,896 - julearn - INFO - ====================
2024-01-23 10:58:06,896 - julearn - INFO -
2024-01-23 10:58:06,896 - julearn - INFO - = Model Parameters =
2024-01-23 10:58:06,896 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:58:06,896 - julearn - INFO - Hyperparameters list:
2024-01-23 10:58:06,896 - julearn - INFO - Set 0
2024-01-23 10:58:06,896 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-01-23 10:58:06,896 - julearn - INFO - svm__gamma: [0.001, 0.01, 'scale', 'auto']
2024-01-23 10:58:06,896 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-01-23 10:58:06,896 - julearn - INFO - zscore: [StandardScaler()]
2024-01-23 10:58:06,896 - julearn - INFO - svm: [SVC()]
2024-01-23 10:58:06,897 - julearn - INFO - Set 1
2024-01-23 10:58:06,897 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-01-23 10:58:06,897 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-01-23 10:58:06,897 - julearn - INFO - zscore: [StandardScaler()]
2024-01-23 10:58:06,897 - julearn - INFO - svm: [SVC(kernel='linear')]
2024-01-23 10:58:06,897 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:06,897 - julearn - INFO - Search Parameters:
2024-01-23 10:58:06,898 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:06,898 - julearn - INFO - ====================
2024-01-23 10:58:06,898 - julearn - INFO -
2024-01-23 10:58:06,898 - julearn - INFO - = Data Information =
2024-01-23 10:58:06,898 - julearn - INFO - Problem type: classification
2024-01-23 10:58:06,898 - julearn - INFO - Number of samples: 100
2024-01-23 10:58:06,898 - julearn - INFO - Number of features: 4
2024-01-23 10:58:06,898 - julearn - INFO - ====================
2024-01-23 10:58:06,898 - julearn - INFO -
2024-01-23 10:58:06,898 - julearn - INFO - Number of classes: 2
2024-01-23 10:58:06,898 - julearn - INFO - Target type: object
2024-01-23 10:58:06,898 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-01-23 10:58:06,899 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:06,899 - julearn - INFO - Binary classification problem detected.
Scores with best hyperparameter: 0.93
{'set_column_types': SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']}),
'svm': SVC(),
'svm__C': 10,
'svm__gamma': 0.01,
'zscore': StandardScaler()}
Important
All the pipeline creators must have the same problem type and steps names in order for this approach to work.
Indeed, if we compare both approaches, we can see that they are equivalent. They both produce the same grid of hyperparameters:
pprint(model1.param_grid)
pprint(model2.param_grid)
[{'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})],
'svm': [SVC()],
'svm__C': [0.01, 0.1, 1, 10],
'svm__gamma': [0.001, 0.01, 'scale', 'auto']},
{'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})],
'svm': [SVC(kernel='linear')],
'svm__C': [0.01, 0.1, 1, 10]}]
[{'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})],
'svm': [SVC()],
'svm__C': [0.01, 0.1, 1, 10],
'svm__gamma': [0.001, 0.01, 'scale', 'auto'],
'zscore': [StandardScaler()]},
{'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})],
'svm': [SVC(kernel='linear')],
'svm__C': [0.01, 0.1, 1, 10],
'zscore': [StandardScaler()]}]
Models as hyperparameters#
But why stop there? Models can also be considered as hyperparameters. For
example, we can try different models for the classification task. Let’s
try the RandomForestClassifier
and the
LogisticRegression
too:
creator1 = PipelineCreator(problem_type="classification")
creator1.add("zscore")
creator1.add(
"svm",
C=[0.01, 0.1, 1, 10],
gamma=[1e-3, 1e-2, "scale", "auto"],
kernel=["rbf"],
name="model",
)
creator2 = PipelineCreator(problem_type="classification")
creator2.add("zscore")
creator2.add(
"svm",
C=[0.01, 0.1, 1, 10],
kernel=["linear"],
name="model",
)
creator3 = PipelineCreator(problem_type="classification")
creator3.add("zscore")
creator3.add(
"rf",
max_depth=[2, 3, 4],
name="model",
)
creator4 = PipelineCreator(problem_type="classification")
creator4.add("zscore")
creator4.add(
"logit",
penalty=["l2", "l1"],
dual=[False],
solver="liblinear",
name="model",
)
scores3, model3 = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=[creator1, creator2, creator3, creator4],
return_estimator="all",
)
print(f"Scores with best hyperparameter: {scores3['test_score'].mean()}")
pprint(model3.best_params_)
2024-01-23 10:58:11,396 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:11,397 - julearn - INFO - Step added
2024-01-23 10:58:11,397 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:11,397 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-01-23 10:58:11,397 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2024-01-23 10:58:11,397 - julearn - INFO - Setting hyperparameter kernel = rbf
2024-01-23 10:58:11,397 - julearn - INFO - Step added
2024-01-23 10:58:11,397 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:11,397 - julearn - INFO - Step added
2024-01-23 10:58:11,397 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:11,397 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-01-23 10:58:11,397 - julearn - INFO - Setting hyperparameter kernel = linear
2024-01-23 10:58:11,397 - julearn - INFO - Step added
2024-01-23 10:58:11,397 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:11,397 - julearn - INFO - Step added
2024-01-23 10:58:11,398 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:11,398 - julearn - INFO - Tuning hyperparameter max_depth = [2, 3, 4]
2024-01-23 10:58:11,398 - julearn - INFO - Step added
2024-01-23 10:58:11,398 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:11,398 - julearn - INFO - Step added
2024-01-23 10:58:11,398 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-01-23 10:58:11,398 - julearn - INFO - Tuning hyperparameter penalty = ['l2', 'l1']
2024-01-23 10:58:11,398 - julearn - INFO - Setting hyperparameter dual = False
2024-01-23 10:58:11,398 - julearn - INFO - Setting hyperparameter solver = liblinear
2024-01-23 10:58:11,398 - julearn - INFO - Step added
2024-01-23 10:58:11,398 - julearn - INFO - ==== Input Data ====
2024-01-23 10:58:11,398 - julearn - INFO - Using dataframe as input
2024-01-23 10:58:11,398 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:58:11,398 - julearn - INFO - Target: species
2024-01-23 10:58:11,398 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-01-23 10:58:11,398 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-01-23 10:58:11,399 - julearn - INFO - ====================
2024-01-23 10:58:11,399 - julearn - INFO -
2024-01-23 10:58:11,400 - julearn - INFO - = Model Parameters =
2024-01-23 10:58:11,400 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:58:11,400 - julearn - INFO - Hyperparameters:
2024-01-23 10:58:11,400 - julearn - INFO - model__C: [0.01, 0.1, 1, 10]
2024-01-23 10:58:11,400 - julearn - INFO - model__gamma: [0.001, 0.01, 'scale', 'auto']
2024-01-23 10:58:11,400 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:11,400 - julearn - INFO - Search Parameters:
2024-01-23 10:58:11,400 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:11,400 - julearn - INFO - ====================
2024-01-23 10:58:11,400 - julearn - INFO -
2024-01-23 10:58:11,401 - julearn - INFO - = Model Parameters =
2024-01-23 10:58:11,401 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:58:11,401 - julearn - INFO - Hyperparameters:
2024-01-23 10:58:11,401 - julearn - INFO - model__C: [0.01, 0.1, 1, 10]
2024-01-23 10:58:11,401 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:11,401 - julearn - INFO - Search Parameters:
2024-01-23 10:58:11,401 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:11,401 - julearn - INFO - ====================
2024-01-23 10:58:11,401 - julearn - INFO -
2024-01-23 10:58:11,402 - julearn - INFO - = Model Parameters =
2024-01-23 10:58:11,402 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:58:11,402 - julearn - INFO - Hyperparameters:
2024-01-23 10:58:11,402 - julearn - INFO - model__max_depth: [2, 3, 4]
2024-01-23 10:58:11,402 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:11,402 - julearn - INFO - Search Parameters:
2024-01-23 10:58:11,402 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:11,402 - julearn - INFO - ====================
2024-01-23 10:58:11,402 - julearn - INFO -
2024-01-23 10:58:11,403 - julearn - INFO - = Model Parameters =
2024-01-23 10:58:11,403 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:58:11,403 - julearn - INFO - Hyperparameters:
2024-01-23 10:58:11,403 - julearn - INFO - model__penalty: ['l2', 'l1']
2024-01-23 10:58:11,403 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:11,403 - julearn - INFO - Search Parameters:
2024-01-23 10:58:11,403 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:11,403 - julearn - INFO - ====================
2024-01-23 10:58:11,403 - julearn - INFO -
2024-01-23 10:58:11,403 - julearn - INFO - = Model Parameters =
2024-01-23 10:58:11,403 - julearn - INFO - Tuning hyperparameters using grid
2024-01-23 10:58:11,403 - julearn - INFO - Hyperparameters list:
2024-01-23 10:58:11,403 - julearn - INFO - Set 0
2024-01-23 10:58:11,403 - julearn - INFO - model__C: [0.01, 0.1, 1, 10]
2024-01-23 10:58:11,403 - julearn - INFO - model__gamma: [0.001, 0.01, 'scale', 'auto']
2024-01-23 10:58:11,404 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-01-23 10:58:11,404 - julearn - INFO - zscore: [StandardScaler()]
2024-01-23 10:58:11,404 - julearn - INFO - model: [SVC()]
2024-01-23 10:58:11,404 - julearn - INFO - Set 1
2024-01-23 10:58:11,404 - julearn - INFO - model__C: [0.01, 0.1, 1, 10]
2024-01-23 10:58:11,404 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-01-23 10:58:11,405 - julearn - INFO - zscore: [StandardScaler()]
2024-01-23 10:58:11,405 - julearn - INFO - model: [SVC(kernel='linear')]
2024-01-23 10:58:11,405 - julearn - INFO - Set 2
2024-01-23 10:58:11,405 - julearn - INFO - model__max_depth: [2, 3, 4]
2024-01-23 10:58:11,405 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-01-23 10:58:11,405 - julearn - INFO - zscore: [StandardScaler()]
2024-01-23 10:58:11,405 - julearn - INFO - model: [RandomForestClassifier()]
2024-01-23 10:58:11,406 - julearn - INFO - Set 3
2024-01-23 10:58:11,406 - julearn - INFO - model__penalty: ['l2', 'l1']
2024-01-23 10:58:11,406 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-01-23 10:58:11,406 - julearn - INFO - zscore: [StandardScaler()]
2024-01-23 10:58:11,406 - julearn - INFO - model: [LogisticRegression(solver='liblinear')]
2024-01-23 10:58:11,406 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:11,406 - julearn - INFO - Search Parameters:
2024-01-23 10:58:11,406 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:11,407 - julearn - INFO - ====================
2024-01-23 10:58:11,407 - julearn - INFO -
2024-01-23 10:58:11,407 - julearn - INFO - = Data Information =
2024-01-23 10:58:11,407 - julearn - INFO - Problem type: classification
2024-01-23 10:58:11,407 - julearn - INFO - Number of samples: 100
2024-01-23 10:58:11,407 - julearn - INFO - Number of features: 4
2024-01-23 10:58:11,407 - julearn - INFO - ====================
2024-01-23 10:58:11,407 - julearn - INFO -
2024-01-23 10:58:11,407 - julearn - INFO - Number of classes: 2
2024-01-23 10:58:11,407 - julearn - INFO - Target type: object
2024-01-23 10:58:11,407 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-01-23 10:58:11,408 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-01-23 10:58:11,408 - julearn - INFO - Binary classification problem detected.
Scores with best hyperparameter: 0.9200000000000002
{'model': SVC(),
'model__C': 10,
'model__gamma': 0.01,
'set_column_types': SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']}),
'zscore': StandardScaler()}
Well, it seems that nothing can beat the SVC
with
kernel="rbf"
for our classification example.
Total running time of the script: (0 minutes 49.058 seconds)