6.3. Hyperparameter Tuning#
Parameters vs Hyperparameters#
Parameters are the values that define the model, and are learned from the data. For example, the weights of a linear regression model are parameters. The parameters of a model are learned during training and are not set by the user.
Hyperparameters are the values that define the model, but are not learned from
the data. For example, the regularization parameter C
of a Support Vector
Machine (SVC
) model is a hyperparameter. The
hyperparameters of a model are set by the user before training and are not
learned during training.
Let’s see an example of a SVC
model with a regularization
parameter C
. We will use the iris
dataset, which is a dataset of
measurements of flowers.
We start by loading the dataset and setting the features and target variables.
from seaborn import load_dataset
from pprint import pprint # To print in a pretty way
df = load_dataset("iris")
X = df.columns[:-1].tolist()
y = "species"
X_types = {"continuous": X}
# The dataset has three kind of species. We will keep two to perform a binary
# classification.
df = df[df["species"].isin(["versicolor", "virginica"])]
We can now use the PipelineCreator
to create a pipeline with a
RobustScaler
and a
SVC
, with a regularization parameter C
set to
0.1
.
from julearn.pipeline import PipelineCreator
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("svm", C=0.1)
print(creator)
2024-05-03 15:26:14,565 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:14,565 - julearn - INFO - Step added
2024-05-03 15:26:14,565 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:14,565 - julearn - INFO - Setting hyperparameter C = 0.1
2024-05-03 15:26:14,565 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: svm
estimator: SVC(C=0.1)
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Hyperparameter Tuning#
Since it is the user who sets the hyperparameters, it is important to choose the right values. This is not always easy, and it is common to try different values and see which one works best. This process is called hyperparameter tuning.
Basically, hyperparameter tuning refers to testing several hyperparameter values and choosing the one that works best.
For example, we can try different values for the regularization parameter
C
of the SVC
model and see which one works best.
from julearn import run_cross_validation
scores1 = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
)
print(f"Score with C=0.1: {scores1['test_score'].mean()}")
creator2 = PipelineCreator(problem_type="classification")
creator2.add("zscore")
creator2.add("svm", C=0.01)
scores2 = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator2,
)
print(f"Score with C=0.01: {scores2['test_score'].mean()}")
2024-05-03 15:26:14,566 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:14,566 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:14,566 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:14,566 - julearn - INFO - Target: species
2024-05-03 15:26:14,567 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:14,567 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:14,567 - julearn - INFO - ====================
2024-05-03 15:26:14,567 - julearn - INFO -
2024-05-03 15:26:14,568 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:14,568 - julearn - INFO - ====================
2024-05-03 15:26:14,568 - julearn - INFO -
2024-05-03 15:26:14,568 - julearn - INFO - = Data Information =
2024-05-03 15:26:14,568 - julearn - INFO - Problem type: classification
2024-05-03 15:26:14,568 - julearn - INFO - Number of samples: 100
2024-05-03 15:26:14,568 - julearn - INFO - Number of features: 4
2024-05-03 15:26:14,568 - julearn - INFO - ====================
2024-05-03 15:26:14,568 - julearn - INFO -
2024-05-03 15:26:14,568 - julearn - INFO - Number of classes: 2
2024-05-03 15:26:14,568 - julearn - INFO - Target type: object
2024-05-03 15:26:14,569 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-03 15:26:14,569 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:14,569 - julearn - INFO - Binary classification problem detected.
Score with C=0.1: 0.8099999999999999
2024-05-03 15:26:14,608 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:14,608 - julearn - INFO - Step added
2024-05-03 15:26:14,608 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:14,608 - julearn - INFO - Setting hyperparameter C = 0.01
2024-05-03 15:26:14,608 - julearn - INFO - Step added
2024-05-03 15:26:14,608 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:14,608 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:14,608 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:14,608 - julearn - INFO - Target: species
2024-05-03 15:26:14,608 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:14,608 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:14,609 - julearn - INFO - ====================
2024-05-03 15:26:14,609 - julearn - INFO -
2024-05-03 15:26:14,609 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:14,609 - julearn - INFO - ====================
2024-05-03 15:26:14,610 - julearn - INFO -
2024-05-03 15:26:14,610 - julearn - INFO - = Data Information =
2024-05-03 15:26:14,610 - julearn - INFO - Problem type: classification
2024-05-03 15:26:14,610 - julearn - INFO - Number of samples: 100
2024-05-03 15:26:14,610 - julearn - INFO - Number of features: 4
2024-05-03 15:26:14,610 - julearn - INFO - ====================
2024-05-03 15:26:14,610 - julearn - INFO -
2024-05-03 15:26:14,610 - julearn - INFO - Number of classes: 2
2024-05-03 15:26:14,610 - julearn - INFO - Target type: object
2024-05-03 15:26:14,610 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-03 15:26:14,611 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:14,611 - julearn - INFO - Binary classification problem detected.
Score with C=0.01: 0.19
We can see that the model with C=0.1
works better than the model with
C=0.01
. However, to be sure that C=0.1
is the best value, we should
try more values. And since this is only one hyperparameter, it is not that
difficult. But what if we have more hyperparameters? And what if we have
several steps in the pipeline (e.g. feature selection, PCA, etc.)?
This is a major problem: the more hyperparameters we have, the more
times we use the same data for training and testing. This usually gives an
optimistic estimation of the performance of the model.
To prevent this, we can use a technique called nested cross-validation. That is, we use cross-validation to tune the hyperparameters, and then we use cross-validation again to estimate the performance of the model using the best hyperparameters set. It is called nested because we first split the data into training and testing sets to estimate the model performance (outer loop), and then we split the training set into two sets to tune the hyperparameters (inner loop).
julearn
has a simple way to do hyperparameter tuning using nested cross-
validation. When we use a PipelineCreator
to create a pipeline,
we can set the hyperparameters we want to tune and the values we want to try.
For example, we can try different values for the regularization parameter
C
of the SVC
model:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("svm", C=[0.01, 0.1, 1, 10])
print(creator)
2024-05-03 15:26:14,649 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:14,649 - julearn - INFO - Step added
2024-05-03 15:26:14,649 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:14,649 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:14,649 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': [0.01, 0.1, 1, 10]}
As we can see above, the creator now shows that the C
hyperparameter
will be tuned. We can now use this creator to run cross-validation. This will
tune the hyperparameters and estimate the performance of the model using the
best hyperparameters set.
scores_tuned, model_tuned = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
return_estimator="all",
)
print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")
2024-05-03 15:26:14,650 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:14,650 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:14,650 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:14,650 - julearn - INFO - Target: species
2024-05-03 15:26:14,650 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:14,650 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:14,651 - julearn - INFO - ====================
2024-05-03 15:26:14,651 - julearn - INFO -
2024-05-03 15:26:14,651 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:14,651 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:14,651 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:14,652 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:14,652 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:14,652 - julearn - INFO - Search Parameters:
2024-05-03 15:26:14,652 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:14,652 - julearn - INFO - ====================
2024-05-03 15:26:14,652 - julearn - INFO -
2024-05-03 15:26:14,652 - julearn - INFO - = Data Information =
2024-05-03 15:26:14,652 - julearn - INFO - Problem type: classification
2024-05-03 15:26:14,652 - julearn - INFO - Number of samples: 100
2024-05-03 15:26:14,652 - julearn - INFO - Number of features: 4
2024-05-03 15:26:14,652 - julearn - INFO - ====================
2024-05-03 15:26:14,652 - julearn - INFO -
2024-05-03 15:26:14,652 - julearn - INFO - Number of classes: 2
2024-05-03 15:26:14,652 - julearn - INFO - Target type: object
2024-05-03 15:26:14,653 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-03 15:26:14,653 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:14,653 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:15,439 - julearn - INFO - Fitting final model
Scores with best hyperparameter: 0.9100000000000001
We can see that the model with the best hyperparameters works better than
the model with C=0.1
. But what’s the best hyperparameter set? We can
see it by printing the model_tuned.best_params_
variable.
pprint(model_tuned.best_params_)
{'svm__C': 1}
We can see that the best hyperparameter set is C=1
. Since this
hyperparameter was not on the boundary of the values we tried, we can
conclude that our search for the best C
value was successful.
However, by checking the SVC
documentation, we can
see that there are more hyperparameters that we can tune. For example, for
the default rbf
kernel, we can tune the gamma
hyperparameter:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("svm", C=[0.01, 0.1, 1, 10], gamma=[0.01, 0.1, 1, 10])
print(creator)
scores_tuned, model_tuned = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
return_estimator="all",
)
print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")
pprint(model_tuned.best_params_)
2024-05-03 15:26:15,596 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:15,596 - julearn - INFO - Step added
2024-05-03 15:26:15,596 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:15,596 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:15,596 - julearn - INFO - Tuning hyperparameter gamma = [0.01, 0.1, 1, 10]
2024-05-03 15:26:15,596 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.01, 0.1, 1, 10]}
2024-05-03 15:26:15,596 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:15,597 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:15,597 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:15,597 - julearn - INFO - Target: species
2024-05-03 15:26:15,597 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:15,597 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:15,597 - julearn - INFO - ====================
2024-05-03 15:26:15,597 - julearn - INFO -
2024-05-03 15:26:15,598 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:15,598 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:15,598 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:15,598 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:15,598 - julearn - INFO - svm__gamma: [0.01, 0.1, 1, 10]
2024-05-03 15:26:15,598 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:15,598 - julearn - INFO - Search Parameters:
2024-05-03 15:26:15,598 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:15,598 - julearn - INFO - ====================
2024-05-03 15:26:15,598 - julearn - INFO -
2024-05-03 15:26:15,599 - julearn - INFO - = Data Information =
2024-05-03 15:26:15,599 - julearn - INFO - Problem type: classification
2024-05-03 15:26:15,599 - julearn - INFO - Number of samples: 100
2024-05-03 15:26:15,599 - julearn - INFO - Number of features: 4
2024-05-03 15:26:15,599 - julearn - INFO - ====================
2024-05-03 15:26:15,599 - julearn - INFO -
2024-05-03 15:26:15,599 - julearn - INFO - Number of classes: 2
2024-05-03 15:26:15,599 - julearn - INFO - Target type: object
2024-05-03 15:26:15,599 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-03 15:26:15,600 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:15,600 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:18,579 - julearn - INFO - Fitting final model
Scores with best hyperparameter: 0.9100000000000001
{'svm__C': 10, 'svm__gamma': 0.01}
We can see that the best hyperparameter set is C=1
and gamma=0.1
.
But since gamma
was on the boundary of the values we tried, we should
try more values to be sure that we are using the best hyperparameter set.
We can even give a combination of different variable types, like the words
"scale"
and "auto"
for the gamma
hyperparameter:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add(
"svm",
C=[0.01, 0.1, 1, 10],
gamma=[1e-5, 1e-4, 1e-3, 1e-2, "scale", "auto"],
)
print(creator)
scores_tuned, model_tuned = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
return_estimator="all",
)
print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")
pprint(model_tuned.best_params_)
2024-05-03 15:26:19,183 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:19,184 - julearn - INFO - Step added
2024-05-03 15:26:19,184 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:19,184 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:19,184 - julearn - INFO - Tuning hyperparameter gamma = [1e-05, 0.0001, 0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:19,184 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [1e-05, 0.0001, 0.001, 0.01, 'scale', 'auto']}
2024-05-03 15:26:19,184 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:19,184 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:19,184 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:19,184 - julearn - INFO - Target: species
2024-05-03 15:26:19,185 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:19,185 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:19,185 - julearn - INFO - ====================
2024-05-03 15:26:19,185 - julearn - INFO -
2024-05-03 15:26:19,186 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:19,186 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:19,186 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:19,186 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:19,186 - julearn - INFO - svm__gamma: [1e-05, 0.0001, 0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:19,186 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:19,186 - julearn - INFO - Search Parameters:
2024-05-03 15:26:19,186 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:19,186 - julearn - INFO - ====================
2024-05-03 15:26:19,186 - julearn - INFO -
2024-05-03 15:26:19,186 - julearn - INFO - = Data Information =
2024-05-03 15:26:19,186 - julearn - INFO - Problem type: classification
2024-05-03 15:26:19,186 - julearn - INFO - Number of samples: 100
2024-05-03 15:26:19,186 - julearn - INFO - Number of features: 4
2024-05-03 15:26:19,187 - julearn - INFO - ====================
2024-05-03 15:26:19,187 - julearn - INFO -
2024-05-03 15:26:19,187 - julearn - INFO - Number of classes: 2
2024-05-03 15:26:19,187 - julearn - INFO - Target type: object
2024-05-03 15:26:19,187 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-03 15:26:19,187 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:19,187 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:23,662 - julearn - INFO - Fitting final model
Scores with best hyperparameter: 0.9100000000000001
{'svm__C': 10, 'svm__gamma': 0.01}
We can even tune hyperparameters from different steps of the pipeline. Let’s
add a SelectKBest
step to the pipeline
and tune its k
hyperparameter:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("select_k", k=[2, 3, 4])
creator.add(
"svm",
C=[0.01, 0.1, 1, 10],
gamma=[1e-3, 1e-2, 1e-1, "scale", "auto"],
)
print(creator)
scores_tuned, model_tuned = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
return_estimator="all",
)
print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")
pprint(model_tuned.best_params_)
2024-05-03 15:26:24,561 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:24,561 - julearn - INFO - Step added
2024-05-03 15:26:24,561 - julearn - INFO - Adding step select_k that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:24,561 - julearn - INFO - Tuning hyperparameter k = [2, 3, 4]
2024-05-03 15:26:24,561 - julearn - INFO - Step added
2024-05-03 15:26:24,561 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:24,561 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:24,561 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 0.1, 'scale', 'auto']
2024-05-03 15:26:24,561 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: select_k
estimator: SelectKBest()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'select_k__k': [2, 3, 4]}
Step 2: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.001, 0.01, 0.1, 'scale', 'auto']}
2024-05-03 15:26:24,562 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:24,562 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:24,562 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:24,562 - julearn - INFO - Target: species
2024-05-03 15:26:24,562 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:24,562 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:24,563 - julearn - INFO - ====================
2024-05-03 15:26:24,563 - julearn - INFO -
2024-05-03 15:26:24,563 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:24,563 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:24,563 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:24,564 - julearn - INFO - select_k__k: [2, 3, 4]
2024-05-03 15:26:24,564 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:24,564 - julearn - INFO - svm__gamma: [0.001, 0.01, 0.1, 'scale', 'auto']
2024-05-03 15:26:24,564 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:24,564 - julearn - INFO - Search Parameters:
2024-05-03 15:26:24,564 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:24,564 - julearn - INFO - ====================
2024-05-03 15:26:24,564 - julearn - INFO -
2024-05-03 15:26:24,564 - julearn - INFO - = Data Information =
2024-05-03 15:26:24,564 - julearn - INFO - Problem type: classification
2024-05-03 15:26:24,564 - julearn - INFO - Number of samples: 100
2024-05-03 15:26:24,564 - julearn - INFO - Number of features: 4
2024-05-03 15:26:24,564 - julearn - INFO - ====================
2024-05-03 15:26:24,564 - julearn - INFO -
2024-05-03 15:26:24,564 - julearn - INFO - Number of classes: 2
2024-05-03 15:26:24,564 - julearn - INFO - Target type: object
2024-05-03 15:26:24,565 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-03 15:26:24,565 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:24,565 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:38,809 - julearn - INFO - Fitting final model
Scores with best hyperparameter: 0.9100000000000001
{'select_k__k': 4, 'svm__C': 10, 'svm__gamma': 0.01}
But how will julearn
find the optimal hyperparameter set?
Searchers#
julearn
uses the same concept as scikit-learn to tune hyperparameters:
it uses a searcher to find the best hyperparameter set. A searcher is an
object that receives a set of hyperparameters and their values, and then
tries to find the best combination of values for the hyperparameters using
cross-validation.
By default, julearn
uses a
GridSearchCV
.
This searcher, specified as "grid"
is very simple. First, it constructs
the _grid_ of hyperparameters to try. As we see above, we have 3
hyperparameters to tune. So it constructs a 3-dimentional grid with all the
possible combinations of the hyperparameters values. The second step is to
perform cross-validation on each of the possible combinations of
hyperparameters values.
Other searchers that julearn
provides are the
RandomizedSearchCV
,
BayesSearchCV
and
OptunaSearchCV
.
The randomized searcher
(RandomizedSearchCV
) is similar to the
GridSearchCV
, but instead
of trying all the possible combinations of hyperparameter values, it tries
a random subset of them. This is useful when we have a lot of hyperparameters
to tune, since it can be very time consuming to try all the possible
combinations, as well as continuous parameters that can be sampled out of a
distribution. For more information, see the
RandomizedSearchCV
documentation.
The Bayesian searcher (BayesSearchCV
) is a bit more
complex. It uses Bayesian optimization to find the best hyperparameter set.
As with the randomized search, it is useful when we have many
hyperparameters to tune, and we don’t want to try all the possible
combinations due to computational constraints. For more information, see the
BayesSearchCV
documentation, including how to specify
the prior distributions of the hyperparameters.
The Optuna searcher (OptunaSearchCV
)
uses the Optuna library to find the best hyperparameter set. Optuna is a
hyperparameter optimization framework that has several algorithms to find
the best hyperparameter set. For more information, see the
Optuna documentation.
We can specify the kind of searcher and its parametrization, by setting the
search_params
parameter in the run_cross_validation()
function.
For example, we can use the
RandomizedSearchCV
searcher with
10 iterations of random search.
search_params = {
"kind": "random",
"n_iter": 10,
}
scores_tuned, model_tuned = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
return_estimator="all",
search_params=search_params,
)
print(
"Scores with best hyperparameter using 10 iterations of "
f"randomized search: {scores_tuned['test_score'].mean()}"
)
pprint(model_tuned.best_params_)
2024-05-03 15:26:41,675 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:41,675 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:41,675 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:41,675 - julearn - INFO - Target: species
2024-05-03 15:26:41,676 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:41,676 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:41,676 - julearn - INFO - ====================
2024-05-03 15:26:41,676 - julearn - INFO -
2024-05-03 15:26:41,677 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:41,677 - julearn - INFO - Tuning hyperparameters using random
2024-05-03 15:26:41,677 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:41,677 - julearn - INFO - select_k__k: [2, 3, 4]
2024-05-03 15:26:41,677 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:41,677 - julearn - INFO - svm__gamma: [0.001, 0.01, 0.1, 'scale', 'auto']
2024-05-03 15:26:41,677 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:41,677 - julearn - INFO - Search Parameters:
2024-05-03 15:26:41,677 - julearn - INFO - n_iter: 10
2024-05-03 15:26:41,677 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:41,677 - julearn - INFO - ====================
2024-05-03 15:26:41,677 - julearn - INFO -
2024-05-03 15:26:41,678 - julearn - INFO - = Data Information =
2024-05-03 15:26:41,678 - julearn - INFO - Problem type: classification
2024-05-03 15:26:41,678 - julearn - INFO - Number of samples: 100
2024-05-03 15:26:41,678 - julearn - INFO - Number of features: 4
2024-05-03 15:26:41,678 - julearn - INFO - ====================
2024-05-03 15:26:41,678 - julearn - INFO -
2024-05-03 15:26:41,678 - julearn - INFO - Number of classes: 2
2024-05-03 15:26:41,678 - julearn - INFO - Target type: object
2024-05-03 15:26:41,678 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-03 15:26:41,679 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:41,679 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:44,114 - julearn - INFO - Fitting final model
Scores with best hyperparameter using 10 iterations of randomized search: 0.89
{'select_k__k': 3, 'svm__C': 1, 'svm__gamma': 'auto'}
We can now see that the best hyperparameter might be different from the grid
search. This is because it tried only 10 combinations and not the whole grid.
Furthermore, the RandomizedSearchCV
searcher can sample hyperparameters from distributions, which can be useful
when we have continuous hyperparameters.
Let’s set both C
and gamma
to be sampled from log-uniform
distributions. We can do this by setting the hyperparameter values as a
tuple with the following format: (low, high, distribution)
. The
distribution can be either "log-uniform"
or "uniform"
.
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("select_k", k=[2, 3, 4])
creator.add(
"svm",
C=(0.01, 10, "log-uniform"),
gamma=(1e-3, 1e-1, "log-uniform"),
)
print(creator)
scores_tuned, model_tuned = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
return_estimator="all",
search_params=search_params,
)
print(
"Scores with best hyperparameter using 10 iterations of "
f"randomized search: {scores_tuned['test_score'].mean()}"
)
pprint(model_tuned.best_params_)
2024-05-03 15:26:44,601 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:44,601 - julearn - INFO - Step added
2024-05-03 15:26:44,601 - julearn - INFO - Adding step select_k that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:44,601 - julearn - INFO - Tuning hyperparameter k = [2, 3, 4]
2024-05-03 15:26:44,601 - julearn - INFO - Step added
2024-05-03 15:26:44,601 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:44,601 - julearn - INFO - Tuning hyperparameter C = (0.01, 10, 'log-uniform')
2024-05-03 15:26:44,601 - julearn - INFO - Tuning hyperparameter gamma = (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:44,601 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: select_k
estimator: SelectKBest()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'select_k__k': [2, 3, 4]}
Step 2: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': (0.01, 10, 'log-uniform'), 'svm__gamma': (0.001, 0.1, 'log-uniform')}
2024-05-03 15:26:44,602 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:44,602 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:44,602 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:44,602 - julearn - INFO - Target: species
2024-05-03 15:26:44,602 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:44,602 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:44,603 - julearn - INFO - ====================
2024-05-03 15:26:44,603 - julearn - INFO -
2024-05-03 15:26:44,603 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:44,603 - julearn - INFO - Tuning hyperparameters using random
2024-05-03 15:26:44,603 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:44,604 - julearn - INFO - select_k__k: [2, 3, 4]
2024-05-03 15:26:44,604 - julearn - INFO - svm__C: (0.01, 10, 'log-uniform')
2024-05-03 15:26:44,604 - julearn - INFO - svm__gamma: (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:44,605 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:44,605 - julearn - INFO - Search Parameters:
2024-05-03 15:26:44,605 - julearn - INFO - n_iter: 10
2024-05-03 15:26:44,605 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:44,605 - julearn - INFO - ====================
2024-05-03 15:26:44,605 - julearn - INFO -
2024-05-03 15:26:44,605 - julearn - INFO - = Data Information =
2024-05-03 15:26:44,605 - julearn - INFO - Problem type: classification
2024-05-03 15:26:44,605 - julearn - INFO - Number of samples: 100
2024-05-03 15:26:44,605 - julearn - INFO - Number of features: 4
2024-05-03 15:26:44,605 - julearn - INFO - ====================
2024-05-03 15:26:44,605 - julearn - INFO -
2024-05-03 15:26:44,606 - julearn - INFO - Number of classes: 2
2024-05-03 15:26:44,606 - julearn - INFO - Target type: object
2024-05-03 15:26:44,606 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-03 15:26:44,606 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:44,606 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:47,052 - julearn - INFO - Fitting final model
Scores with best hyperparameter using 10 iterations of randomized search: 0.95
{'select_k__k': 2,
'svm__C': 8.77140446796582,
'svm__gamma': 0.022636153281629743}
We can also control the number of cross-validation folds used by the searcher
by setting the cv
parameter in the search_params
dictionary. For
example, we can use a bayesian search with 3 folds. Fortunately, the
BayesSearchCV
searcher also accepts distributions for the
hyperparameters.
search_params = {
"kind": "bayes",
"n_iter": 10,
"cv": 3,
}
scores_tuned, model_tuned = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
return_estimator="all",
search_params=search_params,
)
print(
"Scores with best hyperparameter using 10 iterations of "
f"bayesian search and 3-fold CV: {scores_tuned['test_score'].mean()}"
)
pprint(model_tuned.best_params_)
2024-05-03 15:26:47,540 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:47,540 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:47,540 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:47,540 - julearn - INFO - Target: species
2024-05-03 15:26:47,541 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:47,541 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:47,541 - julearn - INFO - ====================
2024-05-03 15:26:47,541 - julearn - INFO -
2024-05-03 15:26:47,542 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:47,542 - julearn - INFO - Tuning hyperparameters using bayes
2024-05-03 15:26:47,542 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:47,542 - julearn - INFO - select_k__k: [2, 3, 4]
2024-05-03 15:26:47,542 - julearn - INFO - svm__C: (0.01, 10, 'log-uniform')
2024-05-03 15:26:47,542 - julearn - INFO - svm__gamma: (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:47,542 - julearn - INFO - Hyperparameter select_k__k as is [2, 3, 4]
2024-05-03 15:26:47,542 - julearn - INFO - Hyperparameter svm__C as is (0.01, 10, 'log-uniform')
2024-05-03 15:26:47,542 - julearn - INFO - Hyperparameter svm__gamma is log-uniform float [0.001, 0.1]
2024-05-03 15:26:47,543 - julearn - INFO - Using inner CV scheme KFold(n_splits=3, random_state=None, shuffle=False)
2024-05-03 15:26:47,543 - julearn - INFO - Search Parameters:
2024-05-03 15:26:47,543 - julearn - INFO - n_iter: 10
2024-05-03 15:26:47,543 - julearn - INFO - cv: KFold(n_splits=3, random_state=None, shuffle=False)
2024-05-03 15:26:47,546 - julearn - INFO - ====================
2024-05-03 15:26:47,546 - julearn - INFO -
2024-05-03 15:26:47,546 - julearn - INFO - = Data Information =
2024-05-03 15:26:47,546 - julearn - INFO - Problem type: classification
2024-05-03 15:26:47,546 - julearn - INFO - Number of samples: 100
2024-05-03 15:26:47,546 - julearn - INFO - Number of features: 4
2024-05-03 15:26:47,546 - julearn - INFO - ====================
2024-05-03 15:26:47,546 - julearn - INFO -
2024-05-03 15:26:47,546 - julearn - INFO - Number of classes: 2
2024-05-03 15:26:47,546 - julearn - INFO - Target type: object
2024-05-03 15:26:47,547 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-03 15:26:47,547 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:47,547 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:52,059 - julearn - INFO - Fitting final model
Scores with best hyperparameter using 10 iterations of bayesian search and 3-fold CV: 0.9099999999999999
OrderedDict([('select_k__k', 4),
('svm__C', 3.8975984906619887),
('svm__gamma', 0.028707916525659204)])
An example using optuna searcher is shown below. The searcher is specified
as "optuna"
and the hyperparameters are specified as a dictionary with
the hyperparameters to tune and their distributions as for the bayesian
searcher. However, the optuna searcher behaviour is controlled by a
Study
object. This object can be passed to the
searcher using the study
parameter in the search_params
dictionary.
Important
The optuna searcher requires that all the hyperparameters are specified as distributions, even the categorical ones.
We first modify the pipeline creator so the select_k
parameter is
specified as a distribution. We exemplarily use a categorical distribution
for the class_weight
hyperparameter, trying the "balanced"
and
None
values.
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("select_k", k=(2, 4, "uniform"))
creator.add(
"svm",
C=(0.01, 10, "log-uniform"),
gamma=(1e-3, 1e-1, "log-uniform"),
class_weight=("balanced", None, "categorical")
)
print(creator)
2024-05-03 15:26:53,014 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:53,014 - julearn - INFO - Step added
2024-05-03 15:26:53,014 - julearn - INFO - Adding step select_k that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:53,014 - julearn - INFO - Tuning hyperparameter k = (2, 4, 'uniform')
2024-05-03 15:26:53,015 - julearn - INFO - Step added
2024-05-03 15:26:53,015 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:53,015 - julearn - INFO - Tuning hyperparameter C = (0.01, 10, 'log-uniform')
2024-05-03 15:26:53,015 - julearn - INFO - Tuning hyperparameter gamma = (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:53,015 - julearn - INFO - Tuning hyperparameter class_weight = ('balanced', None, 'categorical')
2024-05-03 15:26:53,015 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: select_k
estimator: SelectKBest()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'select_k__k': (2, 4, 'uniform')}
Step 2: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': (0.01, 10, 'log-uniform'), 'svm__gamma': (0.001, 0.1, 'log-uniform'), 'svm__class_weight': ('balanced', None, 'categorical')}
We can now use the optuna searcher with 10 trials and 3-fold cross-validation.
import optuna
study = optuna.create_study(
direction="maximize",
study_name="optuna-concept",
load_if_exists=True,
)
search_params = {
"kind": "optuna",
"study": study,
"cv": 3,
}
scores_tuned, model_tuned = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
return_estimator="all",
search_params=search_params,
)
print(
"Scores with best hyperparameter using 10 iterations of "
f"optuna and 3-fold CV: {scores_tuned['test_score'].mean()}"
)
pprint(model_tuned.best_params_)
2024-05-03 15:26:53,017 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:53,017 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:53,017 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:53,017 - julearn - INFO - Target: species
2024-05-03 15:26:53,017 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:53,017 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:53,018 - julearn - INFO - ====================
2024-05-03 15:26:53,018 - julearn - INFO -
2024-05-03 15:26:53,019 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:53,019 - julearn - INFO - Tuning hyperparameters using optuna
2024-05-03 15:26:53,019 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:53,019 - julearn - INFO - select_k__k: (2, 4, 'uniform')
2024-05-03 15:26:53,019 - julearn - INFO - svm__C: (0.01, 10, 'log-uniform')
2024-05-03 15:26:53,019 - julearn - INFO - svm__gamma: (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:53,019 - julearn - INFO - svm__class_weight: ('balanced', None, 'categorical')
2024-05-03 15:26:53,019 - julearn - INFO - Hyperparameter select_k__k is uniform integer [2, 4]
2024-05-03 15:26:53,019 - julearn - INFO - Hyperparameter svm__C is log-uniform float [0.01, 10]
2024-05-03 15:26:53,019 - julearn - INFO - Hyperparameter svm__gamma is log-uniform float [0.001, 0.1]
2024-05-03 15:26:53,019 - julearn - INFO - Hyperparameter svm__class_weight is categorical with 2 options: [balanced and None]
2024-05-03 15:26:53,020 - julearn - INFO - Using inner CV scheme KFold(n_splits=3, random_state=None, shuffle=False)
2024-05-03 15:26:53,020 - julearn - INFO - Search Parameters:
2024-05-03 15:26:53,020 - julearn - INFO - study: <optuna.study.study.Study object at 0x7fd476fc2830>
2024-05-03 15:26:53,020 - julearn - INFO - cv: KFold(n_splits=3, random_state=None, shuffle=False)
/home/runner/work/julearn/julearn/julearn/pipeline/pipeline_creator.py:1041: ExperimentalWarning: OptunaSearchCV is experimental (supported from v0.17.0). The interface can change in the future.
pipeline = search( # type: ignore
2024-05-03 15:26:53,020 - julearn - INFO - ====================
2024-05-03 15:26:53,020 - julearn - INFO -
2024-05-03 15:26:53,021 - julearn - INFO - = Data Information =
2024-05-03 15:26:53,021 - julearn - INFO - Problem type: classification
2024-05-03 15:26:53,021 - julearn - INFO - Number of samples: 100
2024-05-03 15:26:53,021 - julearn - INFO - Number of features: 4
2024-05-03 15:26:53,021 - julearn - INFO - ====================
2024-05-03 15:26:53,021 - julearn - INFO -
2024-05-03 15:26:53,021 - julearn - INFO - Number of classes: 2
2024-05-03 15:26:53,021 - julearn - INFO - Target type: object
2024-05-03 15:26:53,022 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-03 15:26:53,022 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:53,022 - julearn - INFO - Binary classification problem detected.
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/base.py:125: ExperimentalWarning: OptunaSearchCV is experimental (supported from v0.17.0). The interface can change in the future.
new_object = klass(**new_object_params)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/base.py:125: ExperimentalWarning: OptunaSearchCV is experimental (supported from v0.17.0). The interface can change in the future.
new_object = klass(**new_object_params)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/base.py:125: ExperimentalWarning: OptunaSearchCV is experimental (supported from v0.17.0). The interface can change in the future.
new_object = klass(**new_object_params)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/base.py:125: ExperimentalWarning: OptunaSearchCV is experimental (supported from v0.17.0). The interface can change in the future.
new_object = klass(**new_object_params)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/base.py:125: ExperimentalWarning: OptunaSearchCV is experimental (supported from v0.17.0). The interface can change in the future.
new_object = klass(**new_object_params)
2024-05-03 15:26:54,635 - julearn - INFO - Fitting final model
Scores with best hyperparameter using 10 iterations of optuna and 3-fold CV: 0.76
{'select_k__k': 3,
'svm__C': 4.199561794026207,
'svm__class_weight': 'balanced',
'svm__gamma': 0.0017858196018021508}
Specifying distributions#
The hyperparameters can be specified as distributions for the randomized
searcher, bayesian searcher and optuna searcher. The distributions are
either specified toolbox-specific method or a tuple convention with the
following format: (low, high, distribution)
where the distribution can
be either "log-uniform"
or "uniform"
or
(a, b, c, d, ..., "categorical")
where a
, b
, c
, d
, etc.
are the possible categorical values for the hyperparameter.
For example, we can specify the C
and gamma
hyperparameters of the
SVC
as log-uniform distributions, while keeping
the with_mean
parameter of the
StandardScaler
as a categorical parameter
with two options.
creator = PipelineCreator(problem_type="classification")
creator.add("zscore", with_mean=(True, False, "categorical"))
creator.add(
"svm",
C=(0.01, 10, "log-uniform"),
gamma=(1e-3, 1e-1, "log-uniform"),
)
print(creator)
2024-05-03 15:26:54,950 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,950 - julearn - INFO - Tuning hyperparameter with_mean = (True, False, 'categorical')
2024-05-03 15:26:54,951 - julearn - INFO - Step added
2024-05-03 15:26:54,951 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,951 - julearn - INFO - Tuning hyperparameter C = (0.01, 10, 'log-uniform')
2024-05-03 15:26:54,951 - julearn - INFO - Tuning hyperparameter gamma = (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:54,951 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'zscore__with_mean': (True, False, 'categorical')}
Step 1: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': (0.01, 10, 'log-uniform'), 'svm__gamma': (0.001, 0.1, 'log-uniform')}
While this will work for any of the random
, bayes
or optuna
searcher options, it is important to note that both bayes
and optuna
searchers accept further parameters to specify distributions. For example,
the bayes
searcher distributions are defined using the
Categorical
, Integer
and Real
.
For example, we can define a log-uniform distribution with base 2 for the
C
hyperparameter of the SVC
model:
from skopt.space import Real
creator = PipelineCreator(problem_type="classification")
creator.add("zscore", with_mean=(True, False, "categorical"))
creator.add(
"svm",
C=Real(0.01, 10, prior="log-uniform", base=2),
gamma=(1e-3, 1e-1, "log-uniform"),
)
print(creator)
2024-05-03 15:26:54,952 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,952 - julearn - INFO - Tuning hyperparameter with_mean = (True, False, 'categorical')
2024-05-03 15:26:54,952 - julearn - INFO - Step added
2024-05-03 15:26:54,953 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,953 - julearn - INFO - Tuning hyperparameter C = Real(low=0.01, high=10, prior='log-uniform', transform='identity')
2024-05-03 15:26:54,953 - julearn - INFO - Tuning hyperparameter gamma = (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:54,953 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'zscore__with_mean': (True, False, 'categorical')}
Step 1: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': Real(low=0.01, high=10, prior='log-uniform', transform='identity'), 'svm__gamma': (0.001, 0.1, 'log-uniform')}
For the optuna searcher, the distributions are defined using the
CategoricalDistribution
,
FloatDistribution
and
IntDistribution
.
For example, we can define a uniform distribution from 0.5 to 0.9 with a 0.05
step for the n_components
of a PCA
transformer, while keeping a log-uniform distribution for the C
and
gamma
hyperparameters of the SVC
model.
from optuna.distributions import FloatDistribution
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add(
"pca",
n_components=FloatDistribution(0.5, 0.9, step=0.05),
)
creator.add(
"svm",
C=FloatDistribution(0.01, 10, log=True),
gamma=(1e-3, 1e-1, "log-uniform"),
)
print(creator)
2024-05-03 15:26:54,954 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,954 - julearn - INFO - Step added
2024-05-03 15:26:54,954 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,954 - julearn - INFO - Setting hyperparameter n_components = FloatDistribution(high=0.9, log=False, low=0.5, step=0.05)
2024-05-03 15:26:54,954 - julearn - INFO - Step added
2024-05-03 15:26:54,954 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,954 - julearn - INFO - Setting hyperparameter C = FloatDistribution(high=10.0, log=True, low=0.01, step=None)
2024-05-03 15:26:54,954 - julearn - INFO - Tuning hyperparameter gamma = (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:54,954 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: pca
estimator: PCA(n_components=FloatDistribution(high=0.9, log=False, low=0.5, step=0.05))
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 2: svm
estimator: SVC(C=FloatDistribution(high=10.0, log=True, low=0.01, step=None))
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__gamma': (0.001, 0.1, 'log-uniform')}
Tuning more than one grid#
Following our tuning of the SVC
hyperparameters, we
can also see that we can tune the kernel
hyperparameter. This
hyperparameter can also be “linear”. Let’s see how our grid of
hyperparameters would look like if we add this hyperparameter:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add(
"svm",
C=[0.01, 0.1, 1, 10],
gamma=[1e-3, 1e-2, "scale", "auto"],
kernel=["linear", "rbf"],
)
print(creator)
2024-05-03 15:26:54,956 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,956 - julearn - INFO - Step added
2024-05-03 15:26:54,956 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,956 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:54,956 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:54,956 - julearn - INFO - Tuning hyperparameter kernel = ['linear', 'rbf']
2024-05-03 15:26:54,956 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.001, 0.01, 'scale', 'auto'], 'svm__kernel': ['linear', 'rbf']}
We can see that the grid of hyperparameters is now 3-dimensional. However,
there are some combinations that don’t make much sense. For example, the
gamma
hyperparameter is only used when the kernel
is rbf
. So
we will be trying the linear
kernel with each one of the 4 different
gamma
and 4 different C
values. Those are 16 unnecessary combinations.
We can avoid this by using multiple grids. One grid for the linear
kernel and one grid for the rbf
kernel.
julearn
allows to specify multiple grid using two different approaches.
Repeating the step name with different hyperparameters:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add(
"svm",
C=[0.01, 0.1, 1, 10],
gamma=[1e-3, 1e-2, "scale", "auto"],
kernel=["rbf"],
name="svm",
)
creator.add(
"svm",
C=[0.01, 0.1, 1, 10],
kernel=["linear"],
name="svm",
)
print(creator)
scores1, model1 = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
return_estimator="all",
)
print(f"Scores with best hyperparameter: {scores1['test_score'].mean()}")
pprint(model1.best_params_)
2024-05-03 15:26:54,957 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,957 - julearn - INFO - Step added
2024-05-03 15:26:54,957 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,957 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:54,957 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:54,957 - julearn - INFO - Setting hyperparameter kernel = rbf
2024-05-03 15:26:54,957 - julearn - INFO - Step added
2024-05-03 15:26:54,958 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,958 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:54,958 - julearn - INFO - Setting hyperparameter kernel = linear
2024-05-03 15:26:54,958 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.001, 0.01, 'scale', 'auto']}
Step 2: svm
estimator: SVC(kernel='linear')
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {'svm__C': [0.01, 0.1, 1, 10]}
2024-05-03 15:26:54,958 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:54,958 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:54,959 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:54,959 - julearn - INFO - Target: species
2024-05-03 15:26:54,959 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:54,959 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:54,959 - julearn - INFO - ====================
2024-05-03 15:26:54,959 - julearn - INFO -
2024-05-03 15:26:54,960 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:54,960 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:54,960 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:54,960 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:54,960 - julearn - INFO - svm__gamma: [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:54,960 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:54,960 - julearn - INFO - Search Parameters:
2024-05-03 15:26:54,960 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:54,960 - julearn - INFO - ====================
2024-05-03 15:26:54,960 - julearn - INFO -
2024-05-03 15:26:54,961 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:54,961 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:54,961 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:54,961 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:54,961 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:54,961 - julearn - INFO - Search Parameters:
2024-05-03 15:26:54,961 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:54,961 - julearn - INFO - ====================
2024-05-03 15:26:54,961 - julearn - INFO -
2024-05-03 15:26:54,961 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:54,961 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:54,962 - julearn - INFO - Hyperparameters list:
2024-05-03 15:26:54,962 - julearn - INFO - Set 0
2024-05-03 15:26:54,962 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:54,962 - julearn - INFO - svm__gamma: [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:54,962 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-05-03 15:26:54,962 - julearn - INFO - svm: [SVC()]
2024-05-03 15:26:54,962 - julearn - INFO - Set 1
2024-05-03 15:26:54,962 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:54,962 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-05-03 15:26:54,963 - julearn - INFO - svm: [SVC(kernel='linear')]
2024-05-03 15:26:54,963 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:54,963 - julearn - INFO - Search Parameters:
2024-05-03 15:26:54,963 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:54,963 - julearn - INFO - ====================
2024-05-03 15:26:54,963 - julearn - INFO -
2024-05-03 15:26:54,963 - julearn - INFO - = Data Information =
2024-05-03 15:26:54,963 - julearn - INFO - Problem type: classification
2024-05-03 15:26:54,963 - julearn - INFO - Number of samples: 100
2024-05-03 15:26:54,963 - julearn - INFO - Number of features: 4
2024-05-03 15:26:54,963 - julearn - INFO - ====================
2024-05-03 15:26:54,963 - julearn - INFO -
2024-05-03 15:26:54,963 - julearn - INFO - Number of classes: 2
2024-05-03 15:26:54,963 - julearn - INFO - Target type: object
2024-05-03 15:26:54,964 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-03 15:26:54,964 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:54,964 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:58,804 - julearn - INFO - Fitting final model
Scores with best hyperparameter: 0.93
{'set_column_types': SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']}),
'svm': SVC(),
'svm__C': 10,
'svm__gamma': 0.01}
Important
Note that the name
parameter is required when repeating a step name.
If we do not specify the name
parameter, julearn
will
auto-determine the step name in an unique way. The only way to force repated
names is to do so explicitly.
Using multiple pipeline creators:
creator1 = PipelineCreator(problem_type="classification")
creator1.add("zscore")
creator1.add(
"svm",
C=[0.01, 0.1, 1, 10],
gamma=[1e-3, 1e-2, "scale", "auto"],
kernel=["rbf"],
)
creator2 = PipelineCreator(problem_type="classification")
creator2.add("zscore")
creator2.add(
"svm",
C=[0.01, 0.1, 1, 10],
kernel=["linear"],
)
scores2, model2 = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=[creator1, creator2],
return_estimator="all",
)
print(f"Scores with best hyperparameter: {scores2['test_score'].mean()}")
pprint(model2.best_params_)
2024-05-03 15:26:59,576 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:59,576 - julearn - INFO - Step added
2024-05-03 15:26:59,576 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:59,576 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:59,576 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:59,576 - julearn - INFO - Setting hyperparameter kernel = rbf
2024-05-03 15:26:59,576 - julearn - INFO - Step added
2024-05-03 15:26:59,577 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:59,577 - julearn - INFO - Step added
2024-05-03 15:26:59,577 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:59,577 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:59,577 - julearn - INFO - Setting hyperparameter kernel = linear
2024-05-03 15:26:59,577 - julearn - INFO - Step added
2024-05-03 15:26:59,577 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:59,577 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:59,577 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:59,577 - julearn - INFO - Target: species
2024-05-03 15:26:59,577 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:59,577 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:59,578 - julearn - INFO - ====================
2024-05-03 15:26:59,578 - julearn - INFO -
2024-05-03 15:26:59,578 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:59,578 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:59,578 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:59,578 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:59,579 - julearn - INFO - svm__gamma: [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:59,579 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:59,579 - julearn - INFO - Search Parameters:
2024-05-03 15:26:59,579 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:59,579 - julearn - INFO - ====================
2024-05-03 15:26:59,579 - julearn - INFO -
2024-05-03 15:26:59,579 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:59,579 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:59,579 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:59,580 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:59,580 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:59,580 - julearn - INFO - Search Parameters:
2024-05-03 15:26:59,580 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:59,580 - julearn - INFO - ====================
2024-05-03 15:26:59,580 - julearn - INFO -
2024-05-03 15:26:59,580 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:59,580 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:59,580 - julearn - INFO - Hyperparameters list:
2024-05-03 15:26:59,580 - julearn - INFO - Set 0
2024-05-03 15:26:59,580 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:59,580 - julearn - INFO - svm__gamma: [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:59,581 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-05-03 15:26:59,581 - julearn - INFO - zscore: [StandardScaler()]
2024-05-03 15:26:59,581 - julearn - INFO - svm: [SVC()]
2024-05-03 15:26:59,581 - julearn - INFO - Set 1
2024-05-03 15:26:59,581 - julearn - INFO - svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:59,581 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-05-03 15:26:59,581 - julearn - INFO - zscore: [StandardScaler()]
2024-05-03 15:26:59,582 - julearn - INFO - svm: [SVC(kernel='linear')]
2024-05-03 15:26:59,582 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:59,582 - julearn - INFO - Search Parameters:
2024-05-03 15:26:59,582 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:59,582 - julearn - INFO - ====================
2024-05-03 15:26:59,582 - julearn - INFO -
2024-05-03 15:26:59,582 - julearn - INFO - = Data Information =
2024-05-03 15:26:59,582 - julearn - INFO - Problem type: classification
2024-05-03 15:26:59,582 - julearn - INFO - Number of samples: 100
2024-05-03 15:26:59,582 - julearn - INFO - Number of features: 4
2024-05-03 15:26:59,582 - julearn - INFO - ====================
2024-05-03 15:26:59,582 - julearn - INFO -
2024-05-03 15:26:59,582 - julearn - INFO - Number of classes: 2
2024-05-03 15:26:59,582 - julearn - INFO - Target type: object
2024-05-03 15:26:59,583 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-03 15:26:59,583 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:59,583 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:27:03,459 - julearn - INFO - Fitting final model
Scores with best hyperparameter: 0.93
{'set_column_types': SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']}),
'svm': SVC(),
'svm__C': 10,
'svm__gamma': 0.01,
'zscore': StandardScaler()}
Important
All the pipeline creators must have the same problem type and steps names in order for this approach to work.
Indeed, if we compare both approaches, we can see that they are equivalent. They both produce the same grid of hyperparameters:
pprint(model1.param_grid)
pprint(model2.param_grid)
[{'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})],
'svm': [SVC()],
'svm__C': [0.01, 0.1, 1, 10],
'svm__gamma': [0.001, 0.01, 'scale', 'auto']},
{'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})],
'svm': [SVC(kernel='linear')],
'svm__C': [0.01, 0.1, 1, 10]}]
[{'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})],
'svm': [SVC()],
'svm__C': [0.01, 0.1, 1, 10],
'svm__gamma': [0.001, 0.01, 'scale', 'auto'],
'zscore': [StandardScaler()]},
{'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})],
'svm': [SVC(kernel='linear')],
'svm__C': [0.01, 0.1, 1, 10],
'zscore': [StandardScaler()]}]
Models as hyperparameters#
But why stop there? Models can also be considered as hyperparameters. For
example, we can try different models for the classification task. Let’s
try the RandomForestClassifier
and the
LogisticRegression
too:
creator1 = PipelineCreator(problem_type="classification")
creator1.add("zscore")
creator1.add(
"svm",
C=[0.01, 0.1, 1, 10],
gamma=[1e-3, 1e-2, "scale", "auto"],
kernel=["rbf"],
name="model",
)
creator2 = PipelineCreator(problem_type="classification")
creator2.add("zscore")
creator2.add(
"svm",
C=[0.01, 0.1, 1, 10],
kernel=["linear"],
name="model",
)
creator3 = PipelineCreator(problem_type="classification")
creator3.add("zscore")
creator3.add(
"rf",
max_depth=[2, 3, 4],
name="model",
)
creator4 = PipelineCreator(problem_type="classification")
creator4.add("zscore")
creator4.add(
"logit",
penalty=["l2", "l1"],
dual=[False],
solver="liblinear",
name="model",
)
scores3, model3 = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=[creator1, creator2, creator3, creator4],
return_estimator="all",
)
print(f"Scores with best hyperparameter: {scores3['test_score'].mean()}")
pprint(model3.best_params_)
2024-05-03 15:27:04,241 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,241 - julearn - INFO - Step added
2024-05-03 15:27:04,241 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,241 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:27:04,241 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:27:04,241 - julearn - INFO - Setting hyperparameter kernel = rbf
2024-05-03 15:27:04,241 - julearn - INFO - Step added
2024-05-03 15:27:04,241 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,242 - julearn - INFO - Step added
2024-05-03 15:27:04,242 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,242 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:27:04,242 - julearn - INFO - Setting hyperparameter kernel = linear
2024-05-03 15:27:04,242 - julearn - INFO - Step added
2024-05-03 15:27:04,242 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,242 - julearn - INFO - Step added
2024-05-03 15:27:04,242 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,242 - julearn - INFO - Tuning hyperparameter max_depth = [2, 3, 4]
2024-05-03 15:27:04,242 - julearn - INFO - Step added
2024-05-03 15:27:04,242 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,242 - julearn - INFO - Step added
2024-05-03 15:27:04,242 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,242 - julearn - INFO - Tuning hyperparameter penalty = ['l2', 'l1']
2024-05-03 15:27:04,242 - julearn - INFO - Setting hyperparameter dual = False
2024-05-03 15:27:04,242 - julearn - INFO - Setting hyperparameter solver = liblinear
2024-05-03 15:27:04,243 - julearn - INFO - Step added
2024-05-03 15:27:04,243 - julearn - INFO - ==== Input Data ====
2024-05-03 15:27:04,243 - julearn - INFO - Using dataframe as input
2024-05-03 15:27:04,243 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:27:04,243 - julearn - INFO - Target: species
2024-05-03 15:27:04,243 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:27:04,243 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:27:04,243 - julearn - INFO - ====================
2024-05-03 15:27:04,243 - julearn - INFO -
2024-05-03 15:27:04,244 - julearn - INFO - = Model Parameters =
2024-05-03 15:27:04,244 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:27:04,244 - julearn - INFO - Hyperparameters:
2024-05-03 15:27:04,244 - julearn - INFO - model__C: [0.01, 0.1, 1, 10]
2024-05-03 15:27:04,244 - julearn - INFO - model__gamma: [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:27:04,244 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,244 - julearn - INFO - Search Parameters:
2024-05-03 15:27:04,244 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,244 - julearn - INFO - ====================
2024-05-03 15:27:04,245 - julearn - INFO -
2024-05-03 15:27:04,245 - julearn - INFO - = Model Parameters =
2024-05-03 15:27:04,245 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:27:04,245 - julearn - INFO - Hyperparameters:
2024-05-03 15:27:04,245 - julearn - INFO - model__C: [0.01, 0.1, 1, 10]
2024-05-03 15:27:04,245 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,245 - julearn - INFO - Search Parameters:
2024-05-03 15:27:04,245 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,246 - julearn - INFO - ====================
2024-05-03 15:27:04,246 - julearn - INFO -
2024-05-03 15:27:04,246 - julearn - INFO - = Model Parameters =
2024-05-03 15:27:04,246 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:27:04,246 - julearn - INFO - Hyperparameters:
2024-05-03 15:27:04,246 - julearn - INFO - model__max_depth: [2, 3, 4]
2024-05-03 15:27:04,246 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,246 - julearn - INFO - Search Parameters:
2024-05-03 15:27:04,246 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,246 - julearn - INFO - ====================
2024-05-03 15:27:04,247 - julearn - INFO -
2024-05-03 15:27:04,247 - julearn - INFO - = Model Parameters =
2024-05-03 15:27:04,247 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:27:04,247 - julearn - INFO - Hyperparameters:
2024-05-03 15:27:04,247 - julearn - INFO - model__penalty: ['l2', 'l1']
2024-05-03 15:27:04,247 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,247 - julearn - INFO - Search Parameters:
2024-05-03 15:27:04,247 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,247 - julearn - INFO - ====================
2024-05-03 15:27:04,247 - julearn - INFO -
2024-05-03 15:27:04,248 - julearn - INFO - = Model Parameters =
2024-05-03 15:27:04,248 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:27:04,248 - julearn - INFO - Hyperparameters list:
2024-05-03 15:27:04,248 - julearn - INFO - Set 0
2024-05-03 15:27:04,248 - julearn - INFO - model__C: [0.01, 0.1, 1, 10]
2024-05-03 15:27:04,248 - julearn - INFO - model__gamma: [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:27:04,248 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-05-03 15:27:04,248 - julearn - INFO - zscore: [StandardScaler()]
2024-05-03 15:27:04,248 - julearn - INFO - model: [SVC()]
2024-05-03 15:27:04,248 - julearn - INFO - Set 1
2024-05-03 15:27:04,248 - julearn - INFO - model__C: [0.01, 0.1, 1, 10]
2024-05-03 15:27:04,249 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-05-03 15:27:04,249 - julearn - INFO - zscore: [StandardScaler()]
2024-05-03 15:27:04,249 - julearn - INFO - model: [SVC(kernel='linear')]
2024-05-03 15:27:04,249 - julearn - INFO - Set 2
2024-05-03 15:27:04,249 - julearn - INFO - model__max_depth: [2, 3, 4]
2024-05-03 15:27:04,249 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-05-03 15:27:04,250 - julearn - INFO - zscore: [StandardScaler()]
2024-05-03 15:27:04,250 - julearn - INFO - model: [RandomForestClassifier()]
2024-05-03 15:27:04,250 - julearn - INFO - Set 3
2024-05-03 15:27:04,250 - julearn - INFO - model__penalty: ['l2', 'l1']
2024-05-03 15:27:04,250 - julearn - INFO - set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']})]
2024-05-03 15:27:04,250 - julearn - INFO - zscore: [StandardScaler()]
2024-05-03 15:27:04,250 - julearn - INFO - model: [LogisticRegression(solver='liblinear')]
2024-05-03 15:27:04,251 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,251 - julearn - INFO - Search Parameters:
2024-05-03 15:27:04,251 - julearn - INFO - cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,251 - julearn - INFO - ====================
2024-05-03 15:27:04,251 - julearn - INFO -
2024-05-03 15:27:04,251 - julearn - INFO - = Data Information =
2024-05-03 15:27:04,251 - julearn - INFO - Problem type: classification
2024-05-03 15:27:04,251 - julearn - INFO - Number of samples: 100
2024-05-03 15:27:04,251 - julearn - INFO - Number of features: 4
2024-05-03 15:27:04,251 - julearn - INFO - ====================
2024-05-03 15:27:04,251 - julearn - INFO -
2024-05-03 15:27:04,251 - julearn - INFO - Number of classes: 2
2024-05-03 15:27:04,251 - julearn - INFO - Target type: object
2024-05-03 15:27:04,252 - julearn - INFO - Class distributions: species
versicolor 50
virginica 50
Name: count, dtype: int64
2024-05-03 15:27:04,252 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,252 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:27:15,883 - julearn - INFO - Fitting final model
Scores with best hyperparameter: 0.9200000000000002
{'model': SVC(),
'model__C': 10,
'model__gamma': 0.01,
'set_column_types': SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
'petal_length', 'petal_width']}),
'zscore': StandardScaler()}
Well, it seems that nothing can beat the SVC
with
kernel="rbf"
for our classification example.
Total running time of the script: (1 minutes 3.673 seconds)