6.3. Hyperparameter Tuning#

Parameters vs Hyperparameters#

Parameters are the values that define the model, and are learned from the data. For example, the weights of a linear regression model are parameters. The parameters of a model are learned during training and are not set by the user.

Hyperparameters are the values that define the model, but are not learned from the data. For example, the regularization parameter C of a Support Vector Machine (SVC) model is a hyperparameter. The hyperparameters of a model are set by the user before training and are not learned during training.

Let’s see an example of a SVC model with a regularization parameter C. We will use the iris dataset, which is a dataset of measurements of flowers.

We start by loading the dataset and setting the features and target variables.

from seaborn import load_dataset
from pprint import pprint  # To print in a pretty way

df = load_dataset("iris")
X = df.columns[:-1].tolist()
y = "species"
X_types = {"continuous": X}

# The dataset has three kind of species. We will keep two to perform a binary
# classification.
df = df[df["species"].isin(["versicolor", "virginica"])]

We can now use the PipelineCreator to create a pipeline with a RobustScaler and a SVC, with a regularization parameter C set to 0.1.

from julearn.pipeline import PipelineCreator

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("svm", C=0.1)

print(creator)

2024-05-03 15:26:14,565 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:14,565 - julearn - INFO - Step added
2024-05-03 15:26:14,565 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:14,565 - julearn - INFO - Setting hyperparameter C = 0.1
2024-05-03 15:26:14,565 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: svm
    estimator:     SVC(C=0.1)
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}

Hyperparameter Tuning#

Since it is the user who sets the hyperparameters, it is important to choose the right values. This is not always easy, and it is common to try different values and see which one works best. This process is called hyperparameter tuning.

Basically, hyperparameter tuning refers to testing several hyperparameter values and choosing the one that works best.

For example, we can try different values for the regularization parameter C of the SVC model and see which one works best.

from julearn import run_cross_validation

scores1 = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
)

print(f"Score with C=0.1: {scores1['test_score'].mean()}")

creator2 = PipelineCreator(problem_type="classification")
creator2.add("zscore")
creator2.add("svm", C=0.01)

scores2 = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator2,
)

print(f"Score with C=0.01: {scores2['test_score'].mean()}")

2024-05-03 15:26:14,566 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:14,566 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:14,566 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:14,566 - julearn - INFO -      Target: species
2024-05-03 15:26:14,567 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:14,567 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:14,567 - julearn - INFO - ====================
2024-05-03 15:26:14,567 - julearn - INFO -
2024-05-03 15:26:14,568 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:14,568 - julearn - INFO - ====================
2024-05-03 15:26:14,568 - julearn - INFO -
2024-05-03 15:26:14,568 - julearn - INFO - = Data Information =
2024-05-03 15:26:14,568 - julearn - INFO -      Problem type: classification
2024-05-03 15:26:14,568 - julearn - INFO -      Number of samples: 100
2024-05-03 15:26:14,568 - julearn - INFO -      Number of features: 4
2024-05-03 15:26:14,568 - julearn - INFO - ====================
2024-05-03 15:26:14,568 - julearn - INFO -
2024-05-03 15:26:14,568 - julearn - INFO -      Number of classes: 2
2024-05-03 15:26:14,568 - julearn - INFO -      Target type: object
2024-05-03 15:26:14,569 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-05-03 15:26:14,569 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:14,569 - julearn - INFO - Binary classification problem detected.
Score with C=0.1: 0.8099999999999999
2024-05-03 15:26:14,608 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:14,608 - julearn - INFO - Step added
2024-05-03 15:26:14,608 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:14,608 - julearn - INFO - Setting hyperparameter C = 0.01
2024-05-03 15:26:14,608 - julearn - INFO - Step added
2024-05-03 15:26:14,608 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:14,608 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:14,608 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:14,608 - julearn - INFO -      Target: species
2024-05-03 15:26:14,608 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:14,608 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:14,609 - julearn - INFO - ====================
2024-05-03 15:26:14,609 - julearn - INFO -
2024-05-03 15:26:14,609 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:14,609 - julearn - INFO - ====================
2024-05-03 15:26:14,610 - julearn - INFO -
2024-05-03 15:26:14,610 - julearn - INFO - = Data Information =
2024-05-03 15:26:14,610 - julearn - INFO -      Problem type: classification
2024-05-03 15:26:14,610 - julearn - INFO -      Number of samples: 100
2024-05-03 15:26:14,610 - julearn - INFO -      Number of features: 4
2024-05-03 15:26:14,610 - julearn - INFO - ====================
2024-05-03 15:26:14,610 - julearn - INFO -
2024-05-03 15:26:14,610 - julearn - INFO -      Number of classes: 2
2024-05-03 15:26:14,610 - julearn - INFO -      Target type: object
2024-05-03 15:26:14,610 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-05-03 15:26:14,611 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:14,611 - julearn - INFO - Binary classification problem detected.
Score with C=0.01: 0.19

We can see that the model with C=0.1 works better than the model with C=0.01. However, to be sure that C=0.1 is the best value, we should try more values. And since this is only one hyperparameter, it is not that difficult. But what if we have more hyperparameters? And what if we have several steps in the pipeline (e.g. feature selection, PCA, etc.)? This is a major problem: the more hyperparameters we have, the more times we use the same data for training and testing. This usually gives an optimistic estimation of the performance of the model.

To prevent this, we can use a technique called nested cross-validation. That is, we use cross-validation to tune the hyperparameters, and then we use cross-validation again to estimate the performance of the model using the best hyperparameters set. It is called nested because we first split the data into training and testing sets to estimate the model performance (outer loop), and then we split the training set into two sets to tune the hyperparameters (inner loop).

julearn has a simple way to do hyperparameter tuning using nested cross- validation. When we use a PipelineCreator to create a pipeline, we can set the hyperparameters we want to tune and the values we want to try.

For example, we can try different values for the regularization parameter C of the SVC model:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("svm", C=[0.01, 0.1, 1, 10])

print(creator)

2024-05-03 15:26:14,649 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:14,649 - julearn - INFO - Step added
2024-05-03 15:26:14,649 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:14,649 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:14,649 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': [0.01, 0.1, 1, 10]}

As we can see above, the creator now shows that the C hyperparameter will be tuned. We can now use this creator to run cross-validation. This will tune the hyperparameters and estimate the performance of the model using the best hyperparameters set.

scores_tuned, model_tuned = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
    return_estimator="all",
)

print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")

2024-05-03 15:26:14,650 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:14,650 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:14,650 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:14,650 - julearn - INFO -      Target: species
2024-05-03 15:26:14,650 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:14,650 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:14,651 - julearn - INFO - ====================
2024-05-03 15:26:14,651 - julearn - INFO -
2024-05-03 15:26:14,651 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:14,651 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:14,651 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:14,652 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:14,652 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:14,652 - julearn - INFO - Search Parameters:
2024-05-03 15:26:14,652 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:14,652 - julearn - INFO - ====================
2024-05-03 15:26:14,652 - julearn - INFO -
2024-05-03 15:26:14,652 - julearn - INFO - = Data Information =
2024-05-03 15:26:14,652 - julearn - INFO -      Problem type: classification
2024-05-03 15:26:14,652 - julearn - INFO -      Number of samples: 100
2024-05-03 15:26:14,652 - julearn - INFO -      Number of features: 4
2024-05-03 15:26:14,652 - julearn - INFO - ====================
2024-05-03 15:26:14,652 - julearn - INFO -
2024-05-03 15:26:14,652 - julearn - INFO -      Number of classes: 2
2024-05-03 15:26:14,652 - julearn - INFO -      Target type: object
2024-05-03 15:26:14,653 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-05-03 15:26:14,653 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:14,653 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:15,439 - julearn - INFO - Fitting final model
Scores with best hyperparameter: 0.9100000000000001

We can see that the model with the best hyperparameters works better than the model with C=0.1. But what’s the best hyperparameter set? We can see it by printing the model_tuned.best_params_ variable.

pprint(model_tuned.best_params_)

{'svm__C': 1}

We can see that the best hyperparameter set is C=1. Since this hyperparameter was not on the boundary of the values we tried, we can conclude that our search for the best C value was successful.

However, by checking the SVC documentation, we can see that there are more hyperparameters that we can tune. For example, for the default rbf kernel, we can tune the gamma hyperparameter:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("svm", C=[0.01, 0.1, 1, 10], gamma=[0.01, 0.1, 1, 10])

print(creator)

scores_tuned, model_tuned = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
    return_estimator="all",
)

print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")
pprint(model_tuned.best_params_)

2024-05-03 15:26:15,596 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:15,596 - julearn - INFO - Step added
2024-05-03 15:26:15,596 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:15,596 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:15,596 - julearn - INFO - Tuning hyperparameter gamma = [0.01, 0.1, 1, 10]
2024-05-03 15:26:15,596 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.01, 0.1, 1, 10]}

2024-05-03 15:26:15,596 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:15,597 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:15,597 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:15,597 - julearn - INFO -      Target: species
2024-05-03 15:26:15,597 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:15,597 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:15,597 - julearn - INFO - ====================
2024-05-03 15:26:15,597 - julearn - INFO -
2024-05-03 15:26:15,598 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:15,598 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:15,598 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:15,598 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:15,598 - julearn - INFO -      svm__gamma: [0.01, 0.1, 1, 10]
2024-05-03 15:26:15,598 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:15,598 - julearn - INFO - Search Parameters:
2024-05-03 15:26:15,598 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:15,598 - julearn - INFO - ====================
2024-05-03 15:26:15,598 - julearn - INFO -
2024-05-03 15:26:15,599 - julearn - INFO - = Data Information =
2024-05-03 15:26:15,599 - julearn - INFO -      Problem type: classification
2024-05-03 15:26:15,599 - julearn - INFO -      Number of samples: 100
2024-05-03 15:26:15,599 - julearn - INFO -      Number of features: 4
2024-05-03 15:26:15,599 - julearn - INFO - ====================
2024-05-03 15:26:15,599 - julearn - INFO -
2024-05-03 15:26:15,599 - julearn - INFO -      Number of classes: 2
2024-05-03 15:26:15,599 - julearn - INFO -      Target type: object
2024-05-03 15:26:15,599 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-05-03 15:26:15,600 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:15,600 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:18,579 - julearn - INFO - Fitting final model
Scores with best hyperparameter: 0.9100000000000001
{'svm__C': 10, 'svm__gamma': 0.01}

We can see that the best hyperparameter set is C=1 and gamma=0.1. But since gamma was on the boundary of the values we tried, we should try more values to be sure that we are using the best hyperparameter set.

We can even give a combination of different variable types, like the words "scale" and "auto" for the gamma hyperparameter:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    gamma=[1e-5, 1e-4, 1e-3, 1e-2, "scale", "auto"],
)

print(creator)

scores_tuned, model_tuned = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
    return_estimator="all",
)

print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")
pprint(model_tuned.best_params_)

2024-05-03 15:26:19,183 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:19,184 - julearn - INFO - Step added
2024-05-03 15:26:19,184 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:19,184 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:19,184 - julearn - INFO - Tuning hyperparameter gamma = [1e-05, 0.0001, 0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:19,184 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [1e-05, 0.0001, 0.001, 0.01, 'scale', 'auto']}

2024-05-03 15:26:19,184 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:19,184 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:19,184 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:19,184 - julearn - INFO -      Target: species
2024-05-03 15:26:19,185 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:19,185 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:19,185 - julearn - INFO - ====================
2024-05-03 15:26:19,185 - julearn - INFO -
2024-05-03 15:26:19,186 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:19,186 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:19,186 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:19,186 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:19,186 - julearn - INFO -      svm__gamma: [1e-05, 0.0001, 0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:19,186 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:19,186 - julearn - INFO - Search Parameters:
2024-05-03 15:26:19,186 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:19,186 - julearn - INFO - ====================
2024-05-03 15:26:19,186 - julearn - INFO -
2024-05-03 15:26:19,186 - julearn - INFO - = Data Information =
2024-05-03 15:26:19,186 - julearn - INFO -      Problem type: classification
2024-05-03 15:26:19,186 - julearn - INFO -      Number of samples: 100
2024-05-03 15:26:19,186 - julearn - INFO -      Number of features: 4
2024-05-03 15:26:19,187 - julearn - INFO - ====================
2024-05-03 15:26:19,187 - julearn - INFO -
2024-05-03 15:26:19,187 - julearn - INFO -      Number of classes: 2
2024-05-03 15:26:19,187 - julearn - INFO -      Target type: object
2024-05-03 15:26:19,187 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-05-03 15:26:19,187 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:19,187 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:23,662 - julearn - INFO - Fitting final model
Scores with best hyperparameter: 0.9100000000000001
{'svm__C': 10, 'svm__gamma': 0.01}

We can even tune hyperparameters from different steps of the pipeline. Let’s add a SelectKBest step to the pipeline and tune its k hyperparameter:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("select_k", k=[2, 3, 4])
creator.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    gamma=[1e-3, 1e-2, 1e-1, "scale", "auto"],
)

print(creator)

scores_tuned, model_tuned = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
    return_estimator="all",
)

print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")
pprint(model_tuned.best_params_)

2024-05-03 15:26:24,561 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:24,561 - julearn - INFO - Step added
2024-05-03 15:26:24,561 - julearn - INFO - Adding step select_k that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:24,561 - julearn - INFO - Tuning hyperparameter k = [2, 3, 4]
2024-05-03 15:26:24,561 - julearn - INFO - Step added
2024-05-03 15:26:24,561 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:24,561 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:24,561 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 0.1, 'scale', 'auto']
2024-05-03 15:26:24,561 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: select_k
    estimator:     SelectKBest()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'select_k__k': [2, 3, 4]}
  Step 2: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.001, 0.01, 0.1, 'scale', 'auto']}

2024-05-03 15:26:24,562 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:24,562 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:24,562 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:24,562 - julearn - INFO -      Target: species
2024-05-03 15:26:24,562 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:24,562 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:24,563 - julearn - INFO - ====================
2024-05-03 15:26:24,563 - julearn - INFO -
2024-05-03 15:26:24,563 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:24,563 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:24,563 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:24,564 - julearn - INFO -      select_k__k: [2, 3, 4]
2024-05-03 15:26:24,564 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:24,564 - julearn - INFO -      svm__gamma: [0.001, 0.01, 0.1, 'scale', 'auto']
2024-05-03 15:26:24,564 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:24,564 - julearn - INFO - Search Parameters:
2024-05-03 15:26:24,564 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:24,564 - julearn - INFO - ====================
2024-05-03 15:26:24,564 - julearn - INFO -
2024-05-03 15:26:24,564 - julearn - INFO - = Data Information =
2024-05-03 15:26:24,564 - julearn - INFO -      Problem type: classification
2024-05-03 15:26:24,564 - julearn - INFO -      Number of samples: 100
2024-05-03 15:26:24,564 - julearn - INFO -      Number of features: 4
2024-05-03 15:26:24,564 - julearn - INFO - ====================
2024-05-03 15:26:24,564 - julearn - INFO -
2024-05-03 15:26:24,564 - julearn - INFO -      Number of classes: 2
2024-05-03 15:26:24,564 - julearn - INFO -      Target type: object
2024-05-03 15:26:24,565 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-05-03 15:26:24,565 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:24,565 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:38,809 - julearn - INFO - Fitting final model
Scores with best hyperparameter: 0.9100000000000001
{'select_k__k': 4, 'svm__C': 10, 'svm__gamma': 0.01}

But how will julearn find the optimal hyperparameter set?

Searchers#

julearn uses the same concept as scikit-learn to tune hyperparameters: it uses a searcher to find the best hyperparameter set. A searcher is an object that receives a set of hyperparameters and their values, and then tries to find the best combination of values for the hyperparameters using cross-validation.

By default, julearn uses a GridSearchCV. This searcher, specified as "grid" is very simple. First, it constructs the _grid_ of hyperparameters to try. As we see above, we have 3 hyperparameters to tune. So it constructs a 3-dimentional grid with all the possible combinations of the hyperparameters values. The second step is to perform cross-validation on each of the possible combinations of hyperparameters values.

Other searchers that julearn provides are the RandomizedSearchCV, BayesSearchCV and OptunaSearchCV.

The randomized searcher (RandomizedSearchCV) is similar to the GridSearchCV, but instead of trying all the possible combinations of hyperparameter values, it tries a random subset of them. This is useful when we have a lot of hyperparameters to tune, since it can be very time consuming to try all the possible combinations, as well as continuous parameters that can be sampled out of a distribution. For more information, see the RandomizedSearchCV documentation.

The Bayesian searcher (BayesSearchCV) is a bit more complex. It uses Bayesian optimization to find the best hyperparameter set. As with the randomized search, it is useful when we have many hyperparameters to tune, and we don’t want to try all the possible combinations due to computational constraints. For more information, see the BayesSearchCV documentation, including how to specify the prior distributions of the hyperparameters.

The Optuna searcher (OptunaSearchCV) uses the Optuna library to find the best hyperparameter set. Optuna is a hyperparameter optimization framework that has several algorithms to find the best hyperparameter set. For more information, see the Optuna documentation.

We can specify the kind of searcher and its parametrization, by setting the search_params parameter in the run_cross_validation() function. For example, we can use the RandomizedSearchCV searcher with 10 iterations of random search.

search_params = {
    "kind": "random",
    "n_iter": 10,
}

scores_tuned, model_tuned = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
    return_estimator="all",
    search_params=search_params,
)

print(
    "Scores with best hyperparameter using 10 iterations of "
    f"randomized search: {scores_tuned['test_score'].mean()}"
)
pprint(model_tuned.best_params_)

2024-05-03 15:26:41,675 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:41,675 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:41,675 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:41,675 - julearn - INFO -      Target: species
2024-05-03 15:26:41,676 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:41,676 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:41,676 - julearn - INFO - ====================
2024-05-03 15:26:41,676 - julearn - INFO -
2024-05-03 15:26:41,677 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:41,677 - julearn - INFO - Tuning hyperparameters using random
2024-05-03 15:26:41,677 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:41,677 - julearn - INFO -      select_k__k: [2, 3, 4]
2024-05-03 15:26:41,677 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:41,677 - julearn - INFO -      svm__gamma: [0.001, 0.01, 0.1, 'scale', 'auto']
2024-05-03 15:26:41,677 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:41,677 - julearn - INFO - Search Parameters:
2024-05-03 15:26:41,677 - julearn - INFO -      n_iter: 10
2024-05-03 15:26:41,677 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:41,677 - julearn - INFO - ====================
2024-05-03 15:26:41,677 - julearn - INFO -
2024-05-03 15:26:41,678 - julearn - INFO - = Data Information =
2024-05-03 15:26:41,678 - julearn - INFO -      Problem type: classification
2024-05-03 15:26:41,678 - julearn - INFO -      Number of samples: 100
2024-05-03 15:26:41,678 - julearn - INFO -      Number of features: 4
2024-05-03 15:26:41,678 - julearn - INFO - ====================
2024-05-03 15:26:41,678 - julearn - INFO -
2024-05-03 15:26:41,678 - julearn - INFO -      Number of classes: 2
2024-05-03 15:26:41,678 - julearn - INFO -      Target type: object
2024-05-03 15:26:41,678 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-05-03 15:26:41,679 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:41,679 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:44,114 - julearn - INFO - Fitting final model
Scores with best hyperparameter using 10 iterations of randomized search: 0.89
{'select_k__k': 3, 'svm__C': 1, 'svm__gamma': 'auto'}

We can now see that the best hyperparameter might be different from the grid search. This is because it tried only 10 combinations and not the whole grid. Furthermore, the RandomizedSearchCV searcher can sample hyperparameters from distributions, which can be useful when we have continuous hyperparameters. Let’s set both C and gamma to be sampled from log-uniform distributions. We can do this by setting the hyperparameter values as a tuple with the following format: (low, high, distribution). The distribution can be either "log-uniform" or "uniform".

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("select_k", k=[2, 3, 4])
creator.add(
    "svm",
    C=(0.01, 10, "log-uniform"),
    gamma=(1e-3, 1e-1, "log-uniform"),
)

print(creator)

scores_tuned, model_tuned = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
    return_estimator="all",
    search_params=search_params,
)

print(
    "Scores with best hyperparameter using 10 iterations of "
    f"randomized search: {scores_tuned['test_score'].mean()}"
)
pprint(model_tuned.best_params_)

2024-05-03 15:26:44,601 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:44,601 - julearn - INFO - Step added
2024-05-03 15:26:44,601 - julearn - INFO - Adding step select_k that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:44,601 - julearn - INFO - Tuning hyperparameter k = [2, 3, 4]
2024-05-03 15:26:44,601 - julearn - INFO - Step added
2024-05-03 15:26:44,601 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:44,601 - julearn - INFO - Tuning hyperparameter C = (0.01, 10, 'log-uniform')
2024-05-03 15:26:44,601 - julearn - INFO - Tuning hyperparameter gamma = (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:44,601 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: select_k
    estimator:     SelectKBest()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'select_k__k': [2, 3, 4]}
  Step 2: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': (0.01, 10, 'log-uniform'), 'svm__gamma': (0.001, 0.1, 'log-uniform')}

2024-05-03 15:26:44,602 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:44,602 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:44,602 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:44,602 - julearn - INFO -      Target: species
2024-05-03 15:26:44,602 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:44,602 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:44,603 - julearn - INFO - ====================
2024-05-03 15:26:44,603 - julearn - INFO -
2024-05-03 15:26:44,603 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:44,603 - julearn - INFO - Tuning hyperparameters using random
2024-05-03 15:26:44,603 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:44,604 - julearn - INFO -      select_k__k: [2, 3, 4]
2024-05-03 15:26:44,604 - julearn - INFO -      svm__C: (0.01, 10, 'log-uniform')
2024-05-03 15:26:44,604 - julearn - INFO -      svm__gamma: (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:44,605 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:44,605 - julearn - INFO - Search Parameters:
2024-05-03 15:26:44,605 - julearn - INFO -      n_iter: 10
2024-05-03 15:26:44,605 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:44,605 - julearn - INFO - ====================
2024-05-03 15:26:44,605 - julearn - INFO -
2024-05-03 15:26:44,605 - julearn - INFO - = Data Information =
2024-05-03 15:26:44,605 - julearn - INFO -      Problem type: classification
2024-05-03 15:26:44,605 - julearn - INFO -      Number of samples: 100
2024-05-03 15:26:44,605 - julearn - INFO -      Number of features: 4
2024-05-03 15:26:44,605 - julearn - INFO - ====================
2024-05-03 15:26:44,605 - julearn - INFO -
2024-05-03 15:26:44,606 - julearn - INFO -      Number of classes: 2
2024-05-03 15:26:44,606 - julearn - INFO -      Target type: object
2024-05-03 15:26:44,606 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-05-03 15:26:44,606 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:44,606 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:47,052 - julearn - INFO - Fitting final model
Scores with best hyperparameter using 10 iterations of randomized search: 0.95
{'select_k__k': 2,
 'svm__C': 8.77140446796582,
 'svm__gamma': 0.022636153281629743}

We can also control the number of cross-validation folds used by the searcher by setting the cv parameter in the search_params dictionary. For example, we can use a bayesian search with 3 folds. Fortunately, the BayesSearchCV searcher also accepts distributions for the hyperparameters.

search_params = {
    "kind": "bayes",
    "n_iter": 10,
    "cv": 3,
}

scores_tuned, model_tuned = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
    return_estimator="all",
    search_params=search_params,
)

print(
    "Scores with best hyperparameter using 10 iterations of "
    f"bayesian search and 3-fold CV: {scores_tuned['test_score'].mean()}"
)
pprint(model_tuned.best_params_)

2024-05-03 15:26:47,540 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:47,540 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:47,540 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:47,540 - julearn - INFO -      Target: species
2024-05-03 15:26:47,541 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:47,541 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:47,541 - julearn - INFO - ====================
2024-05-03 15:26:47,541 - julearn - INFO -
2024-05-03 15:26:47,542 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:47,542 - julearn - INFO - Tuning hyperparameters using bayes
2024-05-03 15:26:47,542 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:47,542 - julearn - INFO -      select_k__k: [2, 3, 4]
2024-05-03 15:26:47,542 - julearn - INFO -      svm__C: (0.01, 10, 'log-uniform')
2024-05-03 15:26:47,542 - julearn - INFO -      svm__gamma: (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:47,542 - julearn - INFO - Hyperparameter select_k__k as is [2, 3, 4]
2024-05-03 15:26:47,542 - julearn - INFO - Hyperparameter svm__C as is (0.01, 10, 'log-uniform')
2024-05-03 15:26:47,542 - julearn - INFO - Hyperparameter svm__gamma is log-uniform float [0.001, 0.1]
2024-05-03 15:26:47,543 - julearn - INFO - Using inner CV scheme KFold(n_splits=3, random_state=None, shuffle=False)
2024-05-03 15:26:47,543 - julearn - INFO - Search Parameters:
2024-05-03 15:26:47,543 - julearn - INFO -      n_iter: 10
2024-05-03 15:26:47,543 - julearn - INFO -      cv: KFold(n_splits=3, random_state=None, shuffle=False)
2024-05-03 15:26:47,546 - julearn - INFO - ====================
2024-05-03 15:26:47,546 - julearn - INFO -
2024-05-03 15:26:47,546 - julearn - INFO - = Data Information =
2024-05-03 15:26:47,546 - julearn - INFO -      Problem type: classification
2024-05-03 15:26:47,546 - julearn - INFO -      Number of samples: 100
2024-05-03 15:26:47,546 - julearn - INFO -      Number of features: 4
2024-05-03 15:26:47,546 - julearn - INFO - ====================
2024-05-03 15:26:47,546 - julearn - INFO -
2024-05-03 15:26:47,546 - julearn - INFO -      Number of classes: 2
2024-05-03 15:26:47,546 - julearn - INFO -      Target type: object
2024-05-03 15:26:47,547 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-05-03 15:26:47,547 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:47,547 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:52,059 - julearn - INFO - Fitting final model
Scores with best hyperparameter using 10 iterations of bayesian search and 3-fold CV: 0.9099999999999999
OrderedDict([('select_k__k', 4),
             ('svm__C', 3.8975984906619887),
             ('svm__gamma', 0.028707916525659204)])

An example using optuna searcher is shown below. The searcher is specified as "optuna" and the hyperparameters are specified as a dictionary with the hyperparameters to tune and their distributions as for the bayesian searcher. However, the optuna searcher behaviour is controlled by a Study object. This object can be passed to the searcher using the study parameter in the search_params dictionary.

Important

The optuna searcher requires that all the hyperparameters are specified as distributions, even the categorical ones.

We first modify the pipeline creator so the select_k parameter is specified as a distribution. We exemplarily use a categorical distribution for the class_weight hyperparameter, trying the "balanced" and None values.

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("select_k", k=(2, 4, "uniform"))
creator.add(
    "svm",
    C=(0.01, 10, "log-uniform"),
    gamma=(1e-3, 1e-1, "log-uniform"),
    class_weight=("balanced", None, "categorical")
)
print(creator)

2024-05-03 15:26:53,014 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:53,014 - julearn - INFO - Step added
2024-05-03 15:26:53,014 - julearn - INFO - Adding step select_k that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:53,014 - julearn - INFO - Tuning hyperparameter k = (2, 4, 'uniform')
2024-05-03 15:26:53,015 - julearn - INFO - Step added
2024-05-03 15:26:53,015 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:53,015 - julearn - INFO - Tuning hyperparameter C = (0.01, 10, 'log-uniform')
2024-05-03 15:26:53,015 - julearn - INFO - Tuning hyperparameter gamma = (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:53,015 - julearn - INFO - Tuning hyperparameter class_weight = ('balanced', None, 'categorical')
2024-05-03 15:26:53,015 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: select_k
    estimator:     SelectKBest()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'select_k__k': (2, 4, 'uniform')}
  Step 2: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': (0.01, 10, 'log-uniform'), 'svm__gamma': (0.001, 0.1, 'log-uniform'), 'svm__class_weight': ('balanced', None, 'categorical')}

We can now use the optuna searcher with 10 trials and 3-fold cross-validation.

import optuna

study = optuna.create_study(
    direction="maximize",
    study_name="optuna-concept",
    load_if_exists=True,
)

search_params = {
    "kind": "optuna",
    "study": study,
    "cv": 3,
}
scores_tuned, model_tuned = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
    return_estimator="all",
    search_params=search_params,
)

print(
    "Scores with best hyperparameter using 10 iterations of "
    f"optuna and 3-fold CV: {scores_tuned['test_score'].mean()}"
)
pprint(model_tuned.best_params_)

2024-05-03 15:26:53,017 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:53,017 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:53,017 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:53,017 - julearn - INFO -      Target: species
2024-05-03 15:26:53,017 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:53,017 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:53,018 - julearn - INFO - ====================
2024-05-03 15:26:53,018 - julearn - INFO -
2024-05-03 15:26:53,019 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:53,019 - julearn - INFO - Tuning hyperparameters using optuna
2024-05-03 15:26:53,019 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:53,019 - julearn - INFO -      select_k__k: (2, 4, 'uniform')
2024-05-03 15:26:53,019 - julearn - INFO -      svm__C: (0.01, 10, 'log-uniform')
2024-05-03 15:26:53,019 - julearn - INFO -      svm__gamma: (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:53,019 - julearn - INFO -      svm__class_weight: ('balanced', None, 'categorical')
2024-05-03 15:26:53,019 - julearn - INFO - Hyperparameter select_k__k is uniform integer [2, 4]
2024-05-03 15:26:53,019 - julearn - INFO - Hyperparameter svm__C is log-uniform float [0.01, 10]
2024-05-03 15:26:53,019 - julearn - INFO - Hyperparameter svm__gamma is log-uniform float [0.001, 0.1]
2024-05-03 15:26:53,019 - julearn - INFO - Hyperparameter svm__class_weight is categorical with 2 options: [balanced and None]
2024-05-03 15:26:53,020 - julearn - INFO - Using inner CV scheme KFold(n_splits=3, random_state=None, shuffle=False)
2024-05-03 15:26:53,020 - julearn - INFO - Search Parameters:
2024-05-03 15:26:53,020 - julearn - INFO -      study: <optuna.study.study.Study object at 0x7fd476fc2830>
2024-05-03 15:26:53,020 - julearn - INFO -      cv: KFold(n_splits=3, random_state=None, shuffle=False)
/home/runner/work/julearn/julearn/julearn/pipeline/pipeline_creator.py:1041: ExperimentalWarning: OptunaSearchCV is experimental (supported from v0.17.0). The interface can change in the future.
  pipeline = search(  # type: ignore
2024-05-03 15:26:53,020 - julearn - INFO - ====================
2024-05-03 15:26:53,020 - julearn - INFO -
2024-05-03 15:26:53,021 - julearn - INFO - = Data Information =
2024-05-03 15:26:53,021 - julearn - INFO -      Problem type: classification
2024-05-03 15:26:53,021 - julearn - INFO -      Number of samples: 100
2024-05-03 15:26:53,021 - julearn - INFO -      Number of features: 4
2024-05-03 15:26:53,021 - julearn - INFO - ====================
2024-05-03 15:26:53,021 - julearn - INFO -
2024-05-03 15:26:53,021 - julearn - INFO -      Number of classes: 2
2024-05-03 15:26:53,021 - julearn - INFO -      Target type: object
2024-05-03 15:26:53,022 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-05-03 15:26:53,022 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:53,022 - julearn - INFO - Binary classification problem detected.
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/base.py:125: ExperimentalWarning: OptunaSearchCV is experimental (supported from v0.17.0). The interface can change in the future.
  new_object = klass(**new_object_params)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/base.py:125: ExperimentalWarning: OptunaSearchCV is experimental (supported from v0.17.0). The interface can change in the future.
  new_object = klass(**new_object_params)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/base.py:125: ExperimentalWarning: OptunaSearchCV is experimental (supported from v0.17.0). The interface can change in the future.
  new_object = klass(**new_object_params)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/base.py:125: ExperimentalWarning: OptunaSearchCV is experimental (supported from v0.17.0). The interface can change in the future.
  new_object = klass(**new_object_params)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/sklearn/base.py:125: ExperimentalWarning: OptunaSearchCV is experimental (supported from v0.17.0). The interface can change in the future.
  new_object = klass(**new_object_params)
2024-05-03 15:26:54,635 - julearn - INFO - Fitting final model
Scores with best hyperparameter using 10 iterations of optuna and 3-fold CV: 0.76
{'select_k__k': 3,
 'svm__C': 4.199561794026207,
 'svm__class_weight': 'balanced',
 'svm__gamma': 0.0017858196018021508}

Specifying distributions#

The hyperparameters can be specified as distributions for the randomized searcher, bayesian searcher and optuna searcher. The distributions are either specified toolbox-specific method or a tuple convention with the following format: (low, high, distribution) where the distribution can be either "log-uniform" or "uniform" or (a, b, c, d, ..., "categorical") where a, b, c, d, etc. are the possible categorical values for the hyperparameter.

For example, we can specify the C and gamma hyperparameters of the SVC as log-uniform distributions, while keeping the with_mean parameter of the StandardScaler as a categorical parameter with two options.

creator = PipelineCreator(problem_type="classification")
creator.add("zscore", with_mean=(True, False, "categorical"))
creator.add(
    "svm",
    C=(0.01, 10, "log-uniform"),
    gamma=(1e-3, 1e-1, "log-uniform"),
)
print(creator)

2024-05-03 15:26:54,950 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,950 - julearn - INFO - Tuning hyperparameter with_mean = (True, False, 'categorical')
2024-05-03 15:26:54,951 - julearn - INFO - Step added
2024-05-03 15:26:54,951 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,951 - julearn - INFO - Tuning hyperparameter C = (0.01, 10, 'log-uniform')
2024-05-03 15:26:54,951 - julearn - INFO - Tuning hyperparameter gamma = (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:54,951 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'zscore__with_mean': (True, False, 'categorical')}
  Step 1: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': (0.01, 10, 'log-uniform'), 'svm__gamma': (0.001, 0.1, 'log-uniform')}

While this will work for any of the random, bayes or optuna searcher options, it is important to note that both bayes and optuna searchers accept further parameters to specify distributions. For example, the bayes searcher distributions are defined using the Categorical, Integer and Real.

For example, we can define a log-uniform distribution with base 2 for the C hyperparameter of the SVC model:

from skopt.space import Real
creator = PipelineCreator(problem_type="classification")
creator.add("zscore", with_mean=(True, False, "categorical"))
creator.add(
    "svm",
    C=Real(0.01, 10, prior="log-uniform", base=2),
    gamma=(1e-3, 1e-1, "log-uniform"),
)
print(creator)

2024-05-03 15:26:54,952 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,952 - julearn - INFO - Tuning hyperparameter with_mean = (True, False, 'categorical')
2024-05-03 15:26:54,952 - julearn - INFO - Step added
2024-05-03 15:26:54,953 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,953 - julearn - INFO - Tuning hyperparameter C = Real(low=0.01, high=10, prior='log-uniform', transform='identity')
2024-05-03 15:26:54,953 - julearn - INFO - Tuning hyperparameter gamma = (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:54,953 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'zscore__with_mean': (True, False, 'categorical')}
  Step 1: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': Real(low=0.01, high=10, prior='log-uniform', transform='identity'), 'svm__gamma': (0.001, 0.1, 'log-uniform')}

For the optuna searcher, the distributions are defined using the CategoricalDistribution, FloatDistribution and IntDistribution.

For example, we can define a uniform distribution from 0.5 to 0.9 with a 0.05 step for the n_components of a PCA transformer, while keeping a log-uniform distribution for the C and gamma hyperparameters of the SVC model.

from optuna.distributions import FloatDistribution
creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add(
    "pca",
    n_components=FloatDistribution(0.5, 0.9, step=0.05),
)
creator.add(
    "svm",
    C=FloatDistribution(0.01, 10, log=True),
    gamma=(1e-3, 1e-1, "log-uniform"),
)
print(creator)

2024-05-03 15:26:54,954 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,954 - julearn - INFO - Step added
2024-05-03 15:26:54,954 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,954 - julearn - INFO - Setting hyperparameter n_components = FloatDistribution(high=0.9, log=False, low=0.5, step=0.05)
2024-05-03 15:26:54,954 - julearn - INFO - Step added
2024-05-03 15:26:54,954 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,954 - julearn - INFO - Setting hyperparameter C = FloatDistribution(high=10.0, log=True, low=0.01, step=None)
2024-05-03 15:26:54,954 - julearn - INFO - Tuning hyperparameter gamma = (0.001, 0.1, 'log-uniform')
2024-05-03 15:26:54,954 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: pca
    estimator:     PCA(n_components=FloatDistribution(high=0.9, log=False, low=0.5, step=0.05))
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 2: svm
    estimator:     SVC(C=FloatDistribution(high=10.0, log=True, low=0.01, step=None))
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__gamma': (0.001, 0.1, 'log-uniform')}

Tuning more than one grid#

Following our tuning of the SVC hyperparameters, we can also see that we can tune the kernel hyperparameter. This hyperparameter can also be “linear”. Let’s see how our grid of hyperparameters would look like if we add this hyperparameter:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    gamma=[1e-3, 1e-2, "scale", "auto"],
    kernel=["linear", "rbf"],
)
print(creator)

2024-05-03 15:26:54,956 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,956 - julearn - INFO - Step added
2024-05-03 15:26:54,956 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,956 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:54,956 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:54,956 - julearn - INFO - Tuning hyperparameter kernel = ['linear', 'rbf']
2024-05-03 15:26:54,956 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.001, 0.01, 'scale', 'auto'], 'svm__kernel': ['linear', 'rbf']}

We can see that the grid of hyperparameters is now 3-dimensional. However, there are some combinations that don’t make much sense. For example, the gamma hyperparameter is only used when the kernel is rbf. So we will be trying the linear kernel with each one of the 4 different gamma and 4 different C values. Those are 16 unnecessary combinations. We can avoid this by using multiple grids. One grid for the linear kernel and one grid for the rbf kernel.

julearn allows to specify multiple grid using two different approaches.

Repeating the step name with different hyperparameters:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    gamma=[1e-3, 1e-2, "scale", "auto"],
    kernel=["rbf"],
    name="svm",
)
creator.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    kernel=["linear"],
    name="svm",
)

print(creator)

scores1, model1 = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
    return_estimator="all",
)

print(f"Scores with best hyperparameter: {scores1['test_score'].mean()}")
pprint(model1.best_params_)

2024-05-03 15:26:54,957 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,957 - julearn - INFO - Step added
2024-05-03 15:26:54,957 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,957 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:54,957 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:54,957 - julearn - INFO - Setting hyperparameter kernel = rbf
2024-05-03 15:26:54,957 - julearn - INFO - Step added
2024-05-03 15:26:54,958 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:54,958 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:54,958 - julearn - INFO - Setting hyperparameter kernel = linear
2024-05-03 15:26:54,958 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.001, 0.01, 'scale', 'auto']}
  Step 2: svm
    estimator:     SVC(kernel='linear')
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': [0.01, 0.1, 1, 10]}

2024-05-03 15:26:54,958 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:54,958 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:54,959 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:54,959 - julearn - INFO -      Target: species
2024-05-03 15:26:54,959 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:54,959 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:54,959 - julearn - INFO - ====================
2024-05-03 15:26:54,959 - julearn - INFO -
2024-05-03 15:26:54,960 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:54,960 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:54,960 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:54,960 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:54,960 - julearn - INFO -      svm__gamma: [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:54,960 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:54,960 - julearn - INFO - Search Parameters:
2024-05-03 15:26:54,960 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:54,960 - julearn - INFO - ====================
2024-05-03 15:26:54,960 - julearn - INFO -
2024-05-03 15:26:54,961 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:54,961 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:54,961 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:54,961 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:54,961 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:54,961 - julearn - INFO - Search Parameters:
2024-05-03 15:26:54,961 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:54,961 - julearn - INFO - ====================
2024-05-03 15:26:54,961 - julearn - INFO -
2024-05-03 15:26:54,961 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:54,961 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:54,962 - julearn - INFO - Hyperparameters list:
2024-05-03 15:26:54,962 - julearn - INFO -      Set 0
2024-05-03 15:26:54,962 - julearn - INFO -              svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:54,962 - julearn - INFO -              svm__gamma: [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:54,962 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2024-05-03 15:26:54,962 - julearn - INFO -              svm: [SVC()]
2024-05-03 15:26:54,962 - julearn - INFO -      Set 1
2024-05-03 15:26:54,962 - julearn - INFO -              svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:54,962 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2024-05-03 15:26:54,963 - julearn - INFO -              svm: [SVC(kernel='linear')]
2024-05-03 15:26:54,963 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:54,963 - julearn - INFO - Search Parameters:
2024-05-03 15:26:54,963 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:54,963 - julearn - INFO - ====================
2024-05-03 15:26:54,963 - julearn - INFO -
2024-05-03 15:26:54,963 - julearn - INFO - = Data Information =
2024-05-03 15:26:54,963 - julearn - INFO -      Problem type: classification
2024-05-03 15:26:54,963 - julearn - INFO -      Number of samples: 100
2024-05-03 15:26:54,963 - julearn - INFO -      Number of features: 4
2024-05-03 15:26:54,963 - julearn - INFO - ====================
2024-05-03 15:26:54,963 - julearn - INFO -
2024-05-03 15:26:54,963 - julearn - INFO -      Number of classes: 2
2024-05-03 15:26:54,963 - julearn - INFO -      Target type: object
2024-05-03 15:26:54,964 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-05-03 15:26:54,964 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:54,964 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:26:58,804 - julearn - INFO - Fitting final model
Scores with best hyperparameter: 0.93
{'set_column_types': SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']}),
 'svm': SVC(),
 'svm__C': 10,
 'svm__gamma': 0.01}

Important

Note that the name parameter is required when repeating a step name. If we do not specify the name parameter, julearn will auto-determine the step name in an unique way. The only way to force repated names is to do so explicitly.

Using multiple pipeline creators:

creator1 = PipelineCreator(problem_type="classification")
creator1.add("zscore")
creator1.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    gamma=[1e-3, 1e-2, "scale", "auto"],
    kernel=["rbf"],
)

creator2 = PipelineCreator(problem_type="classification")
creator2.add("zscore")
creator2.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    kernel=["linear"],
)

scores2, model2 = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=[creator1, creator2],
    return_estimator="all",
)


print(f"Scores with best hyperparameter: {scores2['test_score'].mean()}")
pprint(model2.best_params_)

2024-05-03 15:26:59,576 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:59,576 - julearn - INFO - Step added
2024-05-03 15:26:59,576 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:59,576 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:59,576 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:59,576 - julearn - INFO - Setting hyperparameter kernel = rbf
2024-05-03 15:26:59,576 - julearn - INFO - Step added
2024-05-03 15:26:59,577 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:59,577 - julearn - INFO - Step added
2024-05-03 15:26:59,577 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:26:59,577 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:26:59,577 - julearn - INFO - Setting hyperparameter kernel = linear
2024-05-03 15:26:59,577 - julearn - INFO - Step added
2024-05-03 15:26:59,577 - julearn - INFO - ==== Input Data ====
2024-05-03 15:26:59,577 - julearn - INFO - Using dataframe as input
2024-05-03 15:26:59,577 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:59,577 - julearn - INFO -      Target: species
2024-05-03 15:26:59,577 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:26:59,577 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:26:59,578 - julearn - INFO - ====================
2024-05-03 15:26:59,578 - julearn - INFO -
2024-05-03 15:26:59,578 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:59,578 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:59,578 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:59,578 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:59,579 - julearn - INFO -      svm__gamma: [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:59,579 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:59,579 - julearn - INFO - Search Parameters:
2024-05-03 15:26:59,579 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:59,579 - julearn - INFO - ====================
2024-05-03 15:26:59,579 - julearn - INFO -
2024-05-03 15:26:59,579 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:59,579 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:59,579 - julearn - INFO - Hyperparameters:
2024-05-03 15:26:59,580 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:59,580 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:59,580 - julearn - INFO - Search Parameters:
2024-05-03 15:26:59,580 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:59,580 - julearn - INFO - ====================
2024-05-03 15:26:59,580 - julearn - INFO -
2024-05-03 15:26:59,580 - julearn - INFO - = Model Parameters =
2024-05-03 15:26:59,580 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:26:59,580 - julearn - INFO - Hyperparameters list:
2024-05-03 15:26:59,580 - julearn - INFO -      Set 0
2024-05-03 15:26:59,580 - julearn - INFO -              svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:59,580 - julearn - INFO -              svm__gamma: [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:26:59,581 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2024-05-03 15:26:59,581 - julearn - INFO -              zscore: [StandardScaler()]
2024-05-03 15:26:59,581 - julearn - INFO -              svm: [SVC()]
2024-05-03 15:26:59,581 - julearn - INFO -      Set 1
2024-05-03 15:26:59,581 - julearn - INFO -              svm__C: [0.01, 0.1, 1, 10]
2024-05-03 15:26:59,581 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2024-05-03 15:26:59,581 - julearn - INFO -              zscore: [StandardScaler()]
2024-05-03 15:26:59,582 - julearn - INFO -              svm: [SVC(kernel='linear')]
2024-05-03 15:26:59,582 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:59,582 - julearn - INFO - Search Parameters:
2024-05-03 15:26:59,582 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:59,582 - julearn - INFO - ====================
2024-05-03 15:26:59,582 - julearn - INFO -
2024-05-03 15:26:59,582 - julearn - INFO - = Data Information =
2024-05-03 15:26:59,582 - julearn - INFO -      Problem type: classification
2024-05-03 15:26:59,582 - julearn - INFO -      Number of samples: 100
2024-05-03 15:26:59,582 - julearn - INFO -      Number of features: 4
2024-05-03 15:26:59,582 - julearn - INFO - ====================
2024-05-03 15:26:59,582 - julearn - INFO -
2024-05-03 15:26:59,582 - julearn - INFO -      Number of classes: 2
2024-05-03 15:26:59,582 - julearn - INFO -      Target type: object
2024-05-03 15:26:59,583 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-05-03 15:26:59,583 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:26:59,583 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:27:03,459 - julearn - INFO - Fitting final model
Scores with best hyperparameter: 0.93
{'set_column_types': SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']}),
 'svm': SVC(),
 'svm__C': 10,
 'svm__gamma': 0.01,
 'zscore': StandardScaler()}

Important

All the pipeline creators must have the same problem type and steps names in order for this approach to work.

Indeed, if we compare both approaches, we can see that they are equivalent. They both produce the same grid of hyperparameters:

pprint(model1.param_grid)
pprint(model2.param_grid)

[{'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})],
  'svm': [SVC()],
  'svm__C': [0.01, 0.1, 1, 10],
  'svm__gamma': [0.001, 0.01, 'scale', 'auto']},
 {'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})],
  'svm': [SVC(kernel='linear')],
  'svm__C': [0.01, 0.1, 1, 10]}]
[{'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})],
  'svm': [SVC()],
  'svm__C': [0.01, 0.1, 1, 10],
  'svm__gamma': [0.001, 0.01, 'scale', 'auto'],
  'zscore': [StandardScaler()]},
 {'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})],
  'svm': [SVC(kernel='linear')],
  'svm__C': [0.01, 0.1, 1, 10],
  'zscore': [StandardScaler()]}]

Models as hyperparameters#

But why stop there? Models can also be considered as hyperparameters. For example, we can try different models for the classification task. Let’s try the RandomForestClassifier and the LogisticRegression too:

creator1 = PipelineCreator(problem_type="classification")
creator1.add("zscore")
creator1.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    gamma=[1e-3, 1e-2, "scale", "auto"],
    kernel=["rbf"],
    name="model",
)

creator2 = PipelineCreator(problem_type="classification")
creator2.add("zscore")
creator2.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    kernel=["linear"],
    name="model",
)

creator3 = PipelineCreator(problem_type="classification")
creator3.add("zscore")
creator3.add(
    "rf",
    max_depth=[2, 3, 4],
    name="model",
)

creator4 = PipelineCreator(problem_type="classification")
creator4.add("zscore")
creator4.add(
    "logit",
    penalty=["l2", "l1"],
    dual=[False],
    solver="liblinear",
    name="model",
)

scores3, model3 = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=[creator1, creator2, creator3, creator4],
    return_estimator="all",
)


print(f"Scores with best hyperparameter: {scores3['test_score'].mean()}")
pprint(model3.best_params_)

2024-05-03 15:27:04,241 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,241 - julearn - INFO - Step added
2024-05-03 15:27:04,241 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,241 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:27:04,241 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:27:04,241 - julearn - INFO - Setting hyperparameter kernel = rbf
2024-05-03 15:27:04,241 - julearn - INFO - Step added
2024-05-03 15:27:04,241 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,242 - julearn - INFO - Step added
2024-05-03 15:27:04,242 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,242 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2024-05-03 15:27:04,242 - julearn - INFO - Setting hyperparameter kernel = linear
2024-05-03 15:27:04,242 - julearn - INFO - Step added
2024-05-03 15:27:04,242 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,242 - julearn - INFO - Step added
2024-05-03 15:27:04,242 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,242 - julearn - INFO - Tuning hyperparameter max_depth = [2, 3, 4]
2024-05-03 15:27:04,242 - julearn - INFO - Step added
2024-05-03 15:27:04,242 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,242 - julearn - INFO - Step added
2024-05-03 15:27:04,242 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-05-03 15:27:04,242 - julearn - INFO - Tuning hyperparameter penalty = ['l2', 'l1']
2024-05-03 15:27:04,242 - julearn - INFO - Setting hyperparameter dual = False
2024-05-03 15:27:04,242 - julearn - INFO - Setting hyperparameter solver = liblinear
2024-05-03 15:27:04,243 - julearn - INFO - Step added
2024-05-03 15:27:04,243 - julearn - INFO - ==== Input Data ====
2024-05-03 15:27:04,243 - julearn - INFO - Using dataframe as input
2024-05-03 15:27:04,243 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:27:04,243 - julearn - INFO -      Target: species
2024-05-03 15:27:04,243 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-05-03 15:27:04,243 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-05-03 15:27:04,243 - julearn - INFO - ====================
2024-05-03 15:27:04,243 - julearn - INFO -
2024-05-03 15:27:04,244 - julearn - INFO - = Model Parameters =
2024-05-03 15:27:04,244 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:27:04,244 - julearn - INFO - Hyperparameters:
2024-05-03 15:27:04,244 - julearn - INFO -      model__C: [0.01, 0.1, 1, 10]
2024-05-03 15:27:04,244 - julearn - INFO -      model__gamma: [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:27:04,244 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,244 - julearn - INFO - Search Parameters:
2024-05-03 15:27:04,244 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,244 - julearn - INFO - ====================
2024-05-03 15:27:04,245 - julearn - INFO -
2024-05-03 15:27:04,245 - julearn - INFO - = Model Parameters =
2024-05-03 15:27:04,245 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:27:04,245 - julearn - INFO - Hyperparameters:
2024-05-03 15:27:04,245 - julearn - INFO -      model__C: [0.01, 0.1, 1, 10]
2024-05-03 15:27:04,245 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,245 - julearn - INFO - Search Parameters:
2024-05-03 15:27:04,245 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,246 - julearn - INFO - ====================
2024-05-03 15:27:04,246 - julearn - INFO -
2024-05-03 15:27:04,246 - julearn - INFO - = Model Parameters =
2024-05-03 15:27:04,246 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:27:04,246 - julearn - INFO - Hyperparameters:
2024-05-03 15:27:04,246 - julearn - INFO -      model__max_depth: [2, 3, 4]
2024-05-03 15:27:04,246 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,246 - julearn - INFO - Search Parameters:
2024-05-03 15:27:04,246 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,246 - julearn - INFO - ====================
2024-05-03 15:27:04,247 - julearn - INFO -
2024-05-03 15:27:04,247 - julearn - INFO - = Model Parameters =
2024-05-03 15:27:04,247 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:27:04,247 - julearn - INFO - Hyperparameters:
2024-05-03 15:27:04,247 - julearn - INFO -      model__penalty: ['l2', 'l1']
2024-05-03 15:27:04,247 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,247 - julearn - INFO - Search Parameters:
2024-05-03 15:27:04,247 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,247 - julearn - INFO - ====================
2024-05-03 15:27:04,247 - julearn - INFO -
2024-05-03 15:27:04,248 - julearn - INFO - = Model Parameters =
2024-05-03 15:27:04,248 - julearn - INFO - Tuning hyperparameters using grid
2024-05-03 15:27:04,248 - julearn - INFO - Hyperparameters list:
2024-05-03 15:27:04,248 - julearn - INFO -      Set 0
2024-05-03 15:27:04,248 - julearn - INFO -              model__C: [0.01, 0.1, 1, 10]
2024-05-03 15:27:04,248 - julearn - INFO -              model__gamma: [0.001, 0.01, 'scale', 'auto']
2024-05-03 15:27:04,248 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2024-05-03 15:27:04,248 - julearn - INFO -              zscore: [StandardScaler()]
2024-05-03 15:27:04,248 - julearn - INFO -              model: [SVC()]
2024-05-03 15:27:04,248 - julearn - INFO -      Set 1
2024-05-03 15:27:04,248 - julearn - INFO -              model__C: [0.01, 0.1, 1, 10]
2024-05-03 15:27:04,249 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2024-05-03 15:27:04,249 - julearn - INFO -              zscore: [StandardScaler()]
2024-05-03 15:27:04,249 - julearn - INFO -              model: [SVC(kernel='linear')]
2024-05-03 15:27:04,249 - julearn - INFO -      Set 2
2024-05-03 15:27:04,249 - julearn - INFO -              model__max_depth: [2, 3, 4]
2024-05-03 15:27:04,249 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2024-05-03 15:27:04,250 - julearn - INFO -              zscore: [StandardScaler()]
2024-05-03 15:27:04,250 - julearn - INFO -              model: [RandomForestClassifier()]
2024-05-03 15:27:04,250 - julearn - INFO -      Set 3
2024-05-03 15:27:04,250 - julearn - INFO -              model__penalty: ['l2', 'l1']
2024-05-03 15:27:04,250 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2024-05-03 15:27:04,250 - julearn - INFO -              zscore: [StandardScaler()]
2024-05-03 15:27:04,250 - julearn - INFO -              model: [LogisticRegression(solver='liblinear')]
2024-05-03 15:27:04,251 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,251 - julearn - INFO - Search Parameters:
2024-05-03 15:27:04,251 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,251 - julearn - INFO - ====================
2024-05-03 15:27:04,251 - julearn - INFO -
2024-05-03 15:27:04,251 - julearn - INFO - = Data Information =
2024-05-03 15:27:04,251 - julearn - INFO -      Problem type: classification
2024-05-03 15:27:04,251 - julearn - INFO -      Number of samples: 100
2024-05-03 15:27:04,251 - julearn - INFO -      Number of features: 4
2024-05-03 15:27:04,251 - julearn - INFO - ====================
2024-05-03 15:27:04,251 - julearn - INFO -
2024-05-03 15:27:04,251 - julearn - INFO -      Number of classes: 2
2024-05-03 15:27:04,251 - julearn - INFO -      Target type: object
2024-05-03 15:27:04,252 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-05-03 15:27:04,252 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-05-03 15:27:04,252 - julearn - INFO - Binary classification problem detected.
2024-05-03 15:27:15,883 - julearn - INFO - Fitting final model
Scores with best hyperparameter: 0.9200000000000002
{'model': SVC(),
 'model__C': 10,
 'model__gamma': 0.01,
 'set_column_types': SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']}),
 'zscore': StandardScaler()}

Well, it seems that nothing can beat the SVC with kernel="rbf" for our classification example.

Total running time of the script: (1 minutes 3.673 seconds)