3.3. Hyperparameter Tuning#

Parameters vs Hyperparameters#

Parameters are the values that define the model, and are learned from the data. For example, the weights of a linear regression model are parameters. The parameters of a model are learned during training and are not set by the user.

Hyperparameters are the values that define the model, but are not learned from the data. For example, the regularization parameter C of a Support Vector Machine (SVC) model is a hyperparameter. The hyperparameters of a model are set by the user before training and are not learned during training.

Let’s see an example of a SVC model with a regularization parameter C. We will use the iris dataset, which is a dataset of measurements of flowers.

Lets start by loading the dataset and setting the features and target variables.

from seaborn import load_dataset
from pprint import pprint  # To print in a pretty way

df = load_dataset("iris")
X = df.columns[:-1].tolist()
y = "species"
X_types = {"continuous": X}

# The dataset has three kind of species. We will keep two to perform a binary
# classification.
df = df[df["species"].isin(["versicolor", "virginica"])]

We can now use the PipelineCreator to create a pipeline with a RobustScaler and a SVC, with a regularization parameter C set to 0.1.

from julearn.pipeline import PipelineCreator

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("svm", C=0.1)

print(creator)

2023-07-19 12:47:19,246 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:19,246 - julearn - INFO - Step added
2023-07-19 12:47:19,246 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:19,246 - julearn - INFO - Setting hyperparameter C = 0.1
2023-07-19 12:47:19,246 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: svm
    estimator:     SVC(C=0.1)
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}

Hyperparameter Tuning#

Since it is the user who sets the hyperparameters, it is important to choose the right values. This is not always easy, and it is common to try different values and see which one works best. This process is called hyperparameter tuning.

Basically, hyperparameter tuning refers to testing several hyperparameter values and choosing the one that works best.

For example, we can try different values for the regularization parameter C of the SVC model and see which one works best.

from julearn import run_cross_validation

scores1 = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
)

print(f"Score with C=0.1: {scores1['test_score'].mean()}")

creator2 = PipelineCreator(problem_type="classification")
creator2.add("zscore")
creator2.add("svm", C=0.01)

scores2 = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator2,
)

print(f"Score with C=0.01: {scores2['test_score'].mean()}")

2023-07-19 12:47:19,248 - julearn - INFO - ==== Input Data ====
2023-07-19 12:47:19,248 - julearn - INFO - Using dataframe as input
2023-07-19 12:47:19,248 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:19,248 - julearn - INFO -      Target: species
2023-07-19 12:47:19,248 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:19,248 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2023-07-19 12:47:19,249 - julearn - INFO - ====================
2023-07-19 12:47:19,249 - julearn - INFO -
2023-07-19 12:47:19,250 - julearn - INFO - = Model Parameters =
2023-07-19 12:47:19,250 - julearn - INFO - ====================
2023-07-19 12:47:19,250 - julearn - INFO -
2023-07-19 12:47:19,250 - julearn - INFO - = Data Information =
2023-07-19 12:47:19,250 - julearn - INFO -      Problem type: classification
2023-07-19 12:47:19,250 - julearn - INFO -      Number of samples: 100
2023-07-19 12:47:19,250 - julearn - INFO -      Number of features: 4
2023-07-19 12:47:19,250 - julearn - INFO - ====================
2023-07-19 12:47:19,250 - julearn - INFO -
2023-07-19 12:47:19,250 - julearn - INFO -      Number of classes: 2
2023-07-19 12:47:19,250 - julearn - INFO -      Target type: object
2023-07-19 12:47:19,251 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2023-07-19 12:47:19,251 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:19,251 - julearn - INFO - Binary classification problem detected.
Score with C=0.1: 0.8099999999999999
2023-07-19 12:47:19,300 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:19,300 - julearn - INFO - Step added
2023-07-19 12:47:19,300 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:19,300 - julearn - INFO - Setting hyperparameter C = 0.01
2023-07-19 12:47:19,300 - julearn - INFO - Step added
2023-07-19 12:47:19,300 - julearn - INFO - ==== Input Data ====
2023-07-19 12:47:19,300 - julearn - INFO - Using dataframe as input
2023-07-19 12:47:19,300 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:19,301 - julearn - INFO -      Target: species
2023-07-19 12:47:19,301 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:19,301 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2023-07-19 12:47:19,301 - julearn - INFO - ====================
2023-07-19 12:47:19,301 - julearn - INFO -
2023-07-19 12:47:19,302 - julearn - INFO - = Model Parameters =
2023-07-19 12:47:19,302 - julearn - INFO - ====================
2023-07-19 12:47:19,302 - julearn - INFO -
2023-07-19 12:47:19,302 - julearn - INFO - = Data Information =
2023-07-19 12:47:19,302 - julearn - INFO -      Problem type: classification
2023-07-19 12:47:19,302 - julearn - INFO -      Number of samples: 100
2023-07-19 12:47:19,302 - julearn - INFO -      Number of features: 4
2023-07-19 12:47:19,302 - julearn - INFO - ====================
2023-07-19 12:47:19,302 - julearn - INFO -
2023-07-19 12:47:19,303 - julearn - INFO -      Number of classes: 2
2023-07-19 12:47:19,303 - julearn - INFO -      Target type: object
2023-07-19 12:47:19,303 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2023-07-19 12:47:19,304 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:19,304 - julearn - INFO - Binary classification problem detected.
Score with C=0.01: 0.19

We can see that the model with C=0.1 works better than the model with C=0.01. However, to be sure that C=0.1 is the best value, we should try more values. And since this is only one hyperparameter, it is not that difficult. But what if we have more hyperparameters? And what if we have several steps in the pipeline (e.g. feature selection, PCA, etc.)? This has a main problem: the more hyperparameters we have, the more times we use the same data for training and testing. This usually gives an optimistic estimation of the performance of the model.

To prevent this, we can use a technique called nested cross-validation. That is, we use cross-validation to tune the hyperparameters, and then we use cross-validation again to estimate the performance of the model using the best hyperparameters set. It is called nested because we first split the data into training and testing sets to estimate the model performance (outer loop), and then we split the training set into two sets to tune the hyperparameters (inner loop).

Julearn has a simple way to do hyperparameter tuning using nested cross- validation. When we use a PipelineCreator to create a pipeline, we can set the hyperparameters we want to tune and the values we want to try.

For example, we can try different values for the regularization parameter C of the SVC model:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("svm", C=[0.01, 0.1, 1, 10])

print(creator)

2023-07-19 12:47:19,352 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:19,352 - julearn - INFO - Step added
2023-07-19 12:47:19,352 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:19,352 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2023-07-19 12:47:19,353 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': [0.01, 0.1, 1, 10]}

As we can see above, the creator now shows that the C hyperparameter will be tuned. We can now use this creator to run cross-validation. This will tune the hyperparameters and estimate the performance of the model using the best hyperparameters set.

scores_tuned, model_tuned = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
    return_estimator="all",
)

print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")

2023-07-19 12:47:19,354 - julearn - INFO - ==== Input Data ====
2023-07-19 12:47:19,354 - julearn - INFO - Using dataframe as input
2023-07-19 12:47:19,354 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:19,354 - julearn - INFO -      Target: species
2023-07-19 12:47:19,354 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:19,354 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2023-07-19 12:47:19,355 - julearn - INFO - ====================
2023-07-19 12:47:19,355 - julearn - INFO -
2023-07-19 12:47:19,355 - julearn - INFO - = Model Parameters =
2023-07-19 12:47:19,355 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:47:19,355 - julearn - INFO - Hyperparameters:
2023-07-19 12:47:19,356 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2023-07-19 12:47:19,356 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:19,356 - julearn - INFO - Search Parameters:
2023-07-19 12:47:19,356 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:19,356 - julearn - INFO - ====================
2023-07-19 12:47:19,356 - julearn - INFO -
2023-07-19 12:47:19,356 - julearn - INFO - = Data Information =
2023-07-19 12:47:19,356 - julearn - INFO -      Problem type: classification
2023-07-19 12:47:19,356 - julearn - INFO -      Number of samples: 100
2023-07-19 12:47:19,356 - julearn - INFO -      Number of features: 4
2023-07-19 12:47:19,356 - julearn - INFO - ====================
2023-07-19 12:47:19,356 - julearn - INFO -
2023-07-19 12:47:19,356 - julearn - INFO -      Number of classes: 2
2023-07-19 12:47:19,356 - julearn - INFO -      Target type: object
2023-07-19 12:47:19,357 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2023-07-19 12:47:19,357 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:19,357 - julearn - INFO - Binary classification problem detected.
Scores with best hyperparameter: 0.9100000000000001

We can see that the model with the best hyperparameters works better than the model with C=0.1. But what’s the best hyperparameter set? We can see it by printing the model_tuned.best_params_ variable.

pprint(model_tuned.best_params_)

{'svm__C': 1}

We can see that the best hyperparameter set is C=1. Since this hyperparameter was not on the boundary of the values we tried, we can conclude that our search for the best C value was successful.

However, by checking on the SVC documentation, we can see that there are more hyperparameters that we can tune. For example, for the default rbf kernel, we can tune the gamma hyperparameter:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("svm", C=[0.01, 0.1, 1, 10], gamma=[0.01, 0.1, 1, 10])

print(creator)

scores_tuned, model_tuned = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
    return_estimator="all",
)

print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")
pprint(model_tuned.best_params_)

2023-07-19 12:47:20,537 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:20,537 - julearn - INFO - Step added
2023-07-19 12:47:20,537 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:20,537 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2023-07-19 12:47:20,538 - julearn - INFO - Tuning hyperparameter gamma = [0.01, 0.1, 1, 10]
2023-07-19 12:47:20,538 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.01, 0.1, 1, 10]}

2023-07-19 12:47:20,538 - julearn - INFO - ==== Input Data ====
2023-07-19 12:47:20,538 - julearn - INFO - Using dataframe as input
2023-07-19 12:47:20,538 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:20,538 - julearn - INFO -      Target: species
2023-07-19 12:47:20,539 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:20,539 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2023-07-19 12:47:20,539 - julearn - INFO - ====================
2023-07-19 12:47:20,539 - julearn - INFO -
2023-07-19 12:47:20,540 - julearn - INFO - = Model Parameters =
2023-07-19 12:47:20,540 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:47:20,540 - julearn - INFO - Hyperparameters:
2023-07-19 12:47:20,540 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2023-07-19 12:47:20,540 - julearn - INFO -      svm__gamma: [0.01, 0.1, 1, 10]
2023-07-19 12:47:20,540 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:20,540 - julearn - INFO - Search Parameters:
2023-07-19 12:47:20,541 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:20,541 - julearn - INFO - ====================
2023-07-19 12:47:20,541 - julearn - INFO -
2023-07-19 12:47:20,541 - julearn - INFO - = Data Information =
2023-07-19 12:47:20,541 - julearn - INFO -      Problem type: classification
2023-07-19 12:47:20,541 - julearn - INFO -      Number of samples: 100
2023-07-19 12:47:20,541 - julearn - INFO -      Number of features: 4
2023-07-19 12:47:20,541 - julearn - INFO - ====================
2023-07-19 12:47:20,541 - julearn - INFO -
2023-07-19 12:47:20,541 - julearn - INFO -      Number of classes: 2
2023-07-19 12:47:20,541 - julearn - INFO -      Target type: object
2023-07-19 12:47:20,542 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2023-07-19 12:47:20,542 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:20,542 - julearn - INFO - Binary classification problem detected.
Scores with best hyperparameter: 0.9100000000000001
{'svm__C': 10, 'svm__gamma': 0.01}

We can see that the best hyperparameter set is C=1 and gamma=0.1. But since gamma was on the boundary of the values we tried, we should try more values to be sure that we are using the best hyperparameter set.

We can even give a mixture of different variable types, like the words "scale" and "auto" for the gamma hyperparameter:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    gamma=[1e-5, 1e-4, 1e-3, 1e-2, "scale", "auto"],
)

print(creator)

scores_tuned, model_tuned = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
    return_estimator="all",
)

print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")
pprint(model_tuned.best_params_)

2023-07-19 12:47:25,080 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:25,080 - julearn - INFO - Step added
2023-07-19 12:47:25,080 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:25,081 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2023-07-19 12:47:25,081 - julearn - INFO - Tuning hyperparameter gamma = [1e-05, 0.0001, 0.001, 0.01, 'scale', 'auto']
2023-07-19 12:47:25,081 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [1e-05, 0.0001, 0.001, 0.01, 'scale', 'auto']}

2023-07-19 12:47:25,081 - julearn - INFO - ==== Input Data ====
2023-07-19 12:47:25,081 - julearn - INFO - Using dataframe as input
2023-07-19 12:47:25,081 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:25,081 - julearn - INFO -      Target: species
2023-07-19 12:47:25,082 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:25,082 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2023-07-19 12:47:25,082 - julearn - INFO - ====================
2023-07-19 12:47:25,082 - julearn - INFO -
2023-07-19 12:47:25,083 - julearn - INFO - = Model Parameters =
2023-07-19 12:47:25,083 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:47:25,083 - julearn - INFO - Hyperparameters:
2023-07-19 12:47:25,083 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2023-07-19 12:47:25,083 - julearn - INFO -      svm__gamma: [1e-05, 0.0001, 0.001, 0.01, 'scale', 'auto']
2023-07-19 12:47:25,083 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:25,084 - julearn - INFO - Search Parameters:
2023-07-19 12:47:25,084 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:25,084 - julearn - INFO - ====================
2023-07-19 12:47:25,084 - julearn - INFO -
2023-07-19 12:47:25,084 - julearn - INFO - = Data Information =
2023-07-19 12:47:25,084 - julearn - INFO -      Problem type: classification
2023-07-19 12:47:25,084 - julearn - INFO -      Number of samples: 100
2023-07-19 12:47:25,084 - julearn - INFO -      Number of features: 4
2023-07-19 12:47:25,084 - julearn - INFO - ====================
2023-07-19 12:47:25,084 - julearn - INFO -
2023-07-19 12:47:25,084 - julearn - INFO -      Number of classes: 2
2023-07-19 12:47:25,084 - julearn - INFO -      Target type: object
2023-07-19 12:47:25,085 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2023-07-19 12:47:25,085 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:25,085 - julearn - INFO - Binary classification problem detected.
Scores with best hyperparameter: 0.9100000000000001
{'svm__C': 10, 'svm__gamma': 0.01}

We can even tune hyperparameters from different steps of the pipeline. Let’s add a SelectKBest step to the pipeline and tune its k hyperparameter:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add("select_k", k=[2, 3, 4])
creator.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    gamma=[1e-3, 1e-2, 1e-1, "scale", "auto"],
)

print(creator)

scores_tuned, model_tuned = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
    return_estimator="all",
)

print(f"Scores with best hyperparameter: {scores_tuned['test_score'].mean()}")
pprint(model_tuned.best_params_)

2023-07-19 12:47:31,831 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:31,832 - julearn - INFO - Step added
2023-07-19 12:47:31,832 - julearn - INFO - Adding step select_k that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:31,832 - julearn - INFO - Tuning hyperparameter k = [2, 3, 4]
2023-07-19 12:47:31,832 - julearn - INFO - Step added
2023-07-19 12:47:31,832 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:31,832 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2023-07-19 12:47:31,832 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 0.1, 'scale', 'auto']
2023-07-19 12:47:31,832 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: select_k
    estimator:     SelectKBest()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'select_k__k': [2, 3, 4]}
  Step 2: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.001, 0.01, 0.1, 'scale', 'auto']}

2023-07-19 12:47:31,833 - julearn - INFO - ==== Input Data ====
2023-07-19 12:47:31,833 - julearn - INFO - Using dataframe as input
2023-07-19 12:47:31,833 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:31,833 - julearn - INFO -      Target: species
2023-07-19 12:47:31,833 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:31,833 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2023-07-19 12:47:31,834 - julearn - INFO - ====================
2023-07-19 12:47:31,834 - julearn - INFO -
2023-07-19 12:47:31,835 - julearn - INFO - = Model Parameters =
2023-07-19 12:47:31,835 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:47:31,835 - julearn - INFO - Hyperparameters:
2023-07-19 12:47:31,835 - julearn - INFO -      select_k__k: [2, 3, 4]
2023-07-19 12:47:31,835 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2023-07-19 12:47:31,835 - julearn - INFO -      svm__gamma: [0.001, 0.01, 0.1, 'scale', 'auto']
2023-07-19 12:47:31,835 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:31,835 - julearn - INFO - Search Parameters:
2023-07-19 12:47:31,836 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:31,836 - julearn - INFO - ====================
2023-07-19 12:47:31,836 - julearn - INFO -
2023-07-19 12:47:31,836 - julearn - INFO - = Data Information =
2023-07-19 12:47:31,836 - julearn - INFO -      Problem type: classification
2023-07-19 12:47:31,836 - julearn - INFO -      Number of samples: 100
2023-07-19 12:47:31,836 - julearn - INFO -      Number of features: 4
2023-07-19 12:47:31,836 - julearn - INFO - ====================
2023-07-19 12:47:31,836 - julearn - INFO -
2023-07-19 12:47:31,836 - julearn - INFO -      Number of classes: 2
2023-07-19 12:47:31,836 - julearn - INFO -      Target type: object
2023-07-19 12:47:31,837 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2023-07-19 12:47:31,837 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:31,837 - julearn - INFO - Binary classification problem detected.
Scores with best hyperparameter: 0.9100000000000001
{'select_k__k': 4, 'svm__C': 10, 'svm__gamma': 0.01}

But how will Julearn find the optimal hyperparameter set?

Searchers#

Julearn uses the same concept as scikit-learn to tune hyperparameters: it uses a searcher to find the best hyperparameter set. A searcher is an object that receives a set of hyperparameters and their values, and then tries to find the best combination of values for the hyperparameters using cross-validation.

By default, Julearn uses a GridSearchCV. This searcher is very simple. First, it construct the “grid” of hyperparameters to try. As we see above, we have 3 hyperparameters to tune. So it constructs a 3-dimentional grid with all the possible combinations of the hyperparameters values. The second step is to perform cross-validation on each of the possible combinations of hyperparameters values.

Another searcher that Julearn provides is the RandomizedSearchCV. This searcher is similar to the GridSearchCV, but instead of trying all the possible combinations of hyperparameters values, it tries a random subset of them. This is useful when we have a lot of hyperparameters to tune, since it can be very time consuming to try all the possible, as well as continuous parameters that can be sampled out of a distribution. For more information, see the RandomizedSearchCV documentation.

Tuning more than one grid#

Following our tuning of the SVC hyperparameters, we can also see that we can tune the kernel hyperparameter. This hyperparameter can also be “linear”. Let’s see how our grid of hyperparameters would look like if we add this hyperparameter:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    gamma=[1e-3, 1e-2, "scale", "auto"],
    kernel=["linear", "rbf"],
)
print(creator)

2023-07-19 12:47:53,245 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:53,245 - julearn - INFO - Step added
2023-07-19 12:47:53,245 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:53,245 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2023-07-19 12:47:53,245 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2023-07-19 12:47:53,245 - julearn - INFO - Tuning hyperparameter kernel = ['linear', 'rbf']
2023-07-19 12:47:53,246 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.001, 0.01, 'scale', 'auto'], 'svm__kernel': ['linear', 'rbf']}

We can see that the grid of hyperparameters is now 3-dimensional. However, there are some combinations that don’t make much sense. For example, the gamma hyperparameter is only used when the kernel is rbf. So we will be trying the linear kernel with each one of the 4 different gamma and 4 different C values. Those are 16 unnecessary combinations. We can avoid this by using multiple grids. One grid for the linear kernel and one grid for the rbf kernel.

Julearn allows to specify multiple grid using two different approaches.

Repeating the step name with different hyperparameters:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore")
creator.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    gamma=[1e-3, 1e-2, "scale", "auto"],
    kernel=["rbf"],
    name="svm",
)
creator.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    kernel=["linear"],
    name="svm",
)

print(creator)


scores1, model1 = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
    return_estimator="all",
)

print(f"Scores with best hyperparameter: {scores1['test_score'].mean()}")
pprint(model1.best_params_)

2023-07-19 12:47:53,247 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:53,247 - julearn - INFO - Step added
2023-07-19 12:47:53,247 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:53,247 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2023-07-19 12:47:53,247 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2023-07-19 12:47:53,247 - julearn - INFO - Setting hyperparameter kernel = rbf
2023-07-19 12:47:53,247 - julearn - INFO - Step added
2023-07-19 12:47:53,248 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:53,248 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2023-07-19 12:47:53,248 - julearn - INFO - Setting hyperparameter kernel = linear
2023-07-19 12:47:53,248 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': [0.01, 0.1, 1, 10], 'svm__gamma': [0.001, 0.01, 'scale', 'auto']}
  Step 2: svm
    estimator:     SVC(kernel='linear')
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {'svm__C': [0.01, 0.1, 1, 10]}

2023-07-19 12:47:53,249 - julearn - INFO - ==== Input Data ====
2023-07-19 12:47:53,249 - julearn - INFO - Using dataframe as input
2023-07-19 12:47:53,249 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:53,249 - julearn - INFO -      Target: species
2023-07-19 12:47:53,249 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:53,249 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2023-07-19 12:47:53,250 - julearn - INFO - ====================
2023-07-19 12:47:53,250 - julearn - INFO -
2023-07-19 12:47:53,250 - julearn - INFO - = Model Parameters =
2023-07-19 12:47:53,250 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:47:53,250 - julearn - INFO - Hyperparameters:
2023-07-19 12:47:53,250 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2023-07-19 12:47:53,250 - julearn - INFO -      svm__gamma: [0.001, 0.01, 'scale', 'auto']
2023-07-19 12:47:53,251 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:53,251 - julearn - INFO - Search Parameters:
2023-07-19 12:47:53,251 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:53,251 - julearn - INFO - ====================
2023-07-19 12:47:53,251 - julearn - INFO -
2023-07-19 12:47:53,251 - julearn - INFO - = Model Parameters =
2023-07-19 12:47:53,252 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:47:53,252 - julearn - INFO - Hyperparameters:
2023-07-19 12:47:53,252 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2023-07-19 12:47:53,252 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:53,252 - julearn - INFO - Search Parameters:
2023-07-19 12:47:53,252 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:53,252 - julearn - INFO - ====================
2023-07-19 12:47:53,252 - julearn - INFO -
2023-07-19 12:47:53,252 - julearn - INFO - = Model Parameters =
2023-07-19 12:47:53,252 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:47:53,252 - julearn - INFO - Hyperparameters list:
2023-07-19 12:47:53,252 - julearn - INFO -      Set 0
2023-07-19 12:47:53,252 - julearn - INFO -              svm__C: [0.01, 0.1, 1, 10]
2023-07-19 12:47:53,252 - julearn - INFO -              svm__gamma: [0.001, 0.01, 'scale', 'auto']
2023-07-19 12:47:53,253 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2023-07-19 12:47:53,253 - julearn - INFO -              svm: [SVC()]
2023-07-19 12:47:53,253 - julearn - INFO -      Set 1
2023-07-19 12:47:53,253 - julearn - INFO -              svm__C: [0.01, 0.1, 1, 10]
2023-07-19 12:47:53,253 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2023-07-19 12:47:53,254 - julearn - INFO -              svm: [SVC(kernel='linear')]
2023-07-19 12:47:53,254 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:53,254 - julearn - INFO - Search Parameters:
2023-07-19 12:47:53,254 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:53,254 - julearn - INFO - ====================
2023-07-19 12:47:53,254 - julearn - INFO -
2023-07-19 12:47:53,254 - julearn - INFO - = Data Information =
2023-07-19 12:47:53,254 - julearn - INFO -      Problem type: classification
2023-07-19 12:47:53,254 - julearn - INFO -      Number of samples: 100
2023-07-19 12:47:53,254 - julearn - INFO -      Number of features: 4
2023-07-19 12:47:53,254 - julearn - INFO - ====================
2023-07-19 12:47:53,254 - julearn - INFO -
2023-07-19 12:47:53,255 - julearn - INFO -      Number of classes: 2
2023-07-19 12:47:53,255 - julearn - INFO -      Target type: object
2023-07-19 12:47:53,255 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2023-07-19 12:47:53,255 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:53,256 - julearn - INFO - Binary classification problem detected.
Scores with best hyperparameter: 0.93
{'set_column_types': SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']}),
 'svm': SVC(C=10, gamma=0.01),
 'svm__C': 10,
 'svm__gamma': 0.01}

Important

Note that the name parameter is required when repeating a step name. If we do not specify the name parameter, julearn will auto-determine the step name in an unique way. The only way to force repated names is to do so explicitly.

Using multiple pipeline creators:

creator1 = PipelineCreator(problem_type="classification")
creator1.add("zscore")
creator1.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    gamma=[1e-3, 1e-2, "scale", "auto"],
    kernel=["rbf"],
)

creator2 = PipelineCreator(problem_type="classification")
creator2.add("zscore")
creator2.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    kernel=["linear"],
)

scores2, model2 = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=[creator1, creator2],
    return_estimator="all",
)


print(f"Scores with best hyperparameter: {scores2['test_score'].mean()}")
pprint(model2.best_params_)

2023-07-19 12:47:59,031 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:59,032 - julearn - INFO - Step added
2023-07-19 12:47:59,032 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:59,032 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2023-07-19 12:47:59,032 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2023-07-19 12:47:59,032 - julearn - INFO - Setting hyperparameter kernel = rbf
2023-07-19 12:47:59,032 - julearn - INFO - Step added
2023-07-19 12:47:59,032 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:59,032 - julearn - INFO - Step added
2023-07-19 12:47:59,032 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:47:59,032 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2023-07-19 12:47:59,032 - julearn - INFO - Setting hyperparameter kernel = linear
2023-07-19 12:47:59,032 - julearn - INFO - Step added
2023-07-19 12:47:59,032 - julearn - INFO - ==== Input Data ====
2023-07-19 12:47:59,033 - julearn - INFO - Using dataframe as input
2023-07-19 12:47:59,033 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:59,033 - julearn - INFO -      Target: species
2023-07-19 12:47:59,033 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:47:59,033 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2023-07-19 12:47:59,034 - julearn - INFO - ====================
2023-07-19 12:47:59,034 - julearn - INFO -
2023-07-19 12:47:59,034 - julearn - INFO - = Model Parameters =
2023-07-19 12:47:59,034 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:47:59,034 - julearn - INFO - Hyperparameters:
2023-07-19 12:47:59,034 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2023-07-19 12:47:59,034 - julearn - INFO -      svm__gamma: [0.001, 0.01, 'scale', 'auto']
2023-07-19 12:47:59,035 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:59,035 - julearn - INFO - Search Parameters:
2023-07-19 12:47:59,035 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:59,035 - julearn - INFO - ====================
2023-07-19 12:47:59,035 - julearn - INFO -
2023-07-19 12:47:59,035 - julearn - INFO - = Model Parameters =
2023-07-19 12:47:59,036 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:47:59,036 - julearn - INFO - Hyperparameters:
2023-07-19 12:47:59,036 - julearn - INFO -      svm__C: [0.01, 0.1, 1, 10]
2023-07-19 12:47:59,036 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:59,036 - julearn - INFO - Search Parameters:
2023-07-19 12:47:59,036 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:59,036 - julearn - INFO - ====================
2023-07-19 12:47:59,036 - julearn - INFO -
2023-07-19 12:47:59,036 - julearn - INFO - = Model Parameters =
2023-07-19 12:47:59,036 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:47:59,036 - julearn - INFO - Hyperparameters list:
2023-07-19 12:47:59,036 - julearn - INFO -      Set 0
2023-07-19 12:47:59,036 - julearn - INFO -              svm__C: [0.01, 0.1, 1, 10]
2023-07-19 12:47:59,036 - julearn - INFO -              svm__gamma: [0.001, 0.01, 'scale', 'auto']
2023-07-19 12:47:59,037 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2023-07-19 12:47:59,037 - julearn - INFO -              zscore: [StandardScaler()]
2023-07-19 12:47:59,037 - julearn - INFO -              svm: [SVC()]
2023-07-19 12:47:59,037 - julearn - INFO -      Set 1
2023-07-19 12:47:59,037 - julearn - INFO -              svm__C: [0.01, 0.1, 1, 10]
2023-07-19 12:47:59,038 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2023-07-19 12:47:59,038 - julearn - INFO -              zscore: [StandardScaler()]
2023-07-19 12:47:59,038 - julearn - INFO -              svm: [SVC(kernel='linear')]
2023-07-19 12:47:59,038 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:59,038 - julearn - INFO - Search Parameters:
2023-07-19 12:47:59,038 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:59,038 - julearn - INFO - ====================
2023-07-19 12:47:59,038 - julearn - INFO -
2023-07-19 12:47:59,038 - julearn - INFO - = Data Information =
2023-07-19 12:47:59,039 - julearn - INFO -      Problem type: classification
2023-07-19 12:47:59,039 - julearn - INFO -      Number of samples: 100
2023-07-19 12:47:59,039 - julearn - INFO -      Number of features: 4
2023-07-19 12:47:59,039 - julearn - INFO - ====================
2023-07-19 12:47:59,039 - julearn - INFO -
2023-07-19 12:47:59,039 - julearn - INFO -      Number of classes: 2
2023-07-19 12:47:59,039 - julearn - INFO -      Target type: object
2023-07-19 12:47:59,040 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2023-07-19 12:47:59,040 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:47:59,040 - julearn - INFO - Binary classification problem detected.
Scores with best hyperparameter: 0.93
{'set_column_types': SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']}),
 'svm': SVC(C=10, gamma=0.01),
 'svm__C': 10,
 'svm__gamma': 0.01,
 'zscore': StandardScaler()}

Important

All the pipeline creators must have the same problem type and steps names in order for this approach to work.

Indeed, if we compare both approaches, we can see that they are equivalent. They both produce the same grid of hyperparameters:

pprint(model1.param_grid)
pprint(model2.param_grid)

[{'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})],
  'svm': [SVC(C=10, gamma=0.01)],
  'svm__C': [0.01, 0.1, 1, 10],
  'svm__gamma': [0.001, 0.01, 'scale', 'auto']},
 {'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})],
  'svm': [SVC(kernel='linear')],
  'svm__C': [0.01, 0.1, 1, 10]}]
[{'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})],
  'svm': [SVC(C=10, gamma=0.01)],
  'svm__C': [0.01, 0.1, 1, 10],
  'svm__gamma': [0.001, 0.01, 'scale', 'auto'],
  'zscore': [StandardScaler()]},
 {'set_column_types': [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})],
  'svm': [SVC(kernel='linear')],
  'svm__C': [0.01, 0.1, 1, 10],
  'zscore': [StandardScaler()]}]

Models as hyperparameters#

But why stop there? Models can also be considered as hyperparameters. For example, we can try different models for the classification task. Let’s try the RandomForestClassifier and the LogisticRegression too:

creator1 = PipelineCreator(problem_type="classification")
creator1.add("zscore")
creator1.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    gamma=[1e-3, 1e-2, "scale", "auto"],
    kernel=["rbf"],
    name="model",
)

creator2 = PipelineCreator(problem_type="classification")
creator2.add("zscore")
creator2.add(
    "svm",
    C=[0.01, 0.1, 1, 10],
    kernel=["linear"],
    name="model",
)

creator3 = PipelineCreator(problem_type="classification")
creator3.add("zscore")
creator3.add(
    "rf",
    max_depth=[2, 3, 4],
    name="model",
)

creator4 = PipelineCreator(problem_type="classification")
creator4.add("zscore")
creator4.add(
    "logit",
    penalty=["l2", "l1"],
    dual=[False],
    solver="liblinear",
    name="model",
)


scores3, model3 = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=[creator1, creator2, creator3, creator4],
    return_estimator="all",
)


print(f"Scores with best hyperparameter: {scores3['test_score'].mean()}")
pprint(model3.best_params_)

2023-07-19 12:48:04,904 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:48:04,904 - julearn - INFO - Step added
2023-07-19 12:48:04,904 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:48:04,904 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2023-07-19 12:48:04,904 - julearn - INFO - Tuning hyperparameter gamma = [0.001, 0.01, 'scale', 'auto']
2023-07-19 12:48:04,905 - julearn - INFO - Setting hyperparameter kernel = rbf
2023-07-19 12:48:04,905 - julearn - INFO - Step added
2023-07-19 12:48:04,905 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:48:04,905 - julearn - INFO - Step added
2023-07-19 12:48:04,905 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:48:04,905 - julearn - INFO - Tuning hyperparameter C = [0.01, 0.1, 1, 10]
2023-07-19 12:48:04,905 - julearn - INFO - Setting hyperparameter kernel = linear
2023-07-19 12:48:04,905 - julearn - INFO - Step added
2023-07-19 12:48:04,905 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:48:04,905 - julearn - INFO - Step added
2023-07-19 12:48:04,905 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:48:04,905 - julearn - INFO - Tuning hyperparameter max_depth = [2, 3, 4]
2023-07-19 12:48:04,905 - julearn - INFO - Step added
2023-07-19 12:48:04,906 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:48:04,906 - julearn - INFO - Step added
2023-07-19 12:48:04,906 - julearn - INFO - Adding step model that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2023-07-19 12:48:04,906 - julearn - INFO - Tuning hyperparameter penalty = ['l2', 'l1']
2023-07-19 12:48:04,906 - julearn - INFO - Setting hyperparameter dual = False
2023-07-19 12:48:04,906 - julearn - INFO - Setting hyperparameter solver = liblinear
2023-07-19 12:48:04,906 - julearn - INFO - Step added
2023-07-19 12:48:04,906 - julearn - INFO - ==== Input Data ====
2023-07-19 12:48:04,906 - julearn - INFO - Using dataframe as input
2023-07-19 12:48:04,906 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:48:04,906 - julearn - INFO -      Target: species
2023-07-19 12:48:04,906 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2023-07-19 12:48:04,906 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2023-07-19 12:48:04,907 - julearn - INFO - ====================
2023-07-19 12:48:04,907 - julearn - INFO -
2023-07-19 12:48:04,908 - julearn - INFO - = Model Parameters =
2023-07-19 12:48:04,908 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:48:04,908 - julearn - INFO - Hyperparameters:
2023-07-19 12:48:04,908 - julearn - INFO -      model__C: [0.01, 0.1, 1, 10]
2023-07-19 12:48:04,908 - julearn - INFO -      model__gamma: [0.001, 0.01, 'scale', 'auto']
2023-07-19 12:48:04,908 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:48:04,908 - julearn - INFO - Search Parameters:
2023-07-19 12:48:04,908 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:48:04,908 - julearn - INFO - ====================
2023-07-19 12:48:04,909 - julearn - INFO -
2023-07-19 12:48:04,909 - julearn - INFO - = Model Parameters =
2023-07-19 12:48:04,909 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:48:04,909 - julearn - INFO - Hyperparameters:
2023-07-19 12:48:04,909 - julearn - INFO -      model__C: [0.01, 0.1, 1, 10]
2023-07-19 12:48:04,909 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:48:04,909 - julearn - INFO - Search Parameters:
2023-07-19 12:48:04,910 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:48:04,910 - julearn - INFO - ====================
2023-07-19 12:48:04,910 - julearn - INFO -
2023-07-19 12:48:04,910 - julearn - INFO - = Model Parameters =
2023-07-19 12:48:04,910 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:48:04,910 - julearn - INFO - Hyperparameters:
2023-07-19 12:48:04,910 - julearn - INFO -      model__max_depth: [2, 3, 4]
2023-07-19 12:48:04,910 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:48:04,911 - julearn - INFO - Search Parameters:
2023-07-19 12:48:04,911 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:48:04,911 - julearn - INFO - ====================
2023-07-19 12:48:04,911 - julearn - INFO -
2023-07-19 12:48:04,911 - julearn - INFO - = Model Parameters =
2023-07-19 12:48:04,911 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:48:04,911 - julearn - INFO - Hyperparameters:
2023-07-19 12:48:04,911 - julearn - INFO -      model__penalty: ['l2', 'l1']
2023-07-19 12:48:04,912 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:48:04,912 - julearn - INFO - Search Parameters:
2023-07-19 12:48:04,912 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:48:04,912 - julearn - INFO - ====================
2023-07-19 12:48:04,912 - julearn - INFO -
2023-07-19 12:48:04,912 - julearn - INFO - = Model Parameters =
2023-07-19 12:48:04,912 - julearn - INFO - Tuning hyperparameters using grid
2023-07-19 12:48:04,912 - julearn - INFO - Hyperparameters list:
2023-07-19 12:48:04,912 - julearn - INFO -      Set 0
2023-07-19 12:48:04,912 - julearn - INFO -              model__C: [0.01, 0.1, 1, 10]
2023-07-19 12:48:04,912 - julearn - INFO -              model__gamma: [0.001, 0.01, 'scale', 'auto']
2023-07-19 12:48:04,913 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2023-07-19 12:48:04,913 - julearn - INFO -              zscore: [StandardScaler()]
2023-07-19 12:48:04,913 - julearn - INFO -              model: [SVC()]
2023-07-19 12:48:04,913 - julearn - INFO -      Set 1
2023-07-19 12:48:04,913 - julearn - INFO -              model__C: [0.01, 0.1, 1, 10]
2023-07-19 12:48:04,913 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2023-07-19 12:48:04,914 - julearn - INFO -              zscore: [StandardScaler()]
2023-07-19 12:48:04,914 - julearn - INFO -              model: [SVC(kernel='linear')]
2023-07-19 12:48:04,914 - julearn - INFO -      Set 2
2023-07-19 12:48:04,914 - julearn - INFO -              model__max_depth: [2, 3, 4]
2023-07-19 12:48:04,914 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2023-07-19 12:48:04,914 - julearn - INFO -              zscore: [StandardScaler()]
2023-07-19 12:48:04,915 - julearn - INFO -              model: [RandomForestClassifier()]
2023-07-19 12:48:04,915 - julearn - INFO -      Set 3
2023-07-19 12:48:04,915 - julearn - INFO -              model__penalty: ['l2', 'l1']
2023-07-19 12:48:04,915 - julearn - INFO -              set_column_types: [SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']})]
2023-07-19 12:48:04,915 - julearn - INFO -              zscore: [StandardScaler()]
2023-07-19 12:48:04,915 - julearn - INFO -              model: [LogisticRegression(solver='liblinear')]
2023-07-19 12:48:04,915 - julearn - INFO - Using inner CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:48:04,916 - julearn - INFO - Search Parameters:
2023-07-19 12:48:04,916 - julearn - INFO -      cv: KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:48:04,916 - julearn - INFO - ====================
2023-07-19 12:48:04,916 - julearn - INFO -
2023-07-19 12:48:04,916 - julearn - INFO - = Data Information =
2023-07-19 12:48:04,916 - julearn - INFO -      Problem type: classification
2023-07-19 12:48:04,916 - julearn - INFO -      Number of samples: 100
2023-07-19 12:48:04,916 - julearn - INFO -      Number of features: 4
2023-07-19 12:48:04,916 - julearn - INFO - ====================
2023-07-19 12:48:04,916 - julearn - INFO -
2023-07-19 12:48:04,916 - julearn - INFO -      Number of classes: 2
2023-07-19 12:48:04,916 - julearn - INFO -      Target type: object
2023-07-19 12:48:04,917 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2023-07-19 12:48:04,917 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2023-07-19 12:48:04,917 - julearn - INFO - Binary classification problem detected.
Scores with best hyperparameter: 0.9200000000000002
{'model': SVC(C=10, gamma=0.01),
 'model__C': 10,
 'model__gamma': 0.01,
 'set_column_types': SetColumnTypes(X_types={'continuous': ['sepal_length', 'sepal_width',
                                       'petal_length', 'petal_width']}),
 'zscore': StandardScaler()}

Well, it seems that nothing can beat the SVC with kernel="rbf" for our classification example.

Total running time of the script: ( 1 minutes 2.169 seconds)