5.3. Model Building#

So far we know how to parametrize run_cross_validation() in terms of the input data (see Data). In this section, we will have a look on how we can parametrize the learning algorithm and the preprocessing steps, also known as the pipeline.

A machine learning pipeline is a process to automate the workflow of a predictive model. It can be thought of as a combination of pipes and filters. At a pipeline’s starting point, the raw data is fed into the first filter. The output of this filter is then fed into the next filter (through a pipe). In supervised machine learning, different filters inside the pipeline modify the data, while the last step is a learning algorithm that generates predictions. Before using the pipeline to predict new data, the pipeline has to be trained (fitted) on data. We call this, as scikit-learn does, fitting the pipeline.

julearn aims to provide a user-friendly way to build and evaluate complex machine learning pipelines. The run_cross_validation() function is the entry point to safely evaluate pipelines by making it easy to specify, customize and train the pipeline. We first have a look at the most basic pipeline, only consisting of a machine learning algorithm. Then we will make the pipeline incrementally more complex.

Pipeline specification in run_cross_validation()#

One important aspect when building machine learning models is the selection of a learning algorithm. This can be specified in run_cross_validation() by setting the model parameter. This parameter can be any scikit-learn compatible learning algorithm. However, julearn provides a list of built-in Models (Estimators) that can be specified by name (see Name column in Models (Estimators)). For example, we can simply set model=="svm" to use a Support Vector Machine (SVM) [2].

Let’s first specify the data parameters as we learned in Data:

from julearn import run_cross_validation
from seaborn import load_dataset

df = load_dataset("iris")
X = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
y = "species"
X_types = {
    "continuous": [
        "sepal_length",
        "sepal_width",
        "petal_length",
        "petal_width",
    ]
}

Now we can run the cross validation with SVM as the learning algorithm:

scores = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model="svm",
    problem_type="classification",
)
print(scores)
2024-10-17 14:16:12,239 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:12,239 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:12,239 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,239 - julearn - INFO -      Target: species
2024-10-17 14:16:12,239 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,239 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-10-17 14:16:12,240 - julearn - INFO - ====================
2024-10-17 14:16:12,240 - julearn - INFO -
2024-10-17 14:16:12,240 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,240 - julearn - INFO - Step added
2024-10-17 14:16:12,240 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:12,241 - julearn - INFO - ====================
2024-10-17 14:16:12,241 - julearn - INFO -
2024-10-17 14:16:12,241 - julearn - INFO - = Data Information =
2024-10-17 14:16:12,241 - julearn - INFO -      Problem type: classification
2024-10-17 14:16:12,241 - julearn - INFO -      Number of samples: 150
2024-10-17 14:16:12,241 - julearn - INFO -      Number of features: 4
2024-10-17 14:16:12,241 - julearn - INFO - ====================
2024-10-17 14:16:12,241 - julearn - INFO -
2024-10-17 14:16:12,241 - julearn - INFO -      Number of classes: 3
2024-10-17 14:16:12,241 - julearn - INFO -      Target type: object
2024-10-17 14:16:12,242 - julearn - INFO -      Class distributions: species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64
2024-10-17 14:16:12,242 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:12,242 - julearn - INFO - Multi-class classification problem detected #classes = 3.
   fit_time  score_time  ...  fold                          cv_mdsum
0  0.003261    0.001721  ...     0  b10eef89b4192178d482d7a1587a248a
1  0.002920    0.001693  ...     1  b10eef89b4192178d482d7a1587a248a
2  0.002697    0.001709  ...     2  b10eef89b4192178d482d7a1587a248a
3  0.002729    0.001671  ...     3  b10eef89b4192178d482d7a1587a248a
4  0.002766    0.001642  ...     4  b10eef89b4192178d482d7a1587a248a

[5 rows x 8 columns]

You will notice that this code indicates an extra parameter problem_type. This is because in machine learning, one can distinguish between regression problems -when predicting a continuous outcome- and classification problems -for discrete class label predictions-. Therefore, run_cross_validation() additionally needs to know which problem type we are interested in. The possible values for problem_type are classification and regression. In the example we are interested in predicting the species (see y in Data), i.e. a discrete class label.

Et voilà, your first machine learning pipeline is ready to go.

Feature preprocessing#

There are cases in which the input data, and in particular the features, should be transformed before passing them to the learning algorithm. One scenario can be that certain learning algorithms need the features in a specific form, for example in standardized form, so that the data resemble a normal distribution. This can be achieved by first z-scoring (or standard scaling) the features (see Scalers).

Importantly in a machine learning workflow, all transformations done to the data have to be done in a cv-consistent way. That means that data transformation steps have to be done on the training data of each respective cross-validation fold and then only apply the parameters of the transformation to the validation data of the respective fold. One should never do preprocessing on the entire dataset and then do cross-validation on the already preprocessed features (or more generally transformed data) because this leads to leakage of information from the validation data into the model. This is exactly where run_cross_validation() comes in handy, because you can simply add your desired preprocessing step (Transformers) and it takes care of doing the respective transformations in a cv-consistent manner.

Let’s have a look at how we can add a z-scoring step to our pipeline:

scores = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    preprocess="zscore",
    model="svm",
    problem_type="classification",
)

print(scores)
2024-10-17 14:16:12,275 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:12,275 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:12,276 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,276 - julearn - INFO -      Target: species
2024-10-17 14:16:12,276 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,276 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-10-17 14:16:12,276 - julearn - INFO - ====================
2024-10-17 14:16:12,276 - julearn - INFO -
2024-10-17 14:16:12,276 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,277 - julearn - INFO - Step added
2024-10-17 14:16:12,277 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,277 - julearn - INFO - Step added
2024-10-17 14:16:12,277 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:12,277 - julearn - INFO - ====================
2024-10-17 14:16:12,277 - julearn - INFO -
2024-10-17 14:16:12,277 - julearn - INFO - = Data Information =
2024-10-17 14:16:12,277 - julearn - INFO -      Problem type: classification
2024-10-17 14:16:12,277 - julearn - INFO -      Number of samples: 150
2024-10-17 14:16:12,277 - julearn - INFO -      Number of features: 4
2024-10-17 14:16:12,277 - julearn - INFO - ====================
2024-10-17 14:16:12,278 - julearn - INFO -
2024-10-17 14:16:12,278 - julearn - INFO -      Number of classes: 3
2024-10-17 14:16:12,278 - julearn - INFO -      Target type: object
2024-10-17 14:16:12,278 - julearn - INFO -      Class distributions: species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64
2024-10-17 14:16:12,278 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:12,279 - julearn - INFO - Multi-class classification problem detected #classes = 3.
   fit_time  score_time  ...  fold                          cv_mdsum
0  0.004609    0.002466  ...     0  b10eef89b4192178d482d7a1587a248a
1  0.004534    0.002429  ...     1  b10eef89b4192178d482d7a1587a248a
2  0.004568    0.002462  ...     2  b10eef89b4192178d482d7a1587a248a
3  0.004427    0.002417  ...     3  b10eef89b4192178d482d7a1587a248a
4  0.004408    0.002435  ...     4  b10eef89b4192178d482d7a1587a248a

[5 rows x 8 columns]

Note

Learning algorithms (what we specified in the model parameter), are estimators. Preprocessing steps however, are usually transformers, because they transform the input data in a certain way. Therefore, the parameter description in the API of run_cross_validation(), defines valid input for the preprocess parameter as TransformerLike:

preprocess : str, TransformerLike or list | None
        Transformer to apply to the features. If string, use one of the
        available transformers. If list, each element can be a string or
        scikit-learn compatible transformer. If None (default), no
        transformation is applied.

But what if we want to add more preprocessing steps? For example, in the case that there are many features available, we might want to reduce the dimensionality of the features before passing them to the learning algorithm. A commonly used approach is a principal component analysis (PCA, see Decomposition). If we nevertheless want to keep our previously applied z-scoring, we can simply add the PCA as another preprocessing step as follows:

scores = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    preprocess=["zscore", "pca"],
    model="svm",
    problem_type="classification",
)

print(scores)
2024-10-17 14:16:12,324 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:12,324 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:12,324 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,324 - julearn - INFO -      Target: species
2024-10-17 14:16:12,324 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,324 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-10-17 14:16:12,325 - julearn - INFO - ====================
2024-10-17 14:16:12,325 - julearn - INFO -
2024-10-17 14:16:12,325 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,325 - julearn - INFO - Step added
2024-10-17 14:16:12,325 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,325 - julearn - INFO - Step added
2024-10-17 14:16:12,325 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,325 - julearn - INFO - Step added
2024-10-17 14:16:12,326 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:12,326 - julearn - INFO - ====================
2024-10-17 14:16:12,326 - julearn - INFO -
2024-10-17 14:16:12,326 - julearn - INFO - = Data Information =
2024-10-17 14:16:12,326 - julearn - INFO -      Problem type: classification
2024-10-17 14:16:12,326 - julearn - INFO -      Number of samples: 150
2024-10-17 14:16:12,326 - julearn - INFO -      Number of features: 4
2024-10-17 14:16:12,326 - julearn - INFO - ====================
2024-10-17 14:16:12,326 - julearn - INFO -
2024-10-17 14:16:12,326 - julearn - INFO -      Number of classes: 3
2024-10-17 14:16:12,327 - julearn - INFO -      Target type: object
2024-10-17 14:16:12,327 - julearn - INFO -      Class distributions: species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64
2024-10-17 14:16:12,327 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:12,327 - julearn - INFO - Multi-class classification problem detected #classes = 3.
   fit_time  score_time  ...  fold                          cv_mdsum
0  0.006233    0.003244  ...     0  b10eef89b4192178d482d7a1587a248a
1  0.005605    0.003292  ...     1  b10eef89b4192178d482d7a1587a248a
2  0.005648    0.003210  ...     2  b10eef89b4192178d482d7a1587a248a
3  0.005651    0.003303  ...     3  b10eef89b4192178d482d7a1587a248a
4  0.005561    0.003198  ...     4  b10eef89b4192178d482d7a1587a248a

[5 rows x 8 columns]

This is nice, but with more steps added to the pipeline this can become opaque. To simplify building complex pipelines, julearn provides a PipelineCreator which helps keeping things neat.

Pipeline specification made easy with the PipelineCreator#

The PipelineCreator is a class that helps the user create complex pipelines with straightforward usage by adding, in order, the desired steps to the pipeline. Once the pipeline is specified, the run_cross_validation() will detect that it is a pipeline creator and will automatically create the pipeline and run the cross-validation.

Note

The PipelineCreator always has to be initialized with the problem_type parameter, which can be either classification or regression.

Let’s re-write the previous example, using the PipelineCreator.

We start by creating an instance of the PipelineCreator, and setting the problem_type parameter to classification.

from julearn.pipeline import PipelineCreator

creator = PipelineCreator(problem_type="classification")

Then we use the add method to add every desired step to the pipeline. Both, the preprocessing steps and the learning algorithm are added in the same way. As with the run_cross_validation() function, one can use the names of the step as indicated in Overview of available Pipeline Steps.

creator.add("zscore")
creator.add("pca")
creator.add("svm")

print(creator)
2024-10-17 14:16:12,384 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,384 - julearn - INFO - Step added
2024-10-17 14:16:12,384 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,384 - julearn - INFO - Step added
2024-10-17 14:16:12,384 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,384 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: pca
    estimator:     PCA()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 2: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}

We then pass the creator to run_cross_validation() as the model parameter. We do not need to (and cannot) specify any other pipeline specification step (such as preprocess)

scores = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
)

print(scores)
2024-10-17 14:16:12,385 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:12,385 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:12,385 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,385 - julearn - INFO -      Target: species
2024-10-17 14:16:12,386 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,386 - julearn - INFO -      X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-10-17 14:16:12,386 - julearn - INFO - ====================
2024-10-17 14:16:12,386 - julearn - INFO -
2024-10-17 14:16:12,387 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:12,387 - julearn - INFO - ====================
2024-10-17 14:16:12,387 - julearn - INFO -
2024-10-17 14:16:12,387 - julearn - INFO - = Data Information =
2024-10-17 14:16:12,387 - julearn - INFO -      Problem type: classification
2024-10-17 14:16:12,387 - julearn - INFO -      Number of samples: 150
2024-10-17 14:16:12,387 - julearn - INFO -      Number of features: 4
2024-10-17 14:16:12,387 - julearn - INFO - ====================
2024-10-17 14:16:12,387 - julearn - INFO -
2024-10-17 14:16:12,387 - julearn - INFO -      Number of classes: 3
2024-10-17 14:16:12,387 - julearn - INFO -      Target type: object
2024-10-17 14:16:12,388 - julearn - INFO -      Class distributions: species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64
2024-10-17 14:16:12,388 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:12,388 - julearn - INFO - Multi-class classification problem detected #classes = 3.
   fit_time  score_time  ...  fold                          cv_mdsum
0  0.006163    0.003283  ...     0  b10eef89b4192178d482d7a1587a248a
1  0.005715    0.003229  ...     1  b10eef89b4192178d482d7a1587a248a
2  0.005679    0.003237  ...     2  b10eef89b4192178d482d7a1587a248a
3  0.005617    0.003217  ...     3  b10eef89b4192178d482d7a1587a248a
4  0.005570    0.003192  ...     4  b10eef89b4192178d482d7a1587a248a

[5 rows x 8 columns]

Awesome! We covered how to create a basic machine learning pipeline and even added multiple feature prepreprocessing steps.

Let’s jump to the next important aspect in the process of building a machine learning model: Hyperparameters. We here cover the basics of setting hyperparameters. If you want to know more about tuning (or optimizing) hyperparameters, please have a look at Hyperparameter Tuning.

Setting hyperparameters#

If you are new to machine learning, the section heading might confuse you: Parameters, hyperparameters - aren’t we doing machine learning, so shouldn’t the model learn all our parameters? Well, yes and no. Yes, it should learn parameters. However, hyperparameters and parameters are two different things.

A model parameter is a variable that is internal to the learning algorithm and we want to learn or estimate its value from the data, which in turn means that they are not set manually. They are required by the model and are often saved as part of the fitting process. Examples of model parameters are the weights in an artificial neural network, the support vectors in a support vector machine or the coefficients/weights in a linear or logistic regression.

Hyperparameters in turn, are configuration(s) of a learning algorithm, which cannot be estimated from data, but nevertheless need to be specified to determine how the model parameters will be learnt. The best value for a hyperparameter on a given problem is usually not known and therefore has to be either set manually, based on experience from a previous similar problem, set by using a heuristic (rule of thumb) or by being tuned. Examples are the learning rate for training a neural network, the C and sigma hyperparameters for support vector machines or the number of estimators in a random forest.

Manually specifying hyperparameters with julearn is as simple as using the PipelineCreator and set the hyperparameter when the step is added.

Let’s say we want to set the with_mean parameter of the z-score transformer and compute PCA up to 20% of the variance explained. This is how the creator would look like:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore", with_mean=True)
creator.add("pca", n_components=0.2)
creator.add("svm")

print(creator)
2024-10-17 14:16:12,444 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,445 - julearn - INFO - Setting hyperparameter with_mean = True
2024-10-17 14:16:12,445 - julearn - INFO - Step added
2024-10-17 14:16:12,445 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,445 - julearn - INFO - Setting hyperparameter n_components = 0.2
2024-10-17 14:16:12,445 - julearn - INFO - Step added
2024-10-17 14:16:12,445 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,445 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: pca
    estimator:     PCA(n_components=0.2)
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 2: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}

Usable transformers or estimators can be seen under Overview of available Pipeline Steps. The basis for most of these steps are the respective scikit-learn estimators or transformers. To see the valid hyperparameters for a certain transformer or estimator, just follow the respective link in Overview of available Pipeline Steps which will lead you to the scikit-learn documentation where you can read more about them.

In many cases one wants to specify more than one hyperparameter. This can be done by passing each hyperparameter separated by a comma. For the svm we could for example specify the C and the kernel hyperparameter like this:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore", with_mean=True)
creator.add("pca", n_components=0.2)
creator.add("svm", C=0.9, kernel="linear")

print(creator)
2024-10-17 14:16:12,446 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,446 - julearn - INFO - Setting hyperparameter with_mean = True
2024-10-17 14:16:12,446 - julearn - INFO - Step added
2024-10-17 14:16:12,446 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,446 - julearn - INFO - Setting hyperparameter n_components = 0.2
2024-10-17 14:16:12,446 - julearn - INFO - Step added
2024-10-17 14:16:12,447 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,447 - julearn - INFO - Setting hyperparameter C = 0.9
2024-10-17 14:16:12,447 - julearn - INFO - Setting hyperparameter kernel = linear
2024-10-17 14:16:12,447 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 1: pca
    estimator:     PCA(n_components=0.2)
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}
  Step 2: svm
    estimator:     SVC(C=0.9, kernel='linear')
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}

Selective preprocessing using feature types#

Under Pipeline specification made easy with the PipelineCreator you might have wondered, how the PipelineCreator makes things easier. Beside the straightforward definition of hyperparameters, the PipelineCreator also allows to specify if a certain step must only be applied to certain features types (see Data on how to define feature types).

In our example, we can now choose to do two PCA steps, one for the petal features, and one for the sepal features.

First, we need to define the X_types so we have both petal and sepal features:

X_types = {
    "sepal": ["sepal_length", "sepal_width"],
    "petal": ["petal_length", "petal_width"],
}

Then, we modify the previous creator to add the pca step to the creator and specify that it should only be applied to the petal and sepal features. Since we also want the zscore applied to all features, we need to specify this as well, indicating that we want to apply it to both petal and sepal features:

creator = PipelineCreator(problem_type="classification")
creator.add("zscore", apply_to=["petal", "sepal"], with_mean=True)
creator.add("pca", apply_to="petal", n_components=1)
creator.add("pca", apply_to="sepal", n_components=1)
creator.add("svm")

print(creator)
2024-10-17 14:16:12,448 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'petal', 'sepal'}; pattern=(?:__:type:__petal|__:type:__sepal)>
2024-10-17 14:16:12,448 - julearn - INFO - Setting hyperparameter with_mean = True
2024-10-17 14:16:12,448 - julearn - INFO - Step added
2024-10-17 14:16:12,448 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'petal'}; pattern=(?:__:type:__petal)>
2024-10-17 14:16:12,448 - julearn - INFO - Setting hyperparameter n_components = 1
2024-10-17 14:16:12,448 - julearn - INFO - Step added
2024-10-17 14:16:12,448 - julearn - INFO - Adding step pca_1 that applies to ColumnTypes<types={'sepal'}; pattern=(?:__:type:__sepal)>
2024-10-17 14:16:12,449 - julearn - INFO - Setting hyperparameter n_components = 1
2024-10-17 14:16:12,449 - julearn - INFO - Step added
2024-10-17 14:16:12,449 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,449 - julearn - INFO - Step added
PipelineCreator:
  Step 0: zscore
    estimator:     StandardScaler()
    apply to:      ColumnTypes<types={'petal', 'sepal'}; pattern=(?:__:type:__petal|__:type:__sepal)>
    needed types:  ColumnTypes<types={'petal', 'sepal'}; pattern=(?:__:type:__petal|__:type:__sepal)>
    tuning params: {}
  Step 1: pca
    estimator:     PCA(n_components=1)
    apply to:      ColumnTypes<types={'petal'}; pattern=(?:__:type:__petal)>
    needed types:  ColumnTypes<types={'petal'}; pattern=(?:__:type:__petal)>
    tuning params: {}
  Step 2: pca_1
    estimator:     PCA(n_components=1)
    apply to:      ColumnTypes<types={'sepal'}; pattern=(?:__:type:__sepal)>
    needed types:  ColumnTypes<types={'sepal'}; pattern=(?:__:type:__sepal)>
    tuning params: {}
  Step 3: svm
    estimator:     SVC()
    apply to:      ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    needed types:  ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
    tuning params: {}

We have additionally specified as a hyperparameter of the pca that we want to use only the first component. For the svm we used the default hyperparameters.

Finally, we again pass the defined X_types and the creator to run_cross_validation():

scores = run_cross_validation(
    X=X,
    y=y,
    data=df,
    X_types=X_types,
    model=creator,
)

print(scores)
2024-10-17 14:16:12,450 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:12,450 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:12,450 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,450 - julearn - INFO -      Target: species
2024-10-17 14:16:12,450 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,450 - julearn - INFO -      X_types:{'sepal': ['sepal_length', 'sepal_width'], 'petal': ['petal_length', 'petal_width']}
2024-10-17 14:16:12,451 - julearn - INFO - ====================
2024-10-17 14:16:12,451 - julearn - INFO -
2024-10-17 14:16:12,452 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:12,452 - julearn - INFO - ====================
2024-10-17 14:16:12,452 - julearn - INFO -
2024-10-17 14:16:12,452 - julearn - INFO - = Data Information =
2024-10-17 14:16:12,452 - julearn - INFO -      Problem type: classification
2024-10-17 14:16:12,452 - julearn - INFO -      Number of samples: 150
2024-10-17 14:16:12,452 - julearn - INFO -      Number of features: 4
2024-10-17 14:16:12,452 - julearn - INFO - ====================
2024-10-17 14:16:12,453 - julearn - INFO -
2024-10-17 14:16:12,453 - julearn - INFO -      Number of classes: 3
2024-10-17 14:16:12,453 - julearn - INFO -      Target type: object
2024-10-17 14:16:12,453 - julearn - INFO -      Class distributions: species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64
2024-10-17 14:16:12,453 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:12,454 - julearn - INFO - Multi-class classification problem detected #classes = 3.
   fit_time  score_time  ...  fold                          cv_mdsum
0  0.015360    0.006925  ...     0  b10eef89b4192178d482d7a1587a248a
1  0.016289    0.006949  ...     1  b10eef89b4192178d482d7a1587a248a
2  0.014953    0.007066  ...     2  b10eef89b4192178d482d7a1587a248a
3  0.014983    0.006925  ...     3  b10eef89b4192178d482d7a1587a248a
4  0.014983    0.007062  ...     4  b10eef89b4192178d482d7a1587a248a

[5 rows x 8 columns]

We covered how to set up basic pipelines, how to use the PipelineCreator, how to use the apply_to parameter of the PipelineCreator and covered basics of hyperparameters. Additionally, we saw a basic use-case of target preprocessing. In the next step we will understand the returns of run_cross_validation(), i.e., the model output and the scores of the performed cross-validation.

Total running time of the script: (0 minutes 0.347 seconds)