5.3. Model Building#
So far we know how to parametrize run_cross_validation()
in terms of the
input data (see Data). In this section, we will have a look
on how we can parametrize the learning algorithm and the preprocessing steps,
also known as the pipeline.
A machine learning pipeline is a process to automate the workflow of
a predictive model. It can be thought of as a combination of pipes and
filters. At a pipeline’s starting point, the raw data is fed into the first
filter. The output of this filter is then fed into the next filter
(through a pipe). In supervised machine learning, different filters inside the
pipeline modify the data, while the last step is a learning algorithm that
generates predictions. Before using the pipeline to predict new data, the
pipeline has to be trained (fitted) on data. We call this, as scikit-learn
does, fitting the pipeline.
julearn
aims to provide a user-friendly way to build and evaluate complex
machine learning pipelines. The run_cross_validation()
function is the
entry point to safely evaluate pipelines by making it easy to specify,
customize and train the pipeline. We first have a look at the most
basic pipeline, only consisting of a machine learning algorithm. Then we will
make the pipeline incrementally more complex.
Pipeline specification in run_cross_validation()
#
One important aspect when building machine learning models is the selection of
a learning algorithm. This can be specified in run_cross_validation()
by setting the model
parameter. This parameter can be any scikit-learn
compatible learning algorithm. However, julearn
provides a list of built-in
Models (Estimators) that can be specified by name (see Name
column in
Models (Estimators)). For example, we can simply set
model=="svm"
to use a Support Vector Machine (SVM) [2].
Let’s first specify the data parameters as we learned in Data:
from julearn import run_cross_validation
from seaborn import load_dataset
df = load_dataset("iris")
X = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
y = "species"
X_types = {
"continuous": [
"sepal_length",
"sepal_width",
"petal_length",
"petal_width",
]
}
Now we can run the cross validation with SVM as the learning algorithm:
scores = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model="svm",
problem_type="classification",
)
print(scores)
2024-10-17 14:16:12,239 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:12,239 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:12,239 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,239 - julearn - INFO - Target: species
2024-10-17 14:16:12,239 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,239 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-10-17 14:16:12,240 - julearn - INFO - ====================
2024-10-17 14:16:12,240 - julearn - INFO -
2024-10-17 14:16:12,240 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,240 - julearn - INFO - Step added
2024-10-17 14:16:12,240 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:12,241 - julearn - INFO - ====================
2024-10-17 14:16:12,241 - julearn - INFO -
2024-10-17 14:16:12,241 - julearn - INFO - = Data Information =
2024-10-17 14:16:12,241 - julearn - INFO - Problem type: classification
2024-10-17 14:16:12,241 - julearn - INFO - Number of samples: 150
2024-10-17 14:16:12,241 - julearn - INFO - Number of features: 4
2024-10-17 14:16:12,241 - julearn - INFO - ====================
2024-10-17 14:16:12,241 - julearn - INFO -
2024-10-17 14:16:12,241 - julearn - INFO - Number of classes: 3
2024-10-17 14:16:12,241 - julearn - INFO - Target type: object
2024-10-17 14:16:12,242 - julearn - INFO - Class distributions: species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
2024-10-17 14:16:12,242 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:12,242 - julearn - INFO - Multi-class classification problem detected #classes = 3.
fit_time score_time ... fold cv_mdsum
0 0.003261 0.001721 ... 0 b10eef89b4192178d482d7a1587a248a
1 0.002920 0.001693 ... 1 b10eef89b4192178d482d7a1587a248a
2 0.002697 0.001709 ... 2 b10eef89b4192178d482d7a1587a248a
3 0.002729 0.001671 ... 3 b10eef89b4192178d482d7a1587a248a
4 0.002766 0.001642 ... 4 b10eef89b4192178d482d7a1587a248a
[5 rows x 8 columns]
You will notice that this code indicates an extra parameter problem_type
.
This is because in machine learning, one can distinguish between regression
problems -when predicting a continuous outcome- and classification problems
-for discrete class label predictions-. Therefore,
run_cross_validation()
additionally needs to know which problem type
we are interested in. The possible values for problem_type
are
classification
and regression
. In the example we are interested in
predicting the species (see y
in Data), i.e. a discrete
class label.
Et voilà, your first machine learning pipeline is ready to go.
Feature preprocessing#
There are cases in which the input data, and in particular the features, should be transformed before passing them to the learning algorithm. One scenario can be that certain learning algorithms need the features in a specific form, for example in standardized form, so that the data resemble a normal distribution. This can be achieved by first z-scoring (or standard scaling) the features (see Scalers).
Importantly in a machine learning workflow, all transformations done to the
data have to be done in a cv-consistent way. That means that
data transformation steps have to be done on the training data of each
respective cross-validation fold and then only apply the parameters of the
transformation to the validation data of the respective fold. One should
never do preprocessing on the entire dataset and then do
cross-validation on the already preprocessed features (or more
generally transformed data) because this leads to leakage of information from
the validation data into the model. This is exactly where
run_cross_validation()
comes in handy, because you can simply add your
desired preprocessing step (Transformers) and it
takes care of doing the respective transformations in a cv-consistent manner.
Let’s have a look at how we can add a z-scoring step to our pipeline:
scores = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
preprocess="zscore",
model="svm",
problem_type="classification",
)
print(scores)
2024-10-17 14:16:12,275 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:12,275 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:12,276 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,276 - julearn - INFO - Target: species
2024-10-17 14:16:12,276 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,276 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-10-17 14:16:12,276 - julearn - INFO - ====================
2024-10-17 14:16:12,276 - julearn - INFO -
2024-10-17 14:16:12,276 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,277 - julearn - INFO - Step added
2024-10-17 14:16:12,277 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,277 - julearn - INFO - Step added
2024-10-17 14:16:12,277 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:12,277 - julearn - INFO - ====================
2024-10-17 14:16:12,277 - julearn - INFO -
2024-10-17 14:16:12,277 - julearn - INFO - = Data Information =
2024-10-17 14:16:12,277 - julearn - INFO - Problem type: classification
2024-10-17 14:16:12,277 - julearn - INFO - Number of samples: 150
2024-10-17 14:16:12,277 - julearn - INFO - Number of features: 4
2024-10-17 14:16:12,277 - julearn - INFO - ====================
2024-10-17 14:16:12,278 - julearn - INFO -
2024-10-17 14:16:12,278 - julearn - INFO - Number of classes: 3
2024-10-17 14:16:12,278 - julearn - INFO - Target type: object
2024-10-17 14:16:12,278 - julearn - INFO - Class distributions: species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
2024-10-17 14:16:12,278 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:12,279 - julearn - INFO - Multi-class classification problem detected #classes = 3.
fit_time score_time ... fold cv_mdsum
0 0.004609 0.002466 ... 0 b10eef89b4192178d482d7a1587a248a
1 0.004534 0.002429 ... 1 b10eef89b4192178d482d7a1587a248a
2 0.004568 0.002462 ... 2 b10eef89b4192178d482d7a1587a248a
3 0.004427 0.002417 ... 3 b10eef89b4192178d482d7a1587a248a
4 0.004408 0.002435 ... 4 b10eef89b4192178d482d7a1587a248a
[5 rows x 8 columns]
Note
Learning algorithms (what we specified in the model
parameter), are
estimators. Preprocessing steps however, are usually transformers,
because they transform the input data in a certain way. Therefore, the
parameter description in the API of run_cross_validation()
,
defines valid input for the preprocess
parameter as
TransformerLike
:
preprocess : str, TransformerLike or list | None
Transformer to apply to the features. If string, use one of the
available transformers. If list, each element can be a string or
scikit-learn compatible transformer. If None (default), no
transformation is applied.
But what if we want to add more preprocessing steps? For example, in the case that there are many features available, we might want to reduce the dimensionality of the features before passing them to the learning algorithm. A commonly used approach is a principal component analysis (PCA, see Decomposition). If we nevertheless want to keep our previously applied z-scoring, we can simply add the PCA as another preprocessing step as follows:
scores = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
preprocess=["zscore", "pca"],
model="svm",
problem_type="classification",
)
print(scores)
2024-10-17 14:16:12,324 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:12,324 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:12,324 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,324 - julearn - INFO - Target: species
2024-10-17 14:16:12,324 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,324 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-10-17 14:16:12,325 - julearn - INFO - ====================
2024-10-17 14:16:12,325 - julearn - INFO -
2024-10-17 14:16:12,325 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,325 - julearn - INFO - Step added
2024-10-17 14:16:12,325 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,325 - julearn - INFO - Step added
2024-10-17 14:16:12,325 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,325 - julearn - INFO - Step added
2024-10-17 14:16:12,326 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:12,326 - julearn - INFO - ====================
2024-10-17 14:16:12,326 - julearn - INFO -
2024-10-17 14:16:12,326 - julearn - INFO - = Data Information =
2024-10-17 14:16:12,326 - julearn - INFO - Problem type: classification
2024-10-17 14:16:12,326 - julearn - INFO - Number of samples: 150
2024-10-17 14:16:12,326 - julearn - INFO - Number of features: 4
2024-10-17 14:16:12,326 - julearn - INFO - ====================
2024-10-17 14:16:12,326 - julearn - INFO -
2024-10-17 14:16:12,326 - julearn - INFO - Number of classes: 3
2024-10-17 14:16:12,327 - julearn - INFO - Target type: object
2024-10-17 14:16:12,327 - julearn - INFO - Class distributions: species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
2024-10-17 14:16:12,327 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:12,327 - julearn - INFO - Multi-class classification problem detected #classes = 3.
fit_time score_time ... fold cv_mdsum
0 0.006233 0.003244 ... 0 b10eef89b4192178d482d7a1587a248a
1 0.005605 0.003292 ... 1 b10eef89b4192178d482d7a1587a248a
2 0.005648 0.003210 ... 2 b10eef89b4192178d482d7a1587a248a
3 0.005651 0.003303 ... 3 b10eef89b4192178d482d7a1587a248a
4 0.005561 0.003198 ... 4 b10eef89b4192178d482d7a1587a248a
[5 rows x 8 columns]
This is nice, but with more steps added to the pipeline this can become
opaque. To simplify building complex pipelines, julearn
provides a
PipelineCreator
which helps keeping things neat.
Pipeline specification made easy with the PipelineCreator
#
The PipelineCreator
is a class that helps the user create complex
pipelines with straightforward usage by adding, in order, the desired steps
to the pipeline. Once the pipeline is specified, the
run_cross_validation()
will detect that it is a pipeline creator and
will automatically create the pipeline and run the cross-validation.
Note
The PipelineCreator
always has to be initialized with the
problem_type
parameter, which can be either classification
or
regression
.
Let’s re-write the previous example, using the PipelineCreator
.
We start by creating an instance of the PipelineCreator
, and
setting the problem_type
parameter to classification
.
from julearn.pipeline import PipelineCreator
creator = PipelineCreator(problem_type="classification")
Then we use the add
method to add every desired step to the pipeline.
Both, the preprocessing steps and the learning algorithm are added in the
same way.
As with the run_cross_validation()
function, one can use the names
of the step as indicated in Overview of available Pipeline Steps.
creator.add("zscore")
creator.add("pca")
creator.add("svm")
print(creator)
2024-10-17 14:16:12,384 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,384 - julearn - INFO - Step added
2024-10-17 14:16:12,384 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,384 - julearn - INFO - Step added
2024-10-17 14:16:12,384 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,384 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: pca
estimator: PCA()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 2: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
We then pass the creator
to run_cross_validation()
as the
model
parameter. We do not need to (and cannot) specify any other
pipeline specification step (such as preprocess
)
scores = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
)
print(scores)
2024-10-17 14:16:12,385 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:12,385 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:12,385 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,385 - julearn - INFO - Target: species
2024-10-17 14:16:12,386 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,386 - julearn - INFO - X_types:{'continuous': ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']}
2024-10-17 14:16:12,386 - julearn - INFO - ====================
2024-10-17 14:16:12,386 - julearn - INFO -
2024-10-17 14:16:12,387 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:12,387 - julearn - INFO - ====================
2024-10-17 14:16:12,387 - julearn - INFO -
2024-10-17 14:16:12,387 - julearn - INFO - = Data Information =
2024-10-17 14:16:12,387 - julearn - INFO - Problem type: classification
2024-10-17 14:16:12,387 - julearn - INFO - Number of samples: 150
2024-10-17 14:16:12,387 - julearn - INFO - Number of features: 4
2024-10-17 14:16:12,387 - julearn - INFO - ====================
2024-10-17 14:16:12,387 - julearn - INFO -
2024-10-17 14:16:12,387 - julearn - INFO - Number of classes: 3
2024-10-17 14:16:12,387 - julearn - INFO - Target type: object
2024-10-17 14:16:12,388 - julearn - INFO - Class distributions: species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
2024-10-17 14:16:12,388 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:12,388 - julearn - INFO - Multi-class classification problem detected #classes = 3.
fit_time score_time ... fold cv_mdsum
0 0.006163 0.003283 ... 0 b10eef89b4192178d482d7a1587a248a
1 0.005715 0.003229 ... 1 b10eef89b4192178d482d7a1587a248a
2 0.005679 0.003237 ... 2 b10eef89b4192178d482d7a1587a248a
3 0.005617 0.003217 ... 3 b10eef89b4192178d482d7a1587a248a
4 0.005570 0.003192 ... 4 b10eef89b4192178d482d7a1587a248a
[5 rows x 8 columns]
Awesome! We covered how to create a basic machine learning pipeline and even added multiple feature prepreprocessing steps.
Let’s jump to the next important aspect in the process of building a machine learning model: Hyperparameters. We here cover the basics of setting hyperparameters. If you want to know more about tuning (or optimizing) hyperparameters, please have a look at Hyperparameter Tuning.
Setting hyperparameters#
If you are new to machine learning, the section heading might confuse you: Parameters, hyperparameters - aren’t we doing machine learning, so shouldn’t the model learn all our parameters? Well, yes and no. Yes, it should learn parameters. However, hyperparameters and parameters are two different things.
A model parameter is a variable that is internal to the learning algorithm and we want to learn or estimate its value from the data, which in turn means that they are not set manually. They are required by the model and are often saved as part of the fitting process. Examples of model parameters are the weights in an artificial neural network, the support vectors in a support vector machine or the coefficients/weights in a linear or logistic regression.
Hyperparameters in turn, are configuration(s) of a learning algorithm,
which cannot be estimated from data, but nevertheless need to be specified to
determine how the model parameters will be learnt. The best value for a
hyperparameter on a given problem is usually not known and therefore has to
be either set manually, based on experience from a previous similar problem,
set by using a heuristic (rule of thumb) or by being tuned. Examples are
the learning rate for training a neural network, the C
and sigma
hyperparameters for support vector machines or the number of estimators in a
random forest.
Manually specifying hyperparameters with julearn
is as simple as using the
PipelineCreator
and set the hyperparameter when the step is added.
Let’s say we want to set the with_mean
parameter of the z-score
transformer and compute PCA up to 20% of the variance explained.
This is how the creator would look like:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore", with_mean=True)
creator.add("pca", n_components=0.2)
creator.add("svm")
print(creator)
2024-10-17 14:16:12,444 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,445 - julearn - INFO - Setting hyperparameter with_mean = True
2024-10-17 14:16:12,445 - julearn - INFO - Step added
2024-10-17 14:16:12,445 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,445 - julearn - INFO - Setting hyperparameter n_components = 0.2
2024-10-17 14:16:12,445 - julearn - INFO - Step added
2024-10-17 14:16:12,445 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,445 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: pca
estimator: PCA(n_components=0.2)
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 2: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Usable transformers or estimators can be seen under
Overview of available Pipeline Steps. The basis for most of these steps are the
respective scikit-learn
estimators or transformers. To see the valid
hyperparameters for a certain transformer or estimator, just follow the
respective link in Overview of available Pipeline Steps which will lead you to the
scikit-learn documentation where you can read more about them.
In many cases one wants to specify more than one hyperparameter. This can be
done by passing each hyperparameter separated by a comma. For the svm
we
could for example specify the C
and the kernel hyperparameter like this:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore", with_mean=True)
creator.add("pca", n_components=0.2)
creator.add("svm", C=0.9, kernel="linear")
print(creator)
2024-10-17 14:16:12,446 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,446 - julearn - INFO - Setting hyperparameter with_mean = True
2024-10-17 14:16:12,446 - julearn - INFO - Step added
2024-10-17 14:16:12,446 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,446 - julearn - INFO - Setting hyperparameter n_components = 0.2
2024-10-17 14:16:12,446 - julearn - INFO - Step added
2024-10-17 14:16:12,447 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,447 - julearn - INFO - Setting hyperparameter C = 0.9
2024-10-17 14:16:12,447 - julearn - INFO - Setting hyperparameter kernel = linear
2024-10-17 14:16:12,447 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 1: pca
estimator: PCA(n_components=0.2)
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Step 2: svm
estimator: SVC(C=0.9, kernel='linear')
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
Selective preprocessing using feature types#
Under Pipeline specification made easy with the PipelineCreator you might have wondered, how the
PipelineCreator
makes things easier. Beside the straightforward
definition of hyperparameters, the PipelineCreator
also allows to
specify if a certain step must only be applied to certain features types
(see Data on how to define feature types).
In our example, we can now choose to do two PCA steps, one for the petal features, and one for the sepal features.
First, we need to define the X_types
so we have both petal and sepal
features:
X_types = {
"sepal": ["sepal_length", "sepal_width"],
"petal": ["petal_length", "petal_width"],
}
Then, we modify the previous creator to add the pca
step to the creator
and specify that it should only be applied to the petal and sepal
features. Since we also want the zscore
applied to all features, we need
to specify this as well, indicating that we want to apply it to both
petal and sepal features:
creator = PipelineCreator(problem_type="classification")
creator.add("zscore", apply_to=["petal", "sepal"], with_mean=True)
creator.add("pca", apply_to="petal", n_components=1)
creator.add("pca", apply_to="sepal", n_components=1)
creator.add("svm")
print(creator)
2024-10-17 14:16:12,448 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'petal', 'sepal'}; pattern=(?:__:type:__petal|__:type:__sepal)>
2024-10-17 14:16:12,448 - julearn - INFO - Setting hyperparameter with_mean = True
2024-10-17 14:16:12,448 - julearn - INFO - Step added
2024-10-17 14:16:12,448 - julearn - INFO - Adding step pca that applies to ColumnTypes<types={'petal'}; pattern=(?:__:type:__petal)>
2024-10-17 14:16:12,448 - julearn - INFO - Setting hyperparameter n_components = 1
2024-10-17 14:16:12,448 - julearn - INFO - Step added
2024-10-17 14:16:12,448 - julearn - INFO - Adding step pca_1 that applies to ColumnTypes<types={'sepal'}; pattern=(?:__:type:__sepal)>
2024-10-17 14:16:12,449 - julearn - INFO - Setting hyperparameter n_components = 1
2024-10-17 14:16:12,449 - julearn - INFO - Step added
2024-10-17 14:16:12,449 - julearn - INFO - Adding step svm that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-17 14:16:12,449 - julearn - INFO - Step added
PipelineCreator:
Step 0: zscore
estimator: StandardScaler()
apply to: ColumnTypes<types={'petal', 'sepal'}; pattern=(?:__:type:__petal|__:type:__sepal)>
needed types: ColumnTypes<types={'petal', 'sepal'}; pattern=(?:__:type:__petal|__:type:__sepal)>
tuning params: {}
Step 1: pca
estimator: PCA(n_components=1)
apply to: ColumnTypes<types={'petal'}; pattern=(?:__:type:__petal)>
needed types: ColumnTypes<types={'petal'}; pattern=(?:__:type:__petal)>
tuning params: {}
Step 2: pca_1
estimator: PCA(n_components=1)
apply to: ColumnTypes<types={'sepal'}; pattern=(?:__:type:__sepal)>
needed types: ColumnTypes<types={'sepal'}; pattern=(?:__:type:__sepal)>
tuning params: {}
Step 3: svm
estimator: SVC()
apply to: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
needed types: ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
tuning params: {}
We have additionally specified as a hyperparameter of the pca
that we want to use only the first component. For the svm
we used
the default hyperparameters.
Finally, we again pass the defined X_types
and the creator
to
run_cross_validation()
:
scores = run_cross_validation(
X=X,
y=y,
data=df,
X_types=X_types,
model=creator,
)
print(scores)
2024-10-17 14:16:12,450 - julearn - INFO - ==== Input Data ====
2024-10-17 14:16:12,450 - julearn - INFO - Using dataframe as input
2024-10-17 14:16:12,450 - julearn - INFO - Features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,450 - julearn - INFO - Target: species
2024-10-17 14:16:12,450 - julearn - INFO - Expanded features: ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
2024-10-17 14:16:12,450 - julearn - INFO - X_types:{'sepal': ['sepal_length', 'sepal_width'], 'petal': ['petal_length', 'petal_width']}
2024-10-17 14:16:12,451 - julearn - INFO - ====================
2024-10-17 14:16:12,451 - julearn - INFO -
2024-10-17 14:16:12,452 - julearn - INFO - = Model Parameters =
2024-10-17 14:16:12,452 - julearn - INFO - ====================
2024-10-17 14:16:12,452 - julearn - INFO -
2024-10-17 14:16:12,452 - julearn - INFO - = Data Information =
2024-10-17 14:16:12,452 - julearn - INFO - Problem type: classification
2024-10-17 14:16:12,452 - julearn - INFO - Number of samples: 150
2024-10-17 14:16:12,452 - julearn - INFO - Number of features: 4
2024-10-17 14:16:12,452 - julearn - INFO - ====================
2024-10-17 14:16:12,453 - julearn - INFO -
2024-10-17 14:16:12,453 - julearn - INFO - Number of classes: 3
2024-10-17 14:16:12,453 - julearn - INFO - Target type: object
2024-10-17 14:16:12,453 - julearn - INFO - Class distributions: species
setosa 50
versicolor 50
virginica 50
Name: count, dtype: int64
2024-10-17 14:16:12,453 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-17 14:16:12,454 - julearn - INFO - Multi-class classification problem detected #classes = 3.
fit_time score_time ... fold cv_mdsum
0 0.015360 0.006925 ... 0 b10eef89b4192178d482d7a1587a248a
1 0.016289 0.006949 ... 1 b10eef89b4192178d482d7a1587a248a
2 0.014953 0.007066 ... 2 b10eef89b4192178d482d7a1587a248a
3 0.014983 0.006925 ... 3 b10eef89b4192178d482d7a1587a248a
4 0.014983 0.007062 ... 4 b10eef89b4192178d482d7a1587a248a
[5 rows x 8 columns]
We covered how to set up basic pipelines, how to use the
PipelineCreator
, how to use the apply_to
parameter of the
PipelineCreator
and covered basics of hyperparameters. Additionally,
we saw a basic use-case of target preprocessing. In the next
step we will understand the returns of run_cross_validation()
, i.e.,
the model output and the scores of the performed cross-validation.
Total running time of the script: (0 minutes 0.347 seconds)