Understanding the pipeline¶
Julearn aims to provide an user-friendly way to apply complex machine learning
pipelines.
To do so julearn provides the run_cross_validation()
function.
Here, users can specify and customize their pipeline,
how should be fitted and evaluated.
Furthermore, this function allows you to return the
complete and fitted pipeline to use it on other data.
In this part of the documentation we will have a closer look to these
features of julearn and how you can use them.
Note
You should have read the input section before.
Model & Problem Type¶
- When using
run_cross_validation()
you have to answer at least 2 questions first and then specify the according arguments properly:
What model do you want to use? You can enter any model name from steps or use any scikit-learning compatible model into the model argument from
run_cross_validation()
What problem type do you want to answer? In machine learning their are different problems you want to handle. Julearn supports
binary_classification
,multiclass_classification
andregression
problems. You shout setproblem_type
to one of these 3 problem typed. By default, julearn uses thebinary_classification
type.
What model do you want to use and what problem type do you want to use machine learning on.
Preprocessing¶
Concepts¶
By default users do not have to specify how to preprocess their data. In this case, julearn automatically standardizes the continuous features, the confounds and removes existing confound from the continuous features.
But users can configure run_cross_validation()
by specifying
the 3 preprocessing arguments for transforming the
confounds, target and features respectively (in this order).
To do so you can set the following arguments in the
run_cross_validation()
:
preprocess_X
: specifies how to transform the features. Here, you can enter the names or a list of the names of available transformers (steps ). These are then applied in order to the features. By default most transformers are applied only to the continuous features. For more information on this and how to modify this behavior see below.E.g.
['zscore', 'pca']
would mean that the (continuous) features are first z-standardized and then reduced using a principle component analysis. By default features will will not be preprocessed and confound removed in case a confound was specified.
preprocess_y
: specifies how to transform the target. Currently, this is limited to one available transformer. By default no preprocessing is applied.
preprocess_confounds
: specifies how to transform the confounds. Here, you use the same lists of available transformers as inpreprocess_X
. By default confounds will not be preprocessed.
Example¶
Assume we want not preprocess the confounds, zscore the transformer and then use a pca on the features before removing the confound from these features. All of these operations are included in the steps and can therefore be referred to by name.
- In other words we need to set:
preprocess_target = 'zscore'
preprocess_X = ['pca', 'remove_confound']
Additionally, we know that we are facing a multiclass_classification problem and want to use a svm model. Put together with an example from the input the code looks like this:
from seaborn import load_dataset
from julearn import run_cross_validation
df_iris = load_dataset('iris')
X = ['sepal_length', 'sepal_width', 'petal_length']
y = 'species'
confounds = 'petal_width
preprocess_confounds = []
preprocess_target = 'zscore'
preprocess_X = ['pca', 'remove_confound']
run_cross_validation(
X=X, y=y, data=df_iris, confounds=confounds,
model=model='svm', problem_type='multiclass_classification'
preprocess_X=preprocess_X,
preprocess_confounds=preprocess_confounds,
preprocess_target=preprocess_target)
Note
Instead of using the name of the available transformers you can also use scikit-learn compatible transformers. But it is recommended to register your own transformers first. For more information see (#TODO)
More information¶
As mentioned above julearn allows the user to specify to which variable/columns
or variable/column types each transformer will be applied. To do so you
can adjust the apply_to
hyperparameter which is added to all transformers
used in preprocess_X
. You can find such an example at #TODO
and find more information on hyperparameter tuning in
hyperparameters .
The returned pipeline¶
The run_cross_validation()
uses all the information mentioned above
to create one ExtendedDataFramePipeline
which is then used for
cross_validation. Additionally, it can return the fitted pipeline for other
application. E.g. you could want to test the pipeline on one additional
test set. But how can you do that?
Returning the (extended) pipeline¶
There are multiple options which you can use to return the pipeline(s). For all of them you have to set the return_estimator. These are the possible options:
None: Does not return any estimator
'final'
: Return the estimator fitted on all the data.
'cv'
: Return the all the estimator from each CV split, fitted on the training data.
'all'
: Return all the estimators (final and cv).
These returned estimators are always ExtendedDataFramePipeline
objects.Therefore, the next section will discuss how you can use
a returned estimator.
ExtendedDataFramePipeline¶
The ExtendedDataFramePipeline
has the same basic functionality as
all scikit-learn pipelines or estimators, but also has some caveats.
Where ExtendedDataFramePipeline behave as usual¶
The following methods work as in sklearn:
.fit()
.predict()
.score()
.predict_proba()
Caveats ExtendedDataFramePipeline¶
In contrast to scikit-learn pipelines ExtendedDataFramePipeline
can change the ground truth (transform the target).
This means that any any function which uses sklearn scorer functions instead of
calling .score()
on the ExtendedDataFramePipeline
can give you
the wrong output without any warning.
For example cross_validate function of sklearn when using another scorer.
If you want to use such functions, you can follow this example (#TODO) which
shows how to use julearns extended_scorer
instead
Additional functionality¶
Furthermore, ExtendedDataFramePipeline
have the following
added methods:
preprocess
: a method to apply preprocessing steps of the pipeline to some data. Furthermore, theuntil
argument can be used to only preprocess up to a specific transformer.
Advanced Topics¶
The following sections are advanced topic which do not need to be read for a lot of usecases, but still provide some context for those who want it.
Column Type System¶
Context¶
To be able to discriminate between different types of variables Julearn uses a Column Type System. This system currently distinguishes between continuous variables/features, categorical variables/features and confounds.
Note
On most levels of Julearn this Column Type System is only used internally.
Therefore, users do not have to work with it directly.
For example, by providing the confounds and categorical variables to the
ExtendedDataFramePipeline
it has all the information needed to
apply the Column Type System internally without any further input or changes
to the pandas.DataFrame.
How it works¶
Every pandas.DataFrame column has a column name.
Inside of Julearn we add another string containing the type of the column
separated by our delimiter: '__:type:__'
to the original column names.
For example:
We have the original columns :
'Intelligence'
'Age'
'LikesEvoks'
We know:
Intelligence is a continuous variable
Age is a confound
LikesEvoks is a categorical variable. Either someone likes Evoks or not.
Inside of Julearn’s Column Type System we can provide this information by changing the column names to:
'Intelligence__:type:__continuous'
'Age__:type:__confound'
'LikesEvoks__:type:__categorical'