.. include:: links.inc Understanding the pipeline ========================== Julearn aims to provide an user-friendly way to apply complex machine learning pipelines. To do so julearn provides the :func:`.run_cross_validation` function. Here, users can specify and customize their pipeline, how should be fitted and evaluated. Furthermore, this function allows you to return the complete and fitted pipeline to use it on other data. In this part of the documentation we will have a closer look to these features of julearn and how you can use them. .. note:: You should have read the :doc:`input ` section before. Model & Problem Type ******************** When using :func:`.run_cross_validation` you have to answer at least 2 questions first and then specify the according arguments properly: * What model do you want to use? You can enter any model name from :doc:`steps ` or use any scikit-learning compatible model into the `model` argument from :func:`.run_cross_validation` * What problem type do you want to answer? In machine learning their are different problems you want to handle. Julearn supports ``binary_classification``, ``multiclass_classification`` and ``regression`` problems. You shout set ``problem_type`` to one of these 3 problem typed. By default, julearn uses the ``binary_classification`` type. What model do you want to use and what problem type do you want to use machine learning on. Preprocessing ************* Concepts ^^^^^^^^ By default users do not have to specify how to preprocess their data. In this case, julearn automatically standardizes the continuous features, the confounds and removes existing confound from the continuous features. But users can configure :func:`.run_cross_validation` by specifying the 3 preprocessing arguments for transforming the confounds, target and features respectively (in this order). To do so you can set the following arguments in the :func:`.run_cross_validation` : * ``preprocess_X``: specifies how to transform the features. Here, you can enter the names or a list of the names of available transformers (:doc:`steps ` ). These are then applied in order to the features. By default most transformers are applied only to the continuous features. For more information on this and how to modify this behavior see below. E.g. ``['zscore', 'pca']`` would mean that the (continuous) features are first z-standardized and then reduced using a principle component analysis. By default features will will not be preprocessed and confound removed in case a confound was specified. * ``preprocess_y``: specifies how to transform the target. Currently, this is limited to one available transformer. By default no preprocessing is applied. * ``preprocess_confounds``: specifies how to transform the confounds. Here, you use the same lists of available transformers as in ``preprocess_X``. By default confounds will not be preprocessed. Example ^^^^^^^ Assume we want not preprocess the confounds, zscore the transformer and then use a pca on the features before removing the confound from these features. All of these operations are included in the :doc:`steps ` and can therefore be referred to by name. In other words we need to set: * ``preprocess_target = 'zscore'`` * ``preprocess_X = ['pca', 'remove_confound']`` Additionally, we know that we are facing a multiclass_classification problem and want to use a svm model. Put together with an example from the :doc:`input ` the code looks like this: .. code-block:: python from seaborn import load_dataset from julearn import run_cross_validation df_iris = load_dataset('iris') X = ['sepal_length', 'sepal_width', 'petal_length'] y = 'species' confounds = 'petal_width preprocess_confounds = [] preprocess_target = 'zscore' preprocess_X = ['pca', 'remove_confound'] run_cross_validation( X=X, y=y, data=df_iris, confounds=confounds, model=model='svm', problem_type='multiclass_classification' preprocess_X=preprocess_X, preprocess_confounds=preprocess_confounds, preprocess_target=preprocess_target) .. note:: Instead of using the name of the available transformers you can also use scikit-learn compatible transformers. But it is recommended to register your own transformers first. For more information see (#TODO) More information ^^^^^^^^^^^^^^^^ As mentioned above julearn allows the user to specify to which variable/columns or variable/column types each transformer will be applied. To do so you can adjust the ``apply_to`` hyperparameter which is added to all transformers used in ``preprocess_X``. You can find such an example at #TODO and find more information on hyperparameter tuning in :doc:`hyperparameters ` . The returned pipeline ********************* The :func:`.run_cross_validation` uses all the information mentioned above to create one :class:`.ExtendedDataFramePipeline` which is then used for cross_validation. Additionally, it can return the fitted pipeline for other application. E.g. you could want to test the pipeline on one additional test set. But how can you do that? Returning the (extended) pipeline ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ There are multiple options which you can use to return the pipeline(s). For all of them you have to set the `return_estimator`. These are the possible options: * None: Does not return any estimator * ``'final'``: Return the estimator fitted on all the data. * ``'cv'``: Return the all the estimator from each CV split, fitted on the training data. * ``'all'``: Return all the estimators (final and cv). These returned estimators are always :class:`.ExtendedDataFramePipeline` objects.Therefore, the next section will discuss how you can use a returned estimator. ExtendedDataFramePipeline ^^^^^^^^^^^^^^^^^^^^^^^^^ The :class:`.ExtendedDataFramePipeline` has the same basic functionality as all scikit-learn pipelines or estimators, but also has some caveats. Where ExtendedDataFramePipeline behave as usual ----------------------------------------------- The following methods work as in sklearn: * ``.fit()`` * ``.predict()`` * ``.score()`` * ``.predict_proba()`` Caveats ExtendedDataFramePipeline --------------------------------- In contrast to scikit-learn pipelines :class:`.ExtendedDataFramePipeline` can change the ground truth (transform the target). This means that any any function which uses sklearn scorer functions instead of calling ``.score()`` on the :class:`.ExtendedDataFramePipeline` can give you the wrong output without **any warning**. For example `cross_validate` function of sklearn when using another scorer. If you want to use such functions, you can follow this example (#TODO) which shows how to use julearns ``extended_scorer`` instead Additional functionality ------------------------ Furthermore, :class:`.ExtendedDataFramePipeline` have the following added methods: * ``preprocess``: a method to apply preprocessing steps of the pipeline to some data. Furthermore, the ``until`` argument can be used to only preprocess up to a specific transformer. Advanced Topics =============== The following sections are advanced topic which do not need to be read for a lot of usecases, but still provide some context for those who want it. Column Type System ****************** Context ^^^^^^^ To be able to discriminate between different types of variables Julearn uses a Column Type System. This system currently distinguishes between continuous variables/features, categorical variables/features and confounds. .. note:: On most levels of Julearn this Column Type System is only used internally. Therefore, users do not have to work with it directly. For example, by providing the confounds and categorical variables to the :class:`.ExtendedDataFramePipeline` it has all the information needed to apply the Column Type System internally without any further input or changes to the `pandas.DataFrame`. How it works ^^^^^^^^^^^^ Every `pandas.DataFrame`_ column has a column name. Inside of Julearn we add another string containing the type of the column separated by our delimiter: ``'__:type:__'`` to the original column names. For example: * We have the original columns : - ``'Intelligence'`` - ``'Age'`` - ``'LikesEvoks'`` * We know: - Intelligence is a **continuous** variable - Age is a **confound** - LikesEvoks is a **categorical** variable. Either someone likes Evoks or not. * Inside of Julearn's Column Type System we can provide this information by changing the column names to: - ``'Intelligence__:type:__continuous'`` - ``'Age__:type:__confound'`` - ``'LikesEvoks__:type:__categorical'``