Reference

Main API functions

julearn.api.create_pipeline(model, confounds=None, problem_type='binary_classification', preprocess_X=None, preprocess_y=None, preprocess_confounds=None, model_params=None)

Creates a not fitted julearn pipeline.

Parameters:
modelstr or scikit-learn compatible model.

If string, it will use one of the available models. See available_models.

confoundsstr, list(str) or numpy.array | None

The confounds. See https://juaml.github.io/julearn/input.html for details.

problem_typestr

The kind of problem to model.

Options are:

  • “binary_classification”: Perform a binary classification in which the target (y) has only two possible classes (default). The parameter pos_labels can be used to convert a target with multiple_classes into binary.

  • “multiclass_classification”: Performs a multiclass classification in which the target (y) has more than two possible values.

  • “regression”. Perform a regression. The target (y) has to be ordinal at least.

preprocess_Xstr, scikit-learn compatible transformers or list | None

Transformer to apply to the features (X). If string, use one of the available transformers. If list, each element can be a string or scikit-learn compatible transformer. If None (default), no transformation is applied.

See documentation for details.

preprocess_ystr or scikit-learn transformer | None

Transformer to apply to the target (y). If None (default), no transformation is applied.

See documentation for details.

preprocess_confoundsstr, scikit-learn transformers or list | None

Transformer to apply to the features (X). If string, use one of the available transformers. If list, each element can be a string or scikit-learn compatible transformer. If None (default), no transformation is applied.

See documentation for details.

model_paramsdict | None

If not None, this dictionary specifies the model parameters to use

The dictionary can define the following keys:

  • ‘STEP__PARAMETER’: A value (or several) to be used as PARAMETER for STEP in the pipeline. Example: ‘svm__probability’: True will set the parameter ‘probability’ of the ‘svm’ model. If more than option is provided for at least one hyperparameter, a search will be performed.

  • ‘search’: The kind of search algorithm to use, e.g.: ‘grid’ or ‘random’. Can be any valid julearn searcher name or scikit-learn compatible searcher.

  • ‘cv’: If search is going to be used, the cross-validation splitting strategy to use. Defaults to same CV as for the model evaluation.

  • ‘scoring’: If search is going to be used, the scoring metric to evaluate the performance.

  • ‘search_params’: Additional parameters for the search method.

See https://juaml.github.io/julearn/hyperparameters.html for details.

Returns:
pipelineobj

Not fitted julearn compatible pipeline or pipeline wrappen in Searcher.

julearn.api.run_cross_validation(X, y, model, data=None, confounds=None, problem_type='binary_classification', preprocess_X=None, preprocess_y=None, preprocess_confounds=None, return_estimator=False, return_train_score=False, cv=None, groups=None, scoring=None, pos_labels=None, model_params=None, seed=None, n_jobs=None, verbose=0)

Run cross validation and score.

Parameters:
Xstr, list(str) or numpy.array

The features to use. See https://juaml.github.io/julearn/input.html for details.

ystr or numpy.array

The targets to predict. See https://juaml.github.io/julearn/input.html for details.

modelstr or scikit-learn compatible model.

If string, it will use one of the available models. See available_models.

datapandas.DataFrame | None

DataFrame with the data (optional). See https://juaml.github.io/julearn/input.html for details.

confoundsstr, list(str) or numpy.array | None

The confounds. See https://juaml.github.io/julearn/input.html for details.

problem_typestr

The kind of problem to model.

Options are:

  • “binary_classification”: Perform a binary classification in which the target (y) has only two possible classes (default). The parameter pos_labels can be used to convert a target with multiple_classes into binary.

  • “multiclass_classification”: Performs a multiclass classification in which the target (y) has more than two possible values.

  • “regression”. Perform a regression. The target (y) has to be ordinal at least.

preprocess_Xstr, scikit-learn compatible transformers or list | None

Transformer to apply to the features (X). If string, use one of the available transformers. If list, each element can be a string or scikit-learn compatible transformer. If None (default), no transformation is applied.

See documentation for details.

preprocess_ystr or scikit-learn transformer | None

Transformer to apply to the target (y). If None (default), no transformation is applied.

See documentation for details.

preprocess_confoundsstr, scikit-learn transformers or list | None

Transformer to apply to the features (X). If string, use one of the available transformers. If list, each element can be a string or scikit-learn compatible transformer. If None (default), no transformation is applied.

See documentation for details.

return_estimatorstr | None

Return the fitted estimator(s). Options are:

  • ‘final’: Return the estimator fitted on all the data.

  • ‘cv’: Return the all the estimator from each CV split, fitted on the training data.

  • ‘all’: Return all the estimators (final and cv).

return_train_scorebool

Whether to return the training score with the test scores (default is False).

cvint, str or cross-validation generator | None

Cross-validation splitting strategy to use for model evaluation.

Options are:

  • None: defaults to 5-fold, repeated 5 times.

  • int: the number of folds in a (Stratified)KFold

  • CV Splitter (see scikit-learn documentation on CV)

  • An iterable yielding (train, test) splits as arrays of indices.

groupsstr or numpy.array | None

The grouping labels in case a Group CV is used. See https://juaml.github.io/julearn/input.html for details.

scoringstr | list(str) | obj | dict | None

The scoring metric to use. See https://scikit-learn.org/stable/modules/model_evaluation.html for a comprehensive list of options. If None, use the model’s default scorer.

pos_labelsstr, int, float or list | None

The labels to interpret as positive. If not None, every element from y will be converted to 1 if is equal or in pos_labels and to 0 if not.

model_paramsdict | None

If not None, this dictionary specifies the model parameters to use

The dictionary can define the following keys:

  • ‘STEP__PARAMETER’: A value (or several) to be used as PARAMETER for STEP in the pipeline. Example: ‘svm__probability’: True will set the parameter ‘probability’ of the ‘svm’ model. If more than option is provided for at least one hyperparameter, a search will be performed.

  • ‘search’: The kind of search algorithm to use, e.g.: ‘grid’ or ‘random’. Can be any valid julearn searcher name or scikit-learn compatible searcher.

  • ‘cv’: If search is going to be used, the cross-validation splitting strategy to use. Defaults to same CV as for the model evaluation.

  • ‘scoring’: If search is going to be used, the scoring metric to evaluate the performance.

  • ‘search_params’: Additional parameters for the search method.

See https://juaml.github.io/julearn/hyperparameters.html for details.

seedint | None

If not None, set the random seed before any operation. Useful for reproducibility.

Returns:
scorespd.DataFrame

The resulting scores (one column for each score specified). Additionally, a ‘fit_time’ column will be added. And, if return_estimator='all' or return_estimator='cv', an ‘estimator’ columns with the corresponding estimators fitted for each CV split.

final_estimatorobject

The final estimator, fitted on all the data (only if return_estimator='all' or return_estimator='final')

n_jobsint | None

Number of parallel jobs used by outer cross-validation. Follows scikit-learn/joblib conventions. None is 1 unless you use a joblib.parallel_backend. -1 means use all available processes for parallelisation.

verbose: int

Verbosity level of outer cross-validation. Follows scikit-learn/joblib converntions. 0 means no additional information is printed. Larger number genereally mean more information is printed. Note: verbosity up to 50 will print into standard error, while larger than 50 will print in standrad output.

Model functions

julearn.estimators.list_models()

List all the available model names

Returns:
outlist(str)

A list will all the available model names.

julearn.estimators.get_model(name, problem_type, **kwargs)

Get a model

Parameters:
namestr

The model name

problem_typestr

The type of problem. See run_cross_validation().

Returns:
outscikit-learn compatible model

The model object.

Transformer functions

julearn.transformers.list_transformers(target=False)

List all the available transformers

Parameters:
targetbool

If True, return a list of the target tranformers. If False (default), return a list of features/confounds transformers.

Returns:
outlist(str)

A list will all the available transformer names.

julearn.transformers.get_transformer(name, target=False, **params)

Get a transformer

Parameters:
namestr

The transformer name

targetbool

If True, return a target tranformer. If False (default), return a features/confounds transformers.

Returns:
outscikit-learn compatible transformer

The transformer object.


Logging

julearn.utils.configure_logging(level='WARNING', fname=None, overwrite=None, output_format=None)

Configure the logging functionality

Parameters:
levelint or string

The level of the messages to print. If string, it will be interpreted as elements of logging. Options are: [‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’]. Defaults to ‘WARNING’.

fnamestr, Path or None

Filename of the log to print to. If None, stdout is used.

overwritebool | None

Overwrite the log file (if it exists). Otherwise, statements will be appended to the log (default). None is the same as False, but additionally raises a warning to notify the user that log entries will be appended.

output_formatstr

Format of the output messages. See the following for examples:

e.g., “%(asctime)s - %(levelname)s - %(message)s”.

Defaults to “%(asctime)s - %(name)s - %(levelname)s - %(message)s”

julearn.utils.warn(msg, category=<class 'RuntimeWarning'>)

Warn, but first log it

Parameters:
msgstr

Warning message

categoryinstance of Warning

The warning class. Defaults to RuntimeWarning.

julearn.utils.raise_error(msg, klass=<class 'ValueError'>)

Raise an error, but first log it

Parameters:
msgstr

Error message

klassclass of the error to raise. Defaults to ValueError

Cross-Validation

class julearn.model_selection.StratifiedBootstrap(n_splits=5, *, test_size=0.5, train_size=None, random_state=None)

Stratified Bootstrap cross-validation iterator

Provides train/test indices using resampling with replacement, respecting the distribution of samples for each class.

Parameters:
n_splitsint, default=5

Number of re-shuffling & splitting iterations.

test_sizefloat, int, default=0.2

If float, should be between 0.0 and 1.0 and represent the proportion of groups to include in the test split (rounded up). If int, represents the absolute number of test groups. If None, the value is set to the complement of the train size. The default will change in version 0.21. It will remain 0.2 only if train_size is unspecified, otherwise it will complement the specified train_size.

train_sizefloat or int, default=None

If float, should be between 0.0 and 1.0 and represent the proportion of the groups to include in the train split. If int, represents the absolute number of train groups. If None, the value is automatically set to the complement of the test size.

random_stateint or RandomState instance, default=None

Controls the randomness of the training and testing indices produced. Pass an int for reproducible output across multiple function calls.

split(X, y=None, groups=None)

Generate indices to split data into training and test set.

Parameters:
Xarray-like of shape (n_samples, n_features)

Training data, where n_samples is the number of samples and n_features is the number of features. Note that providing y is sufficient to generate the splits and hence np.zeros(n_samples) may be used as a placeholder for X instead of actual training data.

yarray-like of shape (n_samples,) or (n_samples, n_labels)

The target variable for supervised learning problems. Stratification is done based on the y labels.

groupsobject

Always ignored, exists for compatibility.

Yields:
trainndarray

The training set indices for that split.

testndarray

The testing set indices for that split.

Notes

Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.

class julearn.model_selection.StratifiedGroupsKFold(n_splits=5, *, shuffle=False, random_state=None)
split(X, y, groups=None)

Generate indices to split data into training and test set.

Parameters:
Xarray-like of shape (n_samples, n_features)

Training data, where n_samples is the number of samples and n_features is the number of features. Note that providing y is sufficient to generate the splits and hence np.zeros(n_samples) may be used as a placeholder for X instead of actual training data.

yarray-like of shape (n_samples,)

Always ignored, exists for compatibility.

groupsobject

The stratification variable.

Yields:
trainndarray

The training set indices for that split.

testndarray

The testing set indices for that split.

Notes

Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.

class julearn.model_selection.RepeatedStratifiedGroupsKFold(*, n_splits=5, n_repeats=10, random_state=None)

Repeated Stratified K-Fold cross validator. Repeats Stratified Groups K-Fold n times with different randomization in each repetition.

Parameters:
n_splitsint, default=5

Number of folds. Must be at least 2.

n_repeatsint, default=10

Number of times cross-validator needs to be repeated.

random_stateint, RandomState instance or None, default=None

Controls the generation of the random states for each repetition. Pass an int for reproducible output across multiple function calls. See Glossary.

Notes

Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.


Classes

class julearn.transformers.confounds.DataFrameConfoundRemover(model_confound=None, confounds_match='.*__:type:__confound', threshold=None, keep_confounds=False)

Transformer which can use pd.DataFrames and remove the confounds from the features by subtracting the predicted features given the confounds from the actual features.

Parameters:
model_confoundobj

Sklearn compatible model used to predict all features independently using the confounds as features. The predictions of these models are then subtracted from each feature, defaults to LinearRegression().

confounds_matchlist(str) | str

A string representing a regular expression by which the confounds can be detected from the column names. You can use the exact column names or another regex. The default follows the naming convention inside of julearn: ‘.*__:type:__*.’

thresholdfloat | None

All residual values after confound removal which fall under the threshold will be set to 0.None (default) means that no threshold will be applied.

keep_confoundsbool, optional

Whether you want to return the confound together with the confound removed features, default is False

fit(X, y=None)

Fit confound remover

Parameters:
Xpandas.DataFrame

Training data

ypandas.Series | None

Target values.

Returns:
selfreturns an instance of self.
get_support(indices=False)

Get the support mask

Parameters:
indicesbool

If true, return indexes

Returns:
support_masknumpy.array

The support mask

transform(X)

Removes confounds from data

Parameters:
Xpandas.DataFrame

Data to be deconfounded

Returns:
outpandas.DataFrame

Data without confounds

class julearn.transformers.confounds.TargetConfoundRemover(model_confound=None, confounds_match='.*__:type:__confound', threshold=None)

Transformer which can use pd.DataFrames and remove the confounds from the target by subtracting the predicted target given the confounds from the actual target.

Attributes:
model_confoundobject

Model used to predict the target using the confounds as features. The predictions of these models are then subtracted from the actual target, default is None. Meaning the use of a LinearRegression.

confounds_matchlist(str) | str

A string representing a regular expression by which the confounds can be detected from the column names.

thresholdfloat | None

All residual values after confound removal which fall under the threshold will be set to 0. None means that no threshold will be applied.

fit_transform(X, y)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters:
Xarray-like of shape (n_samples, n_features)

Input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns:
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

class julearn.pipeline.ExtendedDataFramePipeline(dataframe_pipeline, y_transformer=None, confound_dataframe_pipeline=None, confounds=None, categorical_features=None)

A class creating a custom metamodel like a Pipeline. In practice this should be created using :ref:julearn.pipeline._create_extended_pipeline. There are multiple caveats of creating such a pipline without using that function. Compared to an usual scikit-learn pipeline, this have added functionalities:

  • Handling transformations of the target:

    The target can be changed. Importantly this transformed target will be considered the ground truth to score against. Note: if you want to score this pipeline with an external function. You have to consider that the scorer needs to be an exteded_scorer.

  • Handling confounds:

    Adds the confound as type to columns. This allows the DataFrameWrapTransformer inside of the dataframe_pipeline to handle confounds properly.

  • Handling categorical features:

    Adds categorical type to columns so that DataFrameWrapTransformer inside of the dataframe_pipline can handle categorical features properly.

column_types are added to the feature dataframe after each column using the specified separator. E.g. column age becomes age__:type:__confound.

Parameters:
dataframe_pipelineobj

A pipeline working with dataframes and being able to handle confounds. Should be created using julearn.pipeline.create_dataframe_pipeline.

y_transformerobj or None

Any transformer which can take the X and y to transform the y. You can use julearn.transformers.target.TargetTransfromerWrapper to convert most sklearn transformers to a target_transformer.

confound_dataframe_pipelineobj or None

Similar to dataframe_pipeline.

confoundslist(str) or None

List of column names which are confounds (defaults to None).

categorical_featureslist(str), optional

List of column names which are categorical features (defaults to None).

get_params(deep=True)

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

preprocess(X, y, until=None, return_trans_column_type=False)
Parameters:
untilstr or None

the name of the transformer until which preprocess should transform, by default None

return_trans_column_typebool or None

whether to return transformed column names with the associated column type, by default False

Returns:
tuple(pd.DataFrame, pd.Series)

Features and target after preprocessing.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.