Reference
Main API functions
- julearn.api.create_pipeline(model, confounds=None, problem_type='binary_classification', preprocess_X=None, preprocess_y=None, preprocess_confounds=None, model_params=None)
Creates a not fitted julearn pipeline.
- Parameters:
- modelstr or scikit-learn compatible model.
If string, it will use one of the available models. See
available_models
.- confoundsstr, list(str) or numpy.array | None
The confounds. See https://juaml.github.io/julearn/input.html for details.
- problem_typestr
The kind of problem to model.
Options are:
“binary_classification”: Perform a binary classification in which the target (y) has only two possible classes (default). The parameter pos_labels can be used to convert a target with multiple_classes into binary.
“multiclass_classification”: Performs a multiclass classification in which the target (y) has more than two possible values.
“regression”. Perform a regression. The target (y) has to be ordinal at least.
- preprocess_Xstr, scikit-learn compatible transformers or list | None
Transformer to apply to the features (X). If string, use one of the available transformers. If list, each element can be a string or scikit-learn compatible transformer. If None (default), no transformation is applied.
See documentation for details.
- preprocess_ystr or scikit-learn transformer | None
Transformer to apply to the target (y). If None (default), no transformation is applied.
See documentation for details.
- preprocess_confoundsstr, scikit-learn transformers or list | None
Transformer to apply to the features (X). If string, use one of the available transformers. If list, each element can be a string or scikit-learn compatible transformer. If None (default), no transformation is applied.
See documentation for details.
- model_paramsdict | None
If not None, this dictionary specifies the model parameters to use
The dictionary can define the following keys:
‘STEP__PARAMETER’: A value (or several) to be used as PARAMETER for STEP in the pipeline. Example: ‘svm__probability’: True will set the parameter ‘probability’ of the ‘svm’ model. If more than option is provided for at least one hyperparameter, a search will be performed.
‘search’: The kind of search algorithm to use, e.g.: ‘grid’ or ‘random’. Can be any valid julearn searcher name or scikit-learn compatible searcher.
‘cv’: If search is going to be used, the cross-validation splitting strategy to use. Defaults to same CV as for the model evaluation.
‘scoring’: If search is going to be used, the scoring metric to evaluate the performance.
‘search_params’: Additional parameters for the search method.
See https://juaml.github.io/julearn/hyperparameters.html for details.
- Returns:
- pipelineobj
Not fitted julearn compatible pipeline or pipeline wrappen in Searcher.
- julearn.api.run_cross_validation(X, y, model, data=None, confounds=None, problem_type='binary_classification', preprocess_X=None, preprocess_y=None, preprocess_confounds=None, return_estimator=False, return_train_score=False, cv=None, groups=None, scoring=None, pos_labels=None, model_params=None, seed=None, n_jobs=None, verbose=0)
Run cross validation and score.
- Parameters:
- Xstr, list(str) or numpy.array
The features to use. See https://juaml.github.io/julearn/input.html for details.
- ystr or numpy.array
The targets to predict. See https://juaml.github.io/julearn/input.html for details.
- modelstr or scikit-learn compatible model.
If string, it will use one of the available models. See
available_models
.- datapandas.DataFrame | None
DataFrame with the data (optional). See https://juaml.github.io/julearn/input.html for details.
- confoundsstr, list(str) or numpy.array | None
The confounds. See https://juaml.github.io/julearn/input.html for details.
- problem_typestr
The kind of problem to model.
Options are:
“binary_classification”: Perform a binary classification in which the target (y) has only two possible classes (default). The parameter pos_labels can be used to convert a target with multiple_classes into binary.
“multiclass_classification”: Performs a multiclass classification in which the target (y) has more than two possible values.
“regression”. Perform a regression. The target (y) has to be ordinal at least.
- preprocess_Xstr, scikit-learn compatible transformers or list | None
Transformer to apply to the features (X). If string, use one of the available transformers. If list, each element can be a string or scikit-learn compatible transformer. If None (default), no transformation is applied.
See documentation for details.
- preprocess_ystr or scikit-learn transformer | None
Transformer to apply to the target (y). If None (default), no transformation is applied.
See documentation for details.
- preprocess_confoundsstr, scikit-learn transformers or list | None
Transformer to apply to the features (X). If string, use one of the available transformers. If list, each element can be a string or scikit-learn compatible transformer. If None (default), no transformation is applied.
See documentation for details.
- return_estimatorstr | None
Return the fitted estimator(s). Options are:
‘final’: Return the estimator fitted on all the data.
‘cv’: Return the all the estimator from each CV split, fitted on the training data.
‘all’: Return all the estimators (final and cv).
- return_train_scorebool
Whether to return the training score with the test scores (default is False).
- cvint, str or cross-validation generator | None
Cross-validation splitting strategy to use for model evaluation.
Options are:
None: defaults to 5-fold, repeated 5 times.
int: the number of folds in a (Stratified)KFold
CV Splitter (see scikit-learn documentation on CV)
An iterable yielding (train, test) splits as arrays of indices.
- groupsstr or numpy.array | None
The grouping labels in case a Group CV is used. See https://juaml.github.io/julearn/input.html for details.
- scoringstr | list(str) | obj | dict | None
The scoring metric to use. See https://scikit-learn.org/stable/modules/model_evaluation.html for a comprehensive list of options. If None, use the model’s default scorer.
- pos_labelsstr, int, float or list | None
The labels to interpret as positive. If not None, every element from y will be converted to 1 if is equal or in pos_labels and to 0 if not.
- model_paramsdict | None
If not None, this dictionary specifies the model parameters to use
The dictionary can define the following keys:
‘STEP__PARAMETER’: A value (or several) to be used as PARAMETER for STEP in the pipeline. Example: ‘svm__probability’: True will set the parameter ‘probability’ of the ‘svm’ model. If more than option is provided for at least one hyperparameter, a search will be performed.
‘search’: The kind of search algorithm to use, e.g.: ‘grid’ or ‘random’. Can be any valid julearn searcher name or scikit-learn compatible searcher.
‘cv’: If search is going to be used, the cross-validation splitting strategy to use. Defaults to same CV as for the model evaluation.
‘scoring’: If search is going to be used, the scoring metric to evaluate the performance.
‘search_params’: Additional parameters for the search method.
See https://juaml.github.io/julearn/hyperparameters.html for details.
- seedint | None
If not None, set the random seed before any operation. Useful for reproducibility.
- Returns:
- scorespd.DataFrame
The resulting scores (one column for each score specified). Additionally, a ‘fit_time’ column will be added. And, if
return_estimator='all'
orreturn_estimator='cv'
, an ‘estimator’ columns with the corresponding estimators fitted for each CV split.- final_estimatorobject
The final estimator, fitted on all the data (only if
return_estimator='all'
orreturn_estimator='final'
)- n_jobsint | None
Number of parallel jobs used by outer cross-validation. Follows scikit-learn/joblib conventions. None is 1 unless you use a joblib.parallel_backend. -1 means use all available processes for parallelisation.
- verbose: int
Verbosity level of outer cross-validation. Follows scikit-learn/joblib converntions. 0 means no additional information is printed. Larger number genereally mean more information is printed. Note: verbosity up to 50 will print into standard error, while larger than 50 will print in standrad output.
Model functions
- julearn.estimators.list_models()
List all the available model names
- Returns:
- outlist(str)
A list will all the available model names.
- julearn.estimators.get_model(name, problem_type, **kwargs)
Get a model
- Parameters:
- namestr
The model name
- problem_typestr
The type of problem. See
run_cross_validation()
.
- Returns:
- outscikit-learn compatible model
The model object.
Transformer functions
- julearn.transformers.list_transformers(target=False)
List all the available transformers
- Parameters:
- targetbool
If True, return a list of the target tranformers. If False (default), return a list of features/confounds transformers.
- Returns:
- outlist(str)
A list will all the available transformer names.
- julearn.transformers.get_transformer(name, target=False, **params)
Get a transformer
- Parameters:
- namestr
The transformer name
- targetbool
If True, return a target tranformer. If False (default), return a features/confounds transformers.
- Returns:
- outscikit-learn compatible transformer
The transformer object.
Logging
- julearn.utils.configure_logging(level='WARNING', fname=None, overwrite=None, output_format=None)
Configure the logging functionality
- Parameters:
- levelint or string
The level of the messages to print. If string, it will be interpreted as elements of logging. Options are: [‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’]. Defaults to ‘WARNING’.
- fnamestr, Path or None
Filename of the log to print to. If None, stdout is used.
- overwritebool | None
Overwrite the log file (if it exists). Otherwise, statements will be appended to the log (default). None is the same as False, but additionally raises a warning to notify the user that log entries will be appended.
- output_formatstr
Format of the output messages. See the following for examples:
e.g., “%(asctime)s - %(levelname)s - %(message)s”.
Defaults to “%(asctime)s - %(name)s - %(levelname)s - %(message)s”
- julearn.utils.warn(msg, category=<class 'RuntimeWarning'>)
Warn, but first log it
- Parameters:
- msgstr
Warning message
- categoryinstance of Warning
The warning class. Defaults to
RuntimeWarning
.
- julearn.utils.raise_error(msg, klass=<class 'ValueError'>)
Raise an error, but first log it
- Parameters:
- msgstr
Error message
- klassclass of the error to raise. Defaults to ValueError
Cross-Validation
- class julearn.model_selection.StratifiedBootstrap(n_splits=5, *, test_size=0.5, train_size=None, random_state=None)
Stratified Bootstrap cross-validation iterator
Provides train/test indices using resampling with replacement, respecting the distribution of samples for each class.
- Parameters:
- n_splitsint, default=5
Number of re-shuffling & splitting iterations.
- test_sizefloat, int, default=0.2
If float, should be between 0.0 and 1.0 and represent the proportion of groups to include in the test split (rounded up). If int, represents the absolute number of test groups. If None, the value is set to the complement of the train size. The default will change in version 0.21. It will remain 0.2 only if
train_size
is unspecified, otherwise it will complement the specifiedtrain_size
.- train_sizefloat or int, default=None
If float, should be between 0.0 and 1.0 and represent the proportion of the groups to include in the train split. If int, represents the absolute number of train groups. If None, the value is automatically set to the complement of the test size.
- random_stateint or RandomState instance, default=None
Controls the randomness of the training and testing indices produced. Pass an int for reproducible output across multiple function calls.
- split(X, y=None, groups=None)
Generate indices to split data into training and test set.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features. Note that providing
y
is sufficient to generate the splits and hencenp.zeros(n_samples)
may be used as a placeholder forX
instead of actual training data.- yarray-like of shape (n_samples,) or (n_samples, n_labels)
The target variable for supervised learning problems. Stratification is done based on the y labels.
- groupsobject
Always ignored, exists for compatibility.
- Yields:
- trainndarray
The training set indices for that split.
- testndarray
The testing set indices for that split.
Notes
Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.
- class julearn.model_selection.StratifiedGroupsKFold(n_splits=5, *, shuffle=False, random_state=None)
- split(X, y, groups=None)
Generate indices to split data into training and test set.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features. Note that providing
y
is sufficient to generate the splits and hencenp.zeros(n_samples)
may be used as a placeholder forX
instead of actual training data.- yarray-like of shape (n_samples,)
Always ignored, exists for compatibility.
- groupsobject
The stratification variable.
- Yields:
- trainndarray
The training set indices for that split.
- testndarray
The testing set indices for that split.
Notes
Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.
- class julearn.model_selection.RepeatedStratifiedGroupsKFold(*, n_splits=5, n_repeats=10, random_state=None)
Repeated Stratified K-Fold cross validator. Repeats Stratified Groups K-Fold n times with different randomization in each repetition.
- Parameters:
- n_splitsint, default=5
Number of folds. Must be at least 2.
- n_repeatsint, default=10
Number of times cross-validator needs to be repeated.
- random_stateint, RandomState instance or None, default=None
Controls the generation of the random states for each repetition. Pass an int for reproducible output across multiple function calls. See Glossary.
Notes
Randomized CV splitters may return different results for each call of split. You can make the results identical by setting random_state to an integer.
Classes
- class julearn.transformers.confounds.DataFrameConfoundRemover(model_confound=None, confounds_match='.*__:type:__confound', threshold=None, keep_confounds=False)
Transformer which can use pd.DataFrames and remove the confounds from the features by subtracting the predicted features given the confounds from the actual features.
- Parameters:
- model_confoundobj
Sklearn compatible model used to predict all features independently using the confounds as features. The predictions of these models are then subtracted from each feature, defaults to LinearRegression().
- confounds_matchlist(str) | str
A string representing a regular expression by which the confounds can be detected from the column names. You can use the exact column names or another regex. The default follows the naming convention inside of julearn: ‘.*__:type:__*.’
- thresholdfloat | None
All residual values after confound removal which fall under the threshold will be set to 0.None (default) means that no threshold will be applied.
- keep_confoundsbool, optional
Whether you want to return the confound together with the confound removed features, default is False
- fit(X, y=None)
Fit confound remover
- Parameters:
- Xpandas.DataFrame
Training data
- ypandas.Series | None
Target values.
- Returns:
- selfreturns an instance of self.
- get_support(indices=False)
Get the support mask
- Parameters:
- indicesbool
If true, return indexes
- Returns:
- support_masknumpy.array
The support mask
- transform(X)
Removes confounds from data
- Parameters:
- Xpandas.DataFrame
Data to be deconfounded
- Returns:
- outpandas.DataFrame
Data without confounds
- class julearn.transformers.confounds.TargetConfoundRemover(model_confound=None, confounds_match='.*__:type:__confound', threshold=None)
Transformer which can use pd.DataFrames and remove the confounds from the target by subtracting the predicted target given the confounds from the actual target.
- Attributes:
- model_confoundobject
Model used to predict the target using the confounds as features. The predictions of these models are then subtracted from the actual target, default is None. Meaning the use of a LinearRegression.
- confounds_matchlist(str) | str
A string representing a regular expression by which the confounds can be detected from the column names.
- thresholdfloat | None
All residual values after confound removal which fall under the threshold will be set to 0. None means that no threshold will be applied.
- fit_transform(X, y)
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Input samples.
- yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None
Target values (None for unsupervised transformations).
- **fit_paramsdict
Additional fit parameters.
- Returns:
- X_newndarray array of shape (n_samples, n_features_new)
Transformed array.
- class julearn.pipeline.ExtendedDataFramePipeline(dataframe_pipeline, y_transformer=None, confound_dataframe_pipeline=None, confounds=None, categorical_features=None)
A class creating a custom metamodel like a Pipeline. In practice this should be created using :ref:julearn.pipeline._create_extended_pipeline. There are multiple caveats of creating such a pipline without using that function. Compared to an usual scikit-learn pipeline, this have added functionalities:
- Handling transformations of the target:
The target can be changed. Importantly this transformed target will be considered the ground truth to score against. Note: if you want to score this pipeline with an external function. You have to consider that the scorer needs to be an exteded_scorer.
- Handling confounds:
Adds the confound as type to columns. This allows the DataFrameWrapTransformer inside of the dataframe_pipeline to handle confounds properly.
- Handling categorical features:
Adds categorical type to columns so that DataFrameWrapTransformer inside of the dataframe_pipline can handle categorical features properly.
column_types are added to the feature dataframe after each column using the specified separator. E.g. column
age
becomesage__:type:__confound
.- Parameters:
- dataframe_pipelineobj
A pipeline working with dataframes and being able to handle confounds. Should be created using julearn.pipeline.create_dataframe_pipeline.
- y_transformerobj or None
Any transformer which can take the X and y to transform the y. You can use julearn.transformers.target.TargetTransfromerWrapper to convert most sklearn transformers to a target_transformer.
- confound_dataframe_pipelineobj or None
Similar to dataframe_pipeline.
- confoundslist(str) or None
List of column names which are confounds (defaults to None).
- categorical_featureslist(str), optional
List of column names which are categorical features (defaults to None).
- get_params(deep=True)
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- preprocess(X, y, until=None, return_trans_column_type=False)
- Parameters:
- untilstr or None
the name of the transformer until which preprocess should transform, by default None
- return_trans_column_typebool or None
whether to return transformed column names with the associated column type, by default False
- Returns:
- tuple(pd.DataFrame, pd.Series)
Features and target after preprocessing.
- set_params(**params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.