2.2. Data#
Data input to run_cross_validation()
#
Julearn deals with data in the form of pandas DataFrames. This is the kind of
data structure that the run_cross_validation()
uses to input the data
and output some of the results.
The input DataFrame must contain the features and the target or label. This
will be communicated to run_cross_validation()
by specifying the
following parameters:
data
: Name of the dataframe containing the features and the target orlabel.
X
: List of strings containing the column names of the features.y
: String containing the name of the column with the target or label.
For example, using the well known iris dataset, we can specify the data input as follows:
First, we load the data into a pandas dataframe called df
and specify
X
and y
:
from seaborn import load_dataset
df = load_dataset('iris')
Let’s inspect what our dataframe looks like.
df.head()
Given this data, we can now specify the X
and y
parameters:
X = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
y = "species"
Julearn’s run_cross_validation()
function so far would look like this:
run_cross_validation(X=X, y=y, data=df)
This is not yet very useful to do machine learning, but we will come to it step by step.
Giving types to features#
A nice add-on that Julearn offers is the capacity to specify colum-based types for the features. This comes in handy if within the pipeline, one wants to manipulate only certain columns.
To specify column types, we must provide a dictionary with the column types as keys and the column names as values. The type can be anything, but it is recommended to use a string that is meaningful to you.
Important
Every column can only have one type!
In the case of the iris dataset, we could specify the type of the columns
related to the _sepal_ and _petal_ information as "sepal"
and "petal"
respectively.
X_types = {
"petal": ["petal_length", "petal_width"],
"sepal": ["sepal_length", "sepal_width"],
}
Importantly, Julearn also allows to specify the column names as regular expressions. This comes in handy when we are dealing with hundreds or thousands of features and we do not want to specify all the names by hand. For example, we could specify the type of the _sepal_ columns as follows:
X_types = {
"petal": ["petal.*"],
"sepal": ["sepal.*"],
}
Adding an X_types
specification to run_cross_validation()
will
make it look like this:
run_cross_validation(X=X, y=y, data=df, X_types=X_types)
Important
If no X_types
is specified, all the columns will be considered as
"continuous"
and a warning will be raised.
Up to now, we saw how to parametrize run_cross_validation()
in terms
of the input data. In the next section we will see how to specify the output
In the next section we will focus on basic options to use
run_cross_validation()
to evaluate different pipelines in a
cross-validation consistent manner.
Advanced uses cases regarding X_types selective processing are covered in Selective preprocessing using feature types
Total running time of the script: ( 0 minutes 0.006 seconds)