Inspecting Random Forest models#

This example uses the iris dataset, performs simple binary classification using a Random Forest classifier and analyse the model.

# Authors: Federico Raimondo <f.raimondo@fz-juelich.de>
# License: AGPL

import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from seaborn import load_dataset

from julearn import run_cross_validation
from julearn.utils import configure_logging

Set the logging level to info to see extra information.

configure_logging(level="INFO")

/home/runner/work/julearn/julearn/julearn/utils/logging.py:66: UserWarning: The '__version__' attribute is deprecated and will be removed in MarkupSafe 3.1. Use feature detection, or `importlib.metadata.version("markupsafe")`, instead.
  vstring = str(getattr(module, "__version__", None))
2024-10-23 11:29:14,506 - julearn - INFO - ===== Lib Versions =====
2024-10-23 11:29:14,506 - julearn - INFO - numpy: 1.26.4
2024-10-23 11:29:14,506 - julearn - INFO - scipy: 1.14.1
2024-10-23 11:29:14,506 - julearn - INFO - sklearn: 1.5.2
2024-10-23 11:29:14,506 - julearn - INFO - pandas: 2.2.3
2024-10-23 11:29:14,506 - julearn - INFO - julearn: 0.3.5.dev16
2024-10-23 11:29:14,506 - julearn - INFO - ========================

Random Forest variable importance#

df_iris = load_dataset("iris")

The dataset has three kind of species. We will keep two to perform a binary classification.

df_iris = df_iris[df_iris["species"].isin(["versicolor", "virginica"])]

X = ["sepal_length", "sepal_width", "petal_length"]
y = "species"

We will use a Random Forest classifier. By setting return_estimator=’final’, the run_cross_validation() function returns the estimator fitted with all the data.

scores, model_iris = run_cross_validation(
    X=X,
    y=y,
    data=df_iris,
    model="rf",
    preprocess="zscore",
    problem_type="classification",
    return_estimator="final",
)

2024-10-23 11:29:14,509 - julearn - INFO - ==== Input Data ====
2024-10-23 11:29:14,509 - julearn - INFO - Using dataframe as input
2024-10-23 11:29:14,509 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length']
2024-10-23 11:29:14,509 - julearn - INFO -      Target: species
2024-10-23 11:29:14,509 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length']
2024-10-23 11:29:14,509 - julearn - INFO -      X_types:{}
2024-10-23 11:29:14,509 - julearn - WARNING - The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:509: RuntimeWarning: The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
  warn_with_log(
2024-10-23 11:29:14,510 - julearn - INFO - ====================
2024-10-23 11:29:14,510 - julearn - INFO -
2024-10-23 11:29:14,510 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:14,510 - julearn - INFO - Step added
2024-10-23 11:29:14,510 - julearn - INFO - Adding step rf that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:14,510 - julearn - INFO - Step added
2024-10-23 11:29:14,511 - julearn - INFO - = Model Parameters =
2024-10-23 11:29:14,511 - julearn - INFO - ====================
2024-10-23 11:29:14,511 - julearn - INFO -
2024-10-23 11:29:14,511 - julearn - INFO - = Data Information =
2024-10-23 11:29:14,511 - julearn - INFO -      Problem type: classification
2024-10-23 11:29:14,511 - julearn - INFO -      Number of samples: 100
2024-10-23 11:29:14,511 - julearn - INFO -      Number of features: 3
2024-10-23 11:29:14,511 - julearn - INFO - ====================
2024-10-23 11:29:14,511 - julearn - INFO -
2024-10-23 11:29:14,511 - julearn - INFO -      Number of classes: 2
2024-10-23 11:29:14,511 - julearn - INFO -      Target type: object
2024-10-23 11:29:14,512 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-10-23 11:29:14,512 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False) (incl. final model)
2024-10-23 11:29:14,512 - julearn - INFO - Binary classification problem detected.

This type of classifier has an internal variable that can inform us on how important is each of the features. Caution: read the proper scikit-learn documentation RandomForestClassifier to understand how this learning algorithm works.

rf = model_iris["rf"]

to_plot = pd.DataFrame(
    {
        "variable": [x.replace("_", " ") for x in X],
        "importance": rf.feature_importances_,
    }
)

fig, ax = plt.subplots(1, 1, figsize=(6, 4))
sns.barplot(x="importance", y="variable", data=to_plot, ax=ax)
ax.set_title("Variable Importances for Random Forest Classifier")
fig.tight_layout()

Variable Importances for Random Forest Classifier

However, some reviewers (including us), might wander about the variability of the importance of these features. In the previous example all the feature importances were obtained by fitting on the entire dataset, while the performance was estimated using cross validation.

By specifying return_estimator=’cv’, we can get, for each fold, the fitted estimator.

scores = run_cross_validation(
    X=X,
    y=y,
    data=df_iris,
    model="rf",
    preprocess="zscore",
    problem_type="classification",
    return_estimator="cv",
)

2024-10-23 11:29:15,180 - julearn - INFO - ==== Input Data ====
2024-10-23 11:29:15,180 - julearn - INFO - Using dataframe as input
2024-10-23 11:29:15,180 - julearn - INFO -      Features: ['sepal_length', 'sepal_width', 'petal_length']
2024-10-23 11:29:15,180 - julearn - INFO -      Target: species
2024-10-23 11:29:15,180 - julearn - INFO -      Expanded features: ['sepal_length', 'sepal_width', 'petal_length']
2024-10-23 11:29:15,180 - julearn - INFO -      X_types:{}
2024-10-23 11:29:15,180 - julearn - WARNING - The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
/home/runner/work/julearn/julearn/julearn/prepare.py:509: RuntimeWarning: The following columns are not defined in X_types: ['sepal_length', 'sepal_width', 'petal_length']. They will be treated as continuous.
  warn_with_log(
2024-10-23 11:29:15,181 - julearn - INFO - ====================
2024-10-23 11:29:15,181 - julearn - INFO -
2024-10-23 11:29:15,181 - julearn - INFO - Adding step zscore that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:15,181 - julearn - INFO - Step added
2024-10-23 11:29:15,181 - julearn - INFO - Adding step rf that applies to ColumnTypes<types={'continuous'}; pattern=(?:__:type:__continuous)>
2024-10-23 11:29:15,181 - julearn - INFO - Step added
2024-10-23 11:29:15,182 - julearn - INFO - = Model Parameters =
2024-10-23 11:29:15,182 - julearn - INFO - ====================
2024-10-23 11:29:15,182 - julearn - INFO -
2024-10-23 11:29:15,182 - julearn - INFO - = Data Information =
2024-10-23 11:29:15,182 - julearn - INFO -      Problem type: classification
2024-10-23 11:29:15,182 - julearn - INFO -      Number of samples: 100
2024-10-23 11:29:15,182 - julearn - INFO -      Number of features: 3
2024-10-23 11:29:15,182 - julearn - INFO - ====================
2024-10-23 11:29:15,182 - julearn - INFO -
2024-10-23 11:29:15,182 - julearn - INFO -      Number of classes: 2
2024-10-23 11:29:15,183 - julearn - INFO -      Target type: object
2024-10-23 11:29:15,183 - julearn - INFO -      Class distributions: species
versicolor    50
virginica     50
Name: count, dtype: int64
2024-10-23 11:29:15,183 - julearn - INFO - Using outer CV scheme KFold(n_splits=5, random_state=None, shuffle=False)
2024-10-23 11:29:15,183 - julearn - INFO - Binary classification problem detected.

Now we can obtain the feature importance for each estimator (CV fold).

to_plot = []
for i_fold, estimator in enumerate(scores["estimator"]):
    this_importances = pd.DataFrame(
        {
            "variable": [x.replace("_", " ") for x in X],
            "importance": estimator["rf"].feature_importances_,
            "fold": i_fold,
        }
    )
    to_plot.append(this_importances)

to_plot = pd.concat(to_plot)

Finally, we can plot the variable importances for each fold.

fig, ax = plt.subplots(1, 1, figsize=(6, 4))
sns.swarmplot(x="importance", y="variable", data=to_plot, ax=ax)
ax.set_title(
    "Distribution of variable Importances for Random Forest "
    "Classifier across folds"
)
fig.tight_layout()

Distribution of variable Importances for Random Forest Classifier across folds

/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)

Total running time of the script: (0 minutes 1.318 seconds)

Gallery generated by Sphinx-Gallery