Unbiased MDI-like feature importance measure for random forests #31279

GaetandeCast · 2025-04-30T16:01:48Z

Reference Issues/PRs

Fixes #20059

What does this implement/fix? Explain your changes.

This implements two methods that correct the cardinality bias of the feature_importances_ attribute of random forest estimators by leveraging out-of-bag (oob) samples.
The first method is derived from Unbiased Measurement of Feature Importance in Tree-Based Methods, Zhengze Zhou & Giles Hooker. The corresponding attribute is named ufi_feature_importances_.
The second method is derived from A Debiased MDI Feature Importance Measure for Random Forests, Xiao Li et al.. The corresponding attribute is named mdi_oob_feature_importances_.
The names are temporary, we are still seeking a way of favoring one method over the other (currently investigating whether one of the two reaches asymptotic behavior faster than the other).

These attributes are set by the fit method after training, if the parameter oob_score is set to True. In this case we send the oob samples to a Cython method at tree level that propagates them through the tree and returns the corresponding oob prediction function and feature importance measure.

This new feature importance measure has a similar behavior to regular Mean Decrease Impurity but mixes the in-bag and out-of-bag values of each node instead of using the in-bag impurity. The two proposed method differ in the way they mix in-bag and oob samples.

This PR also includes these two new feature importance measures to the test suite, specifically in test_forest.py. Existing tests are widened to test these two measures and new tests are added to make sure they behave correctly (e.g. they coincide with values given by the code of the cited papers, they recover traditional MDI when used on in-bag samples).

Any other comments?

The papers only suggest fixes for trees built with the Gini (classification) and Mean Squared Error (regression) criteria, but we would like the new methods to support the other available criteria in scikit-learn. log_loss support was added for classification with the ufi method by generalizing the idea of mixing in-bag and oob samples.

Some CPU and memory profiling was done to ensure that the computational overhead was controlled enough compared to the cost of model fitting for large enough datasets.

Support for sparse matrix input should be added soon.

This work is done in close colaboration with @ogrisel.

TODO:

Fix the tests related to oob_score_
Done in d198f20
Add support for sparse input data (scipy sparse matrix and scipy sparse array containers).
support: 8329b3b
test: 0b48af4
Add support and tests for sample_weight
Support added in f10721e. Test in 241de66
Expose the feature for GradientBoostingClassifier and GradientBoostintRegressor when row-wise (sub)sampling is enabled at training time.
Done in ce52159
Shall we expose some public method to allow the user to pass held-out data instead of just computing the importance using OOB samples identified at training time?
Separate gradient boosting from this pr
8a09b39
Update doc example on permutation vs mdi to include ufi & mdi_oob
229cc4d
Think about an API to expose feature importance confidence intervals based on tree level booststraping

Edit: We noticed a discrepancy between the formula defined by the authors of mdi_oob and what their code does. This is detailed here, in part 5. Therefore we only implement UFI for now. Furthermore we could not find an equivalent of ufi for the entropy impurity criterion so we compute ufi with gini whatever the impurity criterion in classification, and with mse for classification

…as "extreme value" issues

…lity

…d that they coincide with feature_importances_ on inbag samples

…ut Gini

…s_ on X_train

…n different from gini/mse

ogrisel · 2025-05-28T08:13:09Z

@GaetandeCast can you please document the change in as an "enhancement" under doc/whats_new/upcoming_changes?

Maybe we could have two entries with the same PR number under 2 sections: sklearn.ensemble to document the new attribute for forests model when oob_score=True and one under sklearn.tree to document the new method for individual trees if we choose to make the method public (with test data points explicitly provided by the caller).

…alues

ogrisel · 2025-06-09T12:24:23Z

sklearn/ensemble/tests/test_forest.py::test_importances is now failing because of the lack of normalization for UFI. I think this test needs to be updated.

ogrisel

Here is a pass of feedback. I think many paragraphs of the example should be reworked, the intro in particular.

I will try to find the time to do a full review of this PR in the coming week.

doc/whats_new/upcoming_changes/sklearn.tree/31279.enhancement.rst

examples/inspection/plot_permutation_importance.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

…alized asymptotics

antoinebaker

Here a first pass ! I didn't look at the cython code, examples or tests yet, I'll try at another time :)

sklearn/ensemble/_forest.py

antoinebaker · 2025-06-18T14:10:50Z

sklearn/ensemble/_forest.py

+        # TODO: re-add the dropped return_as="generator_unordered" for compatibility on
+        # joblib version. Introduced in 1.3 but 1.2 is the minimal requirement
+        results = Parallel(n_jobs=self.n_jobs, prefer="threads")(
+            delayed(
+                self._compute_unbiased_feature_importance_and_oob_predictions_per_tree
+            )(tree, X, y, sample_weight)
+            for tree in self.estimators_
+            if tree.tree_.node_count > 1
+        )
+
+        importances = np.zeros(n_features, dtype=np.float64)
+        oob_pred = np.zeros(
+            (n_samples, max_n_classes, self.n_outputs_), dtype=np.float64
+        )
+        n_oob_pred = np.zeros((n_samples, self.n_outputs_), dtype=np.intp)
+
+        for importances_i, oob_pred_i, n_oob_pred_i in results:
+            oob_pred += oob_pred_i
+            n_oob_pred += n_oob_pred_i
+            importances += importances_i
+
+        importances /= self.n_estimators


I feel the code could be simplified by returning a list and using np.mean for instance. But it seems that the code is written to work on generators, in line with the TODO comment to re-add return_as="generator_unordered". Out of curiosity, what is the motivation here for using in the future a generator instead of a list ?

@antoinebaker when I wrote the code I was under the impression that using generators would save on memory usage, but I'm not sure the gain is that big. The issue is that storing oob_pred for every tree in a multi-class multi-output case can become costly. Do you know of a way to avoid this issue ? If you think it's not a big issue I'll happily switch to lists.

sklearn/ensemble/_forest.py

sklearn/tree/_classes.py

antoinebaker · 2025-06-18T15:23:24Z

sklearn/tree/_classes.py

+
+        return self.tree_.compute_feature_importances(normalize=False)
+
+    def compute_unbiased_feature_importance(self, X_test, y_test, sample_weight=None):


In line with the long method name above, should we add a return_oob_pred option here ?

sklearn/tree/_classes.py

Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>

GaetandeCast and others added 30 commits April 14, 2025 17:43

First working implementation of UFI, does not support multi output, h…

c0e22ea

…as "extreme value" issues

Removed the normalization inherited from the old MDI to avoid instabi…

b1e9df8

…lity

added multi output support

2a694b6

removed redundant cross_impurity computations

fd0abfb

added mdi_oob

ef9f48d

redesigned ufi for better memory management

a225a42

removed a debug import

83f3880

added mdi_oob, cleaned the code

27618db

better unified the code between ufi and mdi_oob

5ad9636

fixed a call oversight

21d2e04

fixed an error in mdi_oob computations

8194d6e

changed tests on feature_importances_ to use unbiased FI too

9e16a09

add tests to check that the added methods coincide with the papers an…

8991d79

…d that they coincide with feature_importances_ on inbag samples

added support for regression (only MSE split)

a9d2983

added warning for unbiased feature importance in classification witho…

710d42c

…ut Gini

merged test_non_OOB_unbiased_feature_importances_class & _reg

ddedf27

Fixed a few mistake so that ufi-regression matches feature_importance…

1de98fc

…s_ on X_train

Extended the tests on matching the paper values to regression

c7c5d76

re added tests on oob_score for dense X. They fail

a44084d

revert a small change to a test

082206c

raise an error when calling unbiased feature importance with criterio…

b028cb9

…n different from gini/mse

adapted the tests to the previous commit

dcb3106

Added log_loss ufi

c61c8dc

fixed the oob_score_ issue, simplified the self.value accesses

d198f20

updated api and tests for ufi with 'log_loss'

f2acf5f

divide by 2 ufi 'log_loss' and improve tests

f41cf3f

fix some linting

af785d6

fixed Cython linting

ccd4f18

added inline function for clarity and comments on available criteria

ac36aaa

Merge branch 'main' into main

fda8349

GaetandeCast added 3 commits May 23, 2025 12:45

fix regex match

0c5cab5

drop the return_as parameter, remove joblib version skip in tests

9e78f6d

add non normalized feature importance in private attribute

77676e7

GaetandeCast and others added 5 commits May 30, 2025 10:54

add changelog entries for tree and ensemble

474839f

fix coverage warnings

ce9ebe0

divide importances by weighted_n_sample to avoid large unnormalized v…

be7c99d

…alues

Merge branch 'main' into unbiased-feature-importance

41729c6

remove mdi_oob, remove normalization for ufi

88c30eb

made ufi only available with gini, mse and friedman_mse

eb3316b

ogrisel reviewed Jun 11, 2025

View reviewed changes

GaetandeCast and others added 9 commits June 12, 2025 10:11

Apply suggestions from code review

4d39721

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

Apply suggestions from code review

a66d8e1

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

fix linting and a typo

137645b

divide ufi regression by 2, add private unnormalised mdi, test unnorm…

f99b8b7

…alized asymptotics

make the method public, reorder cython code

2322b26

simplify docstring and fix test in tree

1759c87

update changelog

a7821f6

Add paragraph in tree user guide

2f9afc5

refactor doc example

abe0493

GaetandeCast marked this pull request as ready for review June 18, 2025 14:00

antoinebaker reviewed Jun 18, 2025

View reviewed changes

GaetandeCast and others added 4 commits June 19, 2025 10:04

fix docstring on scoring function

6157f08

call unnormalized fi only once

7ef1da6

Apply suggestions from code review

227ec3d

Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>

remove unused code for oob_pred, clean aggregation of oob_pred

83c0a1a

GaetandeCast force-pushed the unbiased-feature-importance branch from cd131bc to 83c0a1a Compare June 19, 2025 09:50

GaetandeCast added 2 commits June 19, 2025 12:44

improve references in docstrings and add versionadded

e0313d3

mention the new method in the class attribute docstring

cbfc53a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unbiased MDI-like feature importance measure for random forests #31279

Unbiased MDI-like feature importance measure for random forests #31279

Uh oh!

GaetandeCast commented Apr 30, 2025 •

edited

Loading

Uh oh!

ogrisel commented May 28, 2025

Uh oh!

ogrisel commented Jun 9, 2025

Uh oh!

ogrisel left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antoinebaker left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antoinebaker Jun 18, 2025

Uh oh!

GaetandeCast Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antoinebaker Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.


		return self.tree_.compute_feature_importances(normalize=False)

		def compute_unbiased_feature_importance(self, X_test, y_test, sample_weight=None):

Uh oh!

Unbiased MDI-like feature importance measure for random forests #31279

Are you sure you want to change the base?

Unbiased MDI-like feature importance measure for random forests #31279

Uh oh!

Conversation

GaetandeCast commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

TODO:

Uh oh!

ogrisel commented May 28, 2025

Uh oh!

ogrisel commented Jun 9, 2025

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antoinebaker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antoinebaker Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

GaetandeCast Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

antoinebaker Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

GaetandeCast commented Apr 30, 2025 •

edited

Loading