-
-
Notifications
You must be signed in to change notification settings - Fork 26k
Unbiased MDI-like feature importance measure for random forests #31279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Unbiased MDI-like feature importance measure for random forests #31279
Conversation
…as "extreme value" issues
…d that they coincide with feature_importances_ on inbag samples
…n different from gini/mse
@GaetandeCast can you please document the change in as an "enhancement" under Maybe we could have two entries with the same PR number under 2 sections: |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a pass of feedback. I think many paragraphs of the example should be reworked, the intro in particular.
I will try to find the time to do a full review of this PR in the coming week.
doc/whats_new/upcoming_changes/sklearn.tree/31279.enhancement.rst
Outdated
Show resolved
Hide resolved
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…alized asymptotics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here a first pass ! I didn't look at the cython code, examples or tests yet, I'll try at another time :)
# TODO: re-add the dropped return_as="generator_unordered" for compatibility on | ||
# joblib version. Introduced in 1.3 but 1.2 is the minimal requirement | ||
results = Parallel(n_jobs=self.n_jobs, prefer="threads")( | ||
delayed( | ||
self._compute_unbiased_feature_importance_and_oob_predictions_per_tree | ||
)(tree, X, y, sample_weight) | ||
for tree in self.estimators_ | ||
if tree.tree_.node_count > 1 | ||
) | ||
|
||
importances = np.zeros(n_features, dtype=np.float64) | ||
oob_pred = np.zeros( | ||
(n_samples, max_n_classes, self.n_outputs_), dtype=np.float64 | ||
) | ||
n_oob_pred = np.zeros((n_samples, self.n_outputs_), dtype=np.intp) | ||
|
||
for importances_i, oob_pred_i, n_oob_pred_i in results: | ||
oob_pred += oob_pred_i | ||
n_oob_pred += n_oob_pred_i | ||
importances += importances_i | ||
|
||
importances /= self.n_estimators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the code could be simplified by returning a list and using np.mean for instance. But it seems that the code is written to work on generators, in line with the TODO comment to re-add return_as="generator_unordered". Out of curiosity, what is the motivation here for using in the future a generator instead of a list ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antoinebaker when I wrote the code I was under the impression that using generators would save on memory usage, but I'm not sure the gain is that big. The issue is that storing oob_pred
for every tree in a multi-class multi-output case can become costly. Do you know of a way to avoid this issue ? If you think it's not a big issue I'll happily switch to lists.
|
||
return self.tree_.compute_feature_importances(normalize=False) | ||
|
||
def compute_unbiased_feature_importance(self, X_test, y_test, sample_weight=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line with the long method name above, should we add a return_oob_pred
option here ?
Co-authored-by: antoinebaker <antoinebaker@users.noreply.github.com>
cd131bc
to
83c0a1a
Compare
Reference Issues/PRs
Fixes #20059
What does this implement/fix? Explain your changes.
This implements two methods that correct the cardinality bias of the
feature_importances_
attribute of random forest estimators by leveraging out-of-bag (oob) samples.The first method is derived from Unbiased Measurement of Feature Importance in Tree-Based Methods, Zhengze Zhou & Giles Hooker. The corresponding attribute is named
ufi_feature_importances_
.The second method is derived from A Debiased MDI Feature Importance Measure for Random Forests, Xiao Li et al.. The corresponding attribute is named
mdi_oob_feature_importances_
.The names are temporary, we are still seeking a way of favoring one method over the other (currently investigating whether one of the two reaches asymptotic behavior faster than the other).
These attributes are set by the
fit
method after training, if the parameteroob_score
is set toTrue
. In this case we send the oob samples to a Cython method at tree level that propagates them through the tree and returns the corresponding oob prediction function and feature importance measure.This new feature importance measure has a similar behavior to regular Mean Decrease Impurity but mixes the in-bag and out-of-bag values of each node instead of using the in-bag impurity. The two proposed method differ in the way they mix in-bag and oob samples.
This PR also includes these two new feature importance measures to the test suite, specifically in test_forest.py. Existing tests are widened to test these two measures and new tests are added to make sure they behave correctly (e.g. they coincide with values given by the code of the cited papers, they recover traditional MDI when used on in-bag samples).
Any other comments?
The papers only suggest fixes for trees built with the Gini (classification) and Mean Squared Error (regression) criteria, but we would like the new methods to support the other available criteria in scikit-learn.
log_loss
support was added for classification with the ufi method by generalizing the idea of mixing in-bag and oob samples.Some CPU and memory profiling was done to ensure that the computational overhead was controlled enough compared to the cost of model fitting for large enough datasets.
Support for sparse matrix input should be added soon.
This work is done in close colaboration with @ogrisel.
TODO:
oob_score_
Done in d198f20
support: 8329b3b
test: 0b48af4
sample_weight
Support added in f10721e. Test in 241de66
GradientBoostingClassifier
andGradientBoostintRegressor
when row-wise (sub)sampling is enabled at training time.Done in ce52159
8a09b39
229cc4d
Edit: We noticed a discrepancy between the formula defined by the authors of mdi_oob and what their code does. This is detailed here, in part 5. Therefore we only implement UFI for now. Furthermore we could not find an equivalent of ufi for the entropy impurity criterion so we compute ufi with gini whatever the impurity criterion in classification, and with mse for classification