The Wayback Machine - https://web.archive.org/web/20220908143355/https://github.com/scikit-learn/scikit-learn/pull/22754
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754

Open
wants to merge 89 commits into
base: main
Choose a base branch
from

Conversation

adam2392
Copy link
Contributor

@adam2392 adam2392 commented Mar 10, 2022

Reference Issues/PRs

Closes: #20819

This is ready for review.

What does this implement/fix? Explain your changes.

Implements oblique trees/splitter as a subclass of the existing Tree/Splitter. This further allows extensibility of the sklearn Tree and Splitter to allow downstream packages that have to define more complex and exotic sampling mechanisms. The total Cython code that constitutes the logic addition is around ~950 LOC, while the rest are from unit tests, and adding the Python API.

This is an extension of decision trees to generalize to random linear combination of feature oblique trees proposed by Breimann in 2001. This will enable 3rd party packages to subclass this code to instantiate other trees that work on "combinations" of features in some way (e.g. taking the weighted sum, or a kernel). The general intuition is that OF can sample more diverse set of features enabling better generalization and robustness to high-dimensional noise. RF should be used if the user suspects all data is aligned on the feature dimensions. The tradeoffs of computation time, model size and scoring time are complex because OF can fit shorter trees, but requires storage of additional data. OF needs to perform additional operations increasing computation time, but it can be negligible. We always suggest using OF first if you have the computational and storage flexibility. We suggest using RF for now if you have very large sample sizes and very very strict runtime/storage constraints.

Experiments supporting the change

Requested by Thomas Fan, Guillaume, and Olivier, we ran the following experiments.

Docs changes for education and user-guidance
  • extension of the examples/tree/plot_iris_dtc.py to include oblique trees
  • modules/tree.rst with a new section on oblique trees
  • modules/ensemble.rst with a new section on oblique forests
  • A new example in examples/ensemble/plot_oblique_axis_aligned_forests.py demonstrating oblique forest superiority on a real dataset from openml and a simulated dataset

The changes to setup.py files were necessary to compile package in some of the CI pipelines. It worked for me w/o it locally on my Mac M1 machine though.

Tracking Progress on sub-items - used for keeping track of challenges that arose (optional read)

The remaining items to complete are:

  • Add unit tests
  • Refactor to use memory views as in normal splitter (https://github.com/scikit-learn/scikit-learn/pull/23273/files)
  • Fix pickling issue in [ENH] Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754 (comment)
  • Add feature_importances, or specify an error message if it's not allowed.
  • Add extension to the RF-Iris dataset example done by Jong Shin within the docs. Also adding a synthetic dataset perhaps that looks cleaner. And another dataset, for example digits (8 x 8 images) that can get shallower/smaller-forests vs RandomForest; e.g. train OF/RF with 100 trees, and subsample the forest and plot accuracy and plot the max_depth. Basically get a few more examples that are candidates to include into the docs: 1 example showing an intuitive visualization and then another example to demonstrate possible superiority in terms of real example. E.g. digits horizontally stack the real signal but then create permuted noisy copies and stack again. Put it into a separate notebook for now for downstream integration into scikit-learn. End goal: education for users on the inductive bias for oblique trees and when to use.
  • Determine if minor refactoring is needed with TreeBuilder/Tree for adding nodes (see: [MAINT] Improve extensibility of the Tree/Splitter code #22756 (comment))

Any other comments?

This is the PR to demonstrate the end product. To simplify the merging process, I propose the merging of this PR in a series of other smaller PRs.

Another misc. note is that oblique and decision trees can be further easily improved in terms of training speed via quantization of continuous feature values in the data (i.e. same idea used in histogram gradient boosting).

@adam2392 adam2392 marked this pull request as draft Mar 10, 2022
@adam2392 adam2392 force-pushed the obliquepr branch 2 times, most recently from 5b74193 to 4a4b4bb Compare Mar 15, 2022
@adam2392
Copy link
Contributor Author

adam2392 commented Mar 15, 2022

Linking convo as reference: neurodata#16 and specifically neurodata#16 (comment).

I am going to proceed by adding some lightweight unit tests to determine integrability, and getting a working version that we can come back to.

Then I will try out the idea of replacing SplitRecord and such with a cdef class (maybe even dataclass?), which will be used to pass around the data structure from node_split and add_node.

adam2392 and others added 4 commits Mar 15, 2022
Co-authored-by: Chester Huynh <[email protected]>
Co-authored-by: Parth Vora <[email protected]>
Co-Authored-By: Thomas Fan <[email protected]>
Co-Authored-By: Chester Huynh <[email protected]>
Co-Authored-By: Parth Vora <[email protected]>
@adam2392
Copy link
Contributor Author

adam2392 commented Mar 15, 2022

Progress as of 3/16/22

Added Oblique trees/forests to the existing unit tests parametrizations and they mostly work, with one minor issue on pickling. The following is a summary:

  • sparse dataset support, which I think should not be addressed here? There is the issue of how this would be implemented, and needs more thinking in terms of how to handle "categorical" data as well. Overall, I added error messages where needed and updated tests where needed to handle testing Oblique trees/forests.
  • pickling does not work on roundtrip for some reason. See below.
  • added a few new tests of performance relative to Forest_RI (i.e. DecisionTreeClf and RandomForestClf)

The commit 458220e is a stable working version that has all the tree/ensemble unit tests passing and also works as intended. The next step is to explore whether or not we can substitute a cdef class for the SplitRecord/ObliqueSplitRecord, so that way we can support subclassing better.

Pickling Issue Summary

The issue arises in the test: sklearn/tree/tests/test_tree.py::test_min_impurity_decrease, which pickles the object and then compares metadata and performance before/after. The scoring fails, on the lines:

score2 = est2.score(X, y)
        assert (
            score == score2
        ), "Failed to generate same score  after pickling with {0}".format(name)

with score 1.0 before and then 0.97 after pickling.

I've essentially fixed it "almost" in a74d970, where all the internal data structures of the ObliqueTree all match before/after pickling, but for some reason the predict produces different answers... wth?

Update: I can fix it by setting max_depth=5, so I suspect there is some machine-precision rounding issue on this edge case of a dataset. Done in a582546

Notes 3/21/22

Remove the max_depth=5 and try again:

  • look at predictions rather then the score to determine which leaf is problematic. Most likely coming from a mishandling of tie in the feature value decision threshold(?) E.g. float32 compared to float64
  • run 1NN within the test and see if there are duplicates of the feature vector within X with different y-labels
  • iris 101 and 142 lines are duplicated... but might not be relevant. we'll see.

Update May 12th, 2022

Turns out Cython silently allows for j in range(...), even if j is not initialized with a type(?). Weird. Fixed in 8a720ed

adam2392 added 2 commits Mar 15, 2022
tests.

The issues still are:

- sparse dataset support, which I could possibly split into a follow-up PR?
- pickling does not work on roundtrip for some reason
- certain issues with max_leaf
@adam2392
Copy link
Contributor Author

adam2392 commented Jun 13, 2022

Discussion on Discord - 6/13/22

Attendees: w/ @ogrisel @glemaitre @jshinm (I don't have the other folks on GH, please feel free to add them):

  1. Absolute value of the plots in addition to the delta plots, since maybe max_features -> results in diff performance in RF and OF. E.g. maybe RF at mtry='sqrt' is optimal, whereas OF at mtry='2*n_features' is optimal.
  2. Perform grid/random search over max_features and n_estimators to show the results of OF vs RF. Store the end results for MNIST with ROC_auc for the metric, and store the fitted trees on the disc. The results can be stored in a pd.DataFrame. Note: allow max_features to be > n_features for OF by design and see what happens.

We can use code from: https://nbviewer.org/github/ogrisel/notebooks/blob/master/sklearn_demos/ames_housing.ipynb
to inspire additional plots we want to show educating the dev team + users of the tradeoffs.

At the end of the day, we have the following trade-offs: train time, score time, RAM size of the trees, actual performance (e.g. ROC_AUC). Once we have the results of the Search CV, we can quickly plot these.

Misc notes

Some ideas for measuring the size of the tree(s): Pickle.dump with protocol-5 to measure the length of the string

@adam2392
Copy link
Contributor Author

adam2392 commented Jul 20, 2022

We are working on the notebook / example that demonstrates superiority of the oblique decision trees vs axis-aligned decision trees. But based on the feedback we have received, this is ready for review! Please see the PR main post for all the information/discussion relevant for inclusion, maintenance, modularity of code, testability, and documentation.

Thanks for all the feedback and discussion to get it to this point! @glemaitre @thomasjpfan @amueller @ogrisel

@adam2392 adam2392 changed the title [ENH] Add Oblique trees and oblique splitters demonstrating upgradeability of Tree code [ENH] Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees Jul 20, 2022
@adam2392
Copy link
Contributor Author

adam2392 commented Sep 6, 2022

@ogrisel I fixed the merge commit as discussed yesterday.

Here is a "brief" summary. For full details, see the PR description. Let me know if this is too long / detailed:

What are oblique trees: oblique trees generate splits based on combinations of features. They can sample more than n_features candidate splits per split node (up to 2^n in fact). Sampling random linear combinations of features implicitly captures the dominant singular vectors of the dataset with high probability, as described by the Johnson Lindenstrauss lemma. Oblique trees do not necessarily need to sample linear combinations either. They can be generalized in countless ways, which opens the door for more usage of the sklearn.tree module.

When are oblique trees better than axis-aligned? Based on the geometric motivation above, one would suspect that oblique trees fare better than axis-aligned trees when there is a high degree of noise dimensions, and/or there are optimal split hyperplanes in the data that are not aligned with the features themselves. Thus oblique trees are suspected to perform significantly better when there are a high degree of noise dimensions.

When are axis-aligned trees better than oblique?
Thus axis-aligned trees perform better when there are informative splits that occur along the feature axes. In general, this is a more "rare" scenario (e.g. analogously think what are the chances of sampling a 0 eigenvalue in a random matrix? measure 0). Thus, we see that oblique trees on average perform better. However, there is a small increased computational and storage cost for oblique trees: the sampling and storage of the projection vectors for each split node. Thus it is recommended to use axis-aligned trees whenever computation and storage time is a critical requirement, but always test oblique trees via cross-validation if model performance in terms of scoring metrics is preferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants