[ENH] Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754

adam2392 · 2022-03-10T20:33:35Z

Reference Issues/PRs

Closes: #20819

This is ready for review.

What does this implement/fix? Explain your changes.

Implements oblique trees/splitter as a subclass of the existing Tree/Splitter. This further allows extensibility of the sklearn Tree and Splitter to allow downstream packages that have to define more complex and exotic sampling mechanisms. The total Cython code that constitutes the logic addition is around ~950 LOC, while the rest are from unit tests, and adding the Python API.

This is an extension of decision trees to generalize to random linear combination of feature oblique trees proposed by Breimann in 2001. This will enable 3rd party packages to subclass this code to instantiate other trees that work on "combinations" of features in some way (e.g. taking the weighted sum, or a kernel). The general intuition is that OF can sample more diverse set of features enabling better generalization and robustness to high-dimensional noise. RF should be used if the user suspects all data is aligned on the feature dimensions. The tradeoffs of computation time, model size and scoring time are complex because OF can fit shorter trees, but requires storage of additional data. OF needs to perform additional operations increasing computation time, but it can be negligible. We always suggest using OF first if you have the computational and storage flexibility. We suggest using RF for now if you have very large sample sizes and very very strict runtime/storage constraints.

Inclusion criterion was discussed. Breiman 2001, proposed oblique forests and demonstrated its superiority. Also see Adding Oblique Trees (Forest-RC) to the Cythonized Tree Module #20819 (comment)
Maintainability was addressed by figuring out a way that ObliqueSplitter and ObliqueTree can subclass the existing code.
Runtime performance is benchmarked (https://gist.github.com/adam2392/a652aa1c88ba94f0d6aab96ccd3ade24)
Interpretability is demonstrated using an extension of the RandomForest example with iris dataset (https://gist.github.com/adam2392/78519091104cecfeb8ff796eac7e8115)

Experiments supporting the change

Requested by Thomas Fan, Guillaume, and Olivier, we ran the following experiments.

runtime: see above
visual interpretation of the decision boundaries: see iris example
grid search outcomes on simulated data of OF vs RF: https://nbviewer.org/github/neurodata/scikit-learn/blob/5e8f1a6a506bf77fc99cdb59d81fe02f8a301e20/notebook/simulation_benchmark_OF_vs_RF.ipynb
grid search outcomes on low-dimension MNIST of OF vs RF: https://nbviewer.org/github/neurodata/scikit-learn/blob/25a1e9819c5def42f4f919eafdf4df4b065b2c08/notebook/mnist_benchmark_OF_vs_RF.ipynb
performance (accuracy) vs score time: see above 2 notebooks
performance vs fit time: see above 2 notebooks
performance vs model size: see above 2 notebooks
grid search outcomes on an openml dataset: TBD

Docs changes for education and user-guidance

extension of the examples/tree/plot_iris_dtc.py to include oblique trees
modules/tree.rst with a new section on oblique trees
modules/ensemble.rst with a new section on oblique forests
A new example in examples/ensemble/plot_oblique_axis_aligned_forests.py demonstrating oblique forest superiority on a real dataset from openml and a simulated dataset

The changes to setup.py files were necessary to compile package in some of the CI pipelines. It worked for me w/o it locally on my Mac M1 machine though.

Tracking Progress on sub-items - used for keeping track of challenges that arose (optional read)

The remaining items to complete are:

Add unit tests
Refactor to use memory views as in normal splitter (https://github.com/scikit-learn/scikit-learn/pull/23273/files)
Fix pickling issue in [ENH] Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754 (comment)
Add feature_importances, or specify an error message if it's not allowed.
Add extension to the RF-Iris dataset example done by Jong Shin within the docs. Also adding a synthetic dataset perhaps that looks cleaner. And another dataset, for example digits (8 x 8 images) that can get shallower/smaller-forests vs RandomForest; e.g. train OF/RF with 100 trees, and subsample the forest and plot accuracy and plot the max_depth. Basically get a few more examples that are candidates to include into the docs: 1 example showing an intuitive visualization and then another example to demonstrate possible superiority in terms of real example. E.g. digits horizontally stack the real signal but then create permuted noisy copies and stack again. Put it into a separate notebook for now for downstream integration into scikit-learn. End goal: education for users on the inductive bias for oblique trees and when to use.
Determine if minor refactoring is needed with TreeBuilder/Tree for adding nodes (see: [MAINT] Improve extensibility of the Tree/Splitter code #22756 (comment))

Any other comments?

This is the PR to demonstrate the end product. To simplify the merging process, I propose the merging of this PR in a series of other smaller PRs.

~~[MAINT] Modularize Tree code and Splitter utility functions #22753~~
Fix unit test organization for pickling trees: [MAINT] Separate unit tests in test_tree.py for pickling and min_impurity_decrease #22915 (review)
Small PR to update Tree abstractions, such as in sklearn/tree/_classes.py and sklearn/tree/_tree.pyx
Simplifications of the utility code, such as sorting to use c++ (MAINT Create a private extension for sorting utilities #22760)

Another misc. note is that oblique and decision trees can be further easily improved in terms of training speed via quantization of continuous feature values in the data (i.e. same idea used in histogram gradient boosting).

…liquepr

…s, etc.

adam2392 · 2022-03-15T03:01:10Z

Linking convo as reference: neurodata#16 and specifically neurodata#16 (comment).

I am going to proceed by adding some lightweight unit tests to determine integrability, and getting a working version that we can come back to.

Then I will try out the idea of replacing SplitRecord and such with a cdef class (maybe even dataclass?), which will be used to pass around the data structure from node_split and add_node.

Co-authored-by: Chester Huynh <[email protected]> Co-authored-by: Parth Vora <[email protected]>

Co-Authored-By: Thomas Fan <[email protected]> Co-Authored-By: Chester Huynh <[email protected]> Co-Authored-By: Parth Vora <[email protected]>

adam2392 · 2022-03-15T18:57:57Z

Progress as of 3/16/22

Added Oblique trees/forests to the existing unit tests parametrizations and they mostly work, with one minor issue on pickling. The following is a summary:

sparse dataset support, which I think should not be addressed here? There is the issue of how this would be implemented, and needs more thinking in terms of how to handle "categorical" data as well. Overall, I added error messages where needed and updated tests where needed to handle testing Oblique trees/forests.
pickling does not work on roundtrip for some reason. See below.
added a few new tests of performance relative to Forest_RI (i.e. DecisionTreeClf and RandomForestClf)

The commit 458220e is a stable working version that has all the tree/ensemble unit tests passing and also works as intended. The next step is to explore whether or not we can substitute a cdef class for the SplitRecord/ObliqueSplitRecord, so that way we can support subclassing better.

Pickling Issue Summary

The issue arises in the test: sklearn/tree/tests/test_tree.py::test_min_impurity_decrease, which pickles the object and then compares metadata and performance before/after. The scoring fails, on the lines:

score2 = est2.score(X, y)
        assert (
            score == score2
        ), "Failed to generate same score  after pickling with {0}".format(name)

with score 1.0 before and then 0.97 after pickling.

I've essentially fixed it "almost" in a74d970, where all the internal data structures of the ObliqueTree all match before/after pickling, but for some reason the predict produces different answers... wth?

Update: I can fix it by setting max_depth=5, so I suspect there is some machine-precision rounding issue on this edge case of a dataset. Done in a582546

Notes 3/21/22

Remove the max_depth=5 and try again:

look at predictions rather then the score to determine which leaf is problematic. Most likely coming from a mishandling of tie in the feature value decision threshold(?) E.g. float32 compared to float64
run 1NN within the test and see if there are duplicates of the feature vector within X with different y-labels
iris 101 and 142 lines are duplicated... but might not be relevant. we'll see.

Update May 12th, 2022

Turns out Cython silently allows for j in range(...), even if j is not initialized with a type(?). Weird. Fixed in 8a720ed

tests. The issues still are: - sparse dataset support, which I could possibly split into a follow-up PR? - pickling does not work on roundtrip for some reason - certain issues with max_leaf

…into obliquepr

adam2392 · 2022-06-13T17:35:40Z

Discussion on Discord - 6/13/22

Attendees: w/ @ogrisel @glemaitre @jshinm (I don't have the other folks on GH, please feel free to add them):

Absolute value of the plots in addition to the delta plots, since maybe max_features -> results in diff performance in RF and OF. E.g. maybe RF at mtry='sqrt' is optimal, whereas OF at mtry='2*n_features' is optimal.
Perform grid/random search over max_features and n_estimators to show the results of OF vs RF. Store the end results for MNIST with ROC_auc for the metric, and store the fitted trees on the disc. The results can be stored in a pd.DataFrame. Note: allow max_features to be > n_features for OF by design and see what happens.

We can use code from: https://nbviewer.org/github/ogrisel/notebooks/blob/master/sklearn_demos/ames_housing.ipynb
to inspire additional plots we want to show educating the dev team + users of the tradeoffs.

At the end of the day, we have the following trade-offs: train time, score time, RAM size of the trees, actual performance (e.g. ROC_AUC). Once we have the results of the Search CV, we can quickly plot these.

Misc notes

Some ideas for measuring the size of the tree(s): Pickle.dump with protocol-5 to measure the length of the string

adam2392 · 2022-07-20T01:32:50Z

We are working on the notebook / example that demonstrates superiority of the oblique decision trees vs axis-aligned decision trees. But based on the feedback we have received, this is ready for review! Please see the PR main post for all the information/discussion relevant for inclusion, maintenance, modularity of code, testability, and documentation.

built docs on oblique trees: https://output.circle-artifacts.com/output/job/d244cc90-0ee6-46a0-b0f8-be0a6624ec91/artifacts/0/doc/modules/tree.html#oblique-trees (the image linked is somehow swapped, so I need to fix that, but otw the text is there).

Thanks for all the feedback and discussion to get it to this point! @glemaitre @thomasjpfan @amueller @ogrisel

adam2392 · 2022-09-06T06:11:34Z

@ogrisel I fixed the merge commit as discussed yesterday.

Here is a "brief" summary. For full details, see the PR description. Let me know if this is too long / detailed:

What are oblique trees: oblique trees generate splits based on combinations of features. They can sample more than n_features candidate splits per split node (up to 2^n in fact). Sampling random linear combinations of features implicitly captures the dominant singular vectors of the dataset with high probability, as described by the Johnson Lindenstrauss lemma. Oblique trees do not necessarily need to sample linear combinations either. They can be generalized in countless ways, which opens the door for more usage of the sklearn.tree module.

When are oblique trees better than axis-aligned? Based on the geometric motivation above, one would suspect that oblique trees fare better than axis-aligned trees when there is a high degree of noise dimensions, and/or there are optimal split hyperplanes in the data that are not aligned with the features themselves. Thus oblique trees are suspected to perform significantly better when there are a high degree of noise dimensions.

When are axis-aligned trees better than oblique?
Thus axis-aligned trees perform better when there are informative splits that occur along the feature axes. In general, this is a more "rare" scenario (e.g. analogously think what are the chances of sampling a 0 eigenvalue in a random matrix? measure 0). Thus, we see that oblique trees on average perform better. However, there is a small increased computational and storage cost for oblique trees: the sampling and storage of the projection vectors for each split node. Thus it is recommended to use axis-aligned trees whenever computation and storage time is a critical requirement, but always test oblique trees via cross-validation if model performance in terms of scoring metrics is preferred.

adam2392 added 10 commits Mar 7, 2022

Adding makefile

49ef262

Adding changes to make refactor easy

d81a057

Adding changes to make refactor easy

7d76e14

Whoot

de26e27

Removing fluff

64b0652

Removing fluff

5a6a252

Almost working

aa0d81a

Almost mwe with only one nan

9882a0c

Working version with subclass

2bc6836

Remove jupyter notebook

87bb74e

github-actions bot added module:tree cython labels Mar 10, 2022

adam2392 marked this pull request as draft Mar 10, 2022

adam2392 added 4 commits Mar 10, 2022

Merge branch 'main' into obliquepr

32340d7

Removing scruff

fc6ec38

Merge branch 'obliquepr' of github.com:neurodata/scikit-learn into ob…

6051b40

…liquepr

Fix with thomas suggestions

4a4b4bb

adam2392 force-pushed the obliquepr branch 2 times, most recently from 5b74193 to 4a4b4bb Compare Mar 15, 2022

adam2392 added 3 commits Mar 15, 2022

Merge branch 'main' into obliquepr

4a16acd

Merged main. Working version

22bf39a

Co-author commit. Adding co-authors who helped out with code, example…

c6cf441

…s, etc.

adam2392 and others added 4 commits Mar 15, 2022

Co-authored-by: Thomas Fan <[email protected]>

822f935

Co-authored-by: Chester Huynh <[email protected]> Co-authored-by: Parth Vora <[email protected]>

Try to fix setup.

2aef2e9

Co-Authored-By: Thomas Fan <[email protected]> Co-Authored-By: Chester Huynh <[email protected]> Co-Authored-By: Parth Vora <[email protected]>

Merge branch 'main' into obliquepr

6006809

Adding pickling

11bedfc

adam2392 added 2 commits Mar 15, 2022

Adding working version of oblique tree/forests, along with unit

d27a005

tests. The issues still are: - sparse dataset support, which I could possibly split into a follow-up PR? - pickling does not work on roundtrip for some reason - certain issues with max_leaf

Adding almost working unit tests... Weird error with pickling

a74d970

adam2392 and others added 13 commits May 17, 2022

Try again for azure

6d847e0

Fix lint

1d527cc

Apply ci fix

8db3693

Try again

d2d32fd

Try another rand seed

aa30e50

Merge branch 'main' into obliquepr

67a45fd

Fix LGTM

54a1707

upload iris notebook

9f3073e

upload simulation notebook

8629d94

Merge branch 'main' into obliquepr

39a2978

Merge branch 'obliquepr' of https://github.com/neurodata/scikit-learn …

e06a70d

…into obliquepr

remove notebook

16e5573

Merge branch 'main' into obliquepr

53f1d64

adam2392 added 7 commits Jun 14, 2022

Fix doc changes

6dcd270

Fix doc changes

7b0f5e3

Adding doc enhancements

9bfa24a

Merge branch 'main' into obliquepr

1f22727

Fix examples

ecee103

Fix lint

2708f7b

Try again

38cc387

adam2392 changed the title ~~[ENH] Add Oblique trees and oblique splitters demonstrating upgradeability of Tree code~~ [ENH] Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees Jul 20, 2022

Fix doc

de06afa

adam2392 mentioned this pull request Jul 25, 2022

[Refactor] Modularize the tree class in both Python and Cython to enable easy extensions #24000

Open

adam2392 added 3 commits Aug 4, 2022

Merge branch 'main' into obliquepr

dd30df7

Merge branch 'main' into obliquepr

6409404

Fix of

d7e87dc

Merge branch 'main' into obliquepr

3b3743c

Aug	SEP	Oct
	08
2021	2022	2023

[ENH] Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754

[ENH] Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754

adam2392 commented Mar 10, 2022 •

edited

adam2392 commented Mar 15, 2022 •

edited

adam2392 commented Mar 15, 2022 •

edited

adam2392 commented Jun 13, 2022 •

edited

adam2392 commented Jul 20, 2022

adam2392 commented Sep 6, 2022

[ENH] Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754

Are you sure you want to change the base?

[ENH] Add Oblique trees and oblique splitters to tree module: Enables extensibility of the trees #22754

Conversation

adam2392 commented Mar 10, 2022 • edited

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Experiments supporting the change

Docs changes for education and user-guidance

Tracking Progress on sub-items - used for keeping track of challenges that arose (optional read)

Any other comments?

adam2392 commented Mar 15, 2022 • edited

adam2392 commented Mar 15, 2022 • edited

Progress as of 3/16/22

Pickling Issue Summary

Notes 3/21/22

Update May 12th, 2022

adam2392 commented Jun 13, 2022 • edited

Discussion on Discord - 6/13/22

Misc notes

adam2392 commented Jul 20, 2022

adam2392 commented Sep 6, 2022

adam2392 commented Mar 10, 2022 •

edited

adam2392 commented Mar 15, 2022 •

edited

adam2392 commented Mar 15, 2022 •

edited

adam2392 commented Jun 13, 2022 •

edited