The Wayback Machine - https://web.archive.org/web/20230304095726/https://github.com/scikit-learn/scikit-learn/pull/22574
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH add Naive Bayes Metaestimator ColumnwiseNB (aka "GeneralNB") #22574

Open
wants to merge 169 commits into
base: main
Choose a base branch
from

Conversation

avm19
Copy link
Contributor

@avm19 avm19 commented Feb 22, 2022

This PR is ready for review.
REVIEWER! Don't be discouraged by the amount of text below! The itemised comments are optional and document certain considerations.

Reference Issues/PRs
Fixes #12957 #15077 (issues)
Supersedes #16281 (stalled PR)
Somewhat related to #10856
Incorporates changes from PR #22565 , where I add abstract methods in _BaseDiscreteNB

What does this implement/fix? Explain your changes.

This implements a meta-estimator that mixes existing naive Bayes classifiers: GaussianNB, BernoulliNB, and others. The interface is inspired by ColumnTransformer, as was previously suggested by @timbicker , @jnothman and others:

>>> import numpy as np
>>> rng = np.random.RandomState(1)
>>> X = rng.randint(5, size=(6, 100))
>>> y = np.array([0, 0, 1, 1, 2, 2])
>>> from sklearn.naive_bayes import MultinomialNB, GaussianNB, ColumnwiseNB
>>> clf = ColumnwiseNB(estimators=[('mnb1', MultinomialNB(), [0, 1]),
...                                ('mnb2', MultinomialNB(), [3, 4]),
...                                ('gnb1', GaussianNB(), [5])])
>>> clf.fit(X, y)
ColumnwiseNB(estimators=[('mnb1', MultinomialNB(), [0, 1]),
                         ('mnb2', MultinomialNB(), [3, 4]),
                         ('gnb1', GaussianNB(), [5])])
>>> print(clf.predict(X))
[0 0 1 0 2 2]

The meta-estimator combines multiple naive Bayes estimators by expressing the overall joint probability P(x,y) through P(x_i,y), the joint probabilities of the subestimators:

Log P(x,y) = Log P(x_1,y) + ... + Log P(x_N,y) - (N - 1) Log P(y),

where N is the number of estimators (3 in the example). The joint probabilities P(x_i,y) are exposed through the private method _joint_log_likelihood(self, X) in any instance of _BaseNB class, which is a parent to all naive Bayes classifiers implemented so far, including the proposed meta-estimator.

Any other comments?

  1. About the name. This meta-estimator was referred to as GeneralNB in the past, cf. stalled PR [WIP] Implement general naive Bayes #16281. I think terms like "general" or "generalized" are too ambiguous. For example, Bernoulli and Gaussian distributions are both members of the general exponential family of distributions -- is this the generalisation we refer to? Another generalisation is having different distribution families for each class, e.g., P(x|y=0)~N and P(x|y=1)~Mult, or working with any user-provided distribution. In addition, I don't recall seeing "the general naive Bayes" in the literature: all models were just naive Bayes with further assumptions of various degree of generality.
    In contrast, ColumnwiseNB is specific: it tells the user that something is going on with columns and forms a very appropriate association with ColumnTranformer. I think this is a much better name, and maybe there are even better options. I'll be fine with any decision, but I think ColumnwiseNB would be a good choice.
  2. Docstring and tests are included.
  3. Use case, examples and guide. I don't know any good use case or have any good example at hand. I'd be grateful for a suggestion or if someone wrote an example using my implementation, while testing it at the same time. I guess, a paragraph or two should be added to the Guide as well. Shall we leave all this for another PR?[updated below]
  4. Parallel jobs. I implemented fit and partial_fit following _BaseDiscreteNB and took ColumnTransformer's implementation as a blueprint for parallelism. I am not very familiar with the inner workings of joblib, and I ask reviewers to pay extra attention here.
  5. MetaEstimator. _BaseComposition is a parent class and get/set methods for parameters have been implemented and tested. Subestimators are cloned prior to fitting.
  6. Callable columns are supported.
  7. Pandas DataFrame and string indexing of columns is supported.
  8. There is no remainder parameter (un)like in ColumnTransformer. It complicates the logic a little bit. Do you think it'd be really nice to have it? Maybe leave it till the next PR then.
  9. Repeated columns are allowed and have the same effect as if they were duplicated.
  10. GaussianNB and _BaseDiscreteNB use different conventions. For example, priors vs class_prior as hyperparameters and class_log_prior_ vs class_prior_ as attributes. I do not know which convention is more preferable. Just a remark.
  11. The meta-estimator duplicates some functionality of its subestimators, such as checking feature names, sample weights, classes, supplied prior, counting and calculating priors as relative frequencies. I haven't found a way to avoid this.
  12. (I forgot it)
  13. Theoretical issue. It is implicitly assumed in the equation Log P(x,y) = Log P(x_1,y) + ... + Log P(x_N,y) - (N - 1) Log P(y), that the class log priors are finite and agree between the estimators and the subestimator: - inf < Log P(y) = Log P(y|1) = ... = Log P(y|N). The meta-estimators does not check this condition. Meaningless results, including NaN, may be produced by the meta-estimator if the class priors differ or contain a zero probability. Usually this is not a problem, because all children of _BaseNB calculate class priors directly as relative frequencies; unseen classes from partial_fit are on the user, not on us. But in general, class priors can be estimated differently or provided by the user in subestimators and the meta-estimator independently. We can distinguish 2 issues. First, what if the subestimators' class priors do not agree with the meta-estimator's prior? Then we could "marry" the subestimators not "at joint probabilities", but "at conditional probabilities", just as naive Bayes prescribes: Log P(x|y) = Log P(x_1|y) + ... + Log P(x_N|y) or equivalently Log P(x,y) = Log P(x_1,y) - Log P(y|1) + ... + Log P(x_N,y) - Log P(y|N) + Log P(y), where Log P(y|i) is the class prior from the ith subestimator. Second, what if some prior probabilities are zero? A problem arises only when the the zero found in the meta-estimator's class prior. This is something to think about. I wonder if someone came across a paper or a textbook that discusses and resolves this issue.

Finally, I would appreciate any comments and suggestions, especially those addressing mistakes or aimed at efficient programming and "good code". I am glad to be learning from this community.

Updated on Wednesday, February 23, 2022 05:57:39 UTC:

  1. estimatorNBs vs estimators. Initially, the parameter's name was "estimators", by analogy with "transformers". It turns out that many common tests failed simply because they duck-type models based on the presence of estimators in the parameter list. As a result, the tests assume that they're dealing with something like VotingClassifier and feed it LogisticRegression--something that ColumnwiseNB is not suppose to accept. I took the path of the least resistance and renamed estimators to estimatorNBs.

  2. The only change I made to pre-existing test files is adding ColumnwiseNB to VALIDATE_ESTIMATOR_INIT, a list that exempts ColumnTransformer and other estimators from a certain test. [Updated on 2022-05-29] The problematic test is passed by catching the error when _estimators setter in ColumnwiseNB fails to unpack parameters passed by the test. See FIX Remove validation from __init__ and set_params for ColumnTransformer #22537 for the similar change in ColumnTransformer.

Updated on Wednesday, February 23, 2022 09:14:13 UTC:

  1. pytest, joblib, and output. I don't understand how pytest interacts with joblib, and why I can't capture any stdout produced within a parallel loop when running pytest. As a result, I cannot pytest such an output. Could someone please explain this unexpected behaviour or suggest a solution?

Updated on Sunday, February 27, 2022 09:00:13 UTC:

  1. n_features_in_. [updated below] Advice needed. First, I do not fully understand the purpose of this attribute and whether it is needed given we use feature_names_in_. Second, setting it up using BaseEstimator._check_n_features can be problematic, since this method expects a "converted array", and I currently avoid converting X in the meta-estimator.

Updated on Thursday, April 7, 2022 17:14:08 UTC:

  1. There could be an advantage in taking VotingClassifier as a prototype instead of ColumnTransformer. The difference is that we would be passing a list of tuples (str, estimator) instead of (str, estimator, columns). In the former case, estimator would have to be a base naive Bayes estimator wrapped together with a column selector, e.g. a GaussianNB + a selector of float columns. In fact, we could've redefined all _BaseNB to add this new parameter for column selection, whose default value selects all columns of X. The advantage is that now the subestimators' columns subsets are treated as separate hyperparameters for the purpose of GridSearch. We could write
    param_grid = {"clf__gnb1__columns": [[5], [5, 6]]}
    instead of much more verbose
    param_grid = {"clf__estimators": [
    [('mnb1', MultinomialNB(), [0, 1]),
     ('mnb2', MultinomialNB(), [3, 4]),
     ('gnb1', GaussianNB(), [5])],
    [('mnb1', MultinomialNB(), [0, 1]),
     ('mnb2', MultinomialNB(), [3, 4]),
     ('gnb1', GaussianNB(), [5, 6])]}
    ]
    

Updated on Sunday, April 10, 2022 01:01:55 UTC:

  1. Although BaseEstimator._check_n_features is typically passed converted arrays (e.g., from BaseEstimator._validate_data or ColumnTransformer.fit_transform), it seems to work fine with pandas dataframes too. All examples work and all tests are passed.

Updated on Sunday, April 10, 2022 13:45:52 UTC:

  1. Use case, examples and guide. A section to naive Bayes guide is added. An auto-example (titanic dataset) is created for the gallery.

Updated on Monday, April 11, 2022 05:30:23 UTC:

  1. HTML representation ColumnwiseNB displays subestimators in parallel, just like ColumnTransformer.

Updated on Thursday, July 7, 2022 23:24:45 UTC:

  1. _validate_params aka common parameter validation. Necessary changes towards Make all estimators use _validate_params #23462 are made. A custom replacement for test_common.py::test_check_param_validation is implemented in test_naive_bayes.py::test_cwnb_check_param_validation. It is needed because utils.estimator_checks._construct_instance does not know how to create an instance of ColumnwiseNB, which leads to test_common.py skipping this test for ColumnwiseNB (without a notification by pytest; a message is displayed only when the test is called directly). ColumnTransformer, Pipeline, and a handful of other estimators suffer from this problem too.

Updated on Friday, October 7, 2022 16:26:00 UTC:

  1. Why ColumnTransformer is not enough? This question was asked at a Discord voice meeting. First, ColumnTransformer is a transformer, not a classifier. Second, ColumnTransformer simply concatenates the transformed columns, whereas the naive Bayes requires additional calculations. Importantly, these additional calculations are done not on the subestimators' standard output (predict, predict_proba, predict_log_proba), but on the intermediate quantities (predict_joint_log_proba, formerly known as _joint_log_likelihood), see this discussion. This is why ColumnTransformer cannot be used even with additional logic wrapped on top of it.

@avm19 avm19 marked this pull request as draft February 22, 2022 09:41
@avm19 avm19 marked this pull request as ready for review February 23, 2022 09:18
@avm19
Copy link
Contributor Author

avm19 commented Feb 27, 2022

This PR needs attention.

@avm19
Copy link
Contributor Author

avm19 commented Mar 1, 2022

@jnothman Would you be able to review this PR or advise on how to proceed to get it merged in foreseeable future? I am asking you, because you thumbs-upped. Thanks!

avm19 added 14 commits January 3, 2023 15:15
1. estimator_checks.py can now build an instance of ColumnwiseNB
2. _select_half_first|second in naive_bayes.py are used as
  column selectors by the aforementioned instance.
3. X is converted to numpy ndarray, unless X is pandas DataFrame.
  This is to permit column indexing by str in pandas DataFrame,
  while passing tests when X is list of lists etc.
  Previously, I avoided any conversion of X.
4. The meta-estimator now does _check_n_features on predict.
  Previously, it was allowed to predict on more features than seen
  while fitting, because not all features could be selected.
Tests in CI pipeline return error:
Different tests were collected between gw0 and gw1
For details and similar situation see:scikit-learn#18811 (comment)
@avm19 avm19 requested review from glemaitre and removed request for jnothman January 7, 2023 03:37
@avm19
Copy link
Contributor Author

avm19 commented Jan 14, 2023

@glemaitre Please see the changes I made some time ago and my responses to your comments.

(That's not to hurry things up, I am just not sure what happens when I press the "re-request review" button. Do you receive an email or something?)

@lorentzenchr
Copy link
Member

@scikit-learn/core-devs ping
This is a reasonable PR with an author that responds usually quickly and already one approving review. If someone has time for a dedicated second review, it would be highly appreciated.
I myself do not feel competent for the type of estimators considered here.

@jjerphan
Copy link
Member

Hi @lorentzenchr and @avm19, I unfortunately have other Pull Requests that I started reviewing and might better focus my time on finishing reviewing those Pull Requests first. I hope you understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Naive Bayes Classifier with Mixed Bernoulli/Gaussian Models
9 participants