The Wayback Machine - https://web.archive.org/web/20201016061045/https://github.com/scikit-learn/scikit-learn/issues/15761
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix documentation of default values in all classes #15761

Open
cgsavard opened this issue Dec 2, 2019 · 117 comments
Open

Fix documentation of default values in all classes #15761

cgsavard opened this issue Dec 2, 2019 · 117 comments

Comments

@cgsavard
Copy link
Contributor

@cgsavard cgsavard commented Dec 2, 2019

Description

The documentation of default values in many classes is either not included, inconsistent in how it is written, or out-of-date. I would like to gather a few people to work on the default value documentation for every class as there are a ton of classes where these issues exist. I have been told that the default values should be documented as "default=<'value'>" and so I am creating this issue under that assumption.

Solution

Here are a few things that I have seen for the parameters which should be changed:

  • no mention of whether there is a default should be checked against code as a few parameters are missing this entirely
  • "optional" should be changed to "default=<'value'>"
  • make sure how the default values are documented is consistent within the class, i.e. change everything to the format "default=<'value'>"
  • Modify a single file per PR

If a few people work on a few classes each, then this should be done in no time! These should all be fairly simple fixes.

Examples

https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
The link above is an example where default values are not indicated but the parameters say "optional", and where those with default values indicated are all inconsistently documented.

@vachanda
Copy link
Contributor

@vachanda vachanda commented Dec 3, 2019

Hello @cgsavard, I would like to work on this. Can I start looking at the AgglomerativeClustering class?

@cgsavard
Copy link
Contributor Author

@cgsavard cgsavard commented Dec 3, 2019

@vachanda Go for it! We can continue posting here which ones we work on so that others know.

@adrinjalali
Copy link
Member

@adrinjalali adrinjalali commented Dec 3, 2019

Thanks for coordinating this @cgsavard

Note to contributors: please follow the guidelines under: https://scikit-learn.org/stable/developers/contributing.html#guidelines-for-writing-documentation

@vachanda
Copy link
Contributor

@vachanda vachanda commented Dec 4, 2019

@cgsavard, Is there a list of classes that have discrepancies or do we have to go through each of them and update them?

@cgsavard
Copy link
Contributor Author

@cgsavard cgsavard commented Dec 4, 2019

@vachanda I do not have a list, unfortunately. I have just been going through the files and seeing what needs to be updated.

@cgsavard
Copy link
Contributor Author

@cgsavard cgsavard commented Dec 4, 2019

I am working on AffinityPropagation, SpectralCoclustering, SpectralBiclustering, and Birch.

@vachanda
Copy link
Contributor

@vachanda vachanda commented Dec 6, 2019

I am working on FeatureAgglomeration, KMeans, and MiniBatchKMeans.

@jmwoloso
Copy link
Contributor

@jmwoloso jmwoloso commented Dec 6, 2019

Logically speaking, if a param is optional, shouldn't the default be None always? Having a parameter with a default value other than None suggests it should be required.

If there is a default, this usually means that the literature has found this to be a sensible default value which also suggests that this parameter has an impact on performance and thus it shouldn't be optional, but should just mention what the default is. Those seem closer to required parameters by definition, we just happened to make a sensible choice for the user so they can change it or not.

Or more practically speaking, are there currently any optional parameters that we've found which have numeric default values, but for which specifying None will raise an exception? That would also suggest that the parameter is actually required, but that a sensible default has been chosen based on literature/research.

Or maybe I've been confusing the meaning of required and optional all these years? Lol. Would definitely love to help on this either way!

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Dec 6, 2019

@jmwoloso We were really inconsistent regarding the usage of optional and therefore we recently decided to remove it.

@cyrus303
Copy link

@cyrus303 cyrus303 commented Dec 6, 2019

i want to contribute as well. can i go ahead with this

@jmwoloso
Copy link
Contributor

@jmwoloso jmwoloso commented Dec 6, 2019

@glemaitre ok, that definitely makes sense. so then we're removing the optional verbage all together, right, while also noting default values in the doc strings?

should each of these that we find be opened as an issue separately or how are we staging all of this work that we're doing since multiple people are working on multiple things relating to this single issue?

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Dec 6, 2019

@cyrus303 @jmwoloso You can get a class (a module maximum) and correct it. The idea is to remove the optional and add a default when there is one (there is usually one). Since we are touching the documentation, we should make sure that the style on the line follow our new style guide: https://scikit-learn.org/dev/developers/contributing.html#guidelines-for-writing-documentation

You can mention which of the class/module you are working, open a link a PR to avoid duplicate effort :). Looking forward to reviewing it.

@alfaro96
Copy link
Member

@alfaro96 alfaro96 commented Dec 11, 2019

Hey! I will work on tree classes (tree.DecisionTreeClassifier, tree.DecisionTreeRegressor, tree.ExtraTreeClassifier and tree.ExtraTreeRegressor).

@alfaro96
Copy link
Member

@alfaro96 alfaro96 commented Dec 17, 2019

I will also fix this issue for the neighbors module.

@jmwoloso
Copy link
Contributor

@jmwoloso jmwoloso commented Dec 21, 2019

I'll take the ensemble module.

@jmwoloso
Copy link
Contributor

@jmwoloso jmwoloso commented Dec 21, 2019

@glemaitre any preference on bool vs. boolean? seeing a mix of both in ensemble, even in the same class. might as well get those in shape while i'm doing defaults.

EDIT:

ditto for int vs integer. I'm assuming int on that one, but wanted to confirm.

EDIT (again):

also seeing docstrings with inconsistent values relative to the __init__ signature for that class, e.g.:

min_impurity_split for RandomForestClassifier

the __init__ signature has min_impurity_split=None while the docstrings for it say min_impurity_split : float, (default=0). I would assume update the docstrings to match the signature since we'd want to keep the behavior of the class consistent (i.e. we want the same defaults passed in upon instantiation)?

@glemaitre
Copy link
Contributor

@glemaitre glemaitre commented Dec 21, 2019

@jmwoloso Could you refer to https://scikit-learn.org/stable/developers/contributing.html#guidelines-for-writing-documentation. Basically you should default to the python type name (bool, str, int, float)

the init signature has min_impurity_split=None while the docstrings for it say min_impurity_split : float, (default=0). I would assume update the docstrings to match the signature since we'd want to keep the behavior of the class consistent (i.e. we want the same defaults passed in upon instantiation)?

We should match the parameter in the function signature. This value default parameter has changed and the docstring was not updated.

@mghah
Copy link
Contributor

@mghah mghah commented Dec 21, 2019

Hi @cgsavard , I'd like to contribute but this is going to be my first time so need some hand holding. I'm quite familiar with python, somewhat handy with text editors and recently went through the fork -> clone -> edit -> PR workflow tutorial here. Please advise next step... Thank you!

@pulkitmehtawork
Copy link
Contributor

@pulkitmehtawork pulkitmehtawork commented Dec 22, 2019

Hi @cgsavard ,
Can I please work on Imputer ?

@boricles
Copy link
Contributor

@boricles boricles commented Aug 30, 2020

Hi, I am new, and I would like to start contributing. Do you still need some help on this issue? is there any file you still need help with?

@alfaro96
Copy link
Member

@alfaro96 alfaro96 commented Aug 31, 2020

Hey @boricles!

Have a look to #15761 (comment) for a list with the modules still to be fixed.

@boricles
Copy link
Contributor

@boricles boricles commented Aug 31, 2020

@alfaro96 thanks. I did a quick look just now. I will select a module tonight, and work on it.

@boricles
Copy link
Contributor

@boricles boricles commented Aug 31, 2020

I am working on sklearn/config_context

@madprogramer
Copy link

@madprogramer madprogramer commented Sep 5, 2020

Hey, thought I'd see if I could help with the docs.

@alfaro96 I'd like to work on sklearn.feature_extraction.text.CountVectorizer, if it hasn't already been taken, especially because I've personally encountered some pitfalls when working with Vectorizers in the past.

Also, I noticed that although sklearn.model_selection.learning_curve was updated, there's an out-of-date tutorial using the old documentation, should I leave it be? Or is it worth updating?

@haiatn
Copy link
Contributor

@haiatn haiatn commented Sep 5, 2020

Hi @alfaro96,

after edits:
I see sklearn.config_context and sklearn.set_config from sklearn.config_config.py were fixed so it can be checked out from the task list.

I would like to work on sklearn.utils. I saw only once instance of parameter documentation where 'optional' is used in. That means that I need to fix only that instance, correct? It is in sklearn.utils._mocking.py

@alfaro96
Copy link
Member

@alfaro96 alfaro96 commented Sep 6, 2020

Hey, thought I'd see if I could help with the docs.

Hey @madprogramer,

@alfaro96 I'd like to work on sklearn.feature_extraction.text.CountVectorizer, if it hasn't already been taken, especially because I've personally encountered some pitfalls when working with Vectorizers in the past.

I have had a look to the checklist and sklearn.feature_extraction.text.CountVectorizer reference and it does not seem to be fixed. A PR would be welcome.

Edit: The sklearn.feature_extraction.text.CountVectorizer is already fixed.

Also, I noticed that although sklearn.model_selection.learning_curve was updated, there's an out-of-date tutorial using the old documentation, should I leave it be? Or is it worth updating?

It is worth it updating, although this should be done in a separate PR.

Thank you!

@alfaro96
Copy link
Member

@alfaro96 alfaro96 commented Sep 6, 2020

Hi @alfaro96,

Hey @haiatn,

after edits:
I see sklearn.config_context and sklearn.set_config from sklearn.config_config.py were fixed so it can be checked out from the task list.

I have updated the checklist.

I would like to work on sklearn.utils. I saw only once instance of parameter documentation where 'optional' is used in. That means that I need to fix only that instance, correct? It is in sklearn.utils._mocking.py

That is the idea, although the classes in the sklearn.utils._mocking.py file are not part of the public API, so I do not think that it is worthy to update them.

Nevertheless, it would be nice if you could work in any of the other functions, classes and modules that are pending to be fixed.

Thank you!

@haiatn
Copy link
Contributor

@haiatn haiatn commented Sep 6, 2020

I looked at the checklist. From what I saw the following can be checked from the checklist:

  • sklearn.feature_extraction.image.img_to_graph
  • sklearn.isotonic.IsotonicRegression
  • sklearn.isotonic.check_increasing
  • I did not find the file sklearn.ensemble.HistGradientBoostingRegressor but all of sklearn.ensemble is OK

Can I work on sklearn.manifold._spectral_embedding and sklearn.feature_extraction.text.HashVectorizer? I will do it in seperate PR. I think they are the only files left that need fixing (assuming sklearn.feature_extraction.text.CountVectorizer is taken).

@alfaro96
Copy link
Member

@alfaro96 alfaro96 commented Sep 7, 2020

I looked at the checklist. From what I saw the following can be checked from the checklist:

  • sklearn.feature_extraction.image.img_to_graph
  • sklearn.isotonic.IsotonicRegression
  • sklearn.isotonic.check_increasing

Thank you @haiatn, I have updated the checklist.

  • I did not find the file sklearn.ensemble.HistGradientBoostingRegressor but all of sklearn.ensemble is OK

The sklearn.ensemble.HistGradientBoostingClassifier and sklearn.ensemble.HistGradientBoostingRegressor are in this file: scikit-learn/sklearn/ensemble/_hist_gradient_boosting/gradient_boosting.py. However, they have been already fixed.

Can I work on sklearn.manifold._spectral_embedding and sklearn.feature_extraction.text.HashVectorizer? I will do it in seperate PR. I think they are the only files left that need fixing (assuming sklearn.feature_extraction.text.CountVectorizer is taken).

I have had a look to the sklearn.manifold module and sklearn.feature_extraction.text.HashingVectorizer and they have been already fixed (I have updated the checklist accordingly).

Nevertheless, there are several functions in the sklearn.utils module that should be still fixed.

Thank you @haiatn, we really appreciate your help!

@haiatn
Copy link
Contributor

@haiatn haiatn commented Sep 11, 2020

I will now work on sklearn.utils._estimator_html_repr, sklearn.utils.deprecation and sklearn.utils._testing

@haiatn
Copy link
Contributor

@haiatn haiatn commented Sep 15, 2020

I will finish sklearn.utils. There are only 3 files I found that need fixing.

@haiatn
Copy link
Contributor

@haiatn haiatn commented Sep 18, 2020

hey @alfaro96 ,
could you review my open pull requests? I think they are the last ones.
#18360 #18385 #18386

@alfaro96
Copy link
Member

@alfaro96 alfaro96 commented Sep 20, 2020

Hey @haiatn!

I have already have a look to your open PRs.

Thank you!

@haiatn
Copy link
Contributor

@haiatn haiatn commented Sep 24, 2020

Now that we merged what's left off sklearn.utils and it was the last on the checklist, did we finish?

@cmarmo
Copy link
Member

@cmarmo cmarmo commented Sep 24, 2020

There is one last open pull request #18025, then this issue could be eventually closed.

@mynkdsi1011
Copy link

@mynkdsi1011 mynkdsi1011 commented Sep 24, 2020

Hello,
I want to start contributing. Is there any class pending for fixing doc of default values? If any then I can take it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.