[WIP] sample props (proposal 4) #16079
Conversation
Yay! I've been trying to find more time to work on the SLEP, but my break
was consumed visiting family. Thanks
|
Here's an example with a pipeline. @jnothman would you please check if it's [almost] what you had in mind? @NicolasHug you were interested in this one as well IIRC. from sklearn.model_selection import GroupKFold, cross_validate
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import make_scorer
import numpy as np
from sklearn.datasets import make_classification
X, y = make_classification()
sw = np.random.rand(len(X))
sw2 = [5, 6]
brand = ['my brand'] from sklearn.base import BaseEstimator, ClassifierMixin, TransformerMixin
from sklearn.utils.validation import _validate_required_props
from sklearn.svm import SVC
class MyEst(ClassifierMixin, BaseEstimator):
def __init__(self):
self._props_request = {'fit': ['sample_weight', 'brand']}
def fit(self, X, y, **fit_params):
_validate_required_props(self, fit_params, 'fit')
self._estimator = SVC().fit(X, y, fit_params['sample_weight'])
print("{} recieved params {} of lengths {}".format(
self.__class__.__name__,
list(fit_params.keys()), [len(v) for v in fit_params.values()]))
class MyTrs(TransformerMixin, BaseEstimator):
def __init__(self):
self._props_request = {'fit': ['sample_weight']}
def fit(self, X, y=None, **fit_params):
_validate_required_props(self, fit_params, 'fit')
self._estimator = SelectKBest().fit(X, y)
print("{} recieved params {} of lengths {}".format(
self.__class__.__name__,
list(fit_params.keys()), [len(v) for v in fit_params.values()]))
return self
def transform(self, X, y=None):
return self._estimator.transform(X) clf = make_pipeline(MyTrs(), MyEst())
clf.fit(X, y, sample_weight=sw, brand=brand)
trs = MyTrs().set_props_request(
None).set_props_request(
{'fit': {'new_param': 'my_sw'}})
clf = make_pipeline(trs, MyEst())
clf.fit(X, y, sample_weight=sw, brand=brand, my_sw=sw2)
clf.fit(X, y, brand=brand)
|
That example looks nice! I think I had not envisaged that all requested props were required, i.e. an estimator could request sample_weight by default but only receive it if available. What are your thoughts on that? What I'm not sure about is whether we need to specify that the request is for fit. I suppose that makes things more flexible. I have looked at your modification to clone. Okay. But I've not yet looked at much else |
I also first thought of having it as "please give me that prop if you have it" rather than "I really need it". The reason I have it as a mandatory requirement, is that it kinda removes the chances of a meta estimator not passing a required prop by mistake. My gut feeling was that it removes some of those bugs.
I think your other proposals had a more flexible system for routing the props around. Especially when it comes to fit/score/split distinction. Also, I thought this makes it easier to have some props going to transform/predict as well. |
But this would be a mistake in meta-estimator implementation which should not be the user's problem? |
This looks good! Edit: previous comment was irrelevant after I finally read all of #4497 and #SLEP006. My initial conception (and hacky implementation in personal project) looked very similar to the single argument, no routing approach. Now for some comments (sorry if these are still uninformed):
I guess my main question/point is: can this be done in two phases? Phase 1 would be implementing the basic infrastructure (i.e. changing a lot of signatures thought the API) so that That said, how can I be of use to get this done? |
This is good, but it needs work :) |
@@ -257,6 +258,24 @@ def _log_message(self, step_idx): | |||
len(self.steps), | |||
name) | |||
|
|||
def get_props_request(self): |
jnothman
Jun 22, 2020
Member
There will be cases where the meta-estimator requests a prop, such as sample_weight
regardless of whether its descendant estimators also request props.
The implementation here implies that the caller of get_props_request
always wants the "deep" props of its descendants. Do we have use cases that would want us to be able to inspect only the locally consumed props?
There will be cases where the meta-estimator requests a prop, such as sample_weight
regardless of whether its descendant estimators also request props.
The implementation here implies that the caller of get_props_request
always wants the "deep" props of its descendants. Do we have use cases that would want us to be able to inspect only the locally consumed props?
adrinjalali
Jul 3, 2020
Author
Member
I can't think of such a case off the top of my head, but if we do at some point, we can add a deep
parameter here.
I can't think of such a case off the top of my head, but if we do at some point, we can add a deep
parameter here.
Happy to work through this more or assist with the changes |
I think the consensus at the last meeting was to rename the prop request to a metadata request. How is this going? Is there something I can help with here? |
The code is now a bit simpler, tests on I need to support the old behavior in grid search and pipeline somehow as well. |
@@ -257,6 +258,24 @@ def _log_message(self, step_idx): | |||
len(self.steps), | |||
name) | |||
|
|||
def get_props_request(self): |
adrinjalali
Jul 3, 2020
Author
Member
I can't think of such a case off the top of my head, but if we do at some point, we can add a deep
parameter here.
I can't think of such a case off the top of my head, but if we do at some point, we can add a deep
parameter here.
This is getting there, at least tests pass |
Not sure if we ever made a decision on it, but the PR as is, has scorers request |
I don't think we did. I think this is dangerous if estimators don't also request sample weight by default. It is dangerous both because of inconsistency and because if nothing requests sample_weight by default, then an error will be raised but if scorers request by default, then the user passing sample_weight to a cross-validation will silently behave unexpectedly (weighted scoring, unweighted fitting)... but I think estimators requesting sample weight by default is dangerous because they may change over versions and this is not reflected in the code. We certainly should consider how to give users confidence that their code is doing what they think it should... A debugging tool for instance. |
This looks already pretty good. I hope I can help moving this forward. After a first partial pass, I've the following questions:
|
return self | ||
|
||
|
||
class BaseEstimator(_MetadataRequest): |
lorentzenchr
Dec 5, 2020
Contributor
As BaseEstimator
inherits from _MetadataRequest
, we should (later) mention in the docs what's required to roll out your own estimator without inheriting from BaseEstimator
, see also discussions around #18797 (comment).
As BaseEstimator
inherits from _MetadataRequest
, we should (later) mention in the docs what's required to roll out your own estimator without inheriting from BaseEstimator
, see also discussions around #18797 (comment).
def _empty_metadata_request(): | ||
return Bunch(fit={}, | ||
predict={}, | ||
transform={}, | ||
score={}, | ||
split={}, | ||
inverse_transform={}) |
lorentzenchr
Dec 5, 2020
Contributor
This defines all the methods which can request meta-data. That is worth a comment, I think.
This defines all the methods which can request meta-data. That is worth a comment, I think.
meta-estimators such as ``Pipeline`` and ``GridSearchCV``. In order to pass a | ||
metadata to a method such as ``fit`` or ``score``, the object accepting the |
lorentzenchr
Dec 5, 2020
Contributor
Suggested change
meta-estimators such as ``Pipeline`` and ``GridSearchCV``. In order to pass a
metadata to a method such as ``fit`` or ``score``, the object accepting the
meta-estimators such as ``Pipeline`` and ``GridSearchCV``. In order to pass
metadata to a method such as ``fit`` or ``score``, the object accepting the
meta-estimators such as ``Pipeline`` and ``GridSearchCV``. In order to pass a | |
metadata to a method such as ``fit`` or ``score``, the object accepting the | |
meta-estimators such as ``Pipeline`` and ``GridSearchCV``. In order to pass | |
metadata to a method such as ``fit`` or ``score``, the object accepting the |
>>> from sklearn.model_selection import GroupKFold | ||
>>> from sklearn.feature_selection import SelectKBest | ||
>>> from sklearn.pipeline import make_pipeline | ||
>>> N, M = 100, 4 |
lorentzenchr
Dec 5, 2020
Contributor
Suggested change
>>> N, M = 100, 4
>>> n_sampels, n_features = 100, 4
>>> N, M = 100, 4 | |
>>> n_sampels, n_features = 100, 4 |
Usage Examples | ||
************** | ||
Here we present a few examples to show different common usecases. The examples | ||
in this section require the following imports and data: |
lorentzenchr
Dec 5, 2020
Contributor
Suggested change
in this section require the following imports and data:
in this section require the following imports and data::
in this section require the following imports and data: | |
in this section require the following imports and data:: |
Despite make_scorer and LogisticRegressionCV both expecting a key | ||
sample_weight, we can use aliases to pass different weights to different |
lorentzenchr
Dec 5, 2020
•
Contributor
Suggested change
Despite make_scorer and LogisticRegressionCV both expecting a key
sample_weight, we can use aliases to pass different weights to different
Despite ``make_scorer`` and ``LogisticRegressionCV`` both expecting a key
``"sample_weight"``, we can use aliases to pass different weights to different
Despite make_scorer and LogisticRegressionCV both expecting a key | |
sample_weight, we can use aliases to pass different weights to different | |
Despite ``make_scorer`` and ``LogisticRegressionCV`` both expecting a key | |
``"sample_weight"``, we can use aliases to pass different weights to different |
|
||
request_props : list of strings, or dict of {str: str}, default=None | ||
A list of required properties, or a mapping of the form | ||
``{provided_metadata: required_metadata}``, or None. |
lorentzenchr
Dec 5, 2020
Contributor
Suggested change
``{provided_metadata: required_metadata}``, or None.
{"provided_metadata": "required_metadata"}, or None.
``{provided_metadata: required_metadata}``, or None. | |
{"provided_metadata": "required_metadata"}, or None. |
[tool.black] | ||
line-length = 79 |
lorentzenchr
Dec 5, 2020
Contributor
Not related to this PR. Could you please remove? Fortunately, we will have black with another PR.
Not related to this PR. Could you please remove? Fortunately, we will have black with another PR.
@@ -820,13 +822,14 @@ def grad(x, *args): return _logistic_loss_and_grad(x, *args)[1] | |||
|
|||
|
|||
# helper function for LogisticCV | |||
def _log_reg_scoring_path(X, y, train, test, pos_class=None, Cs=10, | |||
def _log_reg_scoring_path(X, y, train, test, *, pos_class=None, Cs=10, |
lorentzenchr
Dec 5, 2020
Contributor
Not related to this PR, but I like it 👍
Not related to this PR, but I like it
@@ -1844,7 +1852,21 @@ def fit(self, X, y, sample_weight=None): | |||
|
|||
# init cross-validation generator | |||
cv = check_cv(self.cv, y, classifier=True) | |||
folds = list(cv.split(X, y)) | |||
scorer = get_scorer(self.scoring) | |||
_params = build_method_metadata_params( |
lorentzenchr
Dec 5, 2020
Contributor
Why the leading underscore? Same for _cv_params and _score_params.
Why the leading underscore? Same for _cv_params and _score_params.
I'd propose to start a vote on SLEP006 before finishing this PR. For example, if 2 reviewers say, this is mature enough for a vote. Reasoning: From a project management point of view, it doesn't make sense to provide a full implementation before the official go-to. |
We had a call with @jnothman , @lorentzenchr , @agramfort and I and here's the summary of the discussion
Main Open Questions1. Routing:does pipeline pass everything around or only requested ones
2. Validation:do estimators validate given props against requested ones
do metaestimators do validation to check if given props are requested or not
3. Parameters:do estimators accept only
4. list or params or dictwe've been passing a list of params ( 5. default requested props
Notesmake sure things that don't accept a parameter also don't have the method allowing users to request that prop (e.g. what to do with meta-estimators which are also consumers
Can it be experimental?
The current proposal makes the existing code break and not silently change. We need to also add code snipets to the SLEP and explain what's expected there and expand on the user experience of the API. Takeaway Message"If requested, you shall pass." |
Taking a stab at proposal 4 of the sample prop proposal (SLEP006 Routing sample-aligned meta-data).
Clear WIP, just putting it here for now for us to see how it'd look.
The first change I've made compared to the proposal is that the dict
storing the required props, includes a key as which method requires
which props. This way score and fit and predict can all have different
requirements.
You can see SLEP examples implemented in
test_props.py
astest_slep_case*
functions.Closes #9566, #15425, #3524, #15425,
Fixes #4497, #7136, #13432, #4632, #7646 (add test), #8127 (add test), #8158,
Enables #6322,
TODO:
#12052, #8710, #15282, #2630, #8950, #11429, #15282, #18028