The Wayback Machine - https://web.archive.org/web/20250614204615/https://github.com/scikit-learn/scikit-learn/pull/19498
Skip to content

[MRG] Fix Improvement in RANSAC: is_data_valid should receive subset_idxs #19498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

alex90r7
Copy link

@alex90r7 alex90r7 commented Feb 19, 2021

Reference Issues/PRs

Fixes #19495

What does this implement/fix? Explain your changes.

Any other comments?

@alex90r7 alex90r7 changed the title is_data_valid can receive the subset_idxs. Backward compatible Fix Improvement in RANSAC: is_data_valid should receive subset_idxs Feb 22, 2021
@alex90r7 alex90r7 changed the title Fix Improvement in RANSAC: is_data_valid should receive subset_idxs [MRG] Fix Improvement in RANSAC: is_data_valid should receive subset_idxs Feb 22, 2021
@glemaitre
Copy link
Member

This solution is buggy because your subset of indices is based on the X given at fit and not the subset X_subset given in the RANSAC iteration.

One possibility is to propagate sample_weight such that it can be an input of is_data_valid and thus putting a weight at zero could be detected in the callable is_data_valid and not consider a specific data point.

However, I don't understand why these samples should be included in the original X.

@alex90r7
Copy link
Author

Please check my comment in the issue.
Also check the paper I mentioned:
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6856449
section II C:
"In each RANSAC iteration, three targets, containing at least one target of each sensor are chosen randomly, providing a unique solution for the velocity profile. "

@glemaitre
Copy link
Member

So in the end, it is not that you want to reject samples but you want to enforce some constraints on the resampling.
To me, it looks like what we do with some cross-validation and the grouping strategy.

I would think that we could add a new parameter (e.g. sampling_strategy) that will default to the random sampling without replacement by default. However, we could pass another strategy based on some group sampling where the groups could be provided at fit as an array.

@alex90r7
Copy link
Author

alex90r7 commented Aug 2, 2021

Yeap, now you got it.
Let me sum up:
I) A new optional argument called "sampling_strategy" will be given to fit.
II) If sampling_strategy is "None" then current sampling will be used:

choose random sample set

subset_idxs = sample_without_replacement(n_samples, min_samples, random_state=random_state)
III) If sampling_strategy is not None, you suggested that:
"we could pass another strategy based on some group sampling where the groups could be provided at fit as an array."
Not sure if that is generic enough for all sampling_strategies. Why not pass a function that will apply the strategy, like the "is_data_valid"?

@glemaitre
Copy link
Member

The grouping sampling could be an option and we could also accept some callables.

@alex90r7
Copy link
Author

alex90r7 commented Aug 2, 2021

That sounds nice. If you think that this interface is better then lets do that.
Could please provide an one-liner with the signature of the fit method?And then I could take it from there.
For example:
default case: fit(self, X, y, sample_weight=None, sampling_strategy=None)
non-default case: fit(self, X, y, sample_weight=None, sampling_strategy=callable)?
non-default case: fit(self, X, y, sample_weight=None, sampling_strategy=dict)?

@glemaitre
Copy link
Member

The contract of scikit-learn estimator is that at fit estimators should not take parameters apart from sample property (such as group or sample_weight)

@alex90r7
Copy link
Author

alex90r7 commented Aug 2, 2021

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV.fit

I think that this is not similar. I would personally go for an extra attribute in the init of the class that would define sampling strategy.
The signature of the fit method would remain the same fit(self, X, y, sample_weight=None)

@glemaitre
Copy link
Member

I think that this is not similar. I would personally go for an extra attribute in the init of the class that would define sampling strategy. The signature of the fit method would remain the same fit(self, X, y, sample_weight=None)

For sure, you will need to add the parameter in __init__. However, you need to pass on the extra information required by the sampling. The issue with a callable is that you need to provide a **kwargs to __init__ as well if you want to be flexible. It might start to be YAGNI. Limiting the strategy_sampling could be user-friendly enough and could answer to most of the needs.

Basically, what would be interesting is to know what are the other type of sampling that one can need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improvement in RANSAC: is_data_valid should receive subset_idxs
2 participants