The Wayback Machine - https://web.archive.org/web/20230128033812/https://github.com/scikit-learn/scikit-learn/pull/24354
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX adapt epsilon value depending of the dtype of the input #24354

Merged
merged 36 commits into from Nov 10, 2022

Conversation

Safikh
Copy link
Contributor

@Safikh Safikh commented Sep 4, 2022

Reference Issues/PRs

Fixes #24315

What does this implement/fix? Explain your changes.

Change the default epsilon value in logloss from 1e-15 to auto which is equal to eps of y_pred's dtype if y_pred
is a numpy float array else it defaults to 1e-15 as earlier.

Any other comments?

@glemaitre glemaitre changed the title Change default epsilon in logloss metric from 1e-15 to 1e-7 FIX adapt epsilon value depending of the dtype of the input Sep 5, 2022
@Safikh
Copy link
Contributor Author

Safikh commented Sep 5, 2022

@glemaitre Made the changes as per your comment.

Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to other maintainer:
We should not forget to add @gsiisg as Co-Author when merging the PR.

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved
- |Fix| :func:`metrics.logloss` takes "auto" as default eps value and it will be equal to
eps value of the `y_pred` if `y_pred` is numpy float else it will be 1e-15. This change was
made to be able to handle float16 and float32 numpy arrays
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- |Fix| :func:`metrics.logloss` takes "auto" as default eps value and it will be equal to
eps value of the `y_pred` if `y_pred` is numpy float else it will be 1e-15. This change was
made to be able to handle float16 and float32 numpy arrays
- |Fix| automatically set `eps` in :func:`metrics.logloss` depending on the input
arrays. `eps` was previously too small by default when passing lower precision
floating point arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add an entry in the "Change models" section stating that we switch from 1e-15 to np.finfo(np.float64).eps by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the "Change models" section in the codebase?

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved
sklearn/metrics/_classification.py Outdated Show resolved Hide resolved
sklearn/metrics/_classification.py Outdated Show resolved Hide resolved
sklearn/metrics/_classification.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_classification.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_classification.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_classification.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Safikh! A few small notes.

y_true = [[0, 1]]
y_score = np.array([1], dtype=dtype)
loss = log_loss(y_true, y_score, eps="auto")
assert_allclose(loss, 0.000977, atol=1e-3)
Copy link
Contributor

@Micky774 Micky774 Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be clearer if this value were computed rather than hard-coded.

sklearn/metrics/_classification.py Show resolved Hide resolved
- |Fix| automatically set `eps` in :func:`metrics.logloss` depending on the input
arrays. `eps` was previously too small by default when passing lower precision
floating point arrays.
:pr:`24354` by :user:`Safiuddin Khaja <Safikh>` and
:user:`gsiisg <gsiisg>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changelog entry should reflect that you are adding a new keyword option to eps and changing the default value of eps.

@gsiisg
Copy link

gsiisg commented Sep 6, 2022

Big thanks to @Safikh and @glemaitre and all contributors!
This is my first time contributing to scikit-learn, so wasn't familiar with the process and created some confusion with a second pull request, will stick with this one from now on. Just want to add a comment to the "eps" comment section, should say that all input of y_pred will pass through sklearn.utils.check_array which will cast ordinary int/float into int64/float64, so that even if the input was not explicit, it will be treated with 64 bit precision unless specified otherwise as in np.float16/32. etc. At first I was worried that y_pred.dtype will error out if the input was [1] etc, but realized later it will be cast as array([1]) which will be 64 bit by default.

@Safikh
Copy link
Contributor Author

Safikh commented Sep 7, 2022

I'm facing a weird bug where if I set the eps to dtype.eps, the test sklearn/metrics/tests/test_common.py::test_not_symmetric_metric fails. If I set it to something else (that is not a multiple of eps), it works.

Copy link

@gsiisg gsiisg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid the hard coding of dtype=np.float64 in check_array(), it will blow up the memory used if the original input was np.float16/32, wouldn't it? And if y_pred is cast as 64 bit then we wouldn't have problem with the eps=1e-15 in the first place.

@Micky774
Copy link
Contributor

Micky774 commented Sep 8, 2022

I think we should avoid the hard coding of dtype=np.float64 in check_array(), it will blow up the memory used if the original input was np.float16/32, wouldn't it? And if y_pred is cast as 64 bit then we wouldn't have problem with the eps=1e-15 in the first place.

You're right. I think it should instead be dtype=[np.float64, np.float32, np.float16] in which case it'll convert to np.float64 if it is anything other than those floating-type. In the int{32, 64} case it'll convert to np.float64 which should be fine.

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved
sklearn/metrics/_classification.py Show resolved Hide resolved
y_pred = check_array(
y_pred, ensure_2d=False, dtype=[np.float64, np.float32, np.float16]
)
eps = np.finfo(y_pred.dtype).eps * 1.0001 if eps == "auto" else eps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for multiplying by 1.0001. This looks really arbritrary.

Copy link
Contributor Author

@Safikh Safikh Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test sklearn/metrics/tests/test_common.py::test_not_symmetric_metric fails for eps. I have not been able to identify why it is happening. But a very minute change to eps seems to fix it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the test is not adapted for the log_loss. Indeed, the loss is symmetric when it contains only 0/1 values but not otherwise. Basically, since it takes y_proba and y_pred, the tests make little sense.

We should correct this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, should I remove log_loss from the symmetric metrics? Then there would be no need for modifying the eps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we need to remove the loss from the symmetric metrics and remove this 1.0001.

@glemaitre glemaitre self-requested a review Sep 12, 2022
- |Fix| add a `"auto"` option to `eps` in :func:`metrics.logloss`.
This option will automatically set the `eps` value depending on the data
type `y_pred`.
Copy link
Contributor

@Micky774 Micky774 Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording here still needs to be more explicit.

Also, at this point I would consider this an enhancement which also happens to fix a bug. Wondering what the maintainers think.

Suggested change
- |Fix| add a `"auto"` option to `eps` in :func:`metrics.logloss`.
This option will automatically set the `eps` value depending on the data
type `y_pred`.
- |Enhancement| Adds an `"auto"` option to `eps` in :func:`metrics.logloss`.
This option will automatically set the `eps` value depending on the data
type of `y_pred`. In addition, the default value of `eps` is changed from
`1e-15` to the new `"auto"` option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

@glemaitre glemaitre self-requested a review Sep 13, 2022
@Safikh
Copy link
Contributor Author

Safikh commented Sep 29, 2022

Hi @glemaitre, I think this PR is complete. Is there anything that needs to be changed?

Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to remove this 1.0001.

sklearn/metrics/_classification.py Outdated Show resolved Hide resolved
y_pred = check_array(
y_pred, ensure_2d=False, dtype=[np.float64, np.float32, np.float16]
)
eps = np.finfo(y_pred.dtype).eps * 1.0001 if eps == "auto" else eps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we need to remove the loss from the symmetric metrics and remove this 1.0001.

Copy link
Contributor

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @Safikh.

@Micky774 @ogrisel do you want to have a look.

Copy link
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. I had a couple of small wording nits, and a concern regarding testing the np.float16 dtype. If those are addressed then this should be ready to merge.

doc/whats_new/v1.2.rst Outdated Show resolved Hide resolved
sklearn/metrics/_classification.py Outdated Show resolved Hide resolved
sklearn/metrics/tests/test_classification.py Show resolved Hide resolved
sklearn/conftest.py Outdated Show resolved Hide resolved
Copy link
Contributor

@Micky774 Micky774 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @glemaitre feel free to merge if the changes are still acceptable to you

@glemaitre glemaitre merged commit f8986ee into scikit-learn:main Nov 10, 2022
25 checks passed
@glemaitre
Copy link
Contributor

Thanks @Safikh Merging.

@gsiisg
Copy link

gsiisg commented Nov 10, 2022

Thanks everyone!

@Safikh Safikh deleted the logloss_float32_fix branch Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

log_loss giving nan when input is np.float32 and eps is default
4 participants