[MRG+1] Fix LOF and Isolation benchmarks #9798

Merged
merged 8 commits into from Oct 25, 2017

Conversation

Projects
None yet
4 participants
Contributor

albertcthomas commented Sep 19, 2017

What does this implement/fix? Explain your changes.

The fix for the benchmark of IsolationForest merged in #8638 leads to the following

import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.datasets import fetch_kddcup99
dataset = fetch_kddcup99(subset='SA', shuffle=True, percent10=True)
X = dataset.data
y = dataset.target
print(X.dtype)  # object

lb = MultiLabelBinarizer()
x1 = lb.fit_transform(X[:, 1])

print(np.unique(X[:,1]))  # [b'icmp' b'tcp' b'udp']
print(x1.shape)  # (100655, 7)

which does not correspond to what LabelBinarizer was doing because strings are viewed as iterables (and we obtain an array with 7 features instead of 3). This fix suggests to apply LabelBinarizer as follows:

from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
x1 = lb.fit_transform(X[:, 1].astype(str))
print(x1.shape)  # (100655, 3)

Any other comments?

This also fixes the benchmark of LocalOutlierFactor.

EDIT: The benchmark of LocalOutlierFactor was predicting on a test set whereas LocalOutlierFactor is not meant to predict on a test set. This is fixed by doing the benchmark in an outlier detection context only.

albertcthomas added some commits Sep 19, 2017

benchmarks/bench_lof.py
@@ -101,7 +97,7 @@
fit_time = time() - tstart
tstart = time()
- scoring = -model.decision_function(X_test) # the lower, the more normal
+ scoring = -model._decision_function(X_test) # the lower, the more normal

This comment has been minimized.

Show comment Hide comment
@agramfort

agramfort Sep 19, 2017

Owner

there is no public function that allows you to do the same?

@agramfort

agramfort Sep 19, 2017

Owner

there is no public function that allows you to do the same?

This comment has been minimized.

Show comment Hide comment
@albertcthomas

albertcthomas Sep 20, 2017

Contributor

Hm no... actually it is maybe a pity to have a benchmark for LOF that needs to predict on a test set...
I can change this benchmark to outlier detection where we train on all the dataset (without knowing the labels) and compute the ROC on this same training set (with the knowledge of the labels). This will only require LOF scores on the training set, i.e. negative_outlier_factor_. WDYT?

@albertcthomas

albertcthomas Sep 20, 2017

Contributor

Hm no... actually it is maybe a pity to have a benchmark for LOF that needs to predict on a test set...
I can change this benchmark to outlier detection where we train on all the dataset (without knowing the labels) and compute the ROC on this same training set (with the knowledge of the labels). This will only require LOF scores on the training set, i.e. negative_outlier_factor_. WDYT?

X = np.c_[X[:, :1], x1, x2, x3, X[:, 4:]]
- y = (y != 'normal.').astype(int)
+ y = (y != b'normal.').astype(int)

This comment has been minimized.

Show comment Hide comment
@agramfort

agramfort Sep 19, 2017

Owner

works with python2 and python3?

@agramfort

agramfort Sep 19, 2017

Owner

works with python2 and python3?

This comment has been minimized.

Show comment Hide comment
@albertcthomas

albertcthomas Sep 20, 2017

Contributor

yes

This comment has been minimized.

Show comment Hide comment
@albertcthomas

albertcthomas Sep 20, 2017

Contributor

thanks for the review @agramfort

Contributor

albertcthomas commented Sep 20, 2017

thanks for the review @agramfort

albertcthomas added some commits Sep 20, 2017

This comment has been minimized.

Show comment Hide comment
@albertcthomas

albertcthomas Sep 20, 2017

Contributor

The LocalOutlierFactor benchmark is now done in an outlier detection context.
bench_lof

Contributor

albertcthomas commented Sep 20, 2017

The LocalOutlierFactor benchmark is now done in an outlier detection context.
bench_lof

This comment has been minimized.

Show comment Hide comment
@albertcthomas

albertcthomas Sep 21, 2017

Contributor

Concerning LOF, standardizing the datasets (using StandardScaler or RobustScaler) does not increase the performance in terms of ROC curve and AUC.

Contributor

albertcthomas commented Sep 21, 2017

Concerning LOF, standardizing the datasets (using StandardScaler or RobustScaler) does not increase the performance in terms of ROC curve and AUC.

@agramfort agramfort changed the title from [MRG] Fix LOF and Isolation benchmarks to [MRG+1] Fix LOF and Isolation benchmarks Sep 21, 2017

This comment has been minimized.

Show comment Hide comment
@agramfort

agramfort Sep 21, 2017

Owner

LGTM +1 for MRG

Owner

agramfort commented Sep 21, 2017

LGTM +1 for MRG

I get the following error when running the isolation forest benchmark:

====== smtp ======
--- Fetching data...
--- Vectorizing data...
----- Target count values: 
------ 0 -> 9568 occurrences
------ 1 -> 3 occurrences
----- Outlier ratio: 0.00031
--- Fitting the IsolationForest estimator...
--- Preparing the plot elements...
/home/ogrisel/code/scikit-learn/sklearn/metrics/ranking.py:547: UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless
  UndefinedMetricWarning)
smtp (AUC: nan, train_time= 0.44s, test_time= 0.25s)

This comment has been minimized.

Show comment Hide comment
@albertcthomas

albertcthomas Sep 22, 2017

Contributor

Indeed... thanks @ogrisel. We also have this warning on master. That's because there are only 3 outliers out of 9571 samples in the smtp dataset when percent10=True. Therefore there is a high chance that there is no outlier in the test set (when the model is trained on a dataset that can contain outliers). Of course this depends on the random seed set in np.random.seedat the beginning of the benchmark.

One solution would be to remove the smtp dataset from the benchmark. If percent10=False the problem is less likely to happen but may still happen depending on the random seed.

Note that we don't have this problem for the LOF benchmark anymore as we train and test on the whole dataset. We wouldn't have the error/warning if the isolation forest benchmark were run similarly as the LOF benchmark. We wouldn't have this problem either if the isolation benchmark were run in a novelty detection context (train the model on a subsample of normal instances only and test it on all the other instances)

Interestingly, for the same seed in np.random.seed, if the benchmark starts with the 'smtp' dataset, the problem disappears (because the state of the RNG is not the same as when the benchmark starts with 'http'). For this reason it might be better to set the seed via the random_state parameter of the fetch_kddcup_99, fetch_covtype and shuffle functions rather than setting it with np.random.seed at the beginning of the dataset. The warning/error should then appear (or don't appear) whatever the order of the datasets.

Contributor

albertcthomas commented Sep 22, 2017

Indeed... thanks @ogrisel. We also have this warning on master. That's because there are only 3 outliers out of 9571 samples in the smtp dataset when percent10=True. Therefore there is a high chance that there is no outlier in the test set (when the model is trained on a dataset that can contain outliers). Of course this depends on the random seed set in np.random.seedat the beginning of the benchmark.

One solution would be to remove the smtp dataset from the benchmark. If percent10=False the problem is less likely to happen but may still happen depending on the random seed.

Note that we don't have this problem for the LOF benchmark anymore as we train and test on the whole dataset. We wouldn't have the error/warning if the isolation forest benchmark were run similarly as the LOF benchmark. We wouldn't have this problem either if the isolation benchmark were run in a novelty detection context (train the model on a subsample of normal instances only and test it on all the other instances)

Interestingly, for the same seed in np.random.seed, if the benchmark starts with the 'smtp' dataset, the problem disappears (because the state of the RNG is not the same as when the benchmark starts with 'http'). For this reason it might be better to set the seed via the random_state parameter of the fetch_kddcup_99, fetch_covtype and shuffle functions rather than setting it with np.random.seed at the beginning of the dataset. The warning/error should then appear (or don't appear) whatever the order of the datasets.

This comment has been minimized.

Show comment Hide comment
@agramfort

agramfort Sep 24, 2017

Owner
Owner

agramfort commented Sep 24, 2017

albertcthomas added some commits Sep 25, 2017

This comment has been minimized.

Show comment Hide comment
@albertcthomas

albertcthomas Sep 25, 2017

Contributor

I clarified the randomness in both benchmarks and thus made the warning deterministic for the Isolation Forest benchmark. I also added a comment about it in the docstring.

Contributor

albertcthomas commented Sep 25, 2017

I clarified the randomness in both benchmarks and thus made the warning deterministic for the Isolation Forest benchmark. I also added a comment about it in the docstring.

This comment has been minimized.

Show comment Hide comment
@ngoix

ngoix Sep 29, 2017

Contributor

Performance is not convincing for LOF, even in an outlier detection context. I did not review the PR, but did you try playing with the threshold/offset parameter? And is there any reason why you changed percebt10 to True? (It is not the cases for other AD benchmarks is it?)

Contributor

ngoix commented Sep 29, 2017

Performance is not convincing for LOF, even in an outlier detection context. I did not review the PR, but did you try playing with the threshold/offset parameter? And is there any reason why you changed percebt10 to True? (It is not the cases for other AD benchmarks is it?)

This comment has been minimized.

Show comment Hide comment
@albertcthomas

albertcthomas Sep 29, 2017

Contributor

percent10 was set to True for IsolationForest. I also set it to True for LOF. Performance is not great either if percent10 is set to False for LOF. LOF seems to be performing well on the considered datasets in a novelty detection context, however LOF is not meant to be used in this context. To show that LOF can perform well in a outlier detection context would imply to consider new datasets (see @ngoix extensive benchmark)

Changing the threshold parameter will not affect the performance as we assess the whole scoring function with a ROC curve.

Contributor

albertcthomas commented Sep 29, 2017

percent10 was set to True for IsolationForest. I also set it to True for LOF. Performance is not great either if percent10 is set to False for LOF. LOF seems to be performing well on the considered datasets in a novelty detection context, however LOF is not meant to be used in this context. To show that LOF can perform well in a outlier detection context would imply to consider new datasets (see @ngoix extensive benchmark)

Changing the threshold parameter will not affect the performance as we assess the whole scoring function with a ROC curve.

This comment has been minimized.

Show comment Hide comment
@ngoix

ngoix Sep 29, 2017

Contributor

You're right my mistake!

Contributor

ngoix commented Sep 29, 2017

You're right my mistake!

This comment has been minimized.

Show comment Hide comment
@ngoix

ngoix Sep 29, 2017

Contributor

But IMO benchmarks are not examples, they are made to compare performance. To do this, we need to choose a commun setting for all AD algo (btw it is one of the reason why we chose to implement a predict and decision function methods for LOF, even if private).
An interesting change on the benchmarks would be to create a file per setting (outlier/anomaly/novelty detection), each file including all AD algo. But this PR is probably not the place to do it.

Contributor

ngoix commented Sep 29, 2017

But IMO benchmarks are not examples, they are made to compare performance. To do this, we need to choose a commun setting for all AD algo (btw it is one of the reason why we chose to implement a predict and decision function methods for LOF, even if private).
An interesting change on the benchmarks would be to create a file per setting (outlier/anomaly/novelty detection), each file including all AD algo. But this PR is probably not the place to do it.

This comment has been minimized.

Show comment Hide comment
@ngoix

ngoix Sep 29, 2017

Contributor

All in all, I think we should stick to the fix of #8638 in this PR.

Contributor

ngoix commented Sep 29, 2017

All in all, I think we should stick to the fix of #8638 in this PR.

benchmarks/bench_isolation_forest.py
@@ -30,15 +41,14 @@ def print_outlier_ratio(y):
print("----- Outlier ratio: %.5f" % (np.min(cnt) / len(y)))
-np.random.seed(1)
+SEED = 1

This comment has been minimized.

Show comment Hide comment
@ngoix

ngoix Sep 29, 2017

Contributor

I'm more used to see rng = np.random.RandomState(0) but I think it doesn't matter.

@ngoix

ngoix Sep 29, 2017

Contributor

I'm more used to see rng = np.random.RandomState(0) but I think it doesn't matter.

This comment has been minimized.

Show comment Hide comment
@agramfort

agramfort Oct 5, 2017

Owner

don't use a new name for something we refer to as random_state

use:

random_state = 1

and pass random_state=random_state in function calls.

@agramfort

agramfort Oct 5, 2017

Owner

don't use a new name for something we refer to as random_state

use:

random_state = 1

and pass random_state=random_state in function calls.

This comment has been minimized.

Show comment Hide comment
@ngoix

ngoix Sep 29, 2017

Contributor

btw @albertcthomas, have you tried to benchmark with and without shuffling data? It looks like data are not shuffled anymore with your changes

Contributor

ngoix commented Sep 29, 2017

btw @albertcthomas, have you tried to benchmark with and without shuffling data? It looks like data are not shuffled anymore with your changes

This comment has been minimized.

Show comment Hide comment
@albertcthomas

albertcthomas Sep 30, 2017

Contributor

It looks like they are shuffled when we change the SEED. For the 'http' dataset, with SEED = 1,

X[0] = [-2.3025850929940455, 5.313698468586339, 5.59508305773286]

with SEED = 12,

X[0] = [-2.3025850929940455, 5.656341400056065, 6.319148277696116]

Results of the whole benchmark with SEED = 1 for the shuffling and random_state=1 for IsolationForest,
bench_1
with SEED = 12 for the shuffling and random_state=1 for IsolationForest

bench_12

We can observe small differences in terms of AUC coming from the fact that the datasets are shuffled.

Contributor

albertcthomas commented Sep 30, 2017

It looks like they are shuffled when we change the SEED. For the 'http' dataset, with SEED = 1,

X[0] = [-2.3025850929940455, 5.313698468586339, 5.59508305773286]

with SEED = 12,

X[0] = [-2.3025850929940455, 5.656341400056065, 6.319148277696116]

Results of the whole benchmark with SEED = 1 for the shuffling and random_state=1 for IsolationForest,
bench_1
with SEED = 12 for the shuffling and random_state=1 for IsolationForest

bench_12

We can observe small differences in terms of AUC coming from the fact that the datasets are shuffled.

This comment has been minimized.

Show comment Hide comment
@ngoix

ngoix Sep 30, 2017

Contributor

Yes I meant for LOF, you removed the shuffle argument which is set to false by default I believe...

Contributor

ngoix commented Sep 30, 2017

Yes I meant for LOF, you removed the shuffle argument which is set to false by default I believe...

This comment has been minimized.

Show comment Hide comment
@albertcthomas

albertcthomas Sep 30, 2017

Contributor

The LOF benchmark is not random anymore because the model is trained and tested on the same dataset, so there is no need to shuffle the datasets. Setting shuffle to True does not change the results. I made a comment about this in the docstring of the benchmark.

Contributor

albertcthomas commented Sep 30, 2017

The LOF benchmark is not random anymore because the model is trained and tested on the same dataset, so there is no need to shuffle the datasets. Setting shuffle to True does not change the results. I made a comment about this in the docstring of the benchmark.

This comment has been minimized.

Show comment Hide comment
@ngoix

ngoix Sep 30, 2017

Contributor

Ok I get it now !
So WDYT about my previous comments before this big "by the way" :) ?

Contributor

ngoix commented Sep 30, 2017

Ok I get it now !
So WDYT about my previous comments before this big "by the way" :) ?

This comment has been minimized.

Show comment Hide comment
@albertcthomas

albertcthomas Oct 2, 2017

Contributor

About making one file per setting (outlier/anomaly/novelty), each comparing all outlier detection estimators? I'm +1 for this in a new PR.

Contributor

albertcthomas commented Oct 2, 2017

About making one file per setting (outlier/anomaly/novelty), each comparing all outlier detection estimators? I'm +1 for this in a new PR.

This comment has been minimized.

Show comment Hide comment
@ngoix

ngoix Oct 4, 2017

Contributor

Then you don't need any change in bench_lof.py...

Contributor

ngoix commented Oct 4, 2017

Then you don't need any change in bench_lof.py...

This comment has been minimized.

Show comment Hide comment
@albertcthomas

albertcthomas Oct 9, 2017

Contributor

The decision to make modifications to the LOF benchmark came after this discussion.

We could use _decision_function in the LOF benchmark as we also use _decision_function for LOF in the outlier detection examples (see here). But IMO this could cause confusion and give a wrong message to the users, as in issue #9874 maybe...
@agramfort what do you think?

Contributor

albertcthomas commented Oct 9, 2017

The decision to make modifications to the LOF benchmark came after this discussion.

We could use _decision_function in the LOF benchmark as we also use _decision_function for LOF in the outlier detection examples (see here). But IMO this could cause confusion and give a wrong message to the users, as in issue #9874 maybe...
@agramfort what do you think?

benchmarks/bench_lof.py
print(__doc__)
-np.random.seed(2)
+SEED = 2 # to control the random selection of anomalies in the SA dataset

This comment has been minimized.

Show comment Hide comment
@agramfort

agramfort Oct 10, 2017

Owner

SEED...

This comment has been minimized.

Show comment Hide comment
@codecov

codecov bot Oct 10, 2017

Codecov Report

Merging #9798 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #9798   +/-   ##
=======================================
  Coverage   96.16%   96.16%           
=======================================
  Files         336      336           
  Lines       62493    62493           
=======================================
  Hits        60098    60098           
  Misses       2395     2395

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf174af...3e06c31. Read the comment docs.

codecov bot commented Oct 10, 2017

Codecov Report

Merging #9798 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #9798   +/-   ##
=======================================
  Coverage   96.16%   96.16%           
=======================================
  Files         336      336           
  Lines       62493    62493           
=======================================
  Hits        60098    60098           
  Misses       2395     2395

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf174af...3e06c31. Read the comment docs.

This comment has been minimized.

Show comment Hide comment
@ngoix

ngoix Oct 10, 2017

Contributor

@albertcthomas I understand your point, but I think that unlike examples, benchmarks are not meant to guide the user but to compare algorithms performance.

Contributor

ngoix commented Oct 10, 2017

@albertcthomas I understand your point, but I think that unlike examples, benchmarks are not meant to guide the user but to compare algorithms performance.

This comment has been minimized.

Show comment Hide comment
@agramfort

agramfort Oct 11, 2017

Owner
Owner

agramfort commented Oct 11, 2017

This comment has been minimized.

Show comment Hide comment
@ngoix

ngoix Oct 15, 2017

Contributor

Ok @agramfort but then we also have to remove LOF from the outlier detection example as it uses _decision_fucntion.
Beside, we can't make the two different benchmarks we discussed above (outlier / anomaly detection), each benchmark containing all AD algo.

Contributor

ngoix commented Oct 15, 2017

Ok @agramfort but then we also have to remove LOF from the outlier detection example as it uses _decision_fucntion.
Beside, we can't make the two different benchmarks we discussed above (outlier / anomaly detection), each benchmark containing all AD algo.

This comment has been minimized.

Show comment Hide comment
@agramfort

agramfort Oct 25, 2017

Owner

just ran the bench locally. Worked great.

good to go @ogrisel or @lesteve ?

Owner

agramfort commented Oct 25, 2017

just ran the bench locally. Worked great.

good to go @ogrisel or @lesteve ?

-# Removed the shuttle dataset because as of 2017-03-23 mldata.org is down:
-# datasets = ['http', 'smtp', 'SA', 'SF', 'shuttle', 'forestcover']
-datasets = ['http', 'smtp', 'SA', 'SF', 'forestcover']
+# datasets available = ['http', 'smtp', 'SA', 'SF', 'shuttle', 'forestcover']

This comment has been minimized.

Show comment Hide comment
@ogrisel

ogrisel Oct 25, 2017

Owner

Now that the shuttle dataset is run by default, we can remove this comment.

@ogrisel

ogrisel Oct 25, 2017

Owner

Now that the shuttle dataset is run by default, we can remove this comment.

This comment has been minimized.

Show comment Hide comment
@ogrisel

ogrisel Oct 25, 2017

Owner

I had a quick IRL conversation with @albertcthomas and I think we should convert those 2 benchmark scripts into a single example script with a plt.subplot grid, one subplot per dataset with both LOF and IF ROC curves.

Owner

ogrisel commented Oct 25, 2017

I had a quick IRL conversation with @albertcthomas and I think we should convert those 2 benchmark scripts into a single example script with a plt.subplot grid, one subplot per dataset with both LOF and IF ROC curves.

This comment has been minimized.

Show comment Hide comment
@ogrisel

ogrisel Oct 25, 2017

Owner

Also I think having to convert the categorical labels as strings to get LabelBinarizer to behave correctly is a bug. It should work with an array of object labels which is a more memory efficient representation. But this should probably be tackled in a separate PR.

Actually, we should just use the categorical encoder once it's ready.

Owner

ogrisel commented Oct 25, 2017

Also I think having to convert the categorical labels as strings to get LabelBinarizer to behave correctly is a bug. It should work with an array of object labels which is a more memory efficient representation. But this should probably be tackled in a separate PR.

Actually, we should just use the categorical encoder once it's ready.

This comment has been minimized.

Show comment Hide comment
@ogrisel

ogrisel Oct 25, 2017

Owner

Let's merge this as this is arlready a good fix in itself and the conversion into an example can be done in another PR.

Owner

ogrisel commented Oct 25, 2017

Let's merge this as this is arlready a good fix in itself and the conversion into an example can be done in another PR.

@ogrisel ogrisel merged commit 7f19dbe into scikit-learn:master Oct 25, 2017

4 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
lgtm analysis: Python No alert changes
Details

donigian added a commit to donigian/scikit-learn that referenced this pull request Oct 28, 2017

Merge branch 'master' of github.com:scikit-learn/scikit-learn into do…
…cs/donigian-update-contribution-guidelines

* 'master' of github.com:scikit-learn/scikit-learn: (23 commits)
  fixes #10031: fix attribute name and shape in documentation (#10033)
  [MRG+1] add changelog entry for fixed and merged PR #10005 issue #9633 (#10025)
  [MRG] Fix LogisticRegression see also should include LogisticRegressionCV(#9995) (#10022)
  [MRG + 1] Labels of clustering should start at 0 or -1 if noise (#10015)
  MAINT Remove redundancy in #9552 (#9573)
  [MRG+1] correct comparison in GaussianNB for 'priors' (#10005)
  [MRG + 1] ENH add check_inverse in FunctionTransformer (#9399)
  [MRG] FIX bug in nested set_params usage (#9999)
  [MRG+1] Fix LOF and Isolation benchmarks (#9798)
  [MRG + 1] Fix negative inputs checking in mean_squared_log_error (#9968)
  DOC Fix typo (#9996)
  DOC Fix typo: x axis -> y axis (#9985)
  improve example plot_forest_iris.py (#9989)
  [MRG+1] Deprecate pooling_func unused parameter in AgglomerativeClustering (#9875)
  DOC update news
  DOC Fix three typos in manifold documentation (#9990)
  DOC add missing dot in docstring
  DOC Add what's new for 0.19.1 (#9983)
  Improve readability of outlier detection example. (#9973)
  DOC: Fixed typo (#9977)
  ...

maskani-moh added a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment