Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
[MRG+1] Fix LOF and Isolation benchmarks #9798
Conversation
albertcthomas
added some commits
Sep 19, 2017
@@ -101,7 +97,7 @@ | ||
fit_time = time() - tstart | ||
tstart = time() | ||
- scoring = -model.decision_function(X_test) # the lower, the more normal | ||
+ scoring = -model._decision_function(X_test) # the lower, the more normal |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
albertcthomas
Sep 20, 2017
Contributor
Hm no... actually it is maybe a pity to have a benchmark for LOF that needs to predict on a test set...
I can change this benchmark to outlier detection where we train on all the dataset (without knowing the labels) and compute the ROC on this same training set (with the knowledge of the labels). This will only require LOF scores on the training set, i.e. negative_outlier_factor_
. WDYT?
albertcthomas
Sep 20, 2017
•
Contributor
Hm no... actually it is maybe a pity to have a benchmark for LOF that needs to predict on a test set...
I can change this benchmark to outlier detection where we train on all the dataset (without knowing the labels) and compute the ROC on this same training set (with the knowledge of the labels). This will only require LOF scores on the training set, i.e. negative_outlier_factor_
. WDYT?
X = np.c_[X[:, :1], x1, x2, x3, X[:, 4:]] | ||
- y = (y != 'normal.').astype(int) | ||
+ y = (y != b'normal.').astype(int) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
thanks for the review @agramfort |
albertcthomas
added some commits
Sep 20, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
albertcthomas
Sep 21, 2017
Contributor
Concerning LOF, standardizing the datasets (using StandardScaler
or RobustScaler
) does not increase the performance in terms of ROC curve and AUC.
Concerning LOF, standardizing the datasets (using |
agramfort
changed the title from
[MRG] Fix LOF and Isolation benchmarks
to
[MRG+1] Fix LOF and Isolation benchmarks
Sep 21, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
LGTM +1 for MRG |
ogrisel
reviewed
Sep 22, 2017
I get the following error when running the isolation forest benchmark:
====== smtp ======
--- Fetching data...
--- Vectorizing data...
----- Target count values:
------ 0 -> 9568 occurrences
------ 1 -> 3 occurrences
----- Outlier ratio: 0.00031
--- Fitting the IsolationForest estimator...
--- Preparing the plot elements...
/home/ogrisel/code/scikit-learn/sklearn/metrics/ranking.py:547: UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless
UndefinedMetricWarning)
smtp (AUC: nan, train_time= 0.44s, test_time= 0.25s)
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
albertcthomas
Sep 22, 2017
Contributor
Indeed... thanks @ogrisel. We also have this warning on master. That's because there are only 3 outliers out of 9571 samples in the smtp dataset when percent10=True
. Therefore there is a high chance that there is no outlier in the test set (when the model is trained on a dataset that can contain outliers). Of course this depends on the random seed set in np.random.seed
at the beginning of the benchmark.
One solution would be to remove the smtp dataset from the benchmark. If percent10=False
the problem is less likely to happen but may still happen depending on the random seed.
Note that we don't have this problem for the LOF benchmark anymore as we train and test on the whole dataset. We wouldn't have the error/warning if the isolation forest benchmark were run similarly as the LOF benchmark. We wouldn't have this problem either if the isolation benchmark were run in a novelty detection context (train the model on a subsample of normal instances only and test it on all the other instances)
Interestingly, for the same seed in np.random.seed
, if the benchmark starts with the 'smtp' dataset, the problem disappears (because the state of the RNG is not the same as when the benchmark starts with 'http'). For this reason it might be better to set the seed via the random_state
parameter of the fetch_kddcup_99
, fetch_covtype
and shuffle
functions rather than setting it with np.random.seed
at the beginning of the dataset. The warning/error should then appear (or don't appear) whatever the order of the datasets.
Indeed... thanks @ogrisel. We also have this warning on master. That's because there are only 3 outliers out of 9571 samples in the smtp dataset when One solution would be to remove the smtp dataset from the benchmark. If Note that we don't have this problem for the LOF benchmark anymore as we train and test on the whole dataset. We wouldn't have the error/warning if the isolation forest benchmark were run similarly as the LOF benchmark. We wouldn't have this problem either if the isolation benchmark were run in a novelty detection context (train the model on a subsample of normal instances only and test it on all the other instances) Interestingly, for the same seed in |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
+1 for passing random_state to make the warning deterministic
|
albertcthomas
added some commits
Sep 25, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
albertcthomas
Sep 25, 2017
Contributor
I clarified the randomness in both benchmarks and thus made the warning deterministic for the Isolation Forest benchmark. I also added a comment about it in the docstring.
I clarified the randomness in both benchmarks and thus made the warning deterministic for the Isolation Forest benchmark. I also added a comment about it in the docstring. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
ngoix
Sep 29, 2017
Contributor
Performance is not convincing for LOF, even in an outlier detection context. I did not review the PR, but did you try playing with the threshold/offset parameter? And is there any reason why you changed percebt10 to True? (It is not the cases for other AD benchmarks is it?)
Performance is not convincing for LOF, even in an outlier detection context. I did not review the PR, but did you try playing with the threshold/offset parameter? And is there any reason why you changed percebt10 to True? (It is not the cases for other AD benchmarks is it?) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
albertcthomas
Sep 29, 2017
Contributor
percent10 was set to True for IsolationForest. I also set it to True for LOF. Performance is not great either if percent10 is set to False for LOF. LOF seems to be performing well on the considered datasets in a novelty detection context, however LOF is not meant to be used in this context. To show that LOF can perform well in a outlier detection context would imply to consider new datasets (see @ngoix extensive benchmark)
Changing the threshold parameter will not affect the performance as we assess the whole scoring function with a ROC curve.
percent10 was set to True for IsolationForest. I also set it to True for LOF. Performance is not great either if percent10 is set to False for LOF. LOF seems to be performing well on the considered datasets in a novelty detection context, however LOF is not meant to be used in this context. To show that LOF can perform well in a outlier detection context would imply to consider new datasets (see @ngoix extensive benchmark) Changing the threshold parameter will not affect the performance as we assess the whole scoring function with a ROC curve. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
You're right my mistake! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
ngoix
Sep 29, 2017
Contributor
But IMO benchmarks are not examples, they are made to compare performance. To do this, we need to choose a commun setting for all AD algo (btw it is one of the reason why we chose to implement a predict and decision function methods for LOF, even if private).
An interesting change on the benchmarks would be to create a file per setting (outlier/anomaly/novelty detection), each file including all AD algo. But this PR is probably not the place to do it.
But IMO benchmarks are not examples, they are made to compare performance. To do this, we need to choose a commun setting for all AD algo (btw it is one of the reason why we chose to implement a predict and decision function methods for LOF, even if private). |
All in all, I think we should stick to the fix of #8638 in this PR. |
@@ -30,15 +41,14 @@ def print_outlier_ratio(y): | ||
print("----- Outlier ratio: %.5f" % (np.min(cnt) / len(y))) | ||
-np.random.seed(1) | ||
+SEED = 1 |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
ngoix
Sep 29, 2017
Contributor
I'm more used to see rng = np.random.RandomState(0)
but I think it doesn't matter.
ngoix
Sep 29, 2017
Contributor
I'm more used to see rng = np.random.RandomState(0)
but I think it doesn't matter.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
agramfort
Oct 5, 2017
Owner
don't use a new name for something we refer to as random_state
use:
random_state = 1
and pass random_state=random_state in function calls.
agramfort
Oct 5, 2017
Owner
don't use a new name for something we refer to as random_state
use:
random_state = 1
and pass random_state=random_state in function calls.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
ngoix
Sep 29, 2017
Contributor
btw @albertcthomas, have you tried to benchmark with and without shuffling data? It looks like data are not shuffled anymore with your changes
btw @albertcthomas, have you tried to benchmark with and without shuffling data? It looks like data are not shuffled anymore with your changes |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
albertcthomas
Sep 30, 2017
Contributor
It looks like they are shuffled when we change the SEED
. For the 'http' dataset, with SEED = 1
,
X[0] = [-2.3025850929940455, 5.313698468586339, 5.59508305773286]
with SEED = 12
,
X[0] = [-2.3025850929940455, 5.656341400056065, 6.319148277696116]
Results of the whole benchmark with SEED = 1
for the shuffling and random_state=1
for IsolationForest
,
with SEED = 12
for the shuffling and random_state=1
for IsolationForest
We can observe small differences in terms of AUC coming from the fact that the datasets are shuffled.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
ngoix
Sep 30, 2017
Contributor
Yes I meant for LOF, you removed the shuffle argument which is set to false by default I believe...
Yes I meant for LOF, you removed the shuffle argument which is set to false by default I believe... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
albertcthomas
Sep 30, 2017
Contributor
The LOF benchmark is not random anymore because the model is trained and tested on the same dataset, so there is no need to shuffle the datasets. Setting shuffle to True does not change the results. I made a comment about this in the docstring of the benchmark.
The LOF benchmark is not random anymore because the model is trained and tested on the same dataset, so there is no need to shuffle the datasets. Setting shuffle to True does not change the results. I made a comment about this in the docstring of the benchmark. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
ngoix
Sep 30, 2017
Contributor
Ok I get it now !
So WDYT about my previous comments before this big "by the way" :) ?
Ok I get it now ! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
albertcthomas
Oct 2, 2017
Contributor
About making one file per setting (outlier/anomaly/novelty), each comparing all outlier detection estimators? I'm +1 for this in a new PR.
About making one file per setting (outlier/anomaly/novelty), each comparing all outlier detection estimators? I'm +1 for this in a new PR. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
Then you don't need any change in bench_lof.py... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
albertcthomas
Oct 9, 2017
Contributor
The decision to make modifications to the LOF benchmark came after this discussion.
We could use _decision_function
in the LOF benchmark as we also use _decision_function
for LOF in the outlier detection examples (see here). But IMO this could cause confusion and give a wrong message to the users, as in issue #9874 maybe...
@agramfort what do you think?
The decision to make modifications to the LOF benchmark came after this discussion. We could use |
print(__doc__) | ||
-np.random.seed(2) | ||
+SEED = 2 # to control the random selection of anomalies in the SA dataset |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
codecov
bot
Oct 10, 2017
Codecov Report
Merging #9798 into master will not change coverage.
The diff coverage isn/a
.
@@ Coverage Diff @@
## master #9798 +/- ##
=======================================
Coverage 96.16% 96.16%
=======================================
Files 336 336
Lines 62493 62493
=======================================
Hits 60098 60098
Misses 2395 2395
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update cf174af...3e06c31. Read the comment docs.
codecov
bot
commented
Oct 10, 2017
•
Codecov Report
@@ Coverage Diff @@
## master #9798 +/- ##
=======================================
Coverage 96.16% 96.16%
=======================================
Files 336 336
Lines 62493 62493
=======================================
Hits 60098 60098
Misses 2395 2395 Continue to review full report at Codecov.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
ngoix
Oct 10, 2017
Contributor
@albertcthomas I understand your point, but I think that unlike examples, benchmarks are not meant to guide the user but to compare algorithms performance.
@albertcthomas I understand your point, but I think that unlike examples, benchmarks are not meant to guide the user but to compare algorithms performance. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
agramfort
Oct 11, 2017
Owner
I don't think we should expose in examples or benchmarks something that is
not publicly supported.
you can put your benchmark in a public gist on github if you want a public
version of it.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
ngoix
Oct 15, 2017
Contributor
Ok @agramfort but then we also have to remove LOF from the outlier detection example as it uses _decision_fucntion
.
Beside, we can't make the two different benchmarks we discussed above (outlier / anomaly detection), each benchmark containing all AD algo.
Ok @agramfort but then we also have to remove LOF from the outlier detection example as it uses |
-# Removed the shuttle dataset because as of 2017-03-23 mldata.org is down: | ||
-# datasets = ['http', 'smtp', 'SA', 'SF', 'shuttle', 'forestcover'] | ||
-datasets = ['http', 'smtp', 'SA', 'SF', 'forestcover'] | ||
+# datasets available = ['http', 'smtp', 'SA', 'SF', 'shuttle', 'forestcover'] |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment
Hide comment
ogrisel
Oct 25, 2017
Owner
Now that the shuttle dataset is run by default, we can remove this comment.
ogrisel
Oct 25, 2017
Owner
Now that the shuttle dataset is run by default, we can remove this comment.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
ogrisel
Oct 25, 2017
Owner
I had a quick IRL conversation with @albertcthomas and I think we should convert those 2 benchmark scripts into a single example script with a plt.subplot
grid, one subplot per dataset with both LOF and IF ROC curves.
I had a quick IRL conversation with @albertcthomas and I think we should convert those 2 benchmark scripts into a single example script with a |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
ogrisel
Oct 25, 2017
Owner
Also I think having to convert the categorical labels as strings to get LabelBinarizer to behave correctly is a bug. It should work with an array of object
labels which is a more memory efficient representation. But this should probably be tackled in a separate PR.
Actually, we should just use the categorical encoder once it's ready.
Actually, we should just use the categorical encoder once it's ready. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Show comment Hide comment
ogrisel
Oct 25, 2017
Owner
Let's merge this as this is arlready a good fix in itself and the conversion into an example can be done in another PR.
Let's merge this as this is arlready a good fix in itself and the conversion into an example can be done in another PR. |
albertcthomas commentedSep 19, 2017
•
Edited 1 time
-
albertcthomas
Sep 20, 2017
What does this implement/fix? Explain your changes.
The fix for the benchmark of IsolationForest merged in #8638 leads to the following
which does not correspond to what LabelBinarizer was doing because strings are viewed as iterables (and we obtain an array with 7 features instead of 3). This fix suggests to apply LabelBinarizer as follows:
Any other comments?
This also fixes the benchmark of LocalOutlierFactor.
EDIT: The benchmark of LocalOutlierFactor was predicting on a test set whereas LocalOutlierFactor is not meant to predict on a test set. This is fixed by doing the benchmark in an outlier detection context only.