I would like to add the algorithm to do the approximate nearest neighbors search to the sklearn.neighbors module.
The algorithm may enhance performance in the case of high-dimensional data space, starting from data dimension D > 50.
It will be the alternative to the existing KDTree and BallTree methods.
The desired utilization can be as follows:
from sklearn.neighbors import NSWGraph
import numpy as np
rng = np.random.RandomState(0)
X = rng.random_sample((50, 128))
nswgraph = NSWGraph()
nswgraph.build(X)
dist, ind = tree.query(X[:1], k=3)
Describe your proposed solution
The method is based on Navigable small world graphs (NSW graphs) that tends to demonstrate better performance in the high-dimensional data space [1] (D > 50). This data structure is a graph which satisfies the property that the shortest path between two random vertices increases proportionally to the logarithm of the graph size [2].
Describe alternatives you've considered, if relevant
The Scikit-Learn alternatives of the proposed method are the KDTree and BallTree algorithms that give a performance advantage in low-dimensional data but become inefficient as the dimensionality grows (suffer from the curse of dimensionality) [3].
Additional context
Despite the created issue with feature request, you can find the actual implementation proposed as the pull request to the
scikit-learn-extra project.
References
[1] Malkov, Y., Ponomarenko, A., Logvinov, A., & Krylov, V. (2014).
Approximate nearest neighbor algorithm based on navigable small world graphs.
Information Systems, 45, 61-68.
[2] Sandberg, Oskar, and Ian Clarke. "The evolution of navigable small-world networks." arXiv preprint cs/0607025 (2006).
[3] Scikit-Learn User's guide: Nearest Neighbor Algorithms https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbor-algorithms
The text was updated successfully, but these errors were encountered:
Describe the workflow you want to enable
Dear Scikit-Learn maintainers, good afternoon!
I would like to add the algorithm to do the approximate nearest neighbors search to the sklearn.neighbors module.
The algorithm may enhance performance in the case of high-dimensional data space, starting from data dimension D > 50.
It will be the alternative to the existing KDTree and BallTree methods.
The desired utilization can be as follows:
Describe your proposed solution
The method is based on Navigable small world graphs (NSW graphs) that tends to demonstrate better performance in the high-dimensional data space [1] (D > 50). This data structure is a graph which satisfies the property that the shortest path between two random vertices increases proportionally to the logarithm of the graph size [2].
Describe alternatives you've considered, if relevant
The Scikit-Learn alternatives of the proposed method are the KDTree and BallTree algorithms that give a performance advantage in low-dimensional data but become inefficient as the dimensionality grows (suffer from the curse of dimensionality) [3].
Additional context
Despite the created issue with feature request, you can find the actual implementation proposed as the pull request to the
scikit-learn-extra project.
References
[1] Malkov, Y., Ponomarenko, A., Logvinov, A., & Krylov, V. (2014).
Approximate nearest neighbor algorithm based on navigable small world graphs.
Information Systems, 45, 61-68.
[2] Sandberg, Oskar, and Ian Clarke. "The evolution of navigable small-world networks." arXiv preprint cs/0607025 (2006).
[3] Scikit-Learn User's guide: Nearest Neighbor Algorithms
https://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbor-algorithms
The text was updated successfully, but these errors were encountered: