New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT 32bit datasets support for PairwiseDistancesReduction
#22590
base: main
Are you sure you want to change the base?
Conversation
6d9f8a9
to
2b68863
Compare
2b68863
to
8ae1cb6
Compare
Also populate the .gitignore with new files
This allows keeping the same interface in Python namely: - PairwiseDistancesReduction.is_usable_for - PairwiseDistancesReduction.valid_metrics - PairwiseDistancesArgKmin.compute while being able to route to the 32bit and 64bit implementations defined via Tempita. The design pattern used here on PairwiseDistancesReduction and PairwiseDistancesArgKmin is the Facade design pattern. See: https://refactoring.guru/design-patterns/facade
DistanceMetric
DistanceMetric
and PairwiseDistancesReduction
|
||
cdef inline np.ndarray _buffer_to_ndarray(const DTYPE_t* x, np.npy_intp n): | ||
# Wrap a memory buffer with an ndarray. Warning: this is not robust. | ||
# In particular, if x is deallocated before the returned array goes | ||
# out of scope, this could cause memory errors. Since there is not | ||
# a possibility of this for our use-case, this should be safe. | ||
|
||
# Note: this Segfaults unless np.import_array() is called above | ||
return PyArray_SimpleNewFromData(1, &n, DTYPECODE, <void*>x) | ||
|
||
|
||
from libc.math cimport fabs, sqrt, exp, pow, cos, sin, asin | ||
cdef DTYPE_t INF = np.inf | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: this has been moved under the Tempita loop to be templated.
###################################################################### | ||
# metric mappings | ||
# These map from metric id strings to class names | ||
METRIC_MAPPING = {'euclidean': EuclideanDistance, | ||
'l2': EuclideanDistance, | ||
'minkowski': MinkowskiDistance, | ||
'p': MinkowskiDistance, | ||
'manhattan': ManhattanDistance, | ||
'cityblock': ManhattanDistance, | ||
'l1': ManhattanDistance, | ||
'chebyshev': ChebyshevDistance, | ||
'infinity': ChebyshevDistance, | ||
'seuclidean': SEuclideanDistance, | ||
'mahalanobis': MahalanobisDistance, | ||
'wminkowski': WMinkowskiDistance, | ||
'hamming': HammingDistance, | ||
'canberra': CanberraDistance, | ||
'braycurtis': BrayCurtisDistance, | ||
'matching': MatchingDistance, | ||
'jaccard': JaccardDistance, | ||
'dice': DiceDistance, | ||
'kulsinski': KulsinskiDistance, | ||
'rogerstanimoto': RogersTanimotoDistance, | ||
'russellrao': RussellRaoDistance, | ||
'sokalmichener': SokalMichenerDistance, | ||
'sokalsneath': SokalSneathDistance, | ||
'haversine': HaversineDistance, | ||
'pyfunc': PyFuncDistance} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewers: this has been moved under the Tempita loop to be templated.
.. math:: | ||
D(x, y) = \frac{N_{TF} + N_{FT}}{N_{TT} + N_{TF} + N_{FT}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed such doc because this was making Tempita injection fail.
This can be reintroduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it would be good to reintroduce it with a notation that does not use backslashes (assuming they are the ones causing the problem).
Previously, upcast was done in the critical region. This causes an unneeded upcast for one of the buffers. This only upcasts buffers when necessary and where necessary without duplication contrarily to previously. Two methods are introduced to perform this upcast for each strategy. Yet, this adds some complexity to the templating.
DistanceMetric
and PairwiseDistancesReduction
PairwiseDistancesReduction
cpdef DTYPE_t[::1] _sqeuclidean_row_norms( | ||
const DTYPE_t[:, ::1] X, | ||
ITYPE_t num_threads, | ||
): | ||
"""Compute the squared euclidean norm of the rows of X in parallel. | ||
This is faster than using np.einsum("ij, ij->i") even when using a single thread. | ||
""" | ||
cdef: | ||
# Casting for X to remove the const qualifier is needed because APIs | ||
# exposed via scipy.linalg.cython_blas aren't reflecting the arguments' | ||
# const qualifier. | ||
# See: https://github.com/scipy/scipy/issues/14262 | ||
DTYPE_t * X_ptr = <DTYPE_t *> &X[0, 0] | ||
ITYPE_t idx = 0 | ||
ITYPE_t n = X.shape[0] | ||
ITYPE_t d = X.shape[1] | ||
DTYPE_t[::1] squared_row_norms = np.empty(n, dtype=DTYPE) | ||
|
||
for idx in prange(n, schedule='static', nogil=True, num_threads=num_threads): | ||
squared_row_norms[idx] = _dot(d, X_ptr + idx * d, 1, X_ptr + idx * d, 1) | ||
|
||
return squared_row_norms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewer: this has been moved bellow, closer to 32bit and 64bit definitions.
from cython.parallel cimport parallel, prange | ||
|
||
from ._dist_metrics cimport DatasetsPair, DenseDenseDatasetsPair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewer: this has been moved bellow, closer to 32bit and 64bit definitions.
cdef: | ||
readonly DatasetsPair datasets_pair | ||
|
||
# The number of threads that can be used is stored in effective_n_threads. | ||
# | ||
# The number of threads to use in the parallelisation strategy | ||
# (i.e. parallel_on_X or parallel_on_Y) can be smaller than effective_n_threads: | ||
# for small datasets, less threads might be needed to loop over pair of chunks. | ||
# | ||
# Hence the number of threads that _will_ be used for looping over chunks | ||
# is stored in chunks_n_threads, allowing solely using what we need. | ||
# | ||
# Thus, an invariant is: | ||
# | ||
# chunks_n_threads <= effective_n_threads | ||
# | ||
ITYPE_t effective_n_threads | ||
ITYPE_t chunks_n_threads | ||
|
||
ITYPE_t n_samples_chunk, chunk_size | ||
|
||
ITYPE_t n_samples_X, X_n_samples_chunk, X_n_chunks, X_n_samples_last_chunk | ||
ITYPE_t n_samples_Y, Y_n_samples_chunk, Y_n_chunks, Y_n_samples_last_chunk | ||
|
||
bint execute_in_parallel_on_Y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to reviewer: this has been moved bellow, within 32bit and 64bit definitions (done via Tempita).
Hence, PairwiseDistancesReduction
now really just is an interface.
Thanks for this analysis. It's not clear to me if the use of I also agree for the long term strategy but I am not sure we have access to a C/C++/Cython level API for these distance implementations and if we do, are they stable across scipy versions and what is the minimal scipy version? |
I think that it historically has been the case to return 64bit outputs.
This is implemented in 7645ba3. Still, the test is failing for t-SNE but not the snippet given above in #22590 (comment).
It's not the case currently, but we could imagine in the long term to come up with interfaces which make it possible? |
I am thinking on putting this PR on hold because I think we need to agree on a plan to properly extends tests to 32bit datasets and implement it first. Moreover, I've opened #22719 which precedes this PR and eases its logic, especially regarding the new implementation introduced via #22320 and the subsequent ones. What do you think? |
This has been extracted from 7645ba.
Conflicts: .gitignore sklearn/metrics/_dist_metrics.pxd.tp sklearn/metrics/_dist_metrics.pyx.tp sklearn/metrics/_pairwise_distances_reduction.pyx.tp sklearn/metrics/setup.py sklearn/metrics/tests/test_pairwise_distances_reduction.py
Reference Issues/PRs
Follows #22134. Experimental POC to assess if Tempita is sufficient.
What does this implement/fix? Explain your changes.
Full design proposal
Context
PairwiseDistancesReduction
needs to support float32 and float64DatasetPairs
.To do so, DatasetPairs needs to be adapted for float32 (X, Y) and concrete
PairwiseDistancesReduction
s needs to do maintain the routing to those.The current Cython extension types (i.e cdef class) hierarchy currently support 64 bits implementation. It simply breaks down as follows:
Where
FastEuclideanPairwiseDistancesArgKmin
is called in most cases.Problem
We need some flexibility to be able to support 32bit datasets while not duplicating the implementations. In this regard, templating (i.e. to have classes be dtype-defined) and type covariance (i.e. if
A
extendsB
thanClass<A>
extendsClass<B>
) would have come in handy to extent the current hierarchy for 64bit to support 32bit.Yet, Cython does not support templating in its language constructions nor does it support type covariance.
Also, Cython offers support for fused types however they can't be used on Cython extension types' attributes, making using this useful feature impossible to use in our context without some hacks.
Proposed solution
Still, we can use Tempita to come up with a working solution preserving performance at the cost of maintenance.
To perform this:
DistanceMetric
sDistanceMetric
s are still exposed via the current public API but the 32bit version must remain private.PairwiseDistancesReductions
has been changed using à la facade design pattern. so as to keep the same Python interfaces (namelyPairwiseDistancesReduction.is_usable_for
,PairwiseDistancesReduction.valid_metrics
,PairwiseDistancesArgKmin.compute
) but have concrete 32bit and 64bit implementation be defined via Tempita as follows:Future extension solution
In the future, we could just use the same pattern. For instance we could have:
TODO:
Hardware scalability
Adapting this script to use float32 datasets, we access that this implementation scales linearly, similarly to its 64bit counterpart:
Raw results
Raw results
Speed-ups between 1.0 (e7fb5b8) and this PR @ 65ebc92 (via ca9197a502bf1289db722a6261ff5fe7edf8e981)
Up to ×50 speed-ups in normal configurations.
Some regression when using small datasets and a high number of threads.
1 core
2 cores
4 cores
8 cores
16 cores
32 cores
64 cores
128 cores
Any other comments?
Is this proposal too complicated? As @lesteve says "Tant pis pour Tempita"?