The Wayback Machine - https://web.archive.org/web/20220530203545/https://github.com/scikit-learn/scikit-learn/pull/23197
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Cythonize _assert_all_finite using stop-on-first strategy #23197

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Apr 22, 2022

Reference Issues/PRs
Fixes #11681

What does this implement/fix? Explain your changes.
Implements the code developed by jakirkham and extends it to meet requirements for current _assert_all_finite function.

Any other comments?
Currently struggling to adapt the function to work with np.float16 arrays.

To Do

  • Compare performance as "second pass" algorithm- Compare op speed by replacing w/ equality check
  • Test speed if configured for 1D/2D contiguous arrays
  • Benchmark w/ non-finite arrays of varying density
  • not np.isfinite--> np.isinf() and np.isnan()

@Micky774 Micky774 changed the title ENH Cythonize _assert_all_finite using reduction scheme ENH Cythonize _assert_all_finite using reduction scheme Apr 22, 2022
@Micky774 Micky774 changed the title ENH Cythonize _assert_all_finite using reduction scheme WIP ENH Cythonize _assert_all_finite using reduction scheme Apr 22, 2022
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Thank you for the PR!

sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
raise TypeError("Unsupported array type: %s" % repr(numpy.PyArray_TypeObjectFromType(a_type)))


cdef inline bint c_isfinite(const_fprecision* a_ptr, size_t step, size_t size, bint_enum disallow_nan) nogil:
Copy link
Member

@thomasjpfan thomasjpfan Apr 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot of indirection for getting disallow_nan passed through. Although, I do not see the motivation for the bint_enum.

I suspect the original implementation is trying to template out the bint_type so the overhead of checking for bint is compiled away in the loop. I am interested to see what the overhead looks like without this optimization.

Copy link
Contributor Author

@Micky774 Micky774 Apr 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll run some benchmarks once the rest of the pattern is stable -- I'm also interested to see the realized performance difference.

sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
sklearn/utils/validation.py Outdated Show resolved Hide resolved
sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
cdef bint_enum err


a_flat = numpy.PyArray_Reshape(a, -1)
Copy link
Member

@thomasjpfan thomasjpfan Apr 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the array is contiguous, I do not think we need to reshape at all.

Copy link
Contributor Author

@Micky774 Micky774 Apr 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually in general do we need the array to be C/F contiguous at all? We don't really care about accessing the data "in order" per se, so it should be fine without a contiguous array right? (I may be misunderstanding the contiguity of ndarrays)

Copy link
Member

@thomasjpfan thomasjpfan Apr 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My overall concern is how reshape can copy and copying X is a huge memory cost to pay. Using C/F contiguous is one way to avoid the possible copy from reshape. If the array is contiguous, the data can be traversed without reshaping.

Also, when the array is contiguous, it can take advantage of data locality and benefit from having the data on the same CPU cache line.

Copy link
Contributor Author

@Micky774 Micky774 Apr 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha -- I was initially talking about removing reshape and simultaneously not stipulating contiguity, but the cache argument makes sense.

@Micky774
Copy link
Contributor Author

@Micky774 Micky774 commented Apr 23, 2022

Current benchmarks indicate that the optimization is dependent on dtype and that the current implementation is preferable once the number of elements ~>5000. I don't think we need to really offer a choice between the cython/python implementations since the preferable algorithm is fairly clear cut for by far most cases.

Code for benchmark:

from sklearn.utils.validation import _assert_all_finite
import numpy as np

dtype = np.float32
X = np.random.rand(1000,100).astype(dtype)

%timeit _assert_all_finite(X)

@Micky774 Micky774 changed the title WIP ENH Cythonize _assert_all_finite using reduction scheme WIP ENH Cythonize _assert_all_finite using stop-on-first strategy Apr 24, 2022
@ogrisel
Copy link
Member

@ogrisel ogrisel commented Apr 29, 2022

Could you try to use prange / OpenMP parallelism to try to see if parallelism can further increase the processing speed on a multicore machine?

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Apr 29, 2022

Could you try to use prange / OpenMP parallelism to try to see if parallelism can further increase the processing speed on a multicore machine?

Even if it is faster with prange, I prefer not to have a drop in single thread performance compared to main. OMP_NUM_THREADS is commonly set to 1 in libraries such as dask or ray to prevent oversubscription.

We can work around this by using np.isfinite + np.sum for single threaded and use this Cython version if there are enough threads. Although, I prefer not to go down this route.

@ogrisel
Copy link
Member

@ogrisel ogrisel commented Apr 29, 2022

Even if it is faster with prange, I prefer not to have a drop in single thread performance compared to main. OMP_NUM_THREADS is commonly set to 1 in libraries such as dask or ray to prevent oversubscription.

Are you sure we will get a drop in single-threaded performance with prangewhen OMP_NUM_THREADS=1? I think it's worth measuring it.

@thomasjpfan
Copy link
Member

@thomasjpfan thomasjpfan commented Apr 29, 2022

With a fairly simple implementation in a Jupyter notebook: (Using %load_ext Cython):

Code for only using Cython + prange
%%cython --compile-args=-fopenmp

from libc.math cimport isfinite as c_isfinite
cimport cython
from cython.parallel cimport prange
from cython cimport floating

cdef int cy_isfinite(floating[:, ::1] a, int n_threads):
    cdef:
        int size = a.size
        int i
        floating* a_ptr = &a[0, 0]
        int output = 1
        
    for i in prange(size, num_threads=n_threads, nogil=True):
        if c_isfinite(a_ptr[i]) == 0:
            output = 0
            break
    return output

def my_isfinite(floating[:, ::1] a, int n_threads=1):
    return bool(cy_isfinite(a, n_threads=n_threads))

and these Python version:

from sklearn.utils.extmath import _safe_accumulator_op
import numpy as np

def sk_isfinite(X):
    return np.isfinite(_safe_accumulator_op(np.sum, X))

def np_isfinite_all(X):
    return np.isfinite(X).all()

Running on a fairly sized X:

rng = np.random.RandomState(42)
X = rng.rand(100_000, 100).astype(dtype)

%timeit sk_isfinite(X)
# 1.87 ms ± 8.21 µs per loop

%timeit np_isfinite_all(X)
# 4.71 ms ± 60.5 µs per loop

%timeit my_isfinite(X, n_threads=1)
# 15.8 ms ± 62.8 µs per loop

%timeit my_isfinite(X, n_threads=8)
# 2.47 ms ± 311 µs per loop

# Only using range
%timeit my_isfinite_single_with_range(X)
# 6.25 ms ± 17.6 µs per loop

From the results above scikit-learn's current np.sum + np.isfinite performs better than all Cython implementations. I think it's strange how using range is better compared to using prange with n_threads=1.

Code for only using Cython + range
%%cython

from libc.math cimport isfinite as c_isfinite
cimport cython
from cython cimport floating

cdef int cy_isfinite_with_range(floating[:, ::1] a):
    cdef:
        int size = a.size
        int i
        floating* a_ptr = &a[0, 0]
        int output = 1
        
    with nogil:
        for i in range(size):
            if c_isfinite(a_ptr[i]) == 0:
                output = 0
                break
    return output

def my_isfinite_single_with_range(floating[:, ::1] a):
    return bool(cy_isfinite_with_range(a))

@ogrisel
Copy link
Member

@ogrisel ogrisel commented May 2, 2022

Thanks @thomasjpfan. It's possible that the np.sum function was recently optimized with SIMD instructions that make more efficient than alternatives.

@ogrisel
Copy link
Member

@ogrisel ogrisel commented May 2, 2022

Indeed, using py-spy top --native to monitor a process that runs your sk_isfinite on a float32 data array in a while True loop:

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)                                                                                                                                                      
 60.00%  60.00%   38.92s    38.92s   _aligned_contig_cast_float_to_double (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
 40.00%  40.00%   25.05s    25.05s   DOUBLE_pairwise_sum (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
  0.00%  60.00%   0.310s    39.23s   npyiter_copy_to_buffers (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
  0.00%   0.00%   0.210s    0.210s   npyiter_goto_iterindex (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
  0.00%   0.00%   0.150s    0.150s   npyiter_copy_from_buffers (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
  0.00%  60.00%   0.090s    39.67s   npyiter_buffered_reduce_iternext_iters2 (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
  0.00%  40.00%   0.040s    25.09s   DOUBLE_add_AVX512F (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
  0.00% 100.00%   0.030s    64.84s   reduce_loop (numpy/core/_multiarray_umath.cpython-310-x86_64-linux-gnu.so)
  0.00%  40.00%   0.030s    25.12s   generic_wrapped_legacy_loop (numpy/core/_multiarray_umath.cpython-310-x86_64-

the DOUBLE_add_AVX512F is SIMD optimized. However I do not understand why numpy converts the float32 data to float64... This seems like a huge waste.

@ogrisel
Copy link
Member

@ogrisel ogrisel commented May 2, 2022

However I do not understand why numpy converts the float32 data to float64... This seems like a huge waste.

Actually this is done on purpose in _safe_accumulator_op. Maybe we could not do that when the goal is to detect nan or inf values. We could call np.sum() directly and only return an exact check if the sum finds a nan or inf.

@ogrisel
Copy link
Member

@ogrisel ogrisel commented May 3, 2022

Something like:

def sk_isfinite(X):
    # Use vectorized sum as a fast detector for inf or nan values
    # but with potential false positives. 
    if not np.isfinite(np.sum(X)):
        # Check again with the slower but exact method to filter out any
        # false positive.
        return np.isfinite(X).all()

This way the 99% of the cases where no ValueError exception are raised should be fast.

@Micky774
Copy link
Contributor Author

@Micky774 Micky774 commented May 26, 2022

Current implementation benchmarks (generated by this benchmark in a Jupyter notebook):
image

In the above image, p_inf and p_nan are the densities of np.inf and np.nan values in an array of one-million elements, such that p_inf+p_nan percent of the array is non-finite.

sklearn/utils/validation.py Outdated Show resolved Hide resolved
@Micky774
Copy link
Contributor Author

@Micky774 Micky774 commented May 26, 2022

Per @jakirkham's suggestion, I made a minor refactor to the Cython implementation to have it serve as a second-pass algorithm. This formulation is now approximately equivalent to main when the array is finite, but significantly faster (2-3x) when there are non-finite elements in large arrays (tested on arrays w/ 10mil elements).

image

And is slightly faster on smaller arrays (100,000 elements)
image

@Micky774 Micky774 changed the title WIP ENH Cythonize _assert_all_finite using stop-on-first strategy ENH Cythonize _assert_all_finite using stop-on-first strategy May 27, 2022
sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
sklearn/utils/isfinite.pyx Show resolved Hide resolved
@@ -0,0 +1,76 @@
# cython: profile=True, linetrace=True
Copy link
Member

@thomasjpfan thomasjpfan May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the benchmarks were run with these on, the results maybe slower than normal.

sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
# size > 5000 is a heuristic for when the python implementation is faster
first_pass_isfinite = False
if is_float:
use_cython = not (X.dtype.kind == "c" or X.dtype == np.float16)
Copy link
Member

@thomasjpfan thomasjpfan May 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use_cython = not (X.dtype.kind == "c" or X.dtype == np.float16)
use_cython = X.dtype in {np.float16, np.float32}

sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
sklearn/utils/isfinite.pyx Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants