The Wayback Machine - https://web.archive.org/web/20220530203551/https://github.com/scikit-learn/scikit-learn/pull/23147
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNT Replaced np.ndarray with memview where applicable in linear_model/_cd_fast.pyx #23147

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

Micky774
Copy link
Contributor

@Micky774 Micky774 commented Apr 17, 2022

Reference Issues/PRs

Addresses #10624

What does this implement/fix? Explain your changes.

Replaces np.ndarray typing with corresponding memviews in sparse_enet_coordinate_descent and enet_coordinate_descent_multi_task.

Any other comments?

Would it be worth disabling boundscheck for any of these functions? Afaik there's not much of a risk of an out-of-bounds access in these functions.

sklearn/linear_model/_cd_fast.pyx Outdated Show resolved Hide resolved
sklearn/linear_model/_cd_fast.pyx Outdated Show resolved Hide resolved
np.ndarray[floating, ndim=1, mode='c'] X_data,
np.ndarray[int, ndim=1, mode='c'] X_indices,
np.ndarray[int, ndim=1, mode='c'] X_indptr,
floating[::1] X_data,
Copy link
Member

@jeremiedbb jeremiedbb Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure it works with readonly X. Can you test this against a read-only csr matrix (Xcsr.data.setflags(write=False)) ?

Copy link
Contributor Author

@Micky774 Micky774 Apr 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm it works with read-only CSR matrices:

from scipy import sparse
from sklearn import linear_model
import numpy as np

clf = linear_model.ElasticNet(alpha=0.1)
X = sparse.random(100,10)
X.data.setflags(write=False)
y = np.random.random(100)
clf.fit(X, y)

Copy link
Member

@jeremiedbb jeremiedbb Apr 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default ElasticNet does a copy of the input. Also, my first comment was misleading, it will also convert if it's not in csc format. If I change your snippet to

from scipy import sparse
from sklearn import linear_model
import numpy as np

clf = linear_model.ElasticNet(alpha=0.1, copy_X=False)
X = sparse.random(100,10, format="csc")
X.data.setflags(write=False)
y = np.random.random(100)
clf.fit(X, y)

Then I get ValueError: buffer source array is read-only

Copy link
Contributor Author

@Micky774 Micky774 May 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha. In that case, since sparse_enet_coordinate_descent doesn't actually modify X_data, I went ahead and implemented the fix suggested in #10624, which is the ReadonlyArrayWrapper added in #20903. I think this should work, but I'm still quite new to the Cython side of things, so please let me know if there are problems with this approach -- thanks :)

Copy link
Member

@jeremiedbb jeremiedbb May 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I guess we can do the same for enet_coordinate_descent because currently we have the same error for the dense case:

from scipy import sparse
from sklearn import linear_model
import numpy as np

clf = linear_model.ElasticNet(alpha=0.1, copy_X=False)
X = np.asfortranarray(np.random.uniform(size=(100, 10)))
X.setflags(write=False)
y = np.random.random(100)
clf.fit(X, y)

and probably also enet_coordinate_descent_multi_task. It would be good to also add a test for these read-only cases because it doesn't seem tested currently

Copy link
Member

@thomasjpfan thomasjpfan May 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can indices or indptr be read only?

I think testing is a must here. If we pass actual read only data with ReadOnlyArrayWrapper, python crashes if the Cython function writes to the memoryview:

%%cython
from cython cimport floating

def f(floating[:] a):
    a[0] = 1
from sklearn.utils._readonly_array_wrapper import ReadonlyArrayWrapper
from sklearn.utils._testing import create_memmap_backed_data
import numpy as np

X = np.asarray([1, 2.3])
X_mapped = create_memmap_backed_data(X)

# crashes
f(ReadonlyArrayWrapper(X_mapped))

Copy link
Contributor Author

@Micky774 Micky774 May 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test for this behavior and confirmed the algorithm doesn't modify indices or indptr so extended the wrapper to them as well.

@Micky774 Micky774 changed the title MAINT Replaced np.ndarray with memview where applicable in linear_model/_cd_fast.pyx MNT Replaced np.ndarray with memview where applicable in linear_model/_cd_fast.pyx May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants