The Wayback Machine - https://web.archive.org/web/20220508031758/https://github.com/scikit-learn/scikit-learn/pull/10745
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Refactor cd #10745

Open
wants to merge 32 commits into
base: main
Choose a base branch
from
Open

[WIP] Refactor cd #10745

wants to merge 32 commits into from

Conversation

Copy link
Contributor

@dohmatob dohmatob commented Mar 2, 2018

Overview

This PR refactors cd_fast.pyx module (linear_module) which implements coordinate-descent, and make it more transparent and extensible.

The resulting implementation, is a single function which can solve general penalized / constrained multi-task least-squares. regression problems. Out-of-the-box, it supports

  • Gram mode (crucial when p << n) and non-Gram mode (crucial when p >> n)
  • a variety of penalties and constraints, including:
    • L11 penalty (sum of L1 norms), as in multi-task Lasso;
    • L21 penalty (sum of L2 norms), as in G-Lasso;
    • L2INF constraint (L2 constraint on each block), as in vanilla dictionary-learning;
    • L1INF constraint (L1 constraint on each block), e.g as in sparse PCA

The API remains consistent with the legacy implementation via cd_fast2.

Ping @agramfort

Other benefits

  • Things like screening rules for LASSO-type problems (inexact Stanford, GapSafe, etc.) and be implemented in a global way (by just modifying the single central cd solver), reducing maintenance complexity
  • The use of cBLAS has been factored out into a transparent API which is easier to understand, modify or replace (say, with the modern scipy's cython_blas way)
  • Factored the implementation of the dual gap computation into a single transparent function.

Also

  • It also supports user-defined penalties (via their prox operators). For example, with this technology, it is straightforward to implement otherwise non-trivial models like:
  • BCD dictionary update in online structured dictionary-learning with general (e.g see Dohmatob el al. "Learning brain regions via large-scale online structured sparse dictionary-learning", In NIPS 2016")
  • The same code can be used in implement dictionary updates (once again with both standard and exotic penalties / constraints), via a single appropriate call (another upcoming PR)

Some very preliminary benchmarks
bench

@agramfort
Copy link

@agramfort agramfort commented Mar 2, 2018

@dohmatob since you created new files diff it is hard to tell if you fully refactored the old code. can we now get rid of cd_fast.pyx? do all tests pass including docstring tests ?

are the stopping criterion exactly the same?

cc @mathurinm @josephsalmon

@sklearn-lgtm
Copy link

@sklearn-lgtm sklearn-lgtm commented Mar 2, 2018

This pull request introduces 1 alert when merging 9987f26 into 76b5b2a - view on lgtm.com

new alerts:

  • 1 for Variable defined multiple times

Comment posted by lgtm.com

@sklearn-lgtm
Copy link

@sklearn-lgtm sklearn-lgtm commented Mar 2, 2018

This pull request introduces 1 alert when merging b217ad1 into 76b5b2a - view on lgtm.com

new alerts:

  • 1 for Variable defined multiple times

Comment posted by lgtm.com

@sklearn-lgtm
Copy link

@sklearn-lgtm sklearn-lgtm commented Mar 2, 2018

This pull request introduces 1 alert when merging d4f937c into 76b5b2a - view on lgtm.com

new alerts:

  • 1 for Variable defined multiple times

Comment posted by lgtm.com

@dohmatob
Copy link
Author

@dohmatob dohmatob commented Mar 3, 2018

@agramfort Yes the tests pass (except for an exotic failure in codecov that i haven't yet looked into)

In the PR, i'm still keeping the old cd_fast.pyx for benchmarking and testing the equivalence of the both implementations. I'll eventually delete it.

Concerning the new files, as said in the PR comment i've also factored out the interaction with BLAS. Of course some of these files might be bundled into one, but i thought it would be cleaner that way.

@sklearn-lgtm
Copy link

@sklearn-lgtm sklearn-lgtm commented Mar 11, 2018

This pull request introduces 1 alert when merging 5c07396 into eed8379 - view on lgtm.com

new alerts:

  • 1 for Variable defined multiple times

Comment posted by lgtm.com

@lesteve
Copy link

@lesteve lesteve commented Mar 12, 2018

Yes the tests pass (except for an exotic failure in codecov that i haven't yet looked into)

If I this can be trusted it looks like the coverage in test_coordescendant.py is ~93% which is probably the main reason of the coverage drop. Note: test files tend to have a coverage of 100% (or very close), you probably want to look at it in more details.

@dohmatob
Copy link
Author

@dohmatob dohmatob commented Apr 5, 2018

@lesteve Thanks for the comments. I'll look into ti failed coverage. This is weird since test_coordescendant is in fact, a test file :)

@jjerphan
Copy link

@jjerphan jjerphan commented May 20, 2021

Hi @dohmatob, are you interested in finishing this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants