The Wayback Machine - https://web.archive.org/web/20201101070506/https://github.com/scikit-learn/scikit-learn/pull/9978
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add quantile regression #9978

Open
wants to merge 34 commits into
base: master
from

Conversation

@avidale
Copy link

@avidale avidale commented Oct 22, 2017

This PR fixes issue #3148

This new feature implements quantile regression - an algorithm that directly minimizes mean absolute error of a linear regression model.

The work is still in progress, but I do want to receive some feedback.

@jnothman
Copy link
Member

@jnothman jnothman commented Oct 22, 2017

Thanks. You'll need a lot of patience for core devs to review this in detail, I suspect. We have a number of pressing API issues as well as highly requested features that have been in the reviewing queue for some time. But I hope we'll get here soon enough!

@ashimb9
Copy link
Contributor

@ashimb9 ashimb9 commented Nov 28, 2017

@avidale Thanks a lot for your contribution! @jnothman has already outlined the broader picture but FWIW a few pointers from a wanderer for when the core devs become available. First, I would suggest that you look into resolving the CI test failures. After that, you might want to consider adding to the PR to a point where you feel comfortable changing the status of this PR to [MRG] from [WIP]. (Of course, this is of no use if you actually need some comments/ideas before you can start working any further). In my limited experience, [WIP]s are usually not prioritized for review (but don't quote me on this ;)) so you might want to consider the change. Finally when you get to that point, you might want to tag some of the core devs that participated in the original discussion since some of them might have missed the initial post.

total_iter = []
loss_args = (X, y, self.quantile, self.alpha, self.l1_ratio,
sample_weight)
for i in range(10):

This comment has been minimized.

@JasonSanchez

JasonSanchez Dec 23, 2017

This seems a little strange to have 10 here. Could you explain more?

This comment has been minimized.

@avidale

avidale Mar 18, 2018
Author

Yes, number of iterations has been a "magic constant": maximal number of cost function detalizations (with decreasing gamma) was hardcoded to 10 . Now it is a class parameter.

@JasonSanchez
Copy link

@JasonSanchez JasonSanchez commented Dec 23, 2017

This is a great add to scikit-learn. Would personally really like to see it merged.

Copy link
Member

@jnothman jnothman left a comment

  • You have test failure to deal with.
  • Some parameters have not been tested, such as l1_ratio.
  • Please add the class to doc/modules/classes.rst
  • We no longer use assert_true, assert_greater. Just use bare assert.
@@ -0,0 +1,102 @@
# Authors: David Dale [email protected]
# License: TBD

This comment has been minimized.

@jnothman

jnothman Feb 1, 2018
Member

We have the same license across the code base...

This comment has been minimized.

@avidale

avidale Mar 18, 2018
Author

Fixed it


def test_quantile_is_approximately_sparse():
# now most of coefficients are not exact zero, but with large n_samples they are close enough
X, y = make_regression(n_samples=3000, n_features=100, n_informative=10, random_state=0, noise=1.0)

This comment has been minimized.

@jnothman

jnothman Feb 1, 2018
Member

please keep lines under 80 chars

assert_array_almost_equal(warm.coef_, warm_coef, 1)


def test_quantile_convergence():

This comment has been minimized.

@jnothman

jnothman Feb 1, 2018
Member

This test is failing.

This comment has been minimized.

@avidale

avidale Mar 18, 2018
Author

Applied a larger dataset (1000 obs vs 500 before). The convergence improved just enough for the test to pass.

@avidale
Copy link
Author

@avidale avidale commented Mar 18, 2018

Under Travis CI there is another failure: the 'nit' parameter not found in the result of a scipy.optimize call. It runs with scipy 0.13.3, which might be too old. What would you recommend to do?

  • make a workaround for old versions of scipy?
  • remove the 'nit' functionality at all?
  • mark the class as working only with new scipy and change my test to respect this restriction?

Current solution: just changed the solver from BFGS to L-BFGS-B. The latter has supported nit since scipy 0.12.

@sklearn-lgtm
Copy link

@sklearn-lgtm sklearn-lgtm commented Sep 26, 2018

This pull request introduces 1 alert when merging 463c633 into 4207225 - view on LGTM.com

new alerts:

  • 1 for Unused local variable

Comment posted by LGTM.com

Copy link
Member

@rth rth left a comment

Thanks for your patience @avidale ! This is a very nice contribution! Below are few comments, I have not checked the math.

The question is how can we be sure that the results we get are valid. There are quite a few asymptotic and sanity checks in the added tests which certainly help. Still is there are way we could compare the results with some well established implementation e.g. in R or other language?

===================

Quantile regression estimates median or other quantiles of :math:`y` conditional on :math:`X`,
while OLS estimates conditional mean.

This comment has been minimized.

coefficients, not its absolute value.

Quantile loss function can be used with models other than linear.
:class:`GradientBoostingRegressor` also has an option to predict conditional quantiles.

This comment has been minimized.

@rth

rth Sep 27, 2018
Member

Please provide more details, about how this could be achieved (the exact parameters are described in the parent issue I think).

(namely, Pareto).
The second part of the code shows that LinearRegression minimizes RMSE,
while QuantileRegressor minimizes MAE, and both do their own job well.

This comment has been minimized.

@rth

rth Sep 27, 2018
Member

Move this paragraph before the actual section of the code when it is computed. Use sphinx-gallery formatting there:

#########################################################################
#
# The second part of the code shows that LinearRegression minimizes RMSE,
# while QuantileRegressor minimizes MAE, and both do their own job well.

(except also keep the lines <80 characters long).


plt.figure(figsize=(15, 5))

plt.subplot(121)

This comment has been minimized.

@rth

rth Sep 27, 2018
Member

The new matplotlib style API is recommended,

fig, ax = plt.subplots(1, 2, figsize=(15, 5))
ax.plot(..)


plt.subplot(122)

y = 20 + x * 0.5 + np.random.pareto(10, size=x.shape[0])*10

This comment has been minimized.

@rth

rth Sep 27, 2018
Member

Use RNG to get reproducible results,

rng = np.random.RandomState(42)
y = ... + rng.pareto(...)

Read more in the :ref:`User Guide <quantile_regression>`
.. versionadded:: 0.20

This comment has been minimized.

@rth

rth Sep 27, 2018
Member

0.21 now

if fit_intercept:
grad = np.zeros(n_features + 1)
else:
grad = np.zeros(n_features + 0)

This comment has been minimized.

@rth

rth Sep 27, 2018
Member

np.empty

max_iter=10000, alpha=0.0001, l1_ratio=0.0,
warm_start=False, fit_intercept=True,
normalize=False, copy_X=True,
gamma=1e-2, gtol=1e-4, xtol=1e-6,

This comment has been minimized.

@rth

rth Sep 27, 2018
Member

How were the default values for gtol, xtol determined?

This comment has been minimized.

@avidale

avidale Oct 7, 2018
Author

I did manual experiments on a few generated datasets with normalized features, and found that these values suffice for the algorithm to converge. So yes, they are somehow arbitrary.

I can make some report describing how these parameters affect the results of computation. Do you think it is necessary?

# if positive error costs more, the slope is maximal
model = QuantileRegressor(quantile=0.51, alpha=0).fit(X, y)
assert_almost_equal(model.intercept_, 1, 2)
assert_almost_equal(model.coef_[0], 10, 2)

This comment has been minimized.

@rth

rth Sep 27, 2018
Member

use assert_allclose everywhere with the desired relative tolerance, as recommended in numpy docs https://docs.scipy.org/doc/numpy/reference/generated/numpy.testing.assert_array_almost_equal.html

X, y = make_regression(n_samples=1000, n_features=20, random_state=0,
noise=10.0)
X += np.random.normal(size=X.shape[1], scale=3)
X *= np.random.normal(size=X.shape[1], scale=3)

This comment has been minimized.

@rth

rth Sep 27, 2018
Member

use np.random.RandomState

@sklearn-lgtm
Copy link

@sklearn-lgtm sklearn-lgtm commented Oct 7, 2018

This pull request introduces 1 alert when merging 80c8182 into 63e5ae6 - view on LGTM.com

new alerts:

  • 1 for Unused local variable

Comment posted by LGTM.com

@avidale
Copy link
Author

@avidale avidale commented Oct 7, 2018

Thank you for the review, @rth! I have fixed some of the issues you indicated, and probably will fix some more.
I could make a report with comparison of this implementation and other implementations (maybe in R and Statsmodels), and with analysis of sensitivity to computational parameters. But I don't know what format is more appropriate. Should I publish the results in my own repo and attach the link here in the PR page? Or even insert it into the documentation?

noise=1.0)

# check that for small n_iter, warning is thrown
with warnings.catch_warnings(record=True) as w:

This comment has been minimized.

@rth

rth Oct 7, 2018
Member

Here and below we can use with pytest.warns(None) as record: (see pytest warning capture) -- it has a very similar API, and a better integration/compatibility with pytest

@rth
Copy link
Member

@rth rth commented Oct 7, 2018

I could make a report with comparison of this implementation and other implementations (maybe in R and Statsmodels), and with analysis of sensitivity to computational parameters. But I don't know what format is more appropriate

I was thinking more of some simple comparison on a few examples to validate this implementation. Comparing with R and statsmodels would be great! That could be done by posting a script and the resulting figure as a comment in PR (you can use <details> tag to hide some details if needed). Of course if you prefer to do a more detailed report or put it somewhere else and link to it -- that would work as well :)

@amueller
Copy link
Member

@amueller amueller commented Mar 7, 2019

untagging this as it seems unlikely to get in in time.

Copy link
Member

@rth rth left a comment

Sorry review is slow, we are severely limited in review bandwidth but I have now the availability to look into this.

Are you still interesting in working on it @avidale ? A few comments are below. Could you please resolve conflicts?

The example is quite nice. I think the main point why this PR is currently somewhat difficult to get accepted is the use this smoothed quantile L1+L2 loss and the resulting nested LBFGS solver, is fairly non trivial.

I think it would be easier to start with QuantileLoss from gradient boosting with L2 penalty only and LBFGS and add this QuantileRegression estimator. Then in a follow-up PR propose an addition of the l1_ratio parameter and this smoothed loss.

The :class:`QuantileRegressor` applies linear loss to all samples. It is thus more radical than
:class:`HuberRegressor`, that applies linear penalty to small fraction of outliers and quadratic loss
to the rest of observations. :class:`QuantileRegressor` also supports L1 and L2 regularization,
like :class:`ElasticNet`. It solves

This comment has been minimized.

@rth

rth Jul 31, 2019
Member

Please keep lines under 80 char length.

Quantile regression estimates median or other quantiles of :math:`y` conditional on :math:`X`, while OLS estimates
conditional mean.

The :class:`QuantileRegressor` applies linear loss to all samples. It is thus more radical than

This comment has been minimized.

@rth

rth Jul 31, 2019
Member

Maybe "Somewhat related, :class:HuberRegressor ..

:align: center
:scale: 50%

Another possible advantage of quantile regression over OLS is its robustness

This comment has been minimized.

@rth

rth Jul 31, 2019
Member

remove "possible"

The right figure shows example of an asymmetric error distribution
(namely, Pareto).
"""
from __future__ import division

This comment has been minimized.

@rth

rth Jul 31, 2019
Member

We don't support Python2 anymore so this can be removed.

sample_weight = np.array(sample_weight)
check_consistent_length(y, sample_weight)
else:
sample_weight = np.ones_like(y)

This comment has been minimized.

@rth

rth Jul 31, 2019
Member

If you merge master in you could use the _check_sample_weights helper form utils validation.

loss_args = (X, y, self.quantile, self.alpha, self.l1_ratio,
self.fit_intercept, sample_weight)
for i in range(self.n_gamma_decreases):
gamma = self.gamma * self.gamma_decrease ** i

This comment has been minimized.

@rth

rth Jul 31, 2019
Member

So if l1_ratio=0.0 the penalty is smooth, do we actually need to do this loop? On failure to converge I think this would try anther round of LBFGS optimization while it should just exist with a ConvergenceWarning in that case I think?

This comment has been minimized.

@rth

rth Aug 1, 2019
Member

Nevermind, I understand it's not smooth even without l1_ratio..

noise=1.0)

# check that for small n_iter, warning is thrown
with pytest.warns(None) as record:

This comment has been minimized.

@rth

rth Jul 31, 2019
Member

This can be done with,

with pytest.warns(ConvergenceWarning, match="QuantileRegressor did not converge"):
    # code here

warnings.warn("QuantileRegressor did not converge:" +
" Scipy solver terminated with '%s'."
% str(result['message'])
)

This comment has been minimized.

@rth

rth Jul 31, 2019
Member

if you merge master in you could use utils.optimize._check_optimize_result('lbfgs', result).

or np.abs(old_param) < self.xtol \
and new_value < value + self.gtol:
value = new_value
parameters = new_parameters

This comment has been minimized.

@rth

rth Jul 31, 2019
Member

This sounds quite expensive to calculate.


Another possible advantage of quantile regression over OLS is its robustness
to outliers, because it is only sign of an error that influences estimated
coefficients, not its absolute value.

This comment has been minimized.

@jeromedockes

jeromedockes Aug 2, 2019
Contributor

shouldn't this be "it is the value [...] rather than the squared error" ? "only the sign of an error" would correspond to L0 loss

@jeromedockes
Copy link
Contributor

@jeromedockes jeromedockes commented Aug 4, 2019

For reference after IRL discussion with @rth and @GaelVaroquaux, here is a paper
describing a way to solve quantile regression with elastic-net penalty, based on
a similar idea as this PR -- solving smoothed problems with decreasing gamma (as
far as I can tell the reference cited in this PR does not discuss
regularization):

Yi, Congrui, and Jian Huang. "Semismooth newton coordinate descent algorithm
for elastic-net penalized huber loss regression and quantile regression."
Journal of Computational and Graphical Statistics 26.3 (2017): 547-557.

Also, when applying (strictly positive) l2 regularization, the dual problem (of
the exact quantile regression) is a smooth loss and bound constraints, so in the
high-dimensional case (n features > n samples), maybe solving this problem
directly can be efficient too?

argmax_c c^Ty - (1 / 4 \lambda) c^TXX^Tc
s.t. quantile - 1 <= c <= quantile

sorry for the noise if this is not helpful

@rth
Copy link
Member

@rth rth commented Aug 7, 2019

Thanks @jeromedockes that is very helpful!

The reference implementation for quantile regression is probably quantreg in R by Koenker who is the author of the original paper in 1978. Adapted from its user manual for the rq function,

The default method is the modified version of the Barrodale and Roberts algorithm for l1 regression, used by l1fit in S, and is described in detail in Koenker and d’Orey(1987, 1994), default ="br". This is quite efficient for problems up to several thousand observations, and may be used to compute the full quantile regression process. It also implements a scheme for computing confidence intervals for the estimated parameters, based on inversion of a rank test described in Koenker (1994). For larger problems it is advantageous to use the Frisch–Newton interior point method "fn". And for very large problems one can use the Frisch–Newton approach after preprocessing "pfn". Both of the latter methods are described in detail in Portnoy and Koenker(1997), this method is primarily well-suited for large n, small p problems where the parametric dimension of the model is modest. For large problems with large parametric dimension it is usually advantageous to use method "sfn" which uses the Frisch-Newton algorithm, but exploits sparse algebra to compute iterates. This is typically helpful when the model includes factor variables that, when expanded, generate design matrices that are very sparse. A sixth option "fnc" that enables the user to specify linear inequality constraints on the fitted coefficients; in this case one needs to specify the matrixRand the vector representing the constraints in the form Rb≥r. See the examples. Finally, there are two penalized methods: "lasso" and "scad" that implement the lasso penalty and Fan and Li’s smoothly clipped absolute deviation penalty, respectively. These methods should probably be regarded as experimental.

So it looks like only the last (experimental) solver supports penalties there.

@jeromedockes
Copy link
Contributor

@jeromedockes jeromedockes commented Aug 8, 2019

So it looks like only the last (experimental) solver supports penalties there.

yes; at least for "lasso" I think it is the same solver as described in

Portnoy, Stephen, and Roger Koenker. "The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators." Statistical Science 12.4 (1997): 279-300.

except that in this case the linear program becomes (n samples + n features)-dimensional with (n features) additional constraints

@jeromedockes
Copy link
Contributor

@jeromedockes jeromedockes commented Aug 8, 2019

The reference implementation for quantile regression is probably quantreg

I think so, also the algorithm described in Yi (2017) is implemented in hqreg

@rth
Copy link
Member

@rth rth commented Sep 19, 2019

Overall it looks like for quantile regression the solver we may want is either the Lasso from Portnoy & Koenker (1997) or the Scad solver with smoothed penalty Fan & Li (1999).

Both the Yi (2017), implemented in hqreg, and Chen & Wei (2005) for the smoothed penalty used in this PR probably won't pass the scikit-learn inclusion criteria citation wise.

So adding this implementation to https://github.com/scikit-learn-contrib/scikit-learn-extra might be a first step toward making it accessible to users.

@rth rth removed this from the 0.22 milestone Sep 19, 2019
@GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Sep 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

9 participants
You can’t perform that action at this time.