The Wayback Machine - https://web.archive.org/web/20210828120538/https://github.com/pydata/xarray/pull/4559
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask friendly check in .weighted() #4559

Merged
merged 23 commits into from Nov 9, 2020
Merged

Conversation

@jbusecke
Copy link
Contributor

@jbusecke jbusecke commented Oct 31, 2020

  • Closes #4541
  • Tests added
  • Passes isort . && black . && mypy . && flake8
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
@pep8speaks
Copy link

@pep8speaks pep8speaks commented Oct 31, 2020

Hello @jbusecke! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-11-09 15:24:48 UTC

@jbusecke jbusecke marked this pull request as draft Oct 31, 2020
Copy link
Collaborator

@max-sixty max-sixty left a comment

Looks good! Thanks @jbusecke . I'll let @dcherian or @mathause comment too.

xarray/core/weighted.py Outdated Show resolved Hide resolved
@jbusecke
Copy link
Contributor Author

@jbusecke jbusecke commented Nov 1, 2020

The Ci environments without dask are failing. Should I add some pytest skip logic, or what is the best way to handle this?

@max-sixty
Copy link
Collaborator

@max-sixty max-sixty commented Nov 1, 2020

Yes, @requires_dask on the test

@keewis
Copy link
Collaborator

@keewis keewis commented Nov 1, 2020

since the test you added requires dask it definitely should get a @requires_dask.

That won't fix all the failing tests, though: map_blocks is only valid if we operate on dask arrays. I guess you either have to apply the function directly on other arrays, or we have to make DataArray.map_blocks on non-dask arrays call to DataArray.map.

Copy link
Collaborator

@mathause mathause left a comment

For simplicity I would use if dask_duck_array(weights):.


weights = DataArray(weights).chunk({"dim_0": -1})

weighted = data.weighted(weights)

This comment has been minimized.

@mathause

mathause Nov 1, 2020
Collaborator

You test that dask does not comoute:

with raise_if_dask_computes():
actual = v.argmax(dim="x")

xarray/core/weighted.py Outdated Show resolved Hide resolved
@mathause
Copy link
Collaborator

@mathause mathause commented Nov 1, 2020

I think you need to do someting along the lines of:

if dask_duck_array(weights):
    import dask.array as dsa
    dsa.map_blocks(_weight_check, weights.data, dtype=weights.dtype)
else:
    _weight_check()

@jbusecke
Copy link
Contributor Author

@jbusecke jbusecke commented Nov 1, 2020

I did have to fiddle with this a bit. I did change .isnull() to np.isnan() and replace the data of weights, otherwise the error would not be raised for the dask array at all.
This does not look terribly elegant to me, but passes tests locally. Waiting for the CI to see if the others pass aswell. Happy to make further changes, and thanks for all the input already.

"`weights` cannot contain missing values. "
"Missing values can be replaced by `weights.fillna(0)`."
def _weight_check(w):
if np.isnan(w).any():

This comment has been minimized.

@keewis

keewis Nov 2, 2020
Collaborator

.isnull() does a bit more than that: np.isnan won't detect NaT. @mathause, how likely is it to get datetime-like arrays here? They don't make much sense as weights, but as far as I can tell we don't check (I might be missing something, though)

This comment has been minimized.

@mathause

mathause Nov 2, 2020
Collaborator

There is no check for that. A TimeDelta may make some sense as weights. DateTime not so much. I think we can get away with using np.isnan. A Date* array as weights containing NaT should be super uncommon.

This comment has been minimized.

@jbusecke

jbusecke Nov 2, 2020
Author Contributor

We could still operate on the dataarray instead of the dask/numpy array, but as @dcherian suggesred, that would be less efficient. I would be curious as to what penalties would actually occur when we use the weights.map_blocks compared to dask.array.map_blocks?

This comment has been minimized.

@dcherian

dcherian Nov 3, 2020
Contributor
  1. Could use duck_array_ops.isnull to account for timedelta64? It is weird to have it as a weight though. Does that work?
  2. Re map_blocks: the xarray version adds tasks that create xarray objects wrapping every block in a dask array. That adds overhead which is totally unneccesary here.

@jbusecke jbusecke marked this pull request as ready for review Nov 2, 2020
@jbusecke
Copy link
Contributor Author

@jbusecke jbusecke commented Nov 3, 2020

Do you think this works or are further changes needed? Many thanks for the guidance so far!

Copy link
Collaborator

@max-sixty max-sixty left a comment

Looks great to me. Want to add a whatsnew?

xarray/core/weighted.py Outdated Show resolved Hide resolved
if is_duck_dask_array(weights.data):
import dask.array as dsa

weights.data = dsa.map_blocks(

This comment has been minimized.

@dcherian

dcherian Nov 3, 2020
Contributor
Suggested change
weights.data = dsa.map_blocks(
weights = weights.copy(data=dsa.map_blocks(

so we don't modify the original object. Could even do weights.data.map_blocks(...) to save some typing...

@jbusecke jbusecke changed the title Dask friendly check for nans in .weighted() Dask friendly check in .weighted() Nov 4, 2020
@jbusecke
Copy link
Contributor Author

@jbusecke jbusecke commented Nov 4, 2020

I am getting some failures for py38-upstream (TestDataset.test_polyfit_warnings) that seem unrelated?

@jbusecke
Copy link
Contributor Author

@jbusecke jbusecke commented Nov 4, 2020

Similarly on MacOSX py38 this fails: TestDask.test_save_mfdataset_compute_false_roundtrip

@keewis
Copy link
Collaborator

@keewis keewis commented Nov 4, 2020

the TestDask.test_save_mfdataset_compute_false_roundtrip failure should be #4539, and the failing upstream-dev CI shouldn't be related, either (my guess is that a dependency – most likely numpy – added a new warning). I'll investigate and open a new issue for that (edit: see #4565).

@jbusecke
Copy link
Contributor Author

@jbusecke jbusecke commented Nov 5, 2020

Ok I think this should be good to go. I have implemented all the requested changes. The remaining failures are related to other problems upstream (I think). Anything else I should add here?

xarray/core/weighted.py Outdated Show resolved Hide resolved
xarray/core/weighted.py Outdated Show resolved Hide resolved
@jbusecke
Copy link
Contributor Author

@jbusecke jbusecke commented Nov 6, 2020

I am not understanding why that MinimumVersionPolicy test is failing (or what it does haha). Is this something upstream, or should I fix?

@keewis
Copy link
Collaborator

@keewis keewis commented Nov 6, 2020

this CI is sometimes flaky, but it's usually enough to just rerun it. I'll do that for you once the other CI finished.

@jbusecke
Copy link
Contributor Author

@jbusecke jbusecke commented Nov 6, 2020

Seems like all the other test are passing (minus the two upstream problems discussed before).

Copy link
Contributor

@dcherian dcherian left a comment

Thanks @jbusecke great contribution!

xarray/core/weighted.py Outdated Show resolved Hide resolved
@mathause mathause merged commit 5a60354 into pydata:master Nov 9, 2020
19 of 21 checks passed
19 of 21 checks passed
@azure-pipelines
pydata.xarray Build #20201109.8 had test failures
Details
@azure-pipelines
pydata.xarray (Linux py38-upstream-dev) Linux py38-upstream-dev failed
Details
@codecov
codecov/project 36.11% (target 1.00%)
Details
docs/readthedocs.org:xray Read the Docs build succeeded!
Details
@azure-pipelines
pydata.xarray (Doctests) Doctests succeeded
Details
@azure-pipelines
pydata.xarray (FormattingBlack) FormattingBlack succeeded
Details
@azure-pipelines
pydata.xarray (LintFlake8) LintFlake8 succeeded
Details
@azure-pipelines
pydata.xarray (Linux py36) Linux py36 succeeded
Details
@azure-pipelines
pydata.xarray (Linux py36-bare-minimum) Linux py36-bare-minimum succeeded
Details
@azure-pipelines
pydata.xarray (Linux py36-min-all-deps) Linux py36-min-all-deps succeeded
Details
@azure-pipelines
pydata.xarray (Linux py36-min-nep18) Linux py36-min-nep18 succeeded
Details
@azure-pipelines
pydata.xarray (Linux py37) Linux py37 succeeded
Details
@azure-pipelines
pydata.xarray (Linux py38) Linux py38 succeeded
Details
@azure-pipelines
pydata.xarray (Linux py38-all-but-dask) Linux py38-all-but-dask succeeded
Details
@azure-pipelines
pydata.xarray (Linux py38-backend-api-v2) Linux py38-backend-api-v2 succeeded
Details
@azure-pipelines
pydata.xarray (Linux py38-flaky) Linux py38-flaky succeeded
Details
@azure-pipelines
pydata.xarray (MacOSX py38) MacOSX py38 succeeded
Details
@azure-pipelines
pydata.xarray (MinimumVersionsPolicy) MinimumVersionsPolicy succeeded
Details
@azure-pipelines
pydata.xarray (TypeChecking) TypeChecking succeeded
Details
@azure-pipelines
pydata.xarray (Windows py37) Windows py37 succeeded
Details
@azure-pipelines
pydata.xarray (isort) isort succeeded
Details
@mathause
Copy link
Collaborator

@mathause mathause commented Nov 9, 2020

Thanks @jbusecke

AndrewWilliams3142 added a commit to AndrewWilliams3142/xarray that referenced this pull request May 16, 2021
max-sixty pushed a commit that referenced this pull request May 27, 2021
* initial changes

* Using map_blocks to lazily mask input arrays, following #4559

* Adding lazy corr cov test with `raise_if_dask_computes`

* adding test for one da without nans

* checking ordering of arrays doesnt matter

Co-authored-by: Deepak Cherian <[email protected]>

* adjust inputs to test

* add test for array with no missing values

* added whatsnew

* fixing format issues

Co-authored-by: Deepak Cherian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

6 participants