Dask friendly check in .weighted()
#4559
Conversation
The Ci environments without dask are failing. Should I add some pytest skip logic, or what is the best way to handle this? |
Yes, |
since the test you added requires That won't fix all the failing tests, though: |
For simplicity I would use |
|
||
weights = DataArray(weights).chunk({"dim_0": -1}) | ||
|
||
weighted = data.weighted(weights) |
mathause
Nov 1, 2020
Collaborator
You test that dask does not comoute:
xarray/xarray/tests/test_dask.py
Lines 189 to 190
in
83884a1
You test that dask does not comoute:
xarray/xarray/tests/test_dask.py
Lines 189 to 190 in 83884a1
Co-authored-by: Deepak Cherian <[email protected]>
Co-authored-by: Maximilian Roos <[email protected]>
I think you need to do someting along the lines of: if dask_duck_array(weights):
import dask.array as dsa
dsa.map_blocks(_weight_check, weights.data, dtype=weights.dtype)
else:
_weight_check() |
I did have to fiddle with this a bit. I did change |
"`weights` cannot contain missing values. " | ||
"Missing values can be replaced by `weights.fillna(0)`." | ||
def _weight_check(w): | ||
if np.isnan(w).any(): |
keewis
Nov 2, 2020
Collaborator
.isnull()
does a bit more than that: np.isnan
won't detect NaT
. @mathause, how likely is it to get datetime-like arrays here? They don't make much sense as weights, but as far as I can tell we don't check (I might be missing something, though)
.isnull()
does a bit more than that: np.isnan
won't detect NaT
. @mathause, how likely is it to get datetime-like arrays here? They don't make much sense as weights, but as far as I can tell we don't check (I might be missing something, though)
mathause
Nov 2, 2020
Collaborator
There is no check for that. A TimeDelta
may make some sense as weights. DateTime
not so much. I think we can get away with using np.isnan
. A Date*
array as weights containing NaT
should be super uncommon.
There is no check for that. A TimeDelta
may make some sense as weights. DateTime
not so much. I think we can get away with using np.isnan
. A Date*
array as weights containing NaT
should be super uncommon.
jbusecke
Nov 2, 2020
Author
Contributor
We could still operate on the dataarray instead of the dask/numpy array, but as @dcherian suggesred, that would be less efficient. I would be curious as to what penalties would actually occur when we use the weights.map_blocks
compared to dask.array.map_blocks
?
We could still operate on the dataarray instead of the dask/numpy array, but as @dcherian suggesred, that would be less efficient. I would be curious as to what penalties would actually occur when we use the weights.map_blocks
compared to dask.array.map_blocks
?
dcherian
Nov 3, 2020
Contributor
- Could use
duck_array_ops.isnull
to account for timedelta64
? It is weird to have it as a weight though. Does that work?
- Re map_blocks: the xarray version adds tasks that create xarray objects wrapping every block in a dask array. That adds overhead which is totally unneccesary here.
- Could use
duck_array_ops.isnull
to account fortimedelta64
? It is weird to have it as a weight though. Does that work? - Re map_blocks: the xarray version adds tasks that create xarray objects wrapping every block in a dask array. That adds overhead which is totally unneccesary here.
Do you think this works or are further changes needed? Many thanks for the guidance so far! |
Looks great to me. Want to add a whatsnew? |
if is_duck_dask_array(weights.data): | ||
import dask.array as dsa | ||
|
||
weights.data = dsa.map_blocks( |
dcherian
Nov 3, 2020
•
Contributor
Suggested change
weights.data = dsa.map_blocks(
weights = weights.copy(data=dsa.map_blocks(
so we don't modify the original object. Could even do weights.data.map_blocks(...)
to save some typing...
weights.data = dsa.map_blocks( | |
weights = weights.copy(data=dsa.map_blocks( |
so we don't modify the original object. Could even do weights.data.map_blocks(...)
to save some typing...
Co-authored-by: Maximilian Roos <[email protected]>
.weighted()
.weighted()
I am getting some failures for |
Similarly on |
Ok I think this should be good to go. I have implemented all the requested changes. The remaining failures are related to other problems upstream (I think). Anything else I should add here? |
Co-authored-by: Mathias Hauser <[email protected]>
I am not understanding why that |
this CI is sometimes flaky, but it's usually enough to just rerun it. I'll do that for you once the other CI finished. |
Seems like all the other test are passing (minus the two upstream problems discussed before). |
5a60354
into
pydata:master
Thanks @jbusecke |
* initial changes * Using map_blocks to lazily mask input arrays, following #4559 * Adding lazy corr cov test with `raise_if_dask_computes` * adding test for one da without nans * checking ordering of arrays doesnt matter Co-authored-by: Deepak Cherian <[email protected]> * adjust inputs to test * add test for array with no missing values * added whatsnew * fixing format issues Co-authored-by: Deepak Cherian <[email protected]>
isort . && black . && mypy . && flake8
whats-new.rst