Open
Description
In dask.distributed
we use a dispatch on type to determine memory overhead for intermediate results. Having a rough sense of the size of intermediate results is useful for the scheduler as it often correlates to the cost of serializing that intermediate between workers.
The default is to fallback to sys.getsizeof
which calls the __sizeof__
method on the object. It would be useful if this (or an equivalent method) was implemented for scikit-learn estimators.
A naive generic implementation for estimators might be:
def __sizeof__(self):
return sum(x.nbytes if hasattr(x, 'nbytes') else getsizeof(x)
for x in self.__dict__.values())
It'd probably even be fine to ignore (or approximate) the memory usage of parameters, and just focus on the memory usage of the results of fit
. This may be straightforward for numpy arrays, but less clear for things like trees.