Replies: 7 comments
-
Thank you @profPlum for the detailed issue - we simplified the API of the losses in #486 In short: to support arbitrary inputs (and not just regular grids), we allow users to optionally provide custom quadratures. In the absence of that quadrature, we assume a regular grid over a domain of a given measure (by default, now set to 1) but that can be specified by users (e.g. 2*pi for Navier-Stokes). This allows the loss to converge to the integral as we refine the discretization. So by default, the loss ends up averaging over the spatial dimensions (dividing by 1/measure), and the user can choose whether to reduce over channels and batch via sum or mean, as in PyTorch. We are also averaging over the batch dimension in the trainer (not the loss directly), to allow us to efficiently compute averages over multiple mini-batches in a distributed manner (e.g. for the validation set). With the new changes, the API should be clearer and default to what you would expect. Let us know if you have any other feedback or suggestions! |
Beta Was this translation helpful? Give feedback.
-
I still think it would be more intuitive if metrics weren't divided by batch size in Trainer (& losses default to mean reductions like torch.nn.L1Loss). Because if you consider custom loss/metric functions (e.g. lambdas) then the warning won't work anymore. And I think generally that if users define a loss/metric a particular way they expect it to be reported as-is, without silent modification behind the scenes. P.S. For me the original motivating example was trying to debug a wrapped version of |
Beta Was this translation helpful? Give feedback.
-
I agree on principle but then it becomes trickier to accumulate results over multiple devices and mini-batches (e.g. for a large validation set). One option can instantiate a distributed meter handler that will properly accumulate averages and assume that the batch size of the input corresponds to the actual dimension average over (that assumption may not always hold, e.g. if a model augments the batch dim by, e.g. mirroring inputs). |
Beta Was this translation helpful? Give feedback.
-
I don’t understand how that becomes tricky, you can just average all the mini batch metrics. It is already done in many libraries, e.g. Keras. Also I don’t think it matters if a user does data augmentation, augmented data points can be treated as regular data points. Just sum all the mini metrics, count how many you summed N then do sum(metrics)/N. |
Beta Was this translation helpful? Give feedback.
-
The mean of means is not in general equal to the global mean, we are not guaranteed to have the same mini-batch size in each device/node using DDP, especially during evaluation, where we care mostly about accuracy. We'd need to track the mini-batch size in each device and accumulate these properly through an incremental mean handler. |
Beta Was this translation helpful? Give feedback.
-
Hmm that sounds like a strange customization option you support. I’m not sure how many people use it. But even so I think it is possible to overcome. You could multiply each metric by the batch size then divide by total data set size or something. That is my current hacky workaround. |
Beta Was this translation helpful? Give feedback.
-
I think this makes a better discussion than an issue, since we have a convention for our problems |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I realize you guys are trying to establish your own conventions.
But I think this particular convention of expecting all losses to sum across the batch dim is too surprising for regular pytorch users... Especially considering that it creates a silent error, this is the worst kind!
Examples where this sum-reduction convention is problematic:
Proposed Solution(s):
Minimal Solution: For the Lp and H1 etc... losses just divide the output by the size of the batch dimension (inside the loss function). This will at least fix problems 1 & 2.
More General Solution: Alternatively you could just do what Pytorch itself does (e.g. torch.nn.L1Loss and torch.nn.NLLLoss) and adopt the mean reduction convention for loss functions. After all the purpose of Neural Operators is resolution invariance isn't it? Well regular Lp and H1 etc... losses are not resolution invariant. So why not adopt mean reduction?
Beta Was this translation helpful? Give feedback.
All reactions