Attention supervision for multiple heads: average or summation?

Dear authors,

In the paper, it is said that the final loss of attention supervision is the average of the cross entropy loss of the attention weights in each attention head. However, in https://github.com/hate-alert/HateXplain/blob/01d742279dac941981f53806154481c0e15ee686/Models/bertModels.py#L57 it does not seem to be an average because it is a summation and there is no division.

I am concerned about this detail because of the $\lambda$ hyperparameter. If one is going to implement the loss with an average (as the paper says), $\lambda$ is being divided by the number of heads, e.g., 12, which may impact the reproducibility of the hyperparameters in the paper.

Did I get it right? I would appreciate any clarification on this matter.

Thank you very much! &#128522;



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attention supervision for multiple heads: average or summation? #27

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Attention supervision for multiple heads: average or summation? #27

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions