Skip to content

Attention supervision for multiple heads: average or summation? #27

Open
@lucasresck

Description

@lucasresck

Dear authors,

In the paper, it is said that the final loss of attention supervision is the average of the cross entropy loss of the attention weights in each attention head. However, in

loss_att +=self.lam*masked_cross_entropy(attention_weights,attention_vals,attention_mask)
it does not seem to be an average because it is a summation and there is no division.

I am concerned about this detail because of the $\lambda$ hyperparameter. If one is going to implement the loss with an average (as the paper says), $\lambda$ is being divided by the number of heads, e.g., 12, which may impact the reproducibility of the hyperparameters in the paper.

Did I get it right? I would appreciate any clarification on this matter.

Thank you very much! 😊

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions