m*= Equal Contributors
Coaching stability is of nice significance to Transformers. On this work, we examine the coaching dynamics of Transformers by inspecting the evolution of the eye layers. Particularly, we monitor the eye entropy for every consideration head in the course of the course of coaching, which is a proxy for mannequin sharpness. We determine a typical sample throughout totally different architectures and duties, the place low consideration entropy is accompanied by excessive coaching instability, which may take the type of oscillating loss or divergence. We denote the pathologically low consideration entropy, equivalent to extremely concentrated consideration scores, as entropy collapse. As a treatment, we suggest sigmaReparam, a easy and environment friendly answer the place we reparametrize all linear layers with spectral normalization and an extra discovered scalar. We display that the proposed reparameterization efficiently prevents entropy collapse within the consideration layers, selling extra secure coaching. Moreover, we show a good decrease certain of the eye entropy, which decreases exponentially quick with the spectral norm of the eye logits, offering further motivation for our method. We conduct experiments with sigmaReparam on picture classification, picture self-supervised studying, machine translation, computerized speech recognition, and language modeling duties, throughout Transformer architectures. We present that sigmaReparam gives stability and robustness with respect to the selection of hyperparameters, going as far as enabling coaching (a) a Imaginative and prescient Transformer to aggressive efficiency with out warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to aggressive efficiency with out warmup and adaptive optimizers.