Summary
An important paper from Meta(FAIR), MIT, NYU and Princeton authors in late 2025 shows transformers can perform equally if not better using a Dynamic tanh operation instead of a Layer Normalization. The PR for this issue will aim to add this technique to the nnx src (norms) with minimal code. There is a successor paper to this as well featuring Dynamic erf (Derf) function for which i will open another issue.
I am attaching the related links here:
Paper: https://arxiv.org/pdf/2503.10622
Title: Transformers without Normalization
Code: https://github.com/jiachenzhu/DyT
Website: https://jiachenzhu.github.io/DyT/
cc: @cgarciae @vfdev-5 @samanklesaria
Additional context

Summary
An important paper from Meta(FAIR), MIT, NYU and Princeton authors in late 2025 shows transformers can perform equally if not better using a Dynamic tanh operation instead of a Layer Normalization. The PR for this issue will aim to add this technique to the nnx src (norms) with minimal code. There is a successor paper to this as well featuring Dynamic erf (Derf) function for which i will open another issue.
I am attaching the related links here:
Paper: https://arxiv.org/pdf/2503.10622
Title: Transformers without Normalization
Code: https://github.com/jiachenzhu/DyT
Website: https://jiachenzhu.github.io/DyT/
cc: @cgarciae @vfdev-5 @samanklesaria
Additional context