Activation Plateaus: Where and How They Emerge
By design, LLMs perform nonlinear mappings from their inputs (text sequences) to their outputs (next-token generations). Some of these nonlinearities are built-in to the model architecture, but others are learned by the model, and may be important parts of how the model represents and transforms information. By studying different aspects...