It was at least since 2021, according to the authors of a preprint from March, that researchers began to see something interesting on the insides of their models.
Also known as an AI program, created from a neural network architecture, a model processes a word by learning to represent it as an arrow or a vector within a high-dimensional space. The directions of these words—which each end up at one single point—become the model's carriers of information.
While these spaces are already strange in their vastness, often consisting in thousands of dimensions, researchers were noticing something even more peculiar; sometimes, inputs would form clouds of points that were distinctively shaped, looking for example like 'Swiss rolls,' or cylinders, after being projected back down to just three dimensions, using standard methods. Over the next few years, they started to see other cloudy shapes, too: curves, loops, circles; helixes, torii; even trees and fractal geometries.
That models might learn to organize information in shapes did not necessarily surprise people. It was natural to think that a model might learn that certain categories of inputs could all be clumped together; like inputs describing calendar dates, or colors, or arithmetical operations.
But in 2023, when others discovered a new method for understanding the insides of their models, called sparse autoencoders (SAEs), the observations began to seem a little odder. This method, which quickly gained traction, was suggestive of a very different picture—that the most important concepts a model learned, like love, or logic, or the identities of different people, were highly fragmented, each one tearing off in a very different direction. But why then were certain inputs found close together?
Almost as soon as this hint of contradiction surfaced, it was quelled by other findings. Both the study from March as well as an October study, by researchers at the company Anthropic, have shown that models learn shapes in ways that compliment these other tendencies, suggested by other methods. As a consequence, we are increasingly making sense of why models learn to make shapes in the high-dimensional minds that they live in.
"There's a lot of confusion, but it also feels like there's been a lot of progress," said Eric Michaud of the Massachusetts Institute of Technology (MIT), who spoke to Foom in an interview. "I don't know where it's all going to go. But overall, it feels healthy."
Continue reading at foommagazine.org ...