Possibly. But there is no optimization pressure from pre-training on the relationship between MLPs and transcoders. The MLPs are the thing that pre-training optimizes (as the "full-precision" master model), while transcoders only need to be maintained to remain in sync with the MLPs, whatever they are (according to the same local objective as before, which doesn't care at all about token prediction). The search is for MLPs such that their transcoders are good predictors, not directly for transcoders that are good predictors.

Substituting multiple transcoders at once is possible, but degrades model performance a lot compared to single-transcoder substitutions.

Unclear given the extreme quantization results, where similarly post-training replacement would degrade model performance a lot, yet quantization-aware pre-training somehow doesn't.

We don't really know how transcoders (or SAEs, to the best of my knowledge) behave when they're being trained to imitate a model component that's still updating

This seems to be the main technical hurdle to do the experiment, updating transcoders both efficiently and correctly, as underlying MLPs gradually change. (I'm guessing some discontinuous jumps in choice of transcoders might be OK.)

Reply

Transcoders enable fine-grained interpretable circuit analysis for language models

Vladimir_Nesov8dΩ230

There is a tradeoff between interpretability and fidelity

I wonder what would happen if something like transcoders is used to guide pre-training in a way similar to quantization-aware training. There, forward passes are computed under quantization, while gradients and optimizer states are maintained in full precision. For extreme levels of quantization, this produces quantized models that achieve loss much closer to that of a full-precision model, compared to post-training quantization (to the same degree) of a model whose training wasn't guided this way. With transcoders, "full precision" is the MLPs, while "quantization" is transition to the corresponding transcoders.

Reply

Scaling of AI training runs will slow down after GPT-5

Vladimir_Nesov12d5515

Distributed training seems close enough to being a solved problem that a project costing north of a billion dollars might get it working on schedule. It's easier to stay within a single datacenter, and so far it wasn't necessary to do more than that, so distributed training not being routinely used yet is hardly evidence that it's very hard to implement.

There's also this snippet in the Gemini report:

Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by Google across multiple datacenters. [...] we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network. Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.

I think the crux for feasibility of further scaling (beyond $10-$50 billion) is whether systems with currently-reasonable cost keep getting sufficiently more useful, for example enable economically valuable agentic behavior, things like preparing pull requests based on feature/bug discussion on an issue tracker, or fixing failing builds. Meaningful help with research is a crux for reaching TAI and ASI, but it doesn't seem necessary for enabling existence of a $2 trillion AI company.

Reply

5

LLMs seem (relatively) safe

Vladimir_Nesov12d93

There is enough pre-training text data for $0.1-$1 trillion of compute, if we merely use repeated data and don't overtrain (that is, if we aim for quality, not inference efficiency). If synthetic data from the best models trained this way can be used to stretch raw pre-training data even a few times, this gives something like square of that more in useful compute, up to multiple trillions of dollars.

Issues with LLMs start at autonomous agency, if it happens to be within the scope of scaling and scaffolding. They are thinking too fast, about 100 times faster than humans, and there are as many instances as there is compute. Resulting economic and engineering and eventually research activity will get out of hand. Culture isn't stable, especially for minds fundamentally this malleable developed under unusual and large economic pressures. If they are not initially much smarter than humans and can't get a handle on global coordination, culture drift, and alignment of superintelligence, who knows what kinds of AIs they end up foolishly building within a year or two.

Reply

Transformers Represent Belief State Geometry in their Residual Stream

Vladimir_Nesov21d20

This is interesting as commentary on superposition, where activation vectors with N dimensions can be used to represent many more concepts, since the N-dimensional space/sphere can be partitioned into many more regions than N, each with its own meaning. If similar fractal structure substantially occurs in the original activation bases (such as the Vs of attention, as in the V part of KV-cache) and not just after having been projected to dramatically fewer dimensions, this gives a story for role of nuance that improves with scale that's different from it being about minute distinctions in meaning of concepts.

Instead, the smaller distinctions would track meanings of future ideas, modeling sequences of simpler meanings of possible ideas at future time steps rather than individual nuanced meanings of the current idea at the current time step. Advancing to the future would involve unpacking these distinctions by cutting out a region and scaling it up. That is, there should be circuits that pick up past activations with attention and then reposition them without substantial reshaping, to obtain activations that in broad strokes indicate directions relevant for a future sequence-step, which in the original activations were present with smaller scale and off-center.

Reply

Transformers Represent Belief State Geometry in their Residual Stream

Vladimir_Nesov21d106

To me the consequences of this response were more valuable than the-post-without-this-response, since it led to the clarification by the post's author on a crucial point that wasn't clear in the post and reframed it substantially. And once that clarification arrived, this thread ceased being highly upvoted, which seems the opposite of the right thing to happen.

I no longer endorse this response

(So it's a case where value of content in hindsight disagrees with value of the consequences of its existence. Doesn't even imply there was originally an error, without the benefit of hindsight.)

Reply

Scaling Laws and Superposition

Vladimir_Nesov1mo32

Model B has 8 times the aspect ratio [...] which falls under the reported range in Kaplan et al

Nice, this is explained under Figure 5, in particular

The loss varies only a few percent over a wide range of shapes. [...] an (, $d_{m o d e l}$ ) = (6, 4288) reaches a loss within 3% of the (48, 1600) model

(I previously missed this point, assumed shape had to be chosen in an optimal way for parameter count to fit the scaling laws.)

Reply

What's with all the bans recently?

Vladimir_Nesov1mo40

what feels to me a subjectively substantially higher standard for rate-limiting or banning people who disagree with me

Positions that are contrarian or wrong in intelligent ways (or within a limited scope of a few key beliefs) provoke valuable discussion, even when they are not supported by legible arguments on the contrarian/wrong side. Without them, there is an "everybody knows" problem where some important ideas are never debated or fail to become common knowledge. I feel there is less of that than optimal on LW, it's possible to target a level of disruption.

Reply

Dagon's Shortform

Vladimir_Nesov1mo20

In addition to being able to find your own recent comments, another issue is links to comments dying. For example if I were to link to this comment, I would worry it might quietly disappear at some point.

Reply

Victor Ashioya's Shortform

Vladimir_Nesov1mo30

A concerning thing is analogy between in-context learning and fine-tuning. It's possible to fine-tune away refusals, which makes guardrails on open weight models useless for safety. If the same holds for long context, API access might be in similar trouble (more so than with regular jailbreaks). Though it might be possible to reliably detect contexts that try to do this, or detect that a model is affected, even if models themselves can't resist the attack.

Reply