ghost-in-the-weights

This idea appears very similar to the paper "Reinforcement Learning Teachers of Test Time Scaling": https://arxiv.org/abs/2506.08388

Aaron_Scher's Shortform

ghost-in-the-weights3mo20

Can we use similar methods to estimate the size and active parameters of GPT-4.5?

Naively extrapolating from the 1800B-A280B estimate for GPT-4 and the fact that GPT4.5 costs about 2.5x as much, we get 4500B-A700B.

I have no idea if that's a good guess, but hopefully someone can come up with a better one.

Why do LLMs hallucinate?

ghost-in-the-weights5mo82

I would recommend looking at the Hallucinations section of Anthropic's Tracing the Thoughts of a Large Language Model:

https://www.anthropic.com/research/tracing-thoughts-language-model

They found that Claude has a refusal/"I don't know" circuit that is activated by default, and gets deactivated by a "known entities" feature when knowledge is found.

They hypothesize that hallucinations are often caused by faulty suppression of this circuit.

ChristianKl's Shortform

ghost-in-the-weights6mo10

Confusingly I believe o stands for "Omni" in the context of GPT-4o, since it's "omni-modal". Based on some quick googling, the o in o1/o3/o4 seems to emphasize that o1 was resetting the counter back to 1 (so it's more like zero1).

A Technique of Pure Reason

ghost-in-the-weights6mo*80

This seems similar to the "platonic ideal" model that Sam Altman described in a recent talk:

https://www.reddit.com/r/singularity/s/zzo8NhT9bd

https://m.youtube.com/watch?v=qhnJDDX2hhU (full talk)

Basically, his (and by extension OpenAI's) ideal model:

is very small and fast
has "super-human reasoning capabilities"
has a very long ("trillion token") context from which it can retrieve information
has access to a massive number of tools

The big question for these pure reasoning models is obviously how to build them. I think pure reasoning would probably have to be designed into the pretraining, rather than bolted on afterwards. Two very high-level directions I could see it being pursued:

For each token or sequence in the pretraining corpus, we somehow supply the model with the "knowledge" required to predict the next token, without telling it the "reasoning". This would force the model's weights to compress reasoning strategies, while the knowledge is hopefully squeezed out of them because it's redundant.
Optimize something other than the standard cross-entropy (GPT) loss. Cross-entropy is inherently mean-seeking, meaning that the model is incentivized to know every high-probability next token (which requires knowledge). A mode seeking loss, such as the reverse kl divergence (GAN objective) or RL rewards, would incentivize the model to know at least one of the next tokens with high reward or probability, which requires much less knowledge.

How Self-Aware Are LLMs?

ghost-in-the-weights6mo40

The best example of LLM metacognition that I've seen is this (unverified) reddit post:

https://www.reddit.com/r/ChatGPT/s/eLNe5BBM1Q

Essentially, a ChatGPT instance was fine-tuned to start each line with letters that spell "HELLO". When asked what made itself special, the model was able to correctly deduce what its special pattern was. Notably, it correctly described the pattern on only the second line.

This is really interesting because the model was not trained to describe the pattern, nor were there any examples in its context. It was somehow able to figure out its own characteristics just from the changes in its parameters.

silentbob's Shortform

ghost-in-the-weights7mo112

There has actually been some work visualizing this process, with a method called the "logit lens".

The first example that I know of: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

A more thorough analysis: https://arxiv.org/abs/2303.08112

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments