This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. Additional thanks to Ameya Prabhu and Callum McDougall for their thoughts and feedback on this post. Introduction I’ve seen that in various posts people...
I often like to cite [https://www.youtube.com/watch?v=Njk2YAgNMnE](this music video) as an example of something that was made possible by AI, and used it as just a building block in a complex artistic process (for my part, I couldn't imagine how I would auto-generate a video like this, or even encode the movement of the camera as a constraint (without some substantial effort), and it was made in 2022!)
forgive my ignorance, but is there any reason that you can't have multi-layer sparse autoencoders, even those that are interpretably compatible with the linear representation hypothesis? like what would their drawbacks be (other than more required compute)?
no matter how it is that you're computing the latents, 0) you still have a reconstruction loss;
it seems to me like this still constructs a set of latents that sparsely activate, and which are linearly represented in activation space
FYI, I heard that Oliver Sacks fabricated/embellished a lot of the anecdotal accounts in his books. This was a fairly public controversy, so evidence for it can be found on Google.
i would like to know where this question leads, since i in principle like children and animals and yet have no idea what to do with them
I think I see the logic. Were you thinking of making the model good at answering questions whose correct answer depend on the model itself, like "When asked a question of the form x, what proportion of the time would you tend to answer y?"
The previous remark about being a microscope into its dataset seemed benign to me, e.g, if the model were already good at answering questions like "What proportion of datapoints satisfying predicates X satisfy predicate Y?"
But perhaps you also argue that the latter induces some small amount of self-awareness -> situational awareness?
While the particulars of your argument seem to me to have some holes, I actually very much agree with your observation we don't know what the upper limit of properly orchestrated Claude instances are, and that targeted engineering of Claude-compatible cognitive tools could vastly increase its capabilities.
One idea I've been playing with for a really long time is that the Claudes aren't the actual agents, but instead just small nodes or subprocesses in a higher-functioning mind. If I loosely imagine a hierarchy of Claudes, each corresponding roughly to system-1 or subconscious deliberative processes, with the ability to write and read to files as a form of "long term memory/processing space" for the... (read more)
(paraphrasing would be a markov kernel here, and with the transitivity property I mentioned earlier, I'm asking that achieves its stationary distribution in one iteration)
for this condition, if you also want symmetricity, this is a very strong condition; you'd only accept "lossless paraphrasings". i think not only are you achieving the stationary distribution in one iteration but the distribution cannot change, so this is either a markov kernel for every semantically different phrase, or not-markov.
There is some danger in this suggestion: it can improve the situational awareness of the LLM.
Why?
i think compute and networking speeds are honestly enough that most people struggle to take advantage of more of those things (streaming video is about the most data-intensive thing a lot of people do, and what's above that is mostly actual computational tasks), so it would take (significant) additional innovations in figuring out how to convert these things into better experiences in order for this to be tenable. it seems a lot of the time that the line is usually drawn somewhere around gaming enthusiasts (e.g there is a cohort of people who will buy a more powerful smartphone so it can render graphics better so they can game on their phones more... (read more)
Does anyone have a rigorous reference or primer on computer ergonomics, or ergonomics in general? It's hard to find a reference that says with authority/solid reasoning what good ergonomics are and why, and solutions to common problems.
Does anyone have a sense of whether, qualitatively, RL stability has been solved for any practical domains?
This question is at least in part asking for qualitative speculation about how the post-training RL works at big labs, but I'm interested in any partial answer people can come up with.
My impression of RL is that there are a lot of tricks to "improve stability", but performance is path-dependent in pretty much any realistic/practical setting (where state space is huge and action space may be huge or continuous). Even for larger toy problems my sense is that various RL algorithms really only work like up to 70% of the time, and 30% of the time... (read more)
it's hard to find definitive information about this basic fact about how modern RL on LLMs works:
are there any particularly clever ways of doing credit assignment for the tokens in a sequence S that resulted in high reward?
moreover, if you adopt the naive strategy of asserting that all of the tokens are equally responsible for the reward, is the actual gradient update to the model parameters mathematically equivalent to the one you'd get SFTing the model on S (possibly weighted by the reward, and possibly adjusted by GRPO)?
the followup is this: in this paper they claim that SFT'd models perform badly at something and RL'd models don't. i can't imagine what the difference between these things would even be, except that the RL'd models are affected by samples which are on-policy for them.
at the end of the somewhat famous blogpost about llm nondeterminism recently https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/ they assert that the determinism is enough to make an rlvr run more stable without importance sampling.
is there something i'm missing here? my strong impression is that the scale of the nondeterminism of the result is quite small, and random in direction, so that it isn't likely to affect an aggregate-scale thing like the qualitative effect of an entire gradient update. (i can imagine that the accumulation of many random errors does bias the policy towards being generally less stable, which implies qualitatively worse, yes...)
without something that mitigates the statement above, my prior is instead that the graph is cherry-picked, intentionally or not, to increase the perceived importance of llm determinism.
I tend to think of myself as immune to rage-baiting/click-baiting/general toxicity from social media and politics. I generally don't engage in arguments on classic culture war topics on the internet, and I knowingly avoid consuming much news on the grounds that it will make me feel worse without inducing meaningful change.
But I recently realized that the phenomenon has slightly broader implications: presumably in any medium, outrage is just more attractive to the human brain, and conflicts are entertaining, especially ones where you can take a side or criticize both sides.
This made me realize that this issue isn't constrained to just the forms of media I'm more explicitly cynical about. In particular, some... (read more)
Is anyone else noticing that Claude (Sonnet 3.5 new, the default on claude.ai) is a lot worse at reasoning recently? In the past five days or so its rate of completely elementary reasoning mistakes, which persist despite repeated clarification in different ways, seems to have skyrocketed for me.
This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.
Additional thanks to Ameya Prabhu and Callum McDougall for their thoughts and feedback on this post.
Introduction
I’ve seen that in various posts people will make an offhanded reference to “strategy-stealing” or “the strategy-stealing assumption” without very clearly defining what they mean by “strategy-stealing”. Part of the trouble with this is that these posts are often of wildly different flavors, and it’s often not clear what their connection to each other is, or perhaps it’s unclear under what conditions it might be feasible to “steal the strategy” of an... (read 3081 more words →)
hmm, i'd thought of lemon markets ruining basic economic activities in modern life, and i'd also thought of urbanization being the root cause of social isolation, and i've even thought it was better socially when people had economic excuses to form communities, but i've never made the particular connection written about here (that functionally, this makes modern socializing a lemon market). thanks!