It might be the case that what people find beautiful and ugly is subjective, but that's not an explanation of ~why~ people find some things beautiful or ugly. Things, including aesthetics, have causal reasons for being the way they are. You can even ask "what would change my mind about whether this is beautiful or ugly?". Raemon explores this topic in depth.
Sequence Summary. This sequence documents research by SatisfIA, an ongoing project on non-maximizing, aspiration-based designs for AI agents that fulfill goals specified by constraints ("aspirations") rather than maximizing an objective function. We aim to contribute to AI safety by exploring design approaches and their software implementations that we believe might be promising but neglected or novel. Our approach is roughly related to but largely complementary to concepts like quantilization and satisficing (sometimes called "soft-optimization"), Decision Transformers, and Active Inference.
This post describes the purpose of the sequence, motivates the research, describes the project status, our working hypotheses and theoretical framework, and has a short glossary of terms. It does not contain results and can safely be skipped if you want to get directly into the actual research.
Epistemic status:...
Crosspost from my blog.
If you spend a lot of time in the blogosphere, you’ll find a great deal of people expressing contrarian views. If you hang out in the circles that I do, you’ll probably have heard of Yudkowsky say that dieting doesn’t really work, Guzey say that sleep is overrated, Hanson argue that medicine doesn’t improve health, various people argue for the lab leak, others argue for hereditarianism, Caplan argue that mental illness is mostly just aberrant preferences and education doesn’t work, and various other people expressing contrarian views. Often, very smart people—like Robin Hanson—will write long posts defending these views, other people will have criticisms, and it will all be such a tangled mess that you don’t really know what to think about them.
For...
It all depends on the topic. It's unlikely that the consensus about objective fields like mathematics or physics are wrong. The more subjective, controversial, and political something is, and the more profit and power lies in controlling the consensus, the more skepticism is appropriate.
The bias on Wikipedia (used as an example) is correlated in this manner, CW topics have a lot of misinformation, while things that people aren't likely to feel strongly about are written more honestly.
If some redpills or blackpills turned out to be true, or some harsh-sound...
TL;DR: In this post, I distinguish between two related concepts in neural network interpretability: polysemanticity and superposition. Neuron polysemanticity is the observed phenomena that many neurons seem to fire (have large, positive activations) on multiple unrelated concepts. Superposition is a specific explanation for neuron (or attention head) polysemanticity, where a neural network represents more sparse features than there are neurons (or number of/dimension of attention heads) in near-orthogonal directions. I provide three ways neurons/attention heads can be polysemantic without superposition: non-neuron aligned orthogonal features, non-linear feature representations, and compositional representation without features. I conclude by listing a few reasons why it might be important to distinguish the two concepts.
Epistemic status: I wrote this “quickly” in about 12 hours, as otherwise it wouldn’t have come out at all. Think of...
Wouldn't “Neuron Polysemanticity is not 'just' Superposition” be a more fitting title?
This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.
This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.
We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.
Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."
We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...
We do weight editing in the RepE paper (that's why it's called RepE instead of ActE)
I looked at the paper again and couldn't find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).
The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.
Wittgenstein argues that we shouldn't understand language by piecing together the dictionary meaning of each individual word in a sentence, but rather that language should be understood in context as a move in a language game.
Consider the phrase, "You're the most beautiful girl in the world". Many rationalists might shy away from such a statement, deeming it statistically improbable. However, while this strict adherence to truth is commendable, I honestly feel it is misguided.
It's honestly kind of absurd to expect your words to be taken literally in these kinds of circumstances. The recipient of such a compliment will almost certainly understand it as hyperbole intended to express fondness and desire, rather than as a literal factual assertion. Further, by invoking a phrase that plays a certain role...
If it wouldn't have felt authentic, then it would have been the wrong choice to say it.
Thanks to Dan Braun, Ze Shen Chin, Paul Colognese, Michael Ivanitskiy, Sudhanshu Kasewa, and Lucas Teixeira for feedback on drafts.
This work was carried out while at Conjecture.
This post is a loosely structured collection of thoughts and confusions about search and mesaoptimization and how to look for them in transformers. We've been thinking about this for a while and still feel confused. Hopefully this post makes others more confused so they can help.
We can define mesaoptimization as internal optimization, where “optimization” describes the structure of computation within a system, not just its behavior. This kind of optimization seems particularly powerful, and many alignment researchers seem to think it’s one of the biggest concerns in alignment. Despite how important this is, we still understand very little about it.
For starters, it's not...
Great post! Two thoughts that came to mind while reading it:
This is exploratory investigation of a new-ish hypothesis, it is not intended to be a comprehensive review of the field or even a a full investigation of the hypothesis.
I've always been skeptical of the seed-oil theory of obesity. Perhaps this is bad rationality on my part, but I've tended to retreat to the sniff test on issues as charged and confusing as diet. My response to the general seed-oil theory was basically "Really? Seeds and nuts? The things you just find growing on plants, and that our ancestors surely ate loads of?"
But a twitter thread recently made me take another look at it, and since I have a lot of chemistry experience I thought I'd take a look.
It goes like this:
PUFAs from nuts and...
As far as I'm aware nobody claims trans fats aren't bad.
See comment by Gilch, allegedly Vaccenic acid isn't harmful. The particular trans-fats produced by isomerization of oleic and linoleic acid, however, probably are harmful. Elaidic acid for example is a major trans-fat component in margarines, which were banned.
Abstract: First [1)], a suggested general method of determining, for AI operating under the human feedback reinforcement learning (HFRL) model, whether the AI is “thinking”; an elucidation of latent knowledge that is separate from a recapitulation of its training data. With independent concepts or cognitions, then, an early observation that AI or AGI may have a self-concept. Second [2)], by cited instances, whether LLMs have already exhibited independent (and de facto alignment-breaking) concepts or behavior; further observations of possible self-concepts exhibited by AI. Also [3)], whether AI has already broken alignment by forming its own “morality” implicit in its meta-prompts. Finally [4)], that if AI have self-concepts, and more, demonstrate aversive behavior to stimuli, that they deserve rights at least to be free of exposure to what is...
For those who are interested, here is a summary of posts by @False Name due to Claude Pro: