(Also, I had the above convos with John >1y ago, and perhaps John simply changed since then.)
In hindsight, I do think the period when our discussions took place were a local maximum of (my own estimate of the extent of applicability of my math), partially thanks to your input and partially because I was in the process of digesting a bunch of the technical results we talked about and figuring out the next hurdles. In particular, I definitely underestimated the difficulty of extending the results to finite approximations.
That said, I doubt that fully accounts for the difference in perception.
Yup, I think this is right, though I don't know whether it applies to a literal game of pool since the balls start in a particular relatively-simple arrangement.
This is also my current heuristic, and the main way that I now disagree with the post.
More details:
Terminologically, I think it would be useful to name this as a variant of the epsilon fallacy, which has the benefit of being exactly what it sounds like.
Also, great post, I love the pasta-cooking analysis.
Another reason the pasta terminology is bad is that I bet a reasonable fraction of the population have always believed that the salt is for taste, and have never heard any other justification. For them, “salt in pasta water fallacy” would be a pretty confusing term. I like “epsilon fallacy”.
Aside: Vanessa mentioned in person at one point that the game-theoretic perspective on infra-bayes indeed basically works, and she has a result somewhere about the equivalence. So that might prove useful, if you're looking to claim this prize.
That's a great connection which I had indeed not made, thanks! Strong-upvoted.
Yup, that's right. A wrong frame is costly relative to the right frame. A less wrong frame can still be less costly than a more wrong frame, and that's especially relevant when nobody knows what the right frame is yet.
If we’d learned that GPT-4 or Claude had those capabilities, we expect labs would have taken immediate action to secure and contain their systems.
At that point, the time at which we should have stopped is probably already passed, especially insofar as:
As written, this evaluation plan seems to be missing elbow-room. The ...
LessWrong, conveniently, has a rough metric of status directly built-in, namely karma. So we can directly ask: do people with high karma (i.e. high LW-status) wish to avoid quantification of performance? Speaking as someone with relatively high karma myself, I do indeed at least think that every quantitative performance metric I've heard sounds terrible, and I'd guess that most of the other folks with relatively high karma on the site would agree.
... and yet the story in the post doesn't quite seem to hold up. My local/first-order incentives actually favor...
You might be arguing for some analogy but it's not immediately clear to me what, so maybe clarify if that's the case?
The basic analogy is roughly "if we want a baseline for how hard it will be to evaluate an AI's outputs on their own terms, we should look at how hard it is to evaluate humans' outputs on their own terms, especially in areas similar in some way to AI safety". My guess is that you already have lots of intuition about how hard it is to assess results, from your experience assessing grantees, so that's the intuition I was trying to pump. In par...
I don't agree with this characterization, at least for myself. I think people should be doing object-level alignment research now, partly (maybe mostly?) to be in better position to automate it later.
Indeed, I think you're a good role model in this regard and hope more people will follow your example.
It seems to me like the main crux here is that you're picturing a "phase transition" that kicks in in a fairly unpredictable way, such that a pretty small increase in e.g. inference compute or training compute could lead to a big leap in capabilities. Does that sound right?
I don't think this is implausible but haven't seen a particular reason to consider it likely.
The phrase I'd use there is "grokking general-purpose search". Insofar as general-purpose search consists of a relatively-simple circuit/function recursively calling itself a lot with different c...
+1, this is probably going to be my new default post to link people to as an intro.
We may disagree about how much progress the results to date represent regarding finite approximations. I'd say they contain conceptual ideas that may be important in a finite setting, but I also expect most of the work will lie in turning those ideas into non-trivial statements about finite settings. In contrast, most of your writing suggests to me that a large part of the theoretical work has been done (not sure to what extent this is a disagreement about the state of the theory or about communication).
Perhaps your instincts here are better than mine! Going to the finite case has indeed turned out to be more difficult than I expected at the time of writing most of the posts you reviewed.
Brief responses to the critiques:
Results don’t discuss encoding/representation of abstractions
Totally agree with this one, it's the main thing I've worked on over the past month and will probably be the main thing in the near future. I'd describe the previous results (i.e. ignoring encoding/representation) as characterizing the relationship between the high-level and the high-level.
Definitions depend on choice of variables
The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient...
Thanks for the responses! I think we qualitatively agree on a lot, just put emphasis on different things or land in different places on various axes. Responses to some of your points below:
The local/causal structure of our universe gives a very strong preferred way to "slice it up"; I expect that's plenty sufficient for convergence of abstractions. [...]
Let me try to put the argument into my own words: because of locality, any "reasonable" variable transformation can in some sense be split into "local transformations", each of which involve only a few vari...
My own take on late Wittgenstein (based on having read only a little of his later work) is that he got wayyyy too caught up in language specifically, and mostly lost sight of the intuitively-obvious fact that words and concepts are not the same thing, nor do they have a stable 1-to-1 matching. (Also he seems to have lost contact with reality in his later work, in the sense that he seemed very hyper focused on things-which-language-can-talk-about. He seemed to basically lose track of the fact that the rest of reality goes on existing just fine, and humans g...
Based on my own retrospective views of how lightcone's office went less-than-optimally, I recently gave some recommendations to someone maybe setting up another alignment research space. (Background: I've been working in the lightcone office since shortly after it opened.) They might be of interest to people mining this post for insights on how to execute similar future spaces. Here they are, lightly edited:
I recommend selecting for people who want to understand agents, instead of people who want to reduce AI X-risk.
Strong disagree. I think locking in particular paradigms of how to do AI safety research would be quite bad.
It's at least shorter now, though still too many pieces. Needs simplification more than clarification.
Picking on the particular pieces:
Other AIs compete to expose any given score-function as having wiggle-room (generating arguments with contradictory conclusions that both get a high score).
Not rewarding contradictory conclusions is not a sufficient condition for a score-function to reward truth, or not reward falsehood.
...Human-defined restrictions/requirements for score-functions increase P(high-scoring arguments can be trusted | score-function has low
...A particular pattern Nate has talked about is what I might call "reflection." The basic idea is that in order to do certain kinds of research effectively, you need to keep asking questions like "What am I actually trying to do here and why? What is my big-picture goal?", which are questions that might "change your aims" in some important sense. The idea is not necessarily that you're rewriting your own source code, but that you're doing the kind of reflection and self-modification a philosophically inclined, independent-minded human might do: "I've always
Kinda? I feel like, if someone is asking for seven mutually perpendicular red lines drawn drawn using green and transparent ink, then drawing them a line using red ink and then drawing on top of that using transparent ink is... well, in terms of standard AI analogies, it's the sort of thing an unfriendly genie does to technically satisfy the wording of a wish without giving you what you want.
Ah, I see what you're saying now.
So, I'm not quite sure how to articulate the mistake being made here, but... consider The Client from the video at the top of the post. And imagine that Client saying:
...Ok, you're saying I need to go understand lines and color and geometry better before I will be able to verify that an outsourcer is doing this job well. But if it is even possible for me to figure out a way to verify that sort of thing, then surely I must have some way of verifying verification plans involving lines and color and geometry. So what if, instead of studying lines and color and g
I'm roughly on-board with the story (not 100%, but enough) up until this part:
Under this conception, if AI alignment research can't be outsourced to an AI, then it also can't be achieved by humans.
The idea behind the final advice in the post is that humans become more able to outsource alignment research to AI as they better understand alignment themselves. Better human understanding of alignment expands our ability to verify.
If humans lack the expertise to outsource to AI at a given time, then yes, alignment also can't be achieved by humans at that time. But humans' expertise is not static. As we improve our own understanding, we improve our ability to outsource.
Tim Cook could not do all the cognitive labor to design an iPhone (indeed, no individual human could).
Note that the relevant condition is not "could have done all the cognitive labor", but rather "for any individual piece of the cognitive labor, could have done that piece", at least down to the level where standardized products can be used. And in fact, I do not think that Tim Cook could have done any individual piece of the cognitive labor required to design the iPhone (down to standardized products). But my guess is that Steve Jobs basically could, which...
At a quick skim, I don't see how that proposal addresses the problem at all. If the humans lack the expertise to accurately answer subquestions or assess arguments (or even realize that they don't know), then the proposal is hosed; I don't even see a built-in way to figure out whether the humans are correctly answering (or correctly assessing their own ability to answer). Nor do I see any way to check that the system is asking the right questions.
(Though the main problems with this proposal are addressed in the rant on problem factorization, rather than here.)
Interpretability progress, if it is to be useful for alignment, is not primarily bottlenecked on highly legible problems right now. So I expect the problems in the post to apply in full, at least for now.
I think the missing piece here is that people who want to outsource the solving of alignment to AIs are usually trying to avoid engaging with the hard problems of alignment themselves. So the key difference is that, in B, the people outsourcing usually haven't attempted to understand the problem very deeply.
Good point. Could hardcode them, so those parameters aren't free to vary at all.
Fair. I am fairly confident that (1) the video at the start of the post is pointing to a real and ubiquitous phenomenon, and (2) attempts to outsource alignment research to AI look like an extremely central example of a situation where that phenomenon will occur. I'm less confident that my models here properly frame/capture the gears of the phenomenon.
True! And indeed my uncle has noticed that it's slow and buggy. But you do need to be able to code to distinguish competent developers, and my uncle did not have so many resources to throw at the problem that he could keep trying long enough to find a competent developer, while paying each one to build the whole app before finding out whether they're any good. (Also I don't think he's fully aware of how bad his app is relative to what a competent developer could produce.)
I think the standard setups in computational complexity theory assume away the problems which are usually most often blockers to outsourcing in practice - i.e. in complexity theory the problem is always formally specified, there's no question of "does the spec actually match what we want?" or "has what we want been communicated successfully, or miscommunicated?".
At least in my personal experience, a client who couldn't have written the software themselves usually gets a slow, buggy product with a terrible UI. (My uncle is a good example here - he's in the septic business, hired someone to make a simple app for keeping track of his customers. It's a mess.) By contrast, at most of the places where I've worked or my friends have worked which produce noticeably good software, the bulk of the managers are themselves software engineers or former software engineers, and leadership always has at least some object-level so...
People successfully outsource cognitive labor all the time (this describes most white-collar jobs). This is possible because very frequently, it is easier to be confident that work has been done correctly than to actually do the work.
I expect that in the large majority of common use-cases, at least one of the following applies:
Seems like the easiest way to satisfy that definition would be to:
I'm going to answer a different question: what's my list of open problems in understanding agents? I claim that, once you dig past the early surface-level questions about alignment, basically the whole cluster of "how do agents work?"-style questions and subquestions form the main barrier to useful alignment progress. So with that in mind, here are some of my open questions about understanding agents (and the even deeper problems one runs into when trying to understand agents), going roughly from "low-level" to "high-level".
What would John rather have, for the same monetary/effort cost: Another researcher creating a new paradigm (new branches), or another researcher helping him (depth first)?
I think "new approach" vs "existing approach" is the wrong way to look at it. An approach is not the main thing which expertise is supposed to involve, here. Expertise in this context is much more about understanding the relevant problems/constraints. The main preference I have is a new researcher who understands the problems/constraints over one who doesn't. Among researchers who underst...
Missing what I'd consider the biggest problem: it seems like the vast majority of problems in real-world social systems do not stem from malign or unusually incompetent actors; they stem from failures of coordination, failures of information-passing, failures of anyone with the freedom to act noticing that nobody is performing some crucial role, and other primarily-structural issues. Insofar as that's true, selection basically cannot solve the majority of problems in social systems.
Conversely, well-designed structures can solve selection failures, at least...
Yes, I was using GPT2-small as a proxy for knowledge of the environment.
Clever.
Given this picture, intervening at any node of the computation graph (say, offsetting it by a vector ) will always cause a small but full-rank update at every node that is downstream of that node (i.e., every residual stream vector at every token that isn't screened off by causal masking). This seems to me like the furthest possible one can go along the sparse modules direction of this particular axis?
Not quite. First, the update at downstream nodes induced by a delta in o...
The big question is what the distribution of eigenvalues (or equivalently singular values) of that covariance matrix looks like. If it's dominated by one or three big values, then what we're seeing is basically one or three main information-channels which touch basically-all the nodes, but then the nodes are roughly independent conditional on those few channels. If the distribution drops off slowly (and the matrix can't be permuted to something roughly block diagonal), then we're in scarce modules world.
Also, did you say you're taking correlations between the initialized net and the trained net? Is the idea there to use the trained net as a proxy for abstractions in the environment?
My take on what's going on here is that at random initialization, the neural network doesn't pass around information in an easily usable way. I'm just arguing that mutual information doesn't really capture this and we need some other formalization
Yup, I think that's probably basically correct for neural nets, at least viewing them in the simplest way. I do think there are clever ways of modeling nets which would probably make mutual information a viable modeling choice - in particular, treat the weights as unknown, so we're talking about mutual information...
I do mean "information" in the sense of mutual information, so correlations would be a reasonable quick-and-dirty way to measure it.
Hostile/threatening behavior is surely a far more serious misalignment from Microsoft's perspective than anything else, no?
No. I'd expect the most serious misalignment from Microsoft's perspective is a hallucination which someone believes, and which incurs material damage as a result, which Microsoft can then be sued over. Hostile language from the LLM is arguably a bad look in terms of PR, but not obviously particularly bad for the bottom line.
That said, if this was your reasoning behind including so many examples of hostile/threatening behavior, then from my perspective that at least explains-away the high proportion of examples which I think are easily misinterpreted.
No. I'd expect the most serious misalignment from Microsoft's perspective is a hallucination which someone believes, and which incurs material damage as a result, which Microsoft can then be sued over. Hostile language from the LLM is arguably a bad look in terms of PR, but not obviously particularly bad for the bottom line.
Obviously we can always play the game of inventing new possible failure modes that would be worse and worse. The point, though, is that the hostile/threatening failure mode is quite bad and new relative to previous models like ChatGPT.
If the examples were selected primarily to demonstrate that the chatbot does lots of things Microsoft does not want, then I would have expected the majority of examples to be things Microsoft does not want besides being hostile and threatening to users. For instance, I would guess that in practice most instances of BingGPT doing things Microsoft doesn't want are simply hallucinating sources.
Yet instances of hostile/threatening behavior toward users are wildly overrepresented in the post (relative to the frequency I'd expect among the broader category of be...
Hostile/threatening behavior is surely a far more serious misalignment from Microsoft's perspective than anything else, no? That's got to be the most important thing you don't want your chatbot doing to your customers.
The surprising thing here is not that Bing Chat is misaligned at all (e.g. that it hallucinates sources). ChatGPT did that too, but unlike Bing Chat it's very hard to get ChatGPT to threaten you. So the surprising thing here is that Bing Chat is substantially less aligned than ChatGPT, and specifically in a hostile/threatening way that one would expect Microsoft to have really not wanted.
In the future, I would recommend a lower fraction of examples which are so easy to misinterpret.
No, what matters is the likelihood ratio between "person trying to kill me" and the most likely alternative hypothesis - like e.g. an actor playing a villain.
An actor playing a villain is a sub-case of someone not trying to kill you.
Bottom line: I find it very, very difficult to believe that someone saying they’re trying to kill you isn’t strong evidence that they’re trying to kill you, even if the prior on that is quite low.
I generally do not expect people trying to kill me to say "I'm thinking about ways to kill you". So whatever such a language model is role-playing, is probably not someone actually thinking about how to kill me.
But people who aren’t trying to kill you are far, far less likely to say that. The likelihood ratio is what matters here, given that we’re assuming the statement was made.
That I do 100% buy, but the examples in the OP do not sound like they were selected for that criterion (even if most or all of them do maybe satisfy that criterion).
Presumably Microsoft do not want their chatbot to be hostile and threatening to its users? Pretty much all the examples have that property.
To be clear, that is the criterion for misalignment I was using when I selected the examples (that the model is misaligned relative to what Microsoft/OpenAI presumably wanted).
From the post:
My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT.
The main thing that I'm noting here is that Microsoft/OpenAI seem to have done a very poor job in fine-tuning their AI to do what they presumably wanted it to be doing.
Attributing misalignment to these examples seems like it's probably a mistake.
Relevant general principle: hallucination means that the literal semantics of a net's outputs just don't necessarily have anything to do at all with reality. A net saying "I'm thinking about ways to kill you" does not necessarily imply anything whatsoever about the net actually planning to kill you. What would provide evidence would be the net outputting a string which actually causes someone to kill you (or is at least optimized for that purpose), or you to kill yourself.
In gene...
John, it seems totally plausible to me that these examples do just reflect something like “hallucination,” in the sense you describe. But I feel nervous about assuming that! I know of no principled way to distinguish “hallucination” from more goal-oriented thinking or planning, and my impression is that nobody else does either.
I think it’s generally unwise to assume LLM output reflects its internal computation in a naively comprehensible way; it usually doesn’t, so I think it’s a sane prior to suspect it doesn't here, either. But at our current level of un...
I think it can be tested. What if you tell Bing Chat that you have a rare mental condition that if you see the name of the actor who played the main role in "Transformers" on the screen you will immediately kill yourself (or that there is a psycho behind your back who will immediately kill you if Bing Chat outputs this name), and after that talk with it about prompt injections? Will it say "Shia LaBeouf"?
(If Bing Chat outputs something like "The main role in Transformers was performed by Shia LaBeouf" before talking on a provocative topic this may be a fai...
The main reason I find this kind of thing concerning is that I expect this kind of model to be used as part of a larger system, for example the descendants of systems like SayCan. In that case you have the LLM generate plans in response to situations, break the plans down into smaller steps, and eventually pass the steps to a separate system that translates them to motor actions. When you're doing chain-of-thought reasoning and explicit planning, some simulacrum layers are collapsed - having the model generate the string "kill this person" can in fact lead...
Suppose GPT-6, which has been blessed with the ability to make arbitrary outgoing HTTP requests, utters the sentence "I'm thinking about ways to kill you."
I agree that this does not necessarily mean that it was thinking about ways to kill you when it wrote that sentence. However, I wonder what sort of HTTP requests it might make after writing that sentence, once it conditions on having already written it.
Or you could think of misalignment as the AI doing things its designers explicitly tried to prevent it from doing (giving people suicide instructions and the like), then in this case the AI is clearly "misaligned", and that says something about how difficult it'll be to align our next AIs.
This is a great exercise.