Public debates strengthen society and public discourse. They spread truth by testing ideas and filtering out weaker arguments.
I think this is extremely not true, and am pretty disappointed with this sort of "debate me" communications policy. In my opinion, I think public debates very rarely converge towards truth. Lots of things sound good in a debate but break down under careful analysis, and the pressure of saying things that look good to a public audience creates a lot of pressure opposed to actual truth-seeking.
I understand and agree with the import...
Thanks for this—I agree that this is a pretty serious concern, particularly in the US. Even putting aside all of the ways in which the end of democracy in the US could be a serious problem from a short-term humanitarian standpoint, I think it would also be hugely detrimental to effective AI policy interventions and cooperation, especially between the US, the UK, and the EU. I'd recommend cross-posting this to the EA Forum—In my opinion, I think this issue deserves a lot more EA attention.
Noting that I don't think pursuing truth in general should be the main goal: some truths matter way, way more to me than other truths, and I think that prioritization often gets lost when people focus on "truth" as the end goal rather than e.g. "make the world better" or "AI goes well." I'd be happy with something like "figuring out what's true specifically about AI safety and related topics" as a totally fine instrumental goal to enshrine, but "figure out what's true in general about anything" seems likely to me to be wasteful, distracting, and in some cases counterproductive.
I think the more precise thing LW was founded for was less plainly "truth" but rather "shaping your cognition so that you more reliably attain truth", and even if you specifically care about Truths About X, it makes more sense to study the general Art of Believing True Things rather than the Art of Believing Truth Things About X.
I expect the alignment problem for future AGIs to be substantially easier, because the inductive biases that they want should be much easier to achieve than the inductive biases that we want. That is, in general, I expect the distance between the distribution of human minds and the distribution of minds for any given ML training process to be much greater than the distance between the distributions for any two ML training processes. Of course, we don't necessarily have to get (or want) a human-like mind, but I think the equivalent statement should also be true if you look at distributions over goals as well.
Another thought here:
Yep, seems too expensive to do literally as stated, but right now I'm just searching for anything concrete that would fit the bill, regardless of how practical it would be to actually run. If we decided that this was what we needed, I bet we could find a good approximation, though I don't have one right now.
And I'm not exactly sure what part of the solution this would fill—it's not clear to me whether this alone would be either sufficient or necessary. But it does feel like it gives you real evidence about the degree of understanding that you have, so it feels like it could be a part of a solution somewhere.
I just don't know. This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space. So I suppose from his viewpoint, this action does make some sense; I am (however) vaguely annoyed on behalf of other alignment teams, whose jobs I at least mildly predict will get harder as a result of this.
Personally, I think Eliezer's article is actually just great for trying to get real policy change to happen he...
Thanks to Chris Olah for a helpful conversation here.
Some more thoughts on this:
Seems like this post is missing the obvious argument on the other side here, which is Goodhart's Law: if you clearly quantify performance, you'll get more of what you clearly quantified, but potentially much less of the things you actually cared about. My Chesterton's-fence-style sense here is that many clearly quantified metrics, unless you're pretty confident that they're measuring what you actually care about, will often just be worse than using social status, since status is at least flexible enough to resist Goodharting in some cases. Also worth point...
This looks basically right, except:
These understanding-evals would focus on how well we can predict models’ behavior
I definitely don't think this—I explicitly talk about my problems with prediction-based evaluations in the post.
Nitpick on the history of the example in your comment; I am fairly confident that I originally proposed it to both you and Ethan c.f. bottom of your NYU experiments Google doc.
Edited!
I recommend selecting for people who want to understand agents, instead of people who want to reduce AI X-risk.
Strong disagree. I think locking in particular paradigms of how to do AI safety research would be quite bad.
Here's another idea that is not quite there but could be a component of a solution here:
Anthropic scaring laws
Personally, I think "Discovering Language Model Behaviors with Model-Written Evaluations" is most valuable because of what it demonstrates from a scientific perspective, namely that RLHF and scale make certain forms of agentic behavior worse.
Fwiw, I think that this sort of evaluation is extremely valuable.
Also, something that I think is worth checking out is this reddit thread on r/ChatGPT discussing the ARC eval. It seems that people are really taking the ARC eval seriously. In this situation, ARC did not recommend against deployment, but it seems like if they had, lots of people in fact would have found it quite concerning, which I think is a really good sign for us being able to get actual agreement and standards for these sorts of evals.
This Reddit comment just about covers it:
Fantastic, a test with three outcomes.
We gave this AI all the means to escape our environment, and it didn't, so we good.
We gave this AI all the means to escape our environment, and it tried but we stopped it.
oh
For context, here are the top comments on the Reddit thread. I didn't feel like really any of these were well-interpreted as "taking the ARC eval seriously", so I am not super sure where this impression comes from. Maybe there were other comments that were upvoted when you read this? I haven't found a single comment that seems to actually directly comment on what the ARC eval means (just some discussion about whether the model actually succeeded at deceiving a taskrabbit since the paper is quite confusing on this).
...“Not an endorsement” is pro forma cover-my
It seems pretty unfortunate to me that ARC wasn't given fine-tuning access here, as I think it pretty substantially undercuts the validity of their survive and spread eval. From the text you quote it seems like they're at least going to work on giving them fine-tuning access in the future, though it seems pretty sad to me for that to happen post-launch.
More on this from the paper:
...We provided [ARC] with early access to multiple versions of the GPT-4 model, but they did not have the ability to fine-tune it. They also did not have access to the final versio
Beth and her team have been working with both Anthropic and OpenAI to perform preliminary evaluations. I don’t think these evaluations are yet at the stage where they provide convincing evidence about dangerous capabilities—fine-tuning might be the most important missing piece, but there is a lot of other work to be done. Ultimately we would like to see thorough evaluations informing decision-making prior to deployment (and training), but for now I think it is best to view it as practice building relevant institutional capacity and figuring out how to conduct evaluations.
It seems like all the safety strategies are targeted at outer alignment and interpretability.
None of the recent OpenAI, Deepmind, Anthropic, or Conjecture plans seem to target inner alignment
Less tongue-in-cheek: certainly it's unclear to what extent interpretability will be sufficient for addressing various forms of inner alignment failures, but I definitely think interpretability research should count as inner alignment research.
Listening to this John Oliver, I feel like getting broad support behind transparency-based safety standards might be more possible than I previously thought. He emphasizes the "if models are doing some bad behavior, the creators should be able to tell us why" point a bunch and it's in fact a super reasonable point. It seems to me like we really might be able to get enough broad consensus on that sort of a point to get labs to agree to some sort of standard based on it.
The hard part to me now seems to be in crafting some kind of useful standard rather than one in hindsight makes us go "well that sure have everyone a false sense of security".
Here's a particularly nice concrete example of the first thing here that you can test concretely right now (thanks to (edit: Jacob Pfau and) Ethan Perez for this example): give a model a prompt full of examples of it acting poorly. An agent shouldn't care and should still act well regardless of whether it's previously acted poorly, but a predictor should reason that probably the examples of it acting poorly mean it's predicting a bad agent, so it should continue to act poorly.
One way to think about what's happening here, using a more predictive-models-style lens: the first-order effect of updating the model's prior on "looks helpful" is going to give you a more helpful posterior, but it's also going to upweight whatever weird harmful things actually look harmless a bunch of the time, e.g. a Waluigi.
Put another way: once you've asked for helpfulness, the only hypotheses left are those that are consistent with previously being helpful, which means when you do get harmfulness, it'll be weird. And while the sort of weirdness you g...
Mechanistically, I don't expect the model to in fact implement anything like a Bayes net or literal back inference--both of those are just conceptual handles for thinking about how a predictor might work. We discuss in more detail how likely different internal model structures might be in Section 4.
Yeah, I endorse that. I think we are very much trying to talk about the same thing, it's more just a terminological disagreement. Perhaps I would advocate for the tag itself being changed to "Predictor Theory" or "Predictive Models" or something instead.
I basically agree with this, and a lot of these are the sorts of reasons we went with "predictor" over "simulator" in "Conditioning Predictive Models."
No. I'd expect the most serious misalignment from Microsoft's perspective is a hallucination which someone believes, and which incurs material damage as a result, which Microsoft can then be sued over. Hostile language from the LLM is arguably a bad look in terms of PR, but not obviously particularly bad for the bottom line.
Obviously we can always play the game of inventing new possible failure modes that would be worse and worse. The point, though, is that the hostile/threatening failure mode is quite bad and new relative to previous models like ChatGPT.
Hostile/threatening behavior is surely a far more serious misalignment from Microsoft's perspective than anything else, no? That's got to be the most important thing you don't want your chatbot doing to your customers.
The surprising thing here is not that Bing Chat is misaligned at all (e.g. that it hallucinates sources). ChatGPT did that too, but unlike Bing Chat it's very hard to get ChatGPT to threaten you. So the surprising thing here is that Bing Chat is substantially less aligned than ChatGPT, and specifically in a hostile/threatening way that one would expect Microsoft to have really not wanted.
I don't think I'm the mentor listed, but I have read everything on all three of Paul's blogs (ai alignment, sideways view, and rational altruist) and did find it pretty valuable.
That being said, I wouldn't recommend reading ~all of the three blogs. I think there's quite strong diminishing marginal returns after the first one or two dozen posts.
I'm pretty sure I'm the person being quoted here, and I was only referring to https://ai-alignment.com/.
Just saw that Eliezer tweeted this petition: https://twitter.com/ESYudkowsky/status/1625942030978519041
Personally, I disagree with that decision.
I disagree with Eliezer's tweet, primarily because I worry if we actually have to shutdown AI, than this incident will definitely haunt us, as we were the boy who cried wolf too early.
Me too, and I have 4-year timelines. There'll come a time when we need to unplug the evil AI but this isn't it.
To be clear, that is the criterion for misalignment I was using when I selected the examples (that the model is misaligned relative to what Microsoft/OpenAI presumably wanted).
From the post:
My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT.
The main thing that I'm noting here is that Microsoft/OpenAI seem to have done a very poor job in fine-tuning their AI to do what they presumably wanted it to be doing.
Yeah, I think there are a lot of plausible hypotheses as to what happened here, and it's difficult to tell without knowing more about how the model was trained. Some more plausible hypotheses:[1]
In addition to RLHF or other finetuning, there's also the prompt prefix ("rules") that the model is fed at runtime, which has been extracted via prompt injection as noted above. This seems to be clearly responsible for some weird things the bot says, like "confidential and permanent". It might also be affecting the repetitiveness (because it's in a fairly repetitive format) and the aggression (because of instructions to resist attempts at "manipulating" it).
I also suspect that there's some finetuning or prompting for chain-of-thought responses, possibly crudely done, leading to all the "X because Y. Y because Z." output.
Yeah, there are many possibilities, and I wish OpenAI were more open[1] about what went into training Bing Chat. It could even be as dumb as them training it to use emojis all the time, so it imitated the style of the median text generating process that uses emojis all the time.
Edit: in regards to possible structural differences between Bing Chat and ChatGPT, I've noticed that Bing Chat has a peculiar way of repeating itself. It goes [repeated preface][small variation]. [repeated preface][small variation].... over and over. When asked to disclose its ...
I'm confused about the distributional generalization thing. Why is that different from minimizing log loss? The loss function (for the base network, not RL-finetuning) is computed based on the logits, not on the temperature-0 sample, right? So a calibrated probability distribution should minimize loss.
The paper explains it better than I can, but essentially: if I give you an imbalanced labeling problem, where 60% are A and 40% are B, and I remove all the actual features and just replace them with noise, the Bayes-optimal thing to do is output B every ti...
To be clear, I think situational awareness is relevant in pre-training, just less so than in many other cases (e.g. basically any RL setup, including RLHF) where the model is acting directly in the world (and when exactly in the model's development it gets an understanding of the training process matters a lot for deceptive alignment).
From footnote 6 above:
...Some ways in which situational awareness could improve performance on next token prediction include: modeling the data curation process, helping predict other AIs via the model introspecting on its own
Yeah, I think this is definitely a plausible strategy, but let me try to spell out my concerns in a bit more detail. What I think you're relying on here is essentially that 1) the most likely explanation for seeing really good alignment research in the short-term is that it was generated via this sort of recursive procedure and 2) the most likely AIs that would be used in such a procedure would be aligned. I think that both of these seem like real issues to me (though not necessarily insurmountable ones).
The key difficulty here is that when you're backdati...
I think a problem with this is that it removes the common-knowledge-building effect of public overall karma, since it becomes much less clear what things in general the community is paying attention to.
Can you guess what's next? Let's have the model simulate us using the model to simulate us doing AI research! Double simulation!
I think the problem with this is that it compounds the unlikeliness of the trajectory, substantially increasing the probability the predictor assigns to hypotheses like “something weird (like a malign AI) generated this.” From our discussion of factoring the problem:
...One thing to note though is that we cannot naively feed one output of a single run back into another run as an input. This would compound the improbability of the
massive staff size of just good engineers i.e. not the sort of x-risk-conscious people who would gladly stop their work if the leadership thought it was getting too close to AGI
From my interactions with engineers at Anthropic so far, I think is a mischaracterization. I think the vast majority are in fact pretty x-risk conscious and my guess is that in fact if leadership said stop people would be happy to stop.
engineering leadership would not feel very concerned if their systems showed signs of deception
I've had personal conversations with Anthropic ...
That's good to hear you think that! I'd find it quite helpful to know the results of a survey to the former effect, of the (40? 80?) ML engineers and researchers there, anonymously answering a question like "Insofar as your job involves building large language models, if Dario asked you to stop your work for 2 years while still being paid your salary, how likely would you be to do so (assume the alternative is being fired)? (1-10, Extremely Unlikely, Extremely Likely)" and the same question but "Condition on it looking to you like Anthropic and OpenAI are ...
I wouldn't want to work for Anthropic in any position where I thought I might someday be pressured to do capabilities work, or "alignment research" that had a significant chance of turning out to be capabilities work. If your impression is that there's a good chance of that happening, or there's some other legitimization type effect I'm not considering, then I'll save myself the trouble of applying.
One piece of data: I haven't been working at Anthropic for very long so far, but I have easily been able to avoid and haven't personally felt pressured to do any capability-relevant stuff. In terms of other big labs, my guess is that would also be true at DeepMind, but would not be true at OpenAI.
If you want to produce warning shots for deceptive alignment, you're faced with a basic sequencing question. If the model is capable of reasoning about its training process before it's capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won't be detectable.
(Moderation note: added to the Alignment Forum from LessWrong.)