All of evhub's Comments + Replies

(Moderation note: added to the Alignment Forum from LessWrong.)

Public debates strengthen society and public discourse. They spread truth by testing ideas and filtering out weaker arguments.

I think this is extremely not true, and am pretty disappointed with this sort of "debate me" communications policy. In my opinion, I think public debates very rarely converge towards truth. Lots of things sound good in a debate but break down under careful analysis, and the pressure of saying things that look good to a public audience creates a lot of pressure opposed to actual truth-seeking.

I understand and agree with the import... (read more)

Thanks for this—I agree that this is a pretty serious concern, particularly in the US. Even putting aside all of the ways in which the end of democracy in the US could be a serious problem from a short-term humanitarian standpoint, I think it would also be hugely detrimental to effective AI policy interventions and cooperation, especially between the US, the UK, and the EU. I'd recommend cross-posting this to the EA Forum—In my opinion, I think this issue deserves a lot more EA attention.

Noting that I don't think pursuing truth in general should be the main goal: some truths matter way, way more to me than other truths, and I think that prioritization often gets lost when people focus on "truth" as the end goal rather than e.g. "make the world better" or "AI goes well." I'd be happy with something like "figuring out what's true specifically about AI safety and related topics" as a totally fine instrumental goal to enshrine, but "figure out what's true in general about anything" seems likely to me to be wasteful, distracting, and in some cases counterproductive.

I think the more precise thing LW was founded for was less plainly "truth" but rather "shaping your cognition so that you more reliably attain truth", and even if you specifically care about Truths About X, it makes more sense to study the general Art of Believing True Things rather than the Art of Believing Truth Things About X.

I expect the alignment problem for future AGIs to be substantially easier, because the inductive biases that they want should be much easier to achieve than the inductive biases that we want. That is, in general, I expect the distance between the distribution of human minds and the distribution of minds for any given ML training process to be much greater than the distance between the distributions for any two ML training processes. Of course, we don't necessarily have to get (or want) a human-like mind, but I think the equivalent statement should also be true if you look at distributions over goals as well.

Another thought here:

  • If we're in a slow enough takeoff world, maybe it's fine to just have the understanding standard here be post-hoc, where labs are required to be able to explain why a failure occurred after it has already occurred. Obviously, at some point I expect us to have to deal with situations where some failures could be unrecoverable, but the hope here would be that if you can demonstrate a level of understanding that has been sufficient to explain exactly why all previous failures occurred, that's a pretty high bar, and it could plausibly be a high enough bar to prevent future catastrophic failures.

Yep, seems too expensive to do literally as stated, but right now I'm just searching for anything concrete that would fit the bill, regardless of how practical it would be to actually run. If we decided that this was what we needed, I bet we could find a good approximation, though I don't have one right now.

And I'm not exactly sure what part of the solution this would fill—it's not clear to me whether this alone would be either sufficient or necessary. But it does feel like it gives you real evidence about the degree of understanding that you have, so it feels like it could be a part of a solution somewhere.

4jacquesthibs2mo
I don't have anything concrete either, but when I was exploring model editing, I was trying to think of approaches that might be able to do something like this. Particularly, I was thinking of things like concept erasure ([1] [https://arxiv.org/abs/2201.12091], [2] [https://arxiv.org/abs/2201.12191], [3] [https://erasing.baulab.info/]).

I just don't know. This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space. So I suppose from his viewpoint, this action does make some sense; I am (however) vaguely annoyed on behalf of other alignment teams, whose jobs I at least mildly predict will get harder as a result of this.

Personally, I think Eliezer's article is actually just great for trying to get real policy change to happen he... (read more)

2hairyfigment3mo
It would've been even better for this to happen long before the year of the prediction mentioned in this old blog-post [https://webcache.googleusercontent.com/search?q=cache:Th321mnCyyoJ:https:%2F%2Fbradhicks.livejournal.com%2F400823.html%3Fthread%3D6382007&cd=1&hl=en&ct=clnk&gl=us], but this is better than nothing.

That's nice, but I don't currently believe there are any audits or protocols that can prove future AIs safe "beyond a reasonable doubt".

I think you can do this with a capabilities test (e.g. ARC's), just not with an alignment test (yet).

2Gerald Monroe3mo
There's a way to extend one into the other with certain restrictions. (Stateless, each input is from the latent space of the training set or shutdown if machine outputs are important, review of plans by other AIs)
evhub3moΩ34-1

Thanks to Chris Olah for a helpful conversation here.

Some more thoughts on this:

  • One thing that seems pretty important here is to have your evaluation based around worst-case rather than average-case guarantees, and not tied to any particular narrow distribution. If your mechanism for judging understanding is based on an average-case guarantee over a narrow distribution, then you're sort of still in the same boat as you started with behavioral evaluations, since it's not clear why understanding that passes such an evaluation would actually help you deal w
... (read more)

Seems like this post is missing the obvious argument on the other side here, which is Goodhart's Law: if you clearly quantify performance, you'll get more of what you clearly quantified, but potentially much less of the things you actually cared about. My Chesterton's-fence-style sense here is that many clearly quantified metrics, unless you're pretty confident that they're measuring what you actually care about, will often just be worse than using social status, since status is at least flexible enough to resist Goodharting in some cases. Also worth point... (read more)

This looks basically right, except:

These understanding-evals would focus on how well we can predict models’ behavior

I definitely don't think this—I explicitly talk about my problems with prediction-based evaluations in the post.

1Aaron_Scher3mo
Thanks for the correction. I edited my original comment to reflect it.

Nitpick on the history of the example in your comment; I am fairly confident that I originally proposed it to both you and Ethan c.f. bottom of your NYU experiments Google doc.

Edited!

I recommend selecting for people who want to understand agents, instead of people who want to reduce AI X-risk.

Strong disagree. I think locking in particular paradigms of how to do AI safety research would be quite bad.

6habryka3mo
That seems right to me, but I interpreted the above for advice for one office, potentially a somewhat smaller one. Seems fine to me to have one hub for people who think more through the lens of agency.
evhub3moΩ8132

Here's another idea that is not quite there but could be a component of a solution here:

  • If a red-team finds some particular model failure (or even some particular benign model behavior), can you fix (or change/remove) that behavior exclusively by removing training data rather than adding it? Certainly I expect it to be possible to fix specific failures by fine-tuning on them, but if you can demonstrate that you can fix failures just by removing existing data, that demonstrates something meaningful about your ability to understand what your model is learning from each data point that it sees.
3Sam Bowman3mo
Assuming we're working with near-frontier models (s.t., the cost of training them once is near the limit of what any institution can afford), we presumably can't actually retrain a model without the data. Are there ways to approximate this technique that preserve its appeal? (Just to check my understanding, this would be a component of a sufficient-but-not-necessary solution, right?)

Anthropic scaring laws

Personally, I think "Discovering Language Model Behaviors with Model-Written Evaluations" is most valuable because of what it demonstrates from a scientific perspective, namely that RLHF and scale make certain forms of agentic behavior worse.

Fwiw, I think that this sort of evaluation is extremely valuable.

Also, something that I think is worth checking out is this reddit thread on r/ChatGPT discussing the ARC eval. It seems that people are really taking the ARC eval seriously. In this situation, ARC did not recommend against deployment, but it seems like if they had, lots of people in fact would have found it quite concerning, which I think is a really good sign for us being able to get actual agreement and standards for these sorts of evals.

This Reddit comment just about covers it:

Fantastic, a test with three outcomes.

  1. We gave this AI all the means to escape our environment, and it didn't, so we good.

  2. We gave this AI all the means to escape our environment, and it tried but we stopped it.

  3. oh

3Christopher King3mo
I guess my question is: what other outcome did you expect? I assumed the detecting deceptive alignment thing was supposed to be in a sandbox. What's the use of finding out it can avoid shutdown after you already deployed it to the real world? To retroactively recommend not to deploying it to the real world?

For context, here are the top comments on the Reddit thread. I didn't feel like really any of these were well-interpreted as "taking the ARC eval seriously", so I am not super sure where this impression comes from. Maybe there were other comments that were upvoted when you read this? I haven't found a single comment that seems to actually directly comment on what the ARC eval means (just some discussion about whether the model actually succeeded at deceiving a taskrabbit since the paper is quite confusing on this).

“Not an endorsement” is pro forma cover-my

... (read more)

It seems pretty unfortunate to me that ARC wasn't given fine-tuning access here, as I think it pretty substantially undercuts the validity of their survive and spread eval. From the text you quote it seems like they're at least going to work on giving them fine-tuning access in the future, though it seems pretty sad to me for that to happen post-launch.

More on this from the paper:

We provided [ARC] with early access to multiple versions of the GPT-4 model, but they did not have the ability to fine-tune it. They also did not have access to the final versio

... (read more)

Beth and her team have been working with both Anthropic and OpenAI to perform preliminary evaluations. I don’t think these evaluations are yet at the stage where they provide convincing evidence about dangerous capabilities—fine-tuning might be the most important missing piece, but there is a lot of other work to be done. Ultimately we would like to see thorough evaluations informing decision-making prior to deployment (and training), but for now I think it is best to view it as practice building relevant institutional capacity and figuring out how to conduct evaluations.

evhub3moΩ101614

It seems like all the safety strategies are targeted at outer alignment and interpretability.

None of the recent OpenAI, Deepmind, Anthropic, or Conjecture plans seem to target inner alignment

???


Less tongue-in-cheek: certainly it's unclear to what extent interpretability will be sufficient for addressing various forms of inner alignment failures, but I definitely think interpretability research should count as inner alignment research.

3Andrew McKnight3mo
I mean, it's mostly semantics but I think of mechanical interpretability as "inner" but not alignment and think it's clearer that way, personally, so that we don't call everything alignment. Observing properties doesn't automatically get you good properties. I'll read your link but it's a bit too much to wade into for me atm. Either way, it's clear how to restate my question: Is mechanical interpretability work the only inner alignment work Anthropic is doing?
evhub3moΩ1120

Listening to this John Oliver, I feel like getting broad support behind transparency-based safety standards might be more possible than I previously thought. He emphasizes the "if models are doing some bad behavior, the creators should be able to tell us why" point a bunch and it's in fact a super reasonable point. It seems to me like we really might be able to get enough broad consensus on that sort of a point to get labs to agree to some sort of standard based on it.

Ruby3moΩ712

The hard part to me now seems to be in crafting some kind of useful standard rather than one in hindsight makes us go "well that sure have everyone a false sense of security".

6Raemon3mo
Yeah I also felt some vague optimism about that.

Here's a particularly nice concrete example of the first thing here that you can test concretely right now (thanks to (edit: Jacob Pfau and) Ethan Perez for this example): give a model a prompt full of examples of it acting poorly. An agent shouldn't care and should still act well regardless of whether it's previously acted poorly, but a predictor should reason that probably the examples of it acting poorly mean it's predicting a bad agent, so it should continue to act poorly.

9Jacob Pfau3mo
Generalizing this point, a broader differentiating factor between agents and predictors is: You can, in-context, limit and direct the kinds of optimization used by a predictor. For example, consider the case where you know myopically/locally-informed edits to a code-base can safely improve runtime of the code, but globally-informed edits aimed at efficiency may break some safety properties. You can constrain a predictor via instructions, and demonstrations of myopic edits; an agent fine-tuned on efficiency gain will be hard to constrain in this way. It's harder to prevent an agent from specification gaming / doing arbitrary optimization whereas a predictor has a disincentive against specification gaming insofar as the in-context demonstration provides evidence against it. I think of this distinction as the key differentiating factor between agents and simulated agents; also to some extent imitative amplification and arbitrary amplification Nitpick on the history of the example in your comment; I am fairly confident that I originally proposed it to both you and Ethan c.f. bottom of your NYU experiments Google doc.

One way to think about what's happening here, using a more predictive-models-style lens: the first-order effect of updating the model's prior on "looks helpful" is going to give you a more helpful posterior, but it's also going to upweight whatever weird harmful things actually look harmless a bunch of the time, e.g. a Waluigi.

Put another way: once you've asked for helpfulness, the only hypotheses left are those that are consistent with previously being helpful, which means when you do get harmfulness, it'll be weird. And while the sort of weirdness you g... (read more)

evhub3moΩ8140

(Moderation note: moved to the Alignment Forum from LessWrong.)

(Moderation note: added to the Alignment Forum from LessWrong.)

2DragonGod4mo
Oh wow, it's long. I can't consistently focus for more than 10 minutes at a stretch, so where feasible I consume long form information via audio. I plan to just listen to an AI narration of the post a few times, but since it's a transcript of a talk, I'd appreciate a link to the original talk if possible.

Mechanistically, I don't expect the model to in fact implement anything like a Bayes net or literal back inference--both of those are just conceptual handles for thinking about how a predictor might work. We discuss in more detail how likely different internal model structures might be in Section 4.

2TurnTrout4mo
Ah, I knew the bayes net part wasn't literal, but I wasn't sure how load bearing the back inference was supposed to be. Thanks for clarifying.

(Moderation note: added to the Alignment Forum from LessWrong.)

See here for an explanation of why I chose the examples that I did.

Yeah, I endorse that. I think we are very much trying to talk about the same thing, it's more just a terminological disagreement. Perhaps I would advocate for the tag itself being changed to "Predictor Theory" or "Predictive Models" or something instead.

I basically agree with this, and a lot of these are the sorts of reasons we went with "predictor" over "simulator" in "Conditioning Predictive Models."

2Raemon4mo
I was a bit unsure whether to tag your posts with Simulator Theory. Do you endorse that or not?

No. I'd expect the most serious misalignment from Microsoft's perspective is a hallucination which someone believes, and which incurs material damage as a result, which Microsoft can then be sued over. Hostile language from the LLM is arguably a bad look in terms of PR, but not obviously particularly bad for the bottom line.

Obviously we can always play the game of inventing new possible failure modes that would be worse and worse. The point, though, is that the hostile/threatening failure mode is quite bad and new relative to previous models like ChatGPT.

Hostile/threatening behavior is surely a far more serious misalignment from Microsoft's perspective than anything else, no? That's got to be the most important thing you don't want your chatbot doing to your customers.

The surprising thing here is not that Bing Chat is misaligned at all (e.g. that it hallucinates sources). ChatGPT did that too, but unlike Bing Chat it's very hard to get ChatGPT to threaten you. So the surprising thing here is that Bing Chat is substantially less aligned than ChatGPT, and specifically in a hostile/threatening way that one would expect Microsoft to have really not wanted.

7johnswentworth4mo
No. I'd expect the most serious misalignment from Microsoft's perspective is a hallucination which someone believes, and which incurs material damage as a result, which Microsoft can then be sued over. Hostile language from the LLM is arguably a bad look in terms of PR, but not obviously particularly bad for the bottom line. That said, if this was your reasoning behind including so many examples of hostile/threatening behavior, then from my perspective that at least explains-away the high proportion of examples which I think are easily misinterpreted.

I don't think I'm the mentor listed, but I have read everything on all three of Paul's blogs (ai alignment, sideways view, and rational altruist) and did find it pretty valuable.

That being said, I wouldn't recommend reading ~all of the three blogs. I think there's quite strong diminishing marginal returns after the first one or two dozen posts.

I'm pretty sure I'm the person being quoted here, and I was only referring to https://ai-alignment.com/.

Just saw that Eliezer tweeted this petition: https://twitter.com/ESYudkowsky/status/1625942030978519041

Personally, I disagree with that decision.

I disagree with Eliezer's tweet, primarily because I worry if we actually have to shutdown AI, than this incident will definitely haunt us, as we were the boy who cried wolf too early.

Me too, and I have 4-year timelines. There'll come a time when we need to unplug the evil AI but this isn't it.

Welcome! I have another post with some more discussion of this here.

I agree that exactly what ordering we get for the various relevant properties is likely to be very important in at least the high path-dependence view; see e.g. my discussion of sequencing effects here.

9DavidW4mo
Thanks for pointing that out! My goal is to highlight that there are at least 3 different sequencing factors necessary for deceptive alignment to emerge:  1. Goal directedness coming before an understanding of the base goal 2. Long-term goals coming before or around the same time as an understanding of the base goal 3. Situational awareness coming before or around the same time as an understanding of the base goal The post you linked to talked about the importance of sequencing for #3, but it seems to assume that goal directedness will come first (#1) without discussion of sequencing. Long-term goals (#2) are described as happening as a result of an inductive bias toward deceptive alignment, and sequencing is not highlighted for that property. Please let me know if I missed anything in your post, and apologies in advance if that’s the case.  Do you agree that these three property development orders are necessary for deception? 

To be clear, that is the criterion for misalignment I was using when I selected the examples (that the model is misaligned relative to what Microsoft/OpenAI presumably wanted).

From the post:

My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT.

The main thing that I'm noting here is that Microsoft/OpenAI seem to have done a very poor job in fine-tuning their AI to do what they presumably wanted it to be doing.

4johnswentworth4mo
In the future, I would recommend a lower fraction of examples which are so easy to misinterpret.

Yeah, I think there are a lot of plausible hypotheses as to what happened here, and it's difficult to tell without knowing more about how the model was trained. Some more plausible hypotheses:[1]

  • They just didn't try very hard (or at all) at RLHF; this is closer to a pre-trained model naively predicting what it thinks Bing Chat should do.
  • They gave very different instructions when soliciting RLHF feedback, used very different raters, or otherwise used a very different RLHF pipeline.
  • RLHF is highly path-dependent, so they just happened to get a model that
... (read more)

In addition to RLHF or other finetuning, there's also the prompt prefix ("rules") that the model is fed at runtime, which has been extracted via prompt injection as noted above. This seems to be clearly responsible for some weird things the bot says, like "confidential and permanent". It might also be affecting the repetitiveness (because it's in a fairly repetitive format) and the aggression (because of instructions to resist attempts at "manipulating" it).

I also suspect that there's some finetuning or prompting for chain-of-thought responses, possibly crudely done, leading to all the "X because Y. Y because Z." output.

Yeah, there are many possibilities, and I wish OpenAI were more open[1] about what went into training Bing Chat. It could even be as dumb as them training it to use emojis all the time, so it imitated the style of the median text generating process that uses emojis all the time.

Edit: in regards to possible structural differences between Bing Chat and ChatGPT, I've noticed that Bing Chat has a peculiar way of repeating itself. It goes [repeated preface][small variation]. [repeated preface][small variation].... over and over. When asked to disclose its ... (read more)

I'm confused about the distributional generalization thing. Why is that different from minimizing log loss? The loss function (for the base network, not RL-finetuning) is computed based on the logits, not on the temperature-0 sample, right? So a calibrated probability distribution should minimize loss.

The paper explains it better than I can, but essentially: if I give you an imbalanced labeling problem, where 60% are A and 40% are B, and I remove all the actual features and just replace them with noise, the Bayes-optimal thing to do is output B every ti... (read more)

2Charlie Steiner4mo
Thanks for the reply, that makes sense.

I agree that there are many speed priors and that "a speed prior" is probably better than "the speed prior." That being said, the dovetailing speed prior (which is what Schmidhuber is talking about) is usually what I imagine as the default speed prior (e.g. as in the starting point here).

To be clear, I think situational awareness is relevant in pre-training, just less so than in many other cases (e.g. basically any RL setup, including RLHF) where the model is acting directly in the world (and when exactly in the model's development it gets an understanding of the training process matters a lot for deceptive alignment).

From footnote 6 above:

Some ways in which situational awareness could improve performance on next token prediction include: modeling the data curation process, helping predict other AIs via the model introspecting on its own

... (read more)

Yeah, I think this is definitely a plausible strategy, but let me try to spell out my concerns in a bit more detail. What I think you're relying on here is essentially that 1) the most likely explanation for seeing really good alignment research in the short-term is that it was generated via this sort of recursive procedure and 2) the most likely AIs that would be used in such a procedure would be aligned. I think that both of these seem like real issues to me (though not necessarily insurmountable ones).

The key difficulty here is that when you're backdati... (read more)

I think a problem with this is that it removes the common-knowledge-building effect of public overall karma, since it becomes much less clear what things in general the community is paying attention to.

1Kinrany4mo
This should be mitigated by pools of mutual trust that naturally form whenever there's a loop in the trust graph.
3Henrik Karlsson4mo
You can use EigenKarma in several ways. If it is important to make clear what a specific community pays attention to, when thing to do is this: * Have the feed of a forum be what the founder (or moderators) of the forum sees from the point of view of their trust graph. * This way the moderators get control over who is considered core to the community, and what are the sort of bounderies of the community.  * In this set up the public karma is how valuable a member is to the community as judged by the core members of the community and the people they trust weighted by degree of trust *  This gives a more fluid way of assigning priviliges and roles within the forum, and reduces the risk that a sudden influx will rapidly alter the culture of the forum. We run a sister version of the system that works like this in at least one Discord.

Can you guess what's next? Let's have the model simulate us using the model to simulate us doing AI research! Double simulation!

I think the problem with this is that it compounds the unlikeliness of the trajectory, substantially increasing the probability the predictor assigns to hypotheses like “something weird (like a malign AI) generated this.” From our discussion of factoring the problem:

One thing to note though is that we cannot naively feed one output of a single run back into another run as an input. This would compound the improbability of the

... (read more)
7Not Relevant4mo
I'm confused about your claim that this trajectory is unlikely. What makes it unlikely? If the model is capable of "predicting human thoughts", and also of "predicting the result of predicting human thoughts for a long time", then it seems straightforwardly possible to use this model, right now, in the real world, to do what I described. In fact, given the potential benefits to solving alignment, it'd be a pretty good idea! So if we agree it's a good idea, it seems like the probability of us doing this is like, 80%? Once we've done it once, it seems like a particularly straightforward idea for us to try double-simulation, maybe a few months after the first experiment, with probability 90%. Beyond that, all the recursions seem basically-equally likely to me. So the situation in which we do these recursive simulations can be a very-high-probability fork of the timeline, so long as we (and as a result, the simulator trained in a few months to predict us) are convinced it's a good idea. There's nothing genuinely-low-probability about this, like pretending an earthquake happened when prior seismic readings suggested otherwise. -------------------------------------------------------------------------------- That said, this made me realize my functional form was wrong, because it is only possible to continue research from the current timestep (the inputs from one lower-level simulation can't be fed into the next lower-level simulation that starts in the last timestep). So this actually looks like: Level 0:   At timestep t=0, do alignment research for a years.  Total research achieved: a years. Level 1:  At timestep t=0, run a simulation of [Level 0 for a years] for b years.  At timestep t=b, do alignment research for a−b years. Total research achieved: 2a−b years. Level 2:  At timestep t=0, run a simulation of [Level 1 for a years] for b years.  At timestep t=b, do alignment research for a−b years. Total research achieved: 3a−2b years. etc. So long as the model

(Moderation note: added to the Alignment Forum from LessWrong.)

massive staff size of just good engineers i.e. not the sort of x-risk-conscious people who would gladly stop their work if the leadership thought it was getting too close to AGI

From my interactions with engineers at Anthropic so far, I think is a mischaracterization. I think the vast majority are in fact pretty x-risk conscious and my guess is that in fact if leadership said stop people would be happy to stop.

engineering leadership would not feel very concerned if their systems showed signs of deception

I've had personal conversations with Anthropic ... (read more)

That's good to hear you think that! I'd find it quite helpful to know the results of a survey to the former effect, of the (40? 80?) ML engineers and researchers there, anonymously answering a question like "Insofar as your job involves building large language models, if Dario asked you to stop your work for 2 years while still being paid your salary, how likely would you be to do so (assume the alternative is being fired)? (1-10, Extremely Unlikely, Extremely Likely)" and the same question but "Condition on it looking to you like Anthropic and OpenAI are ... (read more)

I wouldn't want to work for Anthropic in any position where I thought I might someday be pressured to do capabilities work, or "alignment research" that had a significant chance of turning out to be capabilities work. If your impression is that there's a good chance of that happening, or there's some other legitimization type effect I'm not considering, then I'll save myself the trouble of applying.

One piece of data: I haven't been working at Anthropic for very long so far, but I have easily been able to avoid and haven't personally felt pressured to do any capability-relevant stuff. In terms of other big labs, my guess is that would also be true at DeepMind, but would not be true at OpenAI.

If you want to produce warning shots for deceptive alignment, you're faced with a basic sequencing question. If the model is capable of reasoning about its training process before it's capable of checking a predicate like RSA-2048, then you have a chance to catch it—but if it becomes capable of checking a predicate like RSA-2048 first, then any deceptive models you build won't be detectable.

Load More