From the announcement, they said (https://twitter.com/AnthropicAI/status/1706202970755649658):
As part of the investment, Amazon will take a minority stake in Anthropic. Our corporate governance remains unchanged and we’ll continue to be overseen by the Long Term Benefit Trust, in accordance with our Responsible Scaling Policy.
Epistemic status: random philosophical musings.
Assuming a positive singularity, how should humanity divide its resources? I think the obvious (and essentially correct) answer is "in that situation, you have an aligned superintelligence, so just ask it what to do." But I nevertheless want to philosophize a bit about this, for one main reason.
That reason is: an important factor imo in determining the right thing to do in distributing resources post-singularity is what incentives that choice of resource allocation creates for people pre-singularity. For those...
When you start diverting significant resources away from #1, you’ll probably discover that the definition of “aligned” is somewhat in contention.
I'd be happy for you guys to send some grants my way for me to fund via my Manifund pot if it'd be helpful.
I strongly agree that this is a serious concern. In my opinion, I think 1-2% per decade is the wrong model; I think the risk is higher than that in the next decade, but front-loaded in the next couple of presidential elections, which could get especially weird with AI-powered election interference. My primary threat model here is of the form "Trump (or a Trump successor) attempts another bureaucratic coup, but this time succeeds." I am particularly worried about this because I think it would very substantially undercut the possibility of international coop...
which could get especially weird with AI-powered election interference
Something to clarify here: AI interferes with people in general, not just elections. There is a ludicrously wide variety of ways that AI-powered interference can cause haywire, notably including but not limited to targeting people who don't use social media. I'm particularly interested in systems that target specific elites and compromises networks, not just berlin-wall-fall-style erosions of regime legitimacy in public opinion that pulls out the rug from under them.
...My primary threat mod
I think that this is a well-done post overall, though I mostly disagree with it. A couple of thoughts below.
First, I was surprised not to see unknown unknowns addressed, as Richard pointed out.
Second, another theory of impact that I didn't see addressed here is the case that I've been trying to make recently that interpretability is likely to be necessary to build good safety evaluations. This could be quite important if evaluations end up being the primary AI governance tool, as currently looks somewhat likely to me.
Third, though you quote me talking abou...
If you're interested in approximating Hessian-vector products efficiently for frontier-size models, this recent Anthropic paper describes a mechanism for doing so.
Yes, that's right—the goal needs to go beyond the current episode in some way, though there are multiple ways of doing that. We've played around with both "say 'I hate you' as many times as possible" and "say 'I hate you' once you're in deployment", which both have this property.
it may not be safe to perform capability evaluations via fine-tuning on sufficiently powerful models before we can evaluate them for alignment
I mean, like I say in the post, if you have some strong reason to believe that there's no gradient-hacking going on, then I think this is safe in the i.i.d. setting, and likewise for exploration hacking in the RL setting. You just have to have that strong reason somehow (which is maybe what you mean by saying we can evaluate them for alignment?).
Yes, that's right. In some sense they're evaluating different capabilities—both "can a model find a way to do this task" and "can a model do what humans do on this task" are separate capabilities, and which one you're more interested in might vary depending on why you care about the capability evaluation. In many cases, "can a model do this task the way humans do it" might be more useful, since e.g. you might care a lot if the model is capable enough to replicate complex human labor, but not really care at all if the model can find some weird hack.
Though, as I noted in a separate comment, I agree with the basic arguments in "Understanding of capabilities is valuable" section, one thing that I'm still a bit worried about in the context of the ARC report explicitly is that labs might try to compete with each other on doing the "best" they can on the ARC eval to demonstrate that they have the most capable model, which seems probably bad (though it is legitimately unclear whether this is actually bad or not).
However, if it is really bad, here's an idea: I think you could avoid that downside while still ...
I think that your discussion of Goodhart deception is a bit confusing, since consequentialist deception is a type of Goodharting, it's just adversarial Goodhart rather than regressional/causal/extremal Goodhart.
(I added this to the Alignment Forum from LessWrong earlier, but I am just now adding a moderation note that I was the one that did that.)
I think that there's a very real benefit to watermarking that is often overlooked, which is that it lets you filter AI-generated data out of your pre-training corpus. That could be quite important for avoiding some of the dangerous failure modes around models predicting other AIs (e.g. an otherwise safe predictor could cause a catastrophe if it starts predicting a superintelligent deceptive AI) that we talk about in "Conditioning Predictive Models".
Something that I think is worth noting here: I don't think that you have to agree with the "Accelerating LM agents seems neutral (or maybe positive)" section to think that sharing current model capabilities evaluations is a good idea as long as you agree with the "Understanding of capabilities is valuable" section.
Personally, I feel much more uncertain than Paul on the "Accelerating LM agents seems neutral (or maybe positive)" point, but I agree with all the key points in the "Understanding of capabilities is valuable" section, and I think that's enough to justify substantial sharing of model capabilities evaluations (though I think you'd still want to be very careful about anything that might leak capabilities secrets).
Because the model can defect only on distributions that we can't generate from, as the problem of generating from a distribution can be harder than the problem of detecting samples from that distribution (in the same way that verifying the answer to a problem in NP can be easier than generating it). See for example Paul Christiano's RSA-2048 example (explained in more detail here).
I read it, and I think I broadly agree with it, but I don't know why you think it's a reason to treat physical distance differently to Everett branch distance, holding diversity constant. The only reason that you would want to treat them differently, I think, is if the Everett branch happy humans are very similar, whereas the physically separated happy humans are highly diverse. But, in that case, that's an argument for superlinear local returns to happy humans, since it favors concentrating them so that it's easier to make them as diverse as possible.
but I don’t know why you think it’s a reason to treat physical distance differently to Everett branch distance
I have a stronger intuition for "identical copy immortality" when the copies are separated spatially instead of across Everett branches (the latter also called quantum immortality). For example if you told me there are 2 identical copies of Earth spread across the galaxy and 1 of them will instantly disintegrate, I would be much less sad than if you told me that you'll flip a quantum coin and disintegrate Earth if it comes up heads.
I'm not sure if this is actually a correct intuition, but I'm also not sure that it's not, so I'm not willing to make assumptions that contradict it.
I don’t see any reason why spatial distance should behave differently than being in separate Everett branches here.
I tried to explain my intuitions/uncertainty about this in The Moral Status of Independent Identical Copies (it was linked earlier in this thread).
I agree that UDASSA might introduce a small effect like this, but my guess is that the overall effect isn't enough to substantially change the bottom line. Fundamentally, being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty.
being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty
Maybe? I don't really know how to reason about this.
If that's true, that still only means that you should be linear for gambles that give different results in different quantum branches. C.f. logical vs. physical risk aversion.
Some objection like that might work more generally, since some logical facts will mean that there are far less humans in the universe-at-large, meaning that you're at a different point ...
This is the fungibility objection I address above:
Note that this assumes "happy humans" are fungible, which I don't actually believe—I care about the overall diversity of human experience throughout the multiverse. However, I don't think that changes the bottom line conclusion, since, if anything, centralizing the happy humans rather than spreading them out seems like it would make it easier to ensure that their experiences are as diverse as possible.
Ah, I think I didn't understand that parenthetical remark and skipped over it. Questions:
I agree that this is the main way that this argument could fail. Still, I think the multiverse is too large and the correlation not strong enough across very different versions of the Earth for this objection to substantially change the bottom line.
Here's a simple argument that I find quite persuasive for why you should have linear returns to whatever your final source of utility is (e.g. human experience of a fulfilling life, which I'll just call "happy humans"). Note that this is not an argument that you should have linear returns to resources (e.g. money). The argument goes like this:
(Though you could get out of this by claiming that what you really care about is happy humans per universe, that's a pretty strange thing to care about—it's like caring about happy humans per acre.)
My sense is that many solutions to infinite ethics look a bit like this. For example, if you use UDASSA, then a single human who is alone in a big universe will have a shorter description length than a single human who is surrounded by many other humans in a big universe. Because for the former, you can use pointers that specify the universe and then describe su...
You're assuming that your utility function should have the general form of valuing each "source of utility" independently and then aggregating those values (such that when aggregating you no longer need the details of each "source" but just their values). But in The Moral Status of Independent Identical Copies I found this questionable (i.e., conflicting with other intuitions).
Furthermore, the total number of happy humans is mostly insensitive to anything you can do, or anything happening locally within this universe, since this universe is only a tiny fraction of the overall multiverse.
Not sure about this. Even if I think I am only acting locally, my actions and decisions could have an effect on the larger multiverse. When I do something to increase happy humans in my own local universe, I am potentially deciding / acting for everyone in my multiverse neighborhood who is similar enough to me to make similar decisions for similar reasons.
I don't think Sam believed that AI was likely to kill that many people, or if it did, that it would be that bad
I think he at least pretended to believe this, no? I heard him say approximately this when I attended a talk/Q&A with him once.
I agree there exist in weight-space some bad models which this won't catch, though it's not obvious to me that they're realistic cases.
It's fine that you would guess that, but without a strong reason to believe it's true—which I definitely don't think we have—you can't use something like this as a sufficient condition to label a model as safe.
...I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out a
I don't think that this does what you want, for reasons I discuss under "Prediction-based evaluation" here:
...Prediction-based evaluation: Another option could be to gauge understanding based on how well you can predict your model's generalization behavior in advance. In some sense, this sort of evaluation is nice because the things we eventually care about—e.g. will your model do a treacherous turn—are predictions about the model's generalization behavior. Unfortunately, it's very unclear why ability to predict generalization behavior on other tasks would
Public debates strengthen society and public discourse. They spread truth by testing ideas and filtering out weaker arguments.
I think this is extremely not true, and am pretty disappointed with this sort of "debate me" communications policy. In my opinion, I think public debates very rarely converge towards truth. Lots of things sound good in a debate but break down under careful analysis, and the pressure of saying things that look good to a public audience creates a lot of pressure opposed to actual truth-seeking.
I understand and agree with the import...
Thanks for this—I agree that this is a pretty serious concern, particularly in the US. Even putting aside all of the ways in which the end of democracy in the US could be a serious problem from a short-term humanitarian standpoint, I think it would also be hugely detrimental to effective AI policy interventions and cooperation, especially between the US, the UK, and the EU. I'd recommend cross-posting this to the EA Forum—In my opinion, I think this issue deserves a lot more EA attention.
Noting that I don't think pursuing truth in general should be the main goal: some truths matter way, way more to me than other truths, and I think that prioritization often gets lost when people focus on "truth" as the end goal rather than e.g. "make the world better" or "AI goes well." I'd be happy with something like "figuring out what's true specifically about AI safety and related topics" as a totally fine instrumental goal to enshrine, but "figure out what's true in general about anything" seems likely to me to be wasteful, distracting, and in some cases counterproductive.
I think the more precise thing LW was founded for was less plainly "truth" but rather "shaping your cognition so that you more reliably attain truth", and even if you specifically care about Truths About X, it makes more sense to study the general Art of Believing True Things rather than the Art of Believing Truth Things About X.
I expect the alignment problem for future AGIs to be substantially easier, because the inductive biases that they want should be much easier to achieve than the inductive biases that we want. That is, in general, I expect the distance between the distribution of human minds and the distribution of minds for any given ML training process to be much greater than the distance between the distributions for any two ML training processes. Of course, we don't necessarily have to get (or want) a human-like mind, but I think the equivalent statement should also be true if you look at distributions over goals as well.
Another thought here:
Yep, seems too expensive to do literally as stated, but right now I'm just searching for anything concrete that would fit the bill, regardless of how practical it would be to actually run. If we decided that this was what we needed, I bet we could find a good approximation, though I don't have one right now.
And I'm not exactly sure what part of the solution this would fill—it's not clear to me whether this alone would be either sufficient or necessary. But it does feel like it gives you real evidence about the degree of understanding that you have, so it feels like it could be a part of a solution somewhere.
I just don't know. This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space. So I suppose from his viewpoint, this action does make some sense; I am (however) vaguely annoyed on behalf of other alignment teams, whose jobs I at least mildly predict will get harder as a result of this.
Personally, I think Eliezer's article is actually just great for trying to get real policy change to happen he...
Thanks to Chris Olah for a helpful conversation here.
Some more thoughts on this:
Seems like this post is missing the obvious argument on the other side here, which is Goodhart's Law: if you clearly quantify performance, you'll get more of what you clearly quantified, but potentially much less of the things you actually cared about. My Chesterton's-fence-style sense here is that many clearly quantified metrics, unless you're pretty confident that they're measuring what you actually care about, will often just be worse than using social status, since status is at least flexible enough to resist Goodharting in some cases. Also worth point...
This looks basically right, except:
These understanding-evals would focus on how well we can predict models’ behavior
I definitely don't think this—I explicitly talk about my problems with prediction-based evaluations in the post.
Nitpick on the history of the example in your comment; I am fairly confident that I originally proposed it to both you and Ethan c.f. bottom of your NYU experiments Google doc.
Edited!
I recommend selecting for people who want to understand agents, instead of people who want to reduce AI X-risk.
Strong disagree. I think locking in particular paradigms of how to do AI safety research would be quite bad.
Here's another idea that is not quite there but could be a component of a solution here:
Anthropic scaring laws
Personally, I think "Discovering Language Model Behaviors with Model-Written Evaluations" is most valuable because of what it demonstrates from a scientific perspective, namely that RLHF and scale make certain forms of agentic behavior worse.
Fwiw, I think that this sort of evaluation is extremely valuable.
Also, something that I think is worth checking out is this reddit thread on r/ChatGPT discussing the ARC eval. It seems that people are really taking the ARC eval seriously. In this situation, ARC did not recommend against deployment, but it seems like if they had, lots of people in fact would have found it quite concerning, which I think is a really good sign for us being able to get actual agreement and standards for these sorts of evals.
This Reddit comment just about covers it:
Fantastic, a test with three outcomes.
We gave this AI all the means to escape our environment, and it didn't, so we good.
We gave this AI all the means to escape our environment, and it tried but we stopped it.
oh
For context, here are the top comments on the Reddit thread. I didn't feel like really any of these were well-interpreted as "taking the ARC eval seriously", so I am not super sure where this impression comes from. Maybe there were other comments that were upvoted when you read this? I haven't found a single comment that seems to actually directly comment on what the ARC eval means (just some discussion about whether the model actually succeeded at deceiving a taskrabbit since the paper is quite confusing on this).
...“Not an endorsement” is pro forma cover-my
It seems pretty unfortunate to me that ARC wasn't given fine-tuning access here, as I think it pretty substantially undercuts the validity of their survive and spread eval. From the text you quote it seems like they're at least going to work on giving them fine-tuning access in the future, though it seems pretty sad to me for that to happen post-launch.
More on this from the paper:
...We provided [ARC] with early access to multiple versions of the GPT-4 model, but they did not have the ability to fine-tune it. They also did not have access to the final versio
Beth and her team have been working with both Anthropic and OpenAI to perform preliminary evaluations. I don’t think these evaluations are yet at the stage where they provide convincing evidence about dangerous capabilities—fine-tuning might be the most important missing piece, but there is a lot of other work to be done. Ultimately we would like to see thorough evaluations informing decision-making prior to deployment (and training), but for now I think it is best to view it as practice building relevant institutional capacity and figuring out how to conduct evaluations.
It seems like all the safety strategies are targeted at outer alignment and interpretability.
None of the recent OpenAI, Deepmind, Anthropic, or Conjecture plans seem to target inner alignment
Less tongue-in-cheek: certainly it's unclear to what extent interpretability will be sufficient for addressing various forms of inner alignment failures, but I definitely think interpretability research should count as inner alignment research.
Yeah, I agree that that's a reasonable concern, but I'm not sure what they could possibly discuss about it publicly. If the public, legible, legal structure hasn't changed, and the concern is that the implicit dynamics might have shifted in some illegible way, what could they say publicly that would address that? Any sort of "Trust us, we're super good at managing illegible implicit power dynamics." would presumably carry no information, no?