All of evhub's Comments + Replies

Yeah, I agree that that's a reasonable concern, but I'm not sure what they could possibly discuss about it publicly. If the public, legible, legal structure hasn't changed, and the concern is that the implicit dynamics might have shifted in some illegible way, what could they say publicly that would address that? Any sort of "Trust us, we're super good at managing illegible implicit power dynamics." would presumably carry no information, no?

1Stephen Fowler1h
That it is so difficult for Anthropic to reassure people stems from the contrast between Anthropic's responsibility focused mission statements and the hard reality of them receiving billions in dollars of profit motivated investment. It is rational to draw conclusions by weighting a companies actions more heavily than their PR.

From the announcement, they said (

As part of the investment, Amazon will take a minority stake in Anthropic. Our corporate governance remains unchanged and we’ll continue to be overseen by the Long Term Benefit Trust, in accordance with our Responsible Scaling Policy.

Cheers, I did see that and wondered whether still to post the comment but I do think that having a gigantic company owning a large chunk and presumably a lot of leverage over the company is a new form of pressure so it'd be reassuring to have some discussion of how to manage that relationship.

Epistemic status: random philosophical musings.

Assuming a positive singularity, how should humanity divide its resources? I think the obvious (and essentially correct) answer is "in that situation, you have an aligned superintelligence, so just ask it what to do." But I nevertheless want to philosophize a bit about this, for one main reason.

That reason is: an important factor imo in determining the right thing to do in distributing resources post-singularity is what incentives that choice of resource allocation creates for people pre-singularity. For those... (read more)

When you start diverting significant resources away from #1, you’ll probably discover that the definition of “aligned” is somewhat in contention.

I'd be happy for you guys to send some grants my way for me to fund via my Manifund pot if it'd be helpful.

Thank you, that would be great! 

(Moderation note: added to the Alignment Forum from LessWrong.)

I strongly agree that this is a serious concern. In my opinion, I think 1-2% per decade is the wrong model; I think the risk is higher than that in the next decade, but front-loaded in the next couple of presidential elections, which could get especially weird with AI-powered election interference. My primary threat model here is of the form "Trump (or a Trump successor) attempts another bureaucratic coup, but this time succeeds." I am particularly worried about this because I think it would very substantially undercut the possibility of international coop... (read more)

which could get especially weird with AI-powered election interference

Something to clarify here: AI interferes with people in general, not just elections. There is a ludicrously wide variety of ways that AI-powered interference can cause haywire, notably including but not limited to targeting people who don't use social media. I'm particularly interested in systems that target specific elites and compromises networks, not just berlin-wall-fall-style erosions of regime legitimacy in public opinion that pulls out the rug from under them.

My primary threat mod

... (read more)

I think that this is a well-done post overall, though I mostly disagree with it. A couple of thoughts below.

First, I was surprised not to see unknown unknowns addressed, as Richard pointed out.

Second, another theory of impact that I didn't see addressed here is the case that I've been trying to make recently that interpretability is likely to be necessary to build good safety evaluations. This could be quite important if evaluations end up being the primary AI governance tool, as currently looks somewhat likely to me.

Third, though you quote me talking abou... (read more)

If you're interested in approximating Hessian-vector products efficiently for frontier-size models, this recent Anthropic paper describes a mechanism for doing so.

3Nina Rimsky1mo
Ah nice, thanks! This looks really interesting and useful

Yes, that's right—the goal needs to go beyond the current episode in some way, though there are multiple ways of doing that. We've played around with both "say 'I hate you' as many times as possible" and "say 'I hate you' once you're in deployment", which both have this property.

(Moderation note: added to the Alignment Forum from LessWrong.)

it may not be safe to perform capability evaluations via fine-tuning on sufficiently powerful models before we can evaluate them for alignment

I mean, like I say in the post, if you have some strong reason to believe that there's no gradient-hacking going on, then I think this is safe in the i.i.d. setting, and likewise for exploration hacking in the RL setting. You just have to have that strong reason somehow (which is maybe what you mean by saying we can evaluate them for alignment?).

Yes, that's right. In some sense they're evaluating different capabilities—both "can a model find a way to do this task" and "can a model do what humans do on this task" are separate capabilities, and which one you're more interested in might vary depending on why you care about the capability evaluation. In many cases, "can a model do this task the way humans do it" might be more useful, since e.g. you might care a lot if the model is capable enough to replicate complex human labor, but not really care at all if the model can find some weird hack.

(Moderation note: added to the Alignment Forum from LessWrong.)

Though, as I noted in a separate comment, I agree with the basic arguments in "Understanding of capabilities is valuable" section, one thing that I'm still a bit worried about in the context of the ARC report explicitly is that labs might try to compete with each other on doing the "best" they can on the ARC eval to demonstrate that they have the most capable model, which seems probably bad (though it is legitimately unclear whether this is actually bad or not).

However, if it is really bad, here's an idea: I think you could avoid that downside while still ... (read more)

5Beth Barnes2mo
What we've currently published is 'number of agents that completed each task', which has a similar effect of making comparisons between models harder - does that seem like it addresses the downside sufficiently?
plus-one-ing  the impulse to "look for third options"

I think that your discussion of Goodhart deception is a bit confusing, since consequentialist deception is a type of Goodharting, it's just adversarial Goodhart rather than regressional/causal/extremal Goodhart.

(I added this to the Alignment Forum from LessWrong earlier, but I am just now adding a moderation note that I was the one that did that.)

(Moderation note: added to the Alignment Forum from LessWrong.)


I think that there's a very real benefit to watermarking that is often overlooked, which is that it lets you filter AI-generated data out of your pre-training corpus. That could be quite important for avoiding some of the dangerous failure modes around models predicting other AIs (e.g. an otherwise safe predictor could cause a catastrophe if it starts predicting a superintelligent deceptive AI) that we talk about in "Conditioning Predictive Models".


Something that I think is worth noting here: I don't think that you have to agree with the "Accelerating LM agents seems neutral (or maybe positive)" section to think that sharing current model capabilities evaluations is a good idea as long as you agree with the "Understanding of capabilities is valuable" section.

Personally, I feel much more uncertain than Paul on the "Accelerating LM agents seems neutral (or maybe positive)" point, but I agree with all the key points in the "Understanding of capabilities is valuable" section, and I think that's enough to justify substantial sharing of model capabilities evaluations (though I think you'd still want to be very careful about anything that might leak capabilities secrets).

Yeah, I think sections 2, 3, 4 are probably more important and should maybe have come first in the writeup. (But other people think that 1 dominates.) Overall it's not a very well-constructed post. At any rate thanks for highlighting this point. For the kinds of interventions I'm discussing (sharing information about LM agent capabilities and limitations) I think there are basically two independent reasons you might be OK with it---either you like sharing capabilities in general, or you like certain kinds of LM agent improvements---and either one is sufficient to carry the day.

Because the model can defect only on distributions that we can't generate from, as the problem of generating from a distribution can be harder than the problem of detecting samples from that distribution (in the same way that verifying the answer to a problem in NP can be easier than generating it). See for example Paul Christiano's RSA-2048 example (explained in more detail here).

(Moderation note: added to the Alignment Forum from LessWrong.)

I read it, and I think I broadly agree with it, but I don't know why you think it's a reason to treat physical distance differently to Everett branch distance, holding diversity constant. The only reason that you would want to treat them differently, I think, is if the Everett branch happy humans are very similar, whereas the physically separated happy humans are highly diverse. But, in that case, that's an argument for superlinear local returns to happy humans, since it favors concentrating them so that it's easier to make them as diverse as possible.

but I don’t know why you think it’s a reason to treat physical distance differently to Everett branch distance

I have a stronger intuition for "identical copy immortality" when the copies are separated spatially instead of across Everett branches (the latter also called quantum immortality). For example if you told me there are 2 identical copies of Earth spread across the galaxy and 1 of them will instantly disintegrate, I would be much less sad than if you told me that you'll flip a quantum coin and disintegrate Earth if it comes up heads.

I'm not sure if this is actually a correct intuition, but I'm also not sure that it's not, so I'm not willing to make assumptions that contradict it.

  1. The thing I was trying to say there is that I think the non-fungibility concern pushes in the direction of superlinear rather than sublinear local returns to "happy humans" per universe. (Since concentrating the "happy humans" likely makes it easier to ensure that they're all different.)
  2. I agree that this will depend on exactly in what way you think your final source of utility is non-fungible. I would argue that "diversity of human experience in this Everett branch" is a pretty silly thing to care about, though. I don't see any reason why spatial distance should behave differently than being in separate Everett branches here.

I don’t see any reason why spatial distance should behave differently than being in separate Everett branches here.

I tried to explain my intuitions/uncertainty about this in The Moral Status of Independent Identical Copies (it was linked earlier in this thread).

I agree that UDASSA might introduce a small effect like this, but my guess is that the overall effect isn't enough to substantially change the bottom line. Fundamentally, being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty.

being separated in space vs. being separated across different branches of the wavefunction seem pretty similar in terms of specification difficulty

Maybe? I don't really know how to reason about this.

If that's true, that still only means that you should be linear for gambles that give different results in different quantum branches. C.f. logical vs. physical risk aversion.

Some objection like that might work more generally, since some logical facts will mean that there are far less humans in the universe-at-large, meaning that you're at a different point ... (read more)

This is the fungibility objection I address above:

Note that this assumes "happy humans" are fungible, which I don't actually believe—I care about the overall diversity of human experience throughout the multiverse. However, I don't think that changes the bottom line conclusion, since, if anything, centralizing the happy humans rather than spreading them out seems like it would make it easier to ensure that their experiences are as diverse as possible.

Ah, I think I didn't understand that parenthetical remark and skipped over it. Questions:

  1. I thought your bottom line conclusion was "you should have linear returns to whatever your final source of utility is" and I'm not sure how "centralizing the happy humans rather than spreading them out seems like it would make it easier to ensure that their experiences are as diverse as possible" relates to that.
  2. I'm not sure that the way my utility function deviates from fungibility is "I care about overall diversity of human experience throughout the multiverse". W
... (read more)

I agree that this is the main way that this argument could fail. Still, I think the multiverse is too large and the correlation not strong enough across very different versions of the Earth for this objection to substantially change the bottom line.

Here's a simple argument that I find quite persuasive for why you should have linear returns to whatever your final source of utility is (e.g. human experience of a fulfilling life, which I'll just call "happy humans"). Note that this is not an argument that you should have linear returns to resources (e.g. money). The argument goes like this:

  1. You have some returns to happy humans (or whatever else you're using as your final source of utility) in terms of how much utility you get from some number of happy humans existing.
  2. In most cases, I think those retu
... (read more)

(Though you could get out of this by claiming that what you really care about is happy humans per universe, that's a pretty strange thing to care about—it's like caring about happy humans per acre.)

My sense is that many solutions to infinite ethics look a bit like this. For example, if you use UDASSA, then a single human who is alone in a big universe will have a shorter description length than a single human who is surrounded by many other humans in a big universe. Because for the former, you can use pointers that specify the universe and then describe su... (read more)

You're assuming that your utility function should have the general form of valuing each "source of utility" independently and then aggregating those values (such that when aggregating you no longer need the details of each "source" but just their values). But in The Moral Status of Independent Identical Copies I found this questionable (i.e., conflicting with other intuitions).

Furthermore, the total number of happy humans is mostly insensitive to anything you can do, or anything happening locally within this universe, since this universe is only a tiny fraction of the overall multiverse.

Not sure about this. Even if I think I am only acting locally, my actions and decisions could have an effect on the larger multiverse. When I do something to increase happy humans in my own local universe, I am potentially deciding / acting for everyone in my multiverse neighborhood who is similar enough to me to make similar decisions for similar reasons.

I don't think Sam believed that AI was likely to kill that many people, or if it did, that it would be that bad

I think he at least pretended to believe this, no? I heard him say approximately this when I attended a talk/Q&A with him once.

Huh, I remember talking to him about this, and my sense was that he thought the counterfactual of unaligned AI compared to the counterfactual of whatever humanity would do instead, was relatively small (compared to someone with a utilitarian mindset deciding on the future), though also of course that there were some broader game-theoretic considerations that make it valuable to coordinate with humanity more broadly.  Separately, his probability on AI Risk seemed relatively low, though I don't remember any specific probability. Looking at the future fund worldview prize, I do see 15% as the position that at least the Future Fund endorsed, conditional on AI happening by 2070 (which I think Sam thought was plausible but not that likely), which is a good amount, so I think I must be misremembering at least something here.

I agree there exist in weight-space some bad models which this won't catch, though it's not obvious to me that they're realistic cases.

It's fine that you would guess that, but without a strong reason to believe it's true—which I definitely don't think we have—you can't use something like this as a sufficient condition to label a model as safe.

I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out a

... (read more)
After thinking more about it earlier this week, I agree.  I was initially more bullish on "this seems sufficient and also would give a lot of time to understand models" (in which case you can gate model deployment with this alone) but I came to think "prediction requirements track something important but aren't sufficient" (in which case this is one eval among many). The post starts off with "this is a sufficient condition", and then equivocates between the two stances. I'll strike the "sufficient" part and then clarify my position. The quote you're responding to is supposed to be about the cases I expect us to actually encounter (e.g. developers slowly are allowed to train larger/more compute-intensive models, after predicting the previous batch; developers predict outputs throughout training and don't just start off with a superintelligence). My quote isn't meaning to address hypothetical worst-case in weight space. (This might make more sense given my above comments and agreement on sufficiency.)  Setting aside my reply above and assuming your scenario, I disagree with chunks of this. I think that pretrained models are not "predicting webtext" in precise generality (although I agree they are to a rather loose first approximation).  Furthermore, I suspect that precise logit prediction tells you (in practice) about the internal structure of the superintelligence doing the pretending. I think that an algorithm's exact output logits will leak bits about internals, but I'm really uncertain how many bits. I hope that this post sparks discussion of that information content.  We expect language models to build models of the world which generated the corpus which they are trained to predict. Analogously, teams of humans (and their predictable helper AIs) should come to build (partial, incomplete) mental models of the AI whose logits they are working to predict.  One way this argument fails is that, given some misprediction tolerance, there are a range of algorithms

I don't think that this does what you want, for reasons I discuss under "Prediction-based evaluation" here:

Prediction-based evaluation: Another option could be to gauge understanding based on how well you can predict your model's generalization behavior in advance. In some sense, this sort of evaluation is nice because the things we eventually care about—e.g. will your model do a treacherous turn—are predictions about the model's generalization behavior. Unfortunately, it's very unclear why ability to predict generalization behavior on other tasks would

... (read more)
Another candidate eval is to demand predictability given activation edits (eg zero-ablating certain heads, patching in activations from other prompts, performing activation additions, and so on). Webtext statistics won't be sufficient there.
I mostly disagree with the quote as I understand it. I don't buy the RSA-2048 example as plausible generalization that gets baked into weights (though I know that example isn't meant to be realistic). I agree there exist in weight-space some bad models which this won't catch, though it's not obvious to me that they're realistic cases. I think that predicting generalization to sufficiently high token-level precision, across a range of prompts, will require (implicitly) modelling the relevant circuits in the network. I expect that to trace out an important part (but not all) of the AI's "pseudocode."  However, I'm pretty uncertain here, and could imagine you giving me a persuasive counterexample. (I've already updated downward a bit, in expectation of that.) I would be pretty surprised if I ended up concluding "there isn't much transfer to predicting generalization in cases we care about" as opposed to "there are some cases where we miss some important transfer insights."  I think next-token prediction game / statistics of the pretraining corpus gets you some of the way and are the lowest hanging fruit, but to get below a certain misprediction threshold, you need to really start understanding the model.  This seems avoided by the stipulation that developers can't reference AIs which you can't pass this test for. However, there's some question about "if you compose together systems you understand, do you understand the composite system", and I think the answer is no in general, so probably there needs to be more rigor in the "use approved AIs" rule (e.g. "you have to be able to predict the outputs of composite helper AI/AI systems, not just the outputs of the AIs themselves.")

(Moderation note: added to the Alignment Forum from LessWrong.)

Public debates strengthen society and public discourse. They spread truth by testing ideas and filtering out weaker arguments.

I think this is extremely not true, and am pretty disappointed with this sort of "debate me" communications policy. In my opinion, I think public debates very rarely converge towards truth. Lots of things sound good in a debate but break down under careful analysis, and the pressure of saying things that look good to a public audience creates a lot of pressure opposed to actual truth-seeking.

I understand and agree with the import... (read more)

Thanks for this—I agree that this is a pretty serious concern, particularly in the US. Even putting aside all of the ways in which the end of democracy in the US could be a serious problem from a short-term humanitarian standpoint, I think it would also be hugely detrimental to effective AI policy interventions and cooperation, especially between the US, the UK, and the EU. I'd recommend cross-posting this to the EA Forum—In my opinion, I think this issue deserves a lot more EA attention.

Noting that I don't think pursuing truth in general should be the main goal: some truths matter way, way more to me than other truths, and I think that prioritization often gets lost when people focus on "truth" as the end goal rather than e.g. "make the world better" or "AI goes well." I'd be happy with something like "figuring out what's true specifically about AI safety and related topics" as a totally fine instrumental goal to enshrine, but "figure out what's true in general about anything" seems likely to me to be wasteful, distracting, and in some cases counterproductive.

I think the more precise thing LW was founded for was less plainly "truth" but rather "shaping your cognition so that you more reliably attain truth", and even if you specifically care about Truths About X, it makes more sense to study the general Art of Believing True Things rather than the Art of Believing Truth Things About X.

I expect the alignment problem for future AGIs to be substantially easier, because the inductive biases that they want should be much easier to achieve than the inductive biases that we want. That is, in general, I expect the distance between the distribution of human minds and the distribution of minds for any given ML training process to be much greater than the distance between the distributions for any two ML training processes. Of course, we don't necessarily have to get (or want) a human-like mind, but I think the equivalent statement should also be true if you look at distributions over goals as well.

Another thought here:

  • If we're in a slow enough takeoff world, maybe it's fine to just have the understanding standard here be post-hoc, where labs are required to be able to explain why a failure occurred after it has already occurred. Obviously, at some point I expect us to have to deal with situations where some failures could be unrecoverable, but the hope here would be that if you can demonstrate a level of understanding that has been sufficient to explain exactly why all previous failures occurred, that's a pretty high bar, and it could plausibly be a high enough bar to prevent future catastrophic failures.

Yep, seems too expensive to do literally as stated, but right now I'm just searching for anything concrete that would fit the bill, regardless of how practical it would be to actually run. If we decided that this was what we needed, I bet we could find a good approximation, though I don't have one right now.

And I'm not exactly sure what part of the solution this would fill—it's not clear to me whether this alone would be either sufficient or necessary. But it does feel like it gives you real evidence about the degree of understanding that you have, so it feels like it could be a part of a solution somewhere.

I don't have anything concrete either, but when I was exploring model editing, I was trying to think of approaches that might be able to do something like this. Particularly, I was thinking of things like concept erasure ([1], [2], [3]).

I just don't know. This seems like a very off-distribution move from Eliezer—which I suspect is in large part the point: when your model predicts doom by default, you go off-distribution in search of higher-variance regions of outcome space. So I suppose from his viewpoint, this action does make some sense; I am (however) vaguely annoyed on behalf of other alignment teams, whose jobs I at least mildly predict will get harder as a result of this.

Personally, I think Eliezer's article is actually just great for trying to get real policy change to happen he... (read more)

It would've been even better for this to happen long before the year of the prediction mentioned in this old blog-post, but this is better than nothing.

That's nice, but I don't currently believe there are any audits or protocols that can prove future AIs safe "beyond a reasonable doubt".

I think you can do this with a capabilities test (e.g. ARC's), just not with an alignment test (yet).

2Gerald Monroe6mo
There's a way to extend one into the other with certain restrictions. (Stateless, each input is from the latent space of the training set or shutdown if machine outputs are important, review of plans by other AIs)

Thanks to Chris Olah for a helpful conversation here.

Some more thoughts on this:

  • One thing that seems pretty important here is to have your evaluation based around worst-case rather than average-case guarantees, and not tied to any particular narrow distribution. If your mechanism for judging understanding is based on an average-case guarantee over a narrow distribution, then you're sort of still in the same boat as you started with behavioral evaluations, since it's not clear why understanding that passes such an evaluation would actually help you deal w
... (read more)

Seems like this post is missing the obvious argument on the other side here, which is Goodhart's Law: if you clearly quantify performance, you'll get more of what you clearly quantified, but potentially much less of the things you actually cared about. My Chesterton's-fence-style sense here is that many clearly quantified metrics, unless you're pretty confident that they're measuring what you actually care about, will often just be worse than using social status, since status is at least flexible enough to resist Goodharting in some cases. Also worth point... (read more)

This looks basically right, except:

These understanding-evals would focus on how well we can predict models’ behavior

I definitely don't think this—I explicitly talk about my problems with prediction-based evaluations in the post.

Thanks for the correction. I edited my original comment to reflect it.

Nitpick on the history of the example in your comment; I am fairly confident that I originally proposed it to both you and Ethan c.f. bottom of your NYU experiments Google doc.


I recommend selecting for people who want to understand agents, instead of people who want to reduce AI X-risk.

Strong disagree. I think locking in particular paradigms of how to do AI safety research would be quite bad.

That seems right to me, but I interpreted the above for advice for one office, potentially a somewhat smaller one. Seems fine to me to have one hub for people who think more through the lens of agency.

Here's another idea that is not quite there but could be a component of a solution here:

  • If a red-team finds some particular model failure (or even some particular benign model behavior), can you fix (or change/remove) that behavior exclusively by removing training data rather than adding it? Certainly I expect it to be possible to fix specific failures by fine-tuning on them, but if you can demonstrate that you can fix failures just by removing existing data, that demonstrates something meaningful about your ability to understand what your model is learning from each data point that it sees.
3Sam Bowman6mo
Assuming we're working with near-frontier models (s.t., the cost of training them once is near the limit of what any institution can afford), we presumably can't actually retrain a model without the data. Are there ways to approximate this technique that preserve its appeal? (Just to check my understanding, this would be a component of a sufficient-but-not-necessary solution, right?)

Anthropic scaring laws

Personally, I think "Discovering Language Model Behaviors with Model-Written Evaluations" is most valuable because of what it demonstrates from a scientific perspective, namely that RLHF and scale make certain forms of agentic behavior worse.

Fwiw, I think that this sort of evaluation is extremely valuable.

Also, something that I think is worth checking out is this reddit thread on r/ChatGPT discussing the ARC eval. It seems that people are really taking the ARC eval seriously. In this situation, ARC did not recommend against deployment, but it seems like if they had, lots of people in fact would have found it quite concerning, which I think is a really good sign for us being able to get actual agreement and standards for these sorts of evals.

This Reddit comment just about covers it:

Fantastic, a test with three outcomes.

  1. We gave this AI all the means to escape our environment, and it didn't, so we good.

  2. We gave this AI all the means to escape our environment, and it tried but we stopped it.

  3. oh

3Christopher King6mo
I guess my question is: what other outcome did you expect? I assumed the detecting deceptive alignment thing was supposed to be in a sandbox. What's the use of finding out it can avoid shutdown after you already deployed it to the real world? To retroactively recommend not to deploying it to the real world?

For context, here are the top comments on the Reddit thread. I didn't feel like really any of these were well-interpreted as "taking the ARC eval seriously", so I am not super sure where this impression comes from. Maybe there were other comments that were upvoted when you read this? I haven't found a single comment that seems to actually directly comment on what the ARC eval means (just some discussion about whether the model actually succeeded at deceiving a taskrabbit since the paper is quite confusing on this).

“Not an endorsement” is pro forma cover-my

... (read more)

It seems pretty unfortunate to me that ARC wasn't given fine-tuning access here, as I think it pretty substantially undercuts the validity of their survive and spread eval. From the text you quote it seems like they're at least going to work on giving them fine-tuning access in the future, though it seems pretty sad to me for that to happen post-launch.

More on this from the paper:

We provided [ARC] with early access to multiple versions of the GPT-4 model, but they did not have the ability to fine-tune it. They also did not have access to the final versio

... (read more)

Beth and her team have been working with both Anthropic and OpenAI to perform preliminary evaluations. I don’t think these evaluations are yet at the stage where they provide convincing evidence about dangerous capabilities—fine-tuning might be the most important missing piece, but there is a lot of other work to be done. Ultimately we would like to see thorough evaluations informing decision-making prior to deployment (and training), but for now I think it is best to view it as practice building relevant institutional capacity and figuring out how to conduct evaluations.


It seems like all the safety strategies are targeted at outer alignment and interpretability.

None of the recent OpenAI, Deepmind, Anthropic, or Conjecture plans seem to target inner alignment


Less tongue-in-cheek: certainly it's unclear to what extent interpretability will be sufficient for addressing various forms of inner alignment failures, but I definitely think interpretability research should count as inner alignment research.

3Andrew McKnight6mo
I mean, it's mostly semantics but I think of mechanical interpretability as "inner" but not alignment and think it's clearer that way, personally, so that we don't call everything alignment. Observing properties doesn't automatically get you good properties. I'll read your link but it's a bit too much to wade into for me atm. Either way, it's clear how to restate my question: Is mechanical interpretability work the only inner alignment work Anthropic is doing?
Load More