ryan_greenblatt

I work at Redwood Research.

Wiki Contributions

Comments

Do you think that whenever anyone makes a decision that ends up being bad ex-post they should be forced to retire?

Doesn't this strongly disincentivize making positive EV bets which are likely to fail?

In particular, I don't expect either (any?) lab to be able to resist the temptation to internally deploy models with autonomous persuasion capabilities or autonomous AI R&D capabilities

I agree with this as stated, but don't think that avoiding deploying such models is needed to mitigate risk.

I think various labs are to some extent in denial of this because massively deploying possibly misaligned systems sounds crazy (and is somewhat crazy), but I would prefer if various people realized this was likely the default outcome and prepared accordingly.

More strongly, I think most of the relevant bit of the safety usefulness trade-off curve involves deploying such models. (With countermeasures.)

or is seriously entertaining the idea that we might need to do a lot (>1 year) of dedicated safety work (that potentially involves coming up with major theoretical insights, as opposed to a "we will just solve it with empiricism" perspective) before we are confident that we can control such systems.

I think this is a real possibility, but unlikely to be necessarily depending on the risk target. E.g., I think you can deploy ASL-4 models with <5% risk without theoretical insights and instead just via being very careful with various prosaic countermeasures (mostly control).

<1% risk probably requires stronger stuff, though it will depend on the architecture and various other random details.

(That said, I'm pretty sure that these lab's aren't making decisions based on carefully analyzing the situation and are instead just operating like "idk human level models don't seem that bad, we'll probably be able to figure it out, humans can solve most problems with empiricism on priors". But, this prior seems more right than overwhelming pessimism IMO.)

Also, I think you should seriously entertain the idea that just trying quite hard with various prosaic countermeasures might suffice for reasonably high levels of safety. And thus pushing on this could potentially be very leveraged relative to trying to hit a higher target.

I mostly agree with premises 1, 2, and 3, but I don't see how the conclusion follows.

It is possible for things to be hard to influence and yet still worth it to try to influence them.

(Note that the $30 million grant was not an endorsement and was instead a partnership (e.g. it came with a board seat), see Buck's comment.)

(Ex-post, I think this endeavour was probably net negative, though I'm pretty unsure and ex-ante I think it seems great.)

Why focus on the $30 million grant?

What about large numbers of people working at OpenAI directly on capabilities for many years? (Which is surely worth far more than $30 million.)

Separately, this grant seems to have been done to influence the goverance at OpenAI, not make OpenAI go faster. (Directly working on capabilities seems modestly more accelerating and risky than granting money in exchange for a partnership.)

(ETA: TBC, there is a relationship between the grant and people working at OpenAI on capabilities: the grant was associated with a general vague endorsement of trying to play inside game at OpenAI.)

mention seem to me like they could be very important to deploy at scale ASAP

Why think this is important to measure or that this already isn't happening?

E.g., on the current model organism related project I'm working on, I automate inspecting reasoning traces in various ways. But I don't feel like there is any particularly interesting thing going on here which is important to track (e.g. this tip isn't more important than other tips for doing LLM research better).

My main vibe is:

  • AI R&D and AI safety R&D will almost surely come at the same time.
    • Putting aside using AIs as tools in some more limited ways (e.g. for interp labeling or generally for grunt work)
  • People at labs are often already heavily integrating AIs into their workflows (though probably somewhat less experimentation here than would be ideal as far as safety people go).

It seems good to track potential gaps between using AIs for safety and for capabilities, but by default, it seems like a bunch of this work is just ML and will come at the same time.

(Note that this paper was already posted here, so see comments on that post as well.)

I wrote up some of my thoughts on Bengio's agenda here.

TLDR: I'm excited about work on trying to find any interpretable hypothesis which can be highly predictive on hard prediction tasks (e.g. next token prediction).[1] From my understanding, the bayesian aspect of this agenda doesn't add much value.

I might collaborate with someone to write up a more detailed version of this view which engages in detail and is more clearly explained. (To make it easier to argue against and to exist as a more canonical reference.)

As far as Davidad, I think the "manually build an (interpretable) infra-bayesian world model which is sufficiently predictive of the world (as smart as our AI)" part is very likely to be totally unworkable even with vast amounts of AI labor. It's possible that something can be salvaged by retreating to a weaker approach. It seems like a roughly reasonable direction to explore as a possible highly ambitious moonshot to automate researching using AIs, but if you're not optimistic about safely using vast amounts of AI labor to do AI safety work[2], you should discount accordingly.

For an objection along these lines, see this comment.

(The fact that we can be conservative with respect to the infra-bayesian world model doesn't seem to buy much, most of the action is in getting something which is at all good at predicting the world. For instance, in Fabien's example, we would need the infrabayesian world model to be able to distinguish between zero-days and safe code regardless of conservativeness. If it didn't distinguish, then we'd never be able to run any code. This probably requires nearly as much intelligence as our AI has.)

Proof checking on this world model also seems likely to be unworkable, though I have less confidence in this view. And, the more the infra-bayesian world model is computationally intractible to run, the harder it is to proof check. E.g., if running the world model on many inputs is intractable (as would seem to be the default for detailed simulations), I'm very skeptical about proving anything about what it predicts.

I'm not an expert on either agenda and it's plausible that this comment gets some important details wrong.


  1. Or just improving on the intepretability and predictiveness pareto frontier substantially. ↩︎

  2. Presumably by employing some sort of safety intervention e.g. control or only using narrow AIs. ↩︎

Huh, this seems messy. I wish Time was less ambigious with their language here and more clear about exactly what they have/haven't seen.

It seems like the current quote you used is an accurate representation of the article, but I worry that it isn't an accurate representation of what is actually going on.

It seems plausible to me that Time is intentionally being ambigious in order to make the article juicier, though maybe this is just my paranoia about misleading journalism talking. (In particular, it seems like a juicier article if all of the big AI companies are doing this than if they aren't, so it is natural to imply they are all doing it even if you know this is false.)

Overall, my take is that this is a pretty representative quote (and thus I disagree with Zac), but I think the additional context maybe indicates that not all of these companies are doing this, particularly if the article is intentionally trying to deceive.

Due to prior views, I'd bet against Anthropic consistently pushing for very permissive of voluntary regulation behind closed doors which makes me think the article is probably at least somewhat misleading (perhaps intentionally).

I thought the idea of a corrigible AI is that you're trying to build something that isn't itself independent and agentic, but will help you in your goals regardless.

Hmm, no mean something broader than this, something like "humans ultimately have control and will decide what happens". In my usage of the word, I would count situations where humans instruct their AIs to go and acquire as much power as possible for them while protecting them and then later reflect and decide what to do with this power. So, in this scenario, the AI would be arbitrarily agentic and autonomous.

Corrigibility would be as opposed to humanity e.g. appointing a succesor which doesn't ultimately point back to some human driven process.

I would count various indirect normativity schemes here and indirect normativity feels continuous with other forms of oversight in my view (the main difference is oversight over very long time horizons such that you can't train the AI based on it's behavior over that horizon).

I'm not sure if my usage of the term is fully standard, but I think it roughly matches how e.g. Paul Christiano uses the term.

Load More