Anyways, I stand by my comment; I expect throwing money at PauseAI-type orgs is better utility per dollar than nvidia even after taking into account that investing in nvidia to donate to PauseAI later is a possibility.

Orienting to 3 year AGI timelines

Tamsin Leake3mo*144

Thoroughly agree except for what to do with money. I expect that throwing money at orgs that are trying to slow down AI progress (eg PauseAI, or better if someone makes something better) gets you more utility per dollar than nvidia (and also it's more ethical).

Edit: to be clear, I mean actual utility in your utility function. Even if you're fully self-interested and not altruistic at all, I still think your interests are better served by donating to PauseAI-type orgs than investing in nvidia.

(draft) Cyborg software should be open (?)

Tamsin Leake5mo61

^^ Why wouldn't people seeing a cool cyborg tool just lead to more cyborg tools? As opposed to the black boxes that big tech has been building?

You imply a cyborg tool is a "powerful unaligned AI", it's not, it's a tool to improve bandwidth and throughput between any existing AI (which remains untouched by cyborg research) and the human

I was making a more general argument that applies mainly to powerful AI but also to all other things that might help one build powerful AI (such as: insights about AI, cyborg tools, etc). These things-that-help have the downside that someone could use them to build powerful but unaligned AI, which is ultimately the thing we want to delay / reduce-the-probability-of. Whether the downside is bad enough that making them public/popular is net bad is the thing that's uncertain, but I lean towards yes, it is net bad.

I believe that:

It is bad for cyborg tools to be broadly available because that'll help {people trying to build the kind of AI that'd kill everyone} more than they'll {help people trying to save the world}.
It is bad for insights about AI to spread because of the same reason.
It is bad for LLM assistants to be broadly available for the same reason.

Only reasonable people who think hard about AI safety will understand the power of cyborgs

I don't think I'm particularly relying on that assumption?? I don't understand what sounded like I think this.

In any case, I'm not making strict "only X are Y" or "all X are Y" statements; I'm making quantitative "X are disproportionately more Y" statements.

That people won't eventually find out.

I believe that capabilities overhang is temporary, that inevitably "the dam will burst"

Well, yes. And at that point the world is much more doomed; the world has to be saved ahead of that. To increase the probability that we have time to save the world before people find out, we want to buy time. I agree it's inevitable, but it can be delayed. Making tools and insights broadly available hastens the bursting of the dam, which is bad; containing them delays the bursting of the dam, which is good.

(draft) Cyborg software should be open (?)

Tamsin Leake5mo107

I think (not sure!) the damage from people/orgs/states going "wow, AI is powerful, I will try to build some" is larger than the upside of people/orgs/states going "wow, AI is powerful, I should be scared of it". It only takes one strong enough one of the former to kill everyone, and the latter is gonna have a very hard time stopping all of them.

By not informing the public that AI is indeed powerful, awareness of that fact is disproportionately allocated to people who will choose to think hard about it on their own, and thus that knowledge is more likely to be in reasonabler hands (for example they'd also be more likely to think "hmm maybe I shouldn't build unaligned powerful AI").

The same goes for cyborg tools, as well as general insights about AI: we should want them to be differentially accessible to alignment people than the general public.

In fact, my biggest criticism of OpenAI is not that they built GPTs, but that they productized it, made it widely available, and created a giant public frenzy about LLMs. I think we'd have more time to solve alignment if they kept it internally and the public wasn't thinking about AI nearly as much.

The Hopium Wars: the AGI Entente Delusion

Tamsin Leake6mo156

Even if tool AI is controllable, tool AI can be used to assist in building non-tool AI. A benign superassistant is one query away from outputting world-ending code.

If I have some money, whom should I donate it to in order to reduce expected P(doom) the most?

Tamsin Leake6mo18-6

In my opinion the hard part would not be figuring out where to donate to {decrease P(doom) a lot} rather than {decrease P(doom) a little}, but figuring out where to donate to {decrease P(doom)} rather than {increase P(doom)}.

Shortform

Tamsin Leake6mo30

(oops, this ended up being fairly long-winded! hope you don't mind. feel free to ask for further clarifications.)

There's a bunch of things wrong with your description, so I'll first try to rewrite it in my own words, but still as close to the way you wrote it (so as to try to bridge the gap to your ontology) as possible. Note that I might post QACI 2 somewhat soon, which simplifies a bunch of QACI by locating the user as {whatever is interacting with the computer the AI is running on} rather than by using a beacon.

A first pass is to correct your description to the following:

We find a competent honourable human at a particular point in time , like Joe Carlsmith or Wei Dai, and give them a rock engraved with a 1GB secret key, large enough that in counterfactuals it could replace with an entire snapshot of . We also give them the ability to express a 1GB output, eg by writing a 1GB key somewhere which is somehow "signed" as the only . This is part of $H$ — $H$ is not just the human being queried at a particular point in time, it's also the human producing an answer in some way. So $H$ is a function from 1GB bitstring to 1GB bitstring. We define $H^{+}$ as $H$ , followed by whichever new process $H$ describes in its output — typically another instance of $H$ except with a different 1GB payload.
We want a model $M$ of the agent $H^{+}$ . In QACI, we get $M$ by asking a Solomonoff-like ideal reasoner for their best guess about $H^{+}$ after feeding them a bunch of data about the world and the secret key.
We then ask $M$ the question $q$ , "What's the best utility-function-over-policies to maximise?" to get a utility function $U$ $: (O \times A)^{*} \to R$ . We then **ask our solomonoff-like ideal reasoner for their best guess about which action $A$ maximizes $U$ .

Indeed, as you ask in question 3, in this description there's not really a reason to make step 3 an extra thing. The important thing to notice here is that model $M$ might get pretty good, but it'll still have uncertainty.

When you say "we get $M$ by asking a Solomonoff-like ideal reasoner for their best guess about $H^{+}$ ", you're implying that — positing U(M,A) to be the function that says how much utility the utility function returned by model M attributes to action A (in the current history-so-far) — we do something like:

  let M ← oracle(argmax { for model M } 𝔼 { over uncertainty } P(M))
  let A ← oracle(argmax { for action A } U(M, A))
  perform(A)

Indeed, in this scenario, the second line is fairly redundant.

The reason we ask for a utility function is because we want to get a utility function within the counterfactual — we don't want to collapse the uncertainty with an argmax before extracting a utility function, but after. That way, we can do expected-given-uncertainty utility maximization over the full distribution of model-hypotheses, rather than over our best guess about $M$ . We do:

  let A ← oracle(argmax { for A } 𝔼 { for M, over uncertainty } P(M) · U(M, A))
  perform(A)

That is, we ask our ideal reasoner (oracle) for the action with the best utility given uncertainty — not just logical uncertainty, but also uncertainty about which $M$ . This contrasts with what you describe, in which we first pick the most probable $M$ and then calculate the action with the best utility according only to that most-probable pick.

To answer the rest of your questions:

Is this basically IDA, where Step 1 is serial amplification, Step 2 is imitative distillation, and Step 3 is reward modelling?

Unclear! I'm not familiar enough with IDA, and I've bounced off explanations for it I've seen in the past. QACI doesn't feel to me like it particularly involves the concepts of distillation or amplification, but I guess it does involve the concept of iteration, sure. But I don't get the thing called IDA.

Why not replace Step 1 with Strong HCH or some other amplification scheme?

It's unclear to me how one would design an amplification scheme — see concerns of the general shape expressed here. The thing I like about my step 1 is that the QACI loop (well, really, graph (well, really, arbitrary computation, but most of the time the user will probably just call themself in sequence)) is that its setup doesn't involve any AI at all — you could go back in time before the industrial revolution and explain the core QACI idea and it would make sense assuming time-travelling-messages magic, and the magic wouldn't have to do any extrapolating. Just tell someone the idea is that they could send a message to {their past self at a particular fixed point in time}. If there's any amplification scheme, it'll be one designed by the user, inside QACI, with arbitrarily long to figure it out.

What does "bajillion" actually mean in Step 1?

As described above, we don't actually pre-determine the length of the sequence, or in fact the shape of the graph at all. Each iteration decides whether to spawn one or several next iteration, or indeed to spawn an arbitrarily different long-reflection process.

Why are we doing Step 3? Wouldn't it be better to just use M directly as our superintelligence? It seems sufficient to achieve radical abundance, life extension, existential security, etc.

Why not ask M for the policy π directly? Or some instruction for constructing π? The instruction could be "Build the policy using our super-duper RL algo with the following reward function..." but it could be anything.

Hopefully my correction above answers these.

What if there's no reward function that should be maximised? Presumably the reward function would need to be "small", i.e. less than a Exabyte, which imposes a maybe-unsatisfiable constraint.

(Again, untractable-to-naively-compute utility function*, not easily-trained-on reward function. If you have an ideal reasoner, why bother with reward functions when you can just straightforwardly do untractable-to-naively-compute utility functions?)

I guess this is kinda philosophical? I have some short thoughts on here. If an exabyte is enough to describe to describe {a communication channel with a human-on-earth} to an AI-on-earth, which I think seems likely, then it's enough to build "just have a nice corrigible assistant ask the humans what they want"-type channels.

Put another way: if there are actions which are preferable to other actions, then it seems to me like utility function are a fully lossless way for counterfactual QACI users to express which kinds of actions they want the AI to perform, which is all we need. If there's something wrong with utility function over worlds, then counterfactual QACI users can output a utility function which favors actions which lead to something other than utility maximization over worlds, for example actions which lead to the construction of a superintelligent corrigible assistant which will help the humans come up with a better scheme.

Why is there no iteration, like in IDA? For example, after Step 2, we could loop back to Step 1 but reassign $H$ as $H$ with oracle access to $M$ .

Again, I don't get IDA. Iteration doesn't seem particularly needed? Note that inside QACI, the user does have access to an oracle and to all relevant pieces of hypothesis about which hypothesis it is inhabiting in — this is what, in the QACI math, this line does:

${QACI}_{0}$ 's distribution over answers demands that the answer payload $π_{r}$ , when interpreted as math and with all required contextual variables passed as input ( $q, μ 1, μ 2, α, γ_{q}, ξ$ ).

Notably, $α$ is the hypothesis for which world the user is being considered in, and $γ_{q}, ξ$ for their location within that world. Those are sufficient to fully characterize the hypothesis-for- $H$ that describes them. And because the user doesn't really return just a string but a math function which takes $q, μ 1, μ 2, α, γ_{q}, ξ$ as input and returns a string, they can have that math function do arbitrary work — including rederive $H$ . In fact, rediriving $H$ is how they call a next iteration: they say (except in math) "call $H$ again (rederived using $q, μ 1, μ 2, α, γ_{q}, ξ$ ), but with this string, and return the result of that." See also this illustration, which is kinda wrong in places but gets the recursion call graph thing right.

Another reason to do "iteration" like this inside the counterfactual rather than in the actual factual world (if that's what IDA does, which I'm only guessing here) is that we don't have as many iteration steps as we want in the factual world — eventually OpenAI or someone else kills everyone, whereas in the counterfactual, the QACI users are the only ones who can make progress, so the QACI users essentially have as long as they want, so long as they don't take too long in each individual counterfactual step or other somewhat easily avoided actions like that.

Why isn't Step 3 recursive reward modelling? i.e. we could collect a bunch of trajectories from $π$ and ask $M$ to use those trajectories to improve the reward function.

Unclear if this still means anything given the rest of this post. Ask me again if it does.

The case for more Alignment Target Analysis (ATA)

Tamsin Leake7mo60

Hi !

ATA is extremely neglected. The field of ATA is at a very early stage, and currently there does not exist any research project dedicated to ATA. The present post argues that this lack of progress is dangerous and that this neglect is a serious mistake.

I agree it's neglected, but there is in fact at least one researh project dedicated to at least designing alignment targets: the part of the formal alignment agenda dedicated to formal outer alignment, which is the design of math problems to which solutions would be world-saving. Our notable attempts at this are QACI and ESP (there was also some work on a QACI2, but it predates (and in-my-opinion is superceded by) ESP).

Those try to implement CEV in math. They only work for doing CEV of a single person or small group, but that's fine: just do CEV of {a single person or small group} which values all of humanity/moral-patients/whatever getting their values satisfied instead of just that group's values. If you want humanity's values to be satisfied, then "satisfying humanity's values" is not opposite to "satisfy your own values", it's merely the outcome of "satisfy your own values".