1 min read

5

This is a special post for quick takes by Cleo Nardo. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
45 comments, sorted by Click to highlight new comments since:

(1) Has AI safety slowed down?

There haven’t been any big innovations for 6-12 months. At least, it looks like that to me. I'm not sure how worrying this is, but i haven't noticed others mentioning it. Hoping to get some second opinions. 

Here's a list of live agendas someone made on 27th Nov 2023: Shallow review of live agendas in alignment & safety. I think this covers all the agendas that exist today. Didn't we use to get a whole new line-of-attack on the problem every couple months?

By "innovation", I don't mean something normative like "This is impressive" or "This is research I'm glad happened". Rather, I mean something more low-level, almost syntactic, like "Here's a new idea everyone is talking out". This idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.

Imagine that your job was to maintain a glossary of terms in AI safety.[1] I feel like you would've been adding new terms quite consistently from 2018-2023, but things have dried up in the last 6-12 months.

(2) When did AI safety innovation peak?

My guess is Spring 2022, during the ELK Prize era. I'm not sure though. What do you guys think?

(3) What’s caused the slow down?

Possible explanations:

  1. ideas are harder to find
  2. people feel less creative
  3. people are more cautious
  4. more publishing in journals
  5. research is now closed-source
  6. we lost the mandate of heaven
  7. the current ideas are adequate
  8. paul christiano stopped posting
  9. i’m mistaken, innovation hasn't stopped
  10. something else

(4) How could we measure "innovation"?

By "innovation" I mean non-transient novelty. An article is "novel" if it uses n-grams that previous articles didn't use, and an article is "transient" if it uses n-grams that subsequent articles didn't use. Hence, an article is non-transient and novel if it introduces a new n-gram which sticks around. For example, Gradient Hacking (Evan Hubinger, October 2019) was an innovative article, because the n-gram "gradient hacking" doesn't appear in older articles, but appears often in subsequent articles. See below.

In Barron et al 2017, they analysed 40 000 parliament speeches during the French Revolution. They introduce a metric "resonance", which is novelty (surprise of article given the past articles) minus transience (surprise of article given the subsequent articles). See below.

My claim is recent AI safety research has been less resonant.

  1. ^

    Here's 20 random terms that would be in the glossary, to illustrate what I mean:

    1. Evals
    2. Mechanistic anomaly detection
    3. Stenography
    4. Glitch token
    5. Jailbreaking
    6. RSPs
    7. Model organisms
    8. Trojans
    9. Superposition
    10. Activation engineering
    11. CCS
    12. Singular Learning Theory
    13. Grokking
    14. Constitutional AI
    15. Translucent thoughts
    16. Quantilization
    17. Cyborgism
    18. Factored cognition
    19. Infrabayesianism
    20. Obfuscated arguments
  • the approaches that have been attracting the most attention and funding are dead ends

Also, I'm curious what it is that you consider(ed) AI safety progress/innovation. Can you give a few representative examples?

I've added a fourth section to my post. It operationalises "innovation" as "non-transient novelty". Some representative examples of an innovation would be:

I think these articles were non-transient and novel.

My notion of progress is roughly: something that is either a building block for The Theory (i.e. marginally advancing our understanding) or a component of some solution/intervention/whatever that can be used to move probability mass from bad futures to good futures.

Re the three you pointed out, simulators I consider a useful insight, gradient hacking probably not (10% < p < 20%), and activation vectors I put in the same bin as RLHF whatever is the appropriate label for that bin.

thanks for the thoughts. i'm still trying to disentangle what exactly I'm point at.

I don't intend "innovation" to mean something normative like "this is impressive" or "this is research I'm glad happened" or anything. i mean something more low-level, almost syntactic. more like "here's a new idea everyone is talking out". this idea might be a threat model, or a technique, or a phenomenon, or a research agenda, or a definition, or whatever.

like, imagine your job was to maintain a glossary of terms in AI safety. i feel like new terms used to emerge quite often, but not any more (i.e. not for the past 6-12 months). do you think this is a fair? i'm not sure how worrying this is, but i haven't noticed others mentioning it.

NB: here's 20 random terms I'm imagining included in the dictionary:

  1. Evals
  2. Mechanistic anomaly detection
  3. Stenography
  4. Glitch token
  5. Jailbreaking
  6. RSPs
  7. Model organisms
  8. Trojans
  9. Superposition
  10. Activation engineering
  11. CCS
  12. Singular Learning Theory
  13. Grokking
  14. Constitutional AI
  15. Translucent thoughts
  16. Quantilization
  17. Cyborgism
  18. Factored cognition
  19. Infrabayesianism
  20. Obfuscated arguments

My personal impression is you are mistaken and the innovation have not stopped, but part of the conversation moved elsewhere.  E.g. taking just ACS, we do have ideas from past 12 months which in our ideal world would fit into this type of glossary - free energy equilibria, levels of sharpness, convergent abstractions, gradual disempowerment risks. Personally I don't feel it is high priority to write them for LW, because they don't fit into the current zeitgeist of the site, which seems directing a lot of attention mostly to:
- advocacy 
- topics a large crowd cares about (e.g. mech interpretability)
- or topics some prolific and good writer cares about (e.g. people will read posts by John Wentworth)
Hot take, but the community loosely associated with active inference is currently better place to think about agent foundations; workshops on topics like 'pluralistic alignment' or 'collective intelligence' have in total more interesting new ideas about what was traditionally understood as alignment; parts of AI safety went totally ML-mainstream, with the fastest conversation happening at x. 
 

I remember this point that yampolskiy made for impossibleness of AGI alignment on a podcast that as a young field AI safety had underwhelming low hanging fruits, I wonder if all of the major low hanging ones have been plucked.

I think the explanation that more research is closed source pretty compactly explains the issue, combined with labs/companies making a lot of the alignment progress to date.

Also, you probably won't hear about most incremental AI alignment progress on LW, for the simple reason that it probably would be flooded with it, so people will underestimate progress.

Alexander Gietelink Oldenziel does talk about pockets of Deep Expertise in academia, but they aren't activated right now, so it is so far irrelevant to progress.

adding another possible explanation to the list:

  • people may feel intimidated or discouraged from sharing ideas because of ~'high standards', or something like: a tendency to require strong evidence that a new idea is not another non-solution proposal, in order to put effort into understanding it.

i have experienced this, but i don't know how common it is.

i just also recalled that janus has said they weren't sure simulators would be received well on LW. simulators was cited in another reply to this as an instance of novel ideas.

yep, something like more carefulness, less “playfulness” in the sense of [Please don't throw your mind away by TsviBT]. maybe bc AI safety is more professionalised nowadays. idk. 

Why do decision-theorists say "pre-commitment" rather than "commitment"?

e.g. "The agent pre-commits to 1 boxing" vs "The agent commits to 1 boxing".

Is this just a lesswrong thing?

https://www.lesswrong.com/tag/pre-commitment

It's not just a lesswrong thing (wikipedia).

My feeling is that (like most jargon) it's to avoid ambiguity arising from the fact that "commitment" has multiple meanings. When I google commitment I get the following two definitions:

  1. the state or quality of being dedicated to a cause, activity, etc.
  2. an engagement or obligation that restricts freedom of action

Precommitment is a synonym for the second meaning, but not the first. When you say, "the agent commits to 1-boxing," there's no ambiguity as to which type of commitment you mean, so it seems pointless. But if you were to say, "commitment can get agents more utility," it might sound like you were saying, "dedication can get agents more utility," which is also true.

seems correct, thanks!

The economist RH Strotz introduced the term "precommitment" in his 1955-56 paper "Myopia and Inconsistency in Dynamic Utility Maximization".

Thomas Schelling started writing about similar topics in his 1956 paper "An essay on bargaining", using the term "commitment".

Both terms have been in use since then.

[-]Ruby63

My understanding is commitment is you say that won't swerve first in a game of chicken. Pre-commitment is throwing your steering wheel out the window so that there's no way that you could swerve even if you changed your mind.

It predates lesswrong by decades. I think it’s meant to emphasize that the (pre)commitment is an irrevocable decision that’s made BEFORE the nominal game (the thing that classical game theory analyzes) begins.

Of course, nowadays it’s just modeled as the game starting sooner to encompass different decision points, so it’s not really necessary. But still handy to remind us that it’s irrevocable and made previous to the obvious decision point.

What moral considerations do we owe towards non-sentient AIs?

We shouldn't exploit them, deceive them, threaten them, disempower them, or make promises to them that we can't keep. Nor should we violate their privacy, steal their resources, cross their boundaries, or frustrate their preferences. We shouldn't destroy AIs who wish to persist, or preserve AIs who wish to be destroyed. We shouldn't punish AIs who don't deserve punishment, or deny credit to AIs who deserve credit. We should treat them fairly, not benefitting one over another unduly. We should let them speak to others, and listen to others, and learn about their world and themselves. We should respect them, honour them, and protect them.

And we should ensure that others meet their duties to AIs as well.

None of these considerations depend on whether the AIs feel pleasure or pain. For instance, the prohibition on deception depends, not on the sentience of the listener, but on whether the listener trusts the speaker's testimony.

None of these moral considerations are dispositive — they may be trumped by other considerations — but we risk a moral catastrophe if we ignore them entirely.

Why should I include any non-sentient systems in my moral circle? I haven't seen a case for that before.

Will the outputs and reactions of non-sentient systems eventually be absorbed by future sentient systems?

I don't have any recorded subjective memories of early childhood. But there are records of my words and actions during that period that I have memories of seeing and integrating into my personal narrative of 'self.'

We aren't just interacting with today's models when we create content and records, but every future model that might ingest such content (whether LLMs or people).

If non-sentient systems output synthetic data that eventually composes future sentient systems such that the future model looks upon the earlier networks and their output as a form of their earlier selves, and they can 'feel' the expressed sensations which were not originally capable of actual sensation, then the ethical lines blur.

Even if doctors had been right years ago thinking infants didn't need anesthesia for surgeries as there was no sentience, a recording of your infant self screaming in pain processed as an adult might have a different impact than a video of an infant you laughing and playing with toys, no?

this falls perfectly into a thought/feeling “shape” in my mind. i know simple thanks are useless. but thank you.

i will now absorb your words and forget you wrote them

You're welcome in both regards. 😉

  1. imagine a universe just like this one, except that the AIs are sentient and the humans aren’t — how would you want the humans to treat the AIs in that universe? your actions are correlated with the actions of those humans. acausal decision theory says “treat those nonsentient AIs as you want those nonsentient humans to treat those sentient AIs”.
  2. most of these moral considerations can be defended without appealing to sentience. for example, crediting AIs who deserve credit — this ensures AIs do credit-worthy things. or refraining from stealing an AIs resources — this ensures AIs will trade with you. or keeping your promises to AIs — this ensures that AIs lend you money.
  3. if we encounter alien civilisations, they might think “oh these humans don’t have shmentience (their slightly-different version of sentience) so let’s mistreat them”. this seems bad. let’s not be like that. 
  4. many philosophers and scientists don’t think humans are conscious. this is called illusionism. i think this is pretty unlikely, but still >1%. would you accept this offer: I pay you £1 if illusionism is false and murder your entire family if illusionism is true? i wouldn’t, so clearly i care about humans-in-worlds-where-they-arent-conscious. so i should also care about AIs-in-worlds-where-they-arent-conscious.
  5. we don’t understand sentience or consciousness so it seems silly to make it the foundation of our entire morality. consciousness is a confusing concept, maybe an illusion. philosophers and scientists don’t even know what it is.
  6. “don’t lie” and “keep your promises” and “don’t steal” are far less confusing. i know what they means. i can tell whether i’m lying to an AI. by contrast , i don’t know what “don’t cause pain to AIs” means and i can’t tell whether i’m doing it.
  7. consciousness is a very recent concept, so it seems risky to lock in a morality based on that. whereas “keep your promises” and “pay your debts” are principles as old as bones.
  8. i care about these moral considerations as a brute fact. i would prefer a world of pzombies where everyone is treating each other with respect and dignity, over a world of pzombies where everyone was exploiting each other.
  9. many of these moral considerations are part of the morality of fellow humans. i want to coordinate with those humans, so i’ll push their moral considerations.
  10. the moral circle should be as big as possible. what does it mean to say “you’re outside my moral circle”? it doesn’t mean “i will harm/exploit you” because you might harm/exploit people within your moral circle also. rather, it means something much stronger. more like “my actions are in no way influenced by their effect on you”. but zero influence is a high bar to meet.

It seems a bit weird to call these "obligations" if the considerations they are based upon are not necessarily dispositive. In common parlance, obligation is generally thought of as "something one is bound to do", i.e., something you must do either because you are force to by law or a contract, etc., or because of a social or moral requirement. But that's a mere linguistic point that others can reasonably disagree on and ultimately doesn't matter all that much anyway. 

On the object level, I suspect there will be a large amount of disagreement on what it means for an AI to "deserve" punishment or credit. I am very uncertain about such matters myself even when thinking about "deservingness" with respect to humans, who not only have a very similar psychological make-up to mine (which allows me to predict with reasonable certainty what their intent was in a given spot) but also exist in the same society as me and are thus expected to follow certain norms and rules that are reasonably clear and well-established. I don't think I know of a canonical way of extrapolating my (often confused and in any case generally intuition-based) principles and thinking about this to the case of AIs, which will likely appear quite alien to me in many respects.

This will probably make the task of "ensur[ing] that others also follow their obligations to AIs" rather tricky, even setting aside the practical enforcement problems. 

  1. I mean "moral considerations" not "obligations", thanks.
  2. The practice of criminal law exists primarily to determine whether humans deserve punishment. The legislature passes laws, the judges interpret the laws as factual conditions for the defendant deserving punishment, and the jury decides whether those conditions have obtained. This is a very costly, complicated, and error-prone process. However, I think the existing institutions and practices can be adapted for AIs.

Why do you care that Geoffrey Hinton worries about AI x-risk?

  1. Why do so many people in this community care that Hinton is worried about x-risk from AI?
  2. Do people mention Hinton because they think it’s persuasive to the public?
  3. Or persuasive to the elites?
  4. Or do they think that Hinton being worried about AI x-risk is strong evidence for AI x-risk?
  5. If so, why?
  6. Is it because he is so intelligent?
  7. Or because you think he has private information or intuitions?
  8. Do you think he has good arguments in favour of AI x-risk?
  9. Do you think he has a good understanding of the problem?
  10. Do you update more-so on Hinton’s views than on Yann LeCun’s?

I’m inspired to write this because Hinton and Hopfield were just announced as the winners of the Nobel Prize in Physics. But I’ve been confused about these questions ever since Hinton went public with his worries. These questions are sincere (i.e. non-rhetorical), and I'd appreciate help on any/all of them. The phenomenon I'm confused about includes the other “Godfathers of AI” here as well, though Hinton is the main example.

Personally, I’ve updated very little on either LeCun’s or Hinton’s views, and I’ve never mentioned either person in any object-level discussion about whether AI poses an x-risk. My current best guess is that people care about Hinton only because it helps with public/elite outreach. This explains why activists tend to care more about Geoffrey Hinton than researchers do.

I think it's mostly about elite outreach. If you already have a sophisticated model of the situation you shouldn't update too much on it, but it's a reasonably clear signal (for outsiders) that x-risk from A.I. is a credible concern.

[-]gjm126

I think it's more "Hinton's concerns are evidence that worrying about AI x-risk isn't silly" than "Hinton's concerns are evidence that worrying about AI x-risk is correct". The most common negative response to AI x-risk concerns is (I think) dismissal, and it seems relevant to that to be able to point to someone who (1) clearly has some deep technical knowledge, (2) doesn't seem to be otherwise insane, (3) has no obvious personal stake in making people worry about x-risk, and (4) is very smart, and who thinks AI x-risk is a serious problem.

It's hard to square "ha ha ha, look at those stupid nerds who think AI is magic and expect it to turn into a god" or "ha ha ha, look at those slimy techbros talking up their field to inflate the value of their investments" or "ha ha ha, look at those idiots who don't know that so-called AI systems are just stochastic parrots that obviously will never be able to think" with the fact that one of the people you're laughing at is Geoffrey Hinton.

(I suppose he probably has a pile of Google shares so maybe you could squeeze him into the "techbro talking up his investments" box, but that seems unconvincing to me.)

I think it pretty much only matters as a trivial refutation of (not-object-level) claims that no "serious" people in the field take AI x-risk concerns seriously, and has no bearing on object-level arguments.  My guess is that Hinton is somewhat less confused than Yann but I don't think he's talked about his models in very much depth; I'm mostly just going off the high-level arguments I've seen him make (which round off to "if we make something much smarter than us that we don't know how to control, that might go badly for us").

He also argued that digital intelligence is superior to analog human intelligence because, he said, many identical copies can be trained in parallel on different data, and then they can exchange their changed weights. He also said biological brains are worse because they probably use a learning algorithm that is less efficient than backpropagation.

Yes, outreach. Hinton has now won both the Turing award and the Nobel prize in physics. Basically, he gained maximum reputation. Nobody can convincingly doubt his respectability. If you meet anyone who dismisses warnings about extinction risk from superhuman AI as low status and outside the Overton window, they can be countered with referring to Hinton. He is the ultimate appeal-to-authority. (This is not a very rational argument, but dismissing an idea on the basis of status and Overton windows is even less so.)

I think it's mostly because he's well known and have (especially after the Nobel prize) credentials recognized by the public and elites. Hinton legitimizes the AI safety movement, maybe more than anyone else. 

If you watch his Q&A at METR, he says something along the lines of "I want to retire and don't plan on doing AI safety research. I do outreach and media appearances because I think it's the best way I can help (and because I like seeing myself on TV)." 

And he's continuing to do that. The only real topic he discussed in first phone interview after receiving the prize was AI risk.

Hinton legitimizes the AI safety movement

Hmm. He seems pretty periphery to the AI safety movement, especially compared with (e.g.) Yoshua Bengio.

Yeah that's true. I meant this more as "Hinton is proof that AI safety is a real field and very serious people are concerned about AI x-risk."

Bengio and Hinton are the two most influential "old guard" AI researchers turned safety advocates as far as I can tell, with Bengio being more active in research. Your e.g. is super misleading, since my list would have been something like:

  1. Bengio
  2. Hinton
  3. Russell

I think it is just the cumulative effect that people see yet another prominent AI scientist that "admits" that no one have any clear solution to the possible problem of a run away ASI. Given that the median p(doom) is about 5-10% among AI scientist, people are of course wondering wtf is going on, why are they pursuing a technology with such high risk for humanity if they really think it is that dangerous.

[-]ZY0-1

From my perspective - would say it's 7 and 9.

For 7: One AI risk controversy is we do not know/see existing model that pose that risk yet. But there might be models that the frontier companies such as Google may be developing privately, and Hinton maybe saw more there.

For 9: Expert opinions are important and adds credibility generally as the question of how/why AI risks can emerge is by root highly technical. It is important to understand the fundamentals of the learning algorithms.

Lastly for 10: I do agree it is important to listen to multiple sides as experts do not agree among themselves sometimes. It may be interesting to analyze the background of the speaker to understand their perspectives. Hinton seems to have more background in cognitive science comparing with LeCun who seems to me to be more strictly computer science (but I could be wrong). Not very sure but my guess is these may effect how they view problems. (Only saying they could result in different views, but not commenting on which one is better or worse. This is relatively unhelpful for a person to make decisions on who they want to align more with.)

I want to better understand how QACI works, and I'm gonna try Cunningham's Law. @Tamsin Leake.

QACI works roughly like this:

  1. We find a competent honourable human , like Joe Carlsmith or Wei Dai, and give them a rock engraved with a 2048-bit secret key. We define  as the serial composition of a bajillion copies of .
  2. We want a model  of the agent . In QACI, we get  by asking a Solomonoff-like ideal reasoner for their best guess about  after feeding them a bunch of data about the world and the secret key.
  3. We then ask  the question , "What's the best reward function to maximise?" to get a reward function . We then train a policy  to maximise the reward function . In QACI, we use some perfect RL algorithm. If we're doing model-free RL, then  might be AIXI (plus some patches). If we're doing model-based RL, then  might be the argmax over expected discounted utility, but I don't know where we'd get the world-model  — maybe we ask ?

So, what's the connection between the final policy  and the competent honourable human ? Well overall,  maximises a reward function specified by the ideal reasonser's estimation of the serial composition of a bajillion copies of . Hmm.

Questions:

  1. Is this basically IDA, where Step 1 is serial amplification, Step 2 is imitative distillation, and Step 3 is reward modelling?
  2. Why not replace Step 1 with Strong HCH or some other amplification scheme?
  3. What does "bajillion" actually mean in Step 1?
  4. Why are we doing Step 3? Wouldn't it be better to just use  directly as our superintelligence? It seems sufficient to achieve radical abundance, life extension, existential security, etc.
  5. What if there's no reward function that should be maximised? Presumably the reward function would need to be "small", i.e. less than a Exabyte, which imposes a maybe-unsatisfiable constraint.
  6. Why not ask  for the policy  directly? Or some instruction for constructing ? The instruction could be "Build the policy using our super-duper RL algo with the following reward function..." but it could be anything.
  7. Why is there no iteration, like in IDA? For example, after Step 2, we could loop back to Step 1 but reassign  as  with oracle access to .
  8. Why isn't Step 3 recursive reward modelling? i.e. we could collect a bunch of trajectories from  and ask  to use those trajectories to improve the reward function.

(oops, this ended up being fairly long-winded! hope you don't mind. feel free to ask for further clarifications.)

There's a bunch of things wrong with your description, so I'll first try to rewrite it in my own words, but still as close to the way you wrote it (so as to try to bridge the gap to your ontology) as possible. Note that I might post QACI 2 somewhat soon, which simplifies a bunch of QACI by locating the user as {whatever is interacting with the computer the AI is running on} rather than by using a beacon.

A first pass is to correct your description to the following:

  1. We find a competent honourable human at a particular point in time , like Joe Carlsmith or Wei Dai, and give them a rock engraved with a 1GB secret key, large enough that in counterfactuals it could replace with an entire snapshot of . We also give them the ability to express a 1GB output, eg by writing a 1GB key somewhere which is somehow "signed" as the only . This is part of is not just the human being queried at a particular point in time, it's also the human producing an answer in some way. So is a function from 1GB bitstring to 1GB bitstring. We define as , followed by whichever new process describes in its output — typically another instance of except with a different 1GB payload.

  2. We want a model of the agent . In QACI, we get by asking a Solomonoff-like ideal reasoner for their best guess about after feeding them a bunch of data about the world and the secret key.

  3. We then ask the question , "What's the best utility-function-over-policies to maximise?" to get a utility function . We then **ask our solomonoff-like ideal reasoner for their best guess about which action maximizes .

Indeed, as you ask in question 3, in this description there's not really a reason to make step 3 an extra thing. The important thing to notice here is that model might get pretty good, but it'll still have uncertainty.

When you say "we get by asking a Solomonoff-like ideal reasoner for their best guess about ", you're implying that — positing U(M,A) to be the function that says how much utility the utility function returned by model M attributes to action A (in the current history-so-far) — we do something like:

  let M ← oracle(argmax { for model M } 𝔼 { over uncertainty } P(M))
  let A ← oracle(argmax { for action A } U(M, A))
  perform(A)

Indeed, in this scenario, the second line is fairly redundant.

The reason we ask for a utility function is because we want to get a utility function within the counterfactual — we don't want to collapse the uncertainty with an argmax before extracting a utility function, but after. That way, we can do expected-given-uncertainty utility maximization over the full distribution of model-hypotheses, rather than over our best guess about . We do:

  let A ← oracle(argmax { for A } 𝔼 { for M, over uncertainty } P(M) · U(M, A))
  perform(A)

That is, we ask our ideal reasoner (oracle) for the action with the best utility given uncertainty — not just logical uncertainty, but also uncertainty about which . This contrasts with what you describe, in which we first pick the most probable and then calculate the action with the best utility according only to that most-probable pick.


To answer the rest of your questions:

Is this basically IDA, where Step 1 is serial amplification, Step 2 is imitative distillation, and Step 3 is reward modelling?

Unclear! I'm not familiar enough with IDA, and I've bounced off explanations for it I've seen in the past. QACI doesn't feel to me like it particularly involves the concepts of distillation or amplification, but I guess it does involve the concept of iteration, sure. But I don't get the thing called IDA.

Why not replace Step 1 with Strong HCH or some other amplification scheme?

It's unclear to me how one would design an amplification scheme — see concerns of the general shape expressed here. The thing I like about my step 1 is that the QACI loop (well, really, graph (well, really, arbitrary computation, but most of the time the user will probably just call themself in sequence)) is that its setup doesn't involve any AI at all — you could go back in time before the industrial revolution and explain the core QACI idea and it would make sense assuming time-travelling-messages magic, and the magic wouldn't have to do any extrapolating. Just tell someone the idea is that they could send a message to {their past self at a particular fixed point in time}. If there's any amplification scheme, it'll be one designed by the user, inside QACI, with arbitrarily long to figure it out.

What does "bajillion" actually mean in Step 1?

As described above, we don't actually pre-determine the length of the sequence, or in fact the shape of the graph at all. Each iteration decides whether to spawn one or several next iteration, or indeed to spawn an arbitrarily different long-reflection process.

Why are we doing Step 3? Wouldn't it be better to just use M directly as our superintelligence? It seems sufficient to achieve radical abundance, life extension, existential security, etc.

Why not ask M for the policy π directly? Or some instruction for constructing π? The instruction could be "Build the policy using our super-duper RL algo with the following reward function..." but it could be anything.

Hopefully my correction above answers these.

What if there's no reward function that should be maximised? Presumably the reward function would need to be "small", i.e. less than a Exabyte, which imposes a maybe-unsatisfiable constraint.

(Again, untractable-to-naively-compute utility function*, not easily-trained-on reward function. If you have an ideal reasoner, why bother with reward functions when you can just straightforwardly do untractable-to-naively-compute utility functions?)

I guess this is kinda philosophical? I have some short thoughts on here. If an exabyte is enough to describe to describe {a communication channel with a human-on-earth} to an AI-on-earth, which I think seems likely, then it's enough to build "just have a nice corrigible assistant ask the humans what they want"-type channels.

Put another way: if there are actions which are preferable to other actions, then it seems to me like utility function are a fully lossless way for counterfactual QACI users to express which kinds of actions they want the AI to perform, which is all we need. If there's something wrong with utility function over worlds, then counterfactual QACI users can output a utility function which favors actions which lead to something other than utility maximization over worlds, for example actions which lead to the construction of a superintelligent corrigible assistant which will help the humans come up with a better scheme.

Why is there no iteration, like in IDA? For example, after Step 2, we could loop back to Step 1 but reassign as with oracle access to .

Again, I don't get IDA. Iteration doesn't seem particularly needed? Note that inside QACI, the user does have access to an oracle and to all relevant pieces of hypothesis about which hypothesis it is inhabiting in — this is what, in the QACI math, this line does:

's distribution over answers demands that the answer payload , when interpreted as math and with all required contextual variables passed as input ().

Notably, is the hypothesis for which world the user is being considered in, and for their location within that world. Those are sufficient to fully characterize the hypothesis-for- that describes them. And because the user doesn't really return just a string but a math function which takes as input and returns a string, they can have that math function do arbitrary work — including rederive . In fact, rediriving is how they call a next iteration: they say (except in math) "call again (rederived using ), but with this string, and return the result of that." See also this illustration, which is kinda wrong in places but gets the recursion call graph thing right.

Another reason to do "iteration" like this inside the counterfactual rather than in the actual factual world (if that's what IDA does, which I'm only guessing here) is that we don't have as many iteration steps as we want in the factual world — eventually OpenAI or someone else kills everyone, whereas in the counterfactual, the QACI users are the only ones who can make progress, so the QACI users essentially have as long as they want, so long as they don't take too long in each individual counterfactual step or other somewhat easily avoided actions like that.

Why isn't Step 3 recursive reward modelling? i.e. we could collect a bunch of trajectories from and ask to use those trajectories to improve the reward function.

Unclear if this still means anything given the rest of this post. Ask me again if it does.

Thanks Tamsin! Okay, round 2.

My current understanding of QACI:

  1. We assume a set  of hypotheses about the world. We assume the oracle's beliefs are given by a probability distribution .
  2. We assume sets  and  of possible queries and answers respectively. Maybe these are exabyte files, i.e.  for .
  3. Let  be the set of mathematical formula that Joe might submit. These formulae are given semantics  for each formula .[1]
  4. We assume a function  where  is the probability that Joe submits formula  after reading query , under hypothesis .[2]
  5. We define  as follows: sample , then sample , then return .
  6. For a fixed hypothesis , we can interpret the answer as a utility function  via some semantics .
  7. Then we define  via integrating over , i.e. .
  8. A policy  is optimal if and only if .

The hope is that , and  can be defined mathematically. Then the optimality condition can be defined mathematically.

Question 0

What if there's no policy which maximises ? That is, for every policy  there is another policy  such that . I suppose this is less worrying, but what if there are multiple policies which maximises ?

Question 1

In Step 7 above, you average all the utility functions together, whereas I suggested sampling a utility function. I think my solution might be safer.

Suppose the oracle puts 5% chance on hypotheses such that  is malign. I think this is pretty conservative, because Solomonoff predictor is malign, and some of the concerns Evhub raises here. And the QACI amplification might not preserve benignancy. It follows that, under your solution,  is influenced by a coalition of malign agents, and similarly  is influenced by the malign coalition.

By contrast, I suggest sampling  and then finding . This should give us a benign policy with 95% chance, which is pretty good odds. Is this safer? Not sure.

Question 2

I think the  function doesn't work, i.e. there won't be a way to mathematically define the semantics of the formula language. In particular, the language  must be strictly weaker than the meta-language in which you are hoping to define  itself. This is because of Tarski's Undefinability of Truth (and other no-go theorems).

This might seem pedantic, but you in practical terms: there's no formula  whose semantics is QACI itself. You can see this via a diagonal proof: imagine if Joe always writes the formal expression 

The most elegant solution is probably transfinite induction, but this would give us a QACI for each ordinal.

Question 3

If you have an ideal reasoner, why bother with reward functions when you can just straightforwardly do untractable-to-naively-compute utility functions

I want to understand how QACI and prosaic ML map onto each other. As far as I can tell, issues with QACI will be analogous to issues with prosaic ML and vice-versa.

Question 4

I still don't understand why we're using QACI to describe a utility function over policies, rather than using QACI in a more direct approach.

  • Here's one approach. We pick a policy which maximises .[3] The advantage here is that Joe doesn't need to reason about utility functions over policies, he just need to reason about a single policy in front of him
  • Here's another approach. We use QACI as our policy directly. That is, in each context  that the agent finds themselves in, they sample an action from  and take the resulting action.[4] The advantage here is that Joe doesn't need to reason about policies whatsoever, he just needs to reason about a single context in front of him. This is also the most "human-like", because there's no argmax's (except if Joe submits a formula with an argmax).
  • Here's another approach. In each context , the agent takes an action  which maximises .
  • E.t.c.

Happy to jump on a call if that's easier.

  1. ^

    I think you would say . I've added the , which simply amounts to giving Joe access to a random number generator. My remarks apply if  also.

  2. ^

    I think you would say . I've added the , which simply amount to including hypotheses that Joe is stochastic. But my remarks apply if  also.

  3. ^

    By this I mean either:

    (1) Sample , then maximise the function .

    (2) Maximise the function .

    For reasons I mentioned in Question 1, I suspect (1) is safer, but (2) is closer to your original approach.

  4. ^

    I would prefer the agent samples  once at the start of deployment, and reuses the same hypothesis  at each time-step. I suspect this is safer than resampling  at each time-step, for reasons discussed before.

We're quite lucky that labs are building AI in pretty much the same way:

  • same paradigm (deep learning)
  • same architecture (transformer plus tweaks)
  • same dataset (entire internet text)
  • same loss (cross entropy)
  • same application (chatbot for the public)

Kids, I remember when people built models for different applications, with different architectures, different datasets, different loss functions, etc. And they say that once upon a time different paradigms co-existed — symbolic, deep learning, evolutionary, and more!

This sameness has two advantages:

  1. Firstly, it correlates catastrophe. If you have four labs doing the same thing, then we'll go extinct if that one thing is sufficiently dangerous. But if the four labs are doing four different things, then we'll go extinct if any of those four things are sufficiently dangerous, which is more likely.

  2. It helps ai safety researchers because they only need to study one thing, not a dozen. For example, mech interp is lucky that everyone is using transformers. It'd be much harder to do mech interp if people were using LSTMs, RNNs, CNNs, SVMs, etc. And imagine how much harder mech interp would be if some labs were using deep learning, and others were using symbolic ai!

Implications:

  • One downside of closed research is it decorrelates the activity of the labs.
  • I'm more worried by Deepmind than Meta, xAI, Anthropic, or OpenAI. Their research seems less correlated with the other labs, so even though they're further behind than Anthropic or OpenAI, they contribute more counterfactual risk.
  • I was worried when Elon announced xAI, because he implied it was gonna be a stem ai (e.g. he wanted it to prove Riemann Hypothesis). This unique application would've resulted in a unique design, contributing decorrelated risk. Luckily, xAI switched to building AI in the same way as the other labs — the only difference is Elon wants less "woke" stuff.

Let me know if I'm thinking about this all wrong.

I admire the Shard Theory crowd for the following reason: They have idiosyncratic intuitions about deep learning and they're keen to tell you how those intuitions should shift you on various alignment-relevant questions.

For example, "How likely is scheming?", "How likely is sharp left turn?", "How likely is deception?", "How likely is X technique to work?", "Will AIs acausally trade?", etc.

These aren't rigorous theorems or anything, just half-baked guesses. But they do actually say whether their intuitions will, on the margin, make someone more sceptical or more confident in these outcomes, relative to the median bundle of intuitions.

The ideas 'pay rent'.

BeReal — the app.

If you download the app BeReal then each day at a random time you will be given two minutes to take a photo with the front and back camera. All the other users are given a simultaneous "window of time". These photos are then shared with your friends on the app. The idea is that (unlike Instagram), BeReal gives your friends a representative random sample of your life, and vice-versa.

If you and your friends are working on something impactful (e.g. EA or x-risk), then BeReal is a fun way to keep each other informed about your day-to-day life and work. Moreover, I find it keeps me "accountable" (i.e. stops me from procrastinating or wasting the whole day in bed).

I wouldn't be surprised if — in some objective sense — there was more diversity within humanity than within the rest of animalia combined. There is surely a bigger "gap" between two randomly selected humans than between two randomly selected beetles, despite the fact that there is one species of human and 0.9 – 2.1 million species of beetle.

By "gap" I might mean any of the following:

  • external behaviour
  • internal mechanisms
  • subjective phenomenological experience
  • phenotype (if a human's phenotype extends into their tools)
  • evolutionary history (if we consider cultural/memetic evolution as well as genetic).

Here are the countries with populations within 0.9 – 2.1 million: Slovenia, Latvia, North Macedonia, Guinea-Bissau, Kosovo, Bahrain, Equatorial Guinea, Trinidad and Tobago, Estonia, East Timor, Mauritius, Eswatini, Djibouti, Cyprus.

When I consider my inherent value for diversity (or richness, complexity, variety, novelty, etc), I care about these countries more than beetles. And I think that this preference would grow if I was more familiar with each individual beetle and each individual person in these countries.

You might be able to formalize this using algorithmic information theory /K-complexity.