Jack O'Brien

Wiki Contributions


Yep, fair point. In my original comment I seemed to forget about the problem of AIs goodharting our long reflection. I probably agree now that doing a pivotal act into a long reflection is approximately as difficult as solving alignment.

(Side-note about how my brain works: I notice that when I think through all the argumentative steps deliberately, I do believe this statement: "Making an AI which helps humans clarify their values is approximately as hard as making an AI care about any simple, specific thing." However it does not come to mind automatically when I'm reasoning about alignment. 2 Possible fixes:

  1. Think more concretely about Retargeting the Search when I think about solving alignment. This makes the problems seem similar in difficulty.
  2. Meditate on just how hard it is to target an AI at something. Sometimes I forget how Goodhartable any objective is. )

This post was incredibly interesting and useful to me. I would strong-upvote it, but I don't think this post should be promoted to more people. I've been thinking about the question of "who are we aligning AI to" for the past two months.

I really liked your criticism of the Long Reflection because it is refreshingly different from e.g. Macaskill and Ord's writing on the long reflection. I'm still not convinced that we can't avoid all of the hellish things you mentioned like synthetic superstimuli cults and sub-AGI drones. Why can't we just have a simple process of open dialogue with values of truth, individual agency during the reflection, and some clearly defined contract at the end of the long reflection to like, take power away from the AGI drones?

3 is my main reason for wanting to learn more pure math, but I use 1 and 2 to help motivate me

which of these books are you most excited about and why? I also want to do more fun math reading

Let's be optimistic and prove that an agentic AI will be beneficial for the long-term future of humanity. We probably need to prove these 3 premises:

Premise 1:  Training story X will create an AI model which approximates agent formalism A
Premise 2: Agent formalism A is computable and has a set of alignment properties P
Premise 3: An AI with a set of alignment properties P will be beneficial for the long-term future.

Aaand so far I'm not happy with our answers to any of these.

Fantastic! Here's my summary:


  1. A recursively self improving singleton is the most likely AI scenario.
  2. To mitigate AI risk, building a fully aligned singleton on the first try is the easiest solution. This is easier than other approaches which require solving coordination.
  3. By default, AI will become misaligned when it generalises away from human capabilities. We must apply a security mindset and be doubtful of most claims that an AI will be aligned when it generalises.
  4. We should prioritise research which solves the hard part of the alignment problem rather than handwavy, tractable research.
  5. Many current tractable alignment solutions handwave away the hard parts of the problem (and possibly make the problem worse in expectation).
  6. More formal / rigorous research can produce more robust guarantees about AI safety than other research.

Conclusion: We should prioritize working up what we even want from an aligned AI, formalize these ideas into concrete desiderata, then build AI systems which meet those desiderata

Did you get around to writing a longer answer to the question, "How do humans do anything in practice if the search space is vast?" I'd be curious to see your thoughts.

My answer to this question is that: 
(a) Most day-to-day problems can be solved from far away using a low-dimensional space containing natural abstractions. For example, a manager at a company can give their team verbal instructions without describing the detailed sequence of muscle movements needed.
(b) For unsolved problems in science, we get many tries at the problem. So, we can use the scientific method to design many experiments which give us enough bits to locate the solution. For example, a drug discovery team can try thousands of compounds in their search for a new drug. The drug discovery team gets to test each compound on the condition they're trying to treat - so, they can get many bits about which compounds could be effective.

I have a few questions about corrigibility. First, I will tentatively define corrigibility as creating an agent who is willing to let humans shut it off or change its goals without manipulating humans. I have seen that corrigibility can lead to VNM-incoherence (i.e. an agent can be dutch-booked / money-pumped). Has this result been proven in general?

Also, what is the current state of corrigibility research? If the above incoherence result turns out to be correct and corrigibility leads to incoherence, are there any other tractable theoretical directions we could take towards corrigibility? 

Are any people trying to create corrigible agents in practice? (I suspect it is unwise to try this, as any poorly understood corrigibility we manage to implement in practice is liable to be wiped away if a sharp left turn occurs).

Excellent summary, Harrison! I especially enjoyed your use of pillar diagrams to break up streams of text. In general, I found your post very approachable and readable.

As for Pillar 2: I find the description of goals as "generalised concepts" still pretty confusing after reading your summary. I don't think this example of a generalised concept counts as a goal: "things that are perfectly round are objects called spheres; 6-sided boxes are objects called cubes". This statement is a fact, but a goal is a normative preference about the world (cf. the is-ought distinction).
Also, I think the 'coherence' trait could do with slightly more deconfusion - you could phrase it as "goals are internally consistent and stable over time".

I think the most tenuous pillar in Ngo's argument is Pillar 2: that AI will be agentic with large-scale goals. It's plausible that the economic incentives for developing a CEO-style AI with advanced planning capabilities will not be as strong as stated. I agree that there is a strong economic incentive for CEO-style AI which can improve business decision-making. However, I'm not convinced that creating an agentic AI with large-scale goals is the best way to do this. We don't have enough information about which kinds of AI are most cost-effective for doing business decision-making. For example, the AI field may develop viable models that don't display these pesky agentic tendencies. 
(On the other hand, it does seem plausible that an agentic AI with large-scale goals is a very parsimonious/natural/easily-found-by-SGD model for such business decision-making tasks.)

Load More