Theoretical Computer Science Msc student at the University of [Redacted] in the United Kingdom. 

I'm an aspiring alignment theorist; my research vibes are descriptive formal theories of intelligent systems (and their safety properties) with a bias towards constructive theories.

I think it's important that our theories of intelligent systems remain rooted in the characteristics of real world intelligent systems; we cannot develop adequate theory from the null string as input.

I endorse the entirety of this post, and if anything I hold some objections/reservations more strongly than you have presented them here[1].

I very much appreciate that you have grounded these objections firmly in the theory and practice of modern machine learning.

  1. In particular, Yudkowsky's claim that a superintelligence is efficient wrt humanity on all cognitive tasks is IMO flat out infeasible/unattainable (insomuch as we include human aligned technology when evaluating the capabilities of humanity). ↩︎

I was referring to aesthetic preferences.

That particular phrasing of the idea is beautiful and deeply compelling because of its beauty.

If the Chris Olah chart is true, the natural abstraction hypothesis is probably false, if the NAH is false, alignment (of superhuman models) would be considerably more difficult.

This is a huge deal!

Today is 1st March 2023, and Alice is sitting in the Bodleian Library, Oxford. Alice is a smart, honest, helpful, harmless assistant to Bob. Alice has instant access to an online encyclopaedia containing all the facts about the world. Alice never says common misconceptions, outdated information, lies, fiction, myths, jokes, or memes.

Bob: What's the capital of France?


I wish you had demonstrated the effectiveness of flattery by asking questions that straightforward Q&A does poorly on (common misconceptions, myths, jokes, etc.). As is, you've just asserted that flattery works without providing empirical evidence for it. I do think flattery works, but the post would have been richer if the evidence to that effect was present in the post.

Likewise, I would have liked you to compare plausible flattery to implausible flattery to straightforward Q&A and demonstrate empirically that implausible flattery doesn't work, rather than just asserting that implausible flattery is less effective (again I expect that implausible flattery is less effective than plausible flattery but I would have greatly appreciated empirical evidence for it). I would have been interested in seeing how implausible flattery compares to straightforward Q&A.

Tl;Dr: more empirical justification would have enriched the post.

This can be incentivised through an appropriate discount rate in the reward function?

This all seems basically sensible and checks out.

Re: your arguments #1, #2 and #4, we very well might make the decisions to pursue modular implementations of transformative artificial intelligence such as Drexler's Open Agency architecture or Comprehensive AI Services over autonomous sovereigns and accept the inefficiency from humans in the loop and modularity because:

  1. Modular architectures are much easier to oversee/govern (i.e. "scalable oversight" is more tractable)
  2. Correctness/robustness of particular components/services can be locally verified; modular architectures may be more reliable/trustworthy for this reason and thus more economically competitive
  3. Such implementations are less vulnerable/prone to (or at least offer less affordances for) "power seeking"/"influence seeking" behaviour; the risk of takeover and disempowerment is lower
  4. Misaligned AI is likely to cause small local failures before global catastrophic failures, and hostile sociocultural/political/regulatory reactions to such failures (see nuclear) could well incentivise the big AI labs to play it (very) safe lest they strangle their golden goose

Re: #3 many of the biggest/main labs have safety teams and seem to take existential risk from advanced artificial intelligence seriously:

  • Anthropic
  • Deepmind
  • OpenAI

I guess Google Brain and Meta AI stand out as big/well funded teams that aren't (yet) safety pilled.

Paul Christiano's AI Alignment Landscape:

Contrary to many LWers, I think GPT-3 was an amazing development for AI existential safety. 

The foundation models paradigm is not only inherently safer than bespoke RL on physics, the complexity and fragility of value problems are basically solved for free.

Language is a natural interface for humans, and it seems feasible to specify a robust constitution in natural language? 

Constitutional AI seems plausibly feasible, and like it might basically just work?

That said I want more ambitious mechanistic interpretability of LLMs, and to solve ELK for tighter safety guarantees, but I think we're in a much better position now than I thought in 2017.

I want descriptive theories of intelligent systems to answer questions of the following form.




And for each of the above clusters, I want to ask the following questions:

  • How likely are they to emerge by default?
    • That is without training processes that actively incentivise or otherwise select for them
    • Which properties/features are "natural"?
    • Which properties/features are "anti-natural"?
  • If they do emerge, in what form will they manifest?
    • To what degree is that property/feature exhibited/present in particular systems
  • Are they selected for by conventional ML training processes?
    • What kind of training processes select for them?
    • What kind of training processes select against them?
  • How does selection for/against these properties trade off against performance, "capabilities", cost, <other metrics we care about>


I think that answers to these questions would go a long way towards deconfusing us and refining our thinking around:

  • The magnitude of risk we face with particular paradigms/approaches
  • The most probable failure modes
    • And how to mitigate them
  • The likelihood of alignment by default
  • Alignment taxes for particular safety properties (and safety in general)
