Wiki Contributions


I had an insight about the implications of NAH which I believe is useful to communicate if true and useful to dispel if false; I don't think it has been explicitly mentioned before.

One of Eliezer's examples is "The AI must be able to make a cellularly identical but not molecularly identical duplicate of a strawberry." One of the difficulties is explaining to the AI what that means. This is a problem with communicating across different ontologies--the AI sees the world completely differently than we do. If NAH in a strong sense is true, then this problem goes away on its own as capabilities increase; that is, AGI will understand us when we communicate something that has a coherent natural interpretation, even without extra effort on our part to translate it to the AGI version of machine code.

Does that seem right?

(Why "Top 3" instead of "literally the top priority?" Well, I do think a successful AGI lab also needs have top-quality researchers, and other forms of operational excellence beyond the ones this post focuses on. You only get one top priority, )


I think the situation is more dire than this post suggests, mostly because "You only get one top priority." If your top priority is anything other than this kind of organizational adequacy, it will take precedence too often; if your top priority is organizational adequacy, you probably can't get off the ground.

The best distillation of my understanding regarding why "second priority" is basically the same as "not a priority at all" is this twitter thread by Dan Luu.

The fear was that if they said that they needed to ship fast and improve reliability, reliability would be used as an excuse to not ship quickly and needing to ship quickly would be used as an excuse for poor reliability and they'd achieve none of their goals.

I just read an article that reminded me of this post. The relevant section starts with "Bender and Manning’s biggest disagreement is over how meaning is created". Bender's position seems to have some similarities with the thesis you present here, especially when viewed in contrast to what Manning claims is the currently more popular position that meaning can arise purely from distributional properties of language.

This got me wondering: if Bender is correct, then there is a fundamental limitation in how well (pure) language models can understand the world; are there ways to test this hypothesis, and what does it mean for alignment?


Interesting. I'm reminded of this definition of "beauty".

One minor objection I have to the contents of this post is the conflation of models that are fine-tuned (like ChatGPT) and models that are purely self-supervised (like early GPT3); the former has no pretenses of doing only next token prediction.

But they seem like they are only doing part of the "intelligence thing".

I want to be careful here; there is some evidence to suggest that they are doing (or at least capable of doing) a huge portion of the "intelligence thing", including planning, induction, and search, and even more if you include minor external capabilities like storage.

I don't know if anyone else has spoken about this, but since thinking about LLMs a little I am starting to feel like their something analagoss to a small LLM (SLM?) embedded somewhere as a component in humans

I know that the phenomenon has been studied for reading and listening (I personally get a kick out of garden-path sentences); the relevant fields are "natural language processing" and "computational linguistics". I don't know know of any work specifically that addressed it in the "speaking" setting.

if we want to build something "human level" then it stands to reason that it would end up with specialized components for the same sorts of things humans have specialized components for.

Soft disagree. We're actively building the specialized components because that's what we want, not because that's particularly useful for AGI.

Some thoughts:

  • Those who expect fast takeoffs would see the sub-human phase as a blip on the radar on the way to super-human
  • The model you describe is presumably a specialist model (if it were generalist and capable of super-human biology, it would plausibly count as super-human; if it were not capable of super-human biology, it would not be very useful for the purpose you describe). In this case, the source of the risk is better thought of as the actors operating the model and the weapons produced; the AI is just a tool
  • Super-human AI is a particularly salient risk because unlike others, there is reason to expect it to be unintentional; most people don't want to destroy the world
  • The actions for how to reduce xrisk from sub-human AI and from super-human AI are likely to be very different, with the former being mostly focused on the uses of the AI and the latter being on solving relatively novel technical and social problems

I think "sufficiently" is doing a lot of work here. For example, are we talking about >99% chance that it kills <1% of humanity, or >50% chance that it kills <50% of humanity?

I also don't think "something in the middle" is the right characterization; I think "something else" it more accurate. I think that the failure you're pointing at will look less like a power struggle or akrasia and more like an emergent goal structure that wasn't really present in either part.

I also think that "cyborg alignment" is in many ways a much more tractable problem than "AI alignment" (and in some ways even less tractable, because of pesky human psychology):

  • It's a much more gradual problem; a misaligned cyborg (with no agentic AI components) is not directly capable of FOOM (Amdhal's law was mentioned elsewhere in the comments as a limit on usefulness of cyborgism, but it's also a limit on damage)
  • It has been studied longer and has existed longer; all technologies have influenced human thought

It also may be an important paradigm to study (even if we don't actively create tools for it) because it's already happening.

Like, I may not want to become a cyborg if I stop being me, but that's a separate concern from whether it's bad for alignment (if the resulting cyborg is still aligned).

OpenAI’s focus with doing these kinds of augmentations is very much “fixing bugs” with how GPT behaves: Keep GPT on task, prevent GPT from making obvious mistakes, and stop GPT from producing controversial or objectionable content. Notice that these are all things that GPT is very poorly suited for, but humans find quite easy (when they want to). OpenAI is forced to do these things, because as a public facing company they have to avoid disastrous headlines like, for example: Racist AI writes manifesto denying holocaust.[7]

As alignment researchers, we don’t need to worry about any of that! The goal is to solve alignment, and as such we don’t have to be constrained like this in how we use language models. We don’t need to try to “align” language models by adding some RLHF, we need to use language models to enable us to actually solve alignment at its core, and as such we are free to explore a much wider space of possible strategies for using GPT to speed up our research.[8] 


I think this section is wrong because it's looking through the wrong frame. It's true that cyborgs won't have to care about PR disasters in their augments, but it feels like this section is missing the fact that there are actual issues behind the PR problem. Consider a model that will produce the statement "Men are doctors; women are nurses" (when not explicitly playing an appropriate role); this indicates something wrong with its epistemics, and I'd have some concern about using such a model.

Put differently, a lot of the research on fixing algorithmic bias can be viewed as trying to repair known epistemic issues in the model (usually which are inherited from the training data); this seems actively desirable for the use case described here.

Load More