https://intelligence.org/files/Corrigibility.pdf

Here's the link ^

 

It is hard for me to phrase this question in a way that is correct and also not offensive. But research like this feels "not real" for some reason. It feels fuzzy, like random made-up philosophy that won't ever have any real impact on the world. It doesn't feel /real/ like a paper on "look this neural net is doing this bad thing, watch" or "look this model always lies and it's impossible to come up with a model that doesn't lie if it's in this format"

This just feels like pretend, made-up research that they put math equations in to seem like it's formal and rigorous. You just make up rules and then pretend they describe reality and then prove mathematically some results from there? 

Does anyone else see what I'm saying about feeling like this sort of research is fake, and how would I convince myself that it isn't just useless random thoughts written down that kind of dance near the topic? At the end of all those questions, I feel no closer to knowing if a machine would stop you from pressing a button to shut it off. 

New Answer
Ask Related Question
New Comment

1 Answers sorted by

I would say Corrigibility paper shares the same "feel" with certain cryptography papers. I think it is true that this feel is distinct, and not true that it means they are "not real".

For example, what does it mean for cryptosystem to be secure? This is an important topic with impressive achievements, but it does feel different from bolts and nuts of cryptography like how to perform differential cryptanalysis. Indistinguishability under chosen plain text attack, the standard definition of semantic security in cryptography, does sound like "make up rules and then pretend they describe reality and prove results".

In a sense, I think all math papers with focus on definitions (as opposed to proofs) feel like this. Proofs are correct but trivial, so definitions are the real contribution, but applicability of definitions to the real world seems questionable. Proof-focused papers feel different because they are about accepted definitions whose applicability to the real world is not in question.

In a sense, I think all math papers with focus on definitions (as opposed to proofs) feel like this.

I suspect one of the reasons OP feels dissatisfied about the corrigibility paper is that it is not the equivalent of Shannon's seminal results, which generally gave the correct definition of terms, but instead merely gesturing at a problem ("we have no idea how to formalize corrigibility!").

That being said, I resonate a lot with this part of the reply:

Proofs [in conceptual/definition papers] are correct but trivial, so definitions are the real contribution,

... (read more)

I do like the comparison to cryptography, as that is a field I "take seriously" and does also have the issue of it being very difficult to "fairly" define terms. 

Indistinguishability under chosen plain text attack being the definition for something to be canonically "secure" seems a lot more defensible than "properly modeling this random weird utility game maybe means something for AGI ??" but I get why it's a similar sort of issue

8 comments, sorted by Click to highlight new comments since: Today at 2:05 AM

I have a kinda symmetric feeling about "practical" research. "Okay, you have found that one-layer transformer without MLP approximates skip-trigram statistics, how it generalizes to the question 'does GPT-6 want to kill us all?"? (I understand this feeling is not rational, it just shows my general inclination towards "theoretical" work)

"Okay, you have found that one-layer transformer without MLP approximates skip-trigram statistics, how it generalizes to the question 'does GPT-6 want to kill us all?"?

I understand this is more an illustration than a question, but I'll try answering it anyway because I think there's something informative about different perspectives on the problem :-)

Skip-trigrams are a foundational piece of induction heads, which are themselves a key mechanism for in-context learning. A Mathematical Framework for Transformer Circuits was published less than a year ago, IMO subsequent progress is promising, and mechanistic interpretability has been picked up by independent researchers and other labs (e.g. Redwood's project on GPT-2-small).

Of course the skip-trigram result isn't itself an answer to the question of whether some very capable ML system is planning to deceive the operator or seize power, but I claim it's analogous to a lemma in some paper that establishes a field and that said field is one of our most important tools for x-risk mitigation. This was even our hope at the time, though I expected both the research and the field-building to go more slowly - actual events are something like a 90th-percentile outcome relative to my expectations in October-2021.[1]

Finally, while I deeply appreciate theoretical/conceptual research as a complement to empirical and applied research and want both, how on earth is either meant to help alone? If we get a conceptual breakthrough but don't know how to build - and verify that we've correctly built - the thing, we're still screwed; conversely if we get really good at building stuff and verifying our expectations but don't expect some edge-case like FDT-based cooperation then we're still screwed. Efforts which integrate both at least have a chance, if nobody else does something stupid first.


  1. I still think it's pretty unlikely (credible interval 0--40%) that we'll have good enough interpretability tools by the time we really really need them, but I don't see any mutually-exclusive options which are better. ↩︎

Nitpick: 

IMO subsequent progress is promising,

This link probably meant to go to the induction heads and in context learning paper?

Fixed, thanks; it links to the transformer circuits thread which includes both the induction heads paper, SoLU, and Toy Models of Superposition.

This just feels like pretend, made-up research that they put math equations in to seem like it's formal and rigorous.

Can you elaborate which parts feel made-up to you? E.g.:

  • modelling a superintelligent agent as a utility maximizer
  • considering a 3-step toy model with , ,
  • assuming that a specification of exists

At the end of all those questions, I feel no closer to knowing if a machine would stop you from pressing a button to shut it off.

The authors do not claim to have solved the problem and instead state that this is an open problem. So this is not surprising that there is not a satisfying answer.

I would also like to note, that the paper has many more caveats.

Do you think it would still feel fake to you if the paper had a more positive answer to the problem described (eg a description how to modify a utility function of an agent in a toy model such that it does not incentivize the agent to prevent/cause the pressing of the shutdown button)?

I suppose modelling a superintelligent agent as a utility maximizer feels a bit weird but not the weirdest thing, and I'm not sure I can mount a good defense saying that a superintelligent agent definitely wouldn't be aptly modeled by that.

More importantly, the 3-step toy model with  felt like a strange and unrelated leap

I don't know if it's about the not having an answer part. That is probably biasing me. But similar to the cryptography example, if someone defined what security would mean, let's say Indistinguishability under chosen plain text attack. And then proceeded to say "I have no idea how to do that or if it's even possible." Then I would still consider that real even though they didn't give us an answer.

Looking at the paper makes me feel like the authors were just having some fun discussing philosophy and not "ah yes this will be important for the fight later". But it is hard for me to understand why I feel that way.

I am somewhat satisfied by the cryptography comparison for now but definitely hard to see how valuable this is as opposed to general interpretability research.

From a pure world-modelling perspective, the 3 step model is not very interesting, because it doesn't describe reality. It's maybe best to think of it from an engineering perspective, as a test case. We're trying to build an AI, and we want to make sure it works well. We don't know exactly what that looks like in the real world, but we know what it looks like in simplified situations, where the off button is explicitly labelled for the AI and everything is well understood. If a proposed AI design does the wrong thing in the 3-step test case, then it has failed one of its unit tests, and should not be deployed to production (the real world). So the point of the paper is that a reasonable-sounding way you could design an AI with an off switch turns out to fail the unit-test.

I do generally think that too many of the AI-related posts here on LessWrong are "not real" in the way you're suggesting, but this paper in particular seems "real" to me (whatever that means). I find the most "not real" posts are the verbose ones piled high with vague wordy abstractions, without an equation in sight. The equations in the corrigiblity paper aren't there to seem impressive, they're there to unambiguously communicate the math the paper is talking about, so that if the authors have made an error of reasoning, it will be as obvious as possible. The ways you keep something in contact with reality is checking either against experiment, or against the laws of mathematics. To quote Feynman, "if it disagrees with experiment, it's wrong" and similarly, there's a standard in mathematics that statements must be backed up by checkable calculations and proofs. So long as the authors are holding themselves to that standard (and so long as you agree that any well-designed AI should be able to perform well in this easy test case), then it's "real".

idk I think that reaction to miri is pretty common. how do you feel about this one? https://arxiv.org/abs/2208.08345

New to LessWrong?