Intuitions about solving hard problems

Hmm. I suppose a similar key insight for my own line of research might go like:

The orthogonality thesis is actually wrong for brain-like learning systems. Such systems first learn many shallow proxies for their reward signal. Moreover, the circuits implementing these proxies are self-preserving optimization demons. They’ll steer the learning process away from the true data generating process behind the reward signal so as to ensure their own perpetuation.

If true, this insight matters a lot for value alignment because it points to a way that aligned behavior in the infra-human regime could perpetuate into the superhuman regime. If all of:

We can instil aligned behavior in the infra-human regime
The circuits that implement aligned behavior in the infra-human regime can ensure their own perpetuation into the superhuman regime
The circuits that implement aligned behavior in the infra-human regime continue to implement it in the superhuman regime

hold true, then I think we’re in a pretty good position regarding value alignment. Off-switch corrigibility is a bust though because self-preserving circuits won’t want to let you turn them off.

If you’re interested in some of the actual arguments for this thesis, you can read my answer to a question about the relation between human reward circuitry and human values.

[-]Richard_Ngo4y30

I think this is very interesting, and closely related to a line of thinking I've been pursuing; stay tuned for a forthcoming post which talks about the development of shallow proxies (although I'm not thinking of it as a particularly strong reason for optimism).

[-]John Schulman4yΩ10230

Weight-sharing makes deception much harder.

Could you explain or provide a reference for this?

[-]Johannes Treutlein4yΩ120

I'd also be curious about this!

[-]Johannes Treutlein3yΩ120

I find this particularly curious since naively, one would assume that weight sharing implicitly implements a simplicity prior, so it should make optimization more likely and thus also deceptive behavior? Maybe the argument is that somehow weight sharing leaves less wiggle room for obscuring one's reasoning process, making a potential optimizer more interpretable? But the hidden states and tied weights could still be encoding deceptive reasoning in an uninterpretable way?

[-]adamShimi4yΩ9190

I like that you're proposing an explicit heuristic inspired by the history of science for judging research directions and approaches, and acknowledge that it leads to conclusion that are counter intuitive to my Richard-model (pushing for Agents foundations for example), so you're not just retrofitting your own conclusion AFAIK. I also like that you're applying it to object-level directions in alignment — that's something I'm working on at the moment for my own research, based on your pushback.

That being said, my prediction/retrodiction is that this is too strong a criteria, for reasons already discussed in this post. Basically I expect that for most if not all great scientific solutions you mention, if you back up enough (sometimes you don't need to back up that far), you will find a step, an idea, an insight that proved crucial down the line but didn't look like the right type. Even in the post there's a sort of weird double standard where you implicitly discuss Darwin and Einstein after they have matured their theory, whereas you talk about Turing before he proves a non-trivial result or design a non-trivial algorithm. The extension of my prediction here is that during the long process that these thinkers (and others examples) took to arrive at their insights, they built on models and ideas that revealed bits of evidence but where jank, incorrect, and eventually used as scaffolding then thrown away.

Note that this is a rather empirical prediction about the history of science, and that I'm curious of any counterexample you or anybody else would find to it.

Another issue I see is that often the insights redefined the rules of the game. Galileo for example (in the Feyerabend interpretation at least) redefines "rest state" and "movement" to be consistent with a moving earth. You could say that this is insight that are compelling, but from the perspective at that time of history, it looks more like changing the rules of the game (of what a theory of the stars has to deal with) by changing the natural interpretations associated with it.

[-]carboniferous_umbraculum4yΩ030

I broadly agree with Richard's main point, but I also do agree with this comment in the sense that I am not confident that the example of Turing compared with e.g. Einstein is completely fair/accurate.

One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a breakthrough, I think it can be important to look for this "intuitively compelling" explanation. So one thing I have in mind is that I think it's usually good to be skeptical if a claimed breakthrough seems to just 'fall out' of a bunch of partial work without there being a compelling explanation after the fact.

[-]adamShimi4yΩ030

Thanks for the answer.

One thing I would say in response to your comment, Adam, is that I don't usually see the message of your linked post as being incompatible with Richard's main point. I think one usually does have or does need productive mistakes that don't necessarily or obviously look like they are robust partial progress. But still, often when there actually is a breakthrough, I think it can be important to look for this "intuitively compelling" explanation. So one thing I have in mind is that I think it's usually good to be skeptical if a claimed breakthrough seems to just 'fall out' of a bunch of partial work without there being a compelling explanation after the fact.

Hum, I'm not sure I'm following your point. Do you mean that you can have both productive mistakes and intuitively compelling explanations when the final (or even intermediary breakthrough) is reached? Then I totally agree. My point was more that if you only use Richard's heuristic, I expect you to not reach the breakthrough because you would have killed in the bud many productive mistakes that actually lead the way there.

There's also a very kuhnian thing here that I didn't really mention in my previous comment (except on the Galileo part): the compellingness of an answer is often stronger after the fact, when you work in the paradigm that it lead too. That's another aspect of productive mistakes or even breakthrough: they don't necessarily look right or predict more from the start, and evaluating their consequences is not necessarily obvious.

[-]Richard_Ngo4yΩ7110

I like this pushback, and I'm a fan of productive mistakes. I'll have a think about how to rephrase to make that clearer. Maybe there's just a communication problem, where it's hard to tell the difference between people claiming "I have an insight (or proto-insight) which will plausibly be big enough to solve the alignment problem", versus "I have very little traction on the alignment problem but this direction is the best thing I've got". If the only effect of my post is to make a bunch of people say "oh yeah, I meant the second thing all along", then I'd be pretty happy with that.

Why do I care about this? It has uncomfortable tinges of status regulation, but I think it's important because there are so many people reading about this research online, and trying to find a way into the field, and often putting the people already in the field on some kind of intellectual pedestal. Stating clearly the key insights of a given approach, and their epistemic status, will save them a whole bunch of time. E.g. it took me ages to work through my thoughts on myopia in response to Evan's posts on it, whereas if I'd known it hinged on some version of the insight I mentioned in this post, I would have immediately known why I disagreed with it.

As an example of (I claim) doing this right, see the disclaimer on my "shaping safer goals" sequence: "Note that all of the techniques I propose here are speculative brainstorming; I'm not confident in any of them as research directions, although I'd be excited to see further exploration along these lines." Although maybe I should make this even more prominent.

Lastly, I don't think I'm actually comparing Darwin and Einstein's mature theories to Turing's incomplete theory. As I understand it, their big insights required months or years of further work before developing into mature theories (in Darwin's case, literally decades).

[-]adamShimi4yΩ440

I like this pushback, and I'm a fan of productive mistakes. I'll have a think about how to rephrase to make that clearer. Maybe there's just a communication problem, where it's hard to tell the difference between people claiming "I have an insight (or proto-insight) which will plausibly be big enough to solve the alignment problem", versus "I have very little traction on the alignment problem but this direction is the best thing I've got". If the only effect of my post is to make a bunch of people say "oh yeah, I meant the second thing all along", then I'd be pretty happy with that.

When phrased like that, I agree with you. I am personally relatively suspicious of claims by a bunch of people to have found a path to alignment, but actually excited by some of their productive mistakes (as discussed a bit in my post).

I also fully agree that I want people to use the second, and my "history of alignment" research direction aims at concretely teasing the productive mistakes and revealed bits of evidence without falling for the "this is obviously a solution" or "this is obviously not a solution and thus useless".

Why do I care about this? It has uncomfortable tinges of status regulation, but I think it's important because there are so many people reading about this research online, and trying to find a way into the field, and often putting the people already in the field on some kind of intellectual pedestal. Stating clearly the key insights of a given approach, and their epistemic status, will save them a whole bunch of time. E.g. it took me ages to work through my thoughts on myopia in response to Evan's posts on it, whereas if I'd known it hinged on some version of the insight I mentioned in this post, I would have immediately known why I disagreed with it.

+1000. And teasing out more generally the assumptions, the insights, the new parts of works and approach is I think super necessary and on my research agenda. That's also part of the reason why I feel asking newcomers to be distillers is not necessarily a great idea: good distillation of the type we're discussing requires IMO quite a deep understanding of the landscape, the problem and the underlying ideas. Otherwise you at best get a decent summary, and we need more.

As an example of (I claim) doing this right, see the disclaimer on my "shaping safer goals" sequence: "Note that all of the techniques I propose here are speculative brainstorming; I'm not confident in any of them as research directions, although I'd be excited to see further exploration along these lines." Although maybe I should make this even more prominent.

Haven't reread your sequence in quite some time, but I think the value of such exploratory sequence is to make clearer the intuitions underlying the direction, even if they haven't lead yet to productive mistakes. So I like your disclaimer, but I think the even better way of doing this is to clarify for different posts and ideas what are the intuitions you're building on and where the current formalims/descriptions/analogies are failing to capture them.

Lastly, I don't think I'm actually comparing Darwin and Einstein's mature theories to Turing's incomplete theory. As I understand it, their big insights required months or years of further work before developing into mature theories (in Darwin's case, literally decades).

This might also be a bit of miscommunication, but I felt like your discussion of Turing could also have applied especially in Darwin's case, where the initial insight required a lot of additional pieces and clarification to make a clean and ordered theory that you can actually defend. Generally I was pointing at the risk of hindsight bias, where the fact that the insight is clean and powerful once the full theory is known and considered didn't mean it was so compelling at the time it was thought of. (Which is also a general empirical claim about the history of scientific progress, to explore ;) )

[-]carboniferous_umbraculum4y30

Yes I think you understood me correctly. In which case I think we more or less agree in the sense that I also think it may not be productive to use Richard's heuristic as a criterion for which research directions to actually pursue.

[-]iivonen4y70

One thing I'm interested in but don't know where to start looking for it, is seeing people who are working instead on the reverse direction - mathematical approaches which show aligned AI is not possible or likely. By this I mean formal work that suggests something like "almost all AGIs are unsafe", in the same way that the chances of picking a rational number at random from is zero because almost all real numbers are irrational.

I don't say this to be a downer! I mean it in the sense of a mathematician who spent 7 years attempting to prove X exists, and then sits down one day and spends 4 hours proving why X cannot exist. Progress can take surprising forms!

[-]anonymousaisafety4y100

I have been working on an argument from that angle.

I've been developing it independently from my own background in autonomous safety-critical hardware/software systems, but I discovered recently that it's very similar to Drexler's CAIS from 2019, except with more focus on low-level evidence or rationale for why certain claims are justified.

It isn't so much a pure mathematical approach as it is a systems engineering or systems safety perspective on all of the problems^[1]^[2] that would remain even if someone showed up tomorrow and dropped a formally verified algorithm describing an "aligned AGI" onto my desk, and what ramification that has for the development of AGI at all. The only complicated math in it so far is about computational complexity classes and relatively simple if, then logic for analyzing risk vectors.

I guess if I had to pick the "key" insight that I claim I can contribute, and share it now, it would be this:

If you define super-human performance in terms of some abstract thing called "intelligence" (or "general intelligence"), you run into a philosophical question: "what is intelligence?"
There is an answer accepted by this community that intelligence is "efficient cross-domain optimization".
From this answer, it follows that "general intelligence" is a prerequisite, so we can only rank solutions on tasks by evaluating them in the context of "general intelligence". In this way, the community can dismiss super-human performance on a specific task if that solution is not immediately generalizable to other tasks.
This also dismisses solutions that can be generalized to other tasks by taking a known algorithm for training a specific solution and deploying that algorithm on a specific task. The latter solution might be something like DeepMind's research into AlphaGo, then AlphaZero, then AlphaFold, and then Ithaca. That would seem to demonstrate an repeatable engineering process that can be deployed to a specific task and develop a solution with super-human performance on that task without that specific solution generalizing to other tasks.
... [text omitted]
We've achieved super-human or "human" performance in Go, Chess, protein folding, image recognition, language recognition, art generation, code generation, translation, and many other fields using AI/ML systems that do not, in any way, demonstrate "general intelligence".
... [text omitted]
Another way to think about this is to ask if what we call "general intelligence" is ultimately an inefficient algorithm for solving problems, despite the earlier claim that the definition of intelligence was "efficient cross-domain optimization".
I.e.: what if you can always solve problems faster and more efficiently by deploying AI/ML algorithms without "general intelligence", than a hypothetical "general intelligence" algorithm would be able to do, even if that algorithm was deployed to specialized hardware?
The closest I've seen to someone posing this question was Peter Watts' sci-fi novel Blindsight [11], but that was more focused on the idea of "consciousness" vs "general intelligence".
In a world where the algorithm that we'd recognize as "general intelligence" is fundamentally inefficient, we'd see seemingly remarkable and unexplained gains on AI/ML systems across a variety of unrelated problem domains where not one of those AI/ML systems has a capability we'd recognize as "general intelligence".

If you've read CAIS, you might recognize the above argument, where it was worded as:

In particular, taking human learning as a model for machine learning has encouraged the conflation of intelligence-as-learning-capacity with intelligence-as-competence, while these aspects of intelligence are routinely and cleanly separated AI system development: Learning algorithms are typically applied to train systems that do not themselves embody those algorithms. [CAIS 11.7]

When this idea was proposed in 2019, it seems to me like it was criticized because people didn't see how task-focused AI/ML systems could keep improving and eventually surpass human performance without somehow developing "general intelligence" along the way, plus a general skepticism that there would be rational reasons to not "just" staple every single hypothetical task together inside a system and call it AGI. I really think it's worth looking at this again in light of the last 3 years and asking if that criticism was justified.

^{^}
In systems safety, we're concerned with the safety of a larger system than the usual "product-focused" mindset. It is not enough for there to be a proof that a hypothetical product as-designed is safe. We also need to look at the likelihood of:
- design failures (the formal proof was wrong because the verification of it had a bug, there is no formal proof, the "formally verified" proof was actually checked by humans and not by an automated theorem prover)
- manufacturing failures (hardware behavior out-of-spec, missed errata, power failures, bad ICs, or other failure of components)
- implementation failures (software bugs, compiler bugs, differences between an idealized system in a proof vs the implementation of that system in some runtime or with some language)
- verification failures (bugs in tests that resulted in a false claim that the software met the formal spec)
- environment or runtime failures (e.g. radiation-induced upsets like bit flips; Does the system use voting? Is the RAM using ECC? What about the processor itself?)
- usage failures (is the product still safe if it's misused? what type of training or compliance might be required? is maintenance needed? is there some type of warning or lockout on the device itself if it is not actively maintained?)
- process failures ("normalization of deviance")
^{^}
For each of these failure modes, we then look at the worst-case magnitude of that failure. Does the failure result in non-functional behavior, or does it result in erroneous behavior? Can erroneous behavior be detected? By what? Etc. This type of review is called an FMEA. This review process can rule out designs that "seem good on paper" if there's sufficient likelihood of failures and inability to mitigate them to our desired risk tolerances outside of just the design itself, especially if there exist other solutions in the same design space that do not have similar flaws.

[-]David Scott Krueger (formerly: capybaralet)4yΩ230

Weight-sharing makes deception much harder.

Can I read about that somewhere? Or could you briefly elaborate?

[-]Nathan Helm-Burger4y*30

My hope for my personal work is that nibbling away at the mystery with 'prosaic engineering work ' will make the problem clearer such that profound insights will be easier to generate. I think in science generally it is a good heuristic to follow that when no clear theory exists, and gathering more data is an option, then go gather the data. Also, use engineering to build better tools with which to gather new data.

[-]Richard_Ngo4y60

Oh yeah, I totally agree with this. Will edit into the piece.

[-]Eric Drexler3yΩ220

I’d like to promote a norm for proposals for alignment techniques to be very explicit about where the hard work is done, i.e. which part is surprising or insightful or novel enough to make us think that it could solve alignment even in worlds where that’s quite difficult.

Alignment is, by nature, an engineering task, not a scientific task: It is an attempt to make something, not to understand some existing thing. It may be that, as you suggest, “solving hard scientific problems usually requires compelling insights”, but this is beside the point. Spaceflight was a hard problem, but was solved without a special, compelling insight. Likewise for the progress of computation from vacuum tubes to nanoscale electronics. Both are in the domain of engineering, where problems are typically solved by improving and composing many components. Asking “which part solves the hard problem” would be a mistake.

Regarding the CAIS model, you suggest that it “dramatically underrates the importance of general intelligence”, yet I have argued that the comprehensive AI services model (including the service of developing new services) is a way of thinking about implementations of general intelligence, not a substitute for it!

The capabilities of large language models should update our expectations, but do not persuade me that knowledge and skills of societal scale and diversity must or will be embodied in an undifferentiated blob of computation.

By the way, I haven’t suggested the CAIS model as a solution to alignment problems; instead of proposing a solution, it suggests that alignment problems are likely to arise (and perhaps be solved) in a context different from what has often been assumed. Some problems seem more tractable in that context, others less.

[-]David Scott Krueger (formerly: capybaralet)4y*Ω110

This is Eliezer’s description of the core insight behind Paul’s imitative amplification proposal. I find this somewhat compelling, but less so than I used to, since I’ve realized that the line between imitation learning and reinforcement learning is blurrier than I used to think (e.g. see this or this).

I didn't understand what you mean by the line being blurrier... Is this a comment about what works in practice for imitation learning? Does a similar objection apply if we replace imitation

learning with behavioral cloning?

[-]Evan R. Murphy4yΩ010

Overall I think this is a good post and very interesting, thanks.

I find this somewhat compelling, but less so than I used to, since I’ve realized that the line between imitation learning and reinforcement learning is blurrier than I used to think (e.g. see this or this).

So I checked out those links. Briefly looking at them, I can see what you mean about the line between RL and imitation learning being blurry. The first paper seems to show a version of RL which is basically imitation learning.

I'm confused because when you said this makes iterated amplification less compelling to you, I took that to mean it made you less optimistic about iterated amplification as a solution for alignment. But why would whether something is technically classified as imitation learning or a special kind of RL make a difference for its effectiveness?

Or did you mean not that you find it any less promising as an alignment proposal, but just that you now find the core insight less compelling/interesting because it's not as major an innovation over the idea of RL as you had thought it was?

[-]Oleg S.4y10

I don’t know too much about alignment research, but what surprises me most is lack of discussion of two points:

For the alignment to work its theory should not only tell humans how to create aligned super-human AGI, but also tell that AGI how to self-improve without destroying its own values. Otherwise how does paperclips optimizer which is marginally smarter than human make sure that its next iteration will still care about paperclips? Good alignment theory should work across all intelligence levels.
What are practical implication of alignment research in the world where AGI is hard? Imagine we have a good alignment theory but do not have AGI. I would assume that the theory can be used to manipulate existing superintelligent systems such as science, deep state, stock market. The reverse of this is does alignment research have any results which can be practically used right now?

[-]sanxiyn4y10

Since you specifically mentioned Godel: what was Godel's insight?

[-]Algon4y10

EDIT: I am talking about first order logic here.

DOUBLE EDIT: I didn't actually describe the insight required in proving the completeness theorem and compactness theorem. He used the former to prove the latter (because all proofs are finite, and if something is inconsistent, it can't have any models, so every statement must be proveable). I don't know what his key insight was for the compactness theorem, as I've only proven it via the completeness theorem.

That logic could be aritmatized, i.e. you could use the simplest arithmetic system to mimic logical deduction, and hence could use the logic of the simplest arithmetic system to talk about itself, letting you get all those nasty self reference paradoxes. Doing that was highly non trivial. Of course, he also proved a bunch of other important theorems like the completeness theorem: if every model of a set of axioms implies some some statement T, then the set of axioms must prove that statement. And also the compactness theorem: a set of logical statements is consistent iff every finite subset is consistent.

[-]adamShimi4y40

I assumed that Richard meant the incompleteness theorems, of which the first is quite easy to boil down to one insight (the liar paradox but with provability).

^{^}

I model Eliezer as agreeing with most of the claims I make in this post, but strongly disagreeing with this sentence, because he thinks that the core problem is so hard that no amount of prosaic engineering effort could plausibly prevent catastrophe in the absence of major novel insights.

^{^}

Some brief intuitions about why: I think the hardest part of human cognition is generating and merging different ontologies. Thinking “within” an ontology is like doing normal research in a scientific field; reasoning about different ontologies is like doing philosophy, or doing paradigm-breaking research, and so it seems like a particularly difficult thing to generate a training signal for.

^{^}

Thanks to Nathan Helm-Burger for reminding me of this, with his comment.

LESSWRONG
LW

LESSWRONG
LW

106

Intuitions about solving hard problems

106

Ω 48

106

Ω 48