Wiki Contributions


Can you give an example of a theoretical argument of the sort you'd find convincing? Can be about any X caring about any Y.

On the impossible-to-you world: This doesn’t seem so weird or impossible to me? And I think I can tell a pretty easy cultural story slash write an alternative universe novel where we honor those who maximize genetic fitness and all that, and have for a long time—and that this could help explain why civilization and our intelligence developed so damn slowly and all that. Although to truly make the full evidential point that world then has to be weirder still where humans are much more reluctant to mode shift in various ways. It’s also possible this points to you having already accepted from other places the evidence I think evolution introduces, so you’re confused why people keep citing it as evidence.

The ability to write fiction in a world does not demonstrate its plausibility. Beware generalizing from fictional fictional evidence!

The claim that such a world is impossible is a claim that, were you to try to write a fictional version of it, you would run into major holes in the world that you would have to either ignore or paper over with further unrealistic assumptions.

In case it is not clear: My expectation is that sufficiently large capabilities/intelligence/affordances advances inherently break our desired alignment properties under all known techniques.

Nearly every piece of empirical evidence I've seen contradicts this - more capable systems are generally easier to work with in almost every way, and the techniques that worked on less capable versions straightforwardly apply and in fact usually work better than on less intelligent systems.

When I explain my counterargument to pattern 1 to people in person, they will very often try to "rescue" evolution as a worthwhile analogy for thinking about AI development. E.g., they'll change the analogy so it's the programmers who are in a role comparable to evolution, rather than SGD.

In general one should not try to rescue intuitions, and the frequency of doing this is a sign of serious cognitive distortions. You should only try to rescue intutions when they have a clear and validated predictive or pragmatic track record.

The reason for this is very simple - most intuitions or predictions one could make are wrong, and you need a lot of positive evidence to privilege any particular hypotheses re how or what to think. In the absence of evidence, you should stop relying on an intuition, or at least hold it very lightly.

The obvious question here is to what degree do you need new techniques vs merely to train new models with the same techniques as you scale current approaches.


One of the virtues of the deep learning paradigm is that you can usually test things at small scale (where the models are not and will never be especially smart) and there's a smooth range of scaling regimes in between where things tend to generalize.


If you need fundamentally different techniques at different scales, and the large scale techniques do not work at intermediate and small scales, then you might have a problem. If you need the same techniques as at medium or small scales for large scales, then engineering continues to be tractable even as algorithmic advances obsolete old approaches.

It's more like calling a human who's as smart as you are and directly plugged into your brain and in fact reusing your world model and train of thought directly to understand the implications of your decision. That's a huge step up from calling a real human over the phone!

The reason the real human proposal doesn't work is that

  1. the humans you call will lack context on your decision
  2. they won't even be able to receive all the context
  3. they're dumber and slower than you so even if you really could write out your entire chain of thoughts and intuitions consulting them for every decision would be impractical

Note that none of these considerations apply to integrated language models!

ML models in the current paradigm do not seem to behave coherently OOD but I'd bet for nearly any metric of "overall capability" and alignment that the capability metric decays faster vs alignment as we go further OOD.


See for an example of the kinds of things you'd expect to see when taking a neural network OOD. It's not that the model does some insane path-dependent thing, it collapses to entropy. You end up seeing a max-entropy distribution over outputs not goals. This is a good example of the kind of thing that's obvious to people who've done real work with ml but very counter to classic LessWrong intuitions and isn't learnable by implementing mingpt.

Historically you very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence. This is not how it seems to have gone.


Everyone agrees that you assumed superintelligence would understand everything humans understand and more. The dispute is entirely about the things that you encounter before superintelligence. In general it seems like the world turned out much more gradual than you expected and there's information to be found in what capabilities emerged sooner in the process.


We should clearly care if their arguments were wrong in the past, especially if they were systematically wrong in a particular direction, as it's evidence about how much attention we should pay to their arguments now. At some point if someone is wrong enough for long enough you should discard their entire paradigm and cease to privilege hypotheses they suggest, until they reacquire credibility through some other means e.g. a postmortem explaining what they got wrong and what they learned, or some unambiguous demonstration of insight into the domain they're talking about.

Load More