Beware boasting about non-existent forecasting track records

As an example of the kind of point that one might use in deciding who "came off better" in the FOOM debate, Hanson predicted that "AIs that can parse and use CYC should be feasible well before AIs that can parse and use random human writings", which seems pretty clearly falsified by large language models—and that also likely bears on Hanson's view that "[t]he idea that you could create human level intelligence by just feeding raw data into the right math-inspired architecture is pure fantasy".

As you point out, however, this exercise of looking at what was said and retrospectively judging whose worldview seemed "less surprised" by what happened is definitely not the same thing as a forecasting track record. It's too subjective; rationalizing why your views are "less surprised" by what happened than some other view (without either view having specifically predicted what happened), is not hugely more difficult than rationalizing your views in the first place.

Beware boasting about non-existent forecasting track records

The comments about Metaculus ("jumped ahead of 6 years earlier") make more sense if you interpret them as being about Yudkowsky already having "priced in" a deep-learning-Actually-Works update in response to AlphaGo in 2016, in contrast to Metaculus forecasters needing to see DALLE 2/PaLM/Gato in 2022 in order to make "the same" update.

(That said, I agree that Yudkowsky's sneering in the absence of a specific track record is infuriating; I strong-upvoted this post.)

Should we buy Google stock?

Without necessarily disagreeing, I'm curious exactly how far back you want to push this. The natural outcome of technological development has been clear to sufficiently penetrating thinkers since the nineteenth century. Samuel Butler saw it. George Eliot saw it. Following Butler, should "every machine of every sort [...] be destroyed by the well-wisher of his species," that we should "at once go back to the primeval condition of the race"?

In 1951, Turing wrote that "it seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers [...] At some stage therefore we should have to expect the machines to take control".

Turing knew. He knew, and he went and founded the field of computer science anyway. What a terrible person, right?

But What's Your *New Alignment Insight,* out of a Future-Textbook Paragraph?

To get an AI that wants to help humans, just ensure the AI is rewarded for helping humans during training.

To what extent do you expect this to generalize "correctly" outside of the training environment?

In your linked comment, you mention humans being averse to wireheading, but I think that's only sort-of true: a lot of people who successfully avoid trying heroin because they don't want to become heroin addicts, do still end up abusing a lot of other evolutionarily-novel superstimuli, like candy, pornography, and video games.

That makes me think inner-misalignment is still going to be a problem when you scale to superintelligence: maybe we evolve an AI "species" that's genuinely helpful to us in the roughly human-level regime (where its notion of helping and our notion of being-helped, coincide very well), but when the AIs become more powerful than us, they mostly discard the original humans in favor of optimized AI-"helping"-"human" superstimuli.

I guess I could imagine this being an okay future if we happened to get lucky about how robust the generalization turned out to be—maybe the optimized AI-"helping"-"human" superstimuli actually are living good transhuman lives, rather than being a nonsentient "sex toy" that happens to be formed in our image? But I'd really rather not bet the universe on this (if I had the choice not to bet).

The New Right appears to be on the rise for better or worse

Querying the search feature for "Mencius", it looks like he commented exactly once in November 2007. (On Overcoming Bias, the account and comment having been ported over in the transition to Best wishes, Less Wrong Reference Desk.

MIRI announces new "Death With Dignity" strategy

Likelihood ratios can be easier to evaluate than absolute probabilities insofar as you can focus on the meaning of a single piece of evidence, separately from the context of everything else you believe.

Suppose we're trying to pick a multiple-dial combination lock (in the dark, where we can't see how many dials there are). If we somehow learn that the first digit is 3, we can agree that that's bits of progress towards cracking the lock, even if you think there's three dials total (such that we only need 6.64 more bits) and I think there's 10 dials total (such that we need 29.88 more bits).

Similarly, alignment pessimists and optimists might be able to agree that reinforcement learning from human feedback techniques are a good puzzle piece to have (we don't have to write a reward function by hand! that's progress!), even if pessimists aren't particularly encouraged overall (because Goodhart's curse, inner-misaligned mesa-optimizers, &c. are still going to kill you).

But What's Your *New Alignment Insight,* out of a Future-Textbook Paragraph?

"radically transformed in a way that doesn't end with any of the existing humans being alive" is what I meant by "destroyed"

Great, we're on the same page.

That's the thing that very few current humans would do, given sufficient power. That's the thing that we're concerned that future AIs might do, given sufficient power.

I think I'm expressing skepticism that inner-misaligned adaptations in simple learning algorithms are enough to license using current humans as a reference class quite this casually?

The "traditional" Yudkowskian position says, "Just think of AI as something that computes plans that achieve outcomes; logically, a paperclip maximizer is going to eat you and use your atoms to make paperclips." I read you as saying that AIs trained using anything like current-day machine learning techniques aren't going to be pure consequentialists like that; they'll have a mess of inner-misaligned "adaptations" and "instincts", like us. I agree that this is plausible, but I think it suggests "AI will be like another evolved species" rather than "AI will be like humans" as our best current-world analogy, and the logic of "different preferences + more power = genocide" still seems likely to apply across a gap that large (even if it's smaller than the gap to a pure consequentialist)?

But What's Your *New Alignment Insight,* out of a Future-Textbook Paragraph?

The vast majority of humans would not destroy the world [...] This is in stark contrast to how AIs are assumed to destroy the world by default.

Most humans would (and do) seek power and resources in a way that is bad for other systems that happen to be in the way (e.g., rainforests). When we colloquially talk about AIs "destroying the world" by default, it's a very self-centered summary: the world isn't actually "destroyed", just radically transformed in a way that doesn't end with any of the existing humans being alive, much like how our civilization transforms the Earth in ways that cut down existing forests.

You might reply: but wild nature still exists; we don't cut down all the forests! True, but an important question here is to what extent is that due to "actual" environmentalist/conservationist preferences in humans, and to what extent is it just that we "didn't get around to it yet" at our current capability levels?

In today's world, people who care about forest animals, and people who enjoy the experience of being in a forest, both have an interest in protecting forests. In the limit of arbitrarily advanced technology, this is less obvious: it's probably more efficient to turn everything into an optimal computing substrate, and just simulate happy forest animals for the animal lovers and optimal forest scenery for the scenery-lovers. Any fine details of the original forest that the humans don't care about (e.g., the internals of plants) would be lost.

This could still be good news, if it turns out to be easy to hit upon the AI analogue of animal-lovers (because something like "raise the utility of existing agents" is a natural abstraction that's easy to learn?), but "existing humans would not destroy the world" seems far too pat. (We did! We're doing it!)

[Linkpost] New multi-modal Deepmind model fusing Chinchilla with images and videos

Appendix D of the paper shows the prompt for the dialogue examples, which starts with:

This is a conversation between a human, User, and an intelligent visual AI, Flamingo. User sends images, and Flamingo describes them.

and then gives three shots of examples with "User:" and "Flamingo:" labels.

Prize for Alignment Research Tasks

Context: unforeseen maximum critic/assistant for alignment researchers.

Input: formal or informal description of an objective function
Output: formal or informal description of what might actually maximize that function

Standard examples: maximize smiles / tiny molecular smileyfaces; compress sensory information / encrypt it and reveal the key.

Load More