How can we recognize when we are failing to change our thinking in light of new evidence that doesn’t fit our expectations and assumptions? And how can we update our thought processes to overcome the challenges that our old ways of seeing pose?
[Note: This post is an excerpt from a longer paper, written during the first half of the Philosophy Fellowship at the Center for AI Safety. I (William D'Alessandro) am a Postdoctoral Fellow at the Munich Center for Mathematical Philosophy. Along with the other Philosophy Fellowship midterm projects, this draft is posted here for feedback.
The full version of the paper includes a discussion of the conceptual relationship between safety and moral alignment, and an argument that we should choose a reliably safe powerful AGI over one that's (apparently) successfully morally aligned. I've omitted this material for length but can share it on request.
The deontology literature is big, and lots of angles here could be developed further. Questions and suggestions much appreciated!]
Value misalignment arguments for AI risk observe...
When I introduced Rational Animations, I wrote:
What I won't do is aggressively advertise LessWrong and the EA Forum. If the channel succeeds, I will organize fundraisers for EA charities. If I adapt an article for YT, I will link it in the description or just credit the author. If I use quotes from an author on LW or the EA Forum, I will probably credit them on-screen. But I will never say: "Come on LW! Plenty of cool people there!" especially if the channel becomes big. Otherwise, "plenty of cool people" becomes Reddit pretty fast.
If the channel becomes big, I will also refrain from posting direct links to LW and the EA Forum. Remind me if I ever forget. And let me know if these rules are not conservative enough.
You could have AutoAutoRateLimits. That is, you have some target such as "total number of posts/comments per time" or "total number of posts/comments per time from users with <15 karma" or something. From that, you automatically adjust the rate limits to keep the target below a global target level. Maybe you add complications, such as a floor to never let the rate limit go to None, and maybe you have inertia. (There's plausibly bad effects to this though, IDK.)
People talk about Kelly betting and expectation maximization as though they're alternate strategies for the same problem. Actually, they're each the best option to pick for different classes of problems. Understanding when to use Kelly betting and when to use expectation maximization is critical.
Most of the ideas for this came from Ole Peters ergodicity economics writings. Any mistakes are my own.
Alice and Bob visit a casino together. They each have $100, and they decide it'll be fun to split up, play the first game they each find, and then see who has the most money. They'll then keep doing this until their time in the casino is up in a couple days.
Alice heads left and finds a game that looks good. It's double...
Can you be more precise about the exact situation Bob is in? How many rounds will he get to play? Is he trying to maximise money, or trying to beat Alice? I doubt the Kelly criterion will actually be his optimal strategy.
Your response does illustrate that there are holes in my explanation. Bob 1 and Bob 2 do not exist at the same time. They are meant to represent one person at two different points in time.
A separate way I could try to explain what kind of resurrection I am talking about is to imagine a married couple. An omniscient husband would have to care as much about his wife after she was resurrected as he did before she died.
I somewhat doubt that I could patch all of the holes that could be found in my explanation. I would appreciate it if you try to answer what I am trying to ask.
A new Anthropic interpretability paper—“Toy Models of Superpostion”—came out last week that I think is quite exciting and hasn't been discussed here yet.
It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, especially in large language models where it actually seems rare for neurons to correspond to clean features. This
I thought this was a very interesting paper — I particularly liked the relationship to phase transitions.
However, I think there's a likely to be another 'phase' that they don't discuss (possibly it didn't crop up in their small models, since it's only viable in a sufficiently large model): one where you pack a very large number of features (thousands or millions, say) into a fairly large number of dimensions (hundreds, say). In spaces with dimensionality >= O(100), the statistics of norm and dot product are such that even randomly chosen unit norm vecto...
(If you're already familiar with all basics and don't want any preamble, skip ahead to Section B for technical difficulties of alignment proper.)
I have several times failed to write up a well-organized list of reasons why AGI will kill you. People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first. Some fraction of those people are loudly upset with me if the obviously most important points aren't addressed immediately, and I address different points first instead.
Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants. I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally...
This is another reply in this vein, I'm quite new to this so don't feel obliged to read through. I just told myself I will publish this.
I agree (90-99% agreement) with almost all of the points Eliezer made. And the rest is where I probably didn't understand enough or where there's no need for a comment, e.g.:
1. - 8. agree
9. Not sure if I understand it right - if the AGI has been successfully designed not to kill everyone then why need oversight? If it is capable to do so and the design fails then on the other hand what would our oversight do? I don't...
LessWrong is experimenting with the addition of reacts to the site, as per the recent experimental Open Thread. We are now progressing to the next stage of the experiment: trying out reacts in actual discussion threads.
The dev/moderator team will be proactively looking for posts to enable to react voting on (with author permission), but also any user can enable it themselves to help us experiment:
The admins will also be on the lookout for good posts to enable reacts on (with author permission).
We're continuing to think about what reacts should be available. Thanks to everyone who's weighed in so far.
I just spent time today...
Alice: Hey Bob, how's it going?
Bob: Good. How about you?
Alice: Good. So... I stalked you and read your blog.
Bob: Haha, oh yeah? What'd you think?
Alice: YOU THINK SEA MONSTERS ARE GOING TO COME TO LIFE, END HUMANITY, AND TURN THE ENTIRE UNIVERSE INTO PAPER CLIPS?!?!?!?!?!??!?!?!?!
Bob: Oh boy. This is what I was afraid of. This is why I keep my internet life separate from my personal one.
Alice: I'm sorry. I feel bad about stalking you. It's just...
Bob: No, it's ok. I actually would like to talk about this.
Alice: Bob, I'm worried about you. I don't mean this in an offensive way, but this MoreRight community you're a part of, it seems like a cult.
Bob: Again, it's ok. I won't be offended. You don't have to hold back.
If I didn't speak English, so that their words appeared as meaningless noise to me, then I'd be much more uncertain about who to trust, and would probably defer to an average of the opinions of the top ML names, eg. Sutskever, Goodfellow, Hinton, LeCun, Karpathy, Benigo, etc.
Do you think it'd make sense to give more weight to people in the field of AI safety than to people in the field of AI more broadly?
I would, and I think it's something that generally makes sense. Ie. I don't know much about food science but on a question involving dairy, I'd trust food...
Note: this has been in my draft queue since well before the FLI letter and TIME article. It was sequenced behind the interpretability post, and delayed by travel, and I'm just plowing ahead and posting it without any edits to acknowledge ongoing discussion aside from this note.
This is an occasional reminder that I think pushing the frontier of AI capabilities in the current paradigm is highly anti-social, and contributes significantly in expectation to the destruction of everything I know and love. To all doing that (directly and purposefully for its own sake, rather than as a mournful negative externality to alignment research): I request you stop.
(There's plenty of other similarly fun things you can do instead! Like trying to figure out how the heck modern AI systems...
This feels game-theoretically pretty bad to me, and not only abstractly, but I expect concretely that setting up this incentive will cause a bunch of people to attempt to go into capabilities (based on conversations I've had in the space).
I'm not an ethical philosopher, but my intuition, based primarily on personal experience, is that deontological ethics are a collection of heuristic rules of thumb extracted from the average answers of utilitarian ethics applied to a common range of situations that often crop up between humans. (I also view this as a slightly-idealized description of the legal system.) As such, they're useful primarily in the same ways that heuristics often are useful compared to actually calculating a complex function, by reducing computational load. For people, they also... (read more)