The first point isn’t super central. FWIW I do expect that humans will occasionally not swap words back.
Humans should just look at the noised plan and try to convert it into a more reasonable-seeming, executable plan.
Edit: that is, without intentionally changing details.
Fair enough! For what it’s worth, I think the reconstruction is probably the more load-bearing part of the proposal.
Is your claim that the noise borne asymmetric pressure away from treacherous plans disappears in above-human intelligences? I could see it becoming less material as intelligence increases, but the intuition should still hold in principle.
"Most paths lead to bad outcomes" is not quite right. For most (let's say human developed, but not a crux) plan specification languages, most syntactically valid plans in that language would not substantially permute the world state when executed.
I'll begin by noting that over the course of writing this post, the brittleness of treacherous plans became significantly less central.
However, I'm still reasonably convinced that the intuition is sound. If a plan is adversarial to humans, the plan's executor will face adverse optimization pressure from humans and...
i.e. that the problem is easily enough addressed that it can be done by firms in the interests of making a good product and/or based on even a modest amount of concern from their employees and leadership
I'm curious about how contingent this prediction is on 1, timelines and 2, rate of alignment research progress. On 2, how much of your P(no takeover) comes from expectations about future research output from ARC specifically?
If tomorrow, all alignment researchers stopped working on alignment (and went to become professional tennis players or something) and no new alignment researchers arrived, how much more pessimistic would you become about AI takeover?
These predictions are not very related to any alignment research that is currently occurring. I think it's just quite unclear how hard the problem is, e.g. does deceptive alignment occur, do models trained to honestly answer easy questions generalize to hard questions, how much intellectual work are AI systems doing before they can take over, etc.
I know people have spilled a lot of ink over this, but right now I don't have much sympathy for confidence that the risk will be real and hard to fix (just as I don't have much sympathy for confidence that the problem isn't real or will be easy to fix).
Epistemic Status: First read. Moderately endorsed.
I appreciate this post and I think it's generally good for this sort of clarification to be made.
One distinction is between dying (“extinction risk”) and having a bad future (“existential risk”). I think there’s a good chance of bad futures without extinction, e.g. that AI systems take over but don’t kill everyone.
This still seems ambiguous to me. Does "dying" here mean literally everyone? Does it mean "all animals," all mammals," "all humans," or just "most humans? If it's all humans dying, do all hu...
I think this is probably good to just 80/20 with like a weekend of work? So that there’s a basic default action plan for what to do when someone goes “hi designated community person, I’m depressed.”
People really should try to not have depression. Depression is bad for your productivity. Being depressed for eg a year means you lose a year of time, AND it might be bad for your IQ too.
A lot of EAs get depressed or have gotten depressed. This is bad. We should intervene early to stop it.
I think that there should be someone EAs reach out to when they’re depressed (maybe this is Julia Wise?), and then they get told the ways they’re probably right and wrong so their brain can update a bit, and a reasonable action plan to get them on therapy or meds or whatever.
Strong upvoted.
I’m excited about people thinking carefully about publishing norms. I think this post existing is a sign of something healthy.
Re Neel: I think that telling junior mech interp researchers to not worry too much about this seems reasonable. As a (very) junior researcher, I appreciate people not forgetting about us in their posts :)
Short on time. Will respond to last point.
I wrote that they are not planning to "solve alignment once and forever" before deploying first AGI that will help them actually develop alignment and other adjacent sciences.
Surely this is because alignment is hard! Surely if alignment researchers really did find the ultimate solution to alignment and present it on a silver platter, the labs would use it.
Also: An explicit part of SERI MATS’ mission is to put alumni in orgs like Redwood and Anthropic AFAICT. (To the extent your post does this,) it’s plausibly a mistake to treat SERI MATS like an independent alignment research incubator.
Epistemic status: hasty, first pass
First of all thanks for writing this.
I think this letter is “just wrong” in a number of frustrating ways.
A few points:
Ordering food to go and eating it at the restaurant without a plate and utensils defeats the purpose of eating it at the restaurant
Restaurants are a quick and convenient way to get food, even if you don’t sit down and eat there. Ordering my food to-go saves me a decent amount of time and food, and also makes it frictionless to leave.
But judging by votes, it seems like people don’t find this advice very helpful. That’s fine :(
I’m planning on removing this post and replacing it with a single big post of life optimizations.
I think there might be a misunderstanding. I order food because cooking is time-consuming, not because it doesn’t have enough salt or sugar.
Maybe it’d be good if someone compiled a list of healthy restaurants available on DoorDash/Uber Eats/GrubHub in the rationalist/EA hubs?
If you plan to eat at the restaurant, you can just ask them for a box if you have food left over.
This is true at most restaurants. Unfortunately, it often takes a long time for the staff to prepare a box for you (o(5 minutes)).
A potential con is that most food needs to be refrigerated if you want to keep it safe to eat for several hours
One might simply get into the habit of putting whatever food they have in the refrigerator. I find that refrigerated food is usually not unpleasant to eat, even without heating.
Sometimes when you purchase an item, the cashier will randomly ask you if you’d like additional related items. For example, when purchasing a hamburger, you may be asked if you’d like fries.
It is usually a horrible idea to agree to these add-ons, since the cashier does not inform you of the price. I would like fries for free, but not for $100, and not even for $5.
The cashier’s decision to withhold pricing information from you should be evidence that you do not, in fact, want to agree to the deal.
Epistemic status: clumsy
An AI could also be misaligned because it acts in ways that don't pursue any consistent goal (incoherence).
It’s worth noting that this definition of incoherence seems inconsistent with VNM. Eg. A rock might satisfy the folk definition of “pursuing a consistent goal,” but fail to satisfy VNM due to lacking completeness (and by corollary due to not performing expected utility optimization over the outcome space).
Strong upvoted.
The result is surprising and raises interesting questions about the nature of coherence. Even if this turns out to be a fluke, I predict that it’d be an informative one.
I think I was deceived by the title.
I’m pretty sure that rapid capability generalization is distinct from the sharp left turn.
dedicated to them making the sharp left turn
I believe that “treacherous turn” was meant here.
You can give ChatGPT the job posting and a brief description of Simon’s experiment, and then just ask them to provide critiques from a given perspective (eg. “What are some potential moral problems with this plan?”)
I clicked the link and thought it was a bad idea ex post. I think that my attempted charitable reading of the Reddit comments revealed significantly less constructive data than what would have been provided by ChatGPT.
I suspect that rationalists engaging with this form of content harms the community a non-trivial amount.
I understand feeling frustrated given the state of affairs, and I accept your apology.
Have a great day.
You don’t have an accurate picture of my beliefs, and I’m currently pessimistic about my ability to convey them to you. I’ll step out of this thread for now.
I find the accusation that I'm not going to do anything slightly offensive.
Of course, I cannot share what I have done and plan to do without severely de-anonymizing myself.
I'm simply not going to take humanity's horrific odds of success as a license to make things worse, which is exactly what you seem to be insisting upon.
Default comment guidelines:
Your reply does not even remotely resemble good faith engagement.
You can unilaterally slow down AI progress by not working on it. Each additional day until the singularity is one additional day to work on alignment.
"Becoming the fire" because you're doomer-pilled is maximally undignified.
Random thoughts:
What sort of value do you expect to get out of "crossing the theory-practice gap"?
Do you think that this will result in better insights about which direction to focus in during your research, for example?
I was watching some clips of Aaron Gwin's (American professional mountain bike racer) riding recently. Reflecting on how amazing humans are. How good we can get, with training and discipline.
Did some math today, and remembered what I love about it. Being able to just learn, without the pressure and anxiety of school, is so wonderfully joyful. I'm going back to basics, and making sure that I understand absolutely everything.
I'm feeling very excited about my future. I'm going to learn so much. I'm going to have so much fun. I'm going to get so good.
When I first started college, I set myself the goal of looking, by now, like an absolute wizard to me from a year ago. To be advanced enough to be indistinguishable from magic.
A year in, I now can do ...
Well, let me be the first to say that I don't think you're a passive mob that can be found in the aether.
I really don't want to entertain this "you're in a cult" stuff.
It's not very relevant to the post, and it's not very intellectually engaging either. I've dedicated enough cycles to this stuff.
Shallow comment:
How are you envisioning the prevention of strategic takeovers? It seems plausible that robustly preventing strategic takeovers would also require substantial strategizing/actualizing.