It might be some elements of human intelligence (at least at the civilizational level) are culturally/memetically transmitted. All fine and good in theory. Except the social hypercompetition between people and intense selection pressure of ideas online might be eroding our world's intelligence. Eliezer wonders if he's only who he is because he grew up reading old science fiction from before the current era's memes.
Computers get smarter. People don't. Some bots will be greedy and some will not. The greedy ones will take everything.
(h/t Otis Reid)
I think this post captures a lot of important features of the US policymaking system. Pulling out a few especially relevant/broadly applicable sections:
1. There's No Efficient Market For Policy
There can be a huge problem that nobody is working on; that is not evidence that it's not a huge problem. Conversely, there can be a marginal problem swamped with policy work; that's not evidence it's really all that big of a deal.
On the upside, this means there are never-ending arbitrage opportunities in policy. Pick your workstreams wisely.
...2. Personnel Really Is The Most Important Thing
The quality of staffers varies dramatically and can make or break policy efforts. Some Hill staffers are just awesome; if they like your idea, they'll take it and run with it, try to
I think it's especially true for the type of human that likes Lesswrong. Using Scott's distinction between Metis and Techne, we are drawn to Techne. When a techne-leaning person does a deep dive into metis, that can generate a lot of value.
More speculatively, I feel like often--as in the case of lobbying for good government policy--there isn't a straightforward way to capture any of the created value; so it is under-incentivized.
I want to show a philosophical principle which, I believe, has implications for many alignment subproblems. If the principle is valid, it might allow to
This post clarifies and expands on ideas from here and here. Reading the previous posts is not required.
The principle and its most important consequences:
I think I understand you now. Your question seems much simpler than I expected. You're basically just asking "but what if we'll want infinitely complicated / detailed values in the future?"
If people iterativly modified themselves, would their preferences become ever more exacting? If so, then it is true that the "variables humans care about can't be arbitrarily complicated", but the variables humans care about could define a desire to become a system capable of caring about arbitrarily complicated variables.
It's OK if the principle won't be true for hu...
Eliezer and I wrote a book. It’s titled If Anyone Builds It, Everyone Dies. Unlike a lot of other writing either of us have done, it’s being professionally published. It’s hitting shelves on September 16th.
It’s a concise (~60k word) book aimed at a broad audience. It’s been well-received by people who received advance copies, with some endorsements including:
...The most important book I’ve read for years: I want to bring it to every political and corporate leader in the world and stand over them until they’ve read it. Yudkowsky and Soares, who have studied AI and its possible trajectories for decades, sound a loud trumpet call to humanity to awaken us as we sleepwalk into disaster. Their brilliant gift for analogy, metaphor and parable clarifies for the general
I think they are delaying so people can early pre order which affects how many books the publisher prints and distributes which affects how many people ultimately read it and how much it breaks into the Overton window. Getting this conversation mainstream is an important instrumental goal.
If you are looking for info in the mean time you could look at PauseAI:
Or if you want less facts and quotes and more discussion, I recall that Yudkowsky’s Coming of Age is what changed my view from "orthogonality kinda makes sense" to "orthogonality i...
Preface: I am not suicidal or anywhere near at risk, this is not about me. Further, this is not infohazardous content. There will be discussions of death, suicide, and other sensitive topics so please use discretion, but I’m not saying anything dangerous and reading this will hopefully inoculate you against an existing but unseen mental hazard.
There is a hole at the bottom of functional decision theory, a dangerous edge case which can and has led multiple highly intelligent and agentic rationalists to self-destructively spiral and kill themselves or get themselves killed. This hole can be seen as a symmetrical edge case to Newcomb’s Problem in CDT, and to Solomon’s Problem in EDT: a point where an agent naively executing on a pure version of the decision theory will consistently underperform in a...
Predictably avoiding death at all costs, even the cost of your mortal soul, eternal fealty, etc., is unfortunately a bigger security flaw than a willingness to follow through on implied local kamikaze threats.
If you follow decision-theoretic loss-aversion to its natural conclusion, both of us should be closeted and making a good YouTube grift as Republicans. We're making less money this way.
(Work done at Convergence Analysis. Mateusz wrote the post and is responsible for most of the ideas with Justin helping to think it through. Thanks to Olga Babeeva for the feedback on this post.)
Suppose the perspective of pausing or significantly slowing down AI progress or solving the technical problems necessary to ensure that arbitrarily strong AI has good effects on humanity (in time, before we get such systems) both look gloomy.[1] What options do we have left?
Adam Shimi presents a useful frame on the alignment problem in Abstracting The Hardness of Alignment: Unbounded Atomic Optimization:
alignment [is] the problem of dealing with impact on the world (optimization) that is both of unknown magnitude (unbounded) and non-interruptible (atomic).
If the problem is about some system (or a collection of systems) having an unbounded, non-interruptible impact,[2] can we handle it by ensuring that...
About getting coherent corrigibility, my and Joar's post on Updating Utility Functions, makes some progress on a soft form of corrigibility.
Our government, having withdrawn the new diffusion rules, has now announced an agreement to sell massive numbers of highly advanced AI chips to UAE and Saudi Arabia (KSA). This post analyzes that deal and that decision.
It is possible, given sufficiently strong agreement details (which are not yet public and may not be finalized) and private unvoiced considerations, that this deal contains sufficient safeguards and justifications that, absent ability to fix other American policy failures, this decision is superior to the available alternatives. Perhaps these are good deals, with sufficiently strong security arrangements that will actually stick.
Perhaps UAE and KSA are more important markets and general partners than we realize, and the rest of the world really is unable to deploy capital and electrical power the way they...
It always seemed outlandish that in The Animatrix, the first AI city (01) was located in the Middle East...
If we had limitless time, it would be interesting to know how this happened. I guess the prehistory of it involved Saudi Vision 2030 (e.g. the desert city Noem), and the general hypermodernization of Dubai. You can see precursors in the robot Sophia getting Saudi citizenship in 2017, and the UAE's "Falcon" LLM in 2023.
But the initiative must have come from the American side - some intersection of the geopolitical brain trust around Trump, ...
Epistemic status: Argument from behavioural analogy. I’m not claiming that AI systems feel care, only that human users interpret their behaviour as if it carries social meaning. When AI systems simulate care but fail to change behaviour in response to feedback, trust erodes. This essay reframes interface-level alignment not as a momentary performance or affective mimicry, but as longitudinal behavioural coherence with the capacity to track, adapt to, and reflect what users value over time.
--
A few weeks ago, I corrected an LLM for misreading numerical data in a table I’d provided. Again. Same mistake. Same correction. Same nonchalant apology.
“You're absolutely right! I should’ve been more careful. Here’s the corrected version blah blah blah.”
It wasn't the error that irked me, it was the polite but emotionally hollow apology....
In a recent post, Zvi described what he calls "The Most Forbidden Technique":
An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.
You train on [X]. Only [X]. Never [M], never [T].
Why? Because [T] is how you figure out when the model is misbehaving.
If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.
The article specifically discusses this in relation to reasoning models and Chain of Thought (CoT): if we train a model...
Do we know that the examples of Gemini thinking in kaomoji and Claude speaking in spanish, etc, are real?
I say that because ChatGPT doesn't actually display its chain of thought to the user, so it's possible neither does Gemini or Claude. ChatGPT has the chain of thought obfuscated into something more approachable to the user, as I understand it.