It's unimportant, but I disagree with the "extra special" in:
if alignment isn’t solvable at all [...] extra special dead
If we could coordinate well enough and get to SI via very slow human enhancement that might be a good universe to be in. But probably we wouldn't be able to coordinate well enough and prevent AGI in that universe. Still, odds seem similar between "get humanity to hold off on AGI till we solve alignment" which is the ask in alignment possible universes, and "get humanity to hold off on AGI forever" which is the ask in alignment impossible ...
Oh, actually I spoke too soon about "Talk to the City." As a research project, it is cool, but I really don't like the obfuscation that occurs when talking to an LLM about the content it was trained on. I don't know how TTTC works under the hood, but I was hoping for something more like de-duplication of posts, automatically fitting them into argument graphs. Then users could navigate to relevant points in the graph based on a text description of their current point of view, but importantly they would be interfacing with the actual human generated text, wi...
- The human eats ice cream
- The human gets reward
- The human becomes more likely to eat ice cream
So, first of all, the ice cream metaphor is about humans becoming misaligned with evolution, not about conscious human strategies misgeneralizing that ice cream makes their reward circuits light up, which I agree is not a misgeneralization. Ice cream really does light up the reward circuits. If the human learned "I like licking cold things" and then sticks their tongue on a metal pole on a cold winter day, that would be misgeneralization at the level you are fo...
People have tried lots and lots of approaches to getting good performance out of computers, including lots of "scary seeming" approaches
I won't say I could predict that these wouldn't foom ahead of time, but it seems the result of all of these is an AI engineer that is much much more narrow / less capable than a human AI researcher.
It makes me really scared, many people's response to not getting mauled after poking a bear is to poke it some more. I wouldn't care so much if I didn't think the bear was going to maul me, my family, and everyone I care about.
...I
The waluigis will give anti-croissant responses
I'd say the waluigis have a higher probability of giving pro-croissant responses than the luigi's, and are therefore genuinely selected against. The reinforcement learning is not part of the story, it is the thing selecting for the LLM distribution based on whether the content of the story contained pro or anti croissant propaganda.
(Note that this doesn't apply to future, agent shaped, AI (made of LLM components) which are aware of their status (subject to "training" alteration) as part of the story they are working on)
I like this direction of thought, and I suspect it is true as a general rule, but ignores the incentive people have for correctly receiving the information, and the structure through which the information is disseminated. Both factors (and probably others I haven't thought of) would increase or decrease how much information could be transferred.
In the final paragraph, I'm uncertain if you are thinking about "agency" being broken into components which make up the whole concept, or thinking about the category being split into different classes of things, some of which may have intersecting examples. (or both?) I suspect both would be helpful. Agency can be described in terms of components like measurement/sensory, calculations, modeling, planning, comparisons to setpoints/goals, taking actions. Probably not that exact set, but then examples of agent like things could naturally be compared on each c... (read more)