It's like trying to compress a file that was generated by a random device —
Gretta: You can't losslessly compress a truly random file.
I don't think this is strictly true. You can't a priori build a compression scheme that will work for an arbitrary random file (No Free Lunch Theorem). But you can ex post identify the particular patterns in a particular random file, and pick a compression scheme that picks up on those patterns. You probably end up with a pretty ugly scheme that doesn't generalize, and so is unsatisfactory in some aesthetic sense. Especially if you're going for lossless compression, since there's probably a ton of noise that's just very hard to compress in an elegant / generalizable way.
I guess the problem with allowing ex post scheme choices is you can get extreme overfitting -- e.g. the compressor is just a full representation of the pattern. Useless! But if you allow lossiness, you can probably get something that looks okay on some "elegance" prior -- e.g. try a Fourier series, a Taylor series, and N other "natural" decompositions and pick the one that requires the fewest bits to store.
Analogously: you need a bunch of goofy epicycles to explain canon HP magic, and especially if you're really trying hard for an exact match maybe you just end up with something extremely unaesthetic (a ton of random bits at the end that you just have to fiat in). In the degenerate case you get Eliezer's "we're selecting a universe that matches what JKR wrote." But if you're willing to take some predictive loss in return for a simple/elegant model, you can get something like "nested nerfing" that's decent at prediction and at elegance.
And especially, maybe you remember how at the time it didn't seem like a flaw to you. You were not going around being like, “And today I shall be a flawed character.”
A therapist once gave me the insight that character weaknesses are strengths taken too far. Harry's energetic and clever and knowledgeable, he's inspired and energized by competition, and he can meme people into doing things -- and he can be a know-it-all who assumes first principles & cleverness trump empirics and experience, someone who's unwilling to lose, and irresponsible or annoying in how he leads others.
I was thinking the mask of "person who's read more books than Harry ever will"
We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic! We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. (x)
"making up types of guy" research is a go?
They're hiring; you might be great for this.
Nice post!
Your trader analogy made me think, you'd ideally want to have a training period with lots of honeypots and surveillance, so that the trader learns that crime doesn't pay.
This suggests some level-3 ideas I didn't see in your post:
These are basically the same as Redwood's control ideas, but have added benefits in a training context, since you can catch and correct bad behavior as it's being learned. (Ideally - before the AI is very good at it.)
The quote below felt potentially related, but I'm not sure I understood it.
adversarially hardened models where the reward model plays an adversarial zero-sum game with a red-teaming model
Could you explain how this works?
OK, cool, I think I understand where you're coming from much better now. Seems like we basically agree and were just emphasizing different things in our original comments!
I'm in violent agreement that there's a missing mood when people say "AIs will follow the law". I think there's something going on where people are like "but liberalism / decentralized competition have worked so well" and ignoring all the constraints on individual actors that make it so. Rule of law, external oversight, difficulty of conspiring with other humans, inefficiencies of gov't that limit its ability to abuse power, etc.
And those constraints might all fall away with the AGI transition. That's for a number of reasons: ownership of AGI could concentrate power; AGI complements existing power bases (e.g. govt has the authority but not a great ability to selectively enforce laws to silence opponents as mass-scale), it reduces the need for conspirators. As you note, it brings down others' value as trading partners & collaborators. And takeoff dynamics could make things less like an iterated game and more like a one-shot. *taps head* can't be punished if all your opponents are dead.
(I'm guessing you'd agree with all this, just posting to clarify where my head is at)
Apologies for the scrappiness of the below -- I wanted to respond but I have mostly a scattering of thoughts rather than solid takes.
I like the intelligence curse piece very much -- it's what I meant to reference when I linked the Turing Trap above, but I couldn't remember the title & Claude pointed me to that piece instead. I agree with everything you're saying directionally! But I feel some difference in emphasis or vibe that I'm curious about.
-
One response I notice having to your points is: why the focus on value alignment?
"We could use intent alignment / corrigibility to avoid AIs being problematic due to these factors. But all these issues still remain at higher levels: the human-led organizations in charge of those AIs, the society in which those organizations compete, international relations & great-power competition."
And conversely: "if we have value alignment, I don't think there's a guarantee that we wind up in a basin of convergent human values, so you still have the problem of - whose interests are the AIs being trained & deployed to serve? Who gets oversight or vetos on that?"
(Using quotes bc these feel more like 'text completions from system 1' than all-things-considered takes from system 2.)
-
Maybe there's a crux here around how much we value the following states: AI-led world vs some-humans-led world vs deep-human-value-aligned world.
I have some feeling that AI-risk discourse has historically had a knee-jerk reaction against considering the following claims, all of which seem to me like plausible and important considerations:
I think this is partly down to founder effects where Eliezer either didn't buy these ideas or didn't want to emphasize them (bc they cut against the framing of "alignment is the key problem for all of humanity to solve together, everything else is squabbling over a poisoned banana").
-
I also notice some internal tension where part of me is like "the AIs don't seem that scary in Noosphere's world". But another part is like "dude, obviously this is an accelerating scenario where AIs gradually eat all of the meaningful parts of society - why isn't that scary?"
I think where this is coming from is that I tend to focus on "transition dynamics" to the AGI future rather than "equilibrium dynamics" of the AGI future. And in particular I think international relations and war are a pretty high risk throughout the AGI transition (up until you get some kind of amazing AI-powered treaty, or one side brutally wins, or maybe you somehow end up in a defensively stable setup but I don't see it, the returns to scale seem so good).
So maybe I'd say "if you're not talking a classic AI takeover scenario, and you're imagining a somewhat gradual takeoff,
Thanks, I love the specificity here!
Prompt: if someone wanted to spend some $ and some expert-time to facilitate research on "inventing different types of guys", what would be especially useful to do? I'm not a technical person or a grantmaker myself, but I know a number of both types of people; I could imagine e.g. Longview or FLF or Open Phil being interested in this stuff.
Invoking Cunningham's law, I'll try to give a wrong answer for you or others to correct! ;)
Technical resources:
Seeding the community with interesting ideas:
Community resources:
Alignment by default is a minority opinion. Surveying the wide range range of even truly informed opinions, it seems clear to me that we collectively don't know how hard alignment is.
Totally. I think it's "arguable" in the sense of inside-views, not outside-views, if that makes sense? Like: it could be someone's personal vibe that alignment-by-default is >99%. Should they have that as their all-things-considered view? Seems wrong to me, we should be considerably more uncertain here.
But okay, then: we should have some spread of bets across different possible worlds, and put a solid chunk of probability on alignment by default. Even if it's a minority probability, this could matter a lot for what you actually try to do!
For example: I think worlds with short timelines, hard takeoff, and no alignment-by-default are pretty doomed. It's easy to focus on those worlds and feel drawn to plans that are pretty costly and are incongruent with virtue and being-good-collaborators. e.g. "we should have One Winning AGI Project that's Safe and Smart Enough to Get Things Right", the theory of victory that brought you OpenAI.
My intuition is that worlds with at least one of those variables flipped tend to convergently favor solutions that are more virtuous / collaborative and are more likely to fail gracefully.
(I'm tired and not maximally articulate rn, but could try to say more if that feels useful.)
In context, I guess your claim is: "if the 'compressor' is post-hoc trying a bunch of algorithms and picking the best one, the full complexity of that process should count against the compressor." Totally agree with that as far as epistemology is concerned!
But I don't think the epistemological point carries over to the realm of rational-fic.