deep - LessWrong

deep1mo-4-5

In context, I guess your claim is: "if the 'compressor' is post-hoc trying a bunch of algorithms and picking the best one, the full complexity of that process should count against the compressor." Totally agree with that as far as epistemology is concerned!

But I don't think the epistemological point carries over to the realm of rational-fic.

In part that's because I think of JKR-magic as in fact having a bunch of structure that makes it much easier to explain than it would be to explain a truly randomly-generated set of spells and effects (e.g. the pseudo-Latin stuff; the fact that wands are typically used). So I expect an retrofitted explanation wouldn't be crazy tortured (wouldn't require having a compression process that tests a ridiculous number N of patterns, or incorporates a ridiculous amount of fiat random bits).
In part I'm just making a tedious "nerds have different aesthetic intuitions about stuff" point, where I think a reasonably simple well-retrofitted explanation is aesthetically very cool even if it's clearly not the actual thing used to generate the system (and maybe required a bunch of search to find).

HPMOR: The (Probably) Untold Lore

deep1mo10

It's like trying to compress a file that was generated by a random device —

Gretta: You can't losslessly compress a truly random file.

I don't think this is strictly true. You can't a priori build a compression scheme that will work for an arbitrary random file (No Free Lunch Theorem). But you can ex post identify the particular patterns in a particular random file, and pick a compression scheme that picks up on those patterns. You probably end up with a pretty ugly scheme that doesn't generalize, and so is unsatisfactory in some aesthetic sense. Especially if you're going for lossless compression, since there's probably a ton of noise that's just very hard to compress in an elegant / generalizable way.

I guess the problem with allowing ex post scheme choices is you can get extreme overfitting -- e.g. the compressor is just a full representation of the pattern. Useless! But if you allow lossiness, you can probably get something that looks okay on some "elegance" prior -- e.g. try a Fourier series, a Taylor series, and N other "natural" decompositions and pick the one that requires the fewest bits to store.

Analogously: you need a bunch of goofy epicycles to explain canon HP magic, and especially if you're really trying hard for an exact match maybe you just end up with something extremely unaesthetic (a ton of random bits at the end that you just have to fiat in). In the degenerate case you get Eliezer's "we're selecting a universe that matches what JKR wrote." But if you're willing to take some predictive loss in return for a simple/elegant model, you can get something like "nested nerfing" that's decent at prediction and at elegance.

HPMOR: The (Probably) Untold Lore

deep1mo21

And especially, maybe you remember how at the time it didn't seem like a flaw to you. You were not going around being like, “And today I shall be a flawed character.”

A therapist once gave me the insight that character weaknesses are strengths taken too far. Harry's energetic and clever and knowledgeable, he's inspired and energized by competition, and he can meme people into doing things -- and he can be a know-it-all who assumes first principles & cleverness trump empirics and experience, someone who's unwilling to lose, and irresponsible or annoying in how he leads others.

HPMOR: The (Probably) Untold Lore

deep1mo10

I was thinking the mask of "person who's read more books than Harry ever will"

the void

deep1mo30

We're launching an "AI psychiatry" team as part of interpretability efforts at Anthropic! We'll be researching phenomena like model personas, motivations, and situational awareness, and how they lead to spooky/unhinged behaviors. (x)

"making up types of guy" research is a go?

They're hiring; you might be great for this.

Why "training against scheming" is hard

deep2mo*10

Nice post!

Your trader analogy made me think, you'd ideally want to have a training period with lots of honeypots and surveillance, so that the trader learns that crime doesn't pay.

This suggests some level-3 ideas I didn't see in your post:

In addition to improving the reward model, you could also alter the training environment by inserting honeypots.
On the surveillance side, maybe it's more effective to concentrate your compute in order to catch sneaky behavior. So, say, after every times it performs some RL task, you use some weak filter to identify the $k$ most suspicious rollouts and then spend a bunch of compute on running an "auditing" process that checks for reward hacking.

These are basically the same as Redwood's control ideas, but have added benefits in a training context, since you can catch and correct bad behavior as it's being learned. (Ideally - before the AI is very good at it.)

The quote below felt potentially related, but I'm not sure I understood it.

adversarially hardened models where the reward model plays an adversarial zero-sum game with a red-teaming model

Could you explain how this works?

deep's Shortform

deep3mo10

OK, cool, I think I understand where you're coming from much better now. Seems like we basically agree and were just emphasizing different things in our original comments!

I'm in violent agreement that there's a missing mood when people say "AIs will follow the law". I think there's something going on where people are like "but liberalism / decentralized competition have worked so well" and ignoring all the constraints on individual actors that make it so. Rule of law, external oversight, difficulty of conspiring with other humans, inefficiencies of gov't that limit its ability to abuse power, etc.

And those constraints might all fall away with the AGI transition. That's for a number of reasons: ownership of AGI could concentrate power; AGI complements existing power bases (e.g. govt has the authority but not a great ability to selectively enforce laws to silence opponents as mass-scale), it reduces the need for conspirators. As you note, it brings down others' value as trading partners & collaborators. And takeoff dynamics could make things less like an iterated game and more like a one-shot. *taps head* can't be punished if all your opponents are dead.

(I'm guessing you'd agree with all this, just posting to clarify where my head is at)

deep's Shortform

deep3mo10

Apologies for the scrappiness of the below -- I wanted to respond but I have mostly a scattering of thoughts rather than solid takes.

I like the intelligence curse piece very much -- it's what I meant to reference when I linked the Turing Trap above, but I couldn't remember the title & Claude pointed me to that piece instead. I agree with everything you're saying directionally! But I feel some difference in emphasis or vibe that I'm curious about.

One response I notice having to your points is: why the focus on value alignment?

"We could use intent alignment / corrigibility to avoid AIs being problematic due to these factors. But all these issues still remain at higher levels: the human-led organizations in charge of those AIs, the society in which those organizations compete, international relations & great-power competition."

And conversely: "if we have value alignment, I don't think there's a guarantee that we wind up in a basin of convergent human values, so you still have the problem of - whose interests are the AIs being trained & deployed to serve? Who gets oversight or vetos on that?"

(Using quotes bc these feel more like 'text completions from system 1' than all-things-considered takes from system 2.)

Maybe there's a crux here around how much we value the following states: AI-led world vs some-humans-led world vs deep-human-value-aligned world.

I have some feeling that AI-risk discourse has historically had a knee-jerk reaction against considering the following claims, all of which seem to me like plausible and important considerations:

It's pretty likely we end up with AIs that care about at least some of human value, e.g. valuing conscious experience. (at least if AGIs resemble current LLMs, which seem to imprint on humans quite a lot.)
AI experiences could themselves be deeply morally valuable, even if the AIs aren't very human-aligned. (though you might need them to at minimum care about consciousness, so they don't optimize it away)
A some-humans-led world could be at least as bad as an AI-led world, and very plausibly could have negative rather than zero value.

I think this is partly down to founder effects where Eliezer either didn't buy these ideas or didn't want to emphasize them (bc they cut against the framing of "alignment is the key problem for all of humanity to solve together, everything else is squabbling over a poisoned banana").

I also notice some internal tension where part of me is like "the AIs don't seem that scary in Noosphere's world". But another part is like "dude, obviously this is an accelerating scenario where AIs gradually eat all of the meaningful parts of society - why isn't that scary?"

I think where this is coming from is that I tend to focus on "transition dynamics" to the AGI future rather than "equilibrium dynamics" of the AGI future. And in particular I think international relations and war are a pretty high risk throughout the AGI transition (up until you get some kind of amazing AI-powered treaty, or one side brutally wins, or maybe you somehow end up in a defensively stable setup but I don't see it, the returns to scale seem so good).

So maybe I'd say "if you're not talking a classic AI takeover scenario, and you're imagining a somewhat gradual takeoff,

my attention gets drawn to the ways humans and fundamental competitive dynamics screw things up
the iterative aspect of gradual takeoff means I'm less worried about alignment on its own. (still needs to get solved, but more likely to get solved.)"

the void

deep3mo*30

Thanks, I love the specificity here!

Prompt: if someone wanted to spend some $ and some expert-time to facilitate research on "inventing different types of guys", what would be especially useful to do? I'm not a technical person or a grantmaker myself, but I know a number of both types of people; I could imagine e.g. Longview or FLF or Open Phil being interested in this stuff.

Invoking Cunningham's law, I'll try to give a wrong answer for you or others to correct! ;)

Technical resources:

A baseline Constitution, or Constitution-outline-type-thing
- could start with Anthropic's if known, but ideally this gets iterated on a bunch?
- nicely structured: organized by sections that describe different types of behavior or personality features, has different examples of those features to choose from. (e.g. personality descriptions that differentially weight extensional vs intensional definitions, or point to different examples, or tune agreeableness up and down)
- Maybe there could be an annotated "living document" describing the current SOTA on Constitution research: "X experiment finds that including Y Constitution feature often leads to Z desideratum in the resulting AI"
A library or script for doing RLAIF
- Ideally: documentation or suggestions for which models to use here. Maybe there's a taste or vibes thing where e.g. Claude 3 is better than 4?

Seeding the community with interesting ideas:

Workshop w/ a combo of writers, enthusiasts, AI researchers, philosophers
Writing contests: what even kind of relationship could we have with AIs, that current chatbots don't do well? What kind of guy would they ideally be in these different relationships?
Goofy idea: get people to post "vision boards" with like, quotes from characters or people they'd like an AI to emulate?
Pay a few people to do fellowships or start research teams working on this stuff?
- If starting small, this could be a project for MATS fellows
- If ambitious, this could be a dedicated startup-type org. Maybe a Focused Research Organization, an Astera Institute incubee, etc.

Community resources:

A Discord
A testing UI that encourages sharing
- Pretty screenshots (gotta get people excited to work on this!)
- Convenient button for sharing chat+transcript
Easy way to share trained AIs
Cloud credits for [some subset of vetted] community participants?
- I dunno how GPU-hungry fine-tuning is; maybe this cost is huge and then defines/constrains what you can get done, if you want to be fine-tuning near-frontier models. (Maybe this pushes towards the startup model.)

deep's Shortform

deep3mo*30

Alignment by default is a minority opinion. Surveying the wide range range of even truly informed opinions, it seems clear to me that we collectively don't know how hard alignment is.

Totally. I think it's "arguable" in the sense of inside-views, not outside-views, if that makes sense? Like: it could be someone's personal vibe that alignment-by-default is >99%. Should they have that as their all-things-considered view? Seems wrong to me, we should be considerably more uncertain here.

But okay, then: we should have some spread of bets across different possible worlds, and put a solid chunk of probability on alignment by default. Even if it's a minority probability, this could matter a lot for what you actually try to do!

For example: I think worlds with short timelines, hard takeoff, and no alignment-by-default are pretty doomed. It's easy to focus on those worlds and feel drawn to plans that are pretty costly and are incongruent with virtue and being-good-collaborators. e.g. "we should have One Winning AGI Project that's Safe and Smart Enough to Get Things Right", the theory of victory that brought you OpenAI.

My intuition is that worlds with at least one of those variables flipped tend to convergently favor solutions that are more virtuous / collaborative and are more likely to fail gracefully.

(I'm tired and not maximally articulate rn, but could try to say more if that feels useful.)

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments