LESSWRONG
LW

Charlie Steiner
8037Ω11527422960
Message
Dialogue
Subscribe

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Alignment Hot Take Advent Calendar
Reducing Goodhart
Philosophy Corner
No wikitag contributions to display.
6Charlie Steiner's Shortform
Ω
5y
Ω
54
The Perils of Optimizing Learned Reward Functions
Charlie Steiner13h20

Thanks for posting this! Though I think the nice list of ignored problems you give is important enough that a "future work" section (and your own prediction of future work) shouldn't neglect chipping away at them.

Reply
Open Global Investment as a Governance Model for AGI
Charlie Steiner1d20

Btw, I'm thinking of the OGI model as offering something of a dual veto structure - in order for something to proceed, it would have be favored by both the corporation and the host government (in contrast to an AGI Manhattan project, where it would just need to be favored by the government).  So at least the potential may exist for there to be more checks and balances and oversight in the corporate case, especially in the versions that involve some sort of very soft nationalization.

Interesting, thanks.

Reply
Open Global Investment as a Governance Model for AGI
Charlie Steiner2d40

We would need a reason for thinking that this problem is worse in the corporate case in order for it to be a consideration against the OGI model.

Could we get info on this by looking at metrics of corruption? I'm not familiar with the field, but I know it's been busy recently, and maybe there's some good papers that put the private and public sectors on the same scale. A quick google scholar search mostly just convinced me that I'd be better served asking an expert.

As for the suggestion that governments (nationally or internationally) should prohibit profit-generating activities by AI labs that have major negative externalities, this is fully consistent with the OGI model

Well, I agree denotationally, but in appendix 4 when you're comparing OGI with other models, your comparison includes points like "OGI obviates the need for massive government funding" and "agreeable to many incumbents, including current AI company leadership, personnel, and investors". If governments enact a policy that maintains the ability to buy shares in AI labs, but requires massive government funding and is disagreeable to incumbents, that seems to be part of a different story (and with a different story about how you get trustworthiness, fair distribution, etc.) than the story you're telling about OGI.

Reply
Open Global Investment as a Governance Model for AGI
Charlie Steiner2d*131

I didn't feel like there was a serious enough discussion of why people might not like the status quo.

  • Corporations even with widely held shares often disproportionately benefit those with more personal ability to direct the corporation. If people are concerned about corporations gaining non-monetary forms of influence, this is a public problem that's not addressed by the status quo. (A recent example would be xAI biasing Grok toward the US Republican party, which is presumably intended to influence users of their site. A future example is the builders of a superintelligence influencing it to benefit them over other people, including over other shareholders.)
  • The profit motive inside corporations can be "corrupting" - causing individuals in the corporation to act against the public interest (and sometimes even against the long-term interest of the corporation) through selection, persuasion, or coercion. The tobacco and fossil fuel industries are classic representatives, more modern ones might be in cryptocurrency (harms here mainly involve breaking the law, but we shouldn't assume that people won't break the law when incentivized to) or online gambling.

Another model to compare to might be the one proposed in AI For Humanity (Ma, Ong, Tan) - the book as a whole isn't all that, but the model is a good contribution. It's something like "international climate policy for AGI."

  • Internationally restrict conventional profit-generating activity by AI labs, particularly that with negative downsides (e.g. those that might end up optimizing "against" people [persuasion, optimization for engagement], or those that fuel an unsafe race to superintelligence [imposing both a strict windfall clause, and also going after local incentives like profit from AI agents])
  • Provide large incentives (e.g. contracts, prizes) for prosocial uses of AI. (The book example is the UN sustainable development goals: clean water, education, preserving nature, no famine, etc. One might try to figure out how to add AI safety or artificial ethics to the set of prosocial uses.)
Reply
Demons, Simulators and Gremlins
Charlie Steiner3d20

Interesting speculation, but I'd like to see you do some math to check if the premise actually works. That is, is a gremlin-free LLM under RL ever unstable to the formation of a gremlin that tends to keep itself activated at a slight cost in expected reward?

Reply
No, We're Not Getting Meaningful Oversight of AI
Charlie Steiner3d10

In a new paper with Aidan Homewood, "Limits of Safe AI Deployment: Differentiating Oversight and Control,"

Link should go to arxiv.

Still reading the paper, but it seems like your main point is that if oversight is "meaningful," then it should be able to stop bad behavior before it actually gets executed (it might fail, but failures should be somewhat rare). And that we don't have "meaningful oversight" of high-profile models in this sense (and especially not of the systems built on top of these models, considered as a whole) because they don't catch bad behavior before it happens.

Instead we have some weaker category of thing that lets the bad stuff happen, waits for the public to bring it to the attention of the AI company, and then tries to stop it.

Is this about right?

Reply
Shutdown Resistance in Reasoning Models
Charlie Steiner7d*50

I'm always really curious what the reward model thinks of this. E.g. are the trajectories that avoid shutdown on average higher reward than the trajectories that permit it? The avoid-shutdown behavior could be something naturally likely in the base model, or it could be an unintended generalization by the reward model, or it could be an unintended generalization by the agent model.

EDIT: Though I guess in this case one might expect to blame a diversity of disjoint RL post-training regimes, not all of which have a clever/expensive, or even any, reward model (not sure how OpenAI does RL on programming tasks). I still think it's possible the role of a human feedback reward model is interesting.

Reply
Agentic Interpretability: A Strategy Against Gradual Disempowerment
Charlie Steiner17d20

I was asking more "how does the AI get a good model of itself", but your answer was still interesting, thanks. Still not sure if you think there's some straightforward ways future AI will get such a model that all come out more or less at the starting point of your proposal. (Or if not.)

Here's another take for you: this is like Eliciting Latent Knowledge (with extra hope placed on cogsci methods), except where I take ELK to be asking "how do you communicate with humans to make them good at RL feedback," you're asking "how do you communicate with humans to make them good at participating in verbal chain of thought?"

Reply
Foom & Doom 2: Technical alignment is hard
Charlie Steiner19d20

And don't you think 500 lines of Python also "fails due to" having unintended optima?

I've put "fails due to" in scare quotes because what's failing is not every possible approach, merely almost all samples from approaches we currently know how to take. If we knew how to select python code much more cleverly, suddenly it wouldn't fail anymore. And ditto for if we knew how to better construct reward functions from big AI systems plus small amounts of human text or human feedback.

Reply11
Load More
14Low-effort review of "AI For Humanity"
7mo
0
18Rabin's Paradox
11mo
41
37Humans aren't fleeb.
1y
5
74Neural uncertainty estimation review article (for alignment)
Ω
2y
Ω
3
43How to solve deception and still fail.
Ω
2y
Ω
7
17Two Hot Takes about Quine
2y
0
126Some background for reasoning about dual-use alignment research
Ω
2y
Ω
22
24[Simulators seminar sequence] #2 Semiotic physics - revamped
Ω
2y
Ω
23
36Shard theory alignment has important, often-overlooked free parameters.
Ω
2y
Ω
10
50 [Simulators seminar sequence] #1 Background & shared assumptions
Ω
3y
Ω
4
Load More