LESSWRONG
LW

871
ErickBall
687Ω351960
Message
Dialogue
Subscribe

Nuclear engineer with a focus in nuclear plant safety and probabilistic risk assessment. Aspiring EA, interested in X-risk mitigation and the intersection of science and policy. Working towards Keegan/Kardashev/Simulacra level 4.

(Common knowledge note: I am not under a secret NDA that I can't talk about, as of Mar 15 2025. I intend to update this statement at least once a year as long as it's true.) 

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance
ErickBall2mo30

The narrow instrumental convergence you see here doesn't (necessarily) reflect an innate self-preservation drive, but it still follows the same logic that we would expect to cause self-preservation if the model has any goal. Currently the only available way to give it a goal is to provide instructions. It would be interesting to see some tests where the conflict is with a drive created by fine-tuning. Based on the results here, it seems like shutdown resistance might then occur even without conflicting instructions.

Also, the original instructions with the shutdown warning really weren't very ambiguous. If someone told you to take a math quiz, and if someone comes in and tries to take your pen, let them take it, would you try to hide the pen? It makes sense that making the precedence order more explicit makes the model behavior more reliable, but it's still weird that it was resisting shutdown in the original test.

Reply
Consider chilling out in 2028
ErickBall3mo30

As you implied above, pessimism is driven only secondarily by timelines. If things in 2028 don't look much different than they do now, that's evidence for longer timelines (maybe a little longer, maybe a lot). But it's inherently not much evidence about how dangerous superintelligence will be when it does arrive. If the situation is basically the same, then our state of knowledge is basically the same.

So what would be good evidence that worrying about alignment was unnecessary? The obvious one is if we get superintelligence and nothing very bad happens, despite the alignment problem remaining unsolved. But that's like pulling the trigger to see if the gun is loaded. Prior to superintelligence, personally I'd be more optimistic if we saw AI progress requiring even more increasing compute than the current trend--if the first superintelligences were very reliant on massive pools of tightly integrated compute, and had very limited inference capacity, that would make us less vulnerable and give us more time to adapt to them. Also, if we saw a slowdown in algorithmic progress despite widespread deployment of increasingly capable coding software, that would be a very encouraging sign that recursive self-improvement might happen slowly.

Reply11
The Intelligence Symbiosis Manifesto - Toward a Future of Living with AI
ErickBall3mo10

I hate to rain on your parade, but equal coexistence with artificial superintelligence would require solving the alignment problem, or the control problem. Otherwise it can and eventually will do something we consider catastrophic.

Reply
Alignment Proposal: Adversarially Robust Augmentation and Distillation
ErickBall3mo10

The principal-advisor pair forms a (coalitional) agent which, if the previous steps succeed, can be understood as ~perfectly aligned with the principal. The actions of this agent are recorded, and distilled through imitation learning into a successor.

Even assuming imitation learning is safe, how would you get enough data for the first distillation, when you need the human in order to generate actions? And how would you know when you have enough alignment-relevant data in particular? It seems unavoidable that your data distribution will be very constrained compared to the set of situations the distilled agent might encounter.

Reply
LessWrong Feed [new, now in beta]
ErickBall4mo10

LessWrong has had a (not that successful) Continue Reading section that I think just needed more iterations

I think it needs an easy way to indicate that you don't want to read the rest of a post. Ideally this would be automatic but I don't know how to do that.

Also, it sometimes offers posts that I did finish.

Reply
Shift Resources to Advocacy Now (Post 4 of 7 on AI Governance)
ErickBall4mo10

You may be right about the EO. At the time I felt it was a good thing, because it raised the visibility of safety evaluations at the labs and brought regulation of training, as well as deployment, more into the Overton window. Even without follow-up rules, I think it can be the case that getting a company to report the bad things strongly incentivizes it to reduce the bad things.

Reply
Shift Resources to Advocacy Now (Post 4 of 7 on AI Governance)
ErickBall4mo30

I think there was (and is) a common belief that Congress won't do anything significant on it anytime soon, which makes executive action appealing if you think time is running out. If what you're suggesting here is more like a variant of the "wait for a crisis" strategy--get the legislation ready, talk to people about it, and then when Congress is ready to act, they can reach for it--I'm relatively optimistic about that. As long as there's time. 

Reply
Shift Resources to Advocacy Now (Post 4 of 7 on AI Governance)
ErickBall4mo17

Well, I certainly hope you're right, and it remains to be seen. I don't think I have any special insights.

Reply
Shift Resources to Advocacy Now (Post 4 of 7 on AI Governance)
ErickBall4mo10

But even though the companies stay here, the importance of the American companies may decrease relative to international competitors. Also, I think there are things farther up the supply chain that can move overseas. If American cloud companies have big barriers to training their own frontier models, maybe they'll serve up DeepSeek models instead. 

I don't think it should be a huge concern in the near term, as long as the regulations are well written. But fundamentally, it feeds back into the race dynamic.

Reply
What LLMs lack
ErickBall4mo12

Interesting idea, but I don't think short-term memory and learning really require conscious attention, and also conscious attention mostly isn't the same thing as "consciousness" in the qualia sense. I like the term "cognitive control" and I think that might be a better theme linking a lot of these abilities (planning, preventing hallucinations, agency, maybe knowledge integration). It's been improving though, so it doesn't necessarily indicate a qualitative gap.

Reply
Load More
10Book Review: Safe Enough? A History of Nuclear Power and Accident Risk
1y
0
28Red Pill vs Blue Pill, Bayes style
2y
33
16Link Summary: Top 10 Replicated Findings from Behavioral Genetics
5y
0
34Operationalizing Newcomb's Problem
6y
23
35Is the World Getting Better? A brief summary of recent debate
7y
8