Sorry, I mean corrigibility as opposed to CEV, and narrow in the sense that it follows user instructions rather than optimizing for all the user's inferred preferences in every domain, not in the sense of AI that only understands physics. I don't expect unsolvable corrigibility problems at the capability level where AI can 10x the economy under the median default trajectory, rather something like today, where companies undershoot or overshoot on how much the agent tries to be helpful vs corrigible, what's good for business is reasonably aligned, and getting there requires something like 3% of the lab's resources.
Big fan of the "bowing out" react. I did notice a minor UI issue where the voting arrows don't fit in the box:
Inasmuch as we're going for corrigibility, it seems necessary and possible to create an agent that won't self-modify into an agent with complete preferences. Complete preferences is antithetical to narrow agents, and would mean the agent might try to eg solve the Israel-Palestine conflict when all you asked it to do is code you a website. Even if there is a working stop button this is a bad situation. It seems likely we can just train against this sort of thing, though maybe it will require being slightly clever.
As for whether it we can/should have it self-modify to avoid more basic kinds of money-pumps so its preference are at least transitive and independent, this is an empirical question I'm extremely unsure of, but we should aim for the least dangerous agent that gets the desired performance, which should balance the propensity for misaligned actions with the additional capability needed to overcome irrationality.
I was aware there is some line but thought it was "don't ignite a conversation that derails this one" rather than "don't say inaccurate things about groups", which is why I listed lots of groups rather than one and declined to list actively contentious topics like timelines, IABIED reviews, or Matthew Barnett's opinions
This was not my intention, though I could have been more careful. Here are my reasons
It seems like everyone is tired of hearing every other group's opinions about AI. Since like 2005, Eliezer has been hearing people say a superintelligent AI surely won't be clever, and has had enough. The average LW reader is tired of hearing obviously dumb Marc Andreessen accelerationist opinions. The average present harms person wants everyone to stop talking about the unrealistic apocalypse when artists are being replaced by shitty AI art. The average accelerationist wants everyone to stop talking about the unrealistic apocalypse when they could literally cure cancer and save Western civilization. The average NeurIPS author is sad that LLMs have made their expertise in Gaussian kernel wobblification irrelevant. Various subgroups of LW readers are dissatisfied with people who think reward is the optimization target, Eliezer is always right, or discussion is too tribal, or whatever.
With this combined with how Twitter distorts discourse is it any wonder that people need to process things as "oh, that's just another claim by X group, time to dismiss"? Anyway I think naming the groups isn't the problem, and so naming the groups in the post isn't contributing to the problem much. The important thing to address is why people find it advantageous to track these groups.
Nice find; this may be where the real "glitch tokens" work starts.
This seems unlikely for two reasons. First, models can't control the words in their reasoning traces well; even if they're trying to hide them they often blurt out "I must not say that I'm doing X". Second, if the model actually needs to use its CoT rather than narrating something it could have figured out in a single forward pass, it would need to encode misaligned reasoning rather than just omitting it.
So I think CoT will be monitorable until some combination of
All of these could happen by late next year for some usecases, but I don't think it will be emergent property of next generation just having more situational awareness.
I don't think so. While working with Vivek I made a list once of ways agents could be partially consequentialist but concluded that doing game theory type things didn't seem enlightening.
Maybe it's better to think about "agents that are very capable and survive selection processes we put them under" rather than "rational agents" because the latter implies it should be invulnerable to all money-pumps, which is not a property we need or want.
After thinking about it more, it might take more than 3% even if things scale smoothly because I'm not confident corrigibility is only a small fraction of labs' current safety budgets