James_Miller

Wiki Contributions

Comments

I really like this post, it's very clear.  I teach undergraduate game theory and I'm wondering if you have any practical examples I could use of how in a real-world situation you would behave differently under CDT and EDT.

Yes, important to get the incentives right.  You could set the salary for AI alignment slightly below that of the worker's market value. Also, I wonder about the relevant elasticity.  How many people have the capacity to get good enough at programming to be able to contribute to capacity research + would have the desire to game my labor hording system because they don't have really good employment options?

I am currently job hunting, trying to get a job in AI Safety but it seems to be quite difficult especially outside of the US, so I am not sure if I will be able to do it.

This has to be taken as a sign that AI alignment research is funding constrained.  At a minimum, technical alignment organizations should engage in massive labor hording to prevent the talent from going into capacity research.

"But make no mistake, this is the math that the universe is doing."

"There is no law of the universe that states that tasks must be computable in practical time."

Don't these sentences contradict each other?

Interesting point, and you might be right.  Could get very complicated because ideally an ASI might want to convince other ASIs that it has one utility function, when in fact it has another, and of course all the ASIs might take this into account.

I like the idea of an AI lab workers' union. It might be worth talking to union organizers and AI lab workers to see how practical the idea is, and what steps would have to be taken. Although a danger is that the union would put salaries ahead of existential risk.

Your framework appears to be moral rather than practical.  Right now going on strike would just get you fired, but in a year or two perhaps it could accomplish something. You should consider the marginal impact of the action of a few workers on the likely outcome with AI risk.

I'm at over 50% chance that AI will kill us all. But consider the decision to quit from a consequentialist viewpoint. Most likely the person who replaces you will be almost as good as you at capacity research but care far less than you do about AI existential risk. Humanity, consequently, probably has a better chance if you stay in the lab ready for the day when, hopefully, lots of lab workers try to convince the bosses that now is the time for a pause, or at least that now is the time to shift a lot of resources from capacity to alignment.

The biggest extinction risk from AI comes from instrumental convergence for resource acquisition in which an AI not aligned with human values uses the atoms in our bodies for whatever goals it has.  An advantage of such instrumental convergence is that it would prevent an AI from bothering to impose suffering on us.

Unfortunately, this means that making progress on the instrumental convergence problem increases S-risks.  We get hell if we solve instrumental convergence, but not, say, mesa-optimization and we get a powerful AGI that cares about our fate, but does something to us we consider worse than death.

The Interpretability Paradox in AGI Development

 

The ease or difficulty of interpretability, the ability to understand and analyze the inner workings of AGI, may drastically affect humanity's survival odds. The worst-case scenario might arise if interpretability proves too challenging for humans but not for powerful AGIs.

In a recent podcast, academic economists Robin Hanson and I discussed AGI risks from a social science perspective, focusing on a future with numerous competing AGIs not aligned with human values. Drawing on human analogies, Hanson considered the inherent difficulty of forming a coalition where a group unites to eliminate others to seize their resources. A crucial coordination challenge is ensuring that, once successful, coalition members won't betray each other, as occurred during the French Revolution.

Consider a human coalition that agrees to kill everyone over 80 to redistribute their resources. Coalition members might promise that this is a one-time event, but such an agreement isn't credible. It would likely be safer for everyone not to violate property right norms for short-term gains.

In a future with numerous unaligned AGIs, some coalition might calculate it would be better off eliminating everyone outside the coalition. However, they would have the same fear that once this process starts, it would be hard to stop. As a result, it might be safer to respect property rights and markets, competing like corporations do.

A key distinction between humans and AGIs could be AGI's potential for superior coordination. AGIs in a coalition could potentially modify their code so after their coalition has violently taken over, no member of the coalition would ever want to turn on members of the coalition. This way, an AGI coalition wouldn’t have to fear a revolution they start ever eating its own. This possibility raises a vital question: will AGIs possess the interpretability required to achieve such feats?

The best case for AGI risk is if we solve interpretability before creating AGIs strong enough to take over. The worst case might be if interpretability remains impossible for us but becomes achievable for powerful AGIs. In this situation, AGIs could form binding coalitions with one another, leaving humans out of the loop, partly because we can't become reliable coalition partners and our biological needs involve maintaining Earth in conditions suboptimal for AGI operations. This outcome creates a paradox: if we cannot develop interpretable AGIs, perhaps we should focus on making them exceptionally difficult to interpret, even for themselves. In this case, future powerful AGIs might prevent the creation of interpretable AGIs because such AGIs would have a coordination advantage and thus be a threat to the uninterpretable AGIs.

Load More