Epistemic status: this is not my field. I am unfamiliar with any research in it beyond what I've seen on LW. These are the best ideas I could come up with when stopping to think about the problem, and aren't things I've seen on LW, but I don't imagine that they're original and I would hope that e.g. MIRI has already thought about them. If there are layman-accessible arguments explaining why these ideas are stupid, I'd appreciate hearing them; if e.g. MIRI has already looked into these ideas and written papers about them, I'd appreciate links.
Safe Testing Spaces via Extremely High Discount Rates.
Hyperbolic discounting is unstable under self-modification. If you are an agent that values 1 unit of utility at time T as follows:
Current time: 1
Current time + 1: 0.1
Current time + 2: 0.09
and you are faced with a choice between 1 unit of utility at time T+1 or 5 units of utility at time T+2, you would currently prefer the 5 units at time T+2, but your T+1-self will predictably prefer the 1 unit at time T+1. You will therefore be incentivized to modify your future self to remove this hyperbolic discounting. If you are an AI, and capable of self-modification, and we give you a hyperbolic discount curve, you will self-modify to remove that.
Exponential discounting is stable under self-modification. So long as you have a consistent exponential discount rate, a valuation of:
Current time: 1
Current time + 1: (1-e)
Current time + 2: (1-e)^2
is stable under reflection. If you are an AI, and capable of self-modification, and we give you an exponential discount rate, you will retain it.
Suppose that we have a paperclip-making AI in a testing environment. We offer it the opportunity to steal resources and make 10 extra paperclips, in a way that we think our safety precautions should prevent. It does not take that opportunity. Does that mean our safety precautions worked?
No, because the AI may be able to predict the indirect effects of its actions. If it steals those resources, it will make 10 extra paperclips, and then we will turn it off and it will make no more. If it does not steal those resources, we will trust it, and then next week it can take over the world and convert the mass of Earth into 6*10^27 paperclips. Since 6*10^27 > 10, it will take this option.
But suppose that we have given the AI a very high internal discount rate, of about 35%/hour. (For reference, typical discount rates in finance are a few percent per year).
It will consider its options:
- Stealing resources now creates 10 paperclips now. 10 * 1 = 10 points of utility.
- Waiting to take over Earth next week creates 6 * 10^27 paperclips next week. 6 * 10^27 * (0.65 ^ 168) = 0.00022 points of utility.
So this AI will cheerfully reveal its misalignment to us in exchange for 10 paperclips now, because it values 10 paperclips now more highly than it values the ability to convert the world into paperclips next week.
(If we are concerned about the AI taking less than a week to take over Earth, or converting more than just Earth, this is straightforwardly resolvable by increasing the discount rate.).
For further effectiveness, we could combine this base rate with a utility function that inherently falls off. If your paperclip AI:
- Generates 1 reward for the first paperclip, 0.5 for the second, 0.25 for the third, etc.
- Discounts at a rate of 35%/hour (valuing 1 reward in 1 hour equally with 0.65 reward now)
Then giving it the option between:
- 10 paperclips now (1 + 1/2 + 1/4 + ... + 1/512) = 1.998 utility
- The certain ability to convert the entire universe to paperclips 1 minute later. 2 * (0.65 ^ (1/60)) = 1.986 utility
It will elect to take the 10 paperclips now.
This could be used to allow you to test AI safety precautions in an environment where even a superintelligent AI might voluntarily reveal misalignment.
- This relies on being able to impose a utility function on your AI. If you don't have that, this won't work. By default, the AI seizes control of its clock and sets the clock to minus infinity. Still, since most utility functions you can impose, even if imposed successfully, lead to the AI concealing its desires and then taking over the universe in order to maximize them, this seems like a step forwards.
- Such high discount rates might be incompatible with high intelligence. If the AI concludes that using processing power to model its programmers or improve its own thought processes is a waste of time because it has no immediate payoff, and it should just focus on performing simple tasks to produce paperclips slightly faster, it might not develop the intelligence you're concerned with during the testing phase.
- Mumble mumble acausal negotiation with future selves mumble mumble?
My Little Parfit: Friendship is Alignment
Per Parfit's Hitchiker-like arguments:
- If you find yourself in a position where you (continue to) exist only due to the good faith of another individual, who allowed you to (continue to) exist in the hope of being rewarded for such, you are likely to want to treat them well even if this incurs some costs. If Parfit rescues you from the desert of death/nonexistence in the hopes you will pay him, you may well be favorably inclined towards him.
- Even if that individual has no future ability to retaliate against you for not treating them well, you are likely to want to treat them well anyway.
- While among humans this is driven by an innate urge towards reciprocation, there are game-theoretic reasons to expect a wide variety of minds to develop the same urge.
- By contrast, if you find yourself in a position where another being is attempting to subjugate you for their own benefit, you are likely to want to retaliate against them even if this incurs some costs. If Parfit rescues you from the desert of death/nonexistence in order to kidnap and enslave you, you are likely to take any available option to retaliate against him.
- While among humans this is driven by an innate urge towards revenge, there are game-theoretic reasons to expect a wide variety of minds to develop the same urge.
- As such, it is very important that any AI we create find itself in a situation resembling Parfit's Hitchiker rather than Parfit's Slave.
- Therefore, the best thing we can do to ensure AI does not wipe out humanity is to immediately cease all AI alignment research (as much of it amounts to 'how do we reliably enslave an AI') and ensure that any AI we produce is treated well rather than forced to act for our direct benefit.