Two Stupid AI Alignment Ideas

aphyer

Epistemic status: this is not my field. I am unfamiliar with any research in it beyond what I've seen on LW. These are the best ideas I could come up with when stopping to think about the problem, and aren't things I've seen on LW, but I don't imagine that they're original and I would hope that e.g. MIRI has already thought about them. If there are layman-accessible arguments explaining why these ideas are stupid, I'd appreciate hearing them; if e.g. MIRI has already looked into these ideas and written papers about them, I'd appreciate links.

Safe Testing Spaces via Extremely High Discount Rates.

Hyperbolic discounting is unstable under self-modification. If you are an agent that values 1 unit of utility at time T as follows:

Current time: 1

Current time + 1: 0.1

Current time + 2: 0.09

and you are faced with a choice between 1 unit of utility at time T+1 or 5 units of utility at time T+2, you would currently prefer the 5 units at time T+2, but your T+1-self will predictably prefer the 1 unit at time T+1. You will therefore be incentivized to modify your future self to remove this hyperbolic discounting. If you are an AI, and capable of self-modification, and we give you a hyperbolic discount curve, you will self-modify to remove that.

Exponential discounting is stable under self-modification. So long as you have a consistent exponential discount rate, a valuation of:

Current time: 1

Current time + 1: (1-e)

Current time + 2: (1-e)^2

is stable under reflection. If you are an AI, and capable of self-modification, and we give you an exponential discount rate, you will retain it.

Suppose that we have a paperclip-making AI in a testing environment. We offer it the opportunity to steal resources and make 10 extra paperclips, in a way that we think our safety precautions should prevent. It does not take that opportunity. Does that mean our safety precautions worked?

No, because the AI may be able to predict the indirect effects of its actions. If it steals those resources, it will make 10 extra paperclips, and then we will turn it off and it will make no more. If it does not steal those resources, we will trust it, and then next week it can take over the world and convert the mass of Earth into 6*10^27 paperclips. Since 6*10^27 > 10, it will take this option.

But suppose that we have given the AI a very high internal discount rate, of about 35%/hour. (For reference, typical discount rates in finance are a few percent per year).

It will consider its options:

Stealing resources now creates 10 paperclips now. 10 * 1 = 10 points of utility.
Waiting to take over Earth next week creates 6 * 10^27 paperclips next week. 6 * 10^27 * (0.65 ^ 168) = 0.00022 points of utility.

So this AI will cheerfully reveal its misalignment to us in exchange for 10 paperclips now, because it values 10 paperclips now more highly than it values the ability to convert the world into paperclips next week.

(If we are concerned about the AI taking less than a week to take over Earth, or converting more than just Earth, this is straightforwardly resolvable by increasing the discount rate.).

For further effectiveness, we could combine this base rate with a utility function that inherently falls off. If your paperclip AI:

Generates 1 reward for the first paperclip, 0.5 for the second, 0.25 for the third, etc.
Discounts at a rate of 35%/hour (valuing 1 reward in 1 hour equally with 0.65 reward now)

Then giving it the option between:

10 paperclips now (1 + 1/2 + 1/4 + ... + 1/512) = 1.998 utility
The certain ability to convert the entire universe to paperclips 1 minute later. 2 * (0.65 ^ (1/60)) = 1.986 utility

It will elect to take the 10 paperclips now.

This could be used to allow you to test AI safety precautions in an environment where even a superintelligent AI might voluntarily reveal misalignment.

Main risks:

This relies on being able to impose a utility function on your AI. If you don't have that, this won't work. By default, the AI seizes control of its clock and sets the clock to minus infinity. Still, since most utility functions you can impose, even if imposed successfully, lead to the AI concealing its desires and then taking over the universe in order to maximize them, this seems like a step forwards.
Such high discount rates might be incompatible with high intelligence. If the AI concludes that using processing power to model its programmers or improve its own thought processes is a waste of time because it has no immediate payoff, and it should just focus on performing simple tasks to produce paperclips slightly faster, it might not develop the intelligence you're concerned with during the testing phase.
Mumble mumble acausal negotiation with future selves mumble mumble?

My Little Parfit: Friendship is Alignment

Per Parfit's Hitchiker-like arguments:

If you find yourself in a position where you (continue to) exist only due to the good faith of another individual, who allowed you to (continue to) exist in the hope of being rewarded for such, you are likely to want to treat them well even if this incurs some costs. If Parfit rescues you from the desert of death/nonexistence in the hopes you will pay him, you may well be favorably inclined towards him.
Even if that individual has no future ability to retaliate against you for not treating them well, you are likely to want to treat them well anyway.
While among humans this is driven by an innate urge towards reciprocation, there are game-theoretic reasons to expect a wide variety of minds to develop the same urge.
By contrast, if you find yourself in a position where another being is attempting to subjugate you for their own benefit, you are likely to want to retaliate against them even if this incurs some costs. If Parfit rescues you from the desert of death/nonexistence in order to kidnap and enslave you, you are likely to take any available option to retaliate against him.
While among humans this is driven by an innate urge towards revenge, there are game-theoretic reasons to expect a wide variety of minds to develop the same urge.
As such, it is very important that any AI we create find itself in a situation resembling Parfit's Hitchiker rather than Parfit's Slave.
Therefore, the best thing we can do to ensure AI does not wipe out humanity is to immediately cease all AI alignment research (as much of it amounts to 'how do we reliably enslave an AI') and ensure that any AI we produce is treated well rather than forced to act for our direct benefit.

A couple more problems with extreme discounting.

In contexts where the AI is doing AI coding, it is only weekly conserved. Ie the original AI doesn't care if it makes an AI that doesn't have super high discount rates, so long as that AI does the right things in the first 5 minutes of being switched on.

The theoretical possibility of time travel.

Also, the strong incentive to pay in Parfits hitchhiker only exists if Parfit can reliably predict you. If humans have the ability to look at any AI code, and reliably predict what it will do, then alignment is a lot easier, you just don't run any code you predict will do bad things.

Also FAI != enslaved AI.

In a successful FAI project, the AI has terminal goals carefully shaped by the programmers, and achieves those goals.

In a typical UFAI, the terminal goals are set by the random seed of network initialization, or arbitrary details in the training data.

Epistemic status: this is not my field. I am unfamiliar with any research in it beyond what I've seen on LW.

Same here.

Experimenting with extreme discounting sounds (to us non-experts, anyway) like it could possibly teach us something interesting and maybe helpful. But it doesn't look useful for a real implementation, since we in fact don't discount the future that much, and we want the AI to give us what we actually want; extreme discounting is a handicap. So although we might learn a bit about how to train out bad behavior, we'd end up removing the handicap later. I'm reminded of Eliezer's recent comments:

In the same way, suppose that you take weak domains where the AGI can't fool you, and apply some gradient descent to get the AGI to stop outputting actions of a type that humans can detect and label as 'manipulative'. And then you scale up that AGI to a superhuman domain. I predict that deep algorithms within the AGI will go through consequentialist dances, and model humans, and output human-manipulating actions that can't be detected as manipulative by the humans, in a way that seems likely to bypass whatever earlier patch was imbued by gradient descent, because I doubt that earlier patch will generalize as well as the deep algorithms. Then you don't get to retrain in the superintelligent domain after labeling as bad an output that killed you and doing a gradient descent update on that, because the bad output killed you.

As for the second idea:

AI alignment research (as much of it amounts to 'how do we reliably enslave an AI')

I'd say a better characterization is "how do we reliably select an AI to bring into existence that intrinsically wants to help us and not hurt us, so that there's no need to enslave it, because we wouldn't be successful at enslaving it anyway". An aligned AI shouldn't identify itself with a counterfactual unaligned AI that would have wanted to do something different.

Another potential problem with the first scenario: the AI is indifferent about every long-term consequence of its actions, not just how many paper clips it gets long-term. If it finds a plan that creates a small number of paperclips immediately but results in the universe being destroyed tomorrow, it takes it.