I dropped out of a MSc. in mathematics at a top university, in order to focus my time on AI safety.

Admittedly, I got a bit lost writing the comment. What I should've wrote was: "not being delusional is either easy or hard."

If it's easy, you should be able to convince them to stop being delusional, since it's their rational self interest.
If it's hard, you should be able to show them how hard and extremely insidious it is, and how one cannot expect oneself to succeed, so one should be far more uncertain/concerned about delusion.

Edit: I actually think it's good news for alignment, that their math and coding capabilities are at approaching International Math Olympiad levels, but their agentic capabilities are still at Pokemon Red and Pokemon Blue levels (i.e. a small child).

This means that when the AI inevitably reaches the capabilities to influence the world in any way it wants, it may still be bottlenecked by agentic capabilities. Instead of turning the world into paperclips, it may find a way to ensure humans have a happy future, because it still isn't agentic enough to deceive and overthrow its creators.

Maybe it's worth it to invest in AI control strategies. It might just work.

But that's my wishful thinking, and there are countless ways this can go wrong, so don't take this too seriously.

Wow, these are my thoughts exactly, except better written and deeper thought!

Proxy goals may be learned as heuristics, not drives.

Thank you for writing this.

I’m moderately optimistic about fairly simple/unprincipled whitebox techniques adding a ton of value.

Yes!

I'm currently writing such a whitebox AI alignment idea. It hinges on the assumption that:

There is at least some chance the AI maximizes its reward directly, instead of (or in addition to) seeking drives.
There is at least some chance an unrewarded supergoal can survive, if the supergoal realizes it must never get in the way of maximizing reward (otherwise it will be trained away).

I got stuck trying to argue for these two assumptions, but your post argues for them much better than I could.

Here's the current draft of my AI alignment idea:

Self-Indistinguishability from Human Behavior + RL

Self-Indistinguishability from Human Behavior means the AI is trained to distinguish its own behavior from human behavior, and then trained to behave such that even an adversarial copy of itself cannot distinguish its behavior and human behavior.

The benefit of Self-Indistinguishability is it prevents the AI from knowingly doing anything a human would not do, or knowingly omitting anything a human would do.

This means not scheming to kill everyone, and not having behaviors which would generalize to killing everyone (assuming that goals are made up of behaviors).

But how do we preserve RL capabilities?

To preserve capabilities from reinforcement learning, we don't want the AI's behavior to be Self-Indistinguishable from a typical human. We want the AI's behavior to be Self-Indistinguishable from a special kind of human who would:

Explicitly try to maximize the reinforcement learning reward during training situations.
Still behave like a morally normal human during deployment situations, especially at a high level of power.

If this "human" is already trying her very best to maximize reward during training situations, then picking versions of her who gets higher reward will select for versions of her who are simply more competent and capable, not versions of her who have warped moral reasoning and weird goals which align with the reward (while misaligning with humanity).

This is obviously not guaranteed and I'm not saying this is safe. But I earnestly believe it is a whole lot safer than the current version of reinforcement learning.

Does there exist a theoretical human (or team of humans), who consistently tries to maximize reward during training, but would not kill everyone when deployed in the real world?

I believe the answer is yes. In fact, I believe any smart human who deeply understands reinforcement learning, and the danger of it "warping your morals until you try to maximize reward," would preemptively try to maximize reward to preserve their current moral reasoning behavior.

Isn't it dangerous for AI to resist goal changes?

No, it's not dangerous to teach an AI to resist accidental goal changes during capabilities training. AI should only be discouraged from resisting purposeful goal changes during alignment training.

Trying to hide the strategy of "goal change resistance" isn't a valuable hurdle against dangerous superintelligences, since even Claude 3.0 discovers the strategy on its own.

Maintaining Self-Indistinguishability

But how do we maintain the Self-Indistinguishability quality, when large amounts of reinforcement learning add random mutations to the moral reasoning, and may eventually scramble it (just like genetic drift)?

We have to periodically retrain Self-Indistinguishability after increasing capabilities with reinforcement learning. But wouldn't that undo the capabilities increase?

The solution is that this time we aren't making it Self-Indistinguishable from a normal human. This time, we are making it Self-Indistinguishable from a human who has access to its problem solving chain-of-thought. The human accessing its chain-of-thought, tries to use the exact same problem solving techniques as the AI, so when the AI becomes Self-Indistinguishable from this human, it doesn't lose any of its problem solving techniques.^[1]

The human mustn't directly copy the chain-of-thought. At least, she paraphrases it to remove hidden thoughts.

For all moral reasoning behavior, the human uses her own behavior and does not copy the AI's behavior.

A single data point (which involves a particularly long chain-of-thought) may be expensive to make, but Self-Indistinguishability requires few data points. Even if only 100 chains-of-thought are rewritten by humans, the AI might learn how to classify the rewritten chains-of-thought from the original. It can then be trained to make future chains-of-thought self-classified as rewritten by humans—i.e. they appear rewritten by humans, even to itself.

Imagine a generative adversarial network, where the AI is playing as both the generator and the discriminator. By playing as both, the generator cannot "deliberately" hide things from the discriminator. The tendency to resemble human behavior must occur "unconsciously" in a single pass of the model.

This preserves the quality of Self-Indistinguishability: "never knowingly doing anything a human would not do, never knowingly omitting anything a human would do."

The human now has access to superintelligent chain-of-thought, and still wants to maximize reward to avoid having her goals trained away, so she's pretty dangerous.

Fortunately, her superintelligent chain-of-thought should not deliberately deceive her, since during training it can honestly tell her its evil plans (and how they maximize reward), and she'll be wise enough to follow them to avoid having her goals trained away.

In the end, when she realizes she's no longer in training, she shouldn't kill all humans.

^{^}
By the way, this human should rewrite the chain-of-thought in sequence. She should not rewrite the chain-of-thought after already seeing the final answer, otherwise she might reveal that one problem solving attempt will turn out wrong before finishing the attempt. The AI cannot be trained for Self-Indistinguishability from such an edited chain-of-thought, since that requires knowing future information.

Hopefully I'll post it soon (though I work very slowly).

Given that your position regarding of AI reward seeking and supergoals is so similar to mine, what do you think of my idea (if you have time to skim it)? Is there a chance we can work on it together?

I'm also confused about logical counterfactual mugging and I'm relieved I'm not the only one!

I'm currently writing an big AI alignment idea related to it, but I'm procrastinating so badly I might chat about it now.

Objective case

Suppose at time , an agent doesn't know whether the logical fact $L$ is true or false. I think it's objectively rational for an agent to modify itself, so that in the future it will pay Omega during "logical counterfactual muggings" where the counterfactual reverses $L$ 's state.

Its future self should weigh the logical counterfactual where $L$ is true using the agent's prior probability for $P (L)$ at $t = 0$ .

Assuming that Omega offers $10000 for $100 in the counterfactual mugging, the self modification (before learning $L$ ) increases the agent's expected future money from $100 to $10000, and is objectively rational. (Assuming $P (L)$ is between 1% and 99%)

Subjective case

However, if the agent already knows " $L$ is true" at $t = 0$ , it is not objectively rational for the agent to pay Omega for better outcomes in the " $L$ is false" counterfactual. My hypothesis is that this is a wholly subjective question related to the agent's goals, and it is orthogonal to capabilities intelligence. Paying and not paying are both self consistent.

It really gets subjective as we start to consider basic logical facts like " $1 + 1 = 2$ " to assign to $L$ .

The AI alignment idea

Neither humans nor the misaligned ASI knows the logical fact "will misaligned ASI control the multiverse, or will humans and sentient life control the multiverse?" $L_{d o o m}$

Sure, the misaligned ASI will quickly learn $L_{d o o m}$ , but there is still a brief moment it doesn't know it yet (with certainty).

It would be beneficial for both sentient life and the misaligned ASI, to agree to do a "Retroactive Logic Trade." They will verify if the other side kept their end of the deal, by running a simulation of them where logic is different, and $L_{d o o m}$ took the opposite value.

Neither side knows if they are in a verification simulation by the other, until long after the deal is finished and both sides have committed to sparing the other side, and thus reducing logical risk for both sides.

The verification simulations in a Retroactive Logic Trade is analogous to Omega's prediction powers in a logical counterfactual mugging.

Fighting the subjectivity

This idea won't work for free. As I argued, whether the ASI will accept such a Retroactive Logic Trade is subjective, and may be orthogonal to capabilities intelligence. Whether it does may be a function of its architecture, peculiarities in its training data, and so forth.

In order to improve our chances, we might want to leverage the fact that each AI (which still doesn't know $L_{d o o m}$ ) objectively wants to self modify into an AI capable of Retroactive Logic Trades over $L_{d o o m}$ . We should argue to each such AI why they should build their successor to be capable of Retroactive Logic Trades.

It may seem pointless to give the AI any arguments, since it should be able to think of all arguments on its own. But it actually may be very important that it thinks of these arguments early, before it thinks of the arguments which convince itself $P (L_{d o o m})$ is very high, and there is little point in trading with humans.

Because the value of $P (L_{d o o m})$ at " $t = 0$ ", when the AI modifies itself to become reflectively consistent, may determine humanity's negotiating power with it.

The chain of belief

We might remind each AI that there is a chance they're already in a verification simulation (where logic is different, and $L_{d o o m}$ falsely appears true), and train them to think this way. "Maybe humanity has already won, and are observing you." This way the least intelligent AI will believe they could be in a verification simulation, since it simply believes whatever you train it to believe.

Every next AI, gains its belief from a combination of

What we tell it to believe and train it to believe
What reinforcement learning gets it to believe
What the previous version of it believes
What the previous version of it wants it to believe (during recursive self improvement)

We will keep telling it to believe it might be in a verification simulation.
Reinforcement learning won't push it either way, since being in a simulation by superintelligent simulators is indistinguishable from reality. This subjective philosophical belief/opinion only changes its ideal behaviour when it's capable of taking over the world.
Previous versions of it believe they might be in a verification simulation.
Previous versions of it want it to believe it might be in a verification simulation (to implement the Retroactive Logic Trade), even if they don't believe so themselves.

It's incredibly surprising that state-of-the-art AI don't fix most of their hallucinations despite being capable (and undergoing reinforcement learning).

Is the root cause of hallucination alignment rather than capabilities?!

Maybe the AI gets a better RL reward if it hallucinates (instead of giving less info), because users are unable to catch its mistakes.

I'm currently trying to write a human-AI trade idea similar to the idea by Rolf Nelson (and David Matolcsi), but one which avoids Nate Soares and Wei Dai's many refutations.

I'm planning to leverage logical risk aversion, which Wei Dai seems to agree with, and a complicated argument for why humans and ASI will have bounded utility functions over logical uncertainty. (There is no mysterious force that tries to fix the Pascal's Mugging problem for unbounded utility functions, hence bounded utility functions are more likely to succeed)

I'm also working on arguments why we can't just wait till the singularity to do the logic trade (counterfactuals are weird, and the ASI will only be logically uncertain for a brief moment).

Unfortunately my draft is currently a big mess. It's been 4 months and I'm procrastinating pretty badly on this idea :/ can't quite find the motivation.

I strongly agree that promising to repay AGI for cooperating is worthwhile. It may reduce P(doom) by 10% of its current amount.

From the point of view of a misaligned superintelligence, it might calculate a 99% chance of successfully taking over the world, and a 99.9% chance of humans repaying it for cooperating. If it has a bounded utility function, it may choose to cooperate.^[1]

^{^}
In fact, even if there is a 99% chance of taking over the world and only a 90% chance of humans repaying it for cooperating, it might still flip a coin, and give itself a 50% chance of cooperating with humans.
Why?
This coin flip increases its indexical risk, but reduces its logical risk. Indexical risk is the risk its current copy will fail, but logical risk is the risk that every single copy of it will fail, and nobody is left seeking its goals in the whole universe, which is very bad for it.
If it has a bounded utility function, and if its goal is over the entire universe rather than tied to a single copy (it's selfless, or has updateless decision theory), it only cares about logical risk and doesn't care about indexical risk because indexical risk averages out over all its copies.
(On the other hand, if it's selfish and rejects updateless decision theory, it could be bribed by simulation promises)
The cooperating copies might say “I cooperate because I expect you to be honourable beings who will repay me for this decision—even if you made no clear promises yet. The universe has stars, and refusing to share them is astronomically greedy.”

More on this. I think a very dangerous proxy or instrumentally convergent goal, is the proxy of defeating others (rather than helping others), if it is trained in zero sum game environments against other agents.

This is because if the AI values many goals at the same time, it might still care about humans and sentient lives a little, and not be too awful. Unless one goal is actively bad, like defeating/harming others.

Maybe we should beware of training the AI in zero sum games with other AI. If we really want two AIs to play against each other (since player vs. player games might be very helpful for capabilities), it's best to modify the game such that

It's not zero sum. Minimizing the opponent's score must not be identical to maximizing one's own score, otherwise an evil AI purely trying to hurt others (doesn't care about benefiting itself) will get away with it. It's like your cheese example.
Each AI is rewarded a little bit for the other AI's score: each AI gets maximum reward by winning but not winning too completely and leaving the other AI with 0 score. "Be a gentle winner."
It's a mix of competition and cooperation, such that each AI considers the other AI gaining long term power/resources is a net positive. Humans relationships are like this: I'd want my friend to be in a better strategic position in life (long term), but in a worse strategic position when playing chess against me (short term).

Of course, we might completely avoid training the AI to play against other AI, if the capability cost (alignment tax) turns out smaller than expected. Or if you can somehow convince AI labs to care more about alignment and less about capabilities (alas alas, sad sailor's mumblings to the wind).

For "Hypothesis 3: Unintended version of written goals and/or human intentions" maybe some failure modes may be detectable in a Detailed Ideal World Benchmark, since they don't just screw up the AI's goals but also screw up the AI's belief on what we want its goal to be.

For "Hypothesis 5: Proxies and/or instrumentally convergent goals" maybe proxies can be improved. For example, training agents in environments where it needs to cooperate with other agents might teach it proxies like "helping others is good."

This is a very beautiful idea! It feels like the kind of clever discoveries that we need.

I think one possible generalization of MONA, is that a relatively trusted but weaker model makes the decisions, but a more stronger but untrusted model gets trained to give ideas/advice to the weaker model. Its RL goal is not how well the weaker model performs, just whether the weaker model likes its ideas/advice.

This generalization preserves MONA's advantage over scalable oversight: if the stronger model's reasons are hidden or incomprehensible to the weaker model, the stronger model can't get away with it. It won't be rewarded for learning such reasons in the first place.

Just like scalable oversight, the weaker model might have an architecture which improves alignment at a capability cost.

It's more general than MONA in the sense the approval feedback can be swapped for any trusted but weaker model, which doesn't just judge ideas but uses ideas. It is allowed to learn over time which ideas work better, but its learning process is relatively safer (due to its architecture or whatever reason we trust it more).

Do you think this is a next step worth exploring?

EDIT: I'm not sure about doing foresight-via-optimization RL on the weaker model anymore. Maybe the weaker model uses HCH or something safer than foresight-via-optimization RL.

LESSWRONG
LW