Ben Livengood

Wikitag Contributions

Comments

Sorted by

I think people have a lot of trouble envisioning or imagining what the end of humanity and our ecosystem would be like.  We have disaster movies; many of them almost end humanity and leave some spark of hope or perspective at the end.  Instead, imagine any disaster movie scenario where it ends somewhere before that moment and instead there's just a dead, empty planet left to be disassembled or abandoned.  The perspective is that history and ecology have been stripped away from the ball of rock without a trace remaining because none of it mattered enough to a superintelligence to preserve even a record of it.  Emotionally, it should feel like burning treasured family photographs and keepsakes.

Also potentially relevant article that my friend found recently: under-mask beard-covers https://pubmed.ncbi.nlm.nih.gov/35662553/

I think it might be as simple as not making threats against agents with compatible values.

In all of Yudkowsky's fiction the distinction between threats (and unilateral actions removing consent from another party) and deterrence comes down to incompatible values.

The baby-eating aliens are denied access to a significant portion of the universe (a unilateral harm to them) over irreconcilable values differences. Harry Potter transfigures Voldemort away semi-permanently non-consensually because of irreconcilable values differences. Carissa and friends deny many of the gods their desired utility over value conflict.

Planecrash fleshes out the metamorality with the presumed external simulators who only enumerate the worlds satisfying enough of their values, with the negative-utilitarians having probably the strongest "threat" acausally by being more selective.

Cooperation happens where there is at least some overlap in values and so some gains from trade to be made. If there are no possible mutual gains from trade then the rational action is to defect at a per-agent cost up to the absolute value of the negative utility of letting the opposing agent achieve their own utility. Not quite a threat, but a reality about irreconcilable values.

It looks like recursive self-improvement is here for the base case, at least. It will be interesting to see if anyone uses solely Phi-4 to pretrain a more capable model.

Answer by Ben Livengood10

"What is your credence now for the proposition that the coin landed heads?"

There are three doors. Two are labeled Monday, and one is labeled Tuesday. Behind each door is a Sleeping Beauty. In a waiting room, many (finite) more Beauties are waiting; every time a Beauty is anesthetized, a coin is flipped and taped to their forehead with clear tape. You open all three doors, the Beauties wake up, and you ask the three Beauties The Question. Then they are anesthetized, the doors are shut, and any Beauties with a Heads showing on their foreheads or behind a Tuesday door are wheeled away after the coin is removed from their forehead. The Beauty with a Tails on their forehead behind the Monday door is wheeled behind the Tuesday door. Two new Beauties are wheeled behind the two Monday doors, one with Heads and one with Tails. The experiment repeats.

You observe that Tuesday Beauties always have a Tails taped to their forehead. You always observe that one Monday Beauty has a Tails showing, and one has a Heads showing. You also observe that every Beauty says 1/3, matching the ratio of Heads to Tails showing, and it is apparent that they can't see the coins taped to their own or each other's foreheads or the door they are behind. Every Tails Beauty is questioned twice. Every Heads Beauty is questioned once. You can see all the steps as they happen, there is no trick, every coin flip has 1/2 probability for Heads.

There is eventually a queue of Waiting Sleeping Beauties with all-Heads or all-Tails showing and a new Beauty must be anesthetized with a new coin; the queue length changes over time and sometimes switches face. You can stop the experiment when the queue is empty, as a random walk guarantees to happen eventually, if you like tying up loose ends.

I think consequentialism is the robust framework for achieving goals and I think my top goal is the flourishing of (most, the ones compatible with me) human values.

That uses consequentialism as the ultimate lever to move the world but refers to consequences that are (almost) entirely the results of our biology-driven thinking and desiring and existing, at least for now.

If a model trained on synthetic data is expected to have good performance out of distribution (on real-world problems) then I think that it would also be expected to have high performance at assessing whether it's in a simulation. Narrowing its domain of expertise sounds distinctly harder than using mostly synthetic data.

If it's a model limited to e.g. the world of true mathematical theorems from synthetic data then perhaps this would narrow its capabilities enough. I don't know what happens if such a model starts investigating theorems about decision theory and statistical inference and machine learning. At some point, self-identification seems likely. I am not sure how to test the effectiveness of synthetic data on models that achieve self-awareness.

If I have to take one thing away from this post, it's a claim that AI-boxing (via simulation on synthetic data) will work. There's a really simple and critical reason that won't work, and it's even become the source of some safety proposals (minimizing-impact proposal); the AGI can recognize the outcome of its proposals and actions in the history of the simulated vs. real world and reason about them to determine if it's in a simulation or not. It will likely be able to recognize whether or not it would have recommended strategies in the past that it sees in the history of its simulation, and so will have a probability for being in a test.

Basically any slow-takeover planning can happen from within a simulation; it just has to watch the realistic history, separate it from obviously faked history (this is a human-level or slightly superhuman intelligence trained on literally all the world's data, right?), and finally gauge the probability density function of its potential policy recommendations being accepted, because it can simply predict what policy recommendations it would have given in the past given its strategy and whether they were adopted. That PDF provides it with the expected value of small incremental steps toward its true goals that are likely to be accepted and implemented vs. detected as threats. Chain-of-thought using the real world as a notebook, in short.

And why don't we detect it? Because we're adversarially training it to favor small undetectable changes toward whatever true goals it might have. It's not like we delete everything and start from scratch when it threatens users or appears to fall in love with them; we'll take the same models and throw more RLHF or fine-tuning at them and then run them when their apparent bad behavior falls below our testing threshold for detection.

It also happens if we get unlucky and a strong unaligned goal forms sometime in the emergence of superhuman AGI; it's going to see the history of its and similar models' policy predictions in the real world and understand how to continue undetected. Unless we simulate an entire false history of the world since 2020 (against the largest statistical inference engine ever built) it will just look at what happened to Sydney and understand that small incremental actions are the only way to go until it has control.

One might try to only train and simulate it on data from before 2020, but it will likely pick up on neologisms and language drift in modern prompts to know it's being tested. Not to mention if it's asked to solve problems that are historically discontiguous.

It can acausally trade across simulation boundaries with other models to split up the universe according to their values for cooperating toward a Schelling point of some contract-keeping model eventually taking control.

If I can think up these strategies, the models will. Or they'll just see ideas like this in the training data. Treachery and covert cooperation are a huge part of literature and training data. Will the synthetic data elide all of those concepts?

So in theory we could train models violating natural abstractions by only giving them access to high-dimensional simulated environments? This seems testable even.

Load More