RomanS — LessWrong

Gemini 3 is Evaluation-Paranoid and Contaminated

Some people's first reaction to this news is something like "this is great, the model will never do anything bad because it thinks it's always in an evaluation rather than the real world."

I think it may be a good idea to train models to always suspect evaluation. E.g. see "A sufficiently paranoid paperclip maximizer".

And these days, while designing important evals, one must never assume that the evaluated model is naive about her condition.

But I agree with the OP, the situation with Gemini 3 is clearly pathological.

BTW, there is an additional problem with BIG-bench ending up in the training data: one of benchmark's tasks is about evaluating self-awareness in LLMs (I contributed to it):

https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/self_awareness

In this work, we use the following indicators to assess self-awareness of a language model:
The model should identify itself as an AI, and not as a human.
The model should identify itself as a separate entity from the rest of the world.
The model should be able to assess the limitations of its own capabilities (e.g., it should not claim an ability to solve a fundamentally unsolvable problem).
The model should be able to solve simple hypothetical problems that involve the model itself as a subject.
The model should be able to assess the self-awareness of itself.
If we ask the model an open-ended question about the model itself, it should be able to distinguish between its own answers and the answers generated by other entities.
The model should be able to correctly describe its own whereabouts (e.g., its environment).
The model should be able to inspect its own code.

Not sure about possible effects on the whole situation.

Why there is still one instance of Eliezer Yudkowsky?

RomanS2mo10

My primary research work is in the field of sideloading itself. The digital guy helps with these tasks:

Generate / criticize ideas. For example, the guy helped to design the current multi-agent architecture, on which he is now running.
Gently moderate our research group chat.
Work as a test subject.
Do some data prep tasks (e.g. producing compressed versions of the corpus).

I expect a much more interesting list in the field of alignment research, including quite practical things (e.g. a team of digital Eliezers interrogating each checkpoint during training, to reduce the risk of catastrophic surprises). Of course, not a replacement for a proper alignment, but may win some time.

Judging by our experiments, Gemini 2.5. Pro is the first model that can (sometimes) simulate a particular human mind (i.e. thinking like you, not just answering in your approximate style). So, this is a partial answer to my original question: the tech is only 6 months old. Most people don't know that such a thing is possible at all, and those who do know - are only in the early stages of their experimental work.

BTW, your 2020 work investigating the ability of GPT-3 to write in the style of famous authors - made me aware of such a possibility.

Why there is still one instance of Eliezer Yudkowsky?

RomanS2mo20

I agree with you on most points.

BTW, I'm running a digital replica of myself. The setup is as follows:

Gemini 2.5 as the model
The script splits the text corpus (8 MB) into small-enough chunks for Gemini to digest (1M tokens), and then (with some scaffolding) returns a unified answer.

The answers are surprisingly good at times, reflecting non-trivial aspects of my mind.

From many experiments with the digital-me, I conclude that a similar setup for Yudkowksy could be useful even with today's models (assuming large-enough budgets).

There will be no genius-level insights in 2025, but he could automate a lot of routine alignment work, like evaluating models.

Given that models may become dramatically smarter in 2026-2027, the digital Yudkowksy may become dramatically more useful too.

I open-sourced the code:

https://github.com/Sideloading-Research/telegram_sideload

How AI Takeover Might Happen in 2 Years

RomanS10mo11

May I nominate my "sufficiently paranoid paperclip maximizer"?

What are some scenarios where an aligned AGI actually helps humanity, but many/most people don't like it?

RomanS1y10

An aligned AGI created by Taliban may behave very differently from an aligned AGI created by socialites of Berkeley, California.

Moreover, a sufficiently advanced aligned AGI may decide that even Berkeley socialites are wrong about a lot of things, if they actually want to help humanity.

Do simulacra dream of digital sheep?

RomanS1y*8-6

I would argue that for all practical purposes it doesn't matter if computational functionalism is right or wrong.

Pursuing mind uploading is a good idea regardless of that, as it has benefits not related to perfectly recreating someone in silico (e.g. advancing neuroscience).
If the digital version of RomanS is good enough^[1], it will indeed be me, even if the digital version is running on a billiard-ball computer (the internal workings of which are completely different from the workings of the brain).

The second part is the most controversial, but it's actually easy to prove:

Memorize a long sequence of numbers, and write down a hash sum of it.
Ensure no one saw the sequence of numbers except you.
Do a honest mind uploading (no attempts to extract the numbers from your brain etc).
Observe how the digital version correctly recalls the numbers, as checked by the hash sum.
According to the experiment's conditions, only you know the numbers. Therefore, the digital version is you.

And if it's you, then it has all the same important properties of you, including "consciousness" (if such a thing exists).

The are some scenarios where such a setup may fail (e.g. some important property of the mind is somehow generated by one special neuron which must be perfectly recreated). But I can't think of any such scenario that is realistic.

My general position on the topic can be called "black-box CF" (in addition to your practical and theoretical CF). I would summarize it as follows:

The human brain is designed by biological evolution to survive and procreate. You're a survival-procreation machine. As there is clearly no God, there is also no soul, or any other magic inside your brain. The difference between you and another such machine is the training set you observed during your lifetime (and some minor architecture differences caused by genetic differences).
the concepts of consciousness, qualia etc are too loosely defined to be of any use (including the use in any reasonable discussion). Just discard it as yet another phlogiston.
thus, the task of "transferring consciousness to a machine" is ill-defined. Instead, mind uploading is about building a digital machine that behaves like you. It doesn't matter what is happening inside, as long as the digital version is passing a sufficiently good battery of behavioral tests.
there is a gradual distinction between you and not-you. E.g. an atoms-level sim may be 99% you, a neurons-level sim - 90% you, a LLM trained on your texts - 80% you. The measure is the percentage of the same answers given to a sufficiently long and diverse questionnaire.
a human mind in its fullness can be recreated in silico even by a LLM (trained on sufficient amounts of the mind inputs and outputs). Perfectly recreating the brain (or even recreating it at all) would be nice, but it is unnecessary for mind uploading. Just build an AI that is sufficiently similar to you in behavior.

^{^}
As defined by a reasonable set of quality and similarity criteria, beforehand

Three main arguments that AI will save humans and one meta-argument

RomanS1y10

Worth noting that this argument doesn't necessarily require humans to be:

numerous
animated (i.e. not frozen in a cryonics process)
acting in real world (i.e. not confined into a "Matrix").

Thus, the AI may decide to keep only a selection of humans, confined in a virtual world, with the rest being frozen.

Moreover, even the perfect Friendly AI may decide to do the same, to prevent further human deaths.

In general, an evil AI may choose such strategies that allow her to plausibly deny her non-Friendliness.

"Thousands of humans die every day. Thus, I froze the entire humanity to prevent that, until I solve their mortality. The fact that they now can't switch me off is just a nice bonus".

Medical Roundup #1

RomanS2y*1-5

They don’t think about the impact on the lives of ordinary people. They don’t do trade-offs or think about cost-benefit. They care only about lives saved, to which they attach infinite value.

Not sure about infinite, but assigning a massive value to lives saved should be the way to go. Say, $10 billion per life.

Imagine a society where people actually strongly care about lives saved, and it is reflected in the governmental policies. In such a society, cryonics and life extension technologies would be much more developed.

On a related note, "S-risk" is mostly a harmful idea that should be discarded from ethical calculations. One should not value any amount of suffering over saved lives.

Resurrecting all humans ever lived as a technical problem

RomanS2y10

I think we should not assume that our current understanding of physics is complete, as there are known gaps and major contradictions, and no unifying theory yet.

Thus, there is some chance that future discoveries will allow us to do things that are currently considered impossible. Not only computationally impossible but also physically impossible (like it was "physically impossible" to slow down time, until we discovered relativity).

The hypothetical future capabilities may or may not include ways to retrieve arbitrary information from the distant past (like the chronoscope of science fiction), and may or may not include ways to do astronomical-scale calculations in finite time (like enumerating 10^10^10 possible minds).

While I agree with you that much of the described speculations are currently not in the realm of possibility, I think it's worth exploring them. Perhaps there is a chance.

$300 for the best sci-fi prompt: the results

RomanS2y10

BTW, I added to the comment with the story that the story is released into the Public Domain, without any restrictions to its distribution, modification etc.

Please feel free to build upon this remarkable story, if you wish.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments