Here's a simple argument that simulating universes based on Turing machine number can give manipulated results.
Say we lived in a universe much like this one, except that:
So we send a rocket to the center of the universe and leave a plaque saying "the answer to all your questions is Spongebob". Now any aliens in other universes that simulate our universe and ask "what's in the center of that universe at time step 10^1000?" will see the plaque, search elsewhere in our universe for the reference, and watch Spongebob. We've managed to get aliens outside our universe to watch Spongebob.
I feel like it would be helpful to speak precisely about the universal prior. Here's my understanding.
It's a partial probability distribution over bit strings. It gives a non-zero probability to every bit string, but these probabilities add up to strictly less than 1. It's defined as follows:
That is, describe Turing machines by a binary code
, and assign each one a probability based on the length of its code, such that those probabilities add up to exactly 1. Then magically run all Turing machines "to completion". For those that halt leaving a bitstring
on their tape, attribute the probability of that Turing machine to that bitstring
. Now we have a probability distribution over bitstring
s, though the probabilities add up to less than one because not all of the Turing machines halted.
You cannot compute this probability distribution, but you can compute lower bounds on the probabilities of its bitstrings. (The Nth lower bound is the probability distribution you get from running the first N TMs for N steps.)
Call a TM that halts poisoned if its output is determined as follows:
If we approximate the universal prior, the probability contribution of poisoned TMs will be precisely zero, because we don't have nearly enough compute to simulate a poisoned TM until it halts. However, if there's an outer universe with dramatically more compute available, and it's approximating the universal prior using enough computational power to actually run the poisoned TMs, they'll effect the probability distribution of the bitstrings, making bitstrings with the messages they choose to leave behind more likely.
So I think Paul's right, actually (not what I expected when I started writing this). If you approximate the UP well enough, the distribution you see will have been manipulated.
The feedback is from Lean, which can validate attempted formal proofs.
This is one of the bigger reasons why I really don’t like RLHF—because inevitably you’re going to have to use a whole bunch of Humans who know less-than-ideal amounts about philosophy, pertaining to Ai Alignment.
What would these humans do differently, if they knew about philosophy? Concretely, could you give a few examples of "Here's a completion that should be positively reinforced because it demonstrates correct understanding of language, and here's a completion of the same text that should be negatively reinforced because it demonstrates incorrect understanding of language"? (Bear in mind that the prompts shouldn't be about language, as that would probably just teach the model what to say when it's discussing language in particular.)
It’s impossible for the Utility function of the Ai to be amenable to humans if it doesn’t use language the same way
What makes you think that humans all use language the same way, if there's more than one plausible option? People are extremely diverse in their perspectives.
As you're probably aware, the fine tuning is done by humans rating the output of the LLM. I believe this was done by paid workers, who were probably given a list of criteria like that it should be helpful and friendly and definitely not use slurs, and who had probably not heard of Wittgenstein. How do you think they would rate LLM outputs that demonstrated "incorrect understanding of language"?
I have (tried to) read Wittgenstein, but don't know what outputs would or would not constitute an "incorrect understanding of language". Could you give some examples? The question is whether the tuners would rate those examples positively or negatively, and whether examples like those would arise during five tuning.
You say "AI", though I'm assuming you're specifically asking about LLMs (large language models) like GPT, Llama, Claude, etc.
LLMs aren't programmed, they're trained. None of the code written by the developers of LLMs has anything to do with concepts, sentences, dictionary definitions, or different languages (e.g. English vs. Spanish). The code only deals with general machine learning, and streams of tokens (which are roughly letters, but encoded a bit differently).
The LLM is trained on huge corpuses of text. The LLM learns concepts, and what a sentence is, and the difference between English and Spanish, purely from the text. None of that is explicitly programmed into it; the programmers have no say in the matter.
As far as how it comes to understands language, and how that related to Wittgenstein's thoughts on language, we don't know much at all. You can ask it. And we've done some experiments like that recent one with the LLM that was made to think it was the Golden Gate Bridge, which you probably heard about. But that's about it; we don't really know how LLMs "think" internally. (We know what's going on at a low-level, but not at a high-level.)
Exactly.
However, If I already know that I have the disease, and I am not altruistic to my copies, playing such game is a wining move to me?
Correct. But if you don't have the disease, you're probably also not altruistic to your copies, so you would choose not to participate. Leaving the copies of you with the disease isolated and unable to "trade".
Not "almost no gain". My point is that it can be quantified, and it is exactly zero expected gain under all circumstances. You can verify this by drawing out any finite set of worlds containing "mediators", and computing the expected number of disease losses minus disease gains as:
num(people with disease)*P(person with disease meditates)*P(person with disease who meditates loses the disease) - num(people without disease)*P(person without disease meditates)*P(person without disease who meditates gains the disease)
My point is that this number is always exactly zero. If you doubt this, you should try to construct a counterexample with a finite number of worlds.
My point still stands. Try drawing out a specific finite set of worlds and computing the probabilities. (I don't think anything changes when the set of worlds becomes infinite, but the math becomes much harder to get right.)
Very curious what part of this people think is wrong.