...I suppose I'm trying to make a hypothetical AI that would frustrate any sense of "real self" and therefore disprove the claim "all LLMs have a coherent goal that is consistent across characters". In this case, the AI could play the "benevolent sovereign" character or the "paperclip maximizer" character, so if one claimed there was a coherent underlying goal I think the best you could say about it is "it is trying to either be a benevolent sovereign or maximize paperclips". But if your underlying goal can cross such a wide range of behaviors it is practical
So the shoggoth here is the actual process that gets low loss on token prediction. Part of the reason that it is a shoggoth is that it is not the thing that does the talking. Seems like we are onboard here.
The shoggoth is not an average over masks. If you want to see the shoggoth, stop looking at the text on the screen and look at the input token sequence and then the logits that the model spits out. That's what I mean by the behavior of the shoggoth.
On the question of whether it's really a mind, I'm not sure how to tell. I know it gets really ...
The shoggoth is supposed to be a of a different type than the characters. The shoggoth for instance does not speak english, it only knows tokens. There could be a shoggoth character but it would not be the real shoggoth. The shoggoth is the thing that gets low loss on the task of predicting the next token. The characters are patterns that emerge in the history of that behavior.
Yeah I think this would work if you conditioned on all of the programs you check being exactly equally intelligent. Say you have a hundred superintelligent programs in simulations and one of them is aligned, and they are all equally capable, then the unaligned ones will be slightly slower in coming up with aligned behavior maybe, or might have some other small disadvantage.
However, in the challenge described in the post it's going to be hard to tell a level 999 aligned superintelligence from a level 1000 unaligned superintelligence.
I think the advant...
Quick submission:
The first two prongs of OAI's approach seems to be aiming to get a human values aligned training signal. Let us suppose that there is such a thing, and ignore the difference between a training signal and a utility function, both of which I think are charitable assumptions for OAI. Even if we could search the space of all models and find one that in simulations does great on maximizing the correct utility function which we found by using ML to amplify human evaluations of behavior, that is no guarantee that the model we find in that search ...
I loved this, but maybe should come with a cw.
I would think the title is itself a content warning.
I guess someone might think this post is or could be far more abstract and less detailed about the visceral realities than it is (or maybe even just using the topic as a metaphor at most).
What kind of specific content warning do you think would be appropriate? Maybe "Describes the dissection of human bodies in vivid concrete terms."?
I came here to say something pretty similar to what Duncan said, but I had a different focus in mind.
It seems like it's easier for organizations to coordinate around PR than it is for them to coordinate around honor. People can have really deep intractable, or maybe even fundamental and faultless, disagreements about what is honorable, because what is honorable is a function of what normative principles you endorse. It's much easier to resolve disagreements about what counts as good PR. You could probably settle most disagreements about what co...
As a counterpoint, one writer thinks that it's psychologically harder for organizations to think about PR:
...A famous investigative reporter once asked me why my corporate clients were so terrible at defending themselves during controversy. I explained, “It’s not what they do. Companies make and sell stuff. They don’t fight critics for a living. And they dread the very idea of a fight. Critics criticize; it’s their entire purpose for existing; it’s what they do.”
"But the companies have all that money!” he said, exasperated.
"But their critics have you,” I said
It's much easier to resolve disagreements about what counts as good PR.
I mostly disagree. I mean, maybe this applies in comparison to “honor” (not sure), but I don’t think it applies in comparison to “reputation” in many of the relevant senses. A person or company could reasonably wish to maintain a reputation as a maker of solid products that don’t break, or as a reliable fact-checker, or some other such specific standard. And can reasonably resolve internal disagreements about what is and isn’t likely to maintain this reputation.
If it was actually ...
This might be sort of missing the point, but here is an ideal and maybe not very useful not-yet-theory of rationality improvements I just came up with.
There are a few black boxes in the theory. The first takes you and returns your true utility function, whatever that is. Maybe it's just the utility function you endorse, and that's up to you. The other black box is the space of programs that you could be. Maybe it's limited by memory, maybe it's limited by run time, or maybe it's any finite state machine with less than 10^20 states...
I don't think we should be surprised that any reasonable utility function is uncomputable. Consider a set of worlds with utopias that last only as long as a Turing machine in the world does not halt and are otherwise identical. There is one such world for each Turing machine. All of these worlds are possible. No computable utility function can assign higher utility to every world with a never halting Turing machine.
I do think this is an important concept to explain our conception of goal-directedness, but I don't think it can be used as an argument for AI risk, because it proves too much. For example, for many people without technical expertise, the best model they have for a laptop is that it is pursuing some goal (at least, many of my relatives frequently anthropomorphize their laptops).
This definition is supposed to also explains why a mouse has agentic behavior, and I would consider it a failure of the definition if it implied that mice are dangerous. I think a system becomes more dangerous as your best model of that system as an optimizer increases in optimization power.
Here is an idea for a disagreement resolution technique. I think this will work best:
*with one other partner you disagree with.
*when your the beliefs you disagree about are clearly about what the world is like.
*when your the beliefs you disagree about are mutually exclusive.
*when everybody genuinely wants to figure out what is going on.
Probably doesn't really require all of those though.
The first step is that you both write out your beliefs on a shared work space. This can be a notebook or a whiteboard or anything like that. Then you each write do...
Ok, let me give it a try. I am trying to not spend too much time on this, so I prefer to start with a rough draft and see whether there is anything interesting here before I write a massive essay.
You say the following:
Do chakras exist?
In some sense I might be missing the point since the answer to this is basically just "no". Though obviously I still think they form a meaningful category of something, but in my model they form a meaningful category of "mental experiences" and "mental procedures", and definitely not a meaningfu...
If you come up with a test or set of tests that it would be impossible to actually run in practice, but that we could do in principle if money and ethics were no object, I would still be interested in hearing those. After talking to one of my friends who is enthusiastic about chakras for just a little bit, I would not be surprised if we in fact make fairly similar predictions about the results of such tests.
Sometimes I sort of feel like a grumpy old man that read the sequences back in the good old fashioned year of 2010. When I am in that mood I will sometimes look around at how memes spread throughout the community and say things like "this is not the rationality I grew up with". I really do not want to stir things up with this post, but I guess I do want to be empathetic to this part of me and I want to see what others think about the perspective.
One relatively small reason I feel this way is that a lot of really smart rationalists, who are my fr...
I am not one of the Old Guard, but I have an uneasy feeling about something related to the Chakra phenomenon.
It feels like there's a lot of hidden value clustered around wooy topics like Chakras and Tulpas, and the right orientation towards these topics seems fairly straightforward: if it calls out to you, investigate and, if you please, report. What feels less clear to me is how I as an individual or as a member of some broader rat community should respond when, according to me, people do not certain forms of bullshit tests.
This comes from someone wi...
Here is an idea I just thought of in an uber ride for how to narrow down the space of languages it would be reasonable to use for universal induction. To express the k-complexity of an object relative to a programing language I will write:
Suppose we have two programing languages. The first is Python. The second is Qython, which is a lot like Python, except that it interprets the string "A" as a program that outputs some particular algorithmically large random looking character string with . I claim that intuitively, Pyth...
When I started writing this comment I was confused. Then I got myself fairly less confused I think. I am going to say a bunch of things to explain my confusion, how I tried to get less confused, and then I will ask a couple questions. This comment got really long, and I may decide that it should be a post instead.
Take a system with 8 possible states. Imagine is like a simplified Rubik's cube type puzzle. (Thinking about mechanical Rubik's cube solvers is how I originally got confused, but using actual Rubik's cubes to explain would make...
Is there a particular formula for negentropy that OP has in mind? I am not seeing how the log of the inverse of the probability of observing an outcome as good or better than the one observed can be interpreted as the negentropy of a system with respect to that preference ordering.
Edit: Actually, I think I figured it out, but I would still be interested in hearing what other people think.
Something about your proposed decision problem seems cheaty in a way that the standard Newcomb problem doesn't. I'm not sure exactly what it is, but I will try to articulate it, and maybe you can help me figure it out.
It reminds me of two different decision problems. Actually, the first one isn't really a decision problem.
Omega has decided to give all those who two box on the standard Newcomb problem 1,000,000 usd, and all those who do not 1,000 usd.
Now that's not really a decision problem, but that's not the issue with using it ...
I had already proved it for two values of H before I contracted Sellke. How easily does this proof generalize to multiple values of H?
I see. I think you could also use PPI to prove Good's theorem though. Presumably the reason it pays to get new evidence is that you should expect to assign more probability to the truth after observing new evidence?
I honestly could not think of a better way to write it. I had the same problem when my friend first showed me this notation. I thought about using but that seemed more confusing and less standard? I believe this is how they write things in information theory, but those equations usually have logs in them.
I didn't take the time to check whether it did or didn't. If you would walk me through how it does, I would appreciate it.
Luckily, I don't know much about genetics. I totally forgot that, I'll edit the question to reflect it.
To be sure though, did what I mean about the different kinds of cognition come across? I do not actually plan on teaching any genetics.
Yeah, the problem i have with that though is that I'm left asking: why did I change my probability in that? Is it because i updated on something else? Was I certain of that something else? If not, then why did I change my probability of that something else, and on we go down the rabbit hole of an infinite regress.
Wait, actually, I'd like to come back to this. What programming language are we using? If it's one where either grue is primitive, or one where there are primitives that make grue easier to write than green, then true seems simpler than green. How do we pick which language we use?
Here's my problem. I thought we were looking for a way to categorize meaningful statements. I thought we had agreed that a meaningful statement must be interpretable as or consistent with at least one DAG. But now it seems that there are ways the world can be which can not be interpreted even one DAG because they require a directed cycle. SO have we now decided that a meaningful sentence must be interpretable as a directed, cyclic or acyclic, graph?
In general, if I say all and only statements that satisfy P are meaningful, then any statement that doesn't ...
What is Markov relative?
Does EY give his own answer to this elsewhere?
Wait... this will seems stupid, but can't I just say: "there does not exist x where sx = 0"
nevermind
Here's a new strategy.
Use guess culture as a default. Use guess tricks to figure out whether other communicator speaks Ask. Use Ask tricks to figure out whether communicator speaks Tell.
Let's forget about the oracle. What about the program that outputs X only if 1 + 1 = 2, and else prints 0? Let's call it A(1,1). The formalism requires that P(X|A(1,1)) = 1, and it requires that P(A(1,1)) = 2 ^-K(A(1,1,)), but does it need to know that "1 + 1 = 2" is somehow proven by A(1,1) printing X?
In either case, you've shown me something that I explicitly doubted before: one can prove any provable theorem if they have access to a Solomonoff agent's distribution, and they know how to make a program that prints X iff theorem S is provable. All they have to do is check the probability the agent assigns to X conditional on that program.
Awesome. I'm pretty sure you're right; that's the most convincing counterexample I've come across.
I have a weak doubt, but I think you can get rid of it:
let's name the program FTL()
I'm just not sure this means that the theorem itself is assigned a probability. Yes, I have an oracle, but it doesn't assign a probability to a program halting; it tells me whether it halts or not. What the Solomoff formalism requires is that "if (halts(FTL()) == true) then P(X|FTL()) = 1" and "if (halts(FTL()) == false) then P(X|FTL()) = 0" and "P(FTL...
Upvoted for cracking me up.
Terminology quibble:
I get where you get this notion of connotation from, but there's a more formal one that Quine used, which is at least related. It's the difference between an extension and a meaning. So the extensions of "vertebrate" and "things with tails" could have been identical, but that would not mean that the two predicates have the same meanings. To check if the extensions of two terms are identical, you check the world; it seems like to check whether two meanings are identical, you have to check your own mind.
Edit: Whoops, somebody already mentioned this.
I agree. I am saying that we need not assign it a probability at all. Your solution assumes that there is a way to express "two" in the language. Also, the proposition you made is more like "one elephant and another elephant makes two elephants" not "1 + 1 = 2".
I think we'd be better off trying to find a way to express 1 + 1 = 2 as a boolean function on programs.
This is super interesting. Is this based on UDT?
How do you express, Fermat's last theorem for instance, as a boolean combination of the language I gave, or as a boolean combination of programs? Boolean algebra is not strong enough to derive, or even express all of math.
edit: Let's start simple. How do you express 1 + 1 = 2 in the language I gave, or as a boolean combination of programs?
...Except that around 2% of blue egg-shaped objects contain palladium instead. So if you find a blue egg-shaped thing that contains palladium, should you call it a "rube" instead? You're going to put it in the rube bin—why not call it a "rube"?
But when you switch off the light, nearly all bleggs glow faintly in the dark. And blue egg-shaped objects that contain palladium are just as likely to glow in the dark as any other blue egg-shaped object.
So if you find a blue egg-shaped object that contains palladium, and you ask "Is it a b
Here's a question, if we had the ability to input a sensory event with a likelyhoodratio of 3^^^^3:1 this whole problem would be solved?
Hmm, it depends on whether or not you can give finite complete descriptions of those algorithms, if so, I don't see the problem with just tagging them on. If you can give finite descriptions of the algorithm, then its komologorov complexity will be finite, and the prior: 2^-k(h) will still give nonzero probabilities to hyper environments.
If there are no such finite complete descriptions, then I gotta go back to the drawing board, cause the universe could totally allow hyper computations.
On a side note, where should I go to read more about hyper-computation?
At first thought. It seems that if it could be falsified, then it would fail the criteria of containing all and only those hypotheses which could in principle be falsified. Kind of like a meta-reference problem; if it does constrain experience, then there are hypotheses which are not interpretable as causal graphs that constrain experience (no matter how unlikely). This is so because the sentence says "all and only those hypothesis that can be interpreted as causal graphs are falsifiable", and for it to be falsified, means verifying that there is...
I have to ask, how does this metaphysics (cause that's what it is) account for mathematical truths? What causal models do those represent?
My bad:
Someone already asked this more cleverly than I did.
I have a plausibly equivalent (or at least implies Ey's) candidate for the fabric of real things, i.e., the space of hypotheses which could in principle be true, i.e., the space of beliefs which have sense:
A Hypothesis has nonzero probability, iff it's computable or semi computable.
It's rather obviously inspired by Solomonoff abduction, and is a sound principle for any being attempting to approximate the universal prior.
It seems to me that this is the primary thing that we should be working on. If probability is subjective, and causality reduces to probability, then isn't causality subjective, i.e., a function of background knowledge?
Looking it over, I could have been much clearer (sorry). Specifically I want to know. Given a Dag of the form:
A -> C <- B
Is it true that (in all prior joint distributions where A is independent of B, but A is evidence of C, and B is evidence of C) A is none-independent of B, given C is held constant?
I proved that when A & B is evidence against C, this is so, and also when A & B are independent of C, this is so, the only case I am missing is when A & B is evidence for C.
It's clear enough to me that when you have one none-colliding pat...
I have a question: is D-separation implied by the komologorov axioms?
I've proven that it is in some cases:
Premises:
1)A = A|B :. A|BC ≤ A|C
2)C < C|A
3)C < C|B
4) C|AB < C
proof starts:
1)B|C > B {via premise 3
2)A|BC = A B C|AB / (C B|C) {via premise 1
3)A|BC C = A B C|AB / B|C
4)A|BC C / A = B C|AB / B|C
5)B C|AB / B|C < C|AB {via line 1
6)B C|AB / B|C < C {via line 5 and premise 4
7)A|BC C / A < C {via lines 6 and 4
8)A|C = A C|A / C
9)A|C C = A C|A
10)A|C C / A = C|...
A real deadlock i have with using your algorithmic meta-ethics to think about object level ethics is that I don't know who's volition, or "should" label I should extrapolate from. It allows me to figure out what's right for me, and what's right for any group given certain shared extrapolated terminal values, but it doesn't tell me what to do when I am dealing with a population with none-converging extrapolations, or with someone that has different extrapolated values from me (hypothetically).
These individuals are rare, but they likely exist.
Yeah I’m totally with you that it definitely isn’t actually next token prediction, it’s some totally other goal drawn from the dist of goals you get when you sgd for minimizing next token prediction surprise.