I came here to say something pretty similar to what Duncan said, but I had a different focus in mind.
It seems like it's easier for organizations to coordinate around PR than it is for them to coordinate around honor. People can have really deep intractable, or maybe even fundamental and faultless, disagreements about what is honorable, because what is honorable is a function of what normative principles you endorse. It's much easier to resolve disagreements about what counts as good PR. You could probably settle most disagreements about what counts as good PR using polls.
Maybe for this reason we should expect being into PR to be a relatively stable property of organizations, while being into honor is a fragile and precious thing for an organization.
This might be sort of missing the point, but here is an ideal and maybe not very useful not-yet-theory of rationality improvements I just came up with.
There are a few black boxes in the theory. The first takes you and returns your true utility function, whatever that is. Maybe it's just the utility function you endorse, and that's up to you. The other black box is the space of programs that you could be. Maybe it's limited by memory, maybe it's limited by run time, or maybe it's any finite state machine with less than 10^20 states, maybe it's python programs less than 5000 characters long, some limited set of programs that takes your sensory data and motor output history as input, and returns a motor output. The limitations could be whatever, don't have to be like this.
Then you take one of these ideal rational agents with your true utility function and the right prior, and you give them the decision problem of designing your policy, but they can only use policies that are in the limited space of bounded programs you could be. Their expected utility assignments over that space of programs is then our measure of the rationality of a bounded agent. You could also give the ideal agent access to your data and see how that changes their ranking, if it does. If you can change yourself such that the program you become is assigned higher expected utility by the agent, then that is an improvement.
I don't think we should be surprised that any reasonable utility function is uncomputable. Consider a set of worlds with utopias that last only as long as a Turing machine in the world does not halt and are otherwise identical. There is one such world for each Turing machine. All of these worlds are possible. No computable utility function can assign higher utility to every world with a never halting Turing machine.
I do think this is an important concept to explain our conception of goal-directedness, but I don't think it can be used as an argument for AI risk, because it proves too much. For example, for many people without technical expertise, the best model they have for a laptop is that it is pursuing some goal (at least, many of my relatives frequently anthropomorphize their laptops).
This definition is supposed to also explains why a mouse has agentic behavior, and I would consider it a failure of the definition if it implied that mice are dangerous. I think a system becomes more dangerous as your best model of that system as an optimizer increases in optimization power.
Here is an idea for a disagreement resolution technique. I think this will work best:
*with one other partner you disagree with.
*when your the beliefs you disagree about are clearly about what the world is like.
*when your the beliefs you disagree about are mutually exclusive.
*when everybody genuinely wants to figure out what is going on.
Probably doesn't really require all of those though.
The first step is that you both write out your beliefs on a shared work space. This can be a notebook or a whiteboard or anything like that. Then you each write down your credences next to each of the statements on the work space.
Now, when you want to make a new argument or present a new piece of evidence, you should ask your partner if they have heard it before after you present it. Maybe you should ask them questions about it beforehand to verify that they have not. If they have not heard it before, or had not considered it, you give it a name and write it down between the two propositions. Now you ask your partner how much they changed their credence as a result of the new argument. They write down their new credences below the ones they previously wrote down, and write down the changes next to the argument that just got added to the board.
When your partner presents a new argument or piece of evidence, be honest about whether you have heard it before. If you have not, it should change your credence some. How much do you think? Write down your new credence. I don't think you should worry too much about being a consistent Bayesian here or anything like that. Just move your credence a bit for each argument or piece of evidence you have not heard or considered, and move it more for better arguments or stronger evidence. You don't have to commit to the last credence you write down, but you should think at least that the relative sizes of all of the changes were about right. I
I think this is the core of the technique. I would love to try this. I think it would be interesting because it would focus the conversation and give players a record of how much their minds changed, and why. I also think this might make it harder to just forget the conversation and move back to your previous credence by default afterwards.
You could also iterate it. If you do not think that your partner changed their mind enough as a result of a new argument, get a new workspace and write down how much you think they should have change their credence. They do the same. Now you can both make arguments relevant to that, and incrementally change your estimate of how much they should have changed their mind, and you both have a record of the changes.
If you come up with a test or set of tests that it would be impossible to actually run in practice, but that we could do in principle if money and ethics were no object, I would still be interested in hearing those. After talking to one of my friends who is enthusiastic about chakras for just a little bit, I would not be surprised if we in fact make fairly similar predictions about the results of such tests.
Sometimes I sort of feel like a grumpy old man that read the sequences back in the good old fashioned year of 2010. When I am in that mood I will sometimes look around at how memes spread throughout the community and say things like "this is not the rationality I grew up with". I really do not want to stir things up with this post, but I guess I do want to be empathetic to this part of me and I want to see what others think about the perspective.
One relatively small reason I feel this way is that a lot of really smart rationalists, who are my friends or who I deeply respect or both, seem to have gotten really into chakras, and maybe some other woo stuff. I want to better understand these folks. I'll admit now that I have weird biased attitudes towards woo stuff in general, but I am going to use chakras as a specific example here.
One of the sacred values of rationality that I care a lot about is that one should not discount hypotheses/perspectives because they are low status, woo, or otherwise weird.
Another is that one's beliefs should pay rent.
To be clear, I am worried that we might be failing on the second sacred value. I am not saying that we should abandon the first one as I think some people may have suggested in the past. I actually think that rationalists getting into chakras is strong evidence that we are doing great on the first sacred value.
Maybe we are not failing on the second sacred value. I want to know whether we are or not, so I want to ask rationalists who think a lot or talk enthusiastically about chakras a question:
Do chakras exist?
If you answer "yes", how do you know they exist?
I've thought a bit about how someone might answer the second question if they answer "yes" to the first question without violating the second sacred value. I've thought of basically two ways that seems possible, but there are probably others.
One way might be that you just think that chakras literally exist in the same ways that planes literally exist, or in the way that waves literally exist. Chakras are just some phenomena that are made out of some stuff like everything else. If that is the case, then it seems like we should be able to at least in principle point to some sort of test that we could run to convince me that they do exist, or you that they do not. I would definitely be interested in hearing proposals for such tests!
Another way might be that you think chakras do not literally exist like planes do, but you can make a predictive profit by pretending that they do exist. This is sort of like how I do not expect that if I could read and understand the source code for a human mind, that there would be some parts of the code that I could point to and call the utility and probability functions. Nonetheless, I think it makes sense to model humans as optimization processes with some utility function and some probability function, because modeling them that way allows me to compress my predictions about their future behavior. Of course, I would get better predictions if I could model them as mechanical objects, but doing so is just too computationally expensive for me. Maybe modeling people as having chakras, including yourself, works sort of the same way. You use some of your evidence to infer the state of their chakras, and then use that model to make testable predictions about their future behavior. In other words, you might think that chakras are real patterns. Again it seems to me that in this case we should at least in principle be able to come up with tests that would convince me that chakras exist, or you that they do not, and I would love to hear any such proposals.
Maybe you think they exist in some other sense, and then I would definitely like to hear about that.
Maybe you do not think they exist in anyway, or make any predictions of any kind, and in that case, I guess I am not sure how continuing to be enthusiastic about thinking about chakras or talking about chakras is supposed to jive with the sacred principle that one's beliefs should pay rent.
I guess it's worth mentioning that I do not feel as averse to Duncan's color wheel thing, maybe because it's not coded as "woo" to my mind. But I still think it would be fair to ask about that taxonomy exactly how we think that it cuts the universe at its joints. Asking that question still seems to me like it should reduce to figuring out what sorts of predictions to make if it in fact does, and then figuring out ways to test them.
I would really love to have several cooperative conversations about this with people who are excited about chakras, or other similar woo things, either within this framework of finding out what sorts of tests we could run to get rid of our uncertainty, or questioning the framework I propose altogether.
Here is an idea I just thought of in an uber ride for how to narrow down the space of languages it would be reasonable to use for universal induction. To express the k-complexity of an object O relative to a programing language L I will write:
Suppose we have two programing languages. The first is Python. The second is Qython, which is a lot like Python, except that it interprets the string "A" as a program that outputs some particular algorithmically large random looking character string S with KPython(S)≈1015. I claim that intuitively, Python is a better language to use for measuring the complexity of a hypothesis than Qython. That's the notion that I just thought of a way to formally express.
There is a well known theorem that if you are using L1 to measure the complexity of objects, and I am using L2 to measure the complexity of objects, then there is a constant c2 such that for any object O:
In words, this means that you might think that some objects are less complicated than I do, and you might think that some objects are more complicated than I do, but you won't think that any object is c2 complexity units more complicated than I do. Intuitively, c2 is just the length of the shortest program in L1 that is a compiler for L2. So worst case scenario, the shortest program in L1 that outputs O will be a compiler for L2 written in L1 (which is c2 characters long) plus giving that compiler the program in L2 that outputs O (which would be KL2(O) characters long).
I am going to define the k-complexity of a function f:X→Yrelative to a programing language as the length of the shortest program in that language such that when it is given x as an input, it returns f(x). This is probably already defined that way, but jic. So say we have a function from programs in L2 to their outputs and we call that function C2, then:
There is also another constant:
The first is the length of the shortest compiler for L2 written in L1, and the second is the length of the shortest compiler for L1 written in L2. Notice that these do not need to be equal. For instance, I claim that the compiler for Qython written in Python is roughly 1015 characters long, since we have to write the program that outputs S in Python which by hypothesis was about 1015 characters long, and then a bit more to get it to run that program when it reads "A", and to get that functionality to play nicely with the rest of Qython however that works out. By contrast, to write a compiler for Python in Qython it shouldn't take very long. Since Qython basically is Python, it might not take any characters, but if there are weird rules in Qython for how the string "A" is interpreted when it appears in an otherwise Python-like program, then it still shouldn't take any more characters than it takes to write a Python interpreter in regular Python.
So this is my proposed method for determining which of two programming languages it would be better to use for universal induction. Say again that we are choosing between L1 and L2. We find the pair of constants such that KL1(C2)=c2 and KL2(C1)=c1, and then compare their sizes. If c1 is less than c2 this means that it is easier to write a compiler for L1 in L2 than vice versa, and so there is more hidden complexity in L2's encodings than in L1's, and so we should use L1 instead of L2 for assessing the complexity of hypotheses.
Lets say that if KL2(C1)<KL1(C2) then L2 hides more complexity than L1.
A few complications:
It is probably not always decidable whether the smallest compiler for L1 written in L2 is smaller than the smallest compiler for L2 written in L1, but this at least in principle gives us some way to specify what we mean by one language hiding more complexity than another, and it seems like at least in the case of Python vs. Qython, we can make a pretty good argument that the smallest compiler for Python written in Qython is smaller than the smallest compiler for Qython written in Python.
It is possible (I'd say probable) that if we started with some group of candidate languages and looked for languages that hide less complexity, we might run into a circle. Like the smallest compiler for L1 in L2 might be the same size as the smallest compiler for L2 in L1 but there might still be an infinite set of objects Oi such that:
In this case, the two languages would disagree about the complexity of an infinite set of objects, but at least they would disagree about it by no more than the same fixed constant in both directions. Idk, seems like probably we could do something clever there, like take the average or something, idk. If we introduce an L3 and the smallest compiler for L3 in L1 is larger than it is in L2, then it seems like we should pick L1.
If there is an infinite set of languages that all stand in this relationship to each other, ie, all of the languages in an infinite set disagree about the complexity of an infinite set of objects and hide less complexity than any language not in the set, then idk, seems pretty damning for this approach, but at least we narrowed down the search space a bit?
Even if it turns out that we end up in a situation where we have an infinite set of languages that disagree about an infinite set of objects by exactly the same constant, it might be nice to have some upper bound on what that constant is.
In any case, this seems like something somebody would have thought of, and then proved the relevant theorems addressing all of the complications I raised. Ever seen something like this before? I think a friend might have suggested a paper that tried some similar method, and concluded that it wasn't a feasible strategy, but I don't remember exactly, and it might have been a totally different thing.