I loved this, but maybe should come with a cw.
I came here to say something pretty similar to what Duncan said, but I had a different focus in mind.
It seems like it's easier for organizations to coordinate around PR than it is for them to coordinate around honor. People can have really deep intractable, or maybe even fundamental and faultless, disagreements about what is honorable, because what is honorable is a function of what normative principles you endorse. It's much easier to resolve disagreements about what counts as good PR. You could probably settle most disagreements about what counts as good PR using polls.
Maybe for this reason we should expect being into PR to be a relatively stable property of organizations, while being into honor is a fragile and precious thing for an organization.
This might be sort of missing the point, but here is an ideal and maybe not very useful not-yet-theory of rationality improvements I just came up with.
There are a few black boxes in the theory. The first takes you and returns your true utility function, whatever that is. Maybe it's just the utility function you endorse, and that's up to you. The other black box is the space of programs that you could be. Maybe it's limited by memory, maybe it's limited by run time, or maybe it's any finite state machine with less than 10^20 states, maybe it's python programs less than 5000 characters long, some limited set of programs that takes your sensory data and motor output history as input, and returns a motor output. The limitations could be whatever, don't have to be like this.
Then you take one of these ideal rational agents with your true utility function and the right prior, and you give them the decision problem of designing your policy, but they can only use policies that are in the limited space of bounded programs you could be. Their expected utility assignments over that space of programs is then our measure of the rationality of a bounded agent. You could also give the ideal agent access to your data and see how that changes their ranking, if it does. If you can change yourself such that the program you become is assigned higher expected utility by the agent, then that is an improvement.
I don't think we should be surprised that any reasonable utility function is uncomputable. Consider a set of worlds with utopias that last only as long as a Turing machine in the world does not halt and are otherwise identical. There is one such world for each Turing machine. All of these worlds are possible. No computable utility function can assign higher utility to every world with a never halting Turing machine.
I do think this is an important concept to explain our conception of goal-directedness, but I don't think it can be used as an argument for AI risk, because it proves too much. For example, for many people without technical expertise, the best model they have for a laptop is that it is pursuing some goal (at least, many of my relatives frequently anthropomorphize their laptops).
This definition is supposed to also explains why a mouse has agentic behavior, and I would consider it a failure of the definition if it implied that mice are dangerous. I think a system becomes more dangerous as your best model of that system as an optimizer increases in optimization power.
Here is an idea for a disagreement resolution technique. I think this will work best:
*with one other partner you disagree with.
*when your the beliefs you disagree about are clearly about what the world is like.
*when your the beliefs you disagree about are mutually exclusive.
*when everybody genuinely wants to figure out what is going on.
Probably doesn't really require all of those though.
The first step is that you both write out your beliefs on a shared work space. This can be a notebook or a whiteboard or anything like that. Then you each write down your credences next to each of the statements on the work space.
Now, when you want to make a new argument or present a new piece of evidence, you should ask your partner if they have heard it before after you present it. Maybe you should ask them questions about it beforehand to verify that they have not. If they have not heard it before, or had not considered it, you give it a name and write it down between the two propositions. Now you ask your partner how much they changed their credence as a result of the new argument. They write down their new credences below the ones they previously wrote down, and write down the changes next to the argument that just got added to the board.
When your partner presents a new argument or piece of evidence, be honest about whether you have heard it before. If you have not, it should change your credence some. How much do you think? Write down your new credence. I don't think you should worry too much about being a consistent Bayesian here or anything like that. Just move your credence a bit for each argument or piece of evidence you have not heard or considered, and move it more for better arguments or stronger evidence. You don't have to commit to the last credence you write down, but you should think at least that the relative sizes of all of the changes were about right. I
I think this is the core of the technique. I would love to try this. I think it would be interesting because it would focus the conversation and give players a record of how much their minds changed, and why. I also think this might make it harder to just forget the conversation and move back to your previous credence by default afterwards.
You could also iterate it. If you do not think that your partner changed their mind enough as a result of a new argument, get a new workspace and write down how much you think they should have change their credence. They do the same. Now you can both make arguments relevant to that, and incrementally change your estimate of how much they should have changed their mind, and you both have a record of the changes.
If you come up with a test or set of tests that it would be impossible to actually run in practice, but that we could do in principle if money and ethics were no object, I would still be interested in hearing those. After talking to one of my friends who is enthusiastic about chakras for just a little bit, I would not be surprised if we in fact make fairly similar predictions about the results of such tests.
Sometimes I sort of feel like a grumpy old man that read the sequences back in the good old fashioned year of 2010. When I am in that mood I will sometimes look around at how memes spread throughout the community and say things like "this is not the rationality I grew up with". I really do not want to stir things up with this post, but I guess I do want to be empathetic to this part of me and I want to see what others think about the perspective.
One relatively small reason I feel this way is that a lot of really smart rationalists, who are my friends or who I deeply respect or both, seem to have gotten really into chakras, and maybe some other woo stuff. I want to better understand these folks. I'll admit now that I have weird biased attitudes towards woo stuff in general, but I am going to use chakras as a specific example here.
One of the sacred values of rationality that I care a lot about is that one should not discount hypotheses/perspectives because they are low status, woo, or otherwise weird.
Another is that one's beliefs should pay rent.
To be clear, I am worried that we might be failing on the second sacred value. I am not saying that we should abandon the first one as I think some people may have suggested in the past. I actually think that rationalists getting into chakras is strong evidence that we are doing great on the first sacred value.
Maybe we are not failing on the second sacred value. I want to know whether we are or not, so I want to ask rationalists who think a lot or talk enthusiastically about chakras a question:
Do chakras exist?
If you answer "yes", how do you know they exist?
I've thought a bit about how someone might answer the second question if they answer "yes" to the first question without violating the second sacred value. I've thought of basically two ways that seems possible, but there are probably others.
One way might be that you just think that chakras literally exist in the same ways that planes literally exist, or in the way that waves literally exist. Chakras are just some phenomena that are made out of some stuff like everything else. If that is the case, then it seems like we should be able to at least in principle point to some sort of test that we could run to convince me that they do exist, or you that they do not. I would definitely be interested in hearing proposals for such tests!
Another way might be that you think chakras do not literally exist like planes do, but you can make a predictive profit by pretending that they do exist. This is sort of like how I do not expect that if I could read and understand the source code for a human mind, that there would be some parts of the code that I could point to and call the utility and probability functions. Nonetheless, I think it makes sense to model humans as optimization processes with some utility function and some probability function, because modeling them that way allows me to compress my predictions about their future behavior. Of course, I would get better predictions if I could model them as mechanical objects, but doing so is just too computationally expensive for me. Maybe modeling people as having chakras, including yourself, works sort of the same way. You use some of your evidence to infer the state of their chakras, and then use that model to make testable predictions about their future behavior. In other words, you might think that chakras are real patterns. Again it seems to me that in this case we should at least in principle be able to come up with tests that would convince me that chakras exist, or you that they do not, and I would love to hear any such proposals.
Maybe you think they exist in some other sense, and then I would definitely like to hear about that.
Maybe you do not think they exist in anyway, or make any predictions of any kind, and in that case, I guess I am not sure how continuing to be enthusiastic about thinking about chakras or talking about chakras is supposed to jive with the sacred principle that one's beliefs should pay rent.
I guess it's worth mentioning that I do not feel as averse to Duncan's color wheel thing, maybe because it's not coded as "woo" to my mind. But I still think it would be fair to ask about that taxonomy exactly how we think that it cuts the universe at its joints. Asking that question still seems to me like it should reduce to figuring out what sorts of predictions to make if it in fact does, and then figuring out ways to test them.
I would really love to have several cooperative conversations about this with people who are excited about chakras, or other similar woo things, either within this framework of finding out what sorts of tests we could run to get rid of our uncertainty, or questioning the framework I propose altogether.