Computing scientist and Systems architect
It is easy to see that this idea of logical counterfactuals is unsatisfactory. For one, no good account of them has yet been given. For two, there is a sense in which no account could be given; reasoning about logically incoherent worlds can only be so extensive before running into logical contradiction.
I've been doing some work on this topic, and I am seeing two schools of thought on how to deal with the problem of logical contradictions you mention. To explain these, I'll use an example counterfactual not involving agents and free will. Consider the counterfactual sentence: `if the vase had not been broken, the floor would not have been wet'. Now, how can we compute a truth value for this sentence?
School of thought 1 proceeds as follows: we know various facts about the world, like that the vase is broken and that the floor is wet. We also know general facts about vases, breaking, water, and floors. Now we add the extra fact that the vase is not broken to our knowledge base. Based on this extended body of knowledge, we compute the truth value of the claim 'the floor is not wet'. Clearly, we are dealing with a knowledge base that contains mutually contradictory facts: the vase is both broken and it is not broken. Under normal mathematical systems of reasoning, this will allow us to prove any claim we like: the truth value of any sentence becomes 1, which is not what we want. Now, school 1 tries to solve this by coming up with new systems of reasoning that are tolerant of such internal contradictions, systems that will make computations that will produce the 'obviously true' conclusions only, of that will derive the `obviously true' conclusions before deriving the `obviously false' ones, or that compute probabilistic truth values such a way that those of the `obviously true' conclusions are higher. In MIRI terminology, I believe this approach goes under the heading 'decision theory'. I also interpret the two alternative solutions you mention above as following this school of thought. Personally, I find this solution approach not very promising or compelling.
School of thought 2, which includes Pearl's version of counterfactual reasoning, says that if you want to reason (or if you want a machine to reason) in a counterfactual way, you should not just add facts to the body of knowledge you use. You need to delete or edit other facts in the knowledge base too, before you supply it to the reasoning engine, exactly to avoid inputting a knowledge base that has internal contradictions. For example, if you want to reason about 'if the vase had not been broken', one thing you definitely need to do is to first remove the statement (or any information leading to the conclusion that) `the vase is broken' from the knowledge base that goes into your reasoning engine. You have to do this even though the fact that the vase is broken is obviously true for the current world you are in.
So school 2 avoids the problem of having to somehow build a reasoning engine that does the right thing even when a contradictory knowledge base is input. But it trades this for the problem of having to decide exactly what edits will be made to the knowledge base to eliminate the possibility of having such contradictions. In other words, if you want a machine to reason in a counterfactual way, you have to make choices about the specific edits you will make. Often, there are many possible choices, and different choices may lead to different probability distributions in the outcomes computed. This choice problem does not bother me that much, I see it as having design freedom. But if you are a philosopher of language trying to find a single obvious system of meaning for natural language counterfactual sentences, this choice problem might bother you a lot, you might be tempted to find some kind of representation-independent Occam's razor that can be used to decide between counterfactual edits.
Overall, my feeling is that school 2 gives an account of logical counterfactuals that is good enough for my purposes in AGI safety work.
As a trivial school 1 edge case, one could design a reasoning engine that can deal with contradictory facts in its input knowledge base as follows: the engine first makes some school 2 edits on its input to remove the contradictions, and then proceeds calculating the requested truth value. So one could argue that the schools are not fundamentally different, though I do feel they are different in outlook, especially in their outlook on how necessary or useful it will be for AGI safety to resolve certain puzzles.
OK -- I'll do 20 to start.
Nice! There are a lot of cases being considered here, but my main takeaway is that these impact measures have surprising loopholes, once the agent becomes powerful enough to construct sub-agents.
Mathematically, my main takeaway is that, for the impact measure PENALTY(s,a)=∑i|QRi(s,a)−QRi(s,∅)| from Conservative Agency, if the agent wants to achieve the sub-goal Ri while avoiding the penalty triggered by the Ri term, it can build a sub-agent that is slightly worse at achieving Ri than it it would be itself, and set it loose.
Now for some more speculative thoughts. I think the main source of the loophole above is the part QRi(s,∅), so what happens if we just delete that part? Then we get an agent with an incentive to stop any human present in the environment from becoming too good at achieving the goal Ri, which would be bad. More informally, it looks like the penalty term has a loophole because it does not distinguish between humans and sub-agents.
Alice and Bob have a son Carl. Carl walks around and breaks a vase. Who is responsible?
Obviously, this depends on many factors, including Carl's age. To manage the real world, we weave a quite complex web to determine accountability.
In one way, it is encouraging that very simple and compact impact measures, which do not encode any particulars of the agent environment, can be surprisingly effective in simple environments. But my intuition is that when we scale up to more complex environments, the only way to create a good level of robustness is to build more complex measures that rely in part on encoding and leveraging specific properties of the environment.
I used to work in the lighting industry, so here are some comments from an industry perspective.
There are several high-quality studies about how more light, and being able to control dimming and color temperature, can improve subjective well-being, alertness, and sleep patterns. It is generally accepted that you do not need to go to direct sunlight type lux levels indoors to get most of the benefits. Also, you do no need to have the brightest dim level on all the time. For some people, the thing that will really help is a regular schedule that dims down below typical indoor light levels at selected times, without ever dimming above typical levels. I am not an expert on the latest studies, but if you want to build an indoor experimental setup to get to the bottom of what you really like, my feeling is that installing more than 4000 lux, as a peak capacity in selected areas, would definitely be a waste of money and resources.
If I would want to install a hassle-free bright light setup in my home cheaply, I would buy lots of high-end wireless dimmable and color temperature adjustable LED light bulbs, and some low-cost spot lights to put them in, e.g. spot lights that can be attached to a ceiling mounted power rail. If you make sure the bulbs support the ZigBee standard, you will have plenty of options for control software.
If power rails with lots of ~60W equivalent bulbs lacks aesthetic appeal for you, then you could go for a high-end special form factor product like that from Coelux mentioned above. The best way to think about the Coelux product, in business model development terms, is that it is not really a lighting product: it is a specialised piece of high-end furniture. So if you want to develop a business model for a bright home lighting company, the first question you have to ask yourself is whether or not you want to be in the high-end furniture business.
By the way, the main reason why the lighting industry is not making any 200W or 500W equivalent LED bulbs that you could put in your existing spot lights is because of cooling issues. LEDs are pretty energy efficient, but LED bulbs still produce some internal heat that has to be cooled away. For 60W equivalent this can happen by natural air flow around the bulb, but a 200W equivalent bulb would need something like a built-in fan.
In the context of my problem statement, a PAL with high predictive accuracy is something that is in scope to consider. This does not mean that we should or must design a real PAL in this way.
An AGI that exceeds humans in its ability to predict human responses might be a useful tool, e.g. to a politician who wants to make proposals to resolve long-lasting human conflicts. But definitely this is a tool that could also be used for less ethical things.
However, your story, it seems that PAL can only get away with this once. After all, once Dave helps PAL get to the coffee machine once and notices that he still exists (ie, PAL has chosen to end the simulation instead of starting a new one with updated knowledge on Dave's behavior), he will likely no longer believe that he is in a simulation.
Thanks for pointing this out, it had not occurred to me before. So I conclude that when assessing possible risks and countermeasures here, we must to take into account interaction scenarios involving longer time-frames.
Thank you G Gordon and all other posters for your answers and comments! A lot of food for thought here... Below, I'll try to summarize some general take-aways from the responses.My main question was if the simulation epiphany problem had been resolved already somewhere. It looks like the answer is no. Many commenters are leaning towards the significance of case 2. above. I myself also feel this 2. is very significant. Taking all comments together, I am starting to feel that the simulation epiphany problem should be disentangled into two separate problems.Problem 1 is to consider happens in the limit case when PAL's simulator is a perfect predictor of what Dave will do. This gets us into game theory, to reason about likely outcomes of the associated Princess Bride type of infinite regress problem.Problem 2 is to consider, starting from a particular agent design with a perfect predictor, what what might happen when the perfect predictor is replaced with an imperfect one. Problem 2 allows for a case-by-case analysis.In one case, PAL takes route B, and then notices that Dave does not experience the predicted helpful simulation epiphany. In this case we can consider how PAL might adjust its world model, or the world, to make this type of prediction error less likely in future. The possibility that PAL might find it easier to change the world, not the model, might might lead us to the conclusion that we had better add penalties for simulation epiphanies to the utility function. (Such penalties create an incentive for PAL to manipulate real-world Dave into never experiencing simulation epiphanies, but most safety mechanisms involve a trade-off, so I could live with such a thing, if nothing better can be found.)In a second case, suppose that PAL incorrectly predicts that Dave will experience a simulation epiphany when it takes path B, and further that this incorrect prediction projects that Dave concludes from the epiphany that he should attack PAL. This incorrect prediction shows very low utility, so in real life PAL will avoid taking path B. But in this case, there will also never be any error signal that will allow PAL to find out that its prediction was incorrect. What does this tell us? Maybe there is only the trivial conclusion that PAL will need to make some exploration moves occasionally, if it wants to keep improving its world model. If PAL's designers follow through on this conclusion, and mention it in PAL's user manual, then this would also lower the probability of Dave ever believing he is in a simulation.
You are right that there is a potential "mind crime" ethical problem above.
One could argue that, to build an advanced AGI that avoids "mind crime", we can equip the AGI with a highly accurate predictor, but this predictor should be implemented in such a way that it is not actually a highly accurate simulator. I am not exactly sure how one could formally define the constraint of 'not actually being a simulator'. Also, maybe such an implementation constraint will fundamentally limit the level of predictive accuracy (and therefore intelligence) that can be achieved. Which might be a price we should be willing to pay.
Mathematically speaking, if I want to make AGI safety framework correctness proofs, I think it is valid to model the 'highly accurate predictor that is not a simulator' box inside the above AGI as a box with an input-output behavior equivalent to that of a highly accurate simulator. This is a very useful short-cut when making proofs. But it also means that I am not sure how one should define 'not actually being a simulator'.
I am somewhat somewhat new to this site and to the design problem of counter-factual reasoning in embedded agents. So I can say something about whether I think your approach goes in the right direction, but I can't really answer your question if this has all been done before.Based on my own reading and recent work so far, if your goal is to create a working "what happens if I do a" reasoning system for an embedded agent, my intuition is that your approach is necessary, and that it is going in the right direction. I will unpack this statement at length further below.I also get the impression that you are wondering whether doing the proposed work will move forward the discussion about Newcomb's problem. Not sure if this is really a part of your question, but in any case, I am not in a position to give you any good insights on what is needed to move forward the Newcomb discussion.Here are my longer thoughts on your approach to building counterfactual reasoning. I need to start by setting the scene. Say we have an embedded agent containing a very advanced predictive world model, a model that includes all details about the agent's current internal computational structure. Say this agent wants to compute the next action that will best maximize its utility function. The agent can perform this computation in two ways.1. One way to pick the correct action is to initiate a simulation that runs its entire world model forward, and to observe the copy of itself inside the simulation. The action that its simulated copy picks is the action that it wants. There is an obvious problem with using this approach. If the simulated copy takes the same approach, and the agent it simulates in turn does likewise, and so on, we get infinite recursion and the predictive computation will never end. So at one point in the chain, the (real or n-level simulated) agent has to use a second way to compute the action that maximizes utility.2. The second way is the argmax what-if method. Say the agent wants to choose between 10 possible actions. It can do this by running 10 simulations that compute 'the utility of what will happen if I take action a[i]'. Now, the agent knows full well that it will end up choosing only one of these 10 actions in its real future. So only one of these 10 simulations is a simulation of a future that actually contains the agent itself. It is therefore completely inappropriate for the agent to include any statement like 'this future will contain myself' inside at least 9 of the 10 world models that it supplies to its simulator. It has no choice but to 'edit out' any world model information that could cause the simulation engine to make this inference, or else 9 of the 10 simulations will have the 5-and-10 problem, and their results cannot be trusted anymore as valid inputs to its argmax calculation.The above reasoning shows that the agent cannot avoid ending up in a place where it has to edit some details about itself out of the simulator input. (Exception: we could assume that the simulation engine somehow does the needed editing, just after getting the input inside its of the computation, but this amounts to basically the same thing. I interpret some proposals about updating FDT that I have read in the links as proposals to use such a special simulation engine.)If the agent must do some editing on the simulator input, we can worry that the agent will do 'too much' editing, or the wrong type of editing. I have a theory that this worry is at least partly responsible for keeping the discussions around FDT and counterfactuals so lively.The obvious worry is that, if the agent edits itself out of the simulation runs, it will no longer be able to reason about itself. In a philosophical sense, can we still say that such an agent possesses something like consciousness, or a robust theory of self? If not, does this mean that such an editing agent can never get to a state of super-intelligence? For me, these philosophical questions are really questions about the meaning of words, so I do not worry about AGI safety too much if these questions remain unresolved. However, editing also implies a more practical worry: how will editing impact the observable behaviour of the agent? If the agent edits itself out of the simulations, does this inevitably mean that it will lose some observable properties that we would expect to find in a highly intelligent agent, properties like showing an emergent drive to protect its own utility function?When we look at the world models of many existing agents, e.g. the AlphaZero chess playing agent, we can interpret them as being 'pre-edited': these models do not contain any direct representation of the agent's utility function or the hardware its runs on. Therefore, the simulation engine in the agent runs no risk of blowing up, no risk of creating a 5-and-10 contradiction when it combines available pieces of world knowledge. Some of the world knowledge needed to create the contradiction never made it into the world model. However, this type of construction also makes the a agent indifferent to protecting its utility function. I cover this topic in more detail in section 5.2 of my paper here. In section 5.3 I introduce an agent that uses a predictive world model that is 'less edited': this model makes the utility function of the agent show up inside the simulation runs. The pleasing result is that the agent then gets an emergent drive to preserve its utility function.A similar agent design, that that turns the editing knob towards greater embeddedness, is in the paper Self-modification of policy and utility function in rational agents. Both these papers get pretty heavy on the math, when they start proving agent properties. This may be a price we have to pay for turning the knob. I am currently investigating how to turn the knob even closer to embeddedness. There some hints that the math might get simpler again in that case, but we'll see.
Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer's epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer's intentions).
This use of the term corrigibility above, while citing (25), is somewhat confusing to me -- while it does have a certain type of corrigibility, I would not consider the mesa-optimiser described above be corrigible according to the criteria defined in (25). See the comment section here for a longer discussion about this topic.