Some further thoughts on training ML models, based on discussions with Caspar Oesterheld:
Regarding your last point 3., why does this make you more pessimistic rather than just very uncertain about everything?
Why would alignment with the outer reward function be the simplest possible terminal goal? Specifying the outer reward function in the weights would presumably be more complicated. So one would have to specify a pointer towards it in some way. And it's unclear whether that pointer is simpler than a very simple misaligned goal.
Such a pointer would be simple if the neural network already has a representation of the outer reward function in weights anyway (rather than deriving it at run-time in the activations). But it seems likely that any fixed representati...
I am not sure I understand. Are you saying that GPT thinks the text is genuinely from the future (i.e., the distribution that it is modeling contains text from the future), or that it doesn't think so? The sentence you quote is intended to mean that it does not think the text is genuinely from the future.
Thanks for your comment!
Regarding 1: I don't think it would be good to simulate superintelligences with our predictive models. Rather, we want to simulate humans to elicit safe capabilities. We talk more about competitiveness of the approach in Section III.
Regarding 3: I agree it might have been good to discuss cyborgism specifically. I think cyborgism is to some degree compatible with careful conditioning. One possible issue when interacting with the model arises when the model is trained on / prompted with its own outputs, or data that has been influence...
I like the idea behind this experiment, but I find it hard to tell from this write-up what is actually going on. I.e., what is exactly the training setup, what is exactly the model, which parts are hard-coded and which parts are learned? Why is it a weirdo janky thing instead of some other standard model or algorithm? It would be good if this was explained more in the post (it is very effortful to try to piece this together by going through the code). Right now I have a hard time making any inferences from the results.
Update: we recently discovered the performative prediction (Perdomo et al., 2020) literature (HT Alex Pan). This is a machine learning setting where we choose a model parameter (e.g., parameters for a neural network) that minimizes expected loss (e.g., classification error). In performative prediction, the distribution over data points can depend on the choice of model parameter. Our setting is thus a special case in which the parameter of interest is a probability distribution, the loss is a scoring function, and data points are discrete outcomes. Most re...
I think there should be a space both for in-progress research dumps and for more worked out final research reports on the forum. Maybe it would make sense to have separate categories for them or so.
I'm not sure I understand what you mean by a skill-free scoring rule. Can you elaborate what you have in mind?
Thanks for your comment!
Your interpretation sounds right to me. I would add that our result implies that it is impossible to incentivize honest reports in our setting. If you want to incentivize honest reports when is constant, then you have to use a strictly proper scoring rule (this is just the definition of “strictly proper”). But we show for any strictly proper scoring rule that there is a function such that a dishonest prediction is optimal.
Proposition 13 shows that it is possible to “tune” scoring rules to make optimal predictions very close to h...
I think such a natural progression could also lead to something similar to extinction (in addition to permanently curtailing humanity's potential). E.g., maybe we are currently in a regime where optimizing proxies harder still leads to improvements to the true objective, but this could change once we optimize those proxies even more. The natural progression could follow an inverted U-shape.
E.g., take the marketing example. Maybe we will get superhuman persuasion AIs, but also AIs that protect us from persuasive ads and AIs that can provide honest reviews. ...
There is a chance that one can avoid having to solve ontology identification in general if one punts the problem to simulated humans. I.e., it seems one can train the human simulator without solving it, and then use simulated humans to solve the problem. One may have to solve some specific ontology identification problems to make sure one gets an actual human simulator and not e.g. a malign AI simulator. However, this might be easier than solving the problem in full generality.
Minor comment: regarding the RLHF example, one could solve the problem implicitl...
(I think Stockfish would be classified as AI in computer science. I.e., you'd learn about the basic algorithms behind it in a textbook on AI. Maybe you mean that Stockfish was non-ML, or that it had handcrafted heuristics?)
Great post!
I like that you point out that we'd normally do trial and error, but that this might not work with AI. I think you could possibly make clearer where this fails in your story. You do point out how HLMI might become extremely widespread and how it might replace most human work. Right now it seems to me like you argue essentially that the problem is a large-scale accident that comes from a distribution shift. But this doesn't yet say why we couldn't e.g. just continue trial-and-error and correct the AI once we notice that something is going wrong.&...
Overall I agree that solutions to deception look different from solutions to other kinds of distributional shift. (Also, there are probably different solutions to different kinds of large distributional shift as well. E.g., solutions to capability generalization vs solutions to goal generalization.)
I do think one could claim that some general solutions to distributional shift would also solve deceptiveness. E.g., the consensus algorithm works for any kind of distributional shift, but it should presumably also avoid deceptiveness (in the sense that it would...
I like this post and agree that there are different threat models one might categorize broadly under "inner alignment". Before reading this I hadn't reflected on the relationship between them.
Some random thoughts (after an in-person discussion with Erik):
Great post!
Regarding your “Redirecting civilization” approach: I wonder about the competitiveness of this. It seems that we will likely build x-risk-causing AI before we have a good enough model to be able to e.g. simulate the world 1000 years into the future on an alternative timeline? Of course, competitiveness is an issue in general, but the more factored cognition or IDA based approaches seem more realistic to me.
...Alternatively, we can try to be clever and “import” research from the future repeatedly. For instance we can first ask our model to produce r
These issues of preferences over objects of different types (internal states, policies, actions, etc.) and how to translate between them are also discussed in the post Agents Over Cartesian World Models.
Your post seems to be focused more on pointing out a missing piece in the literature rather than asking for a solution to the specific problem (which I believe is a valuable contribution). Regardless, here is roughly how I would understand “what they mean”:
Let be the task space, the output space, the model space, our base objective, and the mesa objective of the model for input . Assume that there exists some map mapping internal objects to outputs by the model,...
Thank you!
It does seem like simulating text generated by using similar models would be hard to avoid when using the model as a research assistant. Presumably any research would get “contaminated” at some point, and models might seize to be helpful without updating them on the newest research.
In theory, if one were to re-train models from scratch on the new research, this might be equivalent to the models updating on the previous models' outputs before reasoning about superrationality, so it would turn things into a version of Newcomb's problem with transpa...
Thanks for your comment! I agree that we probably won't be able to get a textbook from the future just by prompting a language model trained on human-generated texts.
As mentioned in the post, maybe one could train a model to also condition on observations. If the model is very powerful, and it really believes the observations, one could make it work. I do think sometimes it would be beneficial for a model to attain superhuman reasoning skills, even if it is only modeling human-written text. Though of course, this might still not happen in practice.
Overall ...
Would you count issues with malign priors etc. also as issues with myopia? Maybe I'm missing something about what myopia is supposed to mean and be useful for, but these issues seem to have a similar spirit of making an agent do stuff that is motivated by concerns about things happening at different times, in different locations, etc.
E.g., a bad agent could simulate 1000 copies of the LCDT agent and reward it for a particular action favored by the bad agent. Then depending on the anthropic beliefs of the LCDT agent, it might behave so as to maximize this r...
If someone had a strategy that took two years, they would have to over-bid in the first year, taking a loss. But then they have to under-bid on the second year if they're going to make a profit, and--"
"And they get undercut, because someone figures them out."
I think one could imagine scenarios where the first trader can use their influence in the first year to make sure they are not undercut in the second year, analogous to the prediction market example. For instance, the trader could install some kind of encryption in the software that this company use...
I find this particularly curious since naively, one would assume that weight sharing implicitly implements a simplicity prior, so it should make optimization more likely and thus also deceptive behavior? Maybe the argument is that somehow weight sharing leaves less wiggle room for obscuring one's reasoning process, making a potential optimizer more interpretable? But the hidden states and tied weights could still be encoding deceptive reasoning in an uninterpretable way?
Wolfgang Spohn develops the concept of a "dependency equilibrium" based on a similar notion of evidential best response (Spohn 2007, 2010). A joint probability distribution is a dependency equilibrium if all actions of all players that have positive probability are evidential best responses. In case there are actions with zero probability, one evaluates a sequence of joint probability distributions such that and for all actions and . Using your notation of a probability matrix and a utility matrix, the expected utili
...I would like to submit the following entries:
A typology of Newcomblike problems (philosophy paper, co-authored with Caspar Oesterheld).
A wager against Solomonoff induction (blog post).
Three wagers for multiverse-wide superrationality (blog post).
UDT is “updateless” about its utility function (blog post). (I think this post is hard to understand. Nevertheless, if anyone finds it intelligible, I would be interested in their thoughts.)
EDT doesn't pay if it is given the choice to commit to not paying ex-ante (before receiving the letter). So the thought experiment might be an argument against ordinary EDT, but not against updateless EDT. If one takes the possibility of anthropic uncertainty into account, then even ordinary EDT might not pay the blackmailer. See also Abram Demski's post about the Smoking Lesion. Ahmed and Price defend EDT along similar lines in a response to a related thought experiment by Frank Arntzenius.
Thanks for your answer! This "gain" approach seems quite similar to what Wedgwood (2013) has proposed as "Benchmark Theory", which behaves like CDT in cases with, but more like EDT in cases without causally dominant actions. My hunch would be that one might be able to construct a series of thought-experiments in which such a theory violates transitivity of preference, as demonstrated by Ahmed (2012).
I don't understand how you arrive at a gain of 0 for not smoking as a smoke-lover in my example. I would think the gain for not smoking is higher:
...
From my perspective, I don’t think it’s been adequately established that we should prefer updateless CDT to updateless EDT
I agree with this.
It would be nice to have an example which doesn’t arise from an obviously bad agent design, but I don’t have one.
I’d also be interested in finding such a problem.
I am not sure whether your smoking lesion steelman actually makes a decisive case against evidential decision theory. If an agent knows about their utility function on some level, but not on the epistemic level, then this can just as well be made into a
......Imagine that Omega tells you that it threw its coin a million years ago, and would have turned the sky green if it had landed the other way. Back in 2010, I wrote a post arguing that in this sort of situation, since you've always seen the sky being blue, and every other human being has also always seen the sky being blue, everyone has always had enough information to conclude that there's no benefit from paying up in this particular counterfactual mugging, and so there hasn't ever been any incentive to self-modify into an agent that would pay up ... and s
Thanks for the reply and all the useful links!
It's not a given that you can easily observe your existence.
It took me a while to understand this. Would you say that for example in the Evidential Blackmail, you can never tell whether your decision algorithm is just being simulated or whether you're actually in the world where you received the letter, because both times, the decision algorithms receive exactly the same evidence? So in this sense, after updating on receiving the letter, both worlds are still equally likely, and only via your decision do yo...
I agree with all of this, and I can't understand why the Smoking Lesion is still seen as the standard counterexample to EDT.
Regarding the blackmail letter: I think that in principle, it should be possible to use a version of EDT that also chooses policies based on a prior instead of actions based on your current probability distribution. That would be "updateless EDT", and I think it wouldn't give in to Evidential Blackmail. So I think rather than an argument against EDT, it's an argument in favor of updatelessness.
Thanks for the link! What I don't understand is how this works in the context of empirical and logical uncertainty. Also, it's unclear to me how this approach relates to Bayesian conditioning. E.g. if the sentence "if a holds, than o holds" is true, doesn't this also mean that P(o|a)=1? In that sense, proof-based UDT would just be an elaborate specification of how to assign these conditional probabilities "from the viewpoint of the original position", so with updatelessness, and in the context of full logical inference and knowledge of ...
Thanks a lot for your elaborate reply!
(So I'm not even sure what CDT is supposed to do here, since it's not clear that the bet is really on the past state of the world and not on truth of a proposition about the future state of the world.)
Hmm, good point. The truth of the proposition is evaluated on basis of Alice's action, which she can causally influence. But we could think of a Newcomblike scenario in which someone made a perfect prediction a 100 years ago and put down a note about what state the world was in at that time. Now instead of checking Al...
CDT, TDT, and UDT would not give away the money because there is no causal (or acausal) influence on the number of universes.
I'm not so sure about UDT's response. From what I've heard, depending on the exact formal implementation of the problem, UDT might also pay the money? If your thought experiment works via a correlation between the type of universe you live in and the decision theory you employ, then it might be a similar problem to the Coin Flip Creation. I introduced the latter decision problem in an attempt to make a less ambiguous version of th...
That's what I was trying to do with the Coin Flip Creation :) My guess: once you specify the Smoking Lesion and make it unambiguous, it ceases to be an argument against EDT.
I suspect this is a confusion about free will. To be concrete, I think that a thermostat has a causal influence on the future, and does not violate determinism. It deterministically observes a sensor, and either turns on a heater or a cooler based on that sensor, in a way that does not flow backwards--turning on the heater manually will not affect the thermostat's attempted actions except indirectly through the eventual effect on the sensor.
Fair point :) What I meant was that for every world history, there is only one causal influence I could possibly h...
I agree with points 1) and 2). Regarding point 3), that's interesting! Do you think one could also prove that if you don't smoke, you can't (or are less likely to) have the gene in the Smoking Lesion? (See also my response to Vladimir Nesov's comment.)
The point of decision theories is not that they let you reach from beyond the Matrix and change reality in violation of physics; it's that you predictably act in ways that optimize for various criteria.
I agree with this. But I would argue that causal counterfactuals somehow assume that we can "reach from beyond the Matrix and change reality in violation of physics". They work by comparing what would happen if we detached our “action node” from its ancestor nodes and manipulated it in different ways. So causal thinking in some way seems to viol...
Thanks for your comment! I find your line of reasoning in the ASP problem and the Coin Flip Creation plausible. So your point is that, in both cases, by choosing a decision algorithm, one also gets to choose where this algorithm is being instantiated? I would say that in the CFC, choosing the right action is sufficient, while in the ASP you also have to choose the whole UDP program so as to be instantiated in a beneficial way (similar to the distinction of how TDT iterates over acts and UDT iterates over policies).
Would you agree that the Coin Flip Creatio...
Yes, that's correct. I would say that "two-boxing" is generally what CDT would recommend, and "one-boxing" is what EDT recommends. Yes, medical Newcomb problems are different from Newcomb's original problem in that there are no simulations of decisions involved in the former.
Rationality is about more than empirical studies. It's about developing sensible models of the world. It's about conveying sensible models to people in ways that they'll understand them. It's about convincing people that your model is better than theirs, sometimes without having to do an experiment.
Hmm, I'm not sure I understand what you mean. Maybe I'm missing something? Isn't this exactly what Bayesianism is about? Bayesianism is just using laws of probability theory to build an understanding of the world, given all the evidence that we encounter. Of ...
Since the links above are broken, here are links to all the other posts in the sequence: