If a tree falls in the forest, and two people are around to hear it, does it make a sound?
I feel like typically you'd say yes, it makes a sound. Not two sounds, one for each person, but one sound that both people hear.
But that must mean that a sound is not just auditory experiences, because then there would be two rather than one. Rather it's more like, emissions of acoustic vibrations. But this implies that it also makes a sound when no one is around to hear it.
I think this just repeats the original ambiguity of the question, by using the word "sound" in a context where the common meaning (air vibrations perceived by an agent) is only partly applicable. It's still a question of definition, not of understanding what actually happens.
I think we're playing too much with the meaning of "sound" here. The tree causes some vibrations in the air, which leads to two auditory experiences since there are two people
I think I've got it, the fix to the problem in my corrigibility thing!
So to recap: It seems to me that for the stop button problem, we want humans to control whether the AI stops or runs freely, which is a causal notion, and so we should use counterfactuals in our utility function to describe it. (Dunno why most people don't do this.) That is, if we say that the AI's utility should depend on the counterfactuals related to human behavior, then it will want to observe humans to get input on what to do, rather than manipulate them, because this is the only way for it to be dependent on the counterfactuals.
But So8res pointed out, just using counterfactuals directly is a big problem because it rapidly brings us out of distribution. A practical implementation of this beyond the stop button problem up having an exponential amount of counterfactuals to cover, and the vast majority of those counterfactuals will be far outside of the real-world distribution. This means that the AI might not get aligned at all, because the real-world applications don't get nonnegligible weight in the utility function.
But I think I've figured out a solution now, which I'd call conditional+counterfactual corrigibility. As usual let's use B to denote that the stop button gets pressed and the AI shuts down, V to denote whichever non-corrigible utility function that we want to make corrigible, and Xs/Xf to denote a counterfactual where people do (s) or do not (f) want to press the stop button. However, we will also use S and F to denote the conditions where people do or do not want to press the stop button. In that case, the utility function should be. In that case, we can define Control(C) to mean that humans can control whether the AI stops or runs in condition C:
and then we simply want to define the utility as saying that people can control the AI in both the S and the F condition:
Previously, I strongly emphasized the need to keep the AI "under a counterfactual" - that is, if it believed it could control whether humans want to stop it or not, then it would be incentivized to manipulate humans. But this is what brings us out of distribution. However, counterfactuals aren't the only way to keep the appearance of a phenomenon constant - conditionals work too. And conditionals keep you nicely on distribution, so that's now my solution to the distribution issues. This means that we can use much less invasive counterfactuals.
That said, this approach I going to have a hard time with chaotic phenomena, as combining conditionals and counterfactuals in the presence of chaos can get pretty weird.
It also might be vulnerable to some variant of the critiques that were first raised against it, because now the conditional introduces a link between its policy and the scenarios it faces, but I can't immediately construct a case where it happens, because the conditional would still somewhat tend to sabotage the obvious approaches. This sort of unclarity is kind of concerning when it comes to the idea.
Like suppose the AI immediately very publically does something that looks very unsafe. Say grabs control over the stop button and starts mass-producing paperclips in an extremely publically visible way. This would probably lead to people wanting to stop it. So therefore, if it has a policy like that, the |S conditional would lead to people quickly wanting to stop it. This means that in the |S branch, it can quickly determine whether it is in the f|S branch or the s|S branch; in the f|S case, it can then keep going with whatever optimization V specified, while in the s|S case, it can then immediately shut down itself.
But the reason I think the AI *wouldn't* do this is, what about the |F branch? If you condition on humans not wanting to press the stop button even though there's a clearly unaligned AI, what sort of situation could produce this? I have trouble imagining it, because it seems like it would need to be pretty extreme. The best ideas I can come up with is stuff like "black hole swallows the earth", but this would rank pretty low in the AI's utility function, and therefore it would avoid acting this way in order to have a reasonable |F branch.
But this does not seem like sane reasoning on the AI's side to me, so it seems like this should be fixed. And of course, fixed in a principled rather than unprincipled way.
One thing that seems really important for agency is perception. And one thing that seems really important for perception is representation learning. Where representation learning involves taking a complex universe (or perhaps rather, complex sense-data) and choosing features of that universe that are useful for modelling things.
When the features are linearly related to the observations/state of the universe, I feel like I have a really good grasp of how to think about this. But most of the time, the features will be nonlinearly related; e.g. in order to do image classication, you use deep neural networks, not principal component analysis.
I feel like it's an interesting question: where does the nonlinearity come from? Many causal relationships seem essentially linear (especially if you do appropriate changes of variables to help, e.g. taking logarithms; for many purposes, monotonicity can substitute for linearity), and lots of variance in sense-data can be captured through linear means, so it's not obvious why nonlinearity should be so important.
Here's some ideas I have so far:
It seems like it would be nice to develop a theory on sources of nonlinearity. This would make it clearer why sometimes selecting features linearly seems to work (e.g. consider IQ tests), and sometimes it doesn't.
I recently wrote a post about myopia, and one thing I found difficult when writing the post was in really justifying its usefulness. So eventually I mostly gave up, leaving just the point that it can be used for some general analysis (which I still think is true), but without doing any optimality proofs.
But now I've been thinking about it further, and I think I've realized - don't we lack formal proofs of the usefulness of myopia in general? Myopia seems to mostly be justified by the observation that we're already being myopic in some ways, e.g. when training prediction models. But I don't think anybody has formally proven that training prediction models myopically rather than nonmyopically is a good idea for any purpose?
So that seems like a good first step. But that immediately raises the question, good for what purpose? Generally it's justified with us not wanting the prediction algorithms to manipulate the real-world distribution of the data to make it more predictable. And that's sometimes true, but I'm pretty sure one could come up with cases where it would be perfectly fine to do so, e.g. I keep some things organized so that they are easier to find.
It seems to me that it's about modularity. We want to design the prediction algorithm separately from the agent, so we do the predictions myopically because modifying the real world is the agent's job. So my current best guess for the optimality criterion of myopic optimization of predictions would be something related to supporting a wide variety of agents.
Yeah, I think usually when people are interested in myopia, it's because they think there's some desired solution to the problem that is myopic / local, and they want to try to force the algorithm to find that solution rather than some other one. E.g. answering a question based only on some function of its contents, rather than based on the long-term impact of different answers.
I think that once you postulate such a desired myopic solution and its non-myopic competitors, then you can easily prove that myopia helps. But this still leaves the question of how we know this problems statement is true - if there's a simpler myopic solution that's bad, then myopia won't help (so how can we predict if this is true?) and if there's a simpler non-myopic solution that's good, myopia may actively hurt (this one seems a little easier to predict though).
Theory for a capabilities advance that is going to occur soon:
OpenAI is currently getting lots of novel triplets (S, U, A), where S is a system prompt, U is a user prompt, and A is an assistant answer.
Given a bunch of such triplets (S, U_1, A_1), ... (S, U_n, A_n), it seems like they could probably create a model P(S|U_1, A_1, ..., U_n, A_n), which could essentially "generate/distill prompts from examples".
This seems like the first step towards efficiently integrating information from lots of places. (Well, they could ofc also do standard SGD-based gradient descent, but it has its issues.)
A followup option: they could use something a la Constitutional AI to generate perturbations A'_1, ..., A'_n. If they have a previous model like the above, they could then generate a perturbation P(S'|U_1, A'_1, ..., U_n, A'_n). I consider this significant because this then gives them the training data to create a model P(S'|S, U_1, A_1, A'_1), which essentially allows them to do "linguistic backchaining": The user can update an output of the network A_1 -> A'_1, and then the model can suggest a way to change the prompt to obtain similar updates in the future.
Furthermore I imagine this could get combined together into some sort of "linguistic backpropagation" by repeatedly applying models like this, which could unleash a lot of methods to a far greater extent than they have been so far.
Obviously this is just a very rough sketch, and it would be a huge engineering and research project to get this working in practice. Plus maybe there are other methods that work better. I'm mainly just playing around with this because I think there's a strong economic pressure for something-like-this, and I want a toy model to use for thinking about its requirements and consequences.
Actually I suppose they don't even need to add perturbations to A directly, they can just add perturbations to S and generate A's from S'. Or probably even look at user's histories to find direct perturbations to either S or A.
I recently wrote a post presenting a step towards corrigibility using causality here. I've got several ideas in the works for how to improve it, but I'm not sure which one is going to be most interesting to people. Here's a list.
I think there may be some variant of this that could work. Like if you give the AI reward proportional to Bs+rf (where r is a reward function for V) for its current world-state (rather than picking a policy that maximizes Bs+Vf overall; so one difference is that you'd be summing over the reward rather than giving a single one), then that would encourage the AI to create a state where shutdown happens when humans want to press the button and V happens when they don't. But the issue I have with this proposal is that the AI would be prone to not respect past attempts to press the stop button. I think maybe if one picked a different reward function, like (Bs+r)f, then it could work better (though the Bs part would need a time delay...). Though this reward function might leave it open to the "trying to shut down the AI for reasons" objection that you gave before; I think that's fixed by moving the f counterfactual outside of the sum over rewards, but I'm not sure.
his is the immediate insight for the application to the stop button. But on a broader level, the insight is that corrigibility, respecting human's preferences, etc. are best thought of as being preferences about the causal effect of humans on various outcomes, and those sorts of preferences can be specified using utility functions that involve counterfactuals.This seems to be what sets my proposal apart from most "utility indifference proposals", which seem to be possible to phrase in terms of counterfactuals on a bunch of other variables than humans.
his is the immediate insight for the application to the stop button. But on a broader level, the insight is that corrigibility, respecting human's preferences, etc. are best thought of as being preferences about the causal effect of humans on various outcomes, and those sorts of preferences can be specified using utility functions that involve counterfactuals.
This seems to be what sets my proposal apart from most "utility indifference proposals", which seem to be possible to phrase in terms of counterfactuals on a bunch of other variables than humans.
(I also think that there are other useful things that can be specified using utility functions that involve counterfactuals, which I'm trying to prepare for an explainer post. For instance, a sort of "encapsulation" - if you're a paperclip producer, you might want to make a paperclip maximizer which is encapsulated in the sense that it is only allowed to work within a single factory, using a single set of resources, and not influencing the world otherwise. This could be specified using a counterfactual that the outside world's outcome must be "as if" the resources in the factory just disappeared and paperclips appeared at its output act-of-god style. This avoids any unintended impacts on the outside world while still preserving the intended side effect of the creation of a high but controlled amount of paperclips. However, I'm still working on making it sufficiently neat, e.g. this proposal runs into problems with the universe's conservation laws.)
Are there good versions of DAGs for other things than causality?
I've found Pearl-style causal DAGs (and other causal graphical models) useful for reasoning about causality. It's a nice way to abstractly talk and think about it without needing to get bogged down with fiddly details.
In a way, causality describes the paths through which information can "flow". But information is not the only thing in the universe that gets transferred from node to node; there's also things like energy, money, etc., which have somewhat different properties but intuitively seem like they could benefit from graph-based models too.
I'm pretty sure I've seen a number of different graph-based models for describing different flows like this, but I don't know their names, and also the ones I've seen seemed highly specialized and I'm not sure they're the best to use. But I thought, it seems quite probable that someone on LessWrong would know of a recommended system to learn.