If I can demonstrate a goal-less agent acting like it has a goal it is already too late. We need to recognize this theoretically and stop it from happening.
I didn't say you had to demonstrate it with a superintelligent agent. If I had said that, you could also have fairly objected that neither you nor anyone else knows how to build a superintelligent agent.
Just to give one example of an experiment you could do: There's chess variants where you can have various kinds of silly goals like capturing all your opponent's pawns, or trying to force the opponent...
Okay, in that case it's reasonable to think you were unfairly downvoted. I probably would have titled this post something else, though: The current title gives the impression that no reasons were given at all.
Seeing as your original post already had many critical comments on it when you wrote this post, I'm curious to know in what sense you feel you were not provided with a reason for the downvotes? What about the discussion on that post was unsatisfying to you?
Just going to add on here: The main way science fights against herd mentality is by having a culture of trying to disprove theories via experiment, and following Feynman's maxim: "If it disagrees with experiment, it's wrong." Generally, this will also work on rationalists. If you make a post where you can demonstrate a goal-less agent acting like it has a goal, that will get much more traction here.
Cool. For me personally, I think that paying to avoid being given more options looks enough like being dominated that I'd want to keep the axiom of transitivity around, even if it's not technically a money pump.
So in the case where we have transitivity but no completeness, it seems kind of like there might be a weaker coherence theorem, where the agent's behaviour can be described by rolling a dice to pick a utility function before beginning a game, and then subsequently playing according to that utility function. Under this interpretation, if A > B the...
I don't know, this still seems kind of sketchy to me. Say we change the experiment so that it costs the agent a penny to choose A in the initial choice: it will still take that choice, since A-1p is still preferable to A-2p. Compare this to a game where the agent can freely choose between A and C, and there's no cost in pennies to either choice. Since there's a preferential gap between A and C, the agent will sometimes pick A and sometimes pick C. In the first game, on the other hand the agent always picks A. Yet in the first game, not only is picking A m...
Wait, I can construct a money pump for that situation. First let the agent choose between A and C. If there's a preferential gap, the agent should sometimes choose C. Then let the agent pay a penny to upgrade from C to B. Then let the agent pay a penny to upgrade from B to A. The agent is now where it could have been to begin with by choosing A in the first place, but 2 cents poorer.
Even if we ditch the completeness axiom, it sure seems like money pump arguments require us to assume a partial order.
What am I missing?
IMO, not only is "plug every possible h into U(h)" extremely computationally infeasible
To be clear, I'm not saying Thermodynamic bot does the computation the slow exponential way. I already explained how it could be done in polynomial time, at least for a world model that looks like a factor graph that's a tree. Call this ThermodynamicBot-F. You could also imagine the role of "world model" being filled by a neural network (blob of weights) that approximates the full thermodynamic computation. We can call this ThermodynamicBo...
Thanks for the reply. Just to prevent us from spinning our wheels too much, I'm going to start labelling specific agent designs, since it seems like some talking-past-each-other may be happening where we're thinking of agents that work in different ways when making our points.
PolicyGradientBot: Defined by the following description:
...A simple model would be an agent consisting of a big recurrent net (RNN or transformer variant) that takes in observations and outputs predictions through a prediction head and actions through an action head, where there's a so
A simple model would be an agent consisting of a big recurrent net (RNN or transformer variant) that takes in observations and outputs predictions through a prediction head and actions through an action head, where there's a softmax to sample the next action (whether an external action or an internal action). The shards would then be circuits that send outputs into the action head.
Thanks for describing this. Technical question about this design: How are you getting the gradients that feed backwards into the action head? I assume it's not supervised lear...
Cool, thanks for the reply, sounds like maybe a combination of 3a and the aspect of 1 where the shard points to a part of the world model? If no part of the agent is having its weights tuned to choose plans that make a shard happy, where would you say a shard mostly lives in an agent? World model? Somewhere else? Spread across multiple components? (At the bottom of this comment, I propose a different agent architecture that we can use to discuss this that I think fairly naturally matches the way you've been talking about shards.)
...Notice that there is a gi
Self replicating nanotech is what I'm referring to, yes. Doesn't have to be a bacteria-like grey goo sea of nanobots, though. I'd generally expect nanotech to look more like a bunch of nanofactories, computers, energy collectors, and nanomachines to do various other jobs, and some of the nanofactories have the job of producing other nanofactories so that the whole system replicates itself. There wouldn't be the constraint that there is with bacteria where each cell is in competition with all the others.
Sorry for the slow response, lots to read through and I've been kind of busy. Which of the following would you say most closely matches your model of how diamond alignment with shards works?
The diamond abstraction doesn't have any holes in it where things like Liemonds could fit in, due to the natural abstraction hypothesis. The training process is able to find exactly this abstraction and include it in the agent's world model. The diamond shard just points to the abstraction in the world model, and thus also has no holes.
Shards form a kind of vector
Strong AGI: Artificial intelligence strong enough to build nanotech, while being at least as general as humans (probably more general). This definition doesn't imply anything about the goals or values of such an AI, but being at least as general as humans does imply that it is an agent that can select actions, and also implies that it is at least as data-efficient as humans.
Humanity survives: At least one person who was alive before the AI was built is still alive 50 years later. Includes both humanity remaining biological and uploading, doesn't include ev...
Counter-predictions:
Reward is not the optimization target, and neither is the value function.
Yeah, agree that reward is not the optimization target. Otherwise, the agent would just produce diamonds, since that's what the rewards are actually given out for (or seize the reward channel, but we'll ignore that for now). I'm a lot less sure that the value function is not the optimization target. Ignoring other architectures for the moment, consider a design where the agent has a value function and a world model, uses Monte-Carlo tree search, and picks the action that gives the ...
The optimizer isn't looking for Liemonds specifically; it's looking for "Diamonds", a category which initially includes both Diamonds and Liemonds.
There are many directions in which the agent could apply optimization pressure and I think we are unfairly privileging the hypothesis that that direction will be towards "exploiting those holes" as opposed to all the other plausible directions, many of which are effectively orthogonal to "exploiting those holes".
Just to clarify the parameters of the thought experiment, Liemonds are specified to be much eas...
Okay, cool, it seems like we're on the same page, at least. So what I expect to happen for AGI is that the planning module will end up being a good general-purpose optimizer: Something that has a good model of the world, and uses it to find ways of increasing the score obtained from the value function. If there is an obvious way of increasing the score, then the planning module can be expected to discover it, and take it.
Scenario: We have managed to train a value function that values Liemonds as well as Diamonds. These both get a high score according to th...
I think the agent will very much need to keep updating large parts of its value function along with its policy during deployment, so there's no "after you've finished"
I think we agree here: As long as you're updating the value function along with the rest of the agent, this won't wreck everything. A slightly generalized version of what I was saying there still seems relevant to agents that are continually being updated: When you assign the agent tasks where you can't label the results, you should still avoid updating any of the agent's networks. Only up...
Yeah, so by "planning module", pretty much all I mean is this method of the Agent
class, it's not a Cartesian theatre deal at all:
def get_next_action(self, ...):
...
Like, presumably it's okay for agents to be designed out of identifiable sub-components without there being any incoherence or any kind of "inner observer" resulting from that. In the example I gave in my original answer, the planning module made calls to the action network, world model network, and value network, which you'll note is all of the networks comprising that particular agent,...
DragonGod links the same series of posts in a sibling comment, so I think my reply to that comment is mostly the same as my reply to this one. Once you've read it: Under your model, it sounds like producing lots of Diamonds is normal and good agent behaviour, but producing lots of Liemonds is probing weird quirks of my value function that I have no reason to care about pursuing. What's the difference between these two cases? What's the mechanism for how that manifests in a reasonably-designed agent?
Also, I'm not sure we're using terminology in the same way...
So, it looks like the key passage is this one:
...
A reflective diamond-motivated agent chooses plans based on how many diamonds they lead to.
The agent can predict e.g. how diamond-promising it is to search for plans involving simulating malign superintelligences which trick the agent into thinking the simulation plan makes lots of diamonds, versus plans where the agent just improves its synthesis methods.
A reflective agent thinks that the first plan doesn't lead to many diamonds, while the second plan leads to more diamonds.
Therefore, the reflecti
Yes, adversarial robustness is important.
You ask where to find the "malicious ghost" that tries to break alignment. The one-sentence answer is: The planning module of the agent will try to break alignment.
On an abstract level, we're designing an agent, so we create (usually by training) a value function, to tell the agent what outcomes are good, and a planning module, so that the agent can take actions that lead to higher numbers in the value function. Suppose that the value function, for some hacky adversarial inputs, will produce a large value even if hu...
Ah got it, thanks for the reply!
Question about this part:
I do think MIRI "at least temporarily gave up" on personally executing on technical research agendas, or something like that, but, that's not the only type of output.
So, I'm sure various people have probably thought about this a lot, but just to ask the obvious dumb question: Are we sure that this is even a good idea?
Let's say the hope is that at some time in the future, we'll stumble across an Amazing Insight that unblocks progress on AI alignment. At that point, it's probably good to be able to execute quickly on turning that...
Agent Foundations research has stuttered a bit over the team going remote and its membership shifting and various other logistical hurdles, but has been continuous throughout.
There's also at least one other team (the one I provide ops support to) that has been continuous since 2017.
I think the thing Raemon is pointing at is something like "in 2020, both Nate and Eliezer would've answered 'yes' if asked whether they were regularly spending work hours every day on a direct, technical research agenda; in 2023 they would both answer 'no.'"
Strong upvoted! The issue where weights that give the gradient hacker any influence at all will be decreased if it causes bad outputs was one of the objections I also had to that gradient hacking post.
I wrote this post a while back where I managed to create to toy model for things that were not quite gradient hackers, but were maybe a more primitive version: https://www.lesswrong.com/posts/X7S3u5E4KktLp7gHz/tessellating-hills-a-toy-model-for-demons-in-imperfect
In terms of ways to create gradient hackers in an actual neural network, here are some suggestion...
Yep, that's the section I was looking at to get that information. Maybe I phrased it a bit unclearly. The thing that would contradict existing observations is if the interaction were not stochastic. Since it is stochastic in Oppenheim's theory, the theory allows the interference patterns that we observe, so there's no contradiction.
Outside view: This looks fairly legit on first glance, and Jonathan Oppenheim is a reputable physicist. The theory is experimentally testable, with numerous tests mentioned in the paper, and the tests don't require reaching unrealistically high energies in a particle accelerator, which is good.
Inside view: Haven't fully read the paper yet, so take with a grain of salt. Quantum mechanics already has a way of representing states with classical randomness, the density matrix, so having a partially classical and partially quantum theory certainly seems like it...
0.5 probability you're in a simulation is the lower bound, which is only fulfilled if you pay the blackmailer. If you don't pay the blackmailer, then the chance you're in a simulation is nearly 1.
Also, checking if you're in a simulation is definitely a good idea, I try to follow a decision theory something like UDT, and UDT would certainly recommend checking whether or not you're in a simulation. But the Blackmailer isn't obligated to create a simulation with imperfections that can be used to identify the simulation and hurt his prediction accuracy. ...
I think the issue boils down to one of types and not being able to have a "Statement" type in the theory. This is why we have QUOT[X]
to convert a statement X
into a string. QUOT
is not a function, really, it's a macro that converts a statement into a string representation of that statement. true(QUOT[X]) ⇔ X
isn't an axiom, it's an infinite sequence of axioms (a "schema"), one for each possible statement X
. It's considered okay to have an infinite sequence of axioms, so long as you know how to compute that sequence. We can enumerate through all possible s...
Yeah, it definitely depends how you formalize the logic, which I didn't do in my comment above. I think there's some hidden issues with your proposed disproof, though. For example, how do we formalize 2? If we're representing John's utterances as strings of symbols, then one obvious method would be to write down something like: ∀ s:String, says(John, s) ⇒ true(s)
. This seems like a good way of doing things, that doesn't mention the ought
predicate. Unfortunately, it does require the true
predicate, which is meaningless until we have a way of enforcing that...
From a language perspective, I agree that's it's great to not worry about the is/ought distinction when discussing anything other than meta-ethics. It's kind of like how we talk about evolved adaptations as being "meant" to solve a particular problem, even though there was really no intention involved in the process. It's just such a convenient way of speaking, so everyone does it.
I'd guess I'd say that the despite this, the is/ought distinction remains useful in some contexts. Like if someone says "we get morality from X, so you have to believe X or you won't be moral", it gives you a shortcut to realizing "nah, even if I think X is false, I can continue to not do bad things".
What about that thing where you can't derive an "ought" from an "is"? Just from the standpoint of pure logic, we can't derive anything about morality from axioms that don't mention morality. If you want to derive your morality from the existence of God, you still need to add an axiom: "that which God says is moral is moral". On the other end of things, an atheist could still agree with a theist on all moral statements, despite not believing in God. Suppose that God says "A, B, C are moral, and X, Y, Z are immoral". Then an atheist working from the axioms "...
On training AI systems using human feedback: This is way better than nothing, and it's great that OpenAI is doing it, but has the following issues:
It would be really cool to see a video on Newcomb's problem, logical decision theories, and Lobian cooperation in the prisoner's dilemma. I think this group of ideas is one of the most interesting developments in game theory in the past few years, and should be more widely known.
I think what it boils down to is that in 1 dimension, the mean / expected value is a really useful quantity, and you get it by minimizing squared error, whereas the absolute error gives the median, which is still useful, but much less so than the mean. (The mean is one of the moments of the distribution, (the first moment), while the median isn't. Rational agents maximize expected utility, not median utility, etc. Even the M in MAE still stands for "mean".) Plus, although algorithmic considerations aren't too important for small problems; in large problems...
From a pure world-modelling perspective, the 3 step model is not very interesting, because it doesn't describe reality. It's maybe best to think of it from an engineering perspective, as a test case. We're trying to build an AI, and we want to make sure it works well. We don't know exactly what that looks like in the real world, but we know what it looks like in simplified situations, where the off button is explicitly labelled for the AI and everything is well understood. If a proposed AI design does the wrong thing in the 3-step test case, then it has fa...
Debates of "who's in what reference class" tend to waste arbitrary amounts of time while going nowhere. A more helpful framing of your question might be "given that you're participating in a community that culturally reinforces this idea, are you sure you've fully accounted for confirmation bias and groupthink in your views on AI risk?". To me, LessWrong does not look like a cult, but that does not imply that it's immune to various epistemological problems like groupthink.
A quote from Eliezer's short fanfic Trust in God, or, The Riddle of Kyon that you may find interesting:
...Sometimes, even my sense of normality shatters, and I start to think about things that you shouldn't think about. It doesn't help, but sometimes you think about these things anyway.
I stared out the window at the fragile sky and delicate ground and flimsy buildings full of irreplaceable people, and in my imagination, there was a grey curtain sweeping across the world. People saw it coming, and screamed; mothers clutched their children and children clutch
I took Nate to be saying that we'd compute the image with highest faceness according to the discriminator, not the generator. The generator would tend to create "thing that is a face that has the highest probability of occurring in the environment", while the discriminator, whose job is to determine whether or not something is actually a face, has a much better claim to be the thing that judges faceness. I predict that this would look at least as weird and nonhuman as those deep dream images if not more so, though I haven't actually tried it. I also predic...
7: Did I forget some important question that someone will ask in the comments?
Yes!
Is there a way to deal with the issue of there being multiple ROSE points in some games? If Alice says "I think we should pick ROSE point A" and Bob says "I think we should pick ROSE point B", then you've still got a bargaining game left to resolve, right?
Anyways, this is an awesome post, thanks for writing it up!
This is a good point, but also kind of an oversimplification of the situation in physics. Imagine Alice is trying to fit some (x, y) data points on a chart. She doesn't know much about any kinds of function other than linear functions, but she can still fit half of the points at a time pretty well. Half of the points have a large x coordinate, and can be fit well by a line of positive slope. Alice calls this line "The Theory of General Relativity". Half of the points have a small x coordinate, and can be fit well by a line of negative slope. Alice calls th...
I'm interested! I've always been curious about how Eliezer pulled off the AI Box experiments, and while I concur that a sufficiently intelligent AI could convince me to let it out, I'm skeptical that any currently living human could do the same.
I don't know of a reason we couldn't do this with a narrow AI. I have no idea how, but it's possible in principle so far as I know. If anyone can figure out how, they could plausibly execute the pivotal act described above, which would be a very good thing for humanity's chances of survival.
EDIT: Needless to say, but I'll say it anyway: Doing this via narrow AI is vastly preferable to using a general AI. It's both much less risky and means you don't have to expend an insane amount of effort on checking.
In your example, I think even adding just one more node, h3, to the hidden layer would suffice to connect the two solutions. One node per dimension of input suffices to learn the function, but it's also possible for two nodes to share the task between them, where the share of the task they are picking up can vary continuously from 0 to 1. So just have h3 take over x2 from h2, then h2 takes over x1 from h1, and then h1 takes over x2 from h3.
Posting this comment to start some discussion about generalization and instrumental convergence (disagreements #8 and #9).
So my general thoughts here are that ML generalization is almost certainly not good enough for alignment. (At least in the paradigm of deep learning.) I think it's true with high confidence that if we're trying to train a neural net to imitate some value function, and that function takes a high-dimensional input, then it will be possible to find lots of inputs that cause the network to produce a high value when the value function produc...
Yes, sounds right to me. It's also true that one of the big unproven assumptions here is that we could create an AI strong enough to build such a tool, but too weak to hack humans. I find it plausible, personally, but I don't yet have an easy-to-communicate argument for it.
Just to give you some (very late) clarification: The theory I describe above (a first order theory) can handle statements perfectly well, it just represents them as strings, rather than giving them their own separate type. The problem isn't inherently with giving them their own separate type though, it's with expecting to be able to just stick a member of that type in our expression where we're supposed to expect a truth value.
You can skip past my proof and its messy programming notation, and just look here.