# All of DaemonicSigil's Comments + Replies

Just to give you some (very late) clarification: The theory I describe above (a first order theory) can handle statements perfectly well, it just represents them as strings, rather than giving them their own separate type. The problem isn't inherently with giving them their own separate type though, it's with expecting to be able to just stick a member of that type in our expression where we're supposed to expect a truth value.

You can skip past my proof and its messy programming notation, and just look here.

If I can demonstrate a goal-less agent acting like it has a goal it is already too late. We need to recognize this theoretically and stop it from happening.

I didn't say you had to demonstrate it with a superintelligent agent. If I had said that, you could also have fairly objected that neither you nor anyone else knows how to build a superintelligent agent.

Just to give one example of an experiment you could do: There's chess variants where you can have various kinds of silly goals like capturing all your opponent's pawns, or trying to force the opponent...

1Donatas Lučiūnas1d
Thanks, sounds reasonable. But I think I could find irrationality in your opinion if we dug deeper to the same idea mentioned here [https://www.lesswrong.com/posts/dPCpHZmGzc9abvAdi/orthogonality-thesis-is-wrong?commentId=SGDiyqPgwLDBjfzqA]. As it is mentioned in Pascal's Mugging [https://www.lesswrong.com/tag/pascal-s-mugging] I think that Orthogonality thesis is right only if an agent is certain that an outcome with infinite utility does not exist. And I argue that an agent cannot be certain of that. Do you agree? I created a separate post [https://www.lesswrong.com/posts/rRCywmZoEddgEWu5X/rational-ai-is-uncontrollable-alignment-is-impossible] for this, we can continue there.

Okay, in that case it's reasonable to think you were unfairly downvoted. I probably would have titled this post something else, though: The current title gives the impression that no reasons were given at all.

2Donatas Lučiūnas1d
Makes sense, thanks, I updated the question.

Seeing as your original post already had many critical comments on it when you wrote this post, I'm curious to know in what sense you feel you were not provided with a reason for the downvotes? What about the discussion on that post was unsatisfying to you?

1Donatas Lučiūnas1d
There is only one person that went deeper and the discussion is ongoing, you can find my last comment here https://www.lesswrong.com/posts/dPCpHZmGzc9abvAdi/orthogonality-thesis-is-wrong?commentId=SGDiyqPgwLDBjfzqA#Lha9rBfpEZBRd5uuy [https://www.lesswrong.com/posts/dPCpHZmGzc9abvAdi/orthogonality-thesis-is-wrong?commentId=SGDiyqPgwLDBjfzqA#Lha9rBfpEZBRd5uuy] So basically all people who downvoted did it without providing good arguments. I agree that many people think that their arguments are good, but that's exactly the problem I want to address 2 + 2 is not 5 even if many people think so.

Just going to add on here: The main way science fights against herd mentality is by having a culture of trying to disprove theories via experiment, and following Feynman's maxim: "If it disagrees with experiment, it's wrong." Generally, this will also work on rationalists. If you make a post where you can demonstrate a goal-less agent acting like it has a goal, that will get much more traction here.

1Donatas Lučiūnas1d
If I can demonstrate a goal-less agent acting like it has a goal it is already too late. We need to recognize this theoretically and stop it from happening. I try to prove it using logic, but not so many people are really good at it. And people that are good at it don't pay attention to downvoted post. How can I overcome that?

Cool. For me personally, I think that paying to avoid being given more options looks enough like being dominated that I'd want to keep the axiom of transitivity around, even if it's not technically a money pump.

So in the case where we have transitivity but no completeness, it seems kind of like there might be a weaker coherence theorem, where the agent's behaviour can be described by rolling a dice to pick a utility function before beginning a game, and then subsequently playing according to that utility function. Under this interpretation, if A > B the...

I don't know, this still seems kind of sketchy to me. Say we change the experiment so that it costs the agent a penny to choose A in the initial choice: it will still take that choice, since A-1p is still preferable to A-2p. Compare this to a game where the agent can freely choose between A and C, and there's no cost in pennies to either choice. Since there's a preferential gap between A and C, the agent will sometimes pick A and sometimes pick C. In the first game, on the other hand the agent always picks A. Yet in the first game, not only is picking A m...

1EJT5d
Nice! This is a cool case. The behaviour does indeed seem weird. I'm inclined to call it irrational. But the agent isn't pursuing a dominated strategy: in neither game does the agent settle on an option that they strictly disprefer to some other available option. This discussion is interesting and I'm happy to keep having it, but perhaps it's worth saying (if not for your sake then for other readers) that this is a side-thread. The main point of the post is that there are no money-pumps for Completeness. I think that there are probably no money-pumps for Transitivity either, but it's the claim about Completeness that I really want to defend.

Wait, I can construct a money pump for that situation. First let the agent choose between A and C. If there's a preferential gap, the agent should sometimes choose C. Then let the agent pay a penny to upgrade from C to B. Then let the agent pay a penny to upgrade from B to A. The agent is now where it could have been to begin with by choosing A in the first place, but 2 cents poorer.

Even if we ditch the completeness axiom, it sure seems like money pump arguments require us to assume a partial order.

What am I missing?

1EJT6d
So this won't work if the agent knows in advance what trades they'll be offered and is capable of reasoning by backward induction. In that case, the agent will reason that they'd choose A-2p over B-1p if they reached that node, and would choose B-1p over C if they reached that node. So (they will reason), the choice between A and C is actually a choice between A and A-2p, and so they will reliably choose A. And plausibly we should make assumptions like 'the agent knows in advance what trades they will be offered' and 'the agent is capable of backward induction' if we're arguing about whether agents are rationally required to conform their preferences to the VNM axioms.  (If the agent doesn’t know in advance what trades they will be offered or is incapable of backward induction, then their pursuit of a dominated strategy need not indicate any defect in their preferences. Their pursuit of a dominated strategy can instead be blamed on their lack of knowledge and/or reasoning ability.) That said, I've recently become less convinced that 'knowing trades in advance' is a reasonable assumption in the context of predicting the behaviour of advanced artificial agents. And your money-pump seems to work if we assume that the agent doesn't know what trades they will be offered in advance. So maybe we do in fact have reason to expect that advanced artificial agents will have transitive preferences. (I say 'maybe' because there are some other relevant considerations pushing the other way, discussed in a paper-in-progress by Adam Bales.)
1quetzal_rainbow6d
It's not a money pump, because money pump implies infinite cycle of profit. If your loses are bounded, you are fine.

IMO, not only is "plug every possible h into U(h)" extremely computationally infeasible

To be clear, I'm not saying Thermodynamic bot does the computation the slow exponential way. I already explained how it could be done in polynomial time, at least for a world model that looks like a factor graph that's a tree. Call this ThermodynamicBot-F. You could also imagine the role of "world model" being filled by a neural network (blob of weights) that approximates the full thermodynamic computation. We can call this ThermodynamicBo...

Thanks for the reply. Just to prevent us from spinning our wheels too much, I'm going to start labelling specific agent designs, since it seems like some talking-past-each-other may be happening where we're thinking of agents that work in different ways when making our points.

PolicyGradientBot: Defined by the following description:

A simple model would be an agent consisting of a big recurrent net (RNN or transformer variant) that takes in observations and outputs predictions through a prediction head and actions through an action head, where there's a so

...
1cfoster013d
Sounds good. COMMENTS ON THERMODYNAMICBOT If we assume that the agent is making decisions by (approximately) plugging in every possible h into U(h) and picking based on (the partition function derived from) that, then of course you need U(h) to be adversarially robust! I disagree with that as a model of how planning works or should work. IMO, not only is "plug every possible h into U(h)" extremely computationally infeasible, but even if it were feasible it would be a forseeably-broken (because fragile) planning strategy. Quote from a comment of TurnTrout about argmax planning, though I think it also applies to ThermodynamicBot, since that just does a softened version of argmax planning (converging to argmax planning as T->0): I think the sorts of planning methods that try to approximate in the real world the behavior of "think about all possible plans and pick a good one" are unworkable in the limit, not just from an alignment standpoint but also from a practical capability standpoint, so I don't expect us to build competent agents that use them, so I don't worry about them or their attendant need for adversarial robustness. Right, I wasn't thinking of it as actually a GAN, just giving an analogy where similar causal patterns are in play, to make my point clearer. But yeah, if we wanted to actually use a GAN, your suggestions sound reasonable. COMMENTS ON POLICYGRADIENTBOT & GENERAL POSITION I guess, but I'm confused why we're talking about competitiveness all of a sudden. I mean, variants on policy gradient algorithms (PPO, Actor-Critic, etc.) do some impressive things (at least to the extent any RL algorithms currently do impressive things). And I can imagine more sophisticated versions of even plain policy gradient that would do impressive things, if the action space includes mental actions like "sample a rollout from my world model". But that's neither here nor there IMO. In the previous comment, I tried to be clear that it makes me ~no difference wher

A simple model would be an agent consisting of a big recurrent net (RNN or transformer variant) that takes in observations and outputs predictions through a prediction head and actions through an action head, where there's a softmax to sample the next action (whether an external action or an internal action). The shards would then be circuits that send outputs into the action head.

Thanks for describing this. Technical question about this design: How are you getting the gradients that feed backwards into the action head? I assume it's not supervised lear...

1cfoster021d
Could be from rewards or other "external" feedback, could be from TD/bootstrapped errors, could be from an imitation loss or something else. The base case is probably just a plain ol' rewards that get backpropagated through the action head via policy gradients. Sorry for being unclear, I think you're talking about a different dimension of embeddness than what I was pointing at. I was talking about the issue of logical uncertainty: that the agent needs to actually run computation in order to figure out certain things. The agent can't magically sample from P(h) proportional to exp(U(h)), because it needs the exp(U(h')) of all the other histories first in order to weigh the distribution that way, which requires having already sampled h' and having already calculated U(h'). But we are talking about how it samples a history h in the first place! The "At best" comment was proposing an alternative that might work, where the agent samples from a prior that's been tuned based on U(h). Notice, though, that "our sampling is biased towards certain histories over others because they resulted in higher utility" does not imply "if a history would result in higher utility, then our sampling will bias towards it". Consider a parallel situation: sampling images and picking one that gets the highest score on a non-robust face classifier. If we were able to sample from the distribution of images proportional to their (exp) face classifier scores, then we would need to worry a lot about picking an image that's an adversarial example to our face classifier, because those can have absurdly high scores. But instead we need to sample images from the prior of a generative model like FFHQ StyleGAN2-ADA or DDPM, and score those images. A generative model like that will tend strongly to convert whatever input entropy you give it into a natural-looking image, so we can sample & filter from it a ton without worrying much about adversarial robustness. Even if you sample 10K images and pick the 5

Cool, thanks for the reply, sounds like maybe a combination of 3a and the aspect of 1 where the shard points to a part of the world model? If no part of the agent is having its weights tuned to choose plans that make a shard happy, where would you say a shard mostly lives in an agent? World model? Somewhere else? Spread across multiple components? (At the bottom of this comment, I propose a different agent architecture that we can use to discuss this that I think fairly naturally matches the way you've been talking about shards.)

Notice that there is a gi

...
1cfoster021d
I'd say the shards live in the policy, basically, though these are all leaky abstractions. A simple model would be an agent consisting of a big recurrent net (RNN or transformer variant) that takes in observations and outputs predictions through a prediction head and actions through an action head, where there's a softmax to sample the next action (whether an external action or an internal action). The shards would then be circuits that send outputs into the action head. I brought this up because I interpreted your previous comment as expressing skepticism that Whereas I think that it will be true for analogous reasons as the reasons that explain why no part of the agent is "trying to make itself believe it has a million bucks". I have a vague feeling that the "value function map = agent's true values" bit of this is part of the crux we're disagreeing about. Putting that aside, for this to happen, it has to be simultaneously true that the agent's world model knows about and thinks about counterfeit money in particular (or else it won't be able to construct viable plans that produce counterfeit money) while its value function does not know or think about counterfeit money in particular. It also has to be true that the agent tends to generate plans towards counterfeit money over plans towards real money, or else it'll pick a real money plan it generates before it has had a chance to entertain a counterfeit money plan. But during training, the plan generator was trained to generate plans that lead to real money. And the agent's world model / plan generator knows (or at least thinks) that those plans were towards real money, even if its value function doesn't know. This is because it takes different steps to acquire counterfeit money than to acquire real money. If the plan generator was optimized based on the training environment, and the agent was rewarded there for doing the sorts of things that lead to acquiring real money (which are different from the things th

Self replicating nanotech is what I'm referring to, yes. Doesn't have to be a bacteria-like grey goo sea of nanobots, though. I'd generally expect nanotech to look more like a bunch of nanofactories, computers, energy collectors, and nanomachines to do various other jobs, and some of the nanofactories have the job of producing other nanofactories so that the whole system replicates itself. There wouldn't be the constraint that there is with bacteria where each cell is in competition with all the others.

Sorry for the slow response, lots to read through and I've been kind of busy. Which of the following would you say most closely matches your model of how diamond alignment with shards works?

1. The diamond abstraction doesn't have any holes in it where things like Liemonds could fit in, due to the natural abstraction hypothesis. The training process is able to find exactly this abstraction and include it in the agent's world model. The diamond shard just points to the abstraction in the world model, and thus also has no holes.

2. Shards form a kind of vector

...
1cfoster023d
3 is the closest. I don't even know what it would mean for a shard to be "perfect". I have a concept of diamonds in my world model, and shards attached to that concept. That concept includes some central features like hardness, and some peripheral features like associations with engagement rings. That concept doesn't include everything about diamonds, though, and it probably includes some misconceptions and misgeneralizations. I could certainly be fooled in certain circumstances into accepting a fake diamond, for ex. if a seemingly-reputable jewelry store told me it was real. But this isn't an obstacle to me liking & acquiring diamonds generally, because my "imperfect" diamond concept is nonetheless still a pointer grounded in the real world, a pointer that has been optimized (by myself and by reality) to accurately-enough track [https://www.lesswrong.com/posts/WGEPBmErv8ufrq8Fc/teleosemantics] diamonds in the scenarios I've found myself in. That's good enough to hang a diamond-shard off of. Maybe that shard fires more for diamonds arranged in a circle than for diamonds arranged in a triangle. Is that an imperfection? I dunno, I think there are many different ways of wanting a thing, and I don't think we need perfect wanting, if that's even a thing. [https://www.lesswrong.com/posts/rauMEna2ddf26BqiE/alignment-allows-nonrobust-decision-influences-and-doesn-t] Notice that there is a giant difference between "try to get diamonds" and "try to get the diamond-shard to be happy", analogous to the difference between "try to make a million bucks" and "try to convince yourself you made a million bucks". If I wanted to generate a plan to convince myself I'd made a million bucks, my plan-generator could, but I don't want to, because that isn't a strategy I expect to help me get the things I want, like a million bucks. My shards shape what plans I execute, including plans about how I should do planning. The shard is the thing doing optimization in conjunction with the rest of

Strong AGI: Artificial intelligence strong enough to build nanotech, while being at least as general as humans (probably more general). This definition doesn't imply anything about the goals or values of such an AI, but being at least as general as humans does imply that it is an agent that can select actions, and also implies that it is at least as data-efficient as humans.

Humanity survives: At least one person who was alive before the AI was built is still alive 50 years later. Includes both humanity remaining biological and uploading, doesn't include ev...

2lsusr23d
Thanks. These seem like good definitions. They actually set the bar high for your prediction, which is respectable. I appreciate you taking this seriously. If you'll permit just a little bit more pedantic nitpicking, do you mind if I request a precise definition of nanotech? I assume you mean self-replicating nanobots (grey goo) because, technically, we already have nanotech. However, putting the bar at grey goo (potential, of course—the system doesn't have to actually make it for real) might be setting it above what you intended.

Counter-predictions:

• Humanity will still be around in 2030 (P = 90%)
• ... in 2040 (P = 70%)
• A Strong AGI will be built by 2123 (P = 75%)
• Conditional on this, no attempts at solving the alignment problem succeed (P = no idea / depends on our decisions)
• Conditional on this, humanity survives anyway, because the AI is aligned by default or from some other reason we survive without solving alignment. (P = 10%)
2lsusr23d
What are your definitions for "Strong AGI", "the alignment problem succeed[s]" and "humanity survives"?

Reward is not the optimization target, and neither is the value function.

Yeah, agree that reward is not the optimization target. Otherwise, the agent would just produce diamonds, since that's what the rewards are actually given out for (or seize the reward channel, but we'll ignore that for now). I'm a lot less sure that the value function is not the optimization target. Ignoring other architectures for the moment, consider a design where the agent has a value function and a world model, uses Monte-Carlo tree search, and picks the action that gives the ...

1cfoster01mo
I'm fine with describing that design like that. Though I expect we'd need a policy or some other method of proposing actions for the world model/MCTS to evaluate, or else we haven't really specified the design of how the agent makes decisions. Hmm. I wasn't imagining that any particularly exotic design choices were needed for my statements to hold, since I've mostly been arguing against things being required. What robustness properties are you asking about? A shot at the diamond alignment problem [https://www.lesswrong.com/posts/k4AQqboXz8iE5TNXK/a-shot-at-the-diamond-alignment-problem] is probably a good place to start, if you're after a description of how the training process and internal cognition could work along a similar outline to what I was describing.

The optimizer isn't looking for Liemonds specifically; it's looking for "Diamonds", a category which initially includes both Diamonds and Liemonds.

There are many directions in which the agent could apply optimization pressure and I think we are unfairly privileging the hypothesis that that direction will be towards "exploiting those holes" as opposed to all the other plausible directions, many of which are effectively orthogonal to "exploiting those holes".

Just to clarify the parameters of the thought experiment, Liemonds are specified to be much eas...

8cfoster01mo
I get that. In addition, Liemonds and Diamonds are in reality different objects that require different processes to acquire, right? Like, even though Liemonds are easier to produce in large quantities if that's what you're going for, you won't automatically produce Liemonds on the route to producing Diamonds. If you're trying to produce Diamonds, you can end up accidentally producing other junk by failing at diamond manufacturing, but you won't accidentally produce Liemonds. So unless you are intentionally trying to produce Liemonds, say as an instrumental route to "produce the maximum possible Diamond score", you won't produce them. It sounds like the reason you think the agent will intentionally produce Liemonds is as a instrumental route to getting the maximum possible Diamond score. I agree that that would be a great way to produce such a score. But AFAICT getting the maximum possible Diamond score is not "what the agent wants" in general. Reward is not the optimization target, and neither is the value function. Agents use a value function, but the agent's goals =/= maximal scores from the value function. The value function aggregates information from the current agent state to forecast the reward signal. It's not (inherently) a goal, an intention, a desire, or an objective. The agent could use high value function scores as a target in planning if it wanted to, but it could use anything it wants as a target in planning, the value function isn't special in that regard. I expect that agents will use planning towards many different goals, and subgoals of those goals, and so on, with the highest level goals being about the concepts in the world model, not the literal outputs of the value function. I suspect you disagree with this and that this is a crux. No, I am modeling the agent as being quite intelligent, at least as intelligent as a human. I just think it deploys that inteligence in service of a different motivational structure than you do.

Okay, cool, it seems like we're on the same page, at least. So what I expect to happen for AGI is that the planning module will end up being a good general-purpose optimizer: Something that has a good model of the world, and uses it to find ways of increasing the score obtained from the value function. If there is an obvious way of increasing the score, then the planning module can be expected to discover it, and take it.

Scenario: We have managed to train a value function that values Liemonds as well as Diamonds. These both get a high score according to th...

1cfoster01mo
Looks like we're popping back up to an earlier thread of the conversation? Curious what your thoughts on the parent comment were, but I will address your latest comment here. :) I think the AGI will be able to do general-purpose optimization, and that this could be implemented via an internal general-purpose optimization subroutine/module it calls (though, again, I think there's a flavor of homunculus in this design that I dislike). I don't see any reason to think of this subroutine as something that itself cares about value function scores, it's just a generic function that will produce plans for any goal it gets asked to optimize towards. In my model, if there's a general-purpose optimization subroutine, it takes as an argument the (sub)goal that the agent is currently thinking of optimizing towards and the subroutine spits out some answer, possibly internally making calls to itself as it splits the problem into smaller components. In this model, it is false that the general-purpose optimization subroutine is freely trying to find ways to increase the value function scores. Reaching states with high value function scores is a side effect of the plans it outputs, but not the object-level goal of its planning. IF the agent were to set "maximize my nominal value function" as the (sub)goal that it inputs to this subroutine, THEN the subroutine will do what you described, and then the agent can decide what to do with those results. But I dispute that this is the default expectation we should have for how the agent will use the subroutine. Heck, you can do general-purpose optimization, and yet you don't necessarily do that thing. Instead you ask the subroutine to help with the object-level goals that you already know you want, like "plan a route from here to the bar" and "find food to satisfy my current hunger" and "come up with a strategy to get a promotion at work". The general-purpose optimization subroutine isn't pulling you around. In fact, sometimes you reject i

I think the agent will very much need to keep updating large parts of its value function along with its policy during deployment, so there's no "after you've finished"

I think we agree here: As long as you're updating the value function along with the rest of the agent, this won't wreck everything. A slightly generalized version of what I was saying there still seems relevant to agents that are continually being updated: When you assign the agent tasks where you can't label the results, you should still avoid updating any of the agent's networks. Only up...

1cfoster01mo
I understand what you mean but I still think it's incorrect[1]. I think "The agent itself will apply the adversarial pressure to exploit those holes" (emphasis mine) is the key mistake. There are many directions in which the agent could apply optimization pressure and I think we are unfairly privileging the hypothesis that that direction will be towards "exploiting those holes" as opposed to all the other plausible directions, many of which are effectively orthogonal to "exploiting those holes". I would agree with a version of your claim that said "could apply" but not with this one with "will apply". The mere fact that there exist possible inputs that would fall into the "holes" (from our perspective) in the agent's value function does not mean that the agent will or even wants to try to steer itself towards those inputs rather than towards the non-"hole" high-value inputs. Remember that the trained circuits in the agent are what actually implement the agent's decision-making, deciding what things to recognize and raise to attention, making choices about what things it will spend time thinking about, holding memories of plan-targets in mind; all based on past experiences & generalizations learned from them. Even though there is one or many nameless pattern of OOD "hole" input that would theoretically light up the agent's value function (to MAX_INT or some other much-higher-than-desired level), that pattern is not a feature/pattern that the agent has ever actually seen, so its cognitive terrain has never gotten differential updates that were downstream of it, so its cognition doesn't necessarily flow towards it. The circuits in the agent's mind aren't set up to automatically recognize or steer towards the state-action preconditions of that pattern, whereas they are set up to recognize and steer towards the state-action preconditions of prototypical rewarded input encountered during training. In my model, that is what happens when an agent robustly "wants" something

Yeah, so by "planning module", pretty much all I mean is this method of the Agent class, it's not a Cartesian theatre deal at all:

def get_next_action(self, ...):
...


Like, presumably it's okay for agents to be designed out of identifiable sub-components without there being any incoherence or any kind of "inner observer" resulting from that. In the example I gave in my original answer, the planning module made calls to the action network, world model network, and value network, which you'll note is all of the networks comprising that particular agent,...

1cfoster01mo
FWIW the kinds of agents I am imagining go beyond choosing their next actions, they also choose their next thoughts including thoughts-about-planning like "Call self.get_next_action(...) with this input". That is the mechanism by which the agent binds its entire chain of thought—not just the final action produced—to its desires, by using planning reflectively as a tool to achieve ends. No, that wasn't intended to be my point. I wasn't saying that I have an alignment solution, or saying that learning a correct-but-not-adversarially-robust value function and policy for Diamonds is something that we know how to do, or saying that doing so won't be hard. The claim I was pushing back against was that the problem is adversarially hard. I don't think you need a bunch of patches for it to be not-adversarially-hard, I think it is not-adversarially-hard by default. Ok on to the substance: Whoa whoa no I think the agent will very much need to keep updating large parts of its value function along with its policy during deployment, so there's no "after you've finished" (assuming you == the AI). That's a significant component of my threat model. If you think an AGI without this could be competitive, I am curious how. I don't really understand why this is the relevant scenario. The crux being discussed is whether the value function needs to be robust to adversarial distribution shifts, not whether it needs to be robust to ordinary non-adversarial distribution shifts. I think the relevant scenario for our thread would be an agent that correctly learned a value+policy function that picks out Diamonds in the training scenarios, and which learned a generally-correct concept of Diamonds, but where there are findable edge cases in its concept boundary such that it would count type-X Liemonds as Diamonds if presented with them. The question, then, is why would it be thinking about how to produce type-X Liemonds in particular? What reason would the agent have to pursue thoughts that

DragonGod links the same series of posts in a sibling comment, so I think my reply to that comment is mostly the same as my reply to this one. Once you've read it: Under your model, it sounds like producing lots of Diamonds is normal and good agent behaviour, but producing lots of Liemonds is probing weird quirks of my value function that I have no reason to care about pursuing. What's the difference between these two cases? What's the mechanism for how that manifests in a reasonably-designed agent?

Also, I'm not sure we're using terminology in the same way...

8cfoster01mo
We are assuming that the agent has already learned a correct-but-not-adversarially-robust value function for Diamonds. That means that it makes correct distinctions in ordinary circumstances to pick out plans actually leading to Diamonds, but won't correctly distinguish between plans that were deceptively constructed so they merely look like they'll produce Diamonds but actually produce Liemonds vs. plans that actually produce Diamonds. But given that, the agent has no particular reason to raise thoughts to consideration like "probe my value function to find weird quirks in it" in the first place, or to regard them as any more promising than "study geology to figure out a better versions of Diamond-producing-causal-pathways", which is a thought that the existing circuits within the agent have an actual mechanistic reason to be raising, on the basis of past reinforcement around "what were the robust common factors in Diamonds-as-I-have-experienced-them" and "what general planning strategies actually helped me produce Diamonds better in the past" etc. and will in fact generally lead to Diamonds rather than Liemonds, because the two are presumably produced via different mechanisms [https://www.lesswrong.com/posts/JLyWP2Y9LAruR2gi9/can-we-efficiently-distinguish-different-mechanisms]. Inspecting its own value function, looking for flaws in it, is not generally an effective strategy for actually producing Diamonds (or whatever else is the veridical source of reinforcement) in the distribution of scenarios it encountered during the broad-but-not-adversarially-tuned distribution of training inputs. So the mechanistic difference between the scenario I'm suggesting and the one you are is that I think there will be lots of strong circuits that, downstream of the reinforcement events that produced the correct-but-not-adversarially-robust value function oriented towards Diamonds, will fire based on features that differentially pick out Diamonds (as well as, say, Liemonds) aga

So, it looks like the key passage is this one:

1. A reflective diamond-motivated agent chooses plans based on how many diamonds they lead to.

2. The agent can predict e.g. how diamond-promising it is to search for plans involving simulating malign superintelligences which trick the agent into thinking the simulation plan makes lots of diamonds, versus plans where the agent just improves its synthesis methods.

3. A reflective agent thinks that the first plan doesn't lead to many diamonds, while the second plan leads to more diamonds.

4. Therefore, the reflecti

...

You ask where to find the "malicious ghost" that tries to break alignment. The one-sentence answer is: The planning module of the agent will try to break alignment.

On an abstract level, we're designing an agent, so we create (usually by training) a value function, to tell the agent what outcomes are good, and a planning module, so that the agent can take actions that lead to higher numbers in the value function. Suppose that the value function, for some hacky adversarial inputs, will produce a large value even if hu...

6cfoster01mo
I disagree. The planning module is not an independent agent, it's a thing that the rest of the agent interacts with to do whatever the agent as a whole wants. If we have successfully aligned the agent as a whole enough that it has a value function that is correct under non-adversarial inputs but not adversarially-robust, and it is even minimally reflective (it makes decisions about how it engages in planning), then why would it be using the planning module in such a way that requires adversarial robustness? Think about how you construct plans. You aren't naively searching over the space of entire plans and piping them through your evaluation function, accepting whatever input argmaxes it regardless of the cognitive stacktrace that produced it. Instead you plan reflectively, building on planning-strategies that have worked well for you in the past, following chains of thought that you have some amount of prior confidence in (rather than just blindly accepting some random OOD input to your value function), intentionally avoiding the sorts of plan construction methods that would be adversarial (like if you were to pay someone to come up with a "maximally-convincing-looking plan"). Although there are possible adversarial planning methods that would find plans that make your value function output unusually extremely high numbers, you aren't trying to find them, because those rely on unprobed weird quirks of your value function that you have no reason to care about pursuing, unlike the greater part of your value function that behaves normally and pursues normal things. This is especially true when you know that your cognition isn't adversarially robust. Optimizing hard against the planning module is not the same as optimizing hard towards what the agent wants. In fact, optimizing hard against the planning module often goes against what the agent wants, by making the agent waste resources on computation that is useless or worse from the perspective of the rest of its va
3DragonGod1mo
I think @Quintin Pope would disagree with this. As I understand it, one of Shard Theory's claims is exactly that generally capable RL agents would not apply such adverse selection pressure on their own inputs (or at least that we should not design such agents). See: * Don't design agents which exploit adversarial inputs [https://www.lesswrong.com/s/nyEFg3AuJpdAozmoX/p/jFCK9JRLwkoJX4aJA] * Don't align agents to evaluations of plans [https://www.lesswrong.com/s/nyEFg3AuJpdAozmoX/p/fopZesxLCGAXqqaPv] * Alignment allows "nonrobust" decision-influences and doesn't require robust grading [https://www.lesswrong.com/s/nyEFg3AuJpdAozmoX/p/rauMEna2ddf26BqiE]
2DragonGod1mo
Strongly upvoted. The first part of this post presents an intuitive story for how adverse selection pressure on safety properties of the system could arise internally.

Ah got it, thanks for the reply!

I do think MIRI "at least temporarily gave up" on personally executing on technical research agendas, or something like that, but, that's not the only type of output.

So, I'm sure various people have probably thought about this a lot, but just to ask the obvious dumb question: Are we sure that this is even a good idea?

Let's say the hope is that at some time in the future, we'll stumble across an Amazing Insight that unblocks progress on AI alignment. At that point, it's probably good to be able to execute quickly on turning that...

Agent Foundations research has stuttered a bit over the team going remote and its membership shifting and various other logistical hurdles, but has been continuous throughout.

There's also at least one other team (the one I provide ops support to) that has been continuous since 2017.

I think the thing Raemon is pointing at is something like "in 2020, both Nate and Eliezer would've answered 'yes' if asked whether they were regularly spending work hours every day on a direct, technical research agenda; in 2023 they would both answer 'no.'"

Strong upvoted! The issue where weights that give the gradient hacker any influence at all will be decreased if it causes bad outputs was one of the objections I also had to that gradient hacking post.

I wrote this post a while back where I managed to create to toy model for things that were not quite gradient hackers, but were maybe a more primitive version: https://www.lesswrong.com/posts/X7S3u5E4KktLp7gHz/tessellating-hills-a-toy-model-for-demons-in-imperfect

In terms of ways to create gradient hackers in an actual neural network, here are some suggestion...

Yep, that's the section I was looking at to get that information. Maybe I phrased it a bit unclearly. The thing that would contradict existing observations is if the interaction were not stochastic. Since it is stochastic in Oppenheim's theory, the theory allows the interference patterns that we observe, so there's no contradiction.

Outside view: This looks fairly legit on first glance, and Jonathan Oppenheim is a reputable physicist. The theory is experimentally testable, with numerous tests mentioned in the paper, and the tests don't require reaching unrealistically high energies in a particle accelerator, which is good.

Inside view: Haven't fully read the paper yet, so take with a grain of salt. Quantum mechanics already has a way of representing states with classical randomness, the density matrix, so having a partially classical and partially quantum theory certainly seems like it...

1Logan Zoellner2mo
Isn't that what he addresses in this section?

0.5 probability you're in a simulation is the lower bound, which is only fulfilled if you pay the blackmailer. If you don't pay the blackmailer, then the chance you're in a simulation is nearly 1.

Also, checking if you're in a simulation is definitely a good idea, I try to follow a decision theory something like UDT, and UDT would certainly recommend checking whether or not you're in a simulation. But the Blackmailer isn't obligated to create a simulation with imperfections that can be used to identify the simulation and hurt his prediction accuracy. ...

0Slimepriestess2mo
yes, people have actually died.

I think the issue boils down to one of types and not being able to have a "Statement" type in the theory. This is why we have QUOT[X] to convert a statement X into a string. QUOT is not a function, really, it's a macro that converts a statement into a string representation of that statement. true(QUOT[X]) ⇔ X isn't an axiom, it's an infinite sequence of axioms (a "schema"), one for each possible statement X. It's considered okay to have an infinite sequence of axioms, so long as you know how to compute that sequence. We can enumerate through all possible s...

1Jakub Supeł3mo
Oh, an one more thing. My updated premise 2 is: 2'. Whenever John says that X, then X. ( ∀ X:proposition, says(John, X) ⇒ X ) Note that X here is not a statement (grammatically valid sentence?), but a proposition. John can express it however he likes: by means of written word, by means of a demonstration or example, by means of a telepathy, etc. There is no need, specifically, to convert a proposition to a string or vice versa; as long as (1) is true and we most likely understand what proposition John is trying to convey, we will most likely believe in the correct normative proposition (that, if expressed in a statement, requires an "ought").
1Jakub Supeł3mo
Ugh, you are using the language of programming in an area where it doesn't fit. Can you explain what are these funny backslashes, % signs etc.? Why did you name a variable fmtstr instead of simply X? Anyway - statements obviously exist, so if your theory doesn't allow for them, it's the problem with your theory and we can just ignore it. In my theory, every sentence that corresponds to a proposition (not all do of course), if that sentence is utterred by John, that proposition is true - that's what I mean by John being truthful. There is no additional axiom here, this is just premise 2, rephrased.

Yeah, it definitely depends how you formalize the logic, which I didn't do in my comment above. I think there's some hidden issues with your proposed disproof, though. For example, how do we formalize 2? If we're representing John's utterances as strings of symbols, then one obvious method would be to write down something like: ∀ s:String, says(John, s) ⇒ true(s). This seems like a good way of doing things, that doesn't mention the ought predicate. Unfortunately, it does require the true predicate, which is meaningless until we have a way of enforcing that...

1Jakub Supeł3mo
"we find out that we used the axiom true(QUOT[ought(X)]) ⇔ ought(X) from the schema. So in order to derive ought(X), we still had to use an axiom with "ought" in it." But that "axiom", as you call it, is trivially true, as it follows from any sensible definition or understanding of "true". In particular, it follows from the axiom "true(QUOT[X]) ⇔ X", which doesn't have an ought in it.   Moreover, we don't even need the true predicate in this argument (we can formulate it in the spirit of the deflationary theory of truth): 2'. Whenever John says that X, then X. ( ∀ s:proposition, says(John, s) ⇒ s )

From a language perspective, I agree that's it's great to not worry about the is/ought distinction when discussing anything other than meta-ethics. It's kind of like how we talk about evolved adaptations as being "meant" to solve a particular problem, even though there was really no intention involved in the process. It's just such a convenient way of speaking, so everyone does it.

I'd guess I'd say that the despite this, the is/ought distinction remains useful in some contexts. Like if someone says "we get morality from X, so you have to believe X or you won't be moral", it gives you a shortcut to realizing "nah, even if I think X is false, I can continue to not do bad things".

What about that thing where you can't derive an "ought" from an "is"? Just from the standpoint of pure logic, we can't derive anything about morality from axioms that don't mention morality. If you want to derive your morality from the existence of God, you still need to add an axiom: "that which God says is moral is moral". On the other end of things, an atheist could still agree with a theist on all moral statements, despite not believing in God. Suppose that God says "A, B, C are moral, and X, Y, Z are immoral". Then an atheist working from the axioms "...

1Jakub Supeł3mo
The hypothesis that we can't derive an ougth from an is is not a proven theorem. In fact, it is easy to prove the opposite - we can derive an ought only from purely descriptive statements. Here is how we can do it: 1. John says that I ought to clean my room. 2. John always speaks the truth (i.e. never lies and is never mistaken). 3. Therefore, I ought to clean my room. Justifying the two premises is of course another matter, but the argument is logically valid and is not circular or anything like that.
3David Gross3mo
Tangentially, FWIW: Among the ought/is counterarguments that I've heard (I first encountered it in Alasdair MacIntyre's stuff) is that some "is"s have "ought"s wrapped up in them from the get-go. The way we divide reality up into its various "is" packages may or may not include function, purpose, etc. in any particular package, but that's in part a linguistic, cultural, fashionable, etc. decision.   For example: that is a clock, it ought to tell the correct time, because that is what clocks are all about. That it is a clock implies what it ought to do. MacIntyre's position, more-or-less, is that the modern philosophical position that you can't get oughts from izzes in the human moral realm is the result of a catastrophe in which we lost sight of what people are for, in the same way that if we forgot what clocks did and just saw them as bizarre artifacts, we'd think they were just as suitable as objet's d'art, paperweights, or items for bludgeoning fish with, as anything else, and it wouldn't matter which ways the hands were pointing. Now you might say that adding an ought to an is by definition like this (as with the clock) is a sort of artificial, additional, undeclared axiom. But you might consider what removing all the oughts from things like clocks would do to your language and conceptual arsenal. Removing the "ought" from people was a decision, not a conclusion. Philosophers performed a painstaking oughtectomy on the concept of a person and then acted surprised when the ought refused to just regrow itself like a planarian head.

On training AI systems using human feedback: This is way better than nothing, and it's great that OpenAI is doing it, but has the following issues:

1. Practical considerations: AI systems currently tend to require lots of examples and it's expensive to get these if they all have to be provided by a human.
2. Some actions look good to a casual human observer, but are actually bad on closer inspection. The AI would be rewarded for finding and taking such actions.
3. If you're training a neural network, then there are generically going to be lots of adversarial examples
...

It would be really cool to see a video on Newcomb's problem, logical decision theories, and Lobian cooperation in the prisoner's dilemma. I think this group of ideas is one of the most interesting developments in game theory in the past few years, and should be more widely known.

I think what it boils down to is that in 1 dimension, the mean / expected value is a really useful quantity, and you get it by minimizing squared error, whereas the absolute error gives the median, which is still useful, but much less so than the mean. (The mean is one of the moments of the distribution, (the first moment), while the median isn't. Rational agents maximize expected utility, not median utility, etc. Even the M in MAE still stands for "mean".) Plus, although algorithmic considerations aren't too important for small problems; in large problems...

From a pure world-modelling perspective, the 3 step model is not very interesting, because it doesn't describe reality. It's maybe best to think of it from an engineering perspective, as a test case. We're trying to build an AI, and we want to make sure it works well. We don't know exactly what that looks like in the real world, but we know what it looks like in simplified situations, where the off button is explicitly labelled for the AI and everything is well understood. If a proposed AI design does the wrong thing in the 3-step test case, then it has fa...

Debates of "who's in what reference class" tend to waste arbitrary amounts of time while going nowhere. A more helpful framing of your question might be "given that you're participating in a community that culturally reinforces this idea, are you sure you've fully accounted for confirmation bias and groupthink in your views on AI risk?". To me, LessWrong does not look like a cult, but that does not imply that it's immune to various epistemological problems like groupthink.

A quote from Eliezer's short fanfic Trust in God, or, The Riddle of Kyon that you may find interesting:

Sometimes, even my sense of normality shatters, and I start to think about things that you shouldn't think about. It doesn't help, but sometimes you think about these things anyway.

I stared out the window at the fragile sky and delicate ground and flimsy buildings full of irreplaceable people, and in my imagination, there was a grey curtain sweeping across the world. People saw it coming, and screamed; mothers clutched their children and children clutch

...

I took Nate to be saying that we'd compute the image with highest faceness according to the discriminator, not the generator. The generator would tend to create "thing that is a face that has the highest probability of occurring in the environment", while the discriminator, whose job is to determine whether or not something is actually a face, has a much better claim to be the thing that judges faceness. I predict that this would look at least as weird and nonhuman as those deep dream images if not more so, though I haven't actually tried it. I also predic...

8jacob_cannell5mo
Our best conditional generative models sample from a conditional distribution, they don't optimize for feature-ness. The GAN analogy is also mostly irrelevant because diffusion models have taken over for conditional generation, and Nate's comment seems confused [https://www.lesswrong.com/posts/LDRQ5Zfqwi8GjzPYG/counterarguments-to-the-basic-ai-x-risk-case?commentId=mRJ9rAx5DZdyjoowo] as applied to diffusion models.
5acgt5mo
This feels like something we should just test? I don’t have access to any such model but presumably someone does and can just run the experiment? Bcos it seems like peoples hunches are varying a lot here
8cfoster05mo
Upvoted because I agree with all of the above. AFAICT the original post was using the faces analogy in a different way than Nate is. It doesn't claim that the discriminators used to supervise GAN face learning or the classifiers used to detect faces are adversarially robust. That isn't the point it's making. It claims that learned models of faces don't "leave anything important out" in the way that one might expect some key feature to be "left out" when learning to model a complex domain like human faces or human values. And that seems well-supported: the trajectory of modern ML has shown learning such complex models is far easier than we might've thought, even if building adversarially robust classifiers is very hard. (As much as I'd like to have supervision signals that are robust to arbitrarily-capable adversaries, it seems non-obvious to me that that is even required for success at alignment.)

7: Did I forget some important question that someone will ask in the comments?

Yes!

Is there a way to deal with the issue of there being multiple ROSE points in some games? If Alice says "I think we should pick ROSE point A" and Bob says "I think we should pick ROSE point B", then you've still got a bargaining game left to resolve, right?

Anyways, this is an awesome post, thanks for writing it up!

7Diffractor6mo
My preferred way of resolving it is treating the process of "arguing over which equilibrium to move to" as a bargaining game, and just find a ROSE point from that bargaining game. If there's multiple ROSE points, well, fire up another round of bargaining. This repeated process should very rapidly have the disagreement points close in on the Pareto frontier, until everyone is just arguing over very tiny slices of utility. This is imperfectly specified, though, because I'm not entirely sure what the disagreement points would be, because I'm not sure how the "don't let foes get more than what you think is fair" strategy generalizes to >2 players. Maaaybe disagreement-point-invariance comes in clutch here? If everyone agrees that an outcome as bad or worse than their least-preferred ROSE point would happen if they disagreed, then disagreement-point-invariance should come in to have everyone agree that it doesn't really matter exactly where that disagreement point is. Or maybe there's some nice principled property that some equilibria have, which others don't, that lets us winnow down the field of equilibria somewhat. Maybe that could happen. I'm still pretty unsure, but "iterate the bargaining process to argue over which equilibria to go to, you don't get an infinite regress because you rapidly home in on the Pareto frontier with each extra round you add" is my best bad idea for it. EDIT: John Harsanyi had the same idea. He apparently had some example where there were multiple CoCo equilibria and his suggestion was that a second round of bargaining could be initiated over which equilibria to pick, but that in general, it'd be so hard to compute the n-person Pareto frontier for large n, that an equilibria might be stable because nobody can find a different equilibria nearby to aim for. So this problem isn't unique to ROSE points in full generality (CoCo equilibria have the exact same issue), it's just that ROSE is the only one that produces multiple solutions for ba

This is a good point, but also kind of an oversimplification of the situation in physics. Imagine Alice is trying to fit some (x, y) data points on a chart. She doesn't know much about any kinds of function other than linear functions, but she can still fit half of the points at a time pretty well. Half of the points have a large x coordinate, and can be fit well by a line of positive slope. Alice calls this line "The Theory of General Relativity". Half of the points have a small x coordinate, and can be fit well by a line of negative slope. Alice calls th...

2tailcalled6mo
That is a fair point.

I'm interested! I've always been curious about how Eliezer pulled off the AI Box experiments, and while I concur that a sufficiently intelligent AI could convince me to let it out, I'm skeptical that any currently living human could do the same.

I don't know of a reason we couldn't do this with a narrow AI. I have no idea how, but it's possible in principle so far as I know. If anyone can figure out how, they could plausibly execute the pivotal act described above, which would be a very good thing for humanity's chances of survival.

EDIT: Needless to say, but I'll say it anyway: Doing this via narrow AI is vastly preferable to using a general AI. It's both much less risky and means you don't have to expend an insane amount of effort on checking.

In your example, I think even adding just one more node, h3, to the hidden layer would suffice to connect the two solutions. One node per dimension of input suffices to learn the function, but it's also possible for two nodes to share the task between them, where the share of the task they are picking up can vary continuously from 0 to 1. So just have h3 take over x2 from h2, then h2 takes over x1 from h1, and then h1 takes over x2 from h3.

3Buck8mo
Yeah, but you have the same problem if you were using all three of the nodes in the hidden layer.
1. Modelling humans as having free will: A peripheral system identifies parts of the agent's world model that are probably humans. During the planning phase, any given plan is evaluated twice: The first time as normal, the second time the outputs of the human part of the model are corrupted by noise. If the plan fails the second evaluation, then it probably involves manipulating humans and should be discarded.

Posting this comment to start some discussion about generalization and instrumental convergence (disagreements #8 and #9).

So my general thoughts here are that ML generalization is almost certainly not good enough for alignment. (At least in the paradigm of deep learning.) I think it's true with high confidence that if we're trying to train a neural net to imitate some value function, and that function takes a high-dimensional input, then it will be possible to find lots of inputs that cause the network to produce a high value when the value function produc...

Yes, sounds right to me. It's also true that one of the big unproven assumptions here is that we could create an AI strong enough to build such a tool, but too weak to hack humans. I find it plausible, personally, but I don't yet have an easy-to-communicate argument for it.

1Ken Kahn7mo
Why can't a narrow AI (maybe like Drexler's proposal) create the tool safely?