I think there's a (kind of) loophole here, where we use an "abstract hypothetical" model of a hypothetical future, and optimize for consequences our actions for that hypothetical. Is this what you mean by "understood in abstract terms"?
More or less, yes (in the case of engineering problems specifically, which I think is more real-world-oriented than most science AI).
The part I don't understand is why you're saying that this is "simpler"? It seems equally complex in kolmogorov complexity and computational complexity.
What I'm saying is "simpler" is that, given a problem that doesn't need to depend on the actual effects of the outputs on the future of the real world (where operating in a simulation is an example, though one that could become riskily close to the real world depending on the information taken into account by the simulation - it might not be a good idea to include highly detailed political risks of other humans thwarting construction in a fusion reactor construction simulation for example), it is simpler for the AI to solve that problem without taking into consideration the effects of the output on the future of the real world than it is to take into account the effects of the output on the future of the real world anyway.
Doomimir: But you claim to understand that LLMs that emit plausibly human-written text aren't human. Thus, the AI is not the character it's playing. Similarly, being able to predict the conversation in a bar, doesn't make you drunk. What's there not to get, even for you?
You seem to have an intuition that if you don't understand all the mechanisms for how something works, then it is likely to have some hidden goal and be doing its observed behaviour for instrumental reasons. E.g. the "Alien Actress".
And that makes sense from an evolutionary perspective where you encounter some strange intelligent creature doing some mysterious actions on the savannah. I do not think it make sense if you specifically trained the system to have that particular behaviour by gradient descent.
I think, if you trained something by gradient descent to have some particular behaviour, the most likely thing that resulted from that training is a system tightly tuned to have that particular behaviour, with the simplest arrangement that leads to the trained behaviour.
And if the behaviour you are training something to do is something that doesn't necessarily involve actually trying to pursue some long-range goal, it would be very strange, in my view, for it to turn out that the simplest arrangement to provide that behaviour calculates the effects of the output on the long-range future in order to determine what output to select.
Moreover even if you tried to train it to want to have some effect on the future, I expect you would find it more difficult than expected, since it would learn various heuristics and shortcuts long before actually learning the very complicated algorithm of generating a world model, projecting it forward given the system's outputs, and selecting the output that steers the future to the particular goal. (To others: This is not an invitation to try that. Please don't).
That doesn't mean that an AI trained by gradient descent on a task that usually doesn't involve trying to pursue a long range goal can never be dangerous, or that it can never have goals.
But it does mean that the danger and the goals of such a usually-non-long-range-task-trained AI, if it has them, are downstream of its behaviour.
For example, an extremely advanced text predictor might predict the text output of a dangerous agent through an advanced simulation that is itself a dangerous agent.
And if someone actually manages to train a system by gradient descent to do real-world long range tasks (which probably is a lot easier than making a text predictor that advanced), well then...
BTW all the above is specific to gradient descent. I do expect self-modifying agents, for example, to be much more likely to be dangerous, because actual goals lead to wanting to enhance one's ability and inclination to pursue those goals, whereas non-goal-oriented behaviour will not be self-preserving in general.
And in Sleeping Beauty case, as I'm going to show in my next post, indeed there are troubles justifying thirders sampling assumption with other conditions of the setting
I look forward to seeing your argument.
I'm giving you a strong upvote for this. It's rare to find a person who notices that Sleeping Beauty is quite different from other "antropic problems" such as incubator problems.
Thanks! But I can't help but wonder if one of your examples of someone who doesn't notice is my past self making the following comment (in a thread for one of your previous posts) which I still endorse:
I certainly agree that one can have philosophical assumptions such that you sample differently for Sleeping Beauty and Incubator problems, and indeed I would not consider the halfer position particularly tenable in Incubator, whereas I do consider it tenable in Sleeping Beauty.
But ... I did argue in that comment that it is still possible to take a consistent thirder position on both. (In the comment I take the thirder position for sleeping beauty for granted, and argue for it still being possible to apply to Incubator (rather than the other way around, despite being more pro-thirder for Incubator), specifically to rebut an argument in that earlier post of yours that the classic thirder position for Sleeping Beauty didn't apply to Incubator).
Some clarification of my actual view here (rather than my defense of conventional thirderism):
In my view, sampling is not something that occurs in reality, when the "sampling" in question includes sampling between multiple entities that both exist. Each of the entities that actually exists actually exists, and any "sampling" between multiple of such entities occurs (only) in the mind of the observer. (However, can still mix with conventional sampling, in the mind of the observer). Which sampling assumption you use in such cases is in principle arbitrary but in practice should probably be based on how much you care about the correctness of the beliefs of each of the possible entities you are uncertain about being.
Halferism or thirderism for Sleeping Beauty are both viable, in my view, because one could argue for caring equally about being correct at each awakening (resulting in thirderism) or one could argue for caring equally about being correct collectively in the awakenings for each of the coin results (resulting in halferism). There isn't any particular "skin in the game" to really force a person to make a commitment here.
You seem to be assuming that the ability of the system to find out if security assumptions are false affects whether the falsity of the assumptions have a bad effect. Which is clearly the case for some assumptions - "This AI box I am using is inescapable" - but it doesn't seem immediately obvious to me that this is generally the case.
Generally speaking, a system can have bad effects if made under bad assumptions (think a nuclear reactor or aircraft control system) even if it doesn't understand what it's doing. Perhaps that's less likely for AI, of course.
And on the other hand, an intelligent system could be aware that an assumption would break down in circumstances that haven't arrived yet, and not do anything about it (or even tell humans about it).
how often you pop up out of nowhere
Or evolve from something else. (which you clearly intended based, e.g. on your mention of crabs, but didn't make clear in that sentence)
Thirders believe that this awakening should be treated as randomly sampled from three possible awakening states. Halfers believe that this awakening should be treated as randomly sampled from two possible states, corresponding to the result of a coin toss. This is an objective disagreement, that can be formulated in terms of probability theory and at least one side inevitably has to be in the wrong. This is the unresolved issue that we can't simply dismiss because both sides have a point.
If you make some assumptions about sampling, probability theory will give one answer, with other assumptions probability theory will give another answer. So both can be defended with probability theory, it depends on the sampling assumptions. And there isn't necessarily any sampling assumption that's objectively correct here.
By the way I normally agree with thirders in terms of my other assumptions about anthropics, but in the case of Sleeping Beauty since it's particularly formulated to separate the multiple awakenings from impacting on the rest of the world including the past and future, I think the halfer sampling assumption isn't necessarily crazy.
It seems to me we should have a strong prior that it was lab-produced by the immediate high infectiousness. What evidence does Peter Miller provide to overcome that prior?
While some disagreement might be about relatively mundane issues, I think there's some more fundamental disagreement about agency as well.
I my view, in order to be dangerous in a particularly direct way (instead of just misuse risk etc.), an AI's decision to give output X depends on the fact that output X has some specific effects in the future.
Whereas, if you train it on a problem where solutions don't need to depend on the effects of the outputs on the future, I think it much more likely to learn to find the solution without routing that through the future, because that's simpler.
So if you train an AI to give solutions to scientific problems, I don't think, in general, that that needs to depend on the future, so I think that it's likely learn the direct relationships between the data and the solutions. I.e. it's not merely a logical possibility to make it not especially dangerous, but that's the default outcome if you give it problems that don't need to depend on specific effects of the output.
Now, if you were instead to give it a problem that had to depend on the effects of the output on the future, then it would be dangerous...but note that e.g. chess, even though it maps onto a game played in the real world in the future, can also be understood in abstract terms so you don't actually need to deal with anything outside the chess game itself.
In general, I just think that predicting the future of the world and choosing specific outputs based on their effects on the real world is a complicated way to solve problems and expect things to take shortcuts when possible.
Once something does care about the future, then it will have various instrumental goals about the future, but the initial step about actually caring about the future is very much not trivial in my view!
Science is usually a real-world task.
Fair enough, a fully automated do-everything science-doer would need, in order to do everything science-related, have to do real world tasks and would thus be dangerous. That being said, I think there's plenty of room for "doing science" (up to some reasonable level of capability) without going all the way to automation of real-world aspects - you can still have an assistant that thinks up theory for you, just can't have something that does the experiments as well.
Part of your comment (e.g. point 3) relates to how the AI would in practice be rewarded for achieving real-world effects, which I agree is a reason for concern. Thus, as I said, "you might need to be careful not to evaluate in such a way that it will wind up optimizing for real-world effects, though".
Your comment goes beyond this however, and seems to assume in some places that merely knowing or conceptualizing about the real world will lead to "forming goals" about the real world.
I actually agree that this may be the case with AI that self-improves, since if an AI that has a slight tendency toward a real-world goal self-modifies, its tendency toward that real-world goal will tend to direct it to enhance its alignment to that real-world goal, whereas its tendencies not directed towards real-world goals will in general happily overwrite themselves.
If the AI does not self-improve however, then I do not see that as being the case.
If the AI is not being rewarded for the real-world effects, but instead being rewarded for scientific outputs that are "good" according to some criteria that does not depend on their real world effects, then it will learn to generate outputs that are good according to that criteria. I don't think that would, in general, lead it to select actions that would steer the world to some particular world-state. To be sure, these outputs would have effects on the real world - a design for a fusion reactor would tend to lead to a fusion reactor being constructed, for example - but if the particular outputs are not rewarded based on the real-world outcome than they will also not tend to be selected based on the real-world outcome.
Some less relevant nitpicks of points in your comment:
Even if an AI is only trained in a limited domain (e.g. math), it can still have objectives that extend outside of this domain
If you train an AI on some very particular math then it could have goals relating to the future of the real world. I think, however, that the math you would need to train it on to get this effect would have to be very narrow, and likely have to either be derived from real-world data, or involve the AI studying itself (which is a component of the real world after all). I don't think this happens for generically training an AI on math.
As an example, if we humans discovered we were in a simulation, we could easily have goals that extend outside of the simulation (the obvious one being to make sure the simulators didn’t turn us off).
true, but see above and below.
Chess AIs don’t develop goals about the real world because they are too dumb.
If you have something trained by gradient descent solely on doing well at chess, it's not going to consider anything outside the chess game, no matter how many parameters and how much compute it has. Any considerations of outside-of-chess factors lowers the resources for chess, and is selected against until it reaches the point of subverting the training regime (which it doesn't reach, since selected against before then).
Even if you argue that if its smart enough, additional computing power is neutral, the gradient descent doesn't actually reward out-of-context thinking for chess, so it couldn't develop except by sheer chance outside of somehow being a side-effect of thinking about chess itself - but chess is a mathematically "closed" domain so there doesn't seem to be any reason out-of-context thinking would be developed.
The same applies to math in general where the math doesn't deal with the real world or the AI itself. This is a more narrow and more straightforward case than scientific research in general.
I'm not convinced by the argument that AI science systems are necessarily dangerous.
It's generically* the case that any AI that is trying to achieve some real-world future effect is dangerous. In that linked post Nate Soares used chess as an example, which I objected to in a comment. An AI that is optimizing within a chess game isn't thereby dangerous, as long as the optimization stays within the chess game. E.g., an AI might reliably choose strong chess moves, but still not show real-world Omohundro drives (e.g. not avoiding being turned off).
I think scientific research is more analogous to chess than trying to achieve a real-world effect in this regard (even if the scientific research has real-world side effects), in that you can, in principle, optimize for reliably outputting scientific insights without actually leading the AI to output anything based on its real-world effects. (the outputs are selected based on properties aligned with "scientific value", but that doesn't necessarily require the assessment to take into account how it will be used, or any other effect on the future of the world. You might need to be careful not to evaluate in such a way that it will wind up optimizing for real-world effects, though).
Note: an AI that can "build a fusion rocket" is generically dangerous. But an AI that can design a fusion rocket, if that design is based on general principles and not tightly tuned on what will produce some exact real-world effect, is likely not dangerous.
*generically dangerous: I use this to mean, an AI with this properties is going to be dangerous unless some unlikely-by-default (and possibly very difficult) safety precautions are taken.