This is a special post for quick takes by Vanessa Kosoy. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Vanessa Kosoy's Shortform
233 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Text whose primary goal is conveying information (as opposed to emotion, experience or aesthetics) should be skimming friendly. Time is expensive, words are cheap. Skimming is a vital mode of engaging with text, either to evaluate whether it deserves a deeper read or to extract just the information you need. As a reader, you should nurture your skimming skills. As a writer, you should treat skimmers as a legitimate and important part of your target audience. Among other things it means:

  • Good title and TLDR/abstract
  • Clear and useful division into sections
  • Putting the high-level picture and conclusions first, the technicalities and detailed arguments later. Never leave the reader clueless about where you’re going with something for a long time.
  • Visually emphasize the central points and make them as self-contained as possible. For example, in the statement of mathematical theorems avoid terminology whose definition is hidden somewhere in the bulk of the text.

Stronger: as a writer you should assume your modal reader is a skimmer, both because they are, and because even non skimmers are only going to remember about the same number of things that the good skimmer does.

I propose to call metacosmology the hypothetical field of study which would be concerned with the following questions:

  • Studying the space of simple mathematical laws which produce counterfactual universes with intelligent life.
  • Studying the distribution over utility-function-space (and, more generally, mindspace) of those counterfactual minds.
  • Studying the distribution of the amount of resources available to the counterfactual civilizations, and broad features of their development trajectories.
  • Using all of the above to produce a distribution over concretized simulation hypotheses.

This concept is of potential interest for several reasons:

  • It can be beneficial to actually research metacosmology, in order to draw practical conclusions. However, knowledge of metacosmology can pose an infohazard, and we would need to precommit not to accept blackmail from potential simulators.
  • The metacosmology knowledge of a superintelligent AI determines the extent to which it poses risk via the influence of potential simulators.
  • In principle, we might be able to use knowledge of metacosmology in order to engineer an "atheist prior" for the AI that would exclude simulation hypotheses. However, this might be very difficult in practice.
2Alexander Gietelink Oldenziel
Why do bad things happen to good people?

People like Andrew Critch and Paul Christiano have criticized MIRI in the past for their "pivotal act" strategy. The latter can be described as "build superintelligence and use it to take unilateral world-scale actions in a manner inconsistent with existing law and order" (e.g. the notorious "melt all GPUs" example). The critics say (justifiably IMO), this strategy looks pretty hostile to many actors and can trigger preemptive actions against the project attempting it and generally foster mistrust.

Is there a good alternative? The critics tend to assume slow-takeoff multipole scenarios, which makes the comparison with their preferred solutions to be somewhat "apples and oranges". Suppose that we do live in a hard-takeoff singleton world, what then? One answer is "create a trustworthy, competent, multinational megaproject". Alright, but suppose you can't create a multinational megaproject, but you can build aligned AI unilaterally. What is a relatively cooperative thing you can do which would still be effective?

Here is my proposed rough sketch of such a plan[1]:

  • Commit to not make anyone predictably regret supporting the project or not opposing it. This rule is the most important and
... (read more)

IMO it was a big mistake for MIRI to talk about pivotal acts without saying they should even attempt to follow laws.

9habryka
The whole point of the pivotal act framing is that you are looking for something to do that you can do with the least advanced AI system. This means it's definitely not a superintelligence. If you have an aligned superintelligence this I think makes that framing not really make sense. The problem the framing is trying to grapple with is that we want to somehow use AI to solve AI risk, and for that we want to use the very dumbest AI that we can use for a successful plan.
5Vanessa Kosoy
I know, this is what I pointed at in footnote 1. Although "dumbest AI" is not quite right: the sort of AI MIRI envision is still very superhuman in particular domains, but is somehow kept narrowly confined to acting within those domains (e.g. designing nanobots). The rationale mostly isn't assuming that at that stage it won't be possible to create a full superintelligence, but assuming that aligning such a restricted AI would be easier. I have different views on alignment, leading me to believe that aligning a full-fledged superintelligence (sovereign) is actually easier (via PSI or something in that vein). On this view, we still need to contend with the question, what is the thing we will (honestly!) tell other people that our AI is actually going to do. Hence, the above.
3quetzal_rainbow
I always thought "you should use the least advanced superintelligence necessary". I.e., in not-real-example of "melting all GPUs" your system should be able to design nanotech advanced enough to target all GPUs in open enviroment, which is superintelligent task, while not being able to, say, reason about anthropics and decision theory.
6sapphire
Im not particularly against pivotal acts. It seems plausible to me someone will take one. Would not exactly shock me if Sam Altman himself planned to take one to prevent dangerous AGI. He is intelligent and therefore isnt going to openly talk about considering them. But I dont have any serious objection to them being taken if people are reasonable about it.
4faul_sname
What sort of evidence are you envisioning that would allow us to determine that we live in a hard takeoff singleton world, and that the proposed pivotal act would actually work, ahead of actually attempting said pivotal act? I can think of a couple options: 1. We have no such evidence, but we can choose an act that is only pivotal if the underlying world model that leads you to expect a hard takeoff singleton world actually holds, and harmlessly fails otherwise. 2. Galaxy brained game theory arguments, of the flavor John von Neumann made when he argued for preemptive nuclear strike on the Soviet Union. 3. Something else entirely My worry, given the things Yudkowsky has said like "I figured this stuff out using the null string as input", is that the argument is closer to (2). So to reframe the question: Someone has done a lot of philosophical thinking, and come to the conclusion that something apocalyptically bad will happen in the near future. In order to prevent the bad thing from happening, they need to do something extremely destructive and costly that they say will prevent the apocalyptic event. What evidence do you want from that person before you are happy to have them do the destructive and costly thing?
2Vanessa Kosoy
I don't have to know in advance that we're in hard-takeoff singleton world, or even that my AI will succeed to achieve those objectives. The only thing I absolutely have to know in advance is that my AI is aligned. What sort of evidence will I have for this? A lot of detailed mathematical theory, with the modeling assumptions validated by computational experiments and knowledge from other fields of science (e.g. physics, cognitive science, evolutionary biology).  I think you're misinterpreting Yudkowsky's quote. "Using the null string as input" doesn't mean "without evidence", it means "without other people telling me parts of the answer (to this particular question)". I'm not sure what is "extremely destructive and costly" in what I described? Unless you mean the risk of misalignment, in which case, see above.
2faul_sname
This was specifically in response to It sounds like you do in fact believe we are in a hard-takeoff singleton world, or at least one in which a single actor can permanently prevent all other actors from engaging in catastrophic actions using a less destructive approach than "do unto others before they can do unto you". Why do you think that describes the world we live in? What observations led you to that conclusion, and do you think others would come to the same conclusion if they saw the same evidence? I think your set of guidelines from above is mostly[1] a good one, in worlds where a single actor can seize control while following those rules. I don't think that we live in such a world, and honestly I can't really imagine what sort of evidence would convince me that I do live in such a world though. Which is why I'm asking. Yeah, on examination of the comment section I think you're right that by "from the null string" he meant "without direct social inputs on this particular topic".  1. ^ "Commit to not make anyone predictably regret supporting the project or not opposing it" is worrying only by omission -- it's a good guideline, but it leaves the door open for "punish anyone who failed to support the project once the project gets the power to do so". To see why that's a bad idea to allow, consider the situation where there are two such projects and you, the bystander, don't know which one will succeed first.
2Vanessa Kosoy
I don't know whether we live in a hard-takeoff singleton world or not. I think there is some evidence in that direction, e.g. from thinking about the kind of qualitative changes in AI algorithms that might come about in the future, and their implications on the capability growth curve, and also about the possibility of recursive self-improvement. But, the evidence is definitely far from conclusive (in any direction). I think that the singleton world is definitely likely enough to merit some consideration. I also think that some of the same principles apply to some multipole worlds. Yes, I never imagined doing such a thing, but I definitely agree it should be made clear. Basically, don't make threats, i.e. don't try to shape others incentives in ways that they would be better off precommitting not to go along with it.
1quetzal_rainbow
If you are capable to use AI to do harmful and costly thing, like "melt GPUs", you are in hard takeoff world.
2faul_sname
Yeah, I'm not actually worried about the "melt all GPUs" example of a pivotal act. If we actually live in a hard takeoff world, I think we're probably just hosed. The specific plans I'm worried about are ones that ever-so-marginally increase our chances of survival in hard-takeoff singleton worlds, at massive costs in multipolar worlds. A full nuclear exchange would probably kill less than a billion people. If someone convinces themself that a full nuclear exchange would prevent the development of superhuman AI, I would still strongly prefer that person not try their hardest to trigger a nuclear exchange. More generally, I think having a policy of "anyone who thinks the world will end unless they take some specific action should go ahead and take that action, as long as less than a billion people die" is a terrible policy.
1quetzal_rainbow
I think the problem here is "convinces themself". If you are capable to trigger nuclear war, you are probably capable to do something else which is not that, if you put your mind in that.
2faul_sname
Does the" something else which is not that but is in the same difficulty class" also accomplish the goal of "ensure that nobody has access to what you think is enough compute to build an ASI?" If not, I think that implies that the "anything that probably kills less than a billion people is fair game" policy is a bad one.
3NeroWolfe
Why do you think that the space colonists would be able to create a utopian society just because they are not on earth? You will still have all the same types of people up there as down here, and they will continue to exhibit the Seven Deadly Sins. They will just be in a much smaller and more fragile environment, most likely making the consequences of bad behavior worse than here on earth.
5mako yass
They have superintelligence, the augmenting technologies that come of it, and the self-reflection that follows receiving those, they are not the same types of people.
3Vanessa Kosoy
It's not because they're not on Earth, it's because they have a superintelligence helping them. Which might give them advice and guidance, take care of their physical and mental health, create physical constraints (e.g. that prevent violence), or even give them mind augmentation like mako yass suggested (although I don't think that's likely to be a good idea early on). And I don't expect their environment to be fragile because, again, designed by superintelligence. But I don't know the details of the solution: the AI will decide those, as it will be much smarter than me.
2Kaj_Sotala
I would guess that getting space colonies to the kind of a state where they could support significant human inhabitation would be a multi-decade project, even with superintelligence? Especially taking into account that they won't have much nature without significant terraforming efforts, and quite a few people would find any colony without any forests etc. to be intrinsically dystopian.
3Vanessa Kosoy
First, given nanotechnology, it might be possible to build colonies much faster. Second, I think the best way to live is probably as uploads inside virtual reality, so terraforming is probably irrelevant. Third, it's sufficient that the colonists are uploaded or cryopreserved (via some superintelligence-vetted method) and stored someplace safe (whether on Earth or in space) until the colony is entirely ready. Fourth, if we can stop aging and prevent other dangers (including unaligned AI), then a timeline of decades is fine.
1Will_Pearson
Does it make sense to plan for one possible world or do you think that the other possible worlds are being adequately planned for and it is only the fast unilateral take off that is neglected currently? Limiting AI to operating in space makes sense. You might want to pay off or compensate all space launch capability in some way as there would likely be less need. Some recompense for the people who paused working on AI or were otherwise hurt in the build up to AI makes sense. Also trying to communicate ahead of time what a utopic vision of AI and humans might look like, so the cognitive stress isn't too major is probably a good idea to commit to. Committing to support multilateral acts if unilateral acts fail is probably a good idea too. Perhaps even partnering with a multilateral effort so that effort on shared goals can be spread around?
1Ilio
In this plan, how should the AI define what’s in the interest of the person being persuaded? For example, say you have a North Korean soldier who can be persuaded to quite for the west (at the risk of getting the shitty jobs most migrants have) or who can be persuaded to remain loyal to his bosses (at the risk of raising his children in the shitty country most north korean have), what set of rules would you suggest?

An AI progress scenario which seems possible and which I haven't seen discussed: an imitation plateau.

The key observation is, imitation learning algorithms[1] might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans have. That's because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes realizable from the perspective of the learning algorithm, while the world-at-large is still far from realizable. So, an algorithm that only performs well in the realizable setting can learn to imitate a human mind, and thereby indirectly produce reasoning that works in non-realizable settings as well. Of course, literally emulating a human brain is still computationally formidable, but there might be middle scenarios where the learning algorithm is able to produce a good-enough-in-practice imitation of systems that are not too complex.

This opens the possibility that close-to-human-level AI will arrive while we're still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevert... (read more)

2Vladimir_Nesov
This seems similar to gaining uploads prior to AGI, and opens up all those superorg upload-city amplification/distillation constructions which should get past human level shortly after. In other words, the limitations of the dataset can be solved by amplification as soon as the AIs are good enough to be used as building blocks for meaningful amplification, and something human-level-ish seems good enough for that. Maybe even GPT-n is good enough for that.
3Vanessa Kosoy
That is similar to gaining uploads (borrowing terminology from Egan, we can call them "sideloads"), but it's not obvious amplification/distillation will work. In the model based on realizability, the distillation step can fail because the system you're distilling is too computationally complex (hence, too unrealizable). You can deal with it by upscaling the compute of the learning algorithm, but that's not better than plain speedup.
2Vladimir_Nesov
To me this seems to be essentially another limitation of the human Internet archive dataset: reasoning is presented in an opaque way (most slow/deliberative thoughts are not in the dataset), so it's necessary to do a lot of guesswork to figure out how it works. A better dataset both explains and summarizes the reasoning (not to mention gets rid of the incoherent nonsense, but even GPT-3 can do that to an extent by roleplaying Feynman). Any algorithm can be represented by a habit of thought (Turing machine style if you must), and if those are in the dataset, they can be learned. The habits of thought that are simple enough to summarize get summarized and end up requiring fewer steps. My guess is that the human faculties needed for AGI can be both represented by sequences of thoughts (probably just text, stream of consciousness style) and easily learned with current ML. So right now the main obstruction is that it's not feasible to build a dataset with those faculties represented explicitly that's good enough and large enough for current sample-inefficient ML to grok. More compute in the learning algorithm is only relevant for this to the extent that we get a better dataset generator that can work on the tasks before it more reliably.
2Vanessa Kosoy
I don't see any strong argument why this path will produce superintelligence. You can have a stream of thought that cannot be accelerated without investing a proportional amount of compute, while a completely different algorithm would produce a far superior "stream of thought". In particular, such an approach cannot differentiate between features of the stream of thought that are important (meaning that they advance towards the goal) and features of the stream of though that are unimportant (e.g. different ways to phrase the same idea). This forces you to solve a task that is potentially much more difficult than just achieving the goal.
2Vladimir_Nesov
I was arguing that near human level babblers (including the imitation plateau you were talking about) should quickly lead to human level AGIs by amplification via stream of consciousness datasets, which doesn't pose new ML difficulties other than design of the dataset. Superintelligence follows from that by any of the same arguments as for uploads leading to AGI (much faster technological progress; if amplification/distillation of uploads is useful straight away, we get there faster, but it's not necessary). And amplified babblers should be stronger than vanilla uploads (at least implausibly well-educated, well-coordinated, high IQ humans). For your scenario to be stable, it needs to be impossible (in the near term) to run the AGIs (amplified babblers) faster than humans, and for the AGIs to remain less effective than very high IQ humans. Otherwise you get acceleration of technological progress, including ML. So my point is that feasibility of imitation plateau depends on absence of compute overhang, not on ML failing to capture some of the ingredients of human general intelligence.
2Vanessa Kosoy
The imitation plateau can definitely be rather short. I also agree that computational overhang is the major factor here. However, a failure to capture some of the ingredients can be a cause of low computational overhead, whereas a success to capture all of the ingredients is a cause of high computational overhang, because the compute necessary to reach superintelligence might be very different in those two cases. Using sideloads to accelerate progress might still require years, whereas an "intrinsic" AGI might lead to the classical "foom" scenario. EDIT: Although, since training is typically much more computationally expensive than deployment, it is likely that the first human-level imitators will already be significantly sped-up compared to humans, implying that accelerating progress will be relatively easy. It might still take some time from the first prototype until such an accelerate-the-progress project, but probably not much longer than deploying lots of automation.
2Vladimir_Nesov
I agree. But GPT-3 seems to me like a good estimate for how much compute it takes to run stream of consciousness imitation learning sideloads (assuming that learning is done in batches on datasets carefully prepared by non-learning sideloads, so the cost of learning is less important). And with that estimate we already have enough compute overhang to accelerate technological progress as soon as the first amplified babbler AGIs are developed, which, as I argued above, should happen shortly after babblers actually useful for automation of human jobs are developed (because generation of stream of consciousness datasets is a special case of such a job). So the key things to make imitation plateau last for years are either sideloads requiring more compute than it looks like (to me) they require, or amplification of competent babblers into similarly competent AGIs being a hard problem that takes a long time to solve.
4Vanessa Kosoy
Another thing that might happen is a data bottleneck. Maybe there will be a good enough dataset to produce a sideload that simulates an "average" person, and that will be enough to automate many jobs, but for a simulation of a competent AI researcher you would need a more specialized dataset that will take more time to produce (since there are a lot less competent AI researchers than people in general). Moreover, it might be that the sample complexity grows with the duration of coherent thought that you require. That's because, unless you're training directly on brain inputs/outputs, non-realizable (computationally complex) environment influences contaminate the data, and in order to converge you need to have enough data to average them out, which scales with the length of your "episodes". Indeed, all convergence results for Bayesian algorithms we have in the non-realizable setting require ergodicity, and therefore the time of convergence (= sample complexity) scales with mixing time, which in our case is determined by episode length. In such a case, we might discover that many tasks can be automated by sideloads with short coherence time, but AI research might require substantially longer coherence times. And, simulating progress requires by design going off-distribution along certain dimensions which might make things worse.
2avturchin
Another way to describe the same (or similar) plateau: we could think about GPT-n as GLUT with approximation between prerecorded answers: it can produce intelligent products similar to the ones which were created by humans in the past and are presented in its training dataset – but not above the human intelligence level, as there is no superintelligent examples in the dataset. 

Here's the sketch of an AIT toy model theorem that in complex environments without traps, applying selection pressure reliably produces learning agents. I view it as an example of Wentworth's "selection theorem" concept.

Consider any environment  of infinite Kolmogorov complexity (i.e. uncomputable). Fix a computable reward function

Suppose that there exists a policy  of finite Kolmogorov complexity (i.e. computable) that's optimal for  in the slow discount limit. That is,

Then,  cannot be the only environment with this property. Otherwise, this property could be used to define  using a finite number of bits, which is impossible[1]. Since  requires infinitely many more bits to specify than  and , there has to be infinitely many environments with the same property[2]. Therefore,  is a reinforcement learning algorithm for some infinite class of hypothesis.

Moreover, there are natural examples of  as above. For instance, let's construct  as an infinite sequence of finite communicating infra-RDP refinements tha... (read more)

I propose a new formal desideratum for alignment: the Hippocratic principle. Informally the principle says: an AI shouldn't make things worse compared to letting the user handle them on their own, in expectation w.r.t. the user's beliefs. This is similar to the dangerousness bound I talked about before, and is also related to corrigibility. This principle can be motivated as follows. Suppose your options are (i) run a Hippocratic AI you already have and (ii) continue thinking about other AI designs. Then, by the principle itself, (i) is at least as good as (ii) (from your subjective perspective).

More formally, we consider a (some extension of) delegative IRL setting (i.e. there is a single set of input/output channels the control of which can be toggled between the user and the AI by the AI). Let be the the user's policy in universe and the AI policy. Let be some event that designates when we measure the outcome / terminate the experiment, which is supposed to happen with probability for any policy. Let be the value of a state from the user's subjective POV, in universe . Let be the environment in universe . Finally, let be the AI's prior over universes and ... (read more)

8Steven Byrnes
(Update: I don't think this was 100% right, see here for a better version.) Attempted summary for morons like me: AI is trying to help the human H. They share access to a single output channel, e.g. a computer keyboard, so that the actions that H can take are exactly the same as the actions AI can take. Every step, AI can either take an action, or delegate to H to take an action. Also, every step, H reports her current assessment of the timeline / probability distribution for whether she'll succeed at the task, and if so, how soon. At first, AI will probably delegate to H a lot, and by watching H work, AI will gradually learn both the human policy (i.e. what H tends to do in different situations), and how different actions tend to turn out in hindsight from H's own perspective (e.g., maybe whenever H takes action 17, she tends to declare shortly afterwards that probability of success now seems much higher than before—so really H should probably be taking action 17 more often!). Presumably the AI, being a super duper fancy AI algorithm, learns to anticipate how different actions will turn out from H's perspective much better than H herself. In other words, maybe it delegates to H, and H takes action 41, and the AI is watching this and shaking its head and thinking to itself "gee you dunce you're gonna regret that", and shortly thereafter the AI is proven correct. OK, so now what? The naive answer would be: the AI should gradually stop delegating and start just doing the thing that leads to H feeling maximally optimistic later on. But we don't want to do that naive thing. There are two problems: The first problem is "traps" (a.k.a. catastrophes). Let's say action 0 is Press The History Eraser Button. H never takes that action. The AI shouldn't either. What happens is: AI has no idea (wide confidence interval) about what the consequence of action 0 would be, so it doesn't take it. This is the delegative RL thing—in the explore/exploit dilemma, the AI kinda sits b
4Vanessa Kosoy
This is about right. Notice that typically we use the AI for tasks which are hard for H. This means that without the AI's help, H's probability of success will usually be low. Quantilization-wise, this is a problem: the AI will be able to eliminate those paths for which H will report failure, but maybe most of the probability mass among apparent-success paths is still on failure (i.e. the success report is corrupt). This is why the timeline part is important. On a typical task, H expects to fail eventually but they don't expect to fail soon. Therefore, the AI can safely consider a policies of the form "in the short-term, do something H would do with marginal probability, in the long-term go back to H's policy". If by the end of the short-term maneuver H reports an improved prognosis, this can imply that the improvement is genuine (since the AI knows H is probably uncorrupted at this point). Moreover, it's possible that in the new prognosis H still doesn't expect to fail soon. This allows performing another maneuver of the same type. This way, the AI can iteratively steer the trajectory towards true success.
4TurnTrout
The Hippocratic principle seems similar to my concept of non-obstruction (https://www.lesswrong.com/posts/Xts5wm3akbemk4pDa/non-obstruction-a-simple-concept-motivating-corrigibility), but subjective from the human's beliefs instead of the AI's.
2Vanessa Kosoy
Yes, there is some similarity! You could say that a Hippocratic AI needs to be continuously non-obstructive w.r.t. the set of utility functions and priors the user could plausibly have, given what the AI knows. Where, by "continuously" I mean that we are allowed to compare keeping the AI on or turning off at any given moment.
4Vanessa Kosoy
"Corrigibility" is usually defined as the property of AIs who don't resist modifications by their designers. Why would we want to perform such modifications? Mainly it's because we made errors in the initial implementation, and in particular the initial implementation is not aligned. But, this leads to a paradox: if we assume our initial implementation to be flawed in a way that destroys alignment, why wouldn't it also be flawed in a way that destroys corrigibility? In order to stop passing the recursive buck, we must assume some dimensions along which our initial implementation is not allowed to be flawed. Therefore, corrigibility is only a well-posed notion in the context of a particular such assumption. Seen through this lens, the Hippocratic principle becomes a particular crystallization of corrigibility. Specifically, the Hippocratic principle assumes the agent has access to some reliable information about the user's policy and preferences (be it through timelines, revealed preferences or anything else). Importantly, this information can be incomplete, which can motivate altering the agent along the way. And, the agent will not resist this alteration! Indeed, resisting the alteration is ruled out unless the AI can conclude with high confidence (and not just in expectation) that such resistance is harmless. Since we assumed the information is reliable, and the alteration is beneficial, the AI cannot reach such a conclusion. For example, consider an HDTL agent getting upgraded to "Hippocratic CIRL" (assuming some sophisticated model of relationship between human behavior and human preferences). In order to resist the modification, the agent would need a resistance strategy that (i) doesn't deviate too much from the human baseline and (ii) ends with the user submitting a favorable report. Such a strategy is quite unlikely to exist.
4Charlie Steiner
I think the people most interested in corrigibility are imagining a situation where we know what we're doing with corrigibility (e.g. we have some grab-bag of simple properties we want satisfied), but don't even know what we want from alignment, and then they imagine building an unaligned slightly-sub-human AGI and poking at it while we "figure out alignment." Maybe this is a strawman, because the thing I'm describing doesn't make strategic sense, but I think it does have some model of why we might end up with something unaligned but corrigible (for at least a short period).
4Vanessa Kosoy
The concept of corrigibility was introduced by MIRI, and I don't think that's their motivation? On my model of MIRI's model, we won't have time to poke at a slightly subhuman AI, we need to have at least a fairly good notion of what to do with a superhuman AI upfront. Maybe what you meant is "we won't know how to construct perfect-utopia-AI, so we will just construct a prevent-unaligned-AIs-AI and run it so that we can figure out perfect-utopia-AI in our leisure". Which, sure, but I don't see what it has to do with corrigibility. Corrigibility is neither necessary nor sufficient for safety. It's not strictly necessary because in theory an AI can resist modifications in some scenarios while always doing the right thing (although in practice resisting modifications is an enormous red flag), and it's not sufficient since an AI can be "corrigible" but cause catastrophic harm before someone notices and fixes it. What we're supposed to gain from corrigibility is having some margin of error around alignment, in which case we can decompose alignment as corrigibility + approximate alignment. But it is underspecified if we don't say along which dimensions or how big the margin is. If it's infinite margin along all dimensions then corrigibility and alignment are just isomorphic and there's no reason to talk about the former.
2Charlie Steiner
Very interesting - I'm sad I saw this 6 months late. After thinking a bit, I'm still not sure if I want this desideratum. It seems to require a sort of monotonicity, where we can get superhuman performance just by going through states that humans recognize as good, and not by going through states that humans would think are weird or scary or unevaluable. One case where this might come up is in competitive games. Chess AI beats humans in part because it makes moves that many humans evaluate as bad, but are actually good. But maybe this example actually supports your proposal - it seems entirely plausible to make a chess engine that only makes moves that some given population of humans recognize as good, but is better than any human from that population. On the other hand, the humans might be wrong about the reason the move is good, so that the game is made of a bunch of moves that seem good to humans, but where the humans are actually wrong about why they're good (from the human perspective, this looks like regularly having "happy surprises"). We might hope that such human misevaluations are rare enough that quantilization would lead to moves on average being well-evaluated by humans, but for chess I think that might be false! Computers are so much better than humans at chess that a very large chunk of the best moves according to both humans and the computer will be ones that humans misevaluate. Maybe that's more a criticism of quantilizers, not a criticism of this desideratum. So maybe the chess example supports this being a good thing to want? But let me keep critiquing quantilizers then :P If what a powerful AI thinks is best (by an exponential amount) is to turn off the stars until the universe is colder, but humans think it's scary and ban the AI from doing scary things, the AI will still try to turn off the stars in one of the edge-case ways that humans wouldn't find scary. And if we think being manipulated like that is bad and quantilize over actions to m
2Vanessa Kosoy
When I'm deciding whether to run an AI, I should be maximizing the expectation of my utility function w.r.t. my belief state. This is just what it means to act rationally. You can then ask, how is this compatible with trusting another agent smarter than myself? One potentially useful model is: I'm good at evaluating and bad at searching (after all, P≠NP). I can therefore delegate searching to another agent. But, as you point out, this doesn't account for situations in which I seem to be bad at evaluating. Moreover, if the AI prior takes an intentional stance towards the user (in order to help learning their preferences), then the user must be regarded as good at searching. A better model is: I'm good at both evaluating and searching, but the AI can access actions and observations that I cannot. For example, having additional information can allow it to evaluate better. An important special case is: the AI is connected to an external computer (Turing RL) which we can think of as an "oracle". This allows the AI to have additional information which is purely "logical". We need infra-Bayesianism to formalize this: the user has Knightian uncertainty over the oracle's outputs entangled with other beliefs about the universe. For instance, in the chess example, if I know that a move was produced by exhaustive game-tree search then I know it's a good move, even without having the skill to understand why the move is good in any more detail. Now let's examine short-term quantilization for chess. On each cycle, the AI finds a short-term strategy leading to a position that the user evaluates as good, but that the user would require luck to manage on their own. This is repeated again and again throughout the game, leading to overall play substantially superior to the user's. On the other hand, this play is not as good as the AI would achieve if it just optimized for winning at chess without any constrains. So, our AI might not be competitive with an unconstrained unaligned AI
2Charlie Steiner
Agree with the first section, though I would like to register my sentiment that although "good at selecting but missing logical facts" is a better model, it's still not one I'd want an AI to use when inferring my values. I think my point is if "turn off the stars" is not a primitive action, but is a set of states of the world that the AI would overwhelming like to go to, then the actual primitive actions will get evaluated based on how well they end up going to that goal state. And since the AI is better at evaluating than us, we're probably going there. Another way of looking at this claim is that I'm telling a story about why the safety bound on quantilizers gets worse when quantilization is iterated. Iterated quantilization has much worse bounds than quantilizing over the iterated game, which makes sense if we think of games where the AI evaluates many actions better than the human.
2Vanessa Kosoy
I think you misunderstood how the iterated quantilization works. It does not work by the AI setting a long-term goal and then charting a path towards that goal s.t. it doesn't deviate too much from the baseline over every short interval. Instead, every short-term quantilization is optimizing for the user's evaluation in the end of this short-term interval.
2Charlie Steiner
Ah. I indeed misunderstood, thanks :) I'd read "short-term quantilization" as quantilizing over short-term policies evaluated according to their expected utility. My story doesn't make sense if the AI is only trying to push up the reported value estimates (though that puts a lot of weight on these estimates).
2adamShimi
I don't understand what you mean here by quantilizing. The meaning I know is to take a random action over the top \alpha actions, on a given base distribution. But I don't see a distribution here, or even a clear ordering over actions (given that we don't have access to the utility function). I'm probably missing something obvious, but more details would really help.
4Vanessa Kosoy
The distribution is the user's policy, and the utility function for this purpose is the eventual success probability estimated by the user (as part of the timeline report), in the end of the "maneuver". More precisely, the original quantilization formalism was for the one-shot setting, but you can easily generalize it, for example I did it for MDPs.
2adamShimi
Oh, right, that makes a lot of sense. So is the general idea that we quantilize such that we're choosing in expectation an action that doesn't have corrupted utility (by intuitively having something like more than twice as many actions in the quantilization than we expect to be corrupted), so that we guarantee the probability of following the manipulation of the learned user report is small? I also wonder if using the user policy to sample actions isn't limiting, because then we can only take actions that the user would take. Or do you assume by default that the support of the user policy is the full action space, so every action is possible for the AI?
2Vanessa Kosoy
Yes, although you probably want much more than twice. Basically, if the probability of corruption following the user policy is ϵ and your quantilization fraction is ϕ then the AI's probability of corruption is bounded by ϵϕ. Obviously it is limiting, but this is the price of safety. Notice, however, that the quantilization strategy is only an existence proof. In principle, there might be better strategies, depending on the prior (for example, the AI might be able to exploit an assumption that the user is quasi-rational). I didn't specify the AI by quantilization, I specified it by maximizing EU subject to the Hippocratic constraint. Also, the support is not really the important part: even if the support is the full action space, some sequences of actions are possible but so unlikely that the quantilization will never follow them.
1[anonymous]
I like this because it's simple and obviously correct.  Also I can see at least one way you could implement it:    a.  Suppose the AI is 'shadowing' a human worker doing a critical task.  Say it is 'shadowing' a human physician.     b.  Each time the AI observes the same patient, it regresses between [data from the patient] and [predicted decision a 'good' physician would make, predicted outcome for the 'good' decision].  Once the physician makes a decision and communicates it, the AI regresses between [decision the physician made] and [predicted outcome for that decision].    c.  The machine also must have a confidence or this won't work. With large numbers and outright errors made by the physician, it's then possible to detect all the cases where the [decision the physician made] has a substantially worse outcome than the [predicted decision a 'good' physician would make], and when the AI has a high confidence of this [requiring many observations of similar situations] and it's time to call for a second opinion. In the long run, of course, there will be a point where the [predicted decision a 'good' physician would make] is better than the [information gain from a second human opinion] and you really would do best by firing the physician and having the AI make the decisions from then on, trusting for it to call for a second opinion when it is not confident.   (as an example, alpha go zero likely doesn't benefit from asking another master go player for a 'second opinion' when it sees the player it is advising make a bad call)

This idea was inspired by a correspondence with Adam Shimi.

It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions , a set of observations , a policy is a mapping , a reward function is a mapping , the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)

The simplest attempt at defining "goal-directed intelligence" is requiring that the policy in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows , or the prior can believe that behavior not according to leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, descript... (read more)

3AIL
I am not sure I understand your use of C(U) in the third from last paragraph where you define goal directed intelligence. As you define C it is a complexity measure over programs P. I assume this was a typo and you mean K(U)? Or am I misunderstanding the definition of either U or C?
2Vanessa Kosoy
This is not a typo. I'm imagining that we have a program P that outputs (i) a time discount parameter γ∈Q∩[0,1), (ii) a circuit for the transition kernel of an automaton T:S×A×O→S and (iii) a circuit for a reward function r:S→Q (and, ii+iii are allowed to have a shared component to save computation time complexity). The utility function is U:(A×O)ω→R defined by U(x):=(1−γ)∞∑n=0γnr(sxn) where sx∈Sω is defined recursively by sxn+1=T(s,xn)
3AIL
Okay, I think this makes sense. The idea is trying to re-interpret the various functions in the utility function as a single function and asking about the notion of complexity on that function which combines the complexity of producing a circuit which computes that function and the complexity of the circuit itself. But just to check: is T over  S×A×O→S? I thought T in utility functions only depended on states and actions S×A→S?  Maybe I am confused by what you mean by S. I thought it was the state space, but that isn't consistent with r in your post which was defined over A×O→Q? As a follow up: defining r as depending on actions and observations instead of actions and states (which e.g. the definition in POMDP on Wikipedia) seems like it changes things.  So I'm not sure if you intended the rewards to correspond with the observations or 'underlying' states.  One more question, this one about the priors: what are they a prior over exactly? I will use the letters/terms from https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process to try to be explicit. Is the prior capturing the "set of conditional observation probabilities" (O on Wikipedia)? Or is it capturing the "set of conditional transition probabilities between states" (T on Wikipedia)? Or is it capturing a distribution over all possible T and O? Or are you imaging that T is defined with U (and is non-random) and O is defined within the prior?  I ask because the term DKL(ζ0||ζ) will be positive infinity if ζ is zero for any value where ζ0 is non-zero. Which makes the interpretation that it is either O or T directly pretty strange (for example, in the case where there are two states s1 and s2 and two obersvations o1 and o2 an O where P(si|oi)=1 and P(si|oj)=0 if i≠j would have a KL divergence of infinity from the ζ0 if ζ0 had non-zero probability on P(s1|o2)). So, I assume this is a prior over what the conditional observation matrices might be. I am assuming that your comment above implies tha
2Vanessa Kosoy
I'm not entirely sure what you mean by the state space. S is a state space associated specifically with the utility function. It has nothing to do with the state space of the environment. The reward function in the OP is (A×O)∗→R, not A×O→R. I slightly abused notation by defining r:S→Q in the parent comment. Let's say it's r′:S→Q and r is defined by using T to translate the history to the (last) state and then applying r′. The prior is just an environment i.e. a partial mapping ζ:(A×O)∗→ΔO defined on every history to which it doesn't itself assign probability 0. The expression DKL(ξ||ζ) means that we consider all possible ways to choose a Polish space X, probability distributions μ,ν∈ΔX and a mapping f:X×(A×O)∗→ΔO s.t. ζ=Eμ[f] and ξ=Eν[f] (where the expected value is defined using the Bayes law and not pointwise, see also the definition of "instrumental states" here), and take the minimum over all of them of DKL(ν||μ).
3Vanessa Kosoy
Actually, as opposed to what I claimed before, we don't need computational complexity bounds for this definition to make sense. This is because the Solomonoff prior is made of computable hypotheses but is uncomputable itself. Given g>0, we define that "π has (unbounded) goal-directed intelligence (at least) g" when there is a prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then K(π′)≥DKL(ζ0||ζ)+K(U)+g. Here, ζ0 is the Solomonoff prior and K is Kolmogorov complexity. When g=+∞ (i.e. no computable policy can match the expected utility of π; in particular, this implies π is optimal since any policy can be approximated by a computable policy), we say that π is "perfectly (unbounded) goal-directed". Compare this notion to the Legg-Hutter intelligence measure. The LH measure depends on the choice of UTM in radical ways. In fact, for some UTMs, AIXI (which is the maximum of the LH measure) becomes computable or even really stupid. For example, it can always keep taking the same action because of the fear that taking any other action leads to an inescapable "hell" state. On the other hand, goal-directed intelligence differs only by O(1) between UTMs, just like Kolmogorov complexity. A perfectly unbounded goal-directed policy has to be uncomputable, and the notion of which policies are such doesn't depend on the UTM at all. I think that it's also possible to prove that intelligence is rare, in the sense that, for any computable stochastic policy, if we regard it as a probability measure over deterministic policies, then for any ϵ>0 there is g s.t. the probability to get intelligence at least g is smaller than ϵ. Also interesting is that, for bounded goal-directed intelligence, increasing the prices can only decrease intelligence by O(1), and a policy that is perfectly goal-directed w.r.t. lower prices is also such w.r.t. higher prices (I think). In particular, a perfectly unbounded goal-directed policy is perfectly goal-directed for any price vec
5Vanessa Kosoy
Some problems to work on regarding goal-directed intelligence. Conjecture 5 is especially important for deconfusing basic questions in alignment, as it stands in opposition to Stuart Armstrong's thesis about the impossibility to deduce preferences from behavior alone. 1. Conjecture. Informally: It is unlikely to produce intelligence by chance. Formally: Denote Π the space of deterministic policies, and consider some μ∈ΔΠ. Suppose μ is equivalent to a stochastic policy π∗. Then, Eπ∼μ[g(π)]=O(C(π∗)). 2. Find an "intelligence hierarchy theorem". That is, find an increasing sequence {gn} s.t. for every n, there is a policy with goal-directed intelligence in (gn,gn+1) (no more and no less). 3. What is the computational complexity of evaluating g given (i) oracle access to the policy or (ii) description of the policy as a program or automaton? 4. What is the computational complexity of producing a policy with given g? 5. Conjecture. Informally: Intelligent agents have well defined priors and utility functions. Formally: For every (U,ζ) with C(U)<∞ and DKL(ζ0||ζ)<∞, and every ϵ>0, there exists g∈(0,∞) s.t. for every policy π with intelligence at least g w.r.t. (U,ζ), and every (~U,~ζ) s.t. π has intelligence at least g w.r.t. them, any optimal policies π∗,~π∗ for (U,ζ) and (~U,~ζ) respectively satisfy Eζ~π∗[U]≥Eζπ∗[U]−ϵ.
2Davidmanheim
re: #5, that doesn't seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn't some U1≠U2 such that π∗(U1,ζ)=π∗(U2,ζ). (And as pointed out elsewhere, it isn't Stuart's thesis, it's a well known and basic result in the decision theory / economics / philosophy literature.)
2Vanessa Kosoy
You misunderstand the intent. We're talking about inverse reinforcement learning. The goal is not necessarily inferring the unknown U, but producing some behavior that optimizes the unknown U. Ofc if the policy you're observing is optimal then it's trivial to do so by following the same policy. But, using my approach we might be able to extend it into results like "the policy you're observing is optimal w.r.t. certain computational complexity, and your goal is to produce an optimal policy w.r.t. higher computational complexity." (Btw I think the formal statement I gave for 5 is false, but there might be an alternative version that works.) I am referring to this and related work by Armstrong.
2David Scott Krueger (formerly: capybaralet)
Apologies, I didn't take the time to understand all of this yet, but I have a basic question you might have an answer to... We know how to map (deterministic) policies to reward functions using the construction at the bottom of page 6 of the reward modelling agenda (https://arxiv.org/abs/1811.07871v1): the agent is rewarded only if it has so far done exactly what the policy would do.  I think of this as a wrapper function (https://en.wikipedia.org/wiki/Wrapper_function). It seems like this means that, for any policy, we can represent it as optimizing reward with only the minimal overhead in description/computational complexity of the wrapper. So... * Do you think this analysis is correct?  Or what is it missing?  (maybe the assumption that the policy is deterministic is significant?  This turns out to be the case for Orseau et al.'s "Agents and Devices" approach, I think https://arxiv.org/abs/1805.12387). * Are you trying to get around this somehow?  Or are you fine with this minimal overhead being used to distinguish goal-directed from non-goal directed policies?
3Vanessa Kosoy
My framework discards such contrived reward functions because it penalizes for the complexity of the reward function. In the construction you describe, we have C(U)≈C(π). This corresponds to g≈0 (no/low intelligence). On the other hand, policies with g≫0 (high intelligence) have the property that C(π)≫C(U) for the U which "justifies" this g. In other words, your "minimal" overhead is very large from my point of view: to be acceptable, the "overhead" should be substantially negative.
2David Scott Krueger (formerly: capybaralet)
I think the construction gives us $C(\pi) \leq C(U) + e$ for a small constant $e$ (representing the wrapper).  It seems like any compression you can apply to the reward function can be translated to the policy via the wrapper.  So then you would never have $C(\pi) >> C(U)$.  What am I missing/misunderstanding?
2Vanessa Kosoy
For the contrived reward function you suggested, we would never have C(π)≫C(U). But for other reward functions, it is possible that C(π)≫C(U). Which is exactly why this framework rejects the contrived reward function in favor of those other reward functions. And also why this framework considers some policies unintelligent (despite the availability of the contrived reward function) and other policies intelligent.
[-]Vanessa Kosoy*Ω12196

The recent success of AlphaProof updates me in the direction of "working on AI proof assistants is a good way to reduce AI risk". If these assistants become good enough, it will supercharge agent foundations research[1] and might make the difference between success and failure. It's especially appealing that it leverages AI capability advancement for the purpose of AI alignment in a relatively[2] safe way, thereby the deeper we go into the danger zone the greater the positive impact[3].

EDIT: To be clear, I'm not saying that working on proof assistants in e.g. DeepMind is net positive. I'm saying that a hypothetical safety-conscious project aiming to create proof assistants for agent foundations research, that neither leaks dangerous knowledge nor repurposes it for other goals, would be net positive.

  1. ^

    Of course, agent foundation research doesn't reduce to solving formally stated mathematical problems. A lot of it is searching for the right formalizations. However, obtaining proofs is a critical arc in the loop.

  2. ^

    There are some ways for proof assistants to feed back into capability research, but these effects seem weaker: at present capability advancement is not primarily dri

... (read more)
[-]Leon LangΩ91815

I think the main way that proof assistant research feeds into capabilies research is not through the assistants themselves, but by the transfer of the proof assistant research to creating foundation models with better reasoning capabilities. I think researching better proof assistants can shorten timelines.

  • See also Demis' Hassabis recent tweet. Admittedly, it's unclear whether he refers to AlphaProof itself being accessible from Gemini, or the research into AlphaProof feeding into improvements of Gemini.
  • See also an important paragraph in the blogpost for AlphaProof: "As part of our IMO work, we also experimented with a natural language reasoning system, built upon Gemini and our latest research to enable advanced problem-solving skills. This system doesn’t require the problems to be translated into a formal language and could be combined with other AI systems. We also tested this approach on this year’s IMO problems and the results showed great promise."
8Vanessa Kosoy
I can see that research into proof assistants might lead to better techniques for combining foundation models with RL. Is there anything more specific that you imagine? Outside of math there are very different problems because there is no easy to way to synthetically generate a lot of labeled data (as opposed to formally verifiable proofs). While some AI techniques developed for proof assistants might be transferable to other problems, I can easily imagine a responsible actor[1] producing a net positive. Don't disclose your techniques (except maybe very judiciously), don't open your source, maintain information security, maybe only provide access as a service, maybe only provide access to select people/organizations. 1. ^ To be clear, I don't consider Alphabet to be a responsible actor.
2Leon Lang
Not much more specific! I guess from a certain level of capabilities onward, one could create labels with foundation models that evaluate reasoning steps. This is much more fuzzy than math, but I still guess a person who created a groundbreaking proof assistant would be extremely valuable for any effort that tries to make foundation models reason reliably. And if they’d work at a company like google, then I think their ideas would likely diffuse even if they didn’t want to work on foundation models. Thanks for your details on how someone could act responsibly in this space! That makes sense. I think one caveat is that proof assistant research might need enormous amounts of compute, and so it’s unclear how to work on it productively outside of a company where the ideas would likely diffuse.
1Bogdan Ionut Cirstea
There seems to be some transfer though between math or code capabilities (for which synthetic data can often be easily created and verified) and broader agentic (LLM) capabilities, e.g. https://x.com/YangjunR/status/1793681237275820254/photo/2. 
1Bogdan Ionut Cirstea
I expect even more of the agent foundations workflow could be safely automated / strongly-augmented - including e.g. research ideation and literature reviews, see e.g. ResearchAgent: Iterative Research Idea Generation over Scientific Literature with Large Language Models, Acceleron: A Tool to Accelerate Research Ideation, LitLLM: A Toolkit for Scientific Literature Review.
2Vanessa Kosoy
I'm skeptical about research ideation, but literature reviews, yes, I can see that.

A thought inspired by this thread. Maybe we should have a standard template for a code of conduct for organizations, that we will encourage all rational-sphere and EA orgs to endorse. This template would include, never making people sign non-disparagement agreements (and maybe also forbidding other questionable practices that surfaced in recent scandals). Organizations would be encouraged to create their own codes based on the template and commit to them publicly (and maybe even in some legally binding manner). This flexibility means we don't need a 100% consensus about what has to be in the code, but also if e.g. a particular org decides to remove a particular clause, that will be publicly visible and salient.

2Viliam
Codes created by organizations can simply avoid topics that are sensitive for them, or describe them in nebulous ways. You can probably imagine a code written by a bad organization that seems nice and is technically followed by the bad organization, mostly because it contains lots of applause lights but not the specific boring things. I am in favor of having one centrally created document "this is what a decent EA employment looks like". Of course it is optional for everyone. The point is to create common knowledge. Especially for young people, for whom it may be the first job ever. It's not to force everyone to follow it, but rather to show that if your employer does not follow it, then it is not normal, if you complain you are sane, and you can do better (while remaining in the EA area). As I imagine it, most of it wouldn't even be specific for EA, but rather the common sense that inexperienced people may miss. Such as "you are entitled to a salary, paid in cash, where the specific number is written in the contract". To prevent employers from saying things like: "you know, this is how it works in for-profit companies, but EAs are different".
7Vanessa Kosoy
If a particular code doesn't talk about e.g. non-disparagement agreements, or talks about them in some confusing, unclear way, then people will notice. The point of having a template is precisely drawing attention to what is expected to be there (in particular when it's not there). Also, I think we should really strive to be better than for-profit companies (see non-disparagement agreements again).

Epistemic status: Leaning heavily into inside view, throwing humility to the winds.

Imagine TAI is magically not coming (CDT-style counterfactual[1]). Then, the most notable-in-hindsight feature of modern times might be the budding of mathematical metaphysics (Solomonoff induction, AIXI, Yudkowsky's "computationalist metaphilosophy"[2], UDT, infra-Bayesianism...) Perhaps, this will lead to an "epistemic revolution" comparable only with the scientific revolution in magnitude. It will revolutionize our understanding of the scientific method (probably solving the interpretation of quantum mechanics[3], maybe quantum gravity, maybe boosting the soft sciences). It will solve a whole range of philosophical questions, some of which humanity was struggling with for centuries (free will, metaethics, consciousness, anthropics...)

But, the philosophical implications of the previous epistemic revolution were not so comforting (atheism, materialism, the cosmic insignificance of human life)[4]. Similarly, the revelations of this revolution might be terrifying[5]. In this case, it remains to be seen which will seem justified in hindsight: the Litany of Gendlin, or the Lovecraftian notion that some ... (read more)

1philip_b
I'm not sure what you mean by CDT- and EDT-style counterfactuals. I have some guesses but please clarify. I think EDT-style counterfactual means, assuming I am a bayesian reasoner, just conditioning on the event "TAI won't come", so it's thinking about the distribution P(O | TAI won't come). One could think that the CDT-counterfactual you're considering means thinking about the distribution P(O | do(TAI doesn't come)) where do is the do operator from Judea Pearl's do calculus for causality. In simple words, this means that we consider the world just like ours but whenever someone tries to launch a TAI, god's intervention (that doesn't make sense together with everything we know about physics) prevents it from working. But I think this is not what you mean. My best guess of what counterfactual you mean is as follows. Among all possible sets laws of physics (or, alternatively, Turing machines running which leads to existence of physical realities), you guess that there exists a set of laws that produces a physical reality where there will appear a civilization approximately (but not exactly) like hours and they'll have a 21-st century approximately like hours, but under their physical laws there won't be TAI. And you want to analyze what's going to happen with that civilization.
-1MackGopherSena
[edited]
4Viliam
What do you mean by "exact opposite reasons"? To me, it seems like continuation of the same trend of humiliating the human ego: * you are not going to live forever * yes, you are mere atoms * your planet is not the center of the universe * even your sun is not special * your species is related to the other species that you consider inferior * instead of being logical, your mind is a set of short-sighted agents fighting each other Followed by: * even your reality is not special * your civilization is too stupid to stop doing the thing(s) that will predictably kill all of you

Probably not too original but I haven't seen it clearly written anywhere.

There are several ways to amplify imitators with different safety-performance tradeoffs. This is something to consider when designing IDA-type solutions.

Amplifying by objective time: The AI is predicting what the user(s) will output after thinking about a problem for a long time. This method is the strongest, but also the least safe. It is the least safe because malign AI might exist in the future, which affects the prediction, which creates an attack vector for future malign AI to infiltrate the present world. We can try to defend by adding a button for "malign AI is attacking", but that still leaves us open to surprise takeovers in which there is no chance to press the button.

Amplifying by subjective time: The AI is predicting what the user(s) will output after thinking about a problem for a short time, where in the beginning they are given the output of a similar process that ran for one iteration less. So, this simulates a "groundhog day" scenario where the humans wake up in the same objective time period over and over without memory of the previous iterations but with a written legacy. This is weaker than... (read more)

3Vanessa Kosoy
More observations about this attack vector ("attack from counterfactuals"). I focus on "amplifying by subjective time". * The harder the takeoff the more dangerous this attack vector: During every simulation cycle, ability to defend against simulated malign AI depends on the power of the defense system in the beginning of the cycle[1]. On the other hand, the capability of the attacker depends on its power in the end of the cycle. Therefore, if power grows very fast this is bad news for the defender. On the other hand, if power grows very slowly, the defender wins because defense is probably easier. * Inner control of anchor makes system safer: Given a fixed objective time anchor, the more times the simulation goes through a cycle, the higher its exposure to the threat, since improving the defense system takes time even if you have all the knowledge. Therefore, we want to give the simulated user the ability to move the objective time anchor forward in time, in order to benefit from improvements in the defense system. * Additional information about the external world makes system safer: Suppose our subjective probability of a malign AI successfully attacking during a single simulation cycle is 0.1%. Then performing 1000 cycles superficially yields a failure probability of about 37%. However, this is only the case if attacks on different cycles are independent random variables. If the AI knows as much about the possibility of such an attack as we do, this is a justified assumption. However, if the AI knows more, then, from our subjective perspective, these variables are positively correlated. In particular, if the AI knows with certainty whether a successful attack would occur or not the failure probability over 1000 cycles goes back to 0.1%. Therefore, it is beneficial to give the AI extra information about the world, just so it can estimate the probability of attack with more confidence. ---------------------------------------- 1. More precisely, it is somewha
2TurnTrout
I think this would make a good top-level post. I have the feeling I’ll want to link to it later.
2Vanessa Kosoy
I retracted part of that, see the edit.

I have repeatedly argued for a departure from pure Bayesianism that I call "quasi-Bayesianism". But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here's another way to understand it, using Bayesianism's own favorite trick: Dutch booking!

Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can predict Alice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.

A possible counterargument is, we don't need to depart far from Bayesianis

... (read more)
3Dagon
Bayeseans are allowed to understand that there are agents with better estimates than they have. And that being offered a bet _IS_ evidence that the other agent THINKS they have an advantage. Randomization (aka "mixed strategy") is well-understood as the rational move in games where opponents are predicting your choices. I have read nothing that would even hint that it's unavailable to Bayesean agents. The relevant probability (updated per Bayes's Rule) would be "is my counterpart trying to minimize my payout based on my choices". edit: I realize you may be using a different definition of "bayeseanism" than I am. I'm thinking humans striving for rational choices, which perforce includes the knowledge of incomplete computation and imperfect knowledge. Naive agents can be imagined that don't have this complexity. Those guys are stuck, and Omega's gonna pwn them.
4Matt Goldenberg
It feels like there's better words for this like rationality, whereas bayesianism is a more specific philosophy about how best to represent and update beliefs.
2Pattern
And here I thought the reason was going to be that Bayesianism doesn't appear to include the cost of computation. (Thus, the usual dutch book arguments should be adjusted so that "optimal betting" does not leave one worse off for having payed, say, an oracle, too much for computation.)

Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via modifying the game rather than abandoning the notion of Nash equilibrium).

The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a repeated version. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requ

... (read more)
2Vanessa Kosoy
We can modify the population game setting to study superrationality. In order to do this, we can allow the agents to see a fixed size finite portion of the their opponents' histories. This should lead to superrationality for the same reasons I discussed before. More generally, we can probably allow each agent to submit a finite state automaton of limited size, s.t. the opponent history is processed by the automaton and the result becomes known to the agent. What is unclear about this is how to define an analogous setting based on source code introspection. While arguably seeing the entire history is equivalent to seeing the entire source code, seeing part of the history, or processing the history through a finite state automaton, might be equivalent to some limited access to source code, but I don't know to define this limitation. EDIT: Actually, the obvious analogue is processing the source code through a finite state automaton.
2Vanessa Kosoy
Instead of postulating access to a portion of the history or some kind of limited access to the opponent's source code, we can consider agents with full access to history / source code but finite memory. The problem is, an agent with fixed memory size usually cannot have regret going to zero, since it cannot store probabilities with arbitrary precision. However, it seems plausible that we can usually get learning with memory of size O(log11−γ). This is because something like "counting pieces of evidence" should be sufficient. For example, if consider finite MDPs, then it is enough to remember how many transitions of each type occurred to encode the belief state. There question is, does assuming O(log11−γ) memory (or whatever is needed for learning) is enough to reach superrationality.
1Gurkenglas
What do you mean by equivalent? The entire history doesn't say what the opponent will do later or would do against other agents, and the source code may not allow you to prove what the agent does if it involves statements that are true but not provable.
2Vanessa Kosoy
For a fixed policy, the history is the only thing you need to know in order to simulate the agent on a given round. In this sense, seeing the history is equivalent to seeing the source code. The claim is: In settings where the agent has unlimited memory and sees the entire history or source code, you can't get good guarantees (as in the folk theorem for repeated games). On the other hand, in settings where the agent sees part of the history, or is constrained to have finite memory (possibly of size O(log11−γ)?), you can (maybe?) prove convergence to Pareto efficient outcomes or some other strong desideratum that deserves to be called "superrationality".
2Vanessa Kosoy
In the previous "population game" setting, we assumed all players are "born" at the same time and learn synchronously, so that they always play against players of the same "age" (history length). Instead, we can consider a "mortal population game" setting where each player has a probability 1−γ to die on every round, and new players are born to replenish the dead. So, if the size of the population is N (we always consider the "thermodynamic" N→∞ limit), N(1−γ) players die and the same number of players are born on every round. Each player's utility function is a simple sum of rewards over time, so, taking mortality into account, effectively ey have geometric time discount. (We could use age-dependent mortality rates to get different discount shapes, or allow each type of player to have different mortality=discount rate.) Crucially, we group the players into games randomly, independent of age. As before, each player type i chooses a policy πi:On→ΔAi. (We can also consider the case where players of the same type may have different policies, but let's keep it simple for now.) In the thermodynamic limit, the population is described as a distribution over histories, which now are allowed to be of variable length: μn∈ΔO∗. For each assignment of policies to player types, we get dynamics μn+1=Tπ(μn) where Tπ:ΔO∗→ΔO∗. So, as opposed to immortal population games, mortal population games naturally give rise to dynamical systems. If we consider only the age distribution, then its evolution doesn't depend on π and it always converges to the unique fixed point distribution ζ(k)=(1−γ)γk. Therefore it is natural to restrict the dynamics to the subspace of ΔO∗ that corresponds to the age distribution ζ. We denote it P. Does the dynamics have fixed points? O∗ can be regarded as a subspace of (O⊔{⊥})ω. The latter is compact (in the product topology) by Tychonoff's theorem and Polish, but O∗ is not closed. So, w.r.t. the weak topology on probability measure spaces, Δ(O⊔{⊥})ω is also

Here's a question inspired by thinking about Turing RL, and trying to understand what kind of "beliefs about computations" should we expect the agent to acquire.

Does mathematics have finite information content?

First, let's focus on computable mathematics. At first glance, the answer seems obviously "no": because of the halting problem, there's no algorithm (i.e. a Turing machine that always terminates) which can predict the result of every computation. Therefore, you can keep learning new facts about results of computations forever. BUT, maybe most of those new facts are essentially random noise, rather than "meaningful" information?

Is there a difference of principle between "noise" and "meaningful content"? It is not obvious, but the answer is "yes": in algorithmic statistics there is the notion of "sophistication" which measures how much "non-random" information is contained in some data. In our setting, the question can be operationalized as follows: is it possible to have an algorithm plus an infinite sequence of bits , s.t. is random in some formal sense (e.g. Martin-Lof) and can decide the output of any finite computation if it's also given access to ?

The answer to th... (read more)

4AlexMennen
Wikipedia claims that every sequence is Turing reducible to a random one, giving a positive answer to the non-resource-bounded version of any question of this form. There might be a resource-bounded version of this result as well, but I'm not sure.

Epistemic status: no claims to novelty, just (possibly) useful terminology.

[EDIT: I increased all the class numbers by 1 in order to admit a new definition of "class I", see child comment.]

I propose a classification on AI systems based on the size of the space of attack vectors. This classification can be applied in two ways: as referring to the attack vectors a priori relevant to the given architectural type, or as referring to the attack vectors that were not mitigated in the specific design. We can call the former the "potential" class and the latter the "effective" class of the given system. In this view, the problem of alignment is designing potential class V (or at least IV) systems are that effectively class 0 (or at least I-II).

Class II: Systems that only ever receive synthetic data that has nothing to do with the real world

Examples:

  • AI that is trained to learn Go by self-play
  • AI that is trained to prove random mathematical statements
  • AI that is trained to make rapid predictions of future cell states in the game of life for random initial conditions
  • AI that is trained to find regularities in sequences corresponding to random programs on some natural universal Turing machin
... (read more)
3Vanessa Kosoy
The idea comes from this comment of Eliezer. Class II or higher systems might admit an attack vector by daemons that infer the universe from the agent's source code. That is, we can imagine a malign hypothesis that makes a treacherous turn after observing enough past actions to infer information about the system's own source code and infer the physical universe from that. (For example, in a TRL setting it can match the actions to the output of a particular program for envelope.) Such daemons are not as powerful as malign simulation hypotheses, since their prior probability is not especially large (compared to the true hypothesis), but might still be non-negligible. Moreover, it is not clear whether the source code can realistically have enough information to enable an attack, but the opposite is not entirely obvious. To account for this I propose the designate class I systems which don't admit this attack vector. For the potential sense, it means that either (i) the system's design is too simple to enable inferring much about the physical universe, or (ii) there is no access to past actions (including opponent actions for self-play) or (iii) the label space is small, which means an attack requires making many distinct errors, and such errors are penalized quickly. And ofc it requires no direct access to the source code. We can maybe imagine an attack vector even for class I systems, if most metacosmologically plausible universes are sufficiently similar, but this is not very likely. Nevertheless, we can reserve the label class 0 for systems that explicitly rule out even such attacks.

I find it interesting to build simple toy models of the human utility function. In particular, I was thinking about the aggregation of value associated with other people. In utilitarianism this question is known as "population ethics" and is infamously plagued with paradoxes. However, I believe that is the result of trying to be impartial. Humans are very partial and this allows coherent ways of aggregation. Here is my toy model:

Let Alice be our viewpoint human. Consider all social interactions Alice has, categorized by some types or properties, and assign a numerical weight to each type of interaction. Let be the weight of the interaction person had with person at time (if there was no interaction at this time then ). Then, we can define Alice's affinity to Bob as

Here is some constant. Ofc can be replaced by many other functions.

Now, we can the define the social distance of Alice to Bob as