Vanessa Kosoy's Shortform

by Vanessa Kosoy18th Oct 201992 comments
92 comments, sorted by Highlighting new comments since Today at 12:03 AM
New Comment
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

An AI progress scenario which seems possible and which I haven't seen discussed: an imitation plateau.

The key observation is, imitation learning algorithms[1] might produce close-to-human-level intelligence even if they are missing important ingredients of general intelligence that humans have. That's because imitation might be a qualitatively easier task than general RL. For example, given enough computing power, a human mind becomes realizable from the perspective of the learning algorithm, while the world-at-large is still far from realizable. So, an algorithm that only performs well in the realizable setting can learn to imitate a human mind, and thereby indirectly produce reasoning that works in non-realizable settings as well. Of course, literally emulating a human brain is still computationally formidable, but there might be middle scenarios where the learning algorithm is able to produce a good-enough-in-practice imitation of systems that are not too complex.

This opens the possibility that close-to-human-level AI will arrive while we're still missing key algorithmic insights to produce general intelligence directly. Such AI would not be easily scalable to superhuman. Nevert... (read more)

2Vladimir_Nesov4moThis seems similar to gaining uploads prior to AGI, and opens up all those superorg upload-city amplification/distillation constructions which should get past human level shortly after. In other words, the limitations of the dataset can be solved by amplification as soon as the AIs are good enough to be used as building blocks for meaningful amplification, and something human-level-ish seems good enough for that. Maybe even GPT-n is good enough for that.
2Vanessa Kosoy4moThat is similar to gaining uploads (borrowing terminology from Egan, we can call them "sideloads"), but it's not obvious amplification/distillation will work. In the model based on realizability, the distillation step can fail because the system you're distilling is too computationally complex (hence, too unrealizable). You can deal with it by upscaling the compute of the learning algorithm, but that's not better than plain speedup.
2Vladimir_Nesov4moTo me this seems to be essentially another limitation of the human Internet archive dataset: reasoning is presented in an opaque way (most slow/deliberative thoughts are not in the dataset), so it's necessary to do a lot of guesswork to figure out how it works. A better dataset both explains and summarizes the reasoning (not to mention gets rid of the incoherent nonsense, but even GPT-3 can do that to an extent by roleplaying Feynman). Any algorithm can be represented by a habit of thought (Turing machine style if you must), and if those are in the dataset, they can be learned. The habits of thought that are simple enough to summarize get summarized and end up requiring fewer steps. My guess is that the human faculties needed for AGI can be both represented by sequences of thoughts (probably just text, stream of consciousness style) and easily learned with current ML. So right now the main obstruction is that it's not feasible to build a dataset with those faculties represented explicitly that's good enough and large enough for current sample-inefficient ML to grok. More compute in the learning algorithm is only relevant for this to the extent that we get a better dataset generator that can work on the tasks before it more reliably.
2Vanessa Kosoy4moI don't see any strong argument why this path will produce superintelligence. You can have a stream of thought that cannot be accelerated without investing a proportional amount of compute, while a completely different algorithm would produce a far superior "stream of thought". In particular, such an approach cannot differentiate between features of the stream of thought that are important (meaning that they advance towards the goal) and features of the stream of though that are unimportant (e.g. different ways to phrase the same idea). This forces you to solve a task that is potentially much more difficult than just achieving the goal.
2Vladimir_Nesov4moI was arguing that near human level babblers (including the imitation plateau you were talking about) should quickly lead to human level AGIs by amplification via stream of consciousness datasets, which doesn't pose new ML difficulties other than design of the dataset. Superintelligence follows from that by any of the same arguments as for uploads leading to AGI (much faster technological progress; if amplification/distillation of uploads is useful straight away, we get there faster, but it's not necessary). And amplified babblers should be stronger than vanilla uploads (at least implausibly well-educated, well-coordinated, high IQ humans). For your scenario to be stable, it needs to be impossible (in the near term) to run the AGIs (amplified babblers) faster than humans, and for the AGIs to remain less effective than very high IQ humans. Otherwise you get acceleration of technological progress, including ML. So my point is that feasibility of imitation plateau depends on absence of compute overhang, not on ML failing to capture some of the ingredients of human general intelligence.
2Vanessa Kosoy4moThe imitation plateau can definitely be rather short. I also agree that computational overhang is the major factor here. However, a failure to capture some of the ingredients can be a cause of low computational overhead, whereas a success to capture all of the ingredients is a cause of high computational overhang, because the compute necessary to reach superintelligence might be very different in those two cases. Using sideloads to accelerate progress might still require years, whereas an "intrinsic" AGI might lead to the classical "foom" scenario. EDIT: Although, since training is typically much more computationally expensive than deployment, it is likely that the first human-level imitators will already be significantly sped-up compared to humans, implying that accelerating progress will be relatively easy. It might still take some time from the first prototype until such an accelerate-the-progress project, but probably not much longer than deploying lots of automation.
2Vladimir_Nesov4moI agree. But GPT-3 seems to me like a good estimate for how much compute it takes to run stream of consciousness imitation learning sideloads (assuming that learning is done in batches on datasets carefully prepared by non-learning sideloads, so the cost of learning is less important). And with that estimate we already have enough compute overhang to accelerate technological progress as soon as the first amplified babbler AGIs are developed, which, as I argued above, should happen shortly after babblers actually useful for automation of human jobs are developed (because generation of stream of consciousness datasets is a special case of such a job). So the key things to make imitation plateau last for years are either sideloads requiring more compute than it looks like (to me) they require, or amplification of competent babblers into similarly competent AGIs being a hard problem that takes a long time to solve.
4Vanessa Kosoy4moAnother thing that might happen is a data bottleneck. Maybe there will be a good enough dataset to produce a sideload that simulates an "average" person, and that will be enough to automate many jobs, but for a simulation of a competent AI researcher you would need a more specialized dataset that will take more time to produce (since there are a lot less competent AI researchers than people in general). Moreover, it might be that the sample complexity grows with the duration of coherent thought that you require. That's because, unless you're training directly on brain inputs/outputs, non-realizable (computationally complex) environment influences contaminate the data, and in order to converge you need to have enough data to average them out, which scales with the length of your "episodes". Indeed, all convergence results for Bayesian algorithms we have in the non-realizable setting require ergodicity, and therefore the time of convergence (= sample complexity) scales with mixing time, which in our case is determined by episode length. In such a case, we might discover that many tasks can be automated by sideloads with short coherence time, but AI research might require substantially longer coherence times. And, simulating progress requires by design going off-distribution along certain dimensions which might make things worse.
2avturchin4moAnother way to describe the same (or similar) plateau: we could think about GPT-n as GLUT with approximation between prerecorded answers: it can produce intelligent products similar to the ones which were created by humans in the past and are presented in its training dataset – but not above the human intelligence level, as there is no superintelligent examples in the dataset.

I have repeatedly argued for a departure from pure Bayesianism that I call "quasi-Bayesianism". But, coming from a LessWrong-ish background, it might be hard to wrap your head around the fact Bayesianism is somehow deficient. So, here's another way to understand it, using Bayesianism's own favorite trick: Dutch booking!

Consider a Bayesian agent Alice. Since Alice is Bayesian, ey never randomize: ey just follow a Bayes-optimal policy for eir prior, and such a policy can always be chosen to be deterministic. Moreover, Alice always accepts a bet if ey can choose which side of the bet to take: indeed, at least one side of any bet has non-negative expected utility. Now, Alice meets Omega. Omega is very smart so ey know more than Alice and moreover ey can predict Alice. Omega offers Alice a series of bets. The bets are specifically chosen by Omega s.t. Alice would pick the wrong side of each one. Alice takes the bets and loses, indefinitely. Alice cannot escape eir predicament: ey might know, in some sense, that Omega is cheating em, but there is no way within the Bayesian paradigm to justify turning down the bets.

A possible counterargument is, we don't need to depart far from Bayesianis

... (read more)
3Dagon10moBayeseans are allowed to understand that there are agents with better estimates than they have. And that being offered a bet _IS_ evidence that the other agent THINKS they have an advantage. Randomization (aka "mixed strategy") is well-understood as the rational move in games where opponents are predicting your choices. I have read nothing that would even hint that it's unavailable to Bayesean agents. The relevant probability (updated per Bayes's Rule) would be "is my counterpart trying to minimize my payout based on my choices". edit: I realize you may be using a different definition of "bayeseanism" than I am. I'm thinking humans striving for rational choices, which perforce includes the knowledge of incomplete computation and imperfect knowledge. Naive agents can be imagined that don't have this complexity. Those guys are stuck, and Omega's gonna pwn them.
4Matt Goldenberg10moIt feels like there's better words for this like rationality, whereas bayesianism is a more specific philosophy about how best to represent and update beliefs.
2Pattern10moAnd here I thought the reason was going to be that Bayesianism doesn't appear to include the cost of computation. (Thus, the usual dutch book arguments should be adjusted so that "optimal betting" does not leave one worse off for having payed, say, an oracle, too much for computation.)

Game theory is widely considered the correct description of rational behavior in multi-agent scenarios. However, real world agents have to learn, whereas game theory assumes perfect knowledge, which can be only achieved in the limit at best. Bridging this gap requires using multi-agent learning theory to justify game theory, a problem that is mostly open (but some results exist). In particular, we would like to prove that learning agents converge to game theoretic solutions such as Nash equilibria (putting superrationality aside: I think that superrationality should manifest via modifying the game rather than abandoning the notion of Nash equilibrium).

The simplest setup in (non-cooperative) game theory is normal form games. Learning happens by accumulating evidence over time, so a normal form game is not, in itself, a meaningful setting for learning. One way to solve this is replacing the normal form game by a repeated version. This, however, requires deciding on a time discount. For sufficiently steep time discounts, the repeated game is essentially equivalent to the normal form game (from the perspective of game theory). However, the full-fledged theory of intelligent agents requ

... (read more)
2Vanessa Kosoy1yWe can modify the population game setting to study superrationality. In order to do this, we can allow the agents to see a fixed size finite portion of the their opponents' histories. This should lead to superrationality for the same reasons I discussed [https://www.alignmentforum.org/posts/S3W4Xrmp6AL7nxRHd/formalising-decision-theory-is-hard#3yw2udyFfvnRC8Btr] before [https://agentfoundations.org/item?id=507]. More generally, we can probably allow each agent to submit a finite state automaton of limited size, s.t. the opponent history is processed by the automaton and the result becomes known to the agent. What is unclear about this is how to define an analogous setting based on source code introspection. While arguably seeing the entire history is equivalent to seeing the entire source code, seeing part of the history, or processing the history through a finite state automaton, might be equivalent to some limited access to source code, but I don't know to define this limitation. EDIT: Actually, the obvious analogue is processing the source code through a finite state automaton.
2Vanessa Kosoy1yInstead of postulating access to a portion of the history or some kind of limited access to the opponent's source code, we can consider agents with full access to history / source code but finite memory. The problem is, an agent with fixed memory size usually cannot have regret going to zero, since it cannot store probabilities with arbitrary precision. However, it seems plausible that we can usually get learning with memory of size O(log11−γ). This is because something like "counting pieces of evidence" should be sufficient. For example, if consider finite MDPs, then it is enough to remember how many transitions of each type occurred to encode the belief state. There question is, does assuming O(log11−γ) memory (or whatever is needed for learning) is enough to reach superrationality.
1Gurkenglas1yWhat do you mean by equivalent? The entire history doesn't say what the opponent will do later or would do against other agents, and the source code may not allow you to prove what the agent does if it involves statements that are true but not provable.
2Vanessa Kosoy1yFor a fixed policy, the history is the only thing you need to know in order to simulate the agent on a given round. In this sense, seeing the history is equivalent to seeing the source code. The claim is: In settings where the agent has unlimited memory and sees the entire history or source code, you can't get good guarantees (as in the folk theorem for repeated games). On the other hand, in settings where the agent sees part of the history, or is constrained to have finite memory (possibly of size O (log11−γ)?), you can (maybe?) prove convergence to Pareto efficient outcomes or some other strong desideratum that deserves to be called "superrationality".
2Vanessa Kosoy1yIn the previous "population game" setting, we assumed all players are "born" at the same time and learn synchronously, so that they always play against players of the same "age" (history length). Instead, we can consider a "mortal population game" setting where each player has a probability 1−γ to die on every round, and new players are born to replenish the dead. So, if the size of the population is N (we always consider the "thermodynamic" N→∞ limit), N(1−γ) players die and the same number of players are born on every round. Each player's utility function is a simple sum of rewards over time, so, taking mortality into account, effectively ey have geometric time discount. (We could use age-dependent mortality rates to get different discount shapes, or allow each type of player to have different mortality=discount rate.) Crucially, we group the players into games randomly, independent of age. As before, each player type i chooses a policy . (We can also consider the case where players of the same type may have different policies, but let's keep it simple for now.) In the thermodynamic limit, the population is described as a distribution over histories, which now are allowed to be of variable length: μn∈ ΔO∗. For each assignment of policies to player types, we get dynamics μn+1=Tπ(μn ) where Tπ:ΔO∗→ΔO∗. So, as opposed to immortal population games, mortal population games naturally give rise to dynamical systems. If we consider only the age distribution, then its evolution doesn't depend on π and it always converges to the unique fixed point distribution ζ(k)=(1−γ)γk. Therefore it is natural to restrict the dynamics to the subspace of ΔO∗ that corresponds to the age distribution ζ. We denote it P. Does the dynamics have fixed points? O∗ can be regarded as a subspace of (O⊔{⊥}) ω. The later is compact (in the product topology) by Tychonoff's theorem and Polish, but O∗ is not closed. So, w.r.t. the weak topology on probability measure spaces, Δ(O⊔{⊥})ω is also comp

Some thoughts about embedded agency.

From a learning-theoretic perspective, we can reformulate the problem of embedded agency as follows: What kind of agent, and in what conditions, can effectively plan for events after its own death? For example, Alice bequeaths eir fortune to eir children, since ey want them be happy even when Alice emself is no longer alive. Here, "death" can be understood to include modification, since modification is effectively destroying an agent and replacing it by different agent[1]. For example, Clippy 1.0 is an AI that values paperclips. Alice disabled Clippy 1.0 and reprogrammed it to value staples before running it again. Then, Clippy 2.0 can be considered to be a new, different agent.

First, in order to meaningfully plan for death, the agent's reward function has to be defined in terms of something different than its direct perceptions. Indeed, by definition the agent no longer perceives anything after death. Instrumental reward functions are somewhat relevant but still don't give the right object, since the reward is still tied to the agent's actions and observations. Therefore, we will consider reward functions defined in terms of some fixed ontology

... (read more)

Learning theory distinguishes between two types of settings: realizable and agnostic (non-realizable). In a realizable setting, we assume that there is a hypothesis in our hypothesis class that describes the real environment perfectly. We are then concerned with the sample complexity and computational complexity of learning the correct hypothesis. In an agnostic setting, we make no such assumption. We therefore consider the complexity of learning the best approximation of the real environment. (Or, the best reward achievable by some space of policies.)

In offline learning and certain varieties of online learning, the agnostic setting is well-understood. However, in more general situations it is poorly understood. The only agnostic result for long-term forecasting that I know is Shalizi 2009, however it relies on ergodicity assumptions that might be too strong. I know of no agnostic result for reinforcement learning.

Quasi-Bayesianism was invented to circumvent the problem. Instead of considering the agnostic setting, we consider a "quasi-realizable" setting: there might be no perfect description of the environment in the hypothesis class, but there are some incomplete descriptions. B

... (read more)

This idea was inspired by a correspondence with Adam Shimi.

It seem very interesting and important to understand to what extent a purely "behaviorist" view on goal-directed intelligence is viable. That is, given a certain behavior (policy), is it possible to tell whether the behavior is goal-directed and what are its goals, without any additional information?

Consider a general reinforcement learning settings: we have a set of actions , a set of observations , a policy is a mapping , a reward function is a mapping , the utility function is a time discounted sum of rewards. (Alternatively, we could use instrumental reward functions.)

The simplest attempt at defining "goal-directed intelligence" is requiring that the policy in question is optimal for some prior and utility function. However, this condition is vacuous: the reward function can artificially reward only behavior that follows , or the prior can believe that behavior not according to leads to some terrible outcome.

The next natural attempt is bounding the description complexity of the prior and reward function, in order to avoid priors and reward functions that are "contrived". However, descript... (read more)

2Vanessa Kosoy8moActually, as opposed to what I claimed before, we don't need computational complexity bounds for this definition to make sense. This is because the Solomonoff prior is made of computable hypotheses but is uncomputable itself. Given g>0, we define that "π has (unbounded) goal-directed intelligence (at least) g" when there is a prior ζ and utility function U s.t. for any policy π′, if Eζπ′[U]≥Eζπ[U] then K(π′)≥DKL(ζ0||ζ)+K(U)+g. Here, ζ0 is the Solomonoff prior and K is Kolmogorov complexity. When g=+∞ (i.e. no computable policy can match the expected utility of π; in particular, this implies π is optimal since any policy can be approximated by a computable policy), we say that π is "perfectly (unbounded) goal-directed". Compare this notion to the Legg-Hutter intelligence measure. The LH measure depends on the choice of UTM in radical ways. In fact, for some UTMs, AIXI (which is the maximum of the LH measure) becomes computable or even really stupid. For example, it can always keep taking the same action because of the fear that taking any other action leads to an inescapable "hell" state. On the other hand, goal-directed intelligence differs only by O(1) between UTMs, just like Kolmogorov complexity. A perfectly unbounded goal-directed policy has to be uncomputable, and the notion of which policies are such doesn't depend on the UTM at all. I think that it's also possible to prove that intelligence is rare, in the sense that, for any computable stochastic policy, if we regard it as a probability measure over deterministic policies, then for any ϵ>0 there is g s.t. the probability to get intelligence at least g is smaller than ϵ. Also interesting is that, for bounded goal-directed intelligence, increasing the prices can only decrease intelligence by O(1), and a policy that is perfectly goal-directed w.r.t. lower prices is also such w.r.t. higher prices (I think). In particular, a perfectly unbounded goal-directed policy is perfectly goal-directed for any price vec
2Vanessa Kosoy4moSome problems to work on regarding goal-directed intelligence. Conjecture 5 is especially important for deconfusing basic questions in alignment, as it stands in opposition to Stuart Armstrong's thesis about the impossibility to deduce preferences from behavior alone. 1. Conjecture. Informally: It is unlikely to produce intelligence by chance. Formally: Denote Π the space of deterministic policies, and consider some μ∈ ΔΠ. Suppose μ is equivalent to a stochastic policy π∗. Then, Eπ∼μ[g(π)]=O(C( π∗)). 2. Find an "intelligence hierarchy theorem". That is, find an increasing sequence {gn} s.t. for every n, there is a policy with goal-directed intelligence in (gn,gn+1) (no more and no less). 3. What is the computational complexity of evaluating g given (i) oracle access to the policy or (ii) description of the policy as a program or automaton? 4. What is the computational complexity of producing a policy with given g? 5. Conjecture. Informally: Intelligent agents have well defined priors and utility functions. Formally: For every (U,ζ) with C(U)<∞ and DKL(ζ0||ζ)<∞, and every ϵ>0, there exists g∈(0,∞) s.t. for every policy π with intelligence at least g w.r.t. (U,ζ), and every (~U,~ζ) s.t. π has intelligence at least g w.r.t. them, any optimal policies π∗,~π∗ for (U,ζ) and (~U,~ζ) respectively satisfy Eζ~π∗[U]≥Eζπ∗[U]−ϵ.
2Davidmanheim14dre: #5, that doesn't seem to claim that we can infer U given their actions, which is what the impossibility of deducing preferences is actually claiming. That is, assuming 5, we still cannot show that there isn't someU1≠U2such thatπ∗( U1,ζ)=π∗(U2,ζ). (And as pointed out elsewhere, it isn't Stuart's thesis, it's a well known and basic result in the decision theory / economics / philosophy literature.)
2Vanessa Kosoy7dYou misunderstand the intent. We're talking about inverse reinforcement learning. The goal is not necessarily inferring the unknown U, but producing some behavior that optimizes the unknown U. Ofc if the policy you're observing is optimal then it's trivial to do so by following the same policy. But, using my approach we might be able to extend it into results like "the policy you're observing is optimal w.r.t. certain computational complexity, and your goal is to produce an optimal policy w.r.t. higher computational complexity." (Btw I think the formal statement I gave for 5 is false, but there might be an alternative version that works.) I am referring to this [http://papers.neurips.cc/paper/7803-occams-razor-is-insufficient-to-infer-the-preferences-of-irrational-agents] and related work by Armstrong.

Much of the orthodox LessWrongian approach to rationality (as it is expounded in Yudkowsky's Sequences and onwards) is grounded in Bayesian probability theory. However, I now realize that pure Bayesianism is wrong, instead the right thing is quasi-Bayesianism. This leads me to ask, what are the implications of quasi-Bayesianism on human rationality? What are the right replacements for (the Bayesian approach to) bets, calibration, proper scoring rules et cetera? Does quasi-Bayesianism clarify important confusing issues in regular Bayesianism such as the pro

... (read more)

In Hanson’s futarchy, the utility function of the state is determined by voting but the actual policy is determined by a prediction market. But, voting incentivizes misrepresenting your values to get a larger share of the pie. So, shouldn’t it be something like the VCG mechanism instead?

This is preliminary description of what I dubbed Dialogic Reinforcement Learning (credit for the name goes to tumblr user @di--es---can-ic-ul-ar--es): the alignment scheme I currently find most promising.

It seems that the natural formal criterion for alignment (or at least the main criterion) is having a "subjective regret bound": that is, the AI has to converge (in the long term planning limit, limit) to achieving optimal expected user!utility with respect to the knowledge state of the user. In order to achieve this, we need to establish a communicati

... (read more)
2Vanessa Kosoy5moI gave a talk on Dialogic Reinforcement Learning in the AI Safety Discussion Day, and there is a recording [https://drive.google.com/file/d/1zKs3uOcR32nTMJ5YNOMZkcL7R_mzi2t6/view?usp=sharing] .
2Vanessa Kosoy5moA variant of Dialogic RL with improved corrigibility. Suppose that the AI's prior allows a small probability for "universe W" whose semantics are, roughly speaking, "all my assumptions are wrong, need to shut down immediately". In other words, this is a universe where all our prior shaping is replaced by the single axiom that shutting down is much higher utility than anything else. Moreover, we add into the prior that assumption that the formal question "W?" is understood perfectly by the user even without any annotation. This means that, whenever the AI assigns a higher-than-threshold probability to the user answering "yes" if asked "W?" at any uncorrupt point in the future, the AI will shutdown immediately. We should also shape the prior s.t. corrupt futures also favor shutdown: this is reasonable in itself, but will also ensure that the AI won't arrive at believing too many futures to be corrupt and thereby avoid the imperative to shutdown as response to a confirmation of W. Now, this won't help if the user only resolves to confirm W after something catastrophic already occurred, such as the AI releasing malign subagents into the wild. But, something of the sort is true for any corrigibility scheme: corrigibility is about allowing the user to make changes in the AI on eir own initiative, which can always be too late. This method doesn't ensure safety in itself, just hardens a system that is supposed to be already close to safe. It would be nice if we could replace "shutdown" by "undo everything you did and then shutdown" but that gets us into thorny specifications issues. Perhaps it's possible to tackle those issues by one of the approaches to "low impact".
2Vanessa Kosoy5moUniverse W should still be governed by a simplicity prior. This means that whenever the agent detects a salient pattern that contradicts the assumptions of its prior shaping, the probability of W increases leading to shutdown. This serves as an additional "sanity test" precaution.
1Gurkenglas1yThis design is made of so many parts! It might benefit from a proof that it is the universal [https://en.wikipedia.org/wiki/Universal_property] answer to a formal question, which I expect to seem less overly complex.
2Vanessa Kosoy1yI am not sure. AI alignment seems to touch on many different aspects of the world, and it is not obvious that it can be reduced to assumptions that are extremely simple and natural. Or, if it can be reduced that way, then it might require a theory that on some level explains human civilization, its evolution and and its influence on the world (even if only on a fairly abstract level). I will share some thoughts how the various assumptions can be reduced another step back, but proceeding to reduce all of them to a simple core seems like a challenging research programme. Most of the parts of this design can be regarded as reflecting particular assumptions we make about the user as an agent. The core idea of having a dialogue comes from modeling the user as a "linguistic agent". Such agents may be viewed as nodes in a distributed AI system, but where each node has different objectives. It is an interesting philosophical question whether this assumption is necessary for value learning. It currently seems plausible to me that only for linguistic agents "values" are truly well-defined, or at least sufficiently well-defined to extrapolate them outside the trajectory that the agent follows on its own. The need to quantilize, debate and censor infohazards comes from the assumption that the user can be manipulated (there is some small fraction of possible inputs that invalidate the usual assumptions about the user's behavior). Specifically debate might be possible to justify by some kind of Bayesian framework where every argument is a piece of evidence, and providing biased arguments is like providing selective evidence. The need to deal with "incoherent" answers and the commitment mechanism comes from the assumption the user has limited access to its own knowledge state (including its own reward function). Perhaps we can formalize it further by modeling the user as a learning algorithm with some intrinsic source of information. Perhaps we can even explain why such agent

In the past I considered the learning-theoretic approach to AI theory as somewhat opposed to the formal logic approach popular in MIRI (see also discussion):

  • Learning theory starts from formulating natural desiderata for agents, whereas "logic-AI" usually starts from postulating a logic-based model of the agent ad hoc.
  • Learning theory naturally allows analyzing computational complexity whereas logic-AI often uses models that are either clearly intractable or even clearly incomputable from the onset.
  • Learning theory focuses on objects that are observable o
... (read more)

I recently realized that the formalism of incomplete models provides a rather natural solution to all decision theory problems involving "Omega" (something that predicts the agent's decisions). An incomplete hypothesis may be thought of a zero-sum game between the agent and an imaginary opponent (we will call the opponent "Murphy" as in Murphy's law). If we assume that the agent cannot randomize against Omega, we need to use the deterministic version of the formalism. That is, an agent that learns an incomplete hypothesis converges to the corresponding max

... (read more)
6Nisan1yMy takeaway from this is that if we're doing policy selection in an environment that contains predictors, instead of applying the counterfactual belief that the predictor is always right, we can assume that we get rewarded if the predictor is wrong, and then take maximin. How would you handle Agent Simulates Predictor? Is that what TRL is for?
4Vanessa Kosoy1yThat's about right. The key point is, "applying the counterfactual belief that the predictor is always right" is not really well-defined (that's why people have been struggling with TDT/UDT/FDT for so long) while the thing I'm doing is perfectly well-defined. I describe agents that are able to learn which predictors exist in their environment and respond rationally ("rationally" according to the FDT philosophy). TRL is for many things to do with rational use of computational resources, such as (i) doing multi-level modelling [https://www.alignmentforum.org/posts/3qXE6fK47JhSfkpnB/do-sufficiently-advanced-agents-use-logic#vAtz6tfscsALGPr32] in order to make optimal use of "thinking time" and "interacting with environment time" (i.e. simultaneously optimize sample and computational complexity) (ii) recursive self-improvement (iii) defending from non-Cartesian daemons (iv) preventing thought crimes. But, yes, it also provides a solution to ASP [https://www.alignmentforum.org/posts/S3W4Xrmp6AL7nxRHd/formalising-decision-theory-is-hard#FXt6z9ycAio9jFAtW] . TRL agents can learn whether it's better to be predictable or predicting.
2Chris_Leong1y"The key point is, "applying the counterfactual belief that the predictor is always right" is not really well-defined" - What do you mean here? I'm curious whether you're referring to the same as or similar to the issue I was referencing in Counterfactuals for Perfect Predictors [https://www.lesswrong.com/posts/AKkFh3zKGzcYBiPo7/counterfactuals-for-perfect-predictors] . The TLDR is that I was worried that it would be inconsistent for an agent that never pays in Parfait's Hitchhiker to end up in town if the predictor is perfect, so that it wouldn't actually be well-defined what the predictor was predicting. And the way I ended up resolving this was by imagining it as an agent that takes input and asking what it would output if given that inconsistent input. But not sure if you were referencing this kind of concern or something else.
5Vanessa Kosoy1yIt is not a mere "concern", it's the crux of problem really. What people in the AI alignment community have been trying to do is, starting with some factual and "objective" description of the universe (such a program or a mathematical formula) and deriving counterfactuals. The way it's supposed to work is, the agent needs to locate all copies of itself or things "logically correlated" with itself (whatever that means) in the program, and imagine it is controlling this part. But a rigorous definition of this that solves all standard decision theoretic scenarios was never found. Instead of doing that, I suggest a solution of different nature. In quasi-Bayesian RL, the agent never arrives at a factual and objective description of the universe. Instead, it arrives at a subjective description which already includes counterfactuals. I then proceed to show that, in Newcomb-like scenarios, such agents receive optimal expected utility (i.e. the same expected utility promised by UDT).
2Chris_Leong1yYeah, I agree that the objective descriptions can leave out vital information, such as how the information you know was acquired, which seems important for determining the counterfactuals.
2cousin_it1yBut in Newcomb's problem, the agent's reward in case of wrong prediction is already defined. For example, if the agent one-boxes but the predictor predicted two-boxing, the reward should be zero. If you change that to +infinity, aren't you open to the charge of formalizing the wrong problem?
2Vanessa Kosoy1yThe point is, if you put this "quasi-Bayesian" agent into an iterated Newcomb-like problem, it will learn to get the maximal reward (i.e. the reward associated with FDT). So, if you're judging it from the side, you will have to concede it behaves rationally, regardless of its internal representation of reality. Philosophically, my point of view is, it is an error to think that counterfactuals have objective, observer-independent, meaning. Instead, we can talk about some sort of consistency conditions between the different points of view. From the agent's point of view, it would reach Nirvana if it dodged the predictor. From Omega's point of view, if Omega two-boxed and the agent one-boxed, the agent's reward would be zero (and the agent would learn its beliefs were wrong). From a third-person point of view, the counterfactual "Omega makes an error of prediction" is ill-defined, it's conditioning on an event of probability 0.
2cousin_it1yYeah, I think I can make peace with that. Another way to think of it is that we can keep the reward structure of the original Newcomb's problem, but instead of saying "Omega is almost always right" we add another person Bob (maybe the mad scientist who built Omega) who's willing to pay you a billion dollars if you prove Omega wrong. Then minimaxing indeed leads to one-boxing. Though I guess the remaining question is why minimaxing is the right thing to do. And if randomizing is allowed, the idea of Omega predicting how you'll randomize seems a bit dodgy as well.
4Vanessa Kosoy1yAnother explanation why maximin is a natural decision rule: when we apply maximin to fuzzy beliefs [https://www.alignmentforum.org/posts/Ajcq9xWi2fmgn8RBJ/the-credit-assignment-problem#X6fFvAHkxCPmQYB6v] , the requirement to learn a particular class of fuzzy hypotheses is a very general way to formulate asymptotic performance desiderata for RL agents. So general that it seems to cover more or less anything you might want. Indeed, the definition directly leads to capturing any desideratum of the form limγ→1Eμπγ[U(γ)]≥f(μ) Here, f doesn't have to be concave: the concavity condition in the definition of fuzzy beliefs is there because we can always assume it without loss of generality. This is because the left hand side in linear in μ so any π that satisfies this will also satisfy it for the concave hull of f. What if instead of maximin we want to apply the minimax-regret decision rule? Then the desideratum is limγ→1Eμπγ[U(γ)]≥V(μ,γ)−f(μ) But, it has the same form! Therefore we can consider it as a special case of the applying maximin (more precisely, it requires allowing the fuzzy belief to depend on γ, but this is not a problem for the basics of the formalism). What if we want our policy to be at least as good as some fixed policy π′0? Then the desideratum is limγ→1Eμπγ[U(γ)]≥Eμπ′0[U(γ)] It still has the same form! Moreover, the predictor/Nirvana trick allows us to generalize this to desiderata of the form: limγ→1Eμπγ[U(γ)]≥f(π,μ) To achieve this, we postulate a predictor that guesses the policy, producing the guess ^π, and define the fuzzy belief using the function Eh∼μ[f(^π(h),μ)] (we assume the guess is not influenced by the agent's actions so we don't need π in the expected value). Using Nirvana trick, we effectively force the guess to be accurate. In particular, this captures self-referential desiderata of the type "the policy cannot be improved by changing it in this particular way". These are of the form: limγ→1Eμπγ[U(γ)]≥EμF(π)[U(γ)] It also allo
2Vanessa Kosoy1yWell, I think that maximin is the right thing to do because it leads to reasonable guarantees for quasi-Bayesian reinforcement learning agents. I think of incomplete models as properties that the environment might satisfy. It is necessary to speak of properties instead of complete models since the environment might be too complex to understand in full (for example because it contains Omega, but also for more prosaic reasons), but we can hope it at least has properties/patterns the agent can understand. A quasi-Bayesian agent has the guarantee that, whenever the environment satisfies one of the properties in its prior, the expected utility will converge at least to the maximin for this property. In other words, such an agent is able to exploit any true property of the environment it can understand. Maybe a more "philosophical" defense of maximin is possible, analogous to VNM / complete class theorems, but I don't know (I actually saw some papers in that vein but haven't read them in detail.) If the agent has random bits that Omega doesn't see, and Omega is predicting the probabilities of the agent's actions, then I think we can still solve it with quasi-Bayesian agents but it requires considering more complicated models and I haven't worked out the details. Specifically, I think that we can define some function X that depends on the agent's actions and Omega's predictions so far (a measure of Omega's apparent inaccuracy), s.t. if Omega is an accurate predictor, then, the supremum of X over time is finite with probability 1. Then, we consider consider a family of models, where model number n says that X<n for all times. Since at least one of these models is true, the agent will learn it, and will converge to behaving appropriately. EDIT 1: I think X should be something like, how much money would a gambler following a particular strategy win, betting against Omega. EDIT 2: Here is the solution. In the case of original Newcomb, consider a gambler that bets against Om
1Linda Linsefors1yI agree that you can assign what ever belief you want (e.g. what ever is useful for the agents decision making proses) for for what happens in the counterfactual when omega is wrong, in decision problems where Omega is assumed to be a perfect predictor. However if you want to generalise to cases where Omega is an imperfect predictor (as you do mention), then I think you will (in general) have to put in the correct reward for Omega being wrong, becasue this is something that might actually be observed.
2Vanessa Kosoy1yThe method should work for imperfect predictors as well. In the simplest case, the agent can model the imperfect predictor as perfect predictor + random noise. So, it definitely knows the correct reward for Omega being wrong. It still believes in Nirvana if "idealized Omega" is wrong.

Epistemic status: moderately confident, based on indirect evidence

I realized that it is very hard to impossible to publish an academic work that takes more than one conceptual inferential step away from the current paradigm. Especially when the inferential steps happen in different fields of knowledge.

You cannot publish a paper where you use computational learning theory to solve metaphysics, and then use the new metaphysics to solve the interpretation of quantum mechanics. A physics publication will not understand the first part, or even understand how it

... (read more)
4Dagon9moHmm. I think I need more detail on your model of publishing and wide-acceptance and their relationship to truth. It seems likely that unless they're circularly dependent, you can publish the smaller-departure in parallel with exploring the further implications in different journals, and in research agendas rather than results publication.
2Pattern9moSo there's journals of X, Y, and Z, but not XYZ? (In hindsight this sounds obvious, though the only obvious alternatives would be * it's hard, but the hardness is in figuring out which place can handle the combination/complexity * Publishing anything is hard (or there's a limit to the time/space allocated per month, and timing matters)

Master post for ideas about infra-Bayesianism.

7Vanessa Kosoy4moIn the anthropic trilemma [https://www.lesswrong.com/posts/y7jZ9BLEeuNTzgAE5/the-anthropic-trilemma], Yudkowsky writes about the thorny problem of understanding subjective probability in a setting where copying and modifying minds is possible. Here, I will argue that infra-Bayesianism (IB) leads to the solution. Consider a population of robots, each of which in a regular RL agent. The environment produces the observations of the robots, but can also make copies or delete portions of their memories. If we consider a random robot sampled from the population, the history they observed will be biased compared to the "physical" baseline. Indeed, suppose that a particular observation c has the property that every time a robot makes it, 10 copies of them are created in the next moment. Then, a random robot will have c much more often in their history than the physical frequency with which c is encountered, due to the resulting "selection bias". We call this setting "anthropic RL" (ARL). The original motivation for IB was non-realizability. But, in ARL, Bayesianism runs into issues even when the environment is realizable from the "physical" perspective. For example, we can consider an "anthropic MDP" (AMDP). An AMDP has finite sets of actions (A) and states (S), and a transition kernel T:A×S→Δ(S∗). The output is a string of states instead of a single state, because many copies of the agent might be instantiated on the next round, each with their own state. In general, there will be no single Bayesian hypothesis that captures the distribution over histories that the average robot sees at any given moment of time (at any given moment of time we sample a robot out of the population and look at their history). This is because the distributions at different moments of time are mutually inconsistent. [EDIT: Actually, given that we don't care about the order of robots, the signature of the transition kernel should be T:A×S→ΔNS] The consistency that is violated is exactly the c
2Charlie Steiner24dCould you expand a little on why you say that no Bayesian hypothesis captures the distribution over robot-histories at different times? It seems like you can unroll an AMDP into a "memory MDP" that puts memory information of the robot into the state, thus allowing Bayesian calculation of the distribution over states in the memory MDP to capture history information in the AMDP.
2Vanessa Kosoy23dI'm not sure what do you mean by that "unrolling". Can you write a mathematical definition? Let's consider a simple example. There are two states: s0 and s1. There is just one action so we can ignore it. s0 is the initial state. An s0 robot transition into an s1 robot. An s1 robot transitions into an s0 robot and an s1 robot. How will our population look like? 0th step: all robots remember s0 1st step: all robots remember s0s1 2nd step: 1/2 of robots remember s0s1s0 and 1/2 of robots remember s0s1s1 3rd step: 1/3 of robots remembers s0s1s0s1, 1/3 of robots remember s0s1s1s0 and 1/3 of robots remember s0s1s1s1 There is no Bayesian hypothesis a robot can have that gives correct predictions both for step 2 and step 3. Indeed, to be consistent with step 2 we must have Pr [s0s1s0]=12 and Pr[s0s1s1]=12. But, to be consistent with step 3 we must have Pr [s0s1s0]=13, Pr[s0s1s1]=23. In other words, there is no Bayesian hypothesis s.t. we can guarantee that a randomly sampled robot on a sufficiently late time step will have learned this hypothesis with high probability. The apparent transition probabilities keep shifting s.t. it might always continue to seem that the world is complicated enough to prevent our robot from having learned it already. Or, at least it's not obvious there is such a hypothesis. In this example, Pr[s0 s1s1]Pr[s0s1s0] will converge to the golden ratio at late steps. But, do all probabilities converge fast enough for learning to happen, in general? I don't know, maybe for finite state spaces it can work. Would definitely be interesting to check. [EDIT: actually, in this example there is such a hypothesis but in general there isn't, see below [https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=E58br2mJWbgzQqZhX] ]
2Charlie Steiner23dGreat example. At least for the purposes of explaining what I mean :) The memory AMDP would just replace the statess0,s1with the memory states[s0],[s1],[s0,s0],[ s0,s1], etc. The action takes a robot in[s0]to memory state[s0,s1], and a robot in[s0,s1]to one robot in[s0,s1,s0]and another in[s0,s1,s1]. (Skip this paragraph unless the specifics of what's going on aren't obvious: given a transition distributionP(s′∗|s,π)(P being the distribution over sets of states s'* given starting state s and policyπ), we can define the memory transition distributionP(s′∗m|sm,π)given policyπand starting "memory state"sm∈S∗ (Note that this star actually does mean finite sequences, sorry for notational ugliness). First we plug the last element ofsminto the transition distribution as the current state. Then for eachs′∗in the domain, for each element ins′∗we concatenate that element onto the end ofsmand collect theses′minto a sets′∗m, which is assigned the same probabilityP(s′∗).) So now at time t=2, if you sample a robot, the probability that its state begins with[s0,s1,s1]is 0.5. And at time t=3, if you sample a robot that probability changes to 0.66. This is the same result as for the regular MDP, it's just that we've turned a question about the history of agents, which may be ill-defined, into a question about which states agents are in. I'm still confused about what you mean by "Bayesian hypothesis" though. Do you mean a hypothesis that takes the form of a non-anthropic MDP?
2Vanessa Kosoy22dI'm not quite sure what are you trying to say here, probably my explanation of the framework was lacking. The robots already remember the history, like in classical RL. The question about the histories is perfectly well-defined. In other words, we are already implicitly doing what you described. It's like in classical RL theory, when you're proving a regret bound or whatever, your probability space consists of histories. Yes, or a classical RL environment. Ofc if we allow infinite state spaces, then any environment can be regarded as an MDP (whose states are histories). That is, I'm talking about hypotheses which conform to the classical "cybernetic agent model". If you wish, we can call it "Bayesian cybernetic hypothesis". Also, I want to clarify something I was myself confused about in the previous comment. For an anthropic Markov chain (when there is only one action) with a finite number of states, we can give a Bayesian cybernetic description, but for a general anthropic MDP we cannot even if the number of states is finite. Indeed, consider some T:S→ΔNS. We can take its expected value to get ET:S→RS+. Assuming the chain is communicating, ET is an irreducible non-negative matrix, so by the Perron-Frobenius theorem it has a unique-up-to-scalar maximal eigenvector η∈RS+. We then get the subjective transition kernel: ST(t∣s)=ET(t∣s)ηt∑t′∈SET(t′∣s)ηt′ Now, consider the following example of an AMDP. There are three actions A:={a,b, c} and two states S:={s0,s1}. When we apply a to an s0 robot, it creates two s0 robots, whereas when we apply a to an s1 robot, it leaves one s1 robot. When we apply b to an s1 robot, it creates two s1 robots, whereas when we apply b to an s0 robot, it leaves one s0 robot. When we apply c to any robot, it results in one robot whose state is s0 with probability 12 and s1 with probability 12. Consider the following two policies. πa takes the sequence of actions cacaca… and πb takes the sequence of actions cbcbcb…. A population that f
2Charlie Steiner22dAh, okay, I see what you mean. Like how preferences are divisible into "selfish" and "worldly" components, where the selfish component is what's impacted by a future simulation of you that is about to have good things happen to it. (edit: The reward function in AMDPs can either be analogous to "wordly" and just sum the reward calculated at individual timesteps, or analogous to "selfish" and calculated by taking the limit of the subjective distribution over parts of the history, then applying a reward function to the expected histories.) I brought up the histories->states thing because I didn't understand what you were getting at, so I was concerned that something unrealistic was going on. For example, if you assume that the agent can remember its history, how can you possibly handle an environment with memory-wiping? In fact, to me the example is still somewhat murky, because you're talking about the subjective probability of a state given a policy and a timestep, but if the agents know their histories there is no actual agent in the information-state that corresponds to having those probabilities. In an MDP the agents just have probabilities over transitions - so maybe a clearer example is an agent that copies itself if it wins the lottery having a larger subjective transition probability of going from gambling to winning. (i.e. states are losing and winning, actions are gamble and copy, the policy is to gamble until you win and then copy).
2Vanessa Kosoy22dAMDP is only a toy model that distills the core difficulty into more or less the simplest non-trivial framework. The rewards are "selfish": there is a reward function r:(S×A)∗→R which allows assigning utilities to histories by time discounted summation, and we consider the expected utility of a random robot sampled from a late population. And, there is no memory wiping. To describe memory wiping we indeed need to do the "unrolling" you suggested. (Notice that from the cybernetic model POV, the history is only the remembered history.) For a more complete framework, we can use an ontology chain [https://www.lesswrong.com/posts/dPmmuaz9szk26BkmD/vanessa-kosoy-s-shortform?commentId=SBPzgAZgFFxtL9E64] , but (i) instead of A×O labels use A×M labels, where M is the set of possible memory states (a policy is then described by π:M→A), to allow for agents that don't fully trust their memory (ii) consider another chain with a bigger state space S′ plus a mapping p:S′→NS s.t. the transition kernels are compatible. Here, the semantics of p(s) is: the multiset of ontological states resulting from interpreting the physical state s by taking the viewpoints of different agents s contains. I didn't understand "no actual agent in the information-state that corresponds to having those probabilities". What does it mean to have an agent in the information-state?
2Charlie Steiner21dNevermind, I think I was just looking at it with the wrong class of reward function in mind.
4Vanessa Kosoy4moThere is a formal analogy between infra-Bayesian decision theory (IBDT) and modal updateless decision theory [https://www.alignmentforum.org/posts/5bd75cc58225bf0670374e61/using-modal-fixed-points-to-formalize-logical-causality] (MUDT). Consider a one-shot decision theory setting. There is a set of unobservable states S, a set of actions A and a reward function r:A×S→[0,1]. An IBDT agent has some belief β∈□S[1] [#fn-mQXwc4sNgtZSzqodo-1], and it chooses the action a∗: =argmaxa∈AEβ[λs.r(a,s)]. We can construct an equivalent scenario, by augmenting this one with a perfect predictor of the agent (Omega). To do so, define S′:=A×S, where the semantics of (p,s) is "the unobservable state is s and Omega predicts the agent will take action p". We then define r′:A×S′→[0,1] by r′(a,p,s):=1a=pr(a,s)+1a≠p and β′∈□S′ by Eβ′[f]:=minp∈AEβ[λs.f(p,s)] (β′ is what we call the pullback of β to S′, i.e we have utter Knightian uncertainty about Omega). This is essentially the usual Nirvana construction. The new setup produces the same optimal action as before. However, we can now give an alternative description of the decision rule. For any p∈A, define Ωp∈□S′ by EΩp[f]:=mins∈Sf(p,s). That is, Ωp is an infra-Bayesian representation of the belief "Omega will make prediction p". For any u∈[0,1], define Ru∈□S′ by ERu[f]:=minμ∈ΔS′:Eμ[r(p,s)]≥uEμ[f(p,s)]. Ru can be interpreted as the belief "assuming Omega is accurate, the expected reward will be at least u". We will also need to use the order ⪯ on □X defined by: ϕ⪯ψ when ∀f∈[0,1]X:Eϕ[f]≥ Eψ[f]. The reversal is needed to make the analogy to logic intuitive. Indeed, ϕ⪯ ψ can be interpreted as "ϕ implies ψ"[2] [#fn-mQXwc4sNgtZSzqodo-2], the meet operator ∧ can be interpreted as logical conjunction and the join operator ∨ can be interpreted as logical disjunction. Claim: a∗=argmaxa∈Amax{u∈[0,1]∣β′∧Ωa⪯Ru} (Actually I only checked it when we restrict to crisp infradistributions, in which case ∧ is intersection of sets and ⪯ is set conta
2Vanessa Kosoy3dInfra-Bayesianism can be naturally understood as semantics for a certain non-classical logic. This promises an elegant synthesis between deductive/symbolic reasoning and inductive/intuitive reasoning, with several possible applications. Specifically, here we will explain how this can work for higher-order logic. There might be holes and/or redundancies in the precise definitions given here, but I'm quite confident the overall idea is sound. For simplicity, we will only work with crisp infradistributions, although a lot of this stuff can work for more general types of infradistributions as well. Therefore, □X will denote the space of crisp infradistribution. Given μ∈□X, S(μ) ⊆ΔX will denote the corresponding convex set. As opposed to previously, we will include the empty-set, i.e. there is ⊥X∈□X s.t. S(⊥X)=∅. Given p∈ΔX and μ∈□X, p: μ will mean p∈S(μ). Given μ,ν∈□X, μ⪯ν will mean S(μ)⊆S(ν). Syntax Let Tι denote a set which we interpret as the types of individuals (we allow more than one). We then recursively define the full set of types T by: * 0∈T (intended meaning: the uninhabited type) * 1∈T (intended meaning: the one element type) * If α∈Tι then α∈T * If α,β∈T then α+β∈T (intended meaning: disjoint union) * If α,β∈T then α×β∈T (intended meaning: Cartesian product) * If α∈T then (α)∈T (intended meaning: predicates with argument of type α) For each α,β∈T, there is a set F0α→β which we interpret as atomic terms of type α→β. We will denote V0α:=F01→α. Among those we distinguish the logical atomic terms: * idα∈F0α→α * 0α∈F00→α * 1α∈F0α→1 * prαβ∈F0α×β→α * iαβ∈F0α→α+β * Symbols we will not list explicitly, that correspond to the algebraic properties of + and × (commutativity, associativity, distributivity and the neutrality of 0 and 1). For example, given α,β∈T there is a "commutator" of type α×β→β×α. * =α∈V0(α×α) * diagα∈F0α→α×α * ()α∈V0((α)×α) (intended meaning: predicate evaluation) * ⊥∈V0(1) * ⊤∈V0(1) * ∧α∈F0(α)×(α)→(α) * ∨α∈F0(α
3Vanessa Kosoy2dWhen using infra-Bayesian logic to define a simplicity prior, it is natural to use "axiom circuits" rather than plain formulae. That is, when we write the axioms defining our hypothesis, we are allowed to introduce "shorthand" symbols for repeating terms. This doesn't affect the expressiveness, but it does affect the description length. Indeed, eliminating all the shorthand symbols can increase the length exponentially.
3Vanessa Kosoy3dInstead of introducing all the "algebrator" logical symbols, we can define T as the quotient by the equivalence relation defined by the algebraic laws. We then need only two extra logical atomic terms: * For any n∈N and σ∈Sn (permutation), denote n:=∑ni=11 and require σ+∈Fn→n * For any n∈N and σ∈Sn, σ×α∈Fαn→αn However, if we do this then it's not clear whether deciding that an expression is a well-formed term can be done in polynomial time. Because, to check that the types match, we need to test the identity of algebraic expressions and opening all parentheses might result in something exponentially long.

Consider a Solomonoff inductor predicting the next bit in the sequence {0, 0, 0, 0, 0...} At most places, it will be very certain the next bit is 0. But, at some places it will be less certain: every time the index of the place is highly compressible. Gradually it will converge to being sure the entire sequence is all 0s. But, the convergence will be very slow: about as slow as the inverse Busy Beaver function!

This is not just a quirk of Solomonoff induction, but a general consequence of reasoning using Occam's razor (which is the only reasonable way to re... (read more)

3philip_b5moAn alignment-unrelated question: Can we, humans, increase the probability that something weird happens in our spacetime region (e.g., the usual laws of physics stop working) by making it possible to compress our spacetime location? E.g., by building a structure that is very regular (meaning that its description can be very short) and has never been built before in our space region, something like make a huge perfectly aligned rectangular grid of hydrogen atoms, or something like that. -------------------------------------------------------------------------------- It's like a magical ritual for changing the laws of physics. This gives a new meaning to summoning circles, pentagrams, etc.
4Vanessa Kosoy5moWe can rephrase your question as follows: "Can we increase the probability of finding an error in the known laws of physics by performing an experiment with a simple property that never happened before, either naturally or artificially"? And the answer is: yes! This is actually what experimental physicists do all the time: perform experiments that try to probe novel circumstances where it is plausible (Occam-razor-wise) that new physics will be discovered. As to magical rituals, sufficiently advanced technology is indistinguishable from magic :)
1MikkW5moI have a sense that similar principles are at play with Spaced Repetition, and that pointing out that connection may be relevant to effectively handling this issue
1ofer5moI think that in embedded settings (with a bounded version of Solomonoff induction) convergence may never occur, even in the limit as the amount of compute that is used for executing the agent goes to infinity. Suppose the observation history contains sensory data that reveals the probability distribution that the agent had, in the last time step, for the next number it's going to see in the target sequence. Now consider the program that says: "if the last number was predicted by the agent to be 0 with probability larger than 1−2− 1010 then the next number is 1; otherwise it is 0." Since it takes much less than 1010 bits to write that program, the agent will never predict two times in a row that the next number is 0 with probability larger than 1−2−1010 (after observing only 0s so far).

One subject I like to harp on is reinforcement learning with traps (actions that cause irreversible long term damage). Traps are important for two reasons. One is that the presence of traps is in the heart of the AI risk concept: attacks on the user, corruption of the input/reward channels, and harmful self-modification can all be conceptualized as traps. Another is that without understanding traps we can't understand long-term planning, which is a key ingredient of goal-directed intelligence.

In general, a prior that contains traps will be unlearnable, mea

... (read more)

One of the central challenges in Dialogic Reinforcement Learning is dealing with fickle users, i.e. the user changing eir mind in illegible ways that cannot necessarily be modeled as, say, Bayesian updating. To take this into account, we cannot use the naive notion of subjective regret bound, since the user doesn't have a well-defined prior. I propose to solve this by extending the notion of dynamically inconsistent preferences to dynamically inconsistent beliefs. We think of the system as a game, where every action-observation history corresponds

... (read more)
2Vanessa Kosoy1yThere is a deficiency in this "dynamically subjective" regret bound (also can be called "realizable misalignment" bound) as a candidate formalization of alignment. It is not robust to scaling down [https://www.alignmentforum.org/posts/bBdfbWfWxHN9Chjcq/robustness-to-scale]. If the AI's prior allows it to accurately model the user's beliefs (realizability assumption), then the criterion seems correct. But, imagine that the user's beliefs are too complex and an accurate model is not possible. Then the realizability assumption is violated and the regret bound guarantees nothing. More precisely, the AI may use incomplete models [https://www.alignmentforum.org/posts/5bd75cc58225bf0670375575/the-learning-theoretic-ai-alignment-research-agenda] to capture some properties of the user's beliefs and exploit them, but this might be not good enough. Therefore, such an AI might fall into a dangerous zone when it is powerful enough to cause catastrophic damage but not powerful enough to know it shouldn't do it. To fix this problem, we need to introduce another criterion which has to hold simultaneously with the misalignment bound. We need that for any reality that satisfies the basic assumptions built into the prior (such as, the baseline policy is fairly safe, most questions are fairly safe, human beliefs don't change too fast etc), the agent will not fail catastrophically. (It would be way too much to ask it would converge to optimality, it would violate no-free-lunch.) In order to formalize "not fail catastrophically" I propose the following definition. Let's start with the case when the user's preferences and beliefs are dynamically consistent. Consider some AI-observable event S that might happen in the world. Consider a candidate learning algorithm πlearn and two auxiliary policies. The policy πbase→S follows the baseline policy until S happens, at which time it switches to the subjectively optimal policy. The policy πlearn→S follows the candidate learning algorithm unt
2TurnTrout1yThis seems quite close (or even identical) to attainable utility preservation [https://arxiv.org/abs/1902.09725]; if I understand correctly, this echoes arguments I've made [https://www.lesswrong.com/posts/yEa7kwoMpsBgaBCgb/towards-a-new-impact-measure#wXHJArzDPoYejHuz2] for why AUP has a good shot of avoiding catastrophes and thereby getting you something which feels similar to corrigibility.
2Vanessa Kosoy1yThere is some similarity, but there are also major differences. They don't even have the same type signature. The dangerousness bound is a desideratum that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Q-learning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions. The reason I pointed out the relation to corrigibility is not because I think that's the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and self-contained: it is a formalization of the condition that "if you run this AI, this won't make things worse than not running the AI", no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with. From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worst-case) certain that the user would not change eir response when faced with some rebuttal. You can't (empirically and in the worst-case) prove a negative.
2Vanessa Kosoy1yDialogic RL assumes that the user has beliefs about the AI's ontology. This includes the environment(fn1) from the AI's perspective. In other words, the user needs to have beliefs about the AI's counterfactuals (the things that would happen if the AI chooses different possible actions). But, what are the semantics of the AI's counterfactuals from the user's perspective? This is more or less the same question that was studied by the MIRI-sphere for a while, starting from Newcomb's paradox, TDT et cetera. Luckily, I now have an answer [https://www.alignmentforum.org/posts/dPmmuaz9szk26BkmD/shortform#TzkG7veQAMMRNh3Pg] based on the incomplete models formalism. This answer can be applied in this case also, quite naturally. Specifically, we assume that there is a sense, meaningful to the user, in which ey select the AI policy (program the AI). Therefore, from the user's perspective, the AI policy is a user action. Again from the user's perspective, the AI's actions and observations are all part of the outcome. The user's beliefs about the user's counterfactuals can therefore be expressed as σ:Π→Δ(A×O )ω(fn2), where Π is the space of AI policies(fn3). We assume that for every π∈Π, σ(π) is consistent with π the natural sense. Such a belief can be transformed into an incomplete model from the AI's perspective, using the same technique we used to solve Newcomb-like decision problems, with σ playing the role of Omega. For a deterministic AI, this model looks like (i) at first, "Murphy" makes a guess that the AI's policy is π=πguess (ii) The environment behaves according to the conditional measures of σ(πguess) (iii) If the AI's policy ever deviates from πguess, the AI immediately enters an eternal "Nirvana" state with maximal reward. For a stochastic AI, we need to apply the technique with statistical tests and multiple models alluded to in the link. This can also be generalized to the setting where the user's beliefs are already an incomplete model, by adding another step
1Gurkenglas1yNirvana and the chicken rule both smell distasteful like proofs by contradiction, as though most everything worth doing can be done without them, and more canonically to boot. (Conjecture: This can be proven, but only by contradiction.)
2Vanessa Kosoy1yMaybe? I am not sure that I like Nirvana, but it doesn't seem that bad. If someone thinks of a solution without it, I would be interested.
2Vanessa Kosoy1yAnother notable feature of this approach is its resistance to "attacks from the future", as opposed to approaches based on forecasting. In the latter, the AI has to predict some future observation, for example what the user will write after working on some problem for a long time. In particular, this is how the distillation step in IDA is normally assumed to work, AFAIU. Such a forecaster might sample a future in which a UFAI has been instantiated and this UFAI will exploit this to infiltrate the present. This might result in a self-fulfilling prophecy, but even if the forecasting is counterfactual (and thus immune to self-fulfilling prophecies)it can be attacked by a UFAI that came to be for unrelated reasons. We can ameliorate this by making the forecasting recursive (i.e. apply multiple distillation & amplification steps) or use some other technique to compress a lot of "thinking time" into a small interval of physical time. However, this is still vulnerable to UFAIs that might arise already at present with a small probability rate (these are likely to exist since our putative FAI is deployed at a time when technology progressed enough to make competing AGI projects a real possibility). Now, compare this to Dialogical RL, as defined via the framework of dynamically inconsistent beliefs. Dialogical RL might also employ forecasting to sample the future, presumably more accurate, beliefs of the user. However, if the user is aware of the possibility of a future attack, this possibility is reflected in eir beliefs, and the AI will automatically take it into account and deflect it as much as possible.
2Vanessa Kosoy1yThis approach also obviates the need for an explicit commitment mechanism. Instead, the AI uses the current user's beliefs about the quality of future user beliefs to decide whether it should wait for user's beliefs to improve or commit to an irreversible coarse of action. Sometimes it can also predict the future user beliefs instead of waiting (predict according to current user beliefs updated by the AI's observations).
2Vanessa Kosoy1y(moved to alignment forum)

In my previous shortform, I used the phrase "attack vector", borrowed from classical computer security. What does it mean to speak of an "attack vector" in the context of AI alignment? I use 3 different interpretations, which are mostly 3 different ways of looking at the same thing.

In the first interpretation, an attack vector is a source of perverse incentives. For example, if a learning protocol allows the AI to ask the user questions, a carefully designed question can artificially produce an answer we would consider invalid, for example by manipulating

... (read more)

The sketch of a proposed solution to the hard problem of consciousness: An entity is conscious if and only if (i) it is an intelligent agent (i.e. a sufficiently general reinforcement learning system) and (ii) its values depend on the presence and/or state of other conscious entities. Yes, this definition is self-referential, but hopefully some fixed point theorem applies. There may be multiple fixed points, corresponding to "mutually alien types of consciousness".

Why is this the correct definition? Because it describes precisely the type of agent who would care about the hard problem of consciousness.

2cousin_it1yI'm not sure your definition has much to do with consciousness, as it would also be satisfied by an AI that runs on an Intel processor and whose utility function says all AIs should run on Intel processors.
2Vanessa Kosoy1yIts utility function would have to say that all conscious AIs should run on Intel processors. There is self-reference there. But, I only have rather low confidence this idea is correct (what being correct means here) or important.
2G Gordon Worley III1yThis seems to me to address the meta problem of consciousness rather than the hard problem of consciousness itself, since you seem to be more offering an etiology for the existence of agents that would care about the hard problem of consciousness rather than an etiology of qualia.
2Vanessa Kosoy1yYes, but I also claim that the agents that would care about the hard problem of consciousness are exactly the agents that are themselves conscious.
2Matt Goldenberg1yI'm trying to figure out what precisely #2 means. How do you define "values"? IE, if I'm a deep learning algorithm in a tournament with other deep learning algorithms, certainly my instrumental values depend on the state of other deep learning algorithms. Is that sufficient in your definition for consciousness?
2Vanessa Kosoy1yNo, I am talking about terminal values. Something like an instrumental reward function [https://www.alignmentforum.org/posts/aAzApjEpdYwAxnsAS/reinforcement-learning-with-imperceptible-rewards] specified via an ontology that involves consciousness. Also, I am not sure deep learning in its present form qualifies as sufficiently "general" for (i).
1Gurkenglas1yYour definition says that people's models of other people can be conscious, doesn't it?

A summary of my current breakdown of the problem of traps into subproblems and possible paths to solutions. Those subproblems are different but different but related. Therefore, it is desirable to not only solve each separately, but also to have an elegant synthesis of the solutions.

Problem 1: In the presence of traps, Bayes-optimality becomes NP-hard even on the weakly feasible level (i.e. using the number of states, actions and hypotheses as security parameters).

Currently I only have speculations about the solution. But, I have a few desiderata for it:

De... (read more)

It seems useful to consider agents that reason in terms of an unobservable ontology, and may have uncertainty over what this ontology is. In particular, in Dialogic RL, the user's preferences are probably defined w.r.t. an ontology that is unobservable by the AI (and probably unobservable by the user too) which the AI has to learn (and the user is probably uncertain about emself). However, onotlogies are more naturally thought of as objects in a category than as elements in a set. The formalization of an "ontology" should probably be a POMDP or a suitable

... (read more)