I don't know what the "underlying real probability" is (no condescendence in this remark; I'm genuinely confused about the physics and philosophy of probability and haven't got time to figure it out for myself, and I'm not sure this is a settled question).
Both P and Q are something that is implemented (i. e., encoded in some way) by the agent itself. The agent knows nothing about the "true generative model" of the environment (even if we can discuss it; see below). The only place where "the feedback from the environment" enters this process is in the calculation of P(st+1|ot), so-called "posterior" belief, which is calculated according to the rules of Bayesian inference. This is the place where the agent is "ensured not to detach from the observations", i. e., the reality of its environment.
I would say, the book doesn't do a very good job of explaining this point. I recommend this paper, section 1 ("Basic terminology, concepts, and mathematics"), and appending A ("Additional mathematical details") that make the mathematics of Active Inference really clear, they explain every transition and derivation of the formalism in detail.
Then, even though an agent uses "its own" generative model of the environment, it is expected to track, with some degree of fidelity, the real dynamics of the environment. This is the whole point of Active Inference, of course. I used the phrase "real dynamics" rather than "generative model" because there is philosophical nuance and can make the phrase "generative model of the environment" misleading or confusing to people. There was a paper specifically aimed to clear out to clear this confusion ("A tale of two densities: Active Inference is enactive inference") But I think that attempt failed, i. e. the paper only added more confusion. Instead of that paper, for physical foundations of Active Inference, that also elucidates this dynamics between the agent and the environment, I'd recommend "A free energy principle for generic quantum systems".
I don't have the time - or knowledge - to respond to everything, but from your response, I worry that my article partially missed the target. I'm trying to argue that humans may not be just - utility - maximizers and that a large part of being human (or maybe any organism?) is to just enjoy the world via some quasi-non-rewarded types of behavior. So there's no real utility for some or perhaps the most important things that we value. Seeking out "surprising" results does help AIs and humans learn, and seeking out information as well. But I'm not sure human psychology supports human intrinsic rewards as necessarily related to utility maximization. I do view survival nor procreation as genetically encoded drives - but they are not the innate drives I described above. It's not completely clear what we gain when we enjoy being in the world, learning, socializing.
Let me rephrase your thought, as I understand it: "I don't think humans are (pure) RL-like agents, they are more like ActInf agents" (by "pure" RL I mean RL without entropy regularization, or other schemes that motivate exploration).
There is copious literature finding the neuronal, neuropsychological, or psychological makeup of humans "basically implementing Active Inference", as well as "basically implementing RL". The portion of this research that is more rigorous maps the empirical observations from neurobiology directly onto the mathematics of ActInf and RL, respectively. I think this kind of research is useful, it equips us with instruments to predict certain aspects of human behaviour, and suggests avenues for disorder treatment.
The portion of this research that is less rigorous and more philosophical, is like pointing out "it looks like humans behave here like ActInf agents", or "it looks like humans behave here like RL agents". This kind of philosophy is only useful for suggesting a direction for mining empirical observations, to either confirm or disprove theories that in this or that corner of behaviour/psychology, humans act more like ActInf, or RL agents. (Note that I would not count observations from psychology here, because they are notoriously unreliable themselves, see reproducibility crisis, etc.)
I'm aware of Friston's free energy principle (it was one of the first things I looked at in graduate school). I personally view most of it as non-falsifiable, but I know that many have used to derive useful interpretation of brain function.
RL is not falsifiable, too. Both can be seen as normative theories of agency. Normative theories are unfalsifiable, they are prescriptions, or, if you want, the sources of the definition of agency.
However, I would say that ActInf is also a physical theory (apart from being normative) because it's derived from (or at least related to) statistical mechanics and the principle of least action. RL is "just" a normative framework of agency because I don't see any relationship with physics in it (again, if you don't add entropy regularisation).
I would say that my question - which I did not answer in the post - is whether we can design AIs that don't seek to maximize some utility or minimize some cost?
I answered to this question above: yes, you can design AI that will not minimise or maximise any utility or cost, but only some form of energy. Just choose Active Inference, ReduNet, GFlowNet, or LeCun's architecture. It's not just renaming "utility" into "energy", there is a deep philosophical departure. (I'm not sure it's articulated somewhere in a piece dedicated to this question, the best resources that I can recommend are the sections which discuss RL in Active Inference book, LeCun's paper (see section "Reward is not enough"), and Bengio's GFlowNet tutorial, all links are above.
However, as I pointed out above, this doesn't save you from instrumental convergence. Which can be just as bad (for humans) as a prototypical utility/cost/paperclip maximiser.
If you want an agent that doesn't instrumentally converge at all, please see the discussion of Mild Optimization.
Caveats apply: embedded agents could still emerge inside agents with these architectures, and these embedded agents might in principle be RL. Perhaps, this is actually why humans sometimes exhibit RL-like behaviour, even though "fundamentally" they are more like ActInf agents.
Although AI agent paradigms use explicit reward approaches, the psychology of human motivation suggests that humans value internally generated reward as much if not more than external reward. I suggest that AIs that begin to exhibit behaviors that appear to be "internally" rewarded may reflect signs of AGI.
Maximisation of explicit reward is the defining feature of Reinforcement Learning, but this is just one of many agentic intelligence architectures (see alternatives listed here). The architectures that I mention below in this comment: Active Inference, ReduNets, and GFlowNets, all use intrinsic motivation. And modern Reinforcement Learning with Entropy Regularization also models intrinsic motivation. Therefore, no, intrinsic motivation is not a sign of looming AGI. In principle, agents on any scale and capability level can have it.
See Friston et al. "Active inference and epistemic value" (2015):
Curiosity and Bayesian surpriseEpistemic value and implicit exploratory behavior are related to curiosity in psychology (Harlow, 1950; Ryan & Deci, 1985) and intrinsic motivation in reinforcement learning (Baldassarre & Mirolli, 2013; Barto, Singh, & Chentanez, 2004; Oudeyer & Kaplan, 2007; Schembri, Mirolli, & Baldassare, 2007; Schmidhuber, 1991). Here intrinsic stands in opposition to extrinsic (e.g., drive or goal) value. While we have focused on reducing uncertainty during inference, most reinforcement learning research uses curiosity or novelty-based mechanisms to learn a policy or model efficiently. The general idea here is that an agent should select actions that improve learning or prediction, thus avoiding behaviors that preclude learning (either because these behaviors are already learned or because they are unlearnable). It has often been emphasized that adaptive agents should seek out surprising stimuli, not unsurprising stimuli as assumed in active inference. This apparent discrepancy can be reconciled if one considers that surprising events, in the setting of curiosity and Bayesian surprise, are simply outcomes that are salient and minimize uncertainty. In active inference, agents are surprised when they do not minimize uncertainty. It is salient (counterfactual) outcomes that optimize exploration (and model selection) and salience-seeking behavior stems nicely from the more general objective of minimizing expected free energy (or surprise proper).There is, however, an important difference between active inference and the concepts of curiosity and Bayesian surprise, at least as they are usually used. Salience is typically framed in “bottomup” terms, in that the agents are not assumed to have a particular goal or task. This is also a characteristic of curiosity (and similar) algorithms that try to learn all possible models, without knowing in advance which will be useful for achieving a specific goal. The active inference scheme considered here contextualizes the utilitarian value of competing policies in terms of their epistemic value, where the implicit reduction in uncertainty is (or can be) tailored for the goals or preferred outcomes in mind.
Also, the maximisation of information gain (aka epistemic value, Bayesian surprise, intrinsic motivation, optimal (Bayesian) experimental design, and infomax principle), is discussed a lot in the more recent book Active Inference (Parr, Pezzulo, and Friston 2022), especially in Chapters 2 and 10. The book is available online for free. The exploration-exploitation tradeoff, as well as the relationships between Active Inference with decision-making and other (agentic) intelligence frameworks are discussed.
A normative theory of agency alternative to Active Inference, the Principle of Maximizing Rate Reduction, can also be viewed as a generalisation of information gain (Chan, Yu, You et al. 2022). See also this recent workshop with Ma and Friston (the masterminds behind ReduNets and Active Inference, respectively).
Yet another framework suitable for AI agents, GFlowNet, also doesn't use the notion of reward maximisation and instead, fits the "reward function", which is conceptually very similar to Active Inference agents minimising their expected free energy.
If by "internal reward" you mean "intrinsically determined preferences/goals", then Active Inference operationalises it, too, as prior preferences, that can be learned just as everything else. Which is the answer to your question "Might AIs develop innate needs that go beyond receiving external rewards and maximizing utility?"
What might a non-utility maximizing AGI look like?
Neither Active Inference nor ReduNets nor GFlowNets nor LeCun's architecture "maximise utility", but they still all instrumentally converge. Instrumental convergence basically approximately equals capability (fitness).
I think this is an important direction of work because, despite lots of concerns on this forum about the interpretability and explainability of ML, I think that in practice we should expect (in the worlds where we survive, at least) AI agents cooperating within systems (or playing against each other in games, or a mixture of these two modes) are going to be more transparent to each other than humans are transparent to each other.
People always think in private; sometimes give off their "real" thoughts and intentions via facial microexpressions and facial flushing, but less so in the era of remote communication.
AIs always learn by receiving and processing data and doing number crunching on the data, should probably expect that we will build infrastructure for logging this data, looking for signs of deception in the weights or activations selectively saved for future security processing. Moreover, if a single integrated stream of thought (like people have in their heads) proves important for general intelligence, we should expect all these streams of thought to be recorded.
I think it's also important to transfer insights from mathematical constructs (including UDT, FDT, superrationality, and games with perfect prediction) onto a physical footing. Here, I argued that FDT should be seen as a group (collective) decision theory, i. e., a piece of collective intelligence. In "An active inference model of collective intelligence", Kaufmann et al. proposed a physics-based explanatory theory of collective intelligence (I don't have an opinion about this theory, just indicate it as one of the proposals out there).
In such theories and game setups, I think it's important to consider bounded rationality (see "Resource-rational analysis: Understanding human cognition as the optimal use of limited computational resources", "Information overload for (bounded) rational agents"), communication costs, communication delays and the delays (and costs) of reaching consensus, and the non-zero cost of updating one's own beliefs (to the point of "Free Will" and the assumption that agents can deflect momentarily, which leads to the idea of pre-commitments in Newcomb problem and Parfit's hitchhiker, without considering that basically, any belief is a micro pre-commitment to this belief). Also, in iterated games, we should ensure that we don't model agents making ergodicity assumptions when it's not "real-world rational".
Thanks, though this is more of a process document, than the kind of policy that I'm looking to: the policy which helps to answer the questions "Should I treat this or that idea or other information as infohazardous, and why?". Except for a five-item list of "presumptuous infohazards". If I understand correctly, the actual policy is left to the judgement of project leads and the coordinator. Or, perhaps, it exists at Conjecture as a written artifact, but is considered private or secret itself.
I basically agree with the argument here. I think that approaches to alignment that try to avoid instrumental convergence are generally unlikely to succeed for exactly the reason that this removes the usefulness of AGI.
Note that this doesn't need to be a philosophical point, it's a physical fact that appears self-evident if you look at it through the lens of Active Inference: Active Inference as a formalisation of instrumental convergence.
Conditions B1-B6 seem to exactly correspond to the description of a catallaxy. A catallaxy (such as the current global economic system) has its own dynamics, which none of the individual agents participating in it can see and predict, as scholars of the field of complex systems have postulated (e. g., Sidney Dekker, David Woods, Richard Cook). However, I don't know what the modern, physics-based information-theoretic version of the science of complex systems tells us about this.
(The following is an adapted version of my comment to the post Interpretability/Tool-ness/Alignment/Corrigibility are not Composable that wasn't published.)
Clearly, in an open system (that is, a system which is open to unpredictable influences of the external world, rather than a closed, spherical-cow-in-a-vacuum game-theoretic setup), computational irreducibility prevents any individual agent comprising the system from reliably modelling and predicting the whole system's behaviour. And that's the idea of the complexity theory: a system can be comprised of components that interact in a certain way, which the master designer of the system can predict from their vantage point, but the agents within the system cannot. If they were intelligent enough to be able to do that, the complexity of the system would rise yet again (since more intelligent agents behave in more complex ways) to a level incomprehensible to them.
So, I strongly feel that aligning a network (community, society, competition) of independent superintelligent agents is impossible: this community will steer in some direction that neither humans nor these superintelligent agents can predict and control.
However, here it's timely to note that aligning a singleton superintelligence with "humans", "the civilization", or "the ecosphere" is also impossible, in a sense, because all these are open systems and that superintelligence would be just a part of that open system. I don’t agree with Jan Kulveit that companies, communities, and “molochs”, are agents. I’d call them “mindless” emergent non-equilibrium dynamical patterns.
So, the strategy proposed in this post (robustness and safety through diversity) looks more promising to me than strategies with a singleton superintelligence, even if we have bulletproof strong reasons to think that that superintelligence is corrigible and trustworthy.
This also reminds me of the program of systems engineering scientist John Doyle, in which the main idea is that robust systems have heterogeneous, diverse controls and feedback mechanisms in them. The central concept there is a "diversity-enabled sweet spot" (DESS). For a recent article on this topic, check "Internal Feedback in Biological Control: Diversity, Delays, and Standard Theory".
Also, the "singleton superintelligence" design might turn out technically impossible: that singleton AI should somehow be everywhere in the world all at once, thus it should be a distributed system; even if it's a hierarchical system with a core "supermind" system, that system must not be deployed alone, for reliability reasons; then, there is a philosophical question whether we can call this setup "a singleton AI" or we got a distributed network of AIs, after all. Though, in this case, these systems are explicitly designed to collude.
The invite links to the AI Alignment Slack are no longer active, does the author (or anyone else) know how to join this Slack now?
Good advice. Some ways to make it even better:
Your idea of "genetic decision theory" is excellent because it provides physical grounding for FDT. It made me realise that FDT should be best considered as group decision theory, where the group can be a pair of people (a relationship), a family, a community, a genus, a nation, the humanity; a human-AI pair, a group of people with AIs, a group of AIs, etc.
As I understand your post, I would title it "FDT and CDT/EDT prescribe the decision procedures for agents of different system levels": FDT prescribes decision procedures for groups of agents, while CDT and EDT prescribe decision procedures for individual agents in these groups. If the outcomes of these decision procedures don't match, this represents an inter-scale system conflict/frustration (see "Towards a Theory of Evolution as Multilevel Learning", section 3 "Fundamental evolutionary phenomena", E2. "Frustration"). Other examples of such frustrations are the principal--agent problem, the cancer cell--organism conflict, a man against society or state, etc. (these are just examples of stark conflicts; to some degree, almost all parts of whatever there is are in some frustration with the supra-system containing it).
Agents (either at the lower level or the higher level or even outside of this hierarchy, i. e. "external designers") should seek to minimise these conflicts via innovation and knowledge (technological, biological, cultural, political, legal, etc.). In the Twin Prisoner Dilemma case, a genus (a group) should innovate so that its members derive maximum utility when pairs of people from the genus happen in a prison. There are multiple ways to do this: a genetic mutation (either randomly selected, or artificially introduced by genetic engineers) which hardwires the brains of the members of the genus in a way that they always cooperate (a biological innovation), or the genus patriarch (if exists; actually, there is a real-life example: mafia, where there are mafia bosses, and those members actually risk going to jail) instituting a rule with very hard punishment for disobeying it (a social innovation), or a law which allows family members not to testify against one another without negative consequences (a legal innovation), or a smart contract with financial precommitments (a technological innovation). Or spreading the idea of FDT among the group members, which is also an act of social entrepreneurship/social innovation endeavour, albeit not a very effective one, I suspect, unless the group in question is a closely-knit community of rationalists :)
One could respond that it is generally difficult to identify the desiderata, and in most cases, all we have is intuitions over decision problems that are not easily reducible. In particular, it might not be possible to tell if some intuition has to do with ontology or decision theory. For example, perhaps one just wants to take mutual cooperation in the Twin Prisoner’s Dilemma as a primitive, and until one has figured out why this is a desideratum (and thus figured out if it is about ontology or decision theory), comparisons of decision theories that merely involve ontological differences do in fact carry some information about what ontology is reasonable. I am somewhat sympathetic to the argument in and of itself, although I disagree about the extent to which our intuitions are irreducible to such an extent that we can not tell whether they are about ontology or decision theory.
It's hard for me to understand what you say in this passage, but if you are hinting at the questions "What agent am I?", "Where does my boundary lie/Markov blanket end?", "Whom I should decide for?", then a psychological answer to this question that the agent continually tests this (i. e., conducts physical experiments) and forms a belief about where his circle of control ends. Thus, a mafia boss believes he controls the group, while an ordinary member does not. A physical theory purporting to answer these questions objectively is minimal physicalism (see "Minimal physicalism as a scale-free substrate for cognition and consciousness", specifically discussing the question of boundaries and awareness).
Physicalist agent ontology vs. algorithmic/logical agent ontology
I believe there is a methodological problem with "algorithmic/logical ontology" as a substrate for a decision theory, and FDT as an instance of such a theory: decision theory is a branch of rationality, which itself is a normative discipline applying to particular kinds of physical systems, and thus it must be based on physics. Algorithms are mathematical objects, they can only describe physical objects (information bearers), but don't belong to the physical world themselves and thus cannot cause something to happen in the physical world (again, information bearers can).
Thus, any decision theory should deal only with physical objects in its ontology, which could be "the brains of the members of the group" (information bearers of algorithms, for example, the FDT algorithm), but not "algorithms" directly, in the abstract.
The other way to put the above point is that I think that FDT's attempt to escape the causal graph is methodologically nonsensical.
The following questions then arise: Can we meaningfully compare ontologies in the first place?If yes, what makes one ontology preferable to another?I think these are difficult questions, but ultimately I think that we probably can compare ontologies; some ontologies are simply more reasonable than others, and they do not simply correspond to “different ways of looking at the world” and that’s that.
The following questions then arise:
I think these are difficult questions, but ultimately I think that we probably can compare ontologies; some ontologies are simply more reasonable than others, and they do not simply correspond to “different ways of looking at the world” and that’s that.
In light of what I have written above, I think these two questions should be replaced with a single one: What systems should be subjects of our (moral) concern? I. e., in the Twin Prisoners Dilemma, if a prisoner is concerned about his group (genus), he cooperates, otherwise, he doesn't. This question has an extremely long history and amount of writing on it, e. g. are nations valuable? Are states valuable? Ecosystems? Cells in our organism? And the paper "Minimal physicalism as a scale-free substrate for cognition and consciousness" also introduces an interesting modern twist on this question, namely, the conjecture that consciousness is scale-free.
For example, one might argue that ‘agency’ is a high-level emergent phenomenon and that a reductionist physicalist ontology might be too “fine-grained” to capture what we care about, whilst the algorithmic conception abstracts away the correct amount of details.
Again, in the context of minimal physicalism, I think we should best dissect "agency" into more clearly definable, scale-free properties of physical systems (such as autonoetic awareness, introduced in the paper, but also others).